The mission of SRE (Site Reliability Engineer) team is to ensure the efficient and sustainable operation of Shopee 24x7, and to build and maintain large-scale, highly available, high-performance distributed systems based on system availability and performance. It is a new system formed by combining traditional software engineering and technical operation. The SRE team needs to dive deep into the Shopee development lines to ensure that the system is highly scalable under rapid evolution of the System. From the perspective of stability and performance, it includes the design of business development, components of the basic platform (middleware, container scheduling, caching, object storage, etc.), OS optimisation, data center and network optimisation. We optimise the inefficient and complicated operation in the traditional operation and maintenance mode through engineering and service means, and are committed to building a sound monitoring system to improve the efficiency of incident handling.
- Responsible for maintaining container-based computing platforms and traffic scheduling platforms using expertise in coding, algorithm and complexity analysis.
- Responsible for safeguarding system availability by actively participating in troubleshooting, investigation and SOP design.
- Responsible for improving system reliability and maintainability by enhancing system monitoring and operation automation.
- Responsible for boosting system performance and scalability through system architecture review and exploring state-of-art techniques.
- Responsible for improving system sustainability via capacity planning, cost optimisation and knowledge accumulation.
- Bachelor’s or higher degree in Computer Science, Engineering, Information Systems or related fields
- Well versed in container and container related resolution such as Kubernetes , Mesos, Docker, Kata,etc.
- Familiar with gateway-related solutions such as DPDK, LVS, OpenResty or Nginx.
- Strong foundation in OS and networking (TCP/IP)
- Have a certain programming foundation, familiar with the common python/golang background development framework.
- More than 3 years experience in related fields, familiar with large-scale operation and maintenance.
- Adaptable and has good communication, collaboration and teamwork ability
- Well versed in English (spoken and written)
Skills Below Are Optional But Preferable
- Experience in developing traffic management or container management automation platform
- Experience with Chaos Engineering
- Experience with Service Mesh