Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, and fault-tolerant systems. Infrastructure SRE ensures that ByteDance's infrastructure services reliability and uptime appropriate to the needs of users and fast iterations of improvement. Our software development pays great attention to optimizing existing systems, building infrastructure, and eliminating work through automation.
In the SRE team, you’ll have the opportunity to manage the complex challenges of scale, while using expertise in coding, algorithms, complexity analysis, and large-scale system design. We embrace a culture of diversity, intellectual curiosity, openness, and problem-solving. We encourage close collaboration while promoting self-direction.
1. Ensure the reliable and efficient operation of ByteDance's core infrastructure, paying attention to system capacity, stability and cost
2. Build automated operation solutions for large-scale systems; cooperate with system development teams to ensure system reliability throughout the life cycle from system design to launch
3. Design and implement software platforms and monitor frameworks for efficient, automated, and intelligent service-oriented architecture (SOA) governance
4. Participate in the design and implementation of an automation platform that can ensure rapid iteration of online large-scale clusters
5. Based on business usage scenarios, optimise and provide best governance practices and service (including but not limited to key link performance bottleneck analysis, business problem location and obstacle removal, promotion of system high availability architecture transformation and upgrading, etc.)
1. BS/MS Degree in Computer Science or related major, working experience in related fields is preferred
2. Solid basic knowledge of computer software, understanding of Linux operating system, storage, network IO and other related principles
3. Familiar with one or more programming languages, such as Python, Go, Java, Shell, Ansible, etc
4. Ability to solve problems systematically, good communication skills and sense of ownership
5. Algorithmic thinking, good data structure and system design ability is preferred
6. Storage direction, relevant system experience is preferred: KV, Table, Graph, Redis, MySQL, MongoDB, MQ , etc.
7. Computing & big data direction, relevant system experience is preferred: Kubernetes, Docker/Containers, Aiops, Spark, Flink, Function as a service, RPC Framework, Service Mesh, etc.