Our Cloud Native Infrastructure team manages a Kubernetes-like cluster management system to host microservices, serverless applications, big data processing, machine learning, distributed storage services, and edge computing platforms for the company.
The team builds and leverages the container-based cluster management system to manage elastic computing resources, providing PaaS hosting capabilities to our developers and infrastructure services. This cluster management system is designed to be capable of managing Bytedance's fleet of machines across multiple data centers, managing hundreds of millions of containers and applications for our business, with high agility, large scalability, high availability, and extreme performance assurance.
- Build application orchestration framework to host various types of production workloads, covering services management, big data jobs, distributed machine learning systems, and distributed storage services;
- Build performant container-based cluster management to manage our hyper-scale resources and workloads, with horizontal scalability and ultra-low end-to-end container startup latency;
- Design and build flexible distributed resources and tasks scheduling framework to meet various needs;
- Design and build cluster federation, horizontal scaling, vertical scaling, and co-location solutions to optimize resource utilization;
- Apply ML methodologies to the scheduler system to help reduce resource fragmentation, balance hot spots and optimize datacenter power usage.
- Bachelor's or higher degree in Computer Science or related fields;
- Experience with one or more general programming languages including but not limited to: Golang, Python, Rust, C/C++;
- Experience working in two or more of the following areas: Unix/Linux environments, distributed and parallel systems, networking systems, and developing large scale software systems;
- Good at independent thinking, able to find problems proactively, and have systematic problem analysis and problem solving skills.
- Experience in cloud-native application or framework development, including OAM, dapr, vitess;
- Experience in large scale cluster management systems, including Kubernetes, Yarn, Mesos;
- Experience in large scale resources and tasks scheduling development;
- Experience in container runtime and relevant projects, including Containerd, Kata-Container, gVisor, and x-containers;
- Contributor/Committer of the open-source community is a plus.