ByteDance big data Computing Engine team is responsible for ByteDance's offline computing, streaming computing and real-time computing engine. We support many core businesses and teams such as AML, recommendation, warehouse, search, advertising, streaming media, as well as security and risk control.
Offline computing is mainly based on Spark, with the average number of daily tasks reaching hundreds of thousands of instances, covering ETL, offline data processing, adhoc queries and other scenarios. We support ByteDance internal recommendation / advertising / search and other large-scale data processing of a large number of core businesses.
Streaming computing is mainly based on Flink, with the total number of tasks reaching as high as tens of thousands, covering multiple business scenarios such as ETL, real-time monitoring, real-time features, etc. We also support the construction of ByteDance internal real-time data warehouse and stream batch integration business scenarios.
The real-time compute engine is based on self-research, covering real-time warehouse, real-time online services, high-frequency updates , online feature store and other machine learning scenarios. We also support a series of internal ByteDance advertising, live broadcasts, recommendation, and other scenes that require data processing and real-time online services.
1. R&D for an efficient, real-time and reliable distributed computing engine
2. R&D for our distributed computing platform and machine learning training job scheduling system. Optimize its stability, performance and other aspects
3. Achieve an in-depth understanding of our business model and ideate abstract solutions to solve business problems.
1. Proficient in Java/Python/C++ and other programming languages and common algorithms, with large-scale distributed system R&D and optimization capabilities
2. In-depth research and relevant experience in the open source computing framework (Hadoop MapReduce/Spark/Flink is preferred)
3. In-depth research and related experience in the cluster resource management system (Hadoop Yarn/Mesos/Kubernetes is preferred)
1. In-depth research and experience in machine learning training and scheduling framework
2. Experience in ultra-large scale cluster operation and management.