The position is in charge of managing a team of people dedicated to proactively building reliability into the product. Because reliability in highly complex, integrated systems typically crosses between multiple programming languages, third-party services and integrations – as well as software and hardware – an SRE team needs to be multi-talented. Each individual in an SRE team should be highly skilled in one or two fields with a wide breadth of knowledge in many other IT operations and other software development skills.
Objectives of this Role
- Run the production environment by monitoring availability and taking a holistic view of system health
- Build software and systems to manage platform infrastructure and applications
- Improve reliability, quality, and time-to-market of our suite of software solutions
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
- Provide primary operational support and engineering for multiple large distributed software applications
Daily and Monthly Responsibilities
- Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
- Partner with development teams to improve services through rigorous testing and release procedures
- Participate in system design consulting, platform management, and capacity planning
- Create sustainable systems and services through automation and uplifts
- Balance feature development speed and reliability with well-defined service level objectives
- Require to work day and night shift
Required Skills and Qualifications
- Bachelor’s degree in computer science or other highly technical, scientific discipline
- Ability to program (structured and OO) with one or more high level languages, such as Python, Java, C/C++, Ruby, and JavaScript
- Experience with distributed storage technologies like NFS, HDFS, Ceph, S3 as well as dynamic resource management frameworks (Mesos, Kubernetes, Yarn)
- A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
- Preferred Qualifications
- Previous success in technical engineering
- Coding experience beyond simple scripts