A site reliability engineer (SRE) will spend up to 50% of their time doing "ops" related work such as issues, on-call, and manual intervention. Since the software system that an SRE oversees is expected to be highly automatic and self-healing, the SRE should spend the other 50% of their time on development tasks such as new features, scaling or automation. The ideal site reliability engineer candidate is either a software engineer with a good administration background or a highly skilled system administrator with knowledge of coding and automation.
As a SRE in NE Digital, you will drive the initiatives to improve automation, scalability and reliability of our core services such as Fairprice Online, Scan&Go, Identity, my first skool and much more. As a member of NTUC Enterprise Center of Excellence you will be exposed to the latest technologies with AWS Cloud, Google Cloud Platform, Kubernetes, Kubeflow, ML/AI, Big Data, in Hybrid/multi cloud environment.
We are strong believers in DevSecOps, SRE, Agile and FinOps.
Roles & Responsibilities
- Work with release engineers to ensure that the software delivery pipeline is as efficient as possible.
- Collaborate closely with product developers to ensure that the designed solution responds to non-functional requirements such as availability, performance, security, and maintainability.
- Responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning
- Engage in and improve the whole lifecycle of services from inception and design, through deployment, operation, and refinement.
- Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews.
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
- Scale systems sustainably through mechanisms like automation; evolve systems by pushing for changes that improve reliability and velocity.
- Practice sustainable incident response and blameless postmortems.
- Documenting “tribal” knowledge.
- Bachelor's degree in Computer Science, related technical field involving systems engineering, or equivalent practical experience.
- Experience in Unix/Linux and/or Windows operating systems.
- Experience in analyzing and troubleshooting systems.
- Understanding of Infrastructure monitoring, logging, alerting release and configuration management.
- Understanding of networking (e.g. TCP/IP, routing, network topology, load balancers, DNS, NTP).
- Experience in one of the following: Python, Go, Perl, Ruby or shell scripting.
- Experience in Public Cloud, AWS and/or GCP.
- Experience maintaining Internet-facing production-grade applications.
- Experience with software deployment and/or orchestration technologies, e.g., Puppet, Chef, Salt, Ansible, Docker, Kubernetes, Terraform.
- Experience in CI/CD (e.g., JIRA, Git, Jenkins, Nexus, ...)
- Experience in standard IT security practices (e.g., encryption, certificates, key management)
- Excellent communication, and problem-solving skills with strong attention to detail.
- Flexibility to work non-business hours that may include weekends and/or holidays
- Self-starter who is able to identify and perform tasks with minimal supervision
- Experience with GSuite apps (Gmail, Gsheet, Gdoc, ...)