Flexport is looking for Site Reliability Engineers to help Flexport establish itself as the most trusted company in the global trade ecosystem. Our SREs are responsible for creating a culture of reliability through the proactive development of services that make our engineering and IT teams do their jobs better.
What You’ll Do
- Establish and maintain monitoring capabilities that track service availability, capacity, and performance across our production environments.
- Develop SLOs in partnership with product and engineering teams to influence velocity and service reliability.
- Lead the incident management program and build a blameless post-mortem culture.
- Improve upon change management processes to limit the impact of bad changes, quickly and accurately detect problems, and ensure safe roll-back and recovery.
- Partner with development teams to improve testing and release procedures.
- Advise product teams in system design, platform management, and capacity planning.
- Create sustainable systems and services through automation and uplifts.
What You’ll Need
- 3+ years of SRE/DevOps experience in a fast-paced global environment.
- 3+ years of experience with developer tools including source code management, CI/CD pipelines, and configuration automation with Infrastructure as Code (CloudFormation, Terraform, etc).
- 3+ years of experience with Linux server and container-based infrastructure.
- 3+ years of hands-on experience with AWS. Experience with Azure and GCP a plus.
- 3+ years experience with commercial or open-source infrastructure monitoring solutions.
- Experience with Kubernetes, relational and non-relational databases, and Windows server infrastructure is a plus.
- Excellence in problem-solving, strategic thinking, and collaboration with cross-functional teams.
- Strong interpersonal and communications skills.
Culture & Values
- Learn more at www.keyvalues.com/flexport