✨ About The Role
- Responsible for maintaining Pelago's system built on AWS, developing a deep understanding of the development workflow, and planning, implementing, and growing the AWS cloud infrastructure
- Troubleshoot issues on the platform, find root causes, and interface with engineering teams to resolve, while monitoring application metrics and proactively raising issues
- Own the reliability, availability, latency, performance, and capacity planning of the Pelago environment, performing incident response and blameless post-mortems
- Implement infrastructure as code for provisioning, configuration, and deployment using Terraform, and conduct load testing to identify bottlenecks before they impact customers
- Work alongside developers to drive automation, maximize efficiency, and improve reliability, with occasional on-call shifts required on a rotational basis
âš¡ Requirements
- Experience working in Site Reliability Engineering or reliability-focused production engineering roles, with a solid understanding of AWS, serverless architecture, and Terraform
- Proficiency in script development and scripting languages, with a strong troubleshooting background in identifying and remediating issues
- Ability to work on a system that has scaled to 50m+ users, adhering to security standards and best practices
- Team player mentality with strong communication and collaboration skills, able to work alongside developers to drive automation and improve reliability
- Passion for making an impact on the health of others, with a commitment to equitable access to healthcare and wellbeing