✨ About The Role
- The Site Reliability Engineer will be responsible for ensuring the reliability, scalability, and performance of Replit's infrastructure.
- This role involves designing and implementing observability solutions to monitor system health and performance.
- The engineer will drive automation and infrastructure as code using tools like Terraform and Ansible.
- Establishing Service Level Objectives (SLOs) and Service Level Indicators (SLIs) in collaboration with product and engineering teams is a key responsibility.
- The role includes leading incident management efforts and conducting post-mortems to improve future responses.
⚡ Requirements
- The ideal candidate will have at least 3 years of experience in Site Reliability Engineering or similar roles such as DevOps or Systems Engineering.
- Strong programming skills in languages like Python or Go are essential for automating tasks and building resilient systems.
- A deep understanding of distributed systems is crucial for ensuring the reliability and performance of the infrastructure.
- Candidates should have experience with container orchestration platforms, particularly Kubernetes, and cloud-native technologies.
- Strong incident management skills are necessary, with a proven track record of leading incident response efforts.