✨ About The Role
- The role involves ensuring the reliability, scalability, and performance of Replit's infrastructure that serves millions of developers worldwide.
- Responsibilities include designing and implementing robust monitoring solutions and automating operational tasks to improve infrastructure reliability.
- The candidate will work with product and engineering teams to define and implement Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
- Leading incident response efforts and developing runbooks for critical services will be key components of the job.
- The position requires identifying and resolving performance bottlenecks and optimizing resource utilization across the infrastructure.
⚡ Requirements
- The ideal candidate will have over 5 years of experience in Site Reliability Engineering or similar roles such as DevOps or Systems Engineering.
- A strong programming background in languages like Python or Go is essential for automating tasks and building resilient systems.
- Candidates should possess a deep understanding of distributed systems and cloud infrastructure, particularly with container orchestration platforms like Kubernetes.
- Strong incident management skills are necessary, with experience leading incident response efforts and conducting post-mortems.
- A passion for continuous learning and staying current with industry best practices and new technologies is highly valued.