✨ About The Role
- Responsible for ensuring the reliability, scalability, and performance of systems as the company continues to expand
- Collaborate with researchers, data scientists, and platform developers to specify requirements and design solutions for the research platform
- Implement fault-tolerant and resilient design patterns to minimize service disruptions and build automation tools to improve system reliability
- Develop and maintain monitoring systems to proactively identify issues and anomalies in the production environment
- Participate in an on-call rotation to respond to critical incidents and ensure 24/7 system availability
âš¡ Requirements
- Experienced reliability engineer with a track record of accelerating engineering reliability in a fast-paced, rapidly scaling company
- Proficient in cloud infrastructure, specifically Azure, and experienced in collaborating with cross-functional teams to ensure reliability and scalability
- Skilled in utilizing Infrastructure as Code (IaC) principles to automate infrastructure provisioning and configuration management
- Strong problem-solving and troubleshooting skills, with excellent communication and collaboration abilities
- Dedicated to creating a diverse, equitable, and inclusive culture while empowering colleagues with excellent tooling and systems