✨ About The Role
- The role involves ensuring the stability, resilience, and scalability of services through automation and infrastructure engineering.
- Responsibilities include deploying monitoring solutions, designing new SRE tools, and collaborating with engineering teams across multiple continents.
- The position requires leading or participating in troubleshooting complex incidents and ensuring high levels of observability for tech services.
- The engineer will work closely with feature teams to implement best practices for monitoring, alerting, and capacity planning.
- The job emphasizes maintaining developer velocity while ensuring product reliability and delivering a high-quality customer experience.
⚡ Requirements
- A skilled engineer with at least 5 years of experience in Site Reliability Engineering (SRE), DevOps, or a related engineering role.
- Strong understanding of SRE and DevOps methodologies, particularly in the build and deployment cycle of applications.
- A focus on observability, with experience in monitoring systems and tools such as Grafana, Loki, and Prometheus.
- A holistic view of application delivery, understanding the integration of monitoring, logging, alerting, and scaling systems.
- Proficient in programming languages such as Java, Python, or Node.js, with a systematic approach to problem-solving.
- Comfortable working in a cloud-native environment, particularly with AWS, and capable of supporting a global user base.
- A proactive individual with a bias for action, who takes initiative to resolve issues and improve processes.
- A growth mindset, willing to mentor less-experienced engineers and eager to learn from others.