The Red Hat ROSA OpenShift Site Reliability Engineering team is seeking a Site Reliability Engineering (SRE) Manager to join our team. OpenShift is Enterprise Kubernetes and SRE-P delivers Red Hat OpenShift Service on AWS (ROSA), Azure Red Hat OpenShift (ARO), and OpenShift Dedicated (OSD) as cloud services for large enterprise customers.
As an SRE Manager, you'll lead a team of SREs in both the development and operations of our managed OpenShift services. You will interact with product managers, customers, and product engineers around the world to deliver a sophisticated cloud-based container orchestration platform for enterprise IT customers. You'll coach and develop SREs as they create Kubernetes Operators and other software to autonomously manage the environment, and guide problem management resolution (PMR) analysis when things go wrong. You will also leverage AI/ML tools to enhance observability, automate repetitive tasks, and predict potential service issues. You'll work in a fast-paced globally distributed team while quickly learning new skills and creating ways to consistently meet service-level agreements (SLAs) for our global cloud services.
ROSA OpenShift SRE is a growing, sophisticated, global, fast-paced team inside the world's Open Source leader with constant opportunities to learn new skills and innovate new solutions to meet our customers' demands. As an SRE on this team, you'll directly contribute to Red Hat's success in the rapidly growing Kubernetes-as-a-service market.
Hire, develop, and retain a team of SREs developing and operating Red Hat's managed OpenShift service offerings.
Coach engineers on SREs principles: automation, toil reduction, and root cause analysis.
Manage high-visibility project delivery, including estimation, schedule, risks, and dependencies.
Drive adoption of AI/ML-assisted automation and observability solutions to improve operational efficiency and reliability.
Meet with customers in pre-sales and post-sales situations, supporting deep-dive conversations on product capabilities and escalation / problem resolution surrounding incidents.
Design processes and communication norms that facilitate coordination across a fast-moving, fast-growing, diverse global team.
Lead your team through frequent changes in organization, process, and technology commensurate with a high growth cloud service in a competitive market.
Resolve customer issues in cooperation with Red Hat's global customer support team.
Participate in a periodic 24x7 management escalation on-call rotation.
2+ years experience managing engineering teams.
Ability to lead distributed, remote teams working across multiple time zones.
Ability to discuss complex technical issues with SREs, product managers, and less-technical stakeholders including customers and senior leaders.
Experience with Agile project management methodologies such as scrum, kanban.
1+ years of experience with cloud providers such as Amazon Web Services (AWS), Google Compute Engine (GCE), or Microsoft Azure.
1+ year(s) of experience with Kubernetes is a plus.
1+ year(s) of experience with docker-based containers is a plus.
Experience or familiarity with AI/ML applications in cloud operations, observability, or automation is a plus.