Site Reliability Engineer

Own the observability platform end-to-end—Prometheus, Grafana, distributed tracing (Open Telemetry)—and establish SLO/SLI frameworks and track SLAs across all system and applications. Lead major incident response as an incident commander; drive root-cause analysis and systemic remediation programs. Lead the evolution of CWAN's cloud infrastructure on AWS, establishing scalability, resilience, and security standards across all services. Serve as the primary owner of the Kubernetes (EKS) platform: design cluster topology, multi-tenancy models, autoscaling strategies, and upgrade lifecycle. Define and enforce the organization's Infrastructure-as-Code standards using Terraform and Ansible; drive adoption of GitOps workflows. Build and maintain CI/CD and automated deployment pipelines for all services and applications across environments. Evaluate and introduce emerging technologies (eBPF, WASM, service meshes) to improve platform capabilities. Partner with engineering leadership to embed reliability requirements into the SDLC; champion chaos engineering and resilience testing programs. Mentor and grow mid-level and junior SREs across global teams through code reviews, pairing, and structured knowledge sharing. Drive capacity planning, cost optimization, and FinOps practices across the AWS environment. Contribute to the engineering roadmap and help define the long-term reliability strategy for the CWAN platform.

Qualifications Required 7+ years of experience in Site Reliability Engineering, Platform Engineering, or related roles. Proven track record leading major incident response and driving post-incident systemic improvements. Strong experience building and operating Observability stacks at scale. Hands on experience with monitoring, logging and tracing tools like Grafana, Prometheus, Mimir, OpenSearch Dynatrace/Datadog, Victoria Metrics etc. Demonstrated ability to mentor engineers and influence technical direction across time zones. Deep expertise with Kubernetes in large-scale, multi-cluster production environments. Advanced proficiency with AWS (EKS, RDS/Aurora, ElastiCache, Direct Connect, IAM/SCP, Cost Explorer). Expert-level Infrastructure-as-Code skills with Terraform (modules, remote state, Atlantis or similar). Hands-on experience with CI/CD platforms: Jenkins, GitHub Actions and GitLab CI. Experience with GitOps workflows (ArgoCD, Rancher). Proficiency in at least one general-purpose programming language (Python, Go, Java) for building tooling and automation. Experience with security best practices in cloud environments (IAM least privilege, secrets management etc.).

Preferred Experience in financial services, FinTech, or other mission-critical, regulated environments. Hands-on experience with service mesh (Istio) and eBPF-based observability tools. Prior staff or principal engineer experience with cross-team technical influence in a global organization. AWS and Kubernetes certifications at the Professional level. Experience with multi-region active-active architectures and global load balancing.

Suggest a correction

Senior Site Reliability Engineer

Clearwater Analytics

Free Jobs Digest

NoDegree

Site Reliability Engineer

Senior Site Reliability Engineer

About Clearwater Analytics