View All Jobs 121918

Senior Site Reliability Engineer

Build and deploy a comprehensive Azure observability stack with self-healing automation
Raleigh, North Carolina, United States
Senior
yesterday
USA Jobs

USA Jobs

Provides a centralized online platform for searching and applying to employment opportunities across the United States.

Senior Site Reliability Engineer

No sponsorship will be provided for this role.

Location: On site at location listed in posting.

Weekly Schedule: Monday-Friday, 9am-5pm

We are seeking a Senior Site Reliability Engineer who will be the guardian of our Azure infrastructure reliability. This role focuses on building comprehensive observability platforms, implementing intelligent monitoring systems, and proactively identifying issues before they impact production. You will create the tools and automation that predict, detect, and prevent problems rather than simply reacting to them. Your primary mission is ensuring our Azure infrastructure and applications never surprise us with failures.

The ideal candidate has deep expertise in Azure Monitor, Application Insights, Log Analytics, and KQL, combined with strong scripting skills in Python or PowerShell. You should have 5-7+ years of experience implementing observability platforms and a proven track record of preventing incidents through proactive monitoring and automation. You'll work with technologies like Prometheus, Grafana, OpenTelemetry, and Azure services (AKS, App Services, Azure SQL, Cosmos DB) while building self-healing automation and predictive analytics tools that keep our systems healthy.

Key Responsibilities:

Design and implement comprehensive observability stack across all Azure resources and applications

Build intelligent alerting systems with anomaly detection and predictive capabilities to prevent incidents

Create self-healing automation and auto-remediation tools that resolve issues without human intervention

Develop internal monitoring platforms, dashboards, and CLI tools for engineering teams

Write KQL queries and analyze metrics/logs to identify optimization opportunities and predict failures

Implement continuous resource monitoring for Azure quotas, costs, security posture, and service health

Build capacity forecasting and trend analysis tools to prevent resource exhaustion

Reduce alert noise while improving coverage and actionability of monitoring systems

Participate in light on-call rotation (prevention-focused approach reduces reactive incidents)

About Us

First Horizon Corporation is a leading regional financial services company, dedicated to helping our clients, communities and associates unlock their full potential with capital and counsel. Headquartered in Memphis, TN, the banking subsidiary First Horizon Bank operates in 12 states across the southern U.S. The Company and its subsidiaries offer commercial, private banking, consumer, small business, wealth and trust management, retail brokerage, capital markets, fixed income, and mortgage banking services. First Horizon has been recognized as one of the nation's best employers by Fortune and Forbes magazines and a Top 10 Most Reputable U.S. Bank.

Benefit Highlights

Medical with wellness incentives, dental, and vision

HSA with company match

Maternity and parental leave

Tuition reimbursement

Mentor program

401(k) with 6% match

More -- FirstHorizon.com/First-Horizon-National-Corporation/Careers/Our-Benefits

+ Show Original Job Post
























Senior Site Reliability Engineer
Raleigh, North Carolina, United States
Engineering
About USA Jobs
Provides a centralized online platform for searching and applying to employment opportunities across the United States.