View All Jobs 133827

Lead Software Engineer - Resiliency

Develop automated solutions to improve platform reliability and incident response efficiency
Columbus, Ohio, United States
Senior
3 weeks ago

Lead Site Reliability Engineer

Be an integral part of an agile team that's constantly pushing the envelope to enhance, build, and deliver top-notch technology products. As a Lead Site Reliability Engineer at JPMorgan Chase within the Employee Compute Branch Team you will play a pivotal role in designing, implementing, and overseeing automation for observability and notification across a diverse set of systems in a global Microsoft Windows environment. You will lead by example, bringing hands-on expertise in PowerShell and C#, and infusing best practices into a team of highly experienced system engineers. Your work will directly impact the reliability, scalability, and efficiency of our platforms, with a strong focus on cloud (Azure and AWS) integration.

Job Responsibilities

  • Champion site reliability engineering culture and practices, exerting technical influence across the team.
  • Lead the design and hands-on implementation of automated observability and notification solutions using PowerShell and C#.
  • Drive initiatives to improve reliability and stability of applications and platforms through data-driven analytics and automation.
  • Collaborate with team members to define and implement service level indicators, objectives, and error budgets.
  • Architect and implement monitoring, alerting, and telemetry solutions using tools such as Grafana, Dynatrace, Prometheus, Datadog, and Splunk.
  • Act as the primary technical lead during major incidents, quickly identifying and resolving issues to minimize impact.
  • Mentor and upskill system engineers, fostering a programming mindset and best practices in automation and reliability.
  • Facilitate cross-team and cross-region collaboration, ensuring alignment and knowledge sharing.
  • Document and share technical solutions and best practices within internal forums and communities of practice.
  • Engage with stakeholders to understand business needs and translate them into technical solutions, with increasing responsibility over time.
  • Break down complex problems into actionable work for the team, ensuring clear direction and accountability.

Required qualifications, capabilities, and skills

  • Formal training or certification on Site Reliability Engineering concepts and 5+ years applied experience.
  • Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, and toil reduction, with proven ability to implement these practices.
  • Expert-level fluency in PowerShell and C# in a Microsoft Windows environment.
  • Hands-on experience with cloud platforms, specifically Azure and AWS.
  • Demonstrated experience in automated software testing (unit, integration, end-to-end).
  • Deep knowledge of software applications and technical processes, with emerging depth in one or more technical disciplines.
  • Proficiency and experience in observability, including white and black box monitoring, SLO alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, and Splunk.
  • Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform).
  • Experience with containerization and container orchestration (e.g., Docker, Kubernetes, ECS).
  • Ability to mentor and teach programming concepts to system engineers with non-programming backgrounds, fostering a programming mindset and best practices.
  • Excellent communication and strategic thinking skills, with the ability to collaborate across teams, regions, and stakeholder groups.

Preferred qualifications, capabilities, and skills

  • Experience leading teams or projects in a site reliability or automation-focused role.
  • Experience in financial services or other highly regulated, secure enterprise environments.
  • Experience with containerization and orchestration (e.g., Docker, Kubernetes, ECS).
  • Familiarity with complex data structures and algorithms.
  • Drive to self-educate and evaluate new technologies.
  • Ability to expand and collaborate across different levels and stakeholder groups.
  • Experience architecting self-healing or remediation automation (a plus, but not required at this stage).

JPMorgan Chase & Co. is an Equal Opportunity Employer, including Disability/Veterans

+ Show Original Job Post
























Lead Software Engineer - Resiliency
Columbus, Ohio, United States
Engineering
About Columbus Staffing