View All Jobs 128417

Staff Software Engineer, Site Reliability

Build and operate scalable, self-healing reliability tools for Intuit Identity platform.
San Diego, California, United States
Senior
$188,500 – 255,000 USD / year
8 hours agoBe an early applicant
Intuit

Intuit

Provides financial management, tax preparation, and accounting software solutions for individuals, small businesses, and self-employed professionals.

Staff Software Engineer - Identity Team

Come join Intuit's Identity Team as a Staff Software Engineer who sits at the true intersection of software development and site reliability to build and operate large scale systems that are secure, fault-tolerant, performant, highly available, affordable, and scalable. This is not a traditional SRE role - we're looking for an engineer who writes production-quality code as readily as they debug infrastructure, and who is equally at home building new system reliability tools and capabilities as they are increasing the resiliency, and operational excellence of the Identity platform capabilities as a whole.

Identity is at the heart of all offerings across Intuit and is foundational to strategic transformation of Intuit. Identity at Intuit is one of the most critical services powering close to 500+ applications/services and enables Intuit's 3 strategic big bets.. Identity capabilities position Intuit at the center of the financial ecosystem and enable fluid exchange of Identity, profile and data across an ecosystem of financial institutions. Identity's technical stack is cloud native microservices based architecture fully operating on Kubernetes & AWS cloud. The work is complex, the stakes are high, and the impact is company-wide.

Responsibilities

Software Development (~50%)

  • Design and build tools, automation systems, and self-healing mechanisms from the ground up - this means writing real, production code, not scripts
  • Develop self-service tools and services that enable Identity developers to troubleshoot and triage issues at scale
  • Build and evolve observability components to detect and isolate issues quickly across massive-scale systems
  • Contribute to the cost and capacity management, uncovering cost saving opportunities and developing automation to enforce optimization at scale
  • Leverage AI to build tools, solve complex operational and auto-healing problems at scale
  • Support and coach other engineers, pair programming or peer reviewing code, helping to ensure that all engineers are growing and part of a community. Be a role model to engineers and inspire a high technical bar for the team.

Systems Reliability & Infrastructure (~50%)

  • Act as the technical subject matter expert to evaluate and evangelize forward-looking processes, tools, technologies and architecture to deliver high-quality secure software faster, efficiently and meeting availability, scale & performance requirements in a AWS public cloud and Kubernetes environment.
  • Partner with Architecture, Product, and Operations on infrastructure target state, Resilience and Operational Excellence (OpEx) best practices & patterns and influence the roadmap to non-linearly improve all-ilities
  • Contribute to FMEA (Failure Mode Effective Analysis) and Chaos Engineering to proactively identify resiliency gaps and prepare for faster recovery during incidents.
  • Drive Incident management and Incident Root Cause Analysis (RCA) to continuously improve development and operational practices
  • Participate in 12/7 on-call rotation

Qualifications

Required

  • BS/MS in Computer Science, Engineering, or equivalent experience
  • 10+ years of experience in software engineering, with demonstrated hands-on expertise designing, developing (not just scripting) and operating complex (high scale and high availability) distributed systems in a cloud-native architecture and AWS environment.
  • Experience using AI to build tools and solve complex operational and auto healing problems.
  • Coding in Python, Java, Go or similar languages combined with strong operational skills
  • Experience in Infrastructure as code (Terraform/CDK preferred), CI/CD pipelines (Jenkins, CircleCI, or similar), Kubernetes and containerization (Docker, ECS) and Monitoring / Alerting tools (Splunk, Wavefront, Grafana Mimir)
  • Ability to handle a fast-paced environment for iterative project turnarounds on mission critical systems.
  • Ability to collaborate across a wide range of roles and experience levels. Strong communication skills.
  • Strong Linux/Unix fundamentals

Who Thrives in This Role

You're a software engineer who genuinely enjoys building tools and services that solve reliability and operational problems - not someone who codes occasionally. You're comfortable writing complex, scalable software and also rolling up your sleeves on an incident to drive investigation, resolution and learnings towards continued improvements. You don't see 'dev' and 'ops' as separate worlds. If that's you, we want to talk.

+ Show Original Job Post
























Staff Software Engineer, Site Reliability
San Diego, California, United States
$188,500 – 255,000 USD / year
Engineering
About Intuit
Provides financial management, tax preparation, and accounting software solutions for individuals, small businesses, and self-employed professionals.