View All Jobs 124940

Software Engineer

Own production reliability for assigned services and drive automation to reduce incident MTTR.
Bangalore
Mid-Level
15 hours agoBe an early applicant
Caterpillar

Caterpillar

Designs, manufactures, and sells heavy machinery, engines, and equipment for construction, mining, energy, and transportation industries worldwide.

39 Similar Jobs at Caterpillar

Caterpillar Inc. Job Posting

Your Work Shapes the World at Caterpillar Inc.

When you join Caterpillar, you're joining a global team who cares not just about the work we do – but also about each other. We are the makers, problem solvers, and future world builders who are creating stronger, more sustainable communities. We don't just talk about progress and innovation here – we make it happen, with our customers, where we work and live. Together, we are building a better world, so we can all enjoy living in it.

Responsibilities:

  • Own production reliability for assigned services through proactive monitoring, alerting, and operational excellence.
  • Participate in 24x7 on‑call rotation, leading P1/P2 incident triage, stabilization, and resolution.
  • Ensure adherence to SLOs, SLIs, SLAs, and availability targets.

Alerting & Monitoring

  • Design, implement, and tune actionable alerts to reduce noise and false positives.
  • Build and maintain alerting using tools such as:
    • Datadog / Dynatrace / AppDynamics / Broadcom
    • CloudWatch, Azure Monitor
    • Synthetic monitoring tools (ThousandEyes or equivalents)
  • Create and maintain operational dashboards for application, infrastructure, and business KPIs.
  • Drive alert rationalization and standardization across teams.

Incident Management & RCA

  • Lead or contribute to Root Cause Analysis (RCA) and Post‑Incident Reviews (PIRs).
  • Perform event correlation across metrics, logs, traces, and deployments.
  • Identify recurring issues and partner with engineering teams for permanent fixes.
  • Produce clear RCA documentation including timeline, impact, root cause, and corrective actions.

Observability & Tooling

  • Implement and operate observability platforms covering:
    • Metrics, logs, traces
    • Service topology and dependency mapping
  • Work with OpenTelemetry‑based pipelines where applicable.
  • Improve visibility into upstream/downstream dependencies.
  • Support onboarding of applications into standard SRE tooling and frameworks.

Automation & Toil Reduction

  • Identify manual and repetitive operational tasks and automate them using scripting or workflows.
  • Contribute to self‑healing and auto‑remediation solutions.
  • Improve MTTR through automation, runbooks, and tooling enhancements.

Collaboration & Governance

  • Work closely with application teams, platform teams, and cloud engineers.
  • Review application designs from a reliability and operability perspective.
  • Contribute to SRE standards, best practices, and documentation.

Required Skills & Experience

Core Experience

  • 5–6 years of experience in SRE, DevOps, Production Support, or Platform Engineering
  • Strong experience handling production incidents (P1/P2) and RCAs

Monitoring & Alerting

  • Hands‑on experience with monitoring and alerting tools such as:
    • Datadog, Dynatrace, AppDynamics, Broadcom
    • CloudWatch, Azure Monitor
    • Synthetic monitoring tools (ThousandEyes or similar)
  • Experience designing noise‑free, service‑impact‑based alerts

RCA & Troubleshooting

  • Strong skills in log analysis, metric correlation, and distributed tracing
  • Experience performing structured RCAs and postmortems
  • Understanding of incident patterns, failure modes, and resilience

Cloud & Infrastructure

  • Experience with AWS and/or Azure
  • Working knowledge of containers and orchestration (ECS/EKS/Kubernetes)
  • Experience with databases (Postgres, Oracle, or similar)

Automation & Programming

  • Proficiency in at least one scripting language: Python, Bash, or JavaScript
  • Familiarity with CI/CD pipelines and IaC concepts (Terraform, CloudFormation – good to have)

Nice to Have

  • Experience with OpenTelemetry
  • Exposure to AIOps / event correlation / AI‑assisted RCA
  • Experience with service maps, dependency graphs, and topology modeling
  • Prior experience supporting mission‑critical or customer‑facing platforms

Behavioral & Soft Skills

  • Strong problem‑solving and analytical mindset
  • Clear communication during high‑pressure incidents
  • Ability to collaborate across engineering, product, and operations teams
  • Ownership mindset with a focus on long‑term reliability over short‑term fixes

What Success Looks Like in This Role

  • Reduced alert noise and faster incident detection
  • Improved MTTR and fewer repeat incidents
  • High‑quality RCAs leading to permanent improvements
  • Strong operational readiness of onboarded applications
+ Show Original Job Post
























Software Engineer
Bangalore
Engineering
About Caterpillar
Designs, manufactures, and sells heavy machinery, engines, and equipment for construction, mining, energy, and transportation industries worldwide.