Caterpillar Inc. Job Posting

Your Work Shapes the World at Caterpillar Inc.

When you join Caterpillar, you're joining a global team who cares not just about the work we do – but also about each other. We are the makers, problem solvers, and future world builders who are creating stronger, more sustainable communities. We don't just talk about progress and innovation here – we make it happen, with our customers, where we work and live. Together, we are building a better world, so we can all enjoy living in it.

Responsibilities:

Own production reliability for assigned services through proactive monitoring, alerting, and operational excellence.
Participate in 24x7 on‑call rotation, leading P1/P2 incident triage, stabilization, and resolution.
Ensure adherence to SLOs, SLIs, SLAs, and availability targets.

Alerting & Monitoring

Design, implement, and tune actionable alerts to reduce noise and false positives.
Build and maintain alerting using tools such as:

Datadog / Dynatrace / AppDynamics / Broadcom
CloudWatch, Azure Monitor
Synthetic monitoring tools (ThousandEyes or equivalents)

Create and maintain operational dashboards for application, infrastructure, and business KPIs.
Drive alert rationalization and standardization across teams.

Incident Management & RCA

Lead or contribute to Root Cause Analysis (RCA) and Post‑Incident Reviews (PIRs).
Perform event correlation across metrics, logs, traces, and deployments.
Identify recurring issues and partner with engineering teams for permanent fixes.
Produce clear RCA documentation including timeline, impact, root cause, and corrective actions.

Observability & Tooling

Implement and operate observability platforms covering:

Metrics, logs, traces
Service topology and dependency mapping

Work with OpenTelemetry‑based pipelines where applicable.
Improve visibility into upstream/downstream dependencies.
Support onboarding of applications into standard SRE tooling and frameworks.

Automation & Toil Reduction

Identify manual and repetitive operational tasks and automate them using scripting or workflows.
Contribute to self‑healing and auto‑remediation solutions.
Improve MTTR through automation, runbooks, and tooling enhancements.

Collaboration & Governance

Work closely with application teams, platform teams, and cloud engineers.
Review application designs from a reliability and operability perspective.
Contribute to SRE standards, best practices, and documentation.

Required Skills & Experience

Core Experience

5–6 years of experience in SRE, DevOps, Production Support, or Platform Engineering
Strong experience handling production incidents (P1/P2) and RCAs

Monitoring & Alerting

Hands‑on experience with monitoring and alerting tools such as:

Datadog, Dynatrace, AppDynamics, Broadcom
CloudWatch, Azure Monitor
Synthetic monitoring tools (ThousandEyes or similar)

Experience designing noise‑free, service‑impact‑based alerts

RCA & Troubleshooting

Strong skills in log analysis, metric correlation, and distributed tracing
Experience performing structured RCAs and postmortems
Understanding of incident patterns, failure modes, and resilience

Cloud & Infrastructure

Experience with AWS and/or Azure
Working knowledge of containers and orchestration (ECS/EKS/Kubernetes)
Experience with databases (Postgres, Oracle, or similar)

Automation & Programming

Proficiency in at least one scripting language: Python, Bash, or JavaScript
Familiarity with CI/CD pipelines and IaC concepts (Terraform, CloudFormation – good to have)

Nice to Have

Experience with OpenTelemetry
Exposure to AIOps / event correlation / AI‑assisted RCA
Experience with service maps, dependency graphs, and topology modeling
Prior experience supporting mission‑critical or customer‑facing platforms

Behavioral & Soft Skills

Strong problem‑solving and analytical mindset
Clear communication during high‑pressure incidents
Ability to collaborate across engineering, product, and operations teams
Ownership mindset with a focus on long‑term reliability over short‑term fixes

What Success Looks Like in This Role

Reduced alert noise and faster incident detection
Improved MTTR and fewer repeat incidents
High‑quality RCAs leading to permanent improvements
Strong operational readiness of onboarded applications

Suggest a correction

Software Engineer

Caterpillar

Free Jobs Digest

NoDegree