Cloud Site Reliability Engineer

We are looking for a Cloud Operations – Site Reliability Engineer to join our international shared services team. Working in the Cloud Operations team you'll automate, configure and deploy the company's cloud-based SaaS infrastructure, ensuring the business operates efficiently and effectively. Your role will be to design, deploy and administer the customer facing SaaS cloud platforms and deliver the best customer experience based on high availability and performance. You will have to assist with technical leadership, to ensure that we are progressing in our Forterro Cloud journey and to minimize the downtime. As a Site Reliability Engineer (SRE), you will design, build, and operate highly reliable SaaS platforms across multiple Forterro product lines. You will focus on automation, reliability engineering, observability, and improving our global cloud infrastructure with a strong AWS, Kubernetes, and GitOps mindset. You will work closely with development teams to deliver predictable, scalable, fault-tolerant services using ArgoCD, Terraform, and modern SRE principles. SREs will be align to product (s) where they own the reliability of those products.

Responsibilities

Platform Operations Management & Automation

Install, configure, and automate applications, servers, infrastructure, and network systems to meet business IT requirements.
Work in alignment with the Platform team to design and engineer scalable SaaS platform architectures on AWS.
Work in collaboration with the Platform team to implement and maintain Infrastructure-as-Code using Terraform and Crossplane.
Own the AWS platform, aligning and defining new processes for SaaS delivery.
Own and evolve GitOps pipelines using ArgoCD.
Develop automated runbooks, playbooks, reusable modules, and self-healing/recovery workflows.

Monitoring, Observability & Reliability

Configure monitoring platforms and availability metrics to ensure maximum operating efficiency.
Implement modern observability tooling (Prometheus, Grafana, Loki, OpenTelemetry, Datadog).
Define and measure SLIs/SLOs, manage error budgets, and drive reliability improvements.
Improve resilience through load testing, chaos testing, and game days.

Operations, Troubleshooting & Incident Management

Troubleshoot system and network problems, diagnosing and resolving faults to restore availability quickly.
Conduct deep-dive root cause analysis and lead post-mortems for P1 incidents.
Take part in a team rota/on-call rotation to provide out-of-hours cover for critical events.
Perform efficient work prioritization to minimize impact on business-critical systems.

Security, Compliance & Governance

Ensure the security posture of the infrastructure and maintain compliance with company standards.
Provide architectural guidance for new SaaS workloads and migrations.
Optimize AWS spend using FinOps best practices.

Collaboration, Best Practice & Stakeholder Engagement

Champion best practice and collaborate with professional services and development teams to provision customer application deployments and services.
Establish and maintain strong working relationships with stakeholders (co-workers, clients, third-party vendors).
Mentor CloudOps engineers and support their progression into SRE roles.
Assist the cloud practice by proposing solutions and contributing relevant business cases.
Work and communicate with the team to ensure cost analysis and cost optimization where possible.

Documentation & Continuous Improvement

Update internal administration systems, technical documentation, system diagrams, and operational guidelines.
Test and evaluate new technologies and products, producing evaluation reports.

Skills, Knowledge & Expertise

Experience & Background

3–7+ years in SRE, DevOps, CloudOps, or cloud engineering roles.
Experience in Cloud DevOps or as an AWS Engineer.
Experience working with SLIs/SLOs and reliability-based engineering practices.

Cloud, SaaS & Platform Expertise

Deep knowledge of AWS services, SaaS architectures, and multi-account governance.
Familiarity with PaaS technologies: containers, Kubernetes (EKS), orchestration, messaging (Pub/Sub), FaaS, etc.
Familiarity with backup solutions (Veeam, Kasten, Velero).
Experience with Cloud FinOps solutions is beneficial but not required.

Automation, IaC & GitOps

Strong expertise with Terraform, Git, CI/CD pipelines, and GitOps using ArgoCD.
Scripting and automation experience: Crossplane, Ansible, Python, YAML, PowerShell/Bash.

Monitoring, Observability & Performance

Experience with best-in-class system monitoring.
Expertise with observability tools such as Grafana, Datadog, Prometheus, Loki, OpenTelemetry.
Strong system design, troubleshooting, and performance analysis skills.

Soft Skills & Ways of Working

Ability to identify sensible, effective, low-risk, and low-complexity solutions.
Excellent attention to detail and data management skills.
Excellent communication, organizational, and cross-functional collaboration abilities.
Knowledge or experience with Agile methodologies.

About Forterro

Founded in 2012, Forterro has grown to become a category leader in industrial software – with strongholds in Europe's top production economies, as well as regional service hubs and development centres around the world. From more than 40 office locations, our 2,500+ employees provide and support software for more than 13,000 industrial businesses.

Our products are deeply rooted in the demands of their local geography. And each is designed to strengthen and accelerate our customers' ability to operate efficiently and compete effectively.

Our Hiring Process

Stage 1: Applied

Stage 2: Technical Interview

Stage 3: Technical Assessment

Stage 4: Functional Interview

Stage 5: Offer

Stage 6: Hired

Find out more

Not quite right? Register your interest to be notified of any roles that come along that meet your criteria.

Suggest a correction

Cloud Site Reliability Engineer

Forterro

Free Jobs Digest

NoDegree