Cloud/Site Reliability Engineer

Location: Providence, RI, US, 02903

Requisition ID: 17859

Brightstar is an innovative, forward-thinking global leader in lottery that builds on our renowned expertise in delivering secure technology and producing reliable, comprehensive solutions for our customers. As a premier pure play global lottery company, our best-in-class lottery operations, retail and digital solutions, and award-winning lottery games enable our customers to achieve their goals, fulfill player needs and distribute meaningful benefits to communities. Brightstar has a well-established local presence and is a trusted partner to governments and regulators around the world, creating value by adhering to the highest standards of service, integrity, and responsibility. Brightstar has approximately 6,000 employees.

About the Role

We are seeking a Cloud/Site Reliability Engineer to join our Cloud Infrastructure Engineering, Operations & Automation team. This role is designed for engineers who are passionate about building resilient systems, preventing incidents before they occur, and driving operational excellence through intelligent monitoring, AI-driven automation, and continuous improvement. You'll play a pivotal role in evolving our cloud-hosted environments to be more self-aware, self-healing, and scalable, ensuring high availability and performance of our applications and services, and contributing with your investigation on issues that are meant to facilitate the engagement of L3 product engineers in case of production incidents.

Responsibilities

As a Cloud/Site Reliability Engineer, you will focus on Level 2 (L2) operational ownership with a strong emphasis on proactive monitoring, root cause analysis, and automation-driven remediation:

Monitoring & Observability
- Design and refine monitoring strategies using tools like Dynatrace, Prometheus, and ELK.
- Develop alerting standards that reduce noise and increase signal quality.
- Continuously improve observability to detect anomalies before they impact users.
Troubleshooting & Root Cause Analysis
- Assess application workloads key metrics for performance and reliability, together with infrastructure and middleware monitoring.
- Identify Public/Hybrid Cloud issues in services and resources.
- Correlate alerts with telemetry and logs to identify systemic issues and improvement opportunities.
- Work with L3 product engineers and with cloud vendors towards the resolution of the cases.
Automation & Self-Healing
- Design, build, and maintain robust automation pipelines using tools such as Terraform, Ansible, Jenkins, Helm, and Bash to streamline cloud operations.
- Develop and implement self-healing capabilities that proactively detect and remediate issues, minimizing manual intervention and downtime.
- Analyze operational workflows to identify repetitive tasks and transform them into scalable, automated solutions.
- Collaborate with the Architecture team to enhance and enforce cloud baseline standards for consistency and reliability.
- Automate incident response and recovery processes leveraging tools like PagerDuty to accelerate resolution and improve system resilience.
Cloud Infrastructure Operations
- Advanced experience with both Azure and AWS cloud service providers.
- Manage Cloud infrastructure and services.
- Monitor and optimize Cloud resources usage.
- Open and manage Microsoft support tickets in collaboration with L3.
- Participate in 24x7 On-Call rotation with after-hours support for critical incident response.

Qualifications

Hands-on experience in cloud operation or site reliability engineering field. Practical experience in public cloud infrastructure and services management (Azure / AWS public cloud knowledge would be preferred). Proficiency in scripting and automation (Terraform, PowerShell, Python, Bash). Experience with Infrastructure as Code (IaC) and GitOps principles. Hands-on experience on K8s and containers orchestration. Expertise in monitoring tools (Dynatrace, Datadog, Prometheus, ELK). Strong analytical, troubleshooting, and communication skills.

Preferred Qualifications

Apply Agentic AI techniques to drive intelligent automation, optimize cloud services, accelerate troubleshooting and root-cause analysis, and enhance system resilience and recoverability. Familiarity with AI/ML Ops or AI-assisted observability tools. Thorough understanding of Java application workloads, and Java performance related topics. Deep knowledge of one programming language (Java/ Python / Go). Strong Linux and networking skills. Understanding software architecture patterns and app-dev principles. Public cloud certifications would be considered as a plus. Experience in a 24/7 operations environment.

Why Join Us?

You'll be part of a forward-thinking Cloud Infrastructure Engineering, Operations & Automation team that values prevention over reaction, automation over repetition, and collaboration over silos. Your work will directly contribute to building a more resilient, scalable, and intelligent cloud ecosystem.

Keys to Success

Building collaborative relationships. Decision making. Drive results. Foster innovation. Personal energy. Self-leadership

Suggest a correction

Cloud/site Reliability Engineer

Rhode Island Staffing

Free Jobs Digest

NoDegree