Site Reliability Engineer - 100% Remote
LRS Consulting Services is seeking a highly skilled Site Reliability Engineer (SRE) to ensure our client’s cloud environments are reliable, scalable, and performing adequately to meet the needs of the business. This is a direct hire opportunity and is 100% remote. LRS Consulting Services has been delivering the highest quality consultants to our clients since 1979. We’ve built a solid reputation for dealing with our clients and our consultants with honesty, integrity, and respect. We work hard every day to maintain that reputation, and we’re very interested in candidates who can help us.
In this role, you will be responsible for ensuring the reliability, scalability, and performance of our clients cloud-based systems. You will work closely with development, operations, and security teams to build and maintain robust monitoring, alerting, and automation solutions. The ideal candidate will have a solid understanding of infrastructure and cloud technologies, as well as experience working in a high growth, large enterprise, multi-location environment.
Day to day responsibilities include:
- Design, implement, and maintain scalable and reliable infrastructure on AWS and Azure.
- Conduct capacity planning and performance testing.
- Define and track SLI’s, SLO’s, RTO and RPO’s.
- Define and test HA & DR Strategies for in scope cloud applications.
- Develop and manage Datadog dashboards, monitors, and alerts to ensure system health and performance.
- Automate operational tasks using Infrastructure as Code (IaC) tools such as Terraform or CloudFormation.
- Collaborate with development teams to improve system reliability and performance through observability best practices.
- Conduct root cause analysis and post-mortems for incidents, and drive continuous improvement.
- Participate in on-call rotations and respond to production incidents with a focus on minimizing downtime.
- Ensure compliance with security and operational standards.
- Partner with in-house and contract team members to implement projects and operations related to AWS and Azure.
- Respond to and support operational duties related to the platforms, including requests and performance monitoring.
- Assist in the selection, configuration, and deployment of services across cloud platforms.
Qualifications:
- Must have a minimum of 2 years of experience as a Site Reliability Engineer.
- Experience with AWS services (EC2, ECS/EKS, RDS, S3, Lambda, CloudWatch, etc.)
- Experience with monitoring, alerting, and performance tuning (Datadog preferred)
- Experience with scripting and automation (e.g., Python, Bash, or Go)
- Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes)
- Solid understanding of networking, security, and system administration
- Experience with CI/CD tools (e.g., Jenkins, GitHub Actions, GitLab CI)
- AWS certifications preferred (e.g., AWS Certified DevOps Engineer, Solutions Architect)
- Experience with other observability tools (e.g., Prometheus, Grafana, ELK) preferred
- Candidate must be able to effectively communicate in English (written & verbal)
- Corp to corp candidates will not be considered
LRS is an equal opportunity employer. Applicants for employment will receive consideration without unlawful discrimination based on race, color, religion, creed, national origin, sex, age, disability, marital status, gender identity, domestic partner status, sexual orientation, genetic information, citizenship status or protected veteran status.