View All Jobs 156852

Information Technology - Site Reliability Engineer

Develop automated tools for incident auto-remediation and system monitoring
St. Louis, Missouri, United States
Senior
17 hours agoBe an early applicant
Apex Systems

Apex Systems

A staffing and services firm specializing in the delivery of IT professionals for contract, contract-to-hire, and direct placements.

74 Similar Jobs at Apex Systems

Technology Engineering And Service Operations

Function: Technology Engineering and Service Operations

Responsible for the provisioning of technology infrastructure and enabling services for the enterprise. Ensures the design, build and run of our technology platforms deliver for both our external and internal customers in an efficient manner while appropriately managing associated risks.

Configures software to automate consumable services for infrastructure and applications using state-of-the-art DevOps principles. Empowers development teams through the introduction, development and/or maintenance of efficient tools and processes. Ensures continuous, high velocity delivery and automated deployment through the use of software provisioning, configuration management, source code management and/or team collaboration applications. May develop own automation scripts and procedures in a variety of scripting languages.

Key Responsibilities:

  • Work in a DevOps environment responsible for the building and running of large-scale, massively distributed, fault-tolerant systems.
  • Work closely with development and operations teams to build highly available, cost effective systems with extremely high uptime metrics.
  • Work with cloud operations team to resolve trouble tickets, develop and run scripts, and troubleshoot
  • Create new tools and scripts designed for auto-remediation of incidents and establishing end-to-end monitoring and alerting on all critical aspects
  • Build infrastructure as code (IAC) patterns that meet security and engineering standards using one or more technologies (Terraform, scripting with cloud CLI, and programming with cloud SDK).
  • Participate in a team of first responders in a 24/7 environment, follow the sun operating model for incident and problem management

Key Skills:

  • DevOps - Uses knowledge of DevOps operational practices and applies engineering skills to improve resilience of products/services. Designs, codes, verifies, tests, documents, modifies programs/scripts and integrated software services. Applies agreed SRE standards and tools to achieve a well-engineered result.
  • Systems Thinking - Uses knowledge of best practices and how systems integrate with others to improve their own work. Understand technology trends and use knowledge to identify factors that achieve the defined expectations of systems availability.
  • Operational Excellence - Prioritizes and organizes one's own work. Monitors and measures systems against key metrics to ensure availability of systems. Identifies new ways of working to make processes run smoother and faster.
  • Troubleshooting - Applies a methodical approach to routine issue definition and resolution. Monitors actions to investigate and resolve problems in systems, processes and services. Determines problem fixes/remedies. Assists with the implementation of agreed remedies and preventative measures. Analyzes patterns and trends.
  • Technical Communication/Presentation - Explains technical information and the impacts to stakeholders and articulates the case for action. Demonstrates strong written and verbal communication skills.

Experience:

  • BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), or equivalent job experience required
  • 4-7 years of experience in software engineering, systems administration, database administration, and networking.
  • 2+ years of experience developing and/or administering software in public cloud
  • Experience in monitoring infrastructure and application uptime and availability to ensure functional and performance objectives.
  • Experience in languages such as Python, Bash, Java, Go JavaScript and/or node.js
  • Demonstrable cross-functional knowledge with systems, storage, networking, security and databases
  • System administration skills, including automation and orchestration of Linux/Windows using Terraform, containers (Docker, Kubernetes, etc.)
  • Proficiency with continuous integration and continuous delivery tooling and practices
  • Cloud Certification Strongly Preferred

EEO Employer

Apex Systems is an equal opportunity employer. We do not discriminate or allow discrimination on the basis of race, color, religion, creed, sex (including pregnancy, childbirth, breastfeeding, or related medical conditions), age, sexual orientation, gender identity, national origin, ancestry, citizenship, genetic information, registered domestic partner status, marital status, disability, status as a crime victim, protected veteran status, political affiliation, union membership, or any other characteristic protected by law. Apex will consider qualified applicants with criminal histories in a manner consistent with the requirements of applicable law. If you have visited our website in search of information on employment opportunities or to apply for a position, and you require an accommodation in using our website for a search or application, please contact our Employee Services Department at employeeservices@apexsystems.com.

+ Show Original Job Post
























Information Technology - Site Reliability Engineer
St. Louis, Missouri, United States
Engineering
About Apex Systems
A staffing and services firm specializing in the delivery of IT professionals for contract, contract-to-hire, and direct placements.