Senior AI Site Reliability Engineer
As a Senior AI Site Reliability Engineer, you will bring an AI-first mindset to solving classic reliability challenges. You'll design, prototype, and deploy intelligent automation that improves observability, incident response, performance tuning, and operational efficiency across SQUIRE's platform. This role is highly cross-functional, you'll collaborate with engineering, infrastructure, and product teams to identify where AI can create leverage, then build and scale those solutions into production.
Reports To: Senior Director, Platform Engineering
Job Duties And Responsibilities:
- Develop and deploy AI/ML-driven solutions for monitoring, anomaly detection, and predictive alerting to improve system reliability and reduce MTTR.
- Use AI techniques to optimize capacity planning, autoscaling, and resource utilization across distributed systems.
- Automate repetitive operational tasks with intelligent agents and large-scale data analysis.
- Integrate LLMs and generative AI into incident response, post-mortem analysis, and business continuity
- Partner with platform and product engineering teams to embed AI-based observability into services from the ground up.
- Continuously evaluate new AI/ML methods and tools to expand SQUIRE's AI-driven SRE capabilities.
- Drive a culture of experimentation: build prototypes, run pilots, measure results, and productionize what works.
- Mentor engineers on applying AI approaches to reliability problems; help establish standards and best practices.
Requirements And Qualifications:
- 5+ years of experience in Site Reliability Engineering, DevOps, or related roles.
- Proven experience using AI/ML (supervised learning, anomaly detection, LLMs, etc.) to solve operational or reliability problems.
- Strong background in distributed systems, cloud infrastructure (AWS Preferred), and container orchestration (Docker, ECS, Elastic Beanstalk).
- Proficiency with observability stacks (Datadog, Sentry, Prometheus, etc.).
- Solid programming/scripting skills in Python, Go, or similar — with experience integrating ML/AI libraries and APIs.
- Hands-on with automation frameworks and infrastructure as code (Terraform, CloudFormation, etc.).
- Excellent analytical and problem-solving skills, with the ability to innovate in operational domains.
- Strong communication and collaboration skills across technical and non-technical stakeholders.
- English proficiency is a must. It's important you can communicate your ideas clearly as you will be interacting with English-speaking coworkers.
- Must be based in Buenos Aires.
- Availability to work on-site in our office in CABA two days a week (Tuesdays and Thursdays).
Nice To Have:
- Familiarity with generative AI/LLM deployment (e.g., for operational assistants, automated runbooks).
- Experience with predictive scaling, proactive fault detection, or automated incident management systems.
- Contributions to AI-Ops / MLOps tooling or open source reliability projects.
- Background in applying AI to security operations or compliance monitoring.
SQUIRE is committed to working with and providing reasonable assistance to individuals with physical and mental disabilities. If you are an individual with a disability requiring an accommodation to apply for an open position, please email your request to recruiting@getsquire.com and someone on our team will respond to your request. SQUIRE provides equal employment opportunities to all employees and applicants for employment and prohibits discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws. This applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation, and training. SQUIRE will not discharge or in any other manner discriminate against employees or applicants because they have inquired about, discussed, or disclosed their own pay or the pay of another employee or applicant. However, employees who have access to the compensation information of other employees or applicants as a part of their essential job functions cannot disclose the pay of other employees or applicants to individuals who do not otherwise have access to compensation information, unless the disclosure is (a) in response to a formal complaint or charge, (b) in furtherance of an investigation, proceeding, hearing, or action, including an investigation conducted by the employer, or (c) consistent with the contractor's legal duty to furnish information.