Staff Machine Learning Operations (MLOps) Engineer
This role is within the global XOps team -- which includes MLOps, LLMOps, AgentOps and DevOps – whose mission is to deliver a world-class AI/ML developer experience for our software engineers and data scientists. You will join a high-performance, mission-driven interdisciplinary team that spans data science, software engineering, product management, cloud architecture, and security expertise. We believe in a culture founded on trust, mutual respect, growth mindsets, and an obsession for building extraordinary products with extraordinary people.
Role Summary
As a Staff MLOps Engineer (individual contributor), you will bring deep technical expertise, the ability to handle complex initiatives end-to-end, and make decisions with broad impact beyond individual tasks. You will independently design and optimize complete systems, resolve technical issues via systematic analysis, and apply industry best practices and advanced methodologies for continuous improvement. You'll lead the development of major ML/AI operational features that span multiple aspects of the ML/AI developer experience— from infrastructure to pipelines, deployment, monitoring, governance, and cost/risk optimization.
Key Responsibilities
Operational Excellence
- Lead cross-team initiatives that elevate operational excellence across the entire ML/AI ecosystem.
- Define org-level MLOps standards, best practices, and architectural patterns adopted across multiple engineering groups.
- Drive long-range improvements to platform maturity through automation, standardization, and advanced operational engineering.
- Create predictive, systemic cloud and ML workload strategies optimizing cost, performance, and resource allocation across teams.
- Anticipate operational, architectural, and organizational risks, implementing durable solutions adopted across teams.
ML/AI Cloud Operations & Engineering
- Architect multi-team AWS ML/AI infrastructure and define long-term reference architectures for platform evolution.
- Establish enterprise-level governance for model lifecycle, provisioning, drift detection, and compliance.
- Define org-wide standards for GenAI/LLM/Agent evaluation and experimentation.
- Own Kubernetes strategy across ML/AI infrastructure, defining cluster topologies, orchestration patterns, and reliability standards.
- Build reusable IaC frameworks and Terraform/CDK modules adopted across multiple teams.
- Mentor engineers across orgs on optimization, reliability, distributed workflows, and ML system design.
- Partner with senior engineering, science, and product leadership to steer ML/AI platform roadmap and strategy.
Required Skills & Experience
- Recognized expert in ML/AI platforms and MLOps, with influence extending across orgs or business units.
- Proven ability to architect cross-team solutions and define long-term platform directions.
- Expertise creating scalable model governance, registries, and compliance standards used org-wide.
- Deep experience with ML observability frameworks and defining SLO/SLA measurement strategies.
- Expert-level distributed systems experience, including Ray, large-scale LLM/Agent pipelines, and scalable RL systems.
- Mastery of IaC/GitOps frameworks and ability to build reusable multi-team infrastructure patterns.
- Leadership experience in defining Kubernetes and workflow orchestration strategies across orgs.
- Deep experience architecting enterprise ML/AI support for foundation models and multimodal AI.
- Strong ability to lead large multi-team roadmap initiatives, influence cross-functional decision-making, and create long-range architectural direction.
- Executive-level communication and influence skills.
- Strong mentoring orientation and ability to upskill teams across the organization.
Preferred Skills – Robotics, ROS, and Industrial ML
- Deep experience building or scaling ML infrastructure for robotics systems, including deployments across diverse environments or tasks.
- Expertise with ROS/ROS2, including architecting reusable ROS components, integrating ML inference pipelines, and supporting heterogeneous robot fleets.
- Experience designing ML systems that support adaptive, multi-skilled robots for environments where tasks change frequently (modular, batch-of-one manufacturing).
- Experience integrating ML pipelines with robot perception, calibration, planning, or control systems.
- Strong background applying Vision-Language Models (VLMs) for industrial perception, scene understanding, or few-shot classification in robotics contexts.
- Experience designing ML infrastructure enabling low-data adaptation, task generalization, or rapid skill-transfer for robots.
- Familiarity with robotic simulation toolchains and workflows, including sim-to-real strategies and dataset generation pipelines.
- Experience influencing robotics-engineering teams and shaping system architecture that bridges manufacturing domain expertise with ML/AI capabilities.
- Demonstrated ability to architect platform-level solutions enabling scalable robotics ML deployment across multiple teams/products.