Senior Machine Learning Operations (MLOps) Engineer
Analog Devices, Inc. is a global semiconductor leader that bridges the physical and digital worlds to enable breakthroughs at the Intelligent Edge. ADI combines analog, digital, and software technologies into solutions that help drive advancements in digitized factories, mobility, and digital healthcare, combat climate change, and reliably connect humans and the world. With revenue of more than $9 billion in FY24 and approximately 24,000 people globally, ADI ensures today's innovators stay Ahead of What's Possibleâ„¢.
This role is within the global XOps team -- which includes MLOps, LLMOps, AgentOps, and DevOps – whose mission is to deliver a world-class AI/ML developer experience for our software engineers and data scientists. You will join a high-performance, mission-driven interdisciplinary team that spans data science, software engineering, product management, cloud architecture, and security expertise. We believe in a culture founded on trust, mutual respect, growth mindsets, and an obsession for building extraordinary products with extraordinary people.
Role Summary
As a Senior MLOps Engineer (individual contributor), you will bring deep technical expertise, the ability to handle complex assignments end-to-end, and make decisions with broad impact beyond individual tasks. You will independently design and optimize complete systems, resolve technical issues via systematic analysis, and apply industry best practices and advanced methodologies for continuous improvement. You'll lead the development of major ML/AI operational features that span multiple aspects of the ML/AI developer experience — from infrastructure to pipelines, deployment, monitoring, governance, and cost/risk optimization.
Key Responsibilities
Operational Excellence
- Foster and contribute to a culture of operational excellence: high-performance, mission-focused, interdisciplinary collaboration, trust, and shared growth.
- Drive proactive capability and process enhancements to ensure enduring value creation, analytic compounding interest, and operational maturity of the ML/AI platform.
- Design and implement resilient cloud-based ML/AI operational capabilities that advance our system attributes: learnability, flexibility, extensibility, interoperability, and scalability.
- Lead efforts for precision and systemic cost efficiency, optimized system performance, and risk mitigation — with data-driven strategy, comprehensive analytics, and predictive capabilities at both "tree" (component) and "forest" (system) levels of our ML/AI workload and processes.
ML/AI Cloud Operations & Engineering
- Architect and implement scalable AWS ML/AI cloud infrastructure to support the end-to-end lifecycle of models, agents, and services.
- Establish governance frameworks for ML/AI infrastructure management (e.g., provisioning, monitoring, drift detection, lifecycle management) and ensure compliance with industry-standard processes.
- Define and ensure principled validation pathways (testing, QA, evaluation) for early-stage GenAI/LLM/Agent-based proofs-of-concept, across the organization.
- Lead and provide guidance on Kubernetes (k8s) cluster management for ML workflows, including choosing/implementing workflow orchestration solutions (e.g., Argo vs Kubeflow) and data-pipeline creation, management, and governance using tools such as Airflow.
- Design and develop infrastructure-as-code (IaC) in AWS CDK (in Python) and/or Terraform along with GitOps to automate infrastructure deployment and management.
- Monitor, analyze and optimize cloud infrastructure and ML/AI model workloads for scalability, cost-efficiency, reliability, and performance.
- Collaborate with engineering, product, science, design, security, and operations teams to translate business requirements into scalable ML/AI solutions, and ensure smooth integration into production systems.
Required Skills & Experience
- Deep understanding of the Data Science Lifecycle (DSLC) and proven ability to shepherd data science or ML/AI projects from inception through production within a platform architecture.
- Expertise in feature stores, model registries, model governance, and compliance frameworks specific to ML/AI (e.g., explainability, audit trails).
- Experience with monitoring tools for ML/AI (latency/throughput SLAs, model drift, resource usage dashboards).
- Experience with Ray for end-to-end workflows to scale data processing, modeling (training, tuning, serving); and experience with scaling RL is a nice-to-have too!
- Expert in infrastructure-as-code and GitOps practices, with demonstrable skills in Terraform, AWS CDK (Python), Argo CD and/or other IaC and CI/CD systems.
- Hands-on experience managing Kubernetes clusters (for ML workloads) and designing/implementing ML workflow orchestration solutions and data pipelines (e.g., Argo, Kubeflow, Airflow).
- Solid understanding of foundation models (LLMs) and their applications in enterprise ML/AI solutions.
- Strong background in AWS DevOps practices and cloud architecture — e.g., AWS services such as Bedrock, SageMaker, EC2, S3, RDS, Lambda, managed MLFlow, etc. Hands-on design and implementation of microservices architectures, APIs, and database management (SQL/NoSQL).
- Proven track record of monitoring and optimizing cloud/ML infrastructure for scalability and cost-efficiency.
- Excellent verbal and written communication skills — able to report findings, document designs, articulate trade-offs, and influence cross-functional stakeholders.
- Demonstrated ability to manage large-scale, complex projects across an organization, and lead development of major features with broad impact.
- Customer-obsessed mindset and a passion for building products that solve real-world problems, combined with high organization, diligence, and ability to juggle multiple initiatives and deadlines.
- Collaborative mindset: ability to foster positive team culture where creativity and innovation thrive.