Staff Software Engineer - AI (DevOps)
Are you ready to partner closely with product, architecture, and engineering teams to define needs and technical strategy, lead research & development within the project life cycle, provide technical analysis and design, and support operations staff in executing, testing, and rolling out solutions? You will combine Stafflevel software engineering with Leadlevel DevOps/Platform expertise: constantly looking to optimize systems and services for security, automation, reliability, and performance/availability, while ensuring solutions adhere to architecture standards and organizational values. You will also help development teams use AI safely and effectively in their SDLC (e.g., GitHub Actions and other MCPS tooling), and drive best practices in AI/ML Ops.
About the Role
In this role as a Staff Software Engineer - AI (DevOps), you will:
- Architect and implement AI-driven solutions using agentic AI patterns, including MCP server architectures, orchestration workflows, and agentic pipelines.
- Design and operate scalable, secure, and cost-efficient AI platforms on cloud infrastructure (Azure and/or AWS) with Kubernetes as the primary runtime.
- Integrate LLMs, vector search, and retrieval-augmented generation (RAG) patterns using services such as Azure AI Foundry and Azure AI Search.
- Define and implement AI/ML Ops practices for model and pipeline lifecycle, including versioning, monitoring, evaluation, and governance.
- Plan, deploy, and maintain critical business applications and AI services in production and non-production cloud environments (Azure / AWS).
- Design and implement appropriate environments for those applications and services; engineer robust release management procedures and provide production support.
- Build and maintain CI/CD pipelines using MCPS tooling (e.g., Azure DevOps, Jenkins, GitHub Actions), including automation for building, testing, scanning, and deploying AI and non-AI workloads.
- Design and maintain infrastructure-as-code (e.g., Terraform, Bicep, Ansible) for cloud, Kubernetes, networking, and platform services.
- Develop and maintain agentic workflows that orchestrate tools, services, and data sources to support complex business processes.
- Use AI tools within the development lifecycle (e.g., AI-assisted code generation, GitHub Actions AI features, AI-driven test generation and triage) to increase velocity while maintaining quality and compliance.
- Collaborate with product and engineering teams to identify opportunities for AI automation in build, test, deployment, and operations workflows.
- Drive improvements to processes and design enhancements to automation to continuously improve production environments (reliability, observability, performance, cost).
- Perform daily system monitoring, verifying integrity and availability of services, reviewing system and application logs, and verifying completion of scheduled and automated tasks.
- Perform ongoing performance tuning, infrastructure upgrades, and resource optimization as required.
- Provide Tier II/Tier III support for incidents and requests from various constituencies; lead technical recovery for high-severity incidents impacting AI platforms and services.
- Establish and maintain monitoring, alerting, SLOs, and dashboards for AI services; contribute to disaster recovery planning and testing to ensure business continuity.
- Partner with security and compliance teams to ensure AI platforms and pipelines meet TR security, privacy, and governance standards, including access controls, data protection, and auditability.
- Provide leadership, technical support, user support, technical orientation, and technical education activities to project teams and staff across multiple locations.
- Influence broader technology groups in adopting cloud, Kubernetes, and AI technologies, processes, and best practices.
- Mentor and coach engineers (Dev, QA, DevOps, Data/ML) in modern DevOps, AI/ML Ops, and platform practices.
- Maintain and contribute to our knowledge base and documentation, including runbooks, design docs, and standards.
- Participate in and often lead technical design reviews, architecture decisions, and cross-team initiatives.
About You
You are a fit for the Staff Software Engineer - AI (DevOps) role if your background includes:
- 8+ years of overall software engineering / DevOps / platform engineering experience, including 3+ years in a Lead-level DevOps / Platform / SRE capacity, and 2+ years supporting AI-driven solutions at enterprise scale.
- Strong experience designing and operating solutions on cloud platforms (Azure and/or AWS), including: Core services such as compute, storage, networking, identity, and managed databases (e.g., RDS or Azure SQL), and Experience with services such as S3/CloudFront/CloudFormation or Azure equivalents where applicable.
- Hands-on expertise with Kubernetes and containerization (Docker), including building and deploying containerized workloads at scale; experience with managed Kubernetes (e.g., AWS EKS and/or Azure AKS).
- Deep knowledge and hands-on experience with CI/CD and MCPS tools, including at least two of: Azure DevOps (ADO), Jenkins, GitHub Actions, with a track record of planning, building, and deploying cloud-based solutions.
- Experience implementing and supporting MCP server architectures, orchestration workflows, and agentic pipelines in production environments.
- Demonstrated experience with AI/ML Ops concepts and tooling (e.g., model/pipeline versioning, evaluation, monitoring, rollout/rollback strategies).
- Strong scripting and programming skills, preferably in Python, Bash, and/or PowerShell; ability to build automation, tools, and integrations.
- Practical experience with Infrastructure as Code (e.g., Terraform, Bicep, Ansible) for provisioning and managing cloud and Kubernetes resources.
- Experience with Azure AI Foundry and Azure AI Search, or similar AI platform and vector search technologies.
- Solid understanding of Git, branching strategies, and GitOps principles and tools.
- Proven experience owning and operating continuous delivery / continuous deployment pipelines and production services, including monitoring, alerting, and incident response.
- Strong communication and collaboration skills, with experience influencing across teams and mentoring other engineers.
Nice to Have:
- Experience building and deploying .NET Core and/or Java-based solutions in cloud and Kubernetes environments.
- Strong understanding of API-first design and implementation, including secure, scalable APIs that integrate AI capabilities.
- Experience implementing comprehensive testing strategies (unit, integration, performance, chaos, and/or evaluation loops for AI systems) in a continuous deployment environment.
- Prior experience setting up monitoring tools (e.g., Prometheus, Grafana, CloudWatch, Azure Monitor, OpenTelemetry) and disaster recovery plans to ensure business continuity.
- Exposure to data and ML tooling (e.g., feature stores, experiment tracking, model registries) and how they integrate with CI/CD and production platforms.
- Experience working in regulated or compliance-sensitive environments (e.g., legal, tax, financial services) with attention to data protection and...
For full info follow application link. As a global business we rely on diversity of culture and thought to deliver on our goals. To ensure we can do that, we seek talented, qualified employees in our operations around the world regardless of race, color, sex/gender, including pregnancy, gender identity and expression, national origin, religion, sexual orientation, disability, age, marital status, citizen status, veteran status, or any other protected classification under country or local law. Thomson Reuters is proud to be an Equal Employment Opportunity Employer providing a drug-free workplace.