Senior Cloud Infrastructure Engineer
Gatik, the leader in autonomous middle-mile logistics, is revolutionizing the B2B supply chain with its autonomous transportation-as-a-service (ATaaS) solution and prioritizing safe, consistent deliveries while streamlining freight movement by reducing congestion. The company focuses on short-haul, B2B logistics for Fortune 500 retailers and in 2021 launched the world's first fully driverless commercial transportation service with Walmart. Gatik's Class 3-7 autonomous trucks are commercially deployed across major markets, including Texas, Arkansas, and Ontario, Canada, driving innovation in freight transportation.
The company's proprietary Level 4 autonomous technology, Gatik Carrierâ„¢, is custom-built to transport freight safely and efficiently between pick-up and drop-off locations on the middle mile. With robust capabilities in both highway and urban environments, Gatik Carrierâ„¢ serves as an all-encompassing solution that integrates advanced software and hardware powering the fleet, facilitating effortless integration into customers' logistics operations.
About the Role
We are seeking a Senior Cloud Infrastructure Engineer to architect and manage the large-scale compute and data infrastructure powering our autonomous driving stack. While researchers develop perception, planning, and world models, your mission is to build the high-performance systems and pipelines that make their work possible. You will be the backbone of our AI platform, ensuring that multi-GPU clusters, distributed training frameworks, and automated workflows are scalable, resilient, and cost-effective. This role is onsite 5 days a week at our Mountain View, CA office!
What You'll Do
- Cloud-Native Orchestration & Kubernetes
- Advanced K8s Management: Architect and maintain mission-critical Kubernetes clusters optimized for heavy GPU/TPU workloads.
- GPU Scheduling: Implement and optimize Kubernetes-native GPU scheduling (NVIDIA GPU Operator) to ensure maximum hardware utilization.
- Infrastructure as Code: Drive the "Everything as Code" philosophy using Terraform, Helm, and cloud-native tools.
- Self-Healing Infrastructure: Deploy Autonomous AI Agents (LangGraph, CrewAI) to monitor cluster health and enable automated triage of hardware failures and NCCL timeouts.
- Data Engineering & CI/CD Pipelines
- Autonomy Data Pipelines: Build large-scale pipelines using Apache Airflow, Kafka, and Spark to process raw sensor data into training-ready formats.
- GitOps: Implement robust GitOps workflows using ArgoCD, Gitlab CI/CD to automate the deployment of both infrastructure and model artifacts.
- Observability: Maintain deep visibility into infrastructure health and model serving performance using Prometheus, Grafana, and OpenTelemetry.
- Agentic DevOps & CI/CD: Develop agent-driven workflows to optimize the developer experience, such as automated PR reviewers for Terraform and AI agents that proactively suggest Kubernetes resource-limit adjustments based on model training telemetry.
- Model Management & Lifecycle (MLOps)
- Experiment & Model Tracking: Design and maintain MLFlow and feature store integrations to provide a robust system of record for every model iteration.
- Workflow Automation: Build complex, automated model lifecycles using Airflow and Kubernetes to streamline the transition from training to simulation.
- High-Performance Serving: Support the deployment of models into simulation and production environments using Triton Inference Server, Ray Serve, and ONNX Runtime.
- Distributed Training & ML Systems Support
- Training Systems Support: Enable researchers to scale models (VLA, World Models) across multi-node setups using PyTorch Distributed (TorchElastic), Ray Train, and Horovod.
- Networking Optimization: Optimize low-level communication (e.g., NCCL tuning, InfiniBand, or RoCE v2) to minimize latency for 3D Gaussian Splatting (3DGS) and large-scale training.
- Hardware-Aware Orchestration: Partner with researchers to fine-tune performance across multi-node GPU clusters for FSDP and DeepSpeed workloads.
What We're Looking For
- Experience: 5+ years in Cloud Infrastructure, DevOps, or MLOps supporting high-scale compute environments.
- Kubernetes Mastery: Deep expertise in K8s, Helm, and container orchestration.
- Orchestration & Tooling: Strong background in Apache Airflow, Argo Workflows, MLFlow, and Terraform.
- Distributed Systems: Practical experience supporting frameworks like Ray and PyTorch Distributed.
- Core Skills: Proficiency in Python, Bash scripting, and a solid understanding of IAM/RBAC.
Bonus Qualifications
- Distributed Training Expertise: Deep understanding of FSDP, and DeepSpeed.
- AI Agent Orchestration: Experience building Agentic Workflows (LangGraph, AutoGen) for infrastructure automation or data curation.
- Advanced Protocols: Familiarity with Model Context Protocol (MCP) to connect AI agents with infrastructure tools.
Salary Range - $180,000- $240,000
More About Gatik
Founded in 2017 by experts in autonomous vehicle technology, Gatik has rapidly expanded its presence to Mountain View, Dallas-Fort Worth, Arkansas, and Toronto. As the first and only company to achieve fully driverless middle-mile commercial deliveries, Gatik holds a unique and defensible position in the AV industry, with a clear trajectory toward sustainable growth and profitability.
We have delivered complete, proprietary AV technology - an integration of software and hardware - to enable earlier successes for our clients in constrained Level 4 autonomy. By choosing the middle mile – with defined point-to-point delivery, we have simplified some of the more complex AV challenges, enabling us to achieve full autonomy ahead of competitors. Given extensive knowledge of Gatik's well-defined, fixed route ODDs and hybrid architecture, we are able to hyper-optimize our models with exponentially less data, establish gate-keeping mechanisms to maintain explainability, and ensure continued safety of the system for unmanned operations.