View All Jobs 114331

Senior Cloud Infrastructure Engineer

Own deployment pipelines for multi-node GPU clusters powering autonomous driving AI workloads
Mountain View, California, United States
Senior
$180,000 – 240,000 USD / year
18 hours agoBe an early applicant
Gatik AI

Gatik AI

Provides autonomous middle-mile logistics solutions using self-driving trucks to enable efficient, short-haul goods transportation for retailers and distributors.

Senior Cloud Infrastructure Engineer

Gatik, the leader in autonomous middle-mile logistics, is revolutionizing the B2B supply chain with its autonomous transportation-as-a-service (ATaaS) solution and prioritizing safe, consistent deliveries while streamlining freight movement by reducing congestion. The company focuses on short-haul, B2B logistics for Fortune 500 retailers and in 2021 launched the world's first fully driverless commercial transportation service with Walmart. Gatik's Class 3-7 autonomous trucks are commercially deployed across major markets, including Texas, Arkansas, and Ontario, Canada, driving innovation in freight transportation.

The company's proprietary Level 4 autonomous technology, Gatik Carrierâ„¢, is custom-built to transport freight safely and efficiently between pick-up and drop-off locations on the middle mile. With robust capabilities in both highway and urban environments, Gatik Carrierâ„¢ serves as an all-encompassing solution that integrates advanced software and hardware powering the fleet, facilitating effortless integration into customers' logistics operations.

About the Role

We are seeking a Senior Cloud Infrastructure Engineer to architect and manage the large-scale compute and data infrastructure powering our autonomous driving stack. While researchers develop perception, planning, and world models, your mission is to build the high-performance systems and pipelines that make their work possible. You will be the backbone of our AI platform, ensuring that multi-GPU clusters, distributed training frameworks, and automated workflows are scalable, resilient, and cost-effective. This role is onsite 5 days a week at our Mountain View, CA office!

What You'll Do

  • Cloud-Native Orchestration & Kubernetes
    • Advanced K8s Management: Architect and maintain mission-critical Kubernetes clusters optimized for heavy GPU/TPU workloads.
    • GPU Scheduling: Implement and optimize Kubernetes-native GPU scheduling (NVIDIA GPU Operator) to ensure maximum hardware utilization.
    • Infrastructure as Code: Drive the "Everything as Code" philosophy using Terraform, Helm, and cloud-native tools.
    • Self-Healing Infrastructure: Deploy Autonomous AI Agents (LangGraph, CrewAI) to monitor cluster health and enable automated triage of hardware failures and NCCL timeouts.
  • Data Engineering & CI/CD Pipelines
    • Autonomy Data Pipelines: Build large-scale pipelines using Apache Airflow, Kafka, and Spark to process raw sensor data into training-ready formats.
    • GitOps: Implement robust GitOps workflows using ArgoCD, Gitlab CI/CD to automate the deployment of both infrastructure and model artifacts.
    • Observability: Maintain deep visibility into infrastructure health and model serving performance using Prometheus, Grafana, and OpenTelemetry.
    • Agentic DevOps & CI/CD: Develop agent-driven workflows to optimize the developer experience, such as automated PR reviewers for Terraform and AI agents that proactively suggest Kubernetes resource-limit adjustments based on model training telemetry.
  • Model Management & Lifecycle (MLOps)
    • Experiment & Model Tracking: Design and maintain MLFlow and feature store integrations to provide a robust system of record for every model iteration.
    • Workflow Automation: Build complex, automated model lifecycles using Airflow and Kubernetes to streamline the transition from training to simulation.
    • High-Performance Serving: Support the deployment of models into simulation and production environments using Triton Inference Server, Ray Serve, and ONNX Runtime.
  • Distributed Training & ML Systems Support
    • Training Systems Support: Enable researchers to scale models (VLA, World Models) across multi-node setups using PyTorch Distributed (TorchElastic), Ray Train, and Horovod.
    • Networking Optimization: Optimize low-level communication (e.g., NCCL tuning, InfiniBand, or RoCE v2) to minimize latency for 3D Gaussian Splatting (3DGS) and large-scale training.
    • Hardware-Aware Orchestration: Partner with researchers to fine-tune performance across multi-node GPU clusters for FSDP and DeepSpeed workloads.

What We're Looking For

  • Experience: 5+ years in Cloud Infrastructure, DevOps, or MLOps supporting high-scale compute environments.
  • Kubernetes Mastery: Deep expertise in K8s, Helm, and container orchestration.
  • Orchestration & Tooling: Strong background in Apache Airflow, Argo Workflows, MLFlow, and Terraform.
  • Distributed Systems: Practical experience supporting frameworks like Ray and PyTorch Distributed.
  • Core Skills: Proficiency in Python, Bash scripting, and a solid understanding of IAM/RBAC.

Bonus Qualifications

  • Distributed Training Expertise: Deep understanding of FSDP, and DeepSpeed.
  • AI Agent Orchestration: Experience building Agentic Workflows (LangGraph, AutoGen) for infrastructure automation or data curation.
  • Advanced Protocols: Familiarity with Model Context Protocol (MCP) to connect AI agents with infrastructure tools.

Salary Range - $180,000- $240,000

More About Gatik

Founded in 2017 by experts in autonomous vehicle technology, Gatik has rapidly expanded its presence to Mountain View, Dallas-Fort Worth, Arkansas, and Toronto. As the first and only company to achieve fully driverless middle-mile commercial deliveries, Gatik holds a unique and defensible position in the AV industry, with a clear trajectory toward sustainable growth and profitability.

We have delivered complete, proprietary AV technology - an integration of software and hardware - to enable earlier successes for our clients in constrained Level 4 autonomy. By choosing the middle mile – with defined point-to-point delivery, we have simplified some of the more complex AV challenges, enabling us to achieve full autonomy ahead of competitors. Given extensive knowledge of Gatik's well-defined, fixed route ODDs and hybrid architecture, we are able to hyper-optimize our models with exponentially less data, establish gate-keeping mechanisms to maintain explainability, and ensure continued safety of the system for unmanned operations.

+ Show Original Job Post
























Senior Cloud Infrastructure Engineer
Mountain View, California, United States
$180,000 – 240,000 USD / year
Engineering
About Gatik AI
Provides autonomous middle-mile logistics solutions using self-driving trucks to enable efficient, short-haul goods transportation for retailers and distributors.