View All Jobs 124579

Data Center Systems Operations Engineer

Ensure maximum data center availability and reliability for AI hardware infrastructure
San Francisco, California, United States
Expert
$250,000 – 300,000 USD / year
19 hours agoBe an early applicant
Lambda

Lambda

A provider of high-performance GPUs for deep learning and AI research, as well as cloud services for machine learning workloads.

Systems Operations Engineer

We're here to help the smartest minds on the planet build Superintelligence. The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to keep up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the place to be.

If you'd like to build the world's best deep learning cloud, join us.

*Note: This position prefers presence in our Bay Area office locations, but is open to remote presence for the right candidate.

About the Job

As Lambda continues to scale its AI platform and customer base, infrastructure decisions must be tightly aligned with product roadmaps, platform growth, and fiscal discipline. The Systems Operations Engineer will own availability analysis, long-term improvement of utilization, input into strategic design, and implementation of key programs across the entire Infrastructure Stack.

This role sits within the Data Center Infrastructure (DC Infra) team and will work cross-functionally with Product, Platform Engineering, and Observability to understand overall health, analyze ongoing/potential issues, make recommendations and changes to our overall design, and ownership of key programs to improve the overall business.

This position is a critical link between the HPC/HW systems and DC Infra—and will help ensure our designs and operations most effectively maximize availability and reliability across our entire Platform.

What You'll Do

Availability Analysis

  • Own end-to-end unification of availability (number of 9s) calculations across Lambda's data center products and various data center footprints, from the power/BMS/cooling and down into the rack/GPU level, and providing adequate telemetry back to facilities, site operations, and at the platform level.
  • Work with thermal/hardware team to understand AI workload characteristics on mechanical systems and need for different BMS control methodologies as Direct to Liquid Chip (DLC) Cooling technologies improve and densities increase.
  • Coordinate across DC Infra team to calculate estimated availabilities for new data center designs.
  • Work with product teams and capacity forecasting to understand how design decisions effecting availability impact time to market and satisfy customer needs.

Utilization Analysis and Oversubscription Strategy

  • Own end-to-end utilization analysis across Lambda's entire data center infrastructure.
  • Analyze DC designs to understand peak possible capacity under varying conditions.
  • Build oversubscription strategy and lead/own company workstream to maximize available MW w/o impacting GPU reliability and customer experience.
  • Ensure appropriate availability considerations are included.

Observability and Analytics

  • Coordinate with the observability team to ensure appropriate points are monitored to understand data center characteristics loads, especially under AI workloads.
  • Help the team understand where approximate warning/danger levels are.
  • Use observations and warning/danger levels to inform BOD for future Data Centers and suggest upgrades/modifications to current Data Centers.
  • Develop strategy for a data center fleet health dashboard.
  • Help provide structure ensuring overall day-to-day and long-term health can be understood from a 20k foot level with the ability to drill down into the details.

Power Capping Strategy and Implementation

  • Coordinate with Site Operations team to strategize and build out power capping capabilities, related to worst-case scenario response/protection as we start aggressively employing oversubscription.
  • Identify appropriate IT blocks where real-time data is monitored.
  • Analyze, propose, and implement a rigorous testing process that iteratively finds and eliminates stranded power and cooling capacity related to utilization.

Site Selection Technical Review

  • Conduct end-to-end technical evaluations of prospective data center sites, including power sufficiency and stability, cooling infrastructure and mechanical systems, and network topology feasibility.
  • Perform risk assessments and recommend sites based on infrastructure fit and growth capacity.
  • Coordinate with DC Infra, Legal, and Business Strategy teams to ensure site selections align with workload and deployment timelines.

Cluster-to-Facility Requirements Alignment

  • Collaborate with HPC Architecture team and Capacity Manager to translate cluster-level hardware and workload requirements into facility-level specifications.
  • Define infrastructure interface requirements (power, cooling, rack layouts, interconnects, monitoring) to ensure alignment between compute stack and facility capabilities.
  • Support long-term infrastructure roadmap development to accommodate future hardware designs, density shifts, and workload patterns.
  • Work with Capacity Manager to understand various levers that can be employed to accelerate growth during demand surges.

You

  • Self-starter with a proven ability to independently dive into the details to understand and solve hard problems across data center infrastructure and operations.
  • Ability to provide world-class analysis, boiling complex issues into the root cause or few key drivers.
  • 10+ years of experience working in directly in or closely with data center infrastructure and HPC/HW operations.
  • Deep familiarity with AI or compute workload patterns, scaling dynamics, and infrastructure cost drivers.
  • Ability to synthesize complex technical and business inputs into clear, actionable strategic recommendations.
  • Excellent communication and collaboration skills across technical, operational, and financial stakeholders.

Preferred Experience

  • Prior experience in hyperscale or cloud infrastructure environments.
  • Familiarity with GPU cluster sizing, workload forecasting, or energy-efficient compute architectures.
  • Working knowledge of typical Data Center Infrastructure designs, topologies, systems and associated reliability/availability calculations.
  • Knowledge of DCIM tools, telemetry systems, or utilization analytics platforms.
  • Engineering degree from university, Masters preferred.
  • Experience working across multi-disciplinary and non-technical teams to explain findings.

Salary Range Information

The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.

About Lambda

  • Founded in 2012, ~400 employees (2025) and growing fast.
  • We offer generous cash & equity compensation.
  • Our investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.
  • We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitability.
  • Our research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG.
  • Health, dental, and vision coverage for you and your dependents.
  • Wellness and Commuter stipends for select roles.
  • 401k Plan with 2% company match (USA employees).
  • Flexible Paid Time Off Plan that we all actually use.

A Final Note:

You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.

Equal Opportunity Employer

Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

+ Show Original Job Post
























Data Center Systems Operations Engineer
San Francisco, California, United States
$250,000 – 300,000 USD / year
Engineering
About Lambda
A provider of high-performance GPUs for deep learning and AI research, as well as cloud services for machine learning workloads.