View All Jobs 1402

Software Engineer, Hardware Health

Develop a comprehensive suite of hardware health tests for custom-built hyperscale supercomputers.
San Francisco Bay Area
Senior
$295,000 - 440,000 USD / year
6 months ago

✨ About The Role

- The role involves maintaining a suite of hardware health tests and collaborating with researchers and the Supercomputing team to address hardware faults - The team operates at a fast pace, providing individuals with a high degree of autonomy and the ability to affect change - Prior hardware expertise is not required for this role, but a strong ability to dig into noisy data and identify problems is essential - The Hardware Health team is incubated inside OpenAI's Research team, which focuses on training large-scale AI models of unprecedented capability - The team's goal is to maximize supercomputing capacity for research and minimize the impact of hardware faults on researchers

âš¡ Requirements

- Ideal candidate would have a balance of building and operational skills, with a proven track record in attention to detail and a strong sense of ownership - Must have excellent abilities in developing in python and shell scripting, as well as experience with SQL, PromQL, and Pandas for data analysis - Prior experience in identifying and rectifying data anomalies, developing reproducible analyses, and building dashboards and visualizations is required - Previous TL experience (2+ years) is preferred as this role involves team growth and a high degree of autonomy - Bonus points for expertise and interest in low-level details of hardware components, protocols, and Linux tooling, or visualization of large data centers and networks
+ Show Original Job Post
























Software Engineer, Hardware Health
San Francisco Bay Area
$295,000 - 440,000 USD / year
Engineering
About OpenAI
Building artificial general intelligence