✨ About The Role
- The role involves maintaining a suite of hardware health tests and collaborating with researchers and the Supercomputing team to address hardware faults
- The team operates at a fast pace, providing individuals with a high degree of autonomy and the ability to affect change
- Prior hardware expertise is not required for this role, but a strong ability to dig into noisy data and identify problems is essential
- The Hardware Health team is incubated inside OpenAI's Research team, which focuses on training large-scale AI models of unprecedented capability
- The team's goal is to maximize supercomputing capacity for research and minimize the impact of hardware faults on researchers
âš¡ Requirements
- Ideal candidate would have a balance of building and operational skills, with a proven track record in attention to detail and a strong sense of ownership
- Must have excellent abilities in developing in python and shell scripting, as well as experience with SQL, PromQL, and Pandas for data analysis
- Prior experience in identifying and rectifying data anomalies, developing reproducible analyses, and building dashboards and visualizations is required
- Previous TL experience (2+ years) is preferred as this role involves team growth and a high degree of autonomy
- Bonus points for expertise and interest in low-level details of hardware components, protocols, and Linux tooling, or visualization of large data centers and networks