✨ About The Role
- Design and build distributed systems used to train next-generation models
- Focus on building systems to distribute work across massive GPU clusters efficiently
- Design and implement methods to make training stack more efficient and scale up to next-generation supercomputers
- Implement methods to robustly train models in the presence of hardware failures
- Build tooling to enhance understanding of problems in largest training jobs
âš¡ Requirements
- Experienced software engineer with a background in high performance computing and low-level systems
- Passionate about building stable and highly efficient distributed systems
- Enjoys delving into low-level details about performance optimization
- Thrives in designing and implementing methods to make training stacks more efficient and scalable
- Comfortable working on massive GPU clusters and designing systems to distribute work efficiently