✨ About The Role
- Work on building robust, debuggable, high-performance libraries to support distributed training workloads
- Profile, optimize, and design for scale compute and data capabilities
- Build and maintain tools used by researchers
- Deploy training framework to supercomputers rapidly responding to changing needs of ML systems
âš¡ Requirements
- Experienced in working on large distributed systems and optimizing compute and data capabilities
- Strong software engineering skills with proficiency in Python
- Enjoys accelerating research and coming up with ideas to improve system performance
- Thrives in a fast-paced environment and can adapt to evolving needs of ML systems architectures
- Excited about deploying training frameworks to supercomputers and maintaining stability and performance