✨ About The Role
- As a Distributed Systems/ML engineer at OpenAI, you will be improving the training throughput for internal training frameworks.
- You will enable researchers to experiment with new ideas by providing them with efficient tools and systems.
- The role involves designing, implementing, and optimizing state-of-the-art AI models.
- You will be profiling and optimizing the training framework to achieve impressive hardware efficiency.
- The job requires collaboration with researchers to develop systems-efficient video models and architectures.
- You will apply the latest techniques to the internal training framework to enhance performance.
- The position is based in San Francisco, CA, and follows a hybrid work model with 3 days in the office per week.
- Relocation assistance is offered to new employees moving to San Francisco for the role.
- The role is part of the Sora team at OpenAI, which focuses on making video a key capability of the foundation models while ensuring their reliability and safety.
âš¡ Requirements
- You should have experience working with multi-modal machine learning pipelines and a passion for diving deep into systems implementations.
- Strong software engineering skills, particularly in Python, are essential for success in this role.
- You should be someone who is driven by performance optimization and has a keen eye for maintaining system performance and maintainability.
- Understanding and optimizing training kernels should be within your skill set.
- You must be passionate about ensuring stable training dynamics in AI systems.
- The ideal candidate does not tolerate bugs in their code and strives for writing bug-free machine learning code.
- You should have a deep understanding of distributed systems and the performance of supercomputers.
- Collaborating with researchers to develop efficient video models and architectures will be a key part of your job.
- You should be someone who enjoys working in a hybrid research and product team environment.