✨ About The Role
- The role involves working closely with machine learning researchers to build and maintain OpenAI's compute and infrastructure
- Responsibilities include deploying huge clusters using Kubernetes and Azure, and building an internal experiment platform for running/training large AI models
- The team works at the cutting edge of speed and scale, combining High-Performance Computing (HPC) traditions in a modern cloud and containerized environment
- The job requires experience in high-performance computing, open-source contributions, and running large Kubernetes clusters with GPU workloads
- The successful candidate will have a direct impact on the success of OpenAI and the field of AI as a whole
âš¡ Requirements
- Experienced in designing, implementing, and running production services and highly available distributed systems
- Proficient in working with highly performant bare-metal systems and large Kubernetes clusters with GPU workloads
- Skilled in managing and monitoring large-scale infrastructure deployments and debugging problems across the stack
- Comfortable with bash, Terraform, Python, and/or Chef, and have experience with cloud platforms like Azure, AWS, or GCP
- Able to quickly obtain a deep technical understanding of new domains and identify important problems to solve