✨ About The Role
- The Networking Engineer will design, manage, and optimize WAN and LAN infrastructure for OpenAI's supercomputers.
- Responsibilities include developing and maintaining data collection and monitoring systems to ensure network visibility and performance.
- The role involves troubleshooting and resolving network issues, including TCP/IP, BGP, and physical problems.
- Automating network issue detection and resolution to reduce operational overhead is a key task.
- Collaboration with hardware and systems engineers is necessary to meet the performance demands of distributed AI training workloads.
- The position requires writing code to instrument and observe the network, primarily using Python and some Rust.
- The role is based in San Francisco, CA, with a hybrid work model of 3 days per week in the office.
âš¡ Requirements
- The ideal candidate will have over 5 years of experience in networking or related infrastructure roles.
- Strong expertise in networking technologies, protocols, and design principles is essential for success in this position.
- Hands-on experience with troubleshooting complex networking issues in both LAN and WAN environments is crucial.
- A deep understanding of setting up TCP/IP networks from scratch, including BGP and ECMP routing, is required.
- Familiarity with optical connectors and optical circuit switches (OCS) will be beneficial.
- The candidate should possess excellent problem-solving and communication abilities, enabling effective collaboration across teams.
- A commitment to continuous learning and ownership of problems from end-to-end is expected.
- Experience with InfiniBand, RoCE, or RDMA in high-performance computing environments is a plus.