Software Engineer, GPU Infrastructure
Company: OpenAI
Location: San Francisco
Posted on: May 4, 2025
Job Description:
Software Engineer, GPU Infrastructure - OpenAICareersSoftware
Engineer, GPU InfrastructureScaling - San FranciscoApply now (opens
in a new window)This role will support the fleet infrastructure
team at OpenAI. The fleet team focuses on running the world's
largest, most reliable, and frictionless GPU fleet to support
OpenAI's general purpose model training and deployment. Work on
this team ranges from
- Maximizing GPUs doing useful work by building user-friendly
scheduling and quota systems
- Running a reliable and low maintenance platform by building
push-button automation for kubernetes cluster provisioning and
upgrades
- Supporting research workflows with service frameworks and
deployment systems
- Ensuring fast model startup times though high performance
snapshot delivery across blob storage down to hardware caching
- Much more!About the RoleAs an engineer within Fleet
infrastructure, you will design, write, deploy, and operate
infrastructure systems for model deployment and training on one of
the world's largest GPU fleet. The scale is immense, the timelines
are tight, and the organization is moving fast; this is an
opportunity to shape a critical system in support of OpenAI's
mission to advance AI capabilities responsibly.This role is based
in San Francisco, CA. We use a hybrid work model of 3 days in the
office per week and offer relocation assistance to new employees.In
this role, you will:
- Design, implement and operate components of our compute fleet
including job scheduling, cluster management, snapshot delivery,
and CI/CD systems.
- Interface with researchers and product teams to understand
workload requirements
- Collaborate with hardware, infrastructure, and business teams
to provide a high utilization and high reliability serviceYou might
thrive in this role if you:
- Have experience with hyperscale compute systems
- Possess strong programming skills
- Have experience working in public clouds (especially
Azure)
- Have experience working in Kubernetes
- Execution focused mentality paired with a rigorous focus on
user requirements
- As a bonus, have an understanding of AI/ML workloadsAbout
OpenAIOpenAI is an AI research and deployment company dedicated to
ensuring that general-purpose artificial intelligence benefits all
of humanity. We push the boundaries of the capabilities of AI
systems and seek to safely deploy them to the world through our
products. AI is an extremely powerful tool that must be created
with safety and human needs at its core, and to achieve our
mission, we must encompass and value the many different
perspectives, voices, and experiences that form the full spectrum
of humanity.We are an equal opportunity employer and do not
discriminate on the basis of race, religion, national origin,
gender, sexual orientation, age, veteran status, disability or any
other legally protected status.OpenAI Affirmative Action and Equal
Employment Opportunity Policy StatementFor US Based Candidates:
Pursuant to the San Francisco Fair Chance Ordinance, we will
consider qualified applicants with arrest and conviction records.We
are committed to providing reasonable accommodations to applicants
with disabilities, and requests can be made via thislink .OpenAI
Global Applicant Privacy PolicyAt OpenAI, we believe artificial
intelligence has the potential to help people solve immense global
challenges, and we want the upside of AI to be widely shared. Join
us in shaping the future of technology.Compensation$325K - $590K +
Offers EquityApply now (opens in a new window)
#J-18808-Ljbffr
Keywords: OpenAI, Oakland , Software Engineer, GPU Infrastructure, IT / Software / Systems , San Francisco, California
Didn't find what you're looking for? Search again!
Loading more jobs...