Applied Research Engineer – Training Infra

Company: Snorkel AI
Location: San Francisco
Posted on: April 1, 2026

Job Description:

About Snorkel At Snorkel, we believe meaningful AI doesn’t start with the model, it starts with the data. We’re on a mission to help enterprises transform expert knowledge into specialized AI at scale. The AI landscape has gone through incredible changes between 2015, when Snorkel started as a research project in the Stanford AI Lab, to the generative AI breakthroughs of today. But one thing has remained constant: the data you use to build AI is the key to achieving differentiation, high performance, and production-ready systems. We work with some of the world’s largest organizations to empower scientists, engineers, financial experts, product creators, journalists, and more to build custom AI with their data faster than ever before. Excited to help us redefine how AI is built? Apply to be the newest Snorkeler! THE ROLE As an Applied Research Engineer at Snorkel AI, you will own the infrastructure that powers our model training and evaluation work. This is a hands-on role where you will build and operate GPU cluster infrastructure, training pipelines, and the tooling that allows our research and engineering teams to run experiments reliably and at scale. You will work closely with research scientists and engineers, translating training requirements into robust, reproducible systems—and proactively removing infrastructure blockers before they slow down the work that matters most. Snorkel AI operates in a fast-paced, high-impact environment. We are looking for someone who takes pride in operational excellence, loves solving complex distributed systems problems, and thrives when given real ownership. Location: Redwood City or San Francisco — OR REMOTE MAIN RESPONSIBILITIES Set up and manage GPU cluster infrastructure on major cloud providers (e.g., AWS HyperPod) for distributed model training, including networking, provisioning, and cost tracking. Build and operate job orchestration and scheduling systems (e.g., Kubernetes, Slurm, or cloud-native equivalents) to reliably launch and manage training, rollout, and evaluation jobs across multi-node clusters. Integrate and maintain ML training frameworks and post-training pipelines, ensuring they run stably and reproducibly at scale. Set up and maintain experiment tracking, dataset versioning, and model artifact management to support fast iteration. Monitor and optimize cluster health, inter-node communication, and resource utilization; implement fault tolerance and auto-recovery so long-running jobs survive node failures. Work closely with research scientists and ML engineers to understand requirements, unblock experiments, and evolve infrastructure as our training workloads needs change. PREFERRED QUALIFICATIONS Hands-on experience managing GPU clusters on major cloud providers, including provisioning, network configuration, and cost management. Experience with distributed compute orchestration tools such as Kubernetes, Slurm, or equivalent cluster management systems. Working knowledge of distributed training concepts: parallelism strategies, memory optimization techniques, and inter-node communication. Experience with setting up, managing, and integrating ML experiment tracking and data/model versioning tools Strong Python proficiency and solid software engineering fundamentals such as version control, modular design, and automation. Ability to work in a fast-moving, iterative environment and take end-to-end ownership of ambiguous infrastructure problems. Hands-on experience with post-training workflows such as supervised fine-tuning (SFT) or reinforcement learning (RLHF, GRPO, or similar) is a strong plus, but not required. The salary range is $150,000.00 – $180,000.00. This role is a great fit for engineers who love building reliable systems close to the frontier of AI research. We welcome applicants from a wide range of backgrounds—whether your experience comes from industry, research labs, or direct hands-on work with distributed infrastructure at scale. BE YOUR BEST AT SNORKEL Joining Snorkel AI means becoming part of a company that has market proven solutions, robust funding, and is scaling rapidly—offering a unique combination of stability and the excitement of high growth. As a member of our team, you’ll have meaningful opportunities to shape priorities and initiatives, influence key strategic decisions, and directly impact our ongoing success. Whether you’re looking to deepen your technical expertise, explore leadership opportunities, or learn new skills across multiple functions, you’re fully supported in building your career in an environment designed for growth, learning, and shared success. Snorkel AI is proud to be an Equal Employment Opportunity employer and is committed to building a team that represents a variety of backgrounds, perspectives, and skills. Snorkel AI embraces diversity and provides equal employment opportunities to all employees and applicants for employment. Snorkel AI prohibits discrimination and harassment of any type on the basis of race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state, or local law. All employment is decided on the basis of qualifications, performance, merit, and business need. We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation. Salary Range $150,000 - $180,000 USD Be Your Best at Snorkel Joining Snorkel AI means becoming part of a company that has market proven solutions, robust funding, and is scaling rapidly—offering a unique combination of stability and the excitement of high growth. As a member of our team, you’ll have meaningful opportunities to shape priorities and initiatives, influence key strategic decisions, and directly impact our ongoing success. Whether you’re looking to deepen your technical expertise, explore leadership opportunities, or learn new skills across multiple functions, you’re fully supported in building your career in an environment designed for growth, learning, and shared success. Snorkel AI is proud to be an Equal Employment Opportunity employer and is committed to building a team that represents a variety of backgrounds, perspectives, and skills. Snorkel AI embraces diversity and provides equal employment opportunities to all employees and applicants for employment. Snorkel AI prohibits discrimination and harassment of any type on the basis of race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state, or local law. All employment is decided on the basis of qualifications, performance, merit, and business need. We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.

Keywords: Snorkel AI, Oakland , Applied Research Engineer – Training Infra, IT / Software / Systems , San Francisco, California

Didn't find what you're looking for? Search again!

Let San Francisco recruiters find you. Post your resume for free!

Get San Francisco IT / Software / Systems jobs via email.

View more Oakland IT / Software / Systems jobs

Other IT / Software / Systems Jobs

Senior Incident Response Engineer (San Jose, CA)
Description: Archer is an aerospace company based in San Jose, California building an all-electric vertical takeoff and landing aircraft with a mission to advance the benefits of sustainable air mobility. We are designing, (more...)
Company: Archer
Location: San Jose
Posted on: 04/3/2026

Technical Account Management (Remote Elegible - Costa Rica)
Description: For over 20 years, Smartsheet has helped people and teams achieve well, anything. From seamless work management to smart, scalable solutions, we ve always worked with flow. We re building tools (more...)
Company: Smartsheet
Location: San Jose
Posted on: 04/3/2026

Manager, Data Science - GenAI Digital Assistant
Description: Manager, Data Science - GenAI Digital Assistant Data is at the center of everything we do. As a startup, we disrupted the credit card industry by individually personalizing every credit card offer using (more...)
Company: Capital One
Location: San Jose
Posted on: 04/3/2026

Salary in Oakland, California Area | More details for Oakland, California Jobs |Salary

Senior Lead AI Engineer (Gen AI Platform Services)
Description: Senior Lead AI Engineer Gen AI Platform Services Overview: At Capital One, we are creating responsible and reliable AI systems, changing banking for good. For years, Capital One has been an industry (more...)
Company: Capital One
Location: San Jose
Posted on: 04/3/2026

Senior Visual Designer, Design Systems
Description: Senior Visual Designer, Design Systems What you can expect Senior UX Visual Designer with deep visual design skills to join Zoom s Design System Team, leading creation of a cohesive design language (more...)
Company: Zoom
Location: San Jose
Posted on: 04/3/2026

Sr. Software Engineer - Back end
Description: Archer is an aerospace company based in San Jose, California building an all-electric vertical takeoff and landing aircraft with a mission to advance the benefits of sustainable air mobility. We are designing, (more...)
Company: Archer
Location: San Jose
Posted on: 04/3/2026

Full-Stack Crypto Software Engineer
Description: About us Curio builds bleeding edge crypto games and infrastructure. Since 2021 , we ve been pioneers in the onchain game space, shipped mini games to thousands of users, and we are about to ship our (more...)
Company: Curio Research
Location: San Francisco
Posted on: 04/3/2026

Machine Learning Scientist (All Levels)
Description: About Abridge Abridge was founded in 2018 with the mission of powering deeper understanding in healthcare. Our AI-powered platform was purpose-built for medical conversations, improving clinical documentation (more...)
Company: Abridge
Location: San Francisco
Posted on: 04/3/2026

Staff AI Researcher / Engineer
Description: Archer is an aerospace company based in San Jose, California building an all-electric vertical takeoff and landing aircraft with a mission to advance the benefits of sustainable air mobility. We are designing, (more...)
Company: Archer
Location: San Jose
Posted on: 04/3/2026

WebRTC Engineering Lead
Description: What you can expect As a media tech lead, you will l ead Zoom s next generation real-time web media architecture, optimizing audio and video quality for web-based clients across multiple browsers and (more...)
Company: Zoom
Location: San Jose
Posted on: 04/3/2026

Loading more jobs...

Applied Research Engineer – Training Infra

Didn't find what you're looking for? Search again!

Other IT / Software / Systems Jobs

Log In or Create An Account