System Development Engineer, Cloud AI/ML/storage server teams
Company: Amazon
Location: Cupertino
Posted on: April 4, 2026
|
|
|
Job Description:
We are seeking an experienced Systems Development Engineer to
lead the development of automation software, diagnostic tooling,
and fleet health infrastructure for our server platforms. You will
work across multiple teams and organizations to build scalable,
reliable systems that keep our storage and accelerated (AI/ML)
compute fleet healthy — with a vision toward zero-touch operations
where automation detects, diagnoses, and resolves issues without
human intervention. You will be a technical leader solving complex
architectural problems that may not be well-defined in advance. You
will own your team's systems, proactively identify deficiencies,
write scalable and robust code to solve issues before they impact
customers. You will decompose large, difficult server testability,
reliability, and diagnosis problems into straightforward tasks and
components — leading delivery yourself and through others in
parallel — using a combination of hardware, software, system
design, processor architecture, diagnostics, and operations
knowledge. You will collaborate with a variety of roles (SDEs,
SDETs, Mechanical/Electrical/Hardware Engineers, TPMs, Managers,
Principals) and organizations through server conception, test
validation, qualification, launch, and operations — driving high
quality and reliability into current and future designs for AWS
server solutions. You will also work closely with ODMs and Design
Partners to ensure our tooling, diagnostics, and automation
requirements are met throughout the hardware development lifecycle
(NPI). Key job responsibilities Fleet Health & Predictive
Infrastructure - Build and own the automation infrastructure
responsible for the health of the server fleet across storage and
accelerator (AI/ML) compute platforms - Design and implement
predictive failure detection systems using telemetry, sensor data,
error trending, and log correlation to identify hardware issues
before they cause customer impact - Drive toward zero-touch
operations — building automation that detects, diagnoses, triages,
and remediates hardware and software faults without human
intervention - Develop monitoring tools, dashboards, and alerting
systems to provide real-time visibility into fleet health across
lab and production environments - Define and track fleet health
metrics (failure rates, mean time to detect, mean time to repair,
first-time fix rate, predictive accuracy) Debugging &
Troubleshooting - Debug and resolve complex system-level issues
across storage, compute, GPU, networking in production environments
- Troubleshoot Linux boot and runtime failures across x86 and ARM
architectures, including PCIe, power, NIC, NVMe, and GPU subsystems
- Perform root cause analysis on hardware failures — correlating
across firmware, kernel, driver, and physical layer to isolate
faults - Build diagnostic tooling that automates root cause
identification and reduces reliance on manual triage Systems
Development & Automation - Lead the definition and development of
software, automation, and enabling tools for server hardware
programs; track and report progress - Design and build scalable
system-level software with focus on durability, availability,
security, and diagnostics - Develop and maintain device drivers for
Linux on ARM and x86 architectures - Build automation solutions
using modern programming languages (Python, Ruby, Java, C/C++,
etc.) - Work with OS internals, storage subsystems, and
accelerator/GPU software stacks in Linux-based environments -
Build, manage, and deploy CI/CD pipelines for rapid deployment of
code changes to org-owned and customer-owned systems Cross-Team
Collaboration - Work across internal HWEng teams to ensure new
server hardware addresses data path and control path functionality
needed by dependent service teams - Work closely with internal
customers to identify early any potential problems onboarding new
servers — storage or accelerated compute — into their ecosystem -
Engage with ODMs and design partners on testability, diagnostic,
and automation requirements during hardware design and development
- Contribute to server design to improve robustness, testability,
diagnosability, and reliability - Partner with datacenter
operations teams to close the loop between field failures and
design improvements About the team Systems Development Engineers in
AWS Hardware Engineering wear many hats. From orchestration tooling
development to hardware integration to kernel driver debugging, we
dive deep into problems across the breadth of AWS. Our teams are
directly responsible for launching and maintaining server hardware
in the fleet — including storage servers powering distributed
storage platforms and AI/ML accelerator servers with GPUs. Located
in Seattle and Cupertino, we work with internal development teams,
ODMs, and design partners to deliver servers deployed in
datacenters worldwide. - 2 years of non-internship professional
software development experience - 1 years of designing or
architecting (design patterns, reliability and scaling) of new and
existing systems experience - 3 years of administrative experience
in networking, storage systems, operating systems and hands-on
systems engineering experience - Knowledge of systems engineering
fundamentals (networking, storage, operating systems) - Experience
programming with at least one modern language such as C++, C#,
Java, Python, Golang, PowerShell, Ruby - Experience with PowerShell
(preferred), Python, Ruby, or Java - Experience working in an Agile
environment using the Scrum methodology - Familiarity building
predictive failure detection or proactive remediation systems at
fleet scale - Familiarity with Linux kernel driver development -
Familiarity with storage, compute, GPU/accelerator platforms
(NVIDIA), including driver integration, diagnostics, or performance
validation - Familiarity with distributed storage systems (block,
object, or file) - Familiarity with server hardware architecture,
BMC/IPMI, firmware, PCIe topology, NVLink, and hardware diagnostics
- Familiarity working with ODMs or hardware design partners through
the product development lifecycle - Familiarity building zero-touch
or self-healing automation for large-scale infrastructure -
Familiarity working in large-scale datacenter or cloud environments
- Track record of rapidly coming up to speed on new engineering
disciplines and making impactful decisions - Familiarity with
hardware bring-up, validation, and fleet-wide deployment -
Familiarity with telemetry pipelines, anomaly detection, and
operational metrics at scale Amazon is an equal opportunity
employer and does not discriminate on the basis of protected
veteran status, disability, or other legally protected status. Los
Angeles County applicants: Job duties for this position include:
work safely and cooperatively with other employees, supervisors,
and staff; adhere to standards of excellence despite stressful
conditions; communicate effectively and respectfully with
employees, supervisors, and staff to ensure exceptional customer
service; and follow all federal, state, and local laws and Company
policies. Criminal history may have a direct, adverse, and negative
relationship with some of the material job duties of this position.
These include the duties and responsibilities listed above, as well
as the abilities to adhere to company policies, exercise sound
judgment, effectively manage stress and work safely and
respectfully with others, exhibit trustworthiness and
professionalism, and safeguard business operations and the
Company’s reputation. Pursuant to the Los Angeles County Fair
Chance Ordinance, we will consider for employment qualified
applicants with arrest and conviction records. Our inclusive
culture empowers Amazonians to deliver the best results for our
customers. If you have a disability and need a workplace
accommodation or adjustment during the application and hiring
process, including support for the interview or onboarding process,
please visit
https://amazon.jobs/content/en/how-we-hire/accommodations for more
information. If the country/region you’re applying in isn’t listed,
please contact your Recruiting Partner. The base salary range for
this position is listed below. Your Amazon package will include
sign-on payments and restricted stock units (RSUs). Final
compensation will be determined based on factors including
experience, qualifications, and location. Amazon also offers
comprehensive benefits including health insurance (medical, dental,
vision, prescription, Basic Life & AD&D insurance and option
for Supplemental life plans, EAP, Mental Health Support, Medical
Advice Line, Flexible Spending Accounts, Adoption and Surrogacy
Reimbursement coverage), 401(k) matching, paid time off, and
parental leave. Learn more about our benefits at
https://amazon.jobs/en/benefits . USA, CA, Cupertino - 148,700.00 -
201,200.00 USD annually USA, WA, Seattle - 129,200.00 - 174,800.00
USD annually
Keywords: Amazon, Oakland , System Development Engineer, Cloud AI/ML/storage server teams, Engineering , Cupertino, California