Hardware Engineer, GPU Infrastructure
at CoreWeave
Roseland, New Jersey, USA -
Start Date | Expiry Date | Salary | Posted On | Experience | Skills | Telecommute | Sponsor Visa |
---|---|---|---|---|---|---|---|
Immediate | 17 Aug, 2024 | Not Specified | 18 May, 2024 | 2 year(s) or above | Automation,Components,Ipmi | No | No |
Required Visa Status:
Citizen | GC |
US Citizen | Student Visa |
H1B | CPT |
OPT | H4 Spouse of H1B |
GC Green Card |
Employment Type:
Full Time | Part Time |
Permanent | Independent - 1099 |
Contract – W2 | C2H Independent |
C2H W2 | Contract – Corp 2 Corp |
Contract to Hire – Corp 2 Corp |
Description:
CoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry’s fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases — VFX and rendering, machine learning and AI, batch processing, and Pixel Streaming — that are up to 35 times faster and 80% less expensive than the large, generalized public clouds. Learn more at www.coreweave.com.
CoreWeave is seeking a highly skilled and motivated Infrastructure/Hardware Engineer, focusing on GPU and PCIe troubleshooting, to join our Hardware Engineering team, reporting to the Director of Compute Architecture. In this role, you will play a crucial part in the design, development, troubleshooting, and optimization of our server hardware infrastructure. You will collaborate closely with cross-functional teams, external vendors, and stakeholders to ensure the successful delivery of highly performant and reliable hardware solutions.
THE IDEAL CANDIDATE WILL HAVE AT LEAST 2 YEARS PROFESSIONAL EXPERIENCE WITH THE FOLLOWING:
- Prior experience supporting and troubleshooting data center class GPUs (preferably A100 or newer)
- Proficiency in ansible/python and experience with programmatically interacting with server BMCs, using IPMI or Redfish (preferably Redfish).
- Experience using, integrating and automating data center class GPU diagnostics and troubleshooting tools
- In-depth knowledge of server hardware, components, and management technologies, particularly GPUs and PCIe devices.
- Proven ability to stay updated with the latest industry technologies and trends.
- Previous experience collaborating with hardware vendors.
- Strong passion for automation, with a commitment to automating processes comprehensively.
- Excellent documentation skills and attention to detail.
- Strong analytical and problem-solving abilities.
How To Apply:
Incase you would like to apply to this job directly from the source, please click here
Responsibilities:
- Troubleshoot complex GPU and PCIe related failures
- Partner with external vendors on failure analysis
- Track component RMAs
- Develop and maintain hardware/firmware management services.
- Automate all aspects of the server hardware lifecycle.
- Serve as the senior point of contact for hardware escalation and troubleshooting.
- Collaborate with cross-functional teams to define hardware requirements, specifications, and system architecture.
- Create and maintain accurate documentation of hardware designs, specifications, test procedures, and results.
- Analyze and optimize the performance of hardware systems, identify bottlenecks, and propose improvements for enhanced efficiency.
- Establish processes for internal hardware testing, deployment, and performance optimization.
REQUIREMENT SUMMARY
Min:2.0Max:7.0 year(s)
Information Technology/IT
IT Software - Other
Software Engineering
Graduate
Proficient
1
Roseland, NJ, USA