GPU Cluster Performance Engineer

at  Advanced Micro Devices Inc

Austin, TX 78735, USA -

Start DateExpiry DateSalaryPosted OnExperienceSkillsTelecommuteSponsor Visa
Immediate20 Jul, 2024Not Specified27 Apr, 2024N/ANetwork Configuration,Performance Analysis,Scripting Languages,AutomationNoNo
Add to Wishlist Apply All Jobs
Required Visa Status:
CitizenGC
US CitizenStudent Visa
H1BCPT
OPTH4 Spouse of H1B
GC Green Card
Employment Type:
Full TimePart Time
PermanentIndependent - 1099
Contract – W2C2H Independent
C2H W2Contract – Corp 2 Corp
Contract to Hire – Corp 2 Corp

Description:

PREFERRED EXPERIENCE:

  • Proven experience in optimizing the performance of GPU clusters.
  • Strong understanding of GPU architectures, parallel computing concepts, and network protocols.
  • Proficiency in scripting languages (e.g., Python, Bash) for automation and performance analysis.
  • Experience with system level performance analysis tools and methodologies for GPU clusters.
  • Analytical mindset with excellent problem-solving and debug skills.
  • Familiarity with cluster management tools and systems.
  • Excellent communication and collaboration skills for effective teamwork.
  • RDMA network configuration, troubleshooting and performance tuning.
  • Linux kernel networking expertise
  • Machine learning and/or HPC system design

Responsibilities:

WHAT YOU DO AT AMD CHANGES EVERYTHING

We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences – the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world’s most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives.
AMD together we advance_
Responsibilities:

THE ROLE:

We are seeking a highly motivated and skilled GPU Cluster Performance Attainment Engineer to join our dynamic team. In this role, you will be at the forefront of optimizing and achieving peak performance for GPU clusters. The ideal candidate will have a strong background in GPU architectures, parallel computing, and hands-on experience in system level performance tuning and debug methodologies. The team fosters and encourages continuous technical innovation to showcase successes as well as facilitate continuous career development.

KEY RESPONSIBILITIES:

  • Performance Optimization: Collaborate with hardware and software teams to enhance the overall performance of GPU clusters, focusing on aspects such as RDMA throughput, latency, and collective communications.
  • Benchmarking and Analysis: Develop and execute comprehensive benchmarking strategies to assess baseline performance, analyze bottlenecks, and identify areas for improvement within GPU cluster environments.
  • Scalability Testing: Evaluate the scalability of GPU clusters by conducting thorough testing under various workloads, ensuring optimal performance across different cluster sizes, configurations, and networking technologies (IB & RoCE)
  • Performance Profiling: Utilize profiling tools and methodologies to analyze and identify performance bottlenecks, providing actionable insights for improvement.
  • Performance Tuning: Implement optimization strategies, including but not limited to protocol enhancements, load balancing techniques, and parallel processing optimizations.
  • Documentation: Create detailed documentation of performance analysis, tuning efforts, and outcomes, providing clear and concise reports for internal teams and stakeholders.
  • Collaboration: Work closely with cross-functional teams, including hardware engineers, software developers, and system architects, to integrate performance improvements into the GPU cluster architecture.
  • Continuous Learning: Stay current with the latest developments in GPU architectures, parallel processing, and emerging technologies to drive continuous improvement in GPU cluster performance.


REQUIREMENT SUMMARY

Min:N/AMax:5.0 year(s)

Information Technology/IT

IT Software - Other

Software Engineering

Graduate

Computer Science

Proficient

1

Austin, TX 78735, USA