HPC AI/ML Platform Manager at Ford Global Career Site

, , United States -

Full Time

Start Date

Immediate

Expiry Date

13 Feb, 26

Salary

0.0

Posted On

15 Nov, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

HPC, AI/ML, Kubernetes, GPU, Linux, Networking, Containers, GIT, Go, Python, Agile, AI/ML Model Training, HPC Supercomputing, Architecture Frameworks, Vendor Coordination, Technical Presentation

Industry

Motor Vehicle Manufacturing

Description

Managing the team responsible for the engineering and operations of the AI/ML infrastructure and middleware Includes CPU and GPU resources in both the HPC batch-based supercomputing environment as well as the HPC Kubernetes platform Some hands-on work is expected as well as being an after hours escalation contact for the regular on-call team Oversee the integration of related HPC infrastructure (e.g. HPC storage, high speed interconnects, directory services, authentication, etc.) Managing the team's Jira backlog including setting priorities that support application team deliverables Assist in setting infrastructure platform strategy Representing the service offering inside and outside of the broader organization including participation in status meetings with key customers Supporting application team's needs and requests including evaluating and supporting new components Manage team performance including mentoring, performance reviews, and general coaching Established and active employee resource groups Bachelor's Degree or equivalent professional experience 4 years of experience managing high-performance computing (HPC) and AI/ML infrastructure platforms, including Kubernetes and GPU batch clusters. Proven ability to quickly learn and adapt complex technologies Strong foundation in Kubernetes, Linux, networking, containers Proficient with GIT and ability to code in Go or Python Demonstrated ability to manage a high-tech team Experience with Agile processes Be a self-starter Have the ability to develop and communicate a strong POV Good people and communication skills Having a passion for and being energized by work on infrastructure and middleware technologies Basic understanding of AI/ML model training frameworks (e.g. PyTorch) Familiarity with HPC supercomputing environments Knowledge of architecture frameworks, patterns, and reference architectures Ability to work with vendors to coordinate installations, resolve issues, manage the PO process Capability to work on multiple projects simultaneously Self-researcher with ability to research a technology and provide insight to project team Ability to present on technical topics

Responsibilities

Manage the team responsible for the engineering and operations of the AI/ML infrastructure and middleware. Oversee the integration of related HPC infrastructure and support application team's needs and requests.