HPC AI/ML Software Engineer (Scientist 2/3)

at  Los Alamos National Laboratory

Los Alamos, New Mexico, USA -

Start DateExpiry DateSalaryPosted OnExperienceSkillsTelecommuteSponsor Visa
Immediate18 Jun, 2024Not Specified18 Mar, 20241 year(s) or aboveDocker,Algorithms,Triad,Visualization,Access,Analytics,Programming Languages,Integration,Python,Ml,Federal Government,Addition,Data Analysis,Devops,Pipelines,Job Scheduling,Unsupervised Learning,Deep Learning,Reinforcement Learning,Leadership,KubernetesNoNo
Add to Wishlist Apply All Jobs
Required Visa Status:
CitizenGC
US CitizenStudent Visa
H1BCPT
OPTH4 Spouse of H1B
GC Green Card
Employment Type:
Full TimePart Time
PermanentIndependent - 1099
Contract – W2C2H Independent
C2H W2Contract – Corp 2 Corp
Contract to Hire – Corp 2 Corp

Description:

SCIENTIST 2 ($99,200 - $164,100)

The successful candidate will perform the full spectrum of tasks, including but not limited to:

  • Research, evaluate and recommend AI/ML software for use on LANL systems.
  • Work closely with HPC users to integrate AI/ML models into production HPC platforms. Provide support and guidance to HPC user community running AI/ML workflows on high performance computing systems
  • Help establish best working practices for users and HPC around AI/ML workflows
  • Together with subject matter experts help develop and implement plan for AI/ML data management (including but not limited to IO optimization, transfer and archival)
  • Collaborate with stakeholders to understand requirements and translate them into technical solutions
  • Work closely with system administrators to troubleshoot problems encountered by applications running on HPC systems
  • Contribute to the development of technical presentations, papers, technical reports, etc.
  • Stay updated on the latest advancements in AI/ML technologies and best practices

MINIMUM JOB REQUIREMENTS:

This requires both breadth and depth of expertise to create, recommend, and approve designs

  • Proficiency in high and low level programming languages such as Python, C/C++, or equivalent languages.
  • Experience developing, supporting, and using AI/ML solutions and pipelines together with a strong foundation in algorithms and techniques - such as supervised and unsupervised learning, deep learning and reinforcement learning
  • Excellent problem-solving skills and attention to detail.
  • Strong communication and collaboration skills.
  • Ability to work effectively in a fast-paced, dynamic multi-disciplinary environment.

ADDITIONAL JOB REQUIREMENTS FOR SCIENTIST 3:

In addition to the requirements outlined above, qualification at the higher level requires:

  • Proven experience (2+ years) working as an AI and Machine Learning Engineer or similar role
  • Leadership: Experience as the technical lead on technical projects
  • HPC Computing Experience: Experience working in a production computing environment, preferably with HPC systems or at large scale. Working knowledge of networking concepts and practices.
  • Hands-on experience with popular ML frameworks and libraries such as TensorFlow, PyTorch, scikit-learn, etc.

EDUCATION/EXPERIENCE AT LOWER LEVEL

Position requires a Bachelor’s degree in a STEM field from an accredited college and university and 4 years of related experience or equivalent experience directly related to the occupation.

EDUCATION/EXPERIENCE AT HIGHER LEVEL

Position requires a Master’s degree in a STEM field from an accredited college or university and 6 years of relevant experience or an equivalent combination of education and experience directly related to the occupation.

DESIRED QUALIFICATIONS:

  • Experience with DevOps including CI/CD pipelines
  • Ability to develop and create solutions to difficult problems often requiring integration of conflicting or incomplete data on a fast-paced schedule
  • Familiar with conducting data analysis, data preprocessing, and feature engineering to prepare data for model training.
  • Skills to train, validate, and fine-tune machine learning models using various techniques and algorithms.
  • Experience with containerization and orchestration tools such as Docker, Kubernetes, Charliecloud etc.
  • Familiarity with GPU optimizations in a scientific computing environment working wtih large multi-physics, molecular dynamics, material, climate, or genomics models
    Work Location: The work location for this position is hybrid and is located in Los Alamos, NM. Hybrid is defined as working partially onsite/partially offsite but within 2 hours ground commute of this location. All work locations are at the discretion of management and can change at any time with appropriate notice.
    Position commitment: Regular appointment employees are required to serve a period of continuous service in their current position in order to be eligible to apply for posted jobs throughout the Laboratory. If an employee has not served the time required, they may only apply for Laboratory jobs with the documented approval of their Division Leader. The position commitment for this position is 1 year.

Examples of experience and research areas include, but are not limited to:

  • Computational performance of AI/ML algorithms, including the ability to use modern computing hardware and software frameworks efficiently.
  • Data parallelism, model-parallelism, collective communication patterns and strategies for large-scale, distributed ML using frameworks e.g. Python (PyTorch, TensorFlow etc).
  • Visualization of large-scale HPC/AI discovery campaigns.
  • Operational data (power, energy, CPU/GPU utilization, job scheduling, large scale storage and I/O traces, system logs) analytics to enable data-driven intelligence and facility innovation.
  • Deployment and use of large language models and other foundation models

Responsibilities:

WHAT YOU WILL DO

The High Performance Computing Environments group (HPC-ENV) is seeking driven HPC data scientists in the very broad areas of HPC and AI/ML overlap, as a Scientist 2 or 3. This position may require engagement in every phase of the system development lifecycle including: requirements generation, system and software design, implementation, integration & test, and verification & validation.

Examples of experience and research areas include, but are not limited to:

  • Computational performance of AI/ML algorithms, including the ability to use modern computing hardware and software frameworks efficiently.
  • Data parallelism, model-parallelism, collective communication patterns and strategies for large-scale, distributed ML using frameworks e.g. Python (PyTorch, TensorFlow etc).
  • Visualization of large-scale HPC/AI discovery campaigns.
  • Operational data (power, energy, CPU/GPU utilization, job scheduling, large scale storage and I/O traces, system logs) analytics to enable data-driven intelligence and facility innovation.
  • Deployment and use of large language models and other foundation models.

The HPC Division supports the Los Alamos National Laboratory (LANL) mission by managing a world-class supercomputing center. We support stockpile stewardship for NNSA/DOE and accelerate scientific discovery for scientists. We integrate and support some of the world’s largest supercomputers during an exciting time in computing with the focus on traditional large scale simulations, data science, artificial intelligence, and machine learning.
HPC-ENV manages how users interact with the HPC systems at LANL which helps secure the nation and pushes the boundaries of science and innovation. Several teams within HPC-ENV are responsible for the broad range of HPC platforms, programming and runtime environments, software, application optimization and readiness, software engineering, user support & services for a large and diverse customer base. We provide support and services to many production platforms at a world-class computing facility to ensure customers can accomplish their research and mission at extreme scale.
This position will be filled at either the Scientist 2 or Scientist 3 level, depending on the skills of the selected candidate. Additional job responsibilities (outlined below) will be assigned if the candidate is hired at the higher level.

The successful candidate will perform the full spectrum of tasks, including but not limited to:

  • Research, evaluate and recommend AI/ML software for use on LANL systems.
  • Work closely with HPC users to integrate AI/ML models into production HPC platforms. Provide support and guidance to HPC user community running AI/ML workflows on high performance computing systems
  • Help establish best working practices for users and HPC around AI/ML workflows
  • Together with subject matter experts help develop and implement plan for AI/ML data management (including but not limited to IO optimization, transfer and archival)
  • Collaborate with stakeholders to understand requirements and translate them into technical solutions
  • Work closely with system administrators to troubleshoot problems encountered by applications running on HPC systems
  • Contribute to the development of technical presentations, papers, technical reports, etc.
  • Stay updated on the latest advancements in AI/ML technologies and best practice


REQUIREMENT SUMMARY

Min:1.0Max:6.0 year(s)

Information Technology/IT

IT Software - Other

Software Engineering

Graduate

STEM

Proficient

1

Los Alamos, NM, USA