AI/ML Operations Engineer (m/f/d, up to TV-L 13, 100%) at Hchstleistungsrechenzentrum Stuttgart HLRS

Stuttgart, Baden-Württemberg, Germany -

Full Time

Start Date

Immediate

Expiry Date

12 May, 25

Salary

0.0

Posted On

12 Feb, 25

Experience

0 year(s) or above

Remote Job

Telecommute

Sponsor Visa

Skills

Good communication skills

Industry

Information Technology/IT

Description

As one of Europe’s leading facilities for high-performance computing, HLRS is a diverse community of scientists, engineers, and other professionals focused on discovering and developing new applications of powerful digital technologies and computer science methods.
The High-Performance Computing Centre Stuttgart (HLRS) was founded as Germany’s first federal high-performance computing (HPC) centre. It operates one of the fastest supercomputers in the world. It offers various HPC solutions and services for universities, research institutions, and industry. Furthermore, HLRS is a worldwide leader in engineering and global system sciences. Staff scientists at HLRS investigate emerging technologies such as Artificial Intelligence (AI), Cloud Computing, and Quantum Computing (QC) towards realising hybrid workflows and lowering the hurdle for non-experts using HPC technologies. In this context, HLRS is significantly involved in international and national research projects across the abovementioned research areas.

SHAPING THE FUTURE OF AI IN HPC

We are seeking a highly motivated AI/ML Innovation Engineer to support the deployment, monitoring, and optimization of AI infrastructure within the AI Factory HammerHAI at HLRS. This role focuses on ensuring scalable, secure, and high-performance AI services for various users, including start-ups, SMEs, industry, and research institutions. The successful candidate will work on integrating AI pipelines, deploying AI workloads in HPC environments, and developing monitoring and benchmarking frameworks. This position requires expertise in AI, ML operations, cloud-native technologies, and high-performance computing systems.
In this context, we are looking for a

Responsibilities

Collect and analyse user requirements to tailor AI software architectures and stacks.
Assess AI software components, evaluating security, compatibility, and performance requirements.
Design, deploy, and optimise AI/ML pipelines in AI-optimised supercomputing environments.
Develop and implement best practices for MLOps, including automation, version control, and containerization.
Test, deploy, benchmark, integrate, and monitor AI system services and pipelines with OpenStack and Kubernetes.
Analyse monitoring data, logs, and system metrics, setting up the monitoring systems when necessary.
Provide technical support and guidance to users on deploying and optimizing AI workloads.
Contribute to technical documentation, user guides, and best practice reports.