Senior Python Systems Engineer (Agent & Infrastructure) at ClearML
, , Germany -
Full Time


Start Date

Immediate

Expiry Date

06 Jun, 26

Salary

0.0

Posted On

08 Mar, 26

Experience

10 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Python, Systems Programming, Kubernetes, Docker, K8s Api, Container Orchestration, Networking, Linux Internals, Shell Scripting, Asyncio, Tcp/Ip, Gpu Management, Devops, Systems Engineering, Custom Resource Definitions, High-Performance Computing

Industry

Software Development

Description
About the company At ClearML, our mission is to make infrastructure management effortless across every phase of the AI lifecycle -- from building and training AI models to large-scale production. Trusted by more than 2,000 organizations, AI builders and IT teams use our AI infrastructure platform to power everything from early-stage R&D to mission-critical public sector and enterprise-grade AI pipelines. We’re growing quickly and looking for curious, self-driven individuals who are excited to shape the future of AI and the infrastructure that powers it. Our customers are tackling some of the world’s most important challenges -- revolutionizing healthcare, discovering new medicines, securing global finance, protecting national security, and preserving our planet’s ecosystems. About the Role: We are looking for a Senior Systems Engineer to own the execution layer of the ClearML platform. You will be responsible for some of the critical components that spin up containers, manage GPUs, and tunnel connections that make ClearML work seamlessly in multiple environments. This role sits at the intersection of Software Engineering and DevOps. You will write Python code that orchestrates infrastructure, manages Docker containers, interacts with the Kubernetes API, and handles low-level networking. Responsibilities ● Agent Development: Design and optimize the clearml-agent, a Python service responsible for pulling jobs, setting up environments, and executing ML pipelines. ● Kubernetes Integration: Write logic to interact directly with K8s APIs, manage Pod life-cycles, and handle Custom Resource Definitions (CRDs). ● Resource Management: Implement logic for dynamic resource allocation (GPU/CPU/Memory) and container orchestration. ● Systems Programming: Build robust daemons and services that interact with OS-level primitives (systemd, signals, I/O streams). ● Networking: Troubleshoot and optimize TCP/IP connections, DNS resolution, and firewall traversal to ensure seamless connectivity for users. ● 8+ years of development experience with a strong focus on Systems Programming. ● Kubernetes Mastery: Deep understanding of Kubernetes architecture (beyond just writing YAML). You should know how to write code that controls K8s. ● Container Internals: Extensive experience with Docker, including building and maintaining images. ● Python for Systems: Experience using Python for automation, daemons, or CLI tools (using libraries like subprocess, socket, asyncio). ● Networking Fundamentals: Strong grasp of HTTP/S, WebSockets, TCP/IP, Proxies, and Reverse Proxies. ● OS Knowledge: strong understanding of Linux internals and shell scripting. Advantages ● Experience with GPU hardware management (NVIDIA drivers, CUDA, NVIDIA Container Toolkit). ● Experience building Kubernetes Operators/Controllers (using Kopf or Operator SDK). ● Background in HPC (High-Performance Computing) or Slurm/MPI. ● Experience with Go (Golang) is a plus (for specific K8s components). Why Join ClearML? ClearML is on a mission to make AI infrastructure effortless — from early experimentation to large-scale, production-grade systems. Trusted by 2,100+ organizations worldwide, our platform powers real-world, mission-critical AI workloads. 🌍 Fully Remote & Global – Work from anywhere with a distributed team of top-tier engineers. 🛠 Engineering-First & Autonomous – High ownership, real responsibility, and freedom to design and ship impactful solutions. 🚀 High Growth, High Impact – Your work directly affects thousands of users, from startups to large enterprises. ⚡ Technically Deep Challenges – Build complex, performance-critical systems at the core of modern AI infrastructure. 🔁 Fast Feedback, Real Users – See your work in production quickly and make a measurable difference. We’re a remote-first team hiring across Europe/ US, with most of the team working within GMT ±3 hours.
Responsibilities
The role involves designing and optimizing the clearml-agent, a Python service responsible for executing ML pipelines, and writing logic to interact directly with Kubernetes APIs to manage Pod life-cycles and CRDs. Responsibilities also include implementing dynamic resource allocation and building robust daemons that interact with OS-level primitives.
Loading...