Member of Technical Staff, AI Infrastructure Team at Verda
Helsinki, Uusimaa, Finland -
Full Time


Start Date

Immediate

Expiry Date

23 Aug, 26

Salary

0.0

Posted On

25 May, 26

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Distributed Systems, Networking, ML Training Infrastructure, NCCL, MPI, NVSHMEM, Nsight Systems, PyTorch Profiler, Perf, Performance Profiling, Distributed Training, Low-level Communication Systems, System Reliability, Benchmarking, Regression Testing, Fault-tolerance

Industry

technology;Information and Internet

Description
About Verda Verda is reimagining cloud infrastructure for AI workloads. We are a full-stack AI cloud company, meaning we install, operate, and optimize our compute for training and inference of AI models. Join Verda while it’s still being built - not once it’s finished! Your responsibilities In this role, you will focus on improving the networking and communication layer behind large-scale LLM training workloads. You will optimize collective communication performance across distributed GPU clusters, helping improve throughput, utilization, and reliability for communication-bound workloads. You will debug and analyze bottlenecks across the networking stack, building tooling and infrastructure for benchmarking, profiling, and regression testing of distributed training performance. You will work closely with training, infrastructure, hardware, and networking teams to improve how workloads scale across clusters, contributing to both system reliability and overall training efficiency. This role is highly collaborative and research-adjacent, requiring curiosity, initiative, and willingness to go deep into low-level communication systems and distributed training infrastructure. Your key competencies Experience with distributed systems, networking, or large-scale ML training infrastructure Experience with communication libraries such as NCCL, MPI, NVSHMEM, or similar technologies Experience with profiling and debugging tools such as Nsight Systems, NCCL logs, PyTorch Profiler, or perf Strong systems thinking and ability to analyze performance bottlenecks across distributed environments Self-starter mindset with ability to independently define and drive technical projects Strong curiosity about low-level systems, networking, and large-scale AI infrastructure Representative projects Build tools to identify NCCL bottlenecks, slow ranks, and communication tail latency Build dashboards and regression infrastructure for training network health and performance Implement fault-tolerance mechanisms to reduce cluster idle time and improve training efficiency Practicalities Location: Helsinki, Finland or London, UK Hybrid mode: Working from either our Helsinki or London office for three days a week Employment type: Full-time and permanent What's next We’re building fast and this role needs the right person behind it. There’s no artificial deadline, but when we find who we’re looking for, we move. If this sounds like your next move, apply now. Please submit your application through our Careers page. We don’t accept applications sent by email.
Responsibilities
Focus on optimizing the networking and communication layer for large-scale LLM training workloads across distributed GPU clusters. Develop tooling for benchmarking, profiling, and regression testing to improve throughput and system reliability.
Loading...