AI Infrastructure & Inference Engineer (GPU Systems) at CO-WORKER TECHNOLOGY
Norrtälje kommun, , Finland -
Full Time


Start Date

Immediate

Expiry Date

02 Feb, 26

Salary

0.0

Posted On

04 Nov, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

GPU Infrastructure, Kubernetes, CI/CD, Model Serving, Optimization, Observability, Security, Python, Go, C++, Cloud Experience, Containers, Performance Tuning, Distributed Training, Feature Stores, Hybrid Cloud

Industry

Staffing and Recruiting

Description
The role • Design, build and operate GPU infrastructure for training and low-latency inference (Kubernetes, autoscaling, CI/CD). • Implement high-throughput model serving (e.g., NVIDIA Triton, vLLM, Text Generation Inference) with caching and canary releases. • Optimize models and runtimes for cost, latency and throughput (quantization, distillation, batching, parallelism). • Establish observability and reliability for inference (telemetry, tracing, SLOs/alerts, capacity planning, FinOps). • Contribute to security, governance and compliance for models, artifacts and datasets (secrets, access, audit). What you bring • MSc/BSc in Computer Science, Electrical/Computer Engineering or similar. • 5+ years in systems/infra/SRE or ML platform work at scale. • Strong experience with containers/Kubernetes and IaC (Terraform or similar). • Hands-on with GPU stacks (CUDA basics, NCCL, drivers/containers) and performance tuning. • Familiarity with model serving frameworks (Triton, vLLM, TGI/HF), queues and service meshes. • Proficiency in Python and one systems language (Go/C++ preferred); solid CI/CD and observability. • Cloud experience (AWS/Azure/GCP) and cost optimisation for GPU workloads. Nice to have • Experience with distributed training/inference (tensor/pipeline parallelism, MIG, RDMA). • Experience with feature stores/vector DBs for RAG-style serving. • On-prem GPU cluster management (Slurm, DCGM) or hybrid cloud. • Security certifications or practical experience with regulated environments.
Responsibilities
The role involves designing, building, and operating GPU infrastructure for training and low-latency inference. Responsibilities also include optimizing models for cost, latency, and throughput while ensuring observability and reliability for inference.
Loading...