Site Reliability Engineer at FluidStack

London, England, United Kingdom -

Full Time

Start Date

Immediate

Expiry Date

11 Oct, 25

Salary

0.0

Posted On

13 Jul, 25

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Weka, Lustre, Storage Systems, English, Computer Engineering, Devops, Ceph, Vast, Computer Science, Writing, Maas, Infrastructure, Communication Skills, Ansible

Industry

Information Technology/IT

Description

ABOUT FLUIDSTACK

Fluidstack is building GPU supercomputers for top AI labs, governments, and enterprises. Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more.
Our team is small, highly motivated, and focused on providing a world class supercomputing experience. We put out customers first in everything we do, working hard to not just win the sale, but to win repeated business and customer referrals.
We hold ourselves and each other to high standards. We expect you to care deeply about the work you do, the products you build, and the experience our customers have in every interaction with us.
You must work hard, take ownership from inception to delivery, and approach every problem with an open mind and a positive attitude. We value effectiveness, competence, and a growth mindset.

How To Apply:

Incase you would like to apply to this job directly from the source, please click here

Responsibilities

SREs at Fluidstack sit at the core of our infrastructure, working across software, hardware, and operations to ensure the reliability and performance of our global GPU cloud.
They partner closely with teams including networking, platform engineering, and data center operations to build systems that scale with the demands of AI workloads.
SREs are hands-on and possess deep systems knowledge and strong communication skills. You’ll be responsible for tackling complex production issues, deploying resilient infrastructure, and continuously improving the stability and observability of our platform as we grow.

A typical day may involve:

Deploying clusters of 1,000+ GPUs using custom written playbooks; modifying these tools as necessary to provide the perfect solution for a customer.
Validating correctness and performance of underlying compute, storage, and networking infrastructure, and working with providers to optimize these subsystems.
Migrating petabytes of data from public cloud platforms to local storage, as quickly and cost effectively as possible.
Debugging issues anywhere in the stack, from “this server’s fan is blocked by a plastic bag” to “optimizing S3 dataloaders from buckets in different regions”.
Building internal tooling to decrease deployment time and increase cluster reliability, including automation where the customer benefits clearly outweigh the implementation overhead.

This role will involve being part of an on-call rotation up to one week per month.