SRE / HPC Engineer

at  FluidStack

Remote, Scotland, United Kingdom -

Start DateExpiry DateSalaryPosted OnExperienceSkillsTelecommuteSponsor Visa
Immediate21 Nov, 2024Not Specified22 Aug, 2024N/AVast,System Administration,Kubernetes,Shared Storage,Bash,Platforms,Nfs,Ansible,Automation,PythonNoNo
Add to Wishlist Apply All Jobs
Required Visa Status:
CitizenGC
US CitizenStudent Visa
H1BCPT
OPTH4 Spouse of H1B
GC Green Card
Employment Type:
Full TimePart Time
PermanentIndependent - 1099
Contract – W2C2H Independent
C2H W2Contract – Corp 2 Corp
Contract to Hire – Corp 2 Corp

Description:

Fluidstack is an AI cloud. We work with many of the top AI companies on the planet, including Poolside, Meta, Modal, Reka, and many more.

SKILLS & EXPERIENCE

  • Experience with HPC systems, System Administration, SRE, or DevOps
  • Experience with large scale workloads utilizing orchestrators like Slurm or Kubernetes.
  • Experience with automation of bare-metal machines and containers, using tools such as Ansible, Bash, or Python.
  • Experience with shared storage on platforms such as NFS, DDN ,Vast, CephFS, etc.
  • Experience provisioning large scale clusters and networks with e.g. BCM, UFM
  • Experience with large-scale GPU systems, working with Nvidia GPUs and Infiniband networks.
  • Fast learner, adaptable, and passionate about Fluidstack’s mission!
    If any of the above bullets resonate with you, please reach out!

Responsibilities:

Our HPC Engineers make sure our GPU infrastructure is working at peak performance and offer top tier support to our customers.

At its core, you will have three main responsibilities:

  • Deployment. We will be onboarding new clusters at least monthly - you will help take bare-metal servers and deploy them for our customers as high performance compute as a service.
  • Automation. Our GPU fleet is large and growing. You will help us to automate many of our processes and systems to allow us to support Fluidstack continuing to scale.
  • Support. This will be a client facing role - you will work closely with our customers to make sure that they are able to utilize our infrastructure to achieve their goals. You will work on everything from GPU debugging, Slurm management, to training performance optimization.


REQUIREMENT SUMMARY

Min:N/AMax:5.0 year(s)

Information Technology/IT

IT Software - Other

Software Engineering

Graduate

Proficient

1

Remote, United Kingdom