HPC Operations Engineer at IREN
Sydney, New South Wales, Australia -
Full Time


Start Date

Immediate

Expiry Date

28 Jun, 26

Salary

0.0

Posted On

30 Mar, 26

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

HPC System Architecture, Kubernetes, Slurm Workload Manager, HPC Management Tools, System Troubleshooting, Cloud Platforms, Network Solutions, Storage Solutions, Incident Response, Monitoring, Deployment, Maintenance, Technical Leadership, Documentation

Industry

technology;Information and Internet

Description
Job Type:  Full-time | Location: Sydney | Department: IT | Reporting to: Senior Manager, Technical Operations Center | Work Location Type: #onsite IREN is a leading AI Cloud Service Provider, delivering large-scale GPU clusters for AI training and inference.  IREN’s vertically integrated platform is underpinned by its expansive portfolio of grid-connected land and data centers in renewable-rich regions across the U.S. and Canada.   The HPC Operations Engineer will provide Tier 2 operational support for the IREN global fleet as part of a 24x7 365 incident response team. They will ensure the timely resolution of site and customer impacting events, engaging vendor and product tier 3 support, when appropriate. They are also responsible for the ongoing improvement and refinement of our monitoring and response alerting, ensuring that we are able to provide immediate support for all possible events.  With 100% renewable energy, we build, own and operate our data centers and take pride in being at the forefront of sustainable solutions for the ever-evolving applications of high-performance compute. We believe that human progress is invaluable, but it should be done in the right way – responsibly, sustainably and having a positive impact on the communities we operate in. * Minimum of 3 - 5 years of experience in HPC system architecture with proven expertise in designing, deploying, and managing HPC clusters. * Extensive knowledge of Kubernetes, with a focus on its integration within HPC environments. * Hands-on experience with the Slurm workload manager, or similar. * Familiarity with HPC management tools and software, ensuring efficient system monitoring and troubleshooting. * Proven track record of resolving complex system challenges and enhancing operational performance. * Understanding of cloud platforms and their integration into HPC ecosystems. * Deep knowledge of network and storage solutions commonly used in HPC setups. * A degree or diploma in computer science, engineering, or a combination of education and experience appropriate to the role. * Relevant certifications in Kubernetes, HPC technologies, or system architecture are advantageous. * Response, triage, and resolution of operational incidents as part of a 24x7 365 response team; Supporting escalations to Tier 3 product operations, when appropriate. * Support the deployment and maintenance of HPC clusters, ensuring they operate effectively and maximize availability * Manage HPC software components such as Kubernetes, Slurm, cluster management software, and any infrastructure required to operate the HPC environment * Collaborate with product operations to ensure accurate monitoring and response for our global fleet. * Draft comprehensive documentation, including operational procedures, and best practice guidelines. * Provide technical leadership and training to other team members, fostering an environment of continuous learning and improvement. At IREN, we offer a highly competitive compensation package that includes base salary, annual performance incentives, and opportunities to build long-term wealth through equity programs. These offerings are part of our broader total rewards package, thoughtfully designed to support your health, well-being, and long-term success.  Compensation & Rewards * Competitive salary range finalized based on experience and impact * Short and long-term incentive programs designed to reward both results and long term company success  Wellbeing & Benefits  * Paid vacation to recharge, travel, or simply enjoy more life outside of work We value diverse perspectives and believe that skills can be developed. If you’re passionate about this role, we want to hear from you — whether you meet every criteria or not. Your unique experiences might be exactly what we need!    IREN Limited is an equal opportunity employer that is committed to creating an inclusive workplace. We evaluate qualified applicants without regard to race, colour, religion, age, sex, sexual orientation, gender identity, genetic information, national origin, disability, veteran status, and other legally protected characteristics.   This job will remain posted until filled. While we appreciate all applications we receive, we are only able to contact candidates under consideration.  By applying for this position and submitting your resume and application materials, you consent to the processing of your personal information in accordance with our Job Applicant Privacy Statement available on our website at www.iren.com [http://www.iren.com/].
Responsibilities
The engineer will provide Tier 2 operational support for the global fleet as part of a 24x7 365 incident response team, ensuring timely resolution of site and customer impacting events. Responsibilities also include improving monitoring and response alerting, supporting cluster deployment and maintenance, and drafting operational documentation.
Loading...