Site Reliability Engineer, HPC Infrastructure and Platforms (Hybrid Eligible)

at  Oak Ridge National Laboratory

Oak Ridge, TN 37830, USA -

Start DateExpiry DateSalaryPosted OnExperienceSkillsTelecommuteSponsor Visa
Immediate01 Jul, 2024Not Specified05 Apr, 20245 year(s) or aboveOperating Systems,Automation,Third Party Vendors,Disabilities,Dashboards,Code Review,Computer Science,Selinux,It,Management Software,Puppet,Testing,Python,Html,Participation,Gitlab,Metrics,Github,Resumes,Virtual Machines,Continuous IntegrationNoNo
Add to Wishlist Apply All Jobs
Required Visa Status:
CitizenGC
US CitizenStudent Visa
H1BCPT
OPTH4 Spouse of H1B
GC Green Card
Employment Type:
Full TimePart Time
PermanentIndependent - 1099
Contract – W2C2H Independent
C2H W2Contract – Corp 2 Corp
Contract to Hire – Corp 2 Corp

Description:

OVERVIEW:

The National Center for Computational Sciences (NCCS) at Oak Ridge National Lab (ORNL), which hosts several of the world’s most powerful computer systems, is seeking highly qualified individuals to play a key role in improving the security, performance, and reliability of the NCCS computing infrastructure which supports multiple highly ranked Top500 Supercomputers, including the first exaflop supercomputer, Frontier.

BASIC QUALIFICATIONS:

Bachelor’s Degree in computer science or closely related field and a minimum of 5 years of experience as an SRE/Systems Engineer. An equivalent combination of education and experience may be considered.

PREFERRED QUALIFICATIONS:

  • Excellent interpersonal/communication skills, and the ability to work as part of a team.
  • Strong working knowledge of Unix system fundamentals and common network protocols.
  • Experience managing Linux/UNIX operating systems in a heterogeneous environment.
  • Proven understanding of networked computing environment concepts.
  • Ability to develop and maintain programs and scripts that aid in the operation and automation using various shell (primarily bash) and high-level languages (Python or Go).
  • Ability to proactively identify performance issues, problems, and areas for improvement.
  • Experience with continuous integration and continuous deployment software methodologies and how they apply to SRE/systems engineering.
  • Understanding of code review and familiarity with tools like GitHub and GitLab
  • Experience using tools such as Nagios, Grafana and Prometheus to monitor systems, metrics, and create dashboards.
  • Experience implementing systems/services using virtual machines and Kubernetes resources.
  • Experience deploying and maintaining automated configuration management software such as Puppet or Ansible
  • Experience implementing systems-level security technologies like SELinux and following best security practices.

Responsibilities:

  • Improve reliability, scalability and quality of our Kubernetes and Linux based applications and services.
  • Define and implement define critical metrics, processes and drive continuous improvement.
  • Capture and analyze metrics to assist in tuning operating systems and applications.
  • Diagnose system operational problems quickly and effectively.
  • Participate in on-call rotation providing 24-hour, 7-day support and off-hours maintenance windows.
  • Coordinate with vendors to resolve hardware and software problems.
  • Deliver ORNL’s mission by aligning behaviors, priorities, and interactions with our core values of Impact, Integrity, Teamwork, Safety, and Service. Promote diversity, equity, inclusion, and accessibility by fostering a respectful workplace – in how we treat one another, work together, and measure success.


REQUIREMENT SUMMARY

Min:5.0Max:10.0 year(s)

Information Technology/IT

IT Software - Network Administration / Security

Software Engineering

Graduate

Computer science or closely related field and a minimum of 5 years of experience as an sre/systems engineer

Proficient

1

Oak Ridge, TN 37830, USA