Site Reliability Engineer, HPC Infrastructure and Platforms (Hybrid Eligible)
at Oak Ridge National Laboratory
Oak Ridge, TN 37830, USA -
Start Date | Expiry Date | Salary | Posted On | Experience | Skills | Telecommute | Sponsor Visa |
---|---|---|---|---|---|---|---|
Immediate | 01 Jul, 2024 | Not Specified | 05 Apr, 2024 | 5 year(s) or above | Operating Systems,Automation,Third Party Vendors,Disabilities,Dashboards,Code Review,Computer Science,Selinux,It,Management Software,Puppet,Testing,Python,Html,Participation,Gitlab,Metrics,Github,Resumes,Virtual Machines,Continuous Integration | No | No |
Required Visa Status:
Citizen | GC |
US Citizen | Student Visa |
H1B | CPT |
OPT | H4 Spouse of H1B |
GC Green Card |
Employment Type:
Full Time | Part Time |
Permanent | Independent - 1099 |
Contract – W2 | C2H Independent |
C2H W2 | Contract – Corp 2 Corp |
Contract to Hire – Corp 2 Corp |
Description:
OVERVIEW:
The National Center for Computational Sciences (NCCS) at Oak Ridge National Lab (ORNL), which hosts several of the world’s most powerful computer systems, is seeking highly qualified individuals to play a key role in improving the security, performance, and reliability of the NCCS computing infrastructure which supports multiple highly ranked Top500 Supercomputers, including the first exaflop supercomputer, Frontier.
BASIC QUALIFICATIONS:
Bachelor’s Degree in computer science or closely related field and a minimum of 5 years of experience as an SRE/Systems Engineer. An equivalent combination of education and experience may be considered.
PREFERRED QUALIFICATIONS:
- Excellent interpersonal/communication skills, and the ability to work as part of a team.
- Strong working knowledge of Unix system fundamentals and common network protocols.
- Experience managing Linux/UNIX operating systems in a heterogeneous environment.
- Proven understanding of networked computing environment concepts.
- Ability to develop and maintain programs and scripts that aid in the operation and automation using various shell (primarily bash) and high-level languages (Python or Go).
- Ability to proactively identify performance issues, problems, and areas for improvement.
- Experience with continuous integration and continuous deployment software methodologies and how they apply to SRE/systems engineering.
- Understanding of code review and familiarity with tools like GitHub and GitLab
- Experience using tools such as Nagios, Grafana and Prometheus to monitor systems, metrics, and create dashboards.
- Experience implementing systems/services using virtual machines and Kubernetes resources.
- Experience deploying and maintaining automated configuration management software such as Puppet or Ansible
- Experience implementing systems-level security technologies like SELinux and following best security practices.
Responsibilities:
- Improve reliability, scalability and quality of our Kubernetes and Linux based applications and services.
- Define and implement define critical metrics, processes and drive continuous improvement.
- Capture and analyze metrics to assist in tuning operating systems and applications.
- Diagnose system operational problems quickly and effectively.
- Participate in on-call rotation providing 24-hour, 7-day support and off-hours maintenance windows.
- Coordinate with vendors to resolve hardware and software problems.
- Deliver ORNL’s mission by aligning behaviors, priorities, and interactions with our core values of Impact, Integrity, Teamwork, Safety, and Service. Promote diversity, equity, inclusion, and accessibility by fostering a respectful workplace – in how we treat one another, work together, and measure success.
REQUIREMENT SUMMARY
Min:5.0Max:10.0 year(s)
Information Technology/IT
IT Software - Network Administration / Security
Software Engineering
Graduate
Computer science or closely related field and a minimum of 5 years of experience as an sre/systems engineer
Proficient
1
Oak Ridge, TN 37830, USA