Site Reliability Engineer at Kaztronix

Sunnyvale, CA 94089, USA -

Full Time

Start Date

Immediate

Expiry Date

09 Oct, 25

Salary

0.0

Posted On

10 Jul, 25

Experience

8 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Stig, Reliability Engineering, Security+, Complex Systems, Machine Learning, Computer Science, Puppet, Devops, Rmf, Aws, Bash, Computer Engineering, System Administration, Operations, Azure, Solarwinds, Software Development, Nispom, Secure Coding, Kubernetes, Python

Industry

Information Technology/IT

Description

A Global Government Contracting Company is seeking a Site Reliability Engineer to join thier team in Sunnyvale, CA!

As a Site Reliability Engineer, you will:

Design, implement, and maintain highly available and scalable systems and infrastructure to support classified applications and services
Develop and implement reliability-focused engineering practices, such as continuous integration, continuous deployment, and continuous monitoring, while ensuring compliance with classified system requirements
Collaborate with development teams to ensure that reliability and scalability are considered throughout the software development lifecycle, while maintaining the security and integrity of the classified system
Identify and mitigate potential sources of downtime and performance degradation, including infrastructure, application, and network issues, while ensuring that all troubleshooting and debugging activities are conducted in accordance with classified system procedures
Develop and maintain technical documentation, including system diagrams, architecture documents, and runbooks, while ensuring that all documentation is properly marked and handled in accordance with classified system requirements
Lead and participate in incident response and post-incident reviews to identify root causes and implement corrective actions, while ensuring that all incident response activities are conducted in accordance with classified system procedures
Collaborate with other teams, including development, operations, and security, to ensure that reliability and scalability are considered in all aspects of system design and operation, while maintaining the security and integrity of the classified system
Develop and maintain metrics and monitoring systems to measure system reliability and performance, while ensuring that all monitoring activities are conducted in accordance with classified system requirements
Stay up-to-date with industry trends and emerging technologies, and apply this knowledge to continuously improve system reliability and scalability, while maintaining the security and integrity of the classified system

BASIC QUALIFICATIONS

Bachelor’s degree in Computer Science, Engineering, or a related field
Minimum 8 years of experience in site reliability engineering, DevOps, or a related field, with a focus on classified systems
Must possess or be able to obtain within 6 months of start date a valid IAT Level II or III DoD Approved 8140 (DoD 8570) certification such as Security+, in good standing
Ability to obtain & maintain a Top Secret security clearance, US Citizenship required
Experienced with production use of vSphere/ESXi/vCenter, RHEL
Advance proficiency using of Python, BASH, Ansible, puppet, and chef for system administration
Demonstrable proficiency with MRTG/PRTG, Nagios, SolarWinds or similar
Proven ability with Cloud and Container technologies: Kubernetes, Docker/Mirantis, AWS, and/or Azure
Strong technical background in systems administration, networking, and software development, with a focus on classified systems
Excellent problem-solving skills, with the ability to analyze complex systems and identify root causes of issues, while maintaining the security and integrity of the classified system
Networking fundamentals, including TCP/IP, DNS, and routing protocols

DESIRED SKILLS

System integration experience of large-scale distributed infrastructure systems
Masters degree in Computer Engineering or related field
Data center operations/system administrator experience, preferably in a DoD environment (RMF, STIG, or NISPOM)
Certification in site reliability engineering, DevOps, or a related field, with a focus on classified systems
Experience with machine learning and artificial intelligence technologies, with a focus on classified systems
Strong knowledge of security principles and practices, including secure coding, secure deployment, and secure operations, with a focus on classified systems
Strong understanding of networking fundamentals, including TCP/IP, DNS, and routing protocols, with a focus on classified systems
Ability to support on-call 24X7 and off-shift for mission critical events/operation that may require extended hours or weekend supports
Comfortable working in a fast paced and dynamic multi-disciplinary environment
Active Secret security clearance

Responsibilities

Please refer the Job description for details