Production Support Engineer (REMOTE)- Veterans Affairs, VA Cleared at THUNDERYARD SOLUTIONS
Remote, Oregon, USA -
Full Time


Start Date

Immediate

Expiry Date

15 Nov, 25

Salary

120000.0

Posted On

15 Aug, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Azure, Ansible, Docker, Python, Orchestration, Hosted Services, Containerization, Splunk, Reliability Engineering, Kubernetes, Bash, System Administration, Troubleshooting, Aws

Industry

Information Technology/IT

Description

WHY JOIN?

We are seeking an experienced Production Support Engineer to join our team supporting a Data & Analytics program for the U.S. Department of Veterans Affairs. This role blends elements of Site Reliability Engineering and Cloud Engineering, focusing on ensuring the stability, reliability, and performance of cloud-hosted services, data pipelines, and DevOps processes.
The ideal candidate will have a strong understanding of modern cloud platforms, monitoring/alerting systems, and CI/CD pipelines, with a problem-solving mindset for troubleshooting production issues in real time.

REQUIRED QUALIFICATIONS

  • 5+ years of experience in Production Support, Site Reliability Engineering, Cloud Engineering, or related roles.
  • Strong understanding of cloud-hosted services (AWS and/or Azure) and associated monitoring tools.
  • Experience with data pipelines, ETL/ELT processes, and data platform operations.
  • Proficiency with DevOps tools (e.g., Jenkins, GitLab CI/CD, Terraform, Ansible).
  • Solid grasp of Linux/Unix system administration and troubleshooting.
  • Familiarity with containerization (Docker, Kubernetes) and orchestration.
  • Strong problem-solving skills and ability to work under pressure in high-availability environments.
  • Excellent communication and collaboration skills.

PREFERRED QUALIFICATIONS

  • Programming or scripting experience in Python, Bash, or similar languages.
  • Experience supporting large-scale Data & Analytics platforms.
  • Knowledge of monitoring/logging tools (CloudWatch, Azure Monitor, Prometheus, Grafana, Splunk, etc.).
  • Experience working within federal government environments.
Responsibilities
  • Monitor, troubleshoot, and resolve issues impacting production systems, pipelines, and analytics applications.
  • Maintain and improve system reliability, scalability, and performance for cloud-hosted services.
  • Support and optimize CI/CD pipelines for data and analytics workloads.
  • Work closely with development, DevOps, and data engineering teams to ensure smooth deployments and minimal downtime.
  • Implement monitoring, logging, and alerting solutions to proactively detect and address issues.
  • Perform root cause analysis and create permanent fixes for recurring problems.
  • Support automation of operational tasks using scripting and programming languages.
  • Participate in incident response, on-call rotations, and post-incident reviews.
Loading...