Team Lead/Production Monitoring/Reliability Engineer at Indev

Ashburn, VA 20147, USA -

Full Time

Start Date

Immediate

Expiry Date

15 Sep, 25

Salary

115000.0

Posted On

15 Jun, 25

Experience

10 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Middleware, Javascript, Puppet, Ksh, Zsh, Powershell, Information Technology, Nexus, Gitlab, Bash, Telecommunications, Programming Languages, Csh, Ruby, Github, Computer Science, Splunk, Artifactory, Operating Systems, Version Control, Nginx, New Relic, Cloud

Industry

Information Technology/IT

Description

POSITION DESCRIPTION:

Indev is seeking a skilled Team Lead and Reliability Engineer to support our client’s mission by enhancing Production Monitoring and ensuring optimal service delivery for their applications. This role involves proactive issue identification, incident resolution, and system health optimization within a 24x7x365 operational environment. The ideal candidate will lead monitoring solutions, manage ITIL engineers, automate processes, and collaborate across IT and business teams to improve service reliability. Expertise in AWS environments, root cause analysis, and technical troubleshooting is essential, along with strong communication and leadership skills to drive continuous improvement.
This is a direct-hire, full time position with salary and benefits. Indev provides a comprehensive benefits package, including Medical, Dental, Vision, 401k with match, Flexible Spending Account, and Paid Time Off (PTO)—including vacation and holiday pay.

REQUIRED QUALIFICATIONS:

Bachelor’s degree in Computer Science, Information Systems, Engineering, Business or other related discipline with a minimum of 10 years of experience in information technology. Additional years of experience can be substituted.
Practical knowledge and hands-on experience with Agile development and DevSecOps practices
Background in systems engineering with expertise in one or more areas such as telecommunications, programming languages, operating systems, middleware, or database technologies
Proficiency with development and operations tools including:
Version control and artifact management systems like GitHub, GitLab, Bitbucket, Artifactory, or Nexus
Cloud and infrastructure monitoring tools, particularly AWS CloudWatch
Centralized logging and analytics platforms such as Splunk
Infrastructure as Code (IaC) and configuration management tools like Terraform or Puppet
Preferred Qualifications:
Familiarity with observability platforms such as New Relic or other AI-driven operations tools
Proficient in one or more programming languages such as JavaScript, Ruby, or Go
Experience working with modern application delivery technologies including Nginx, HAProxy, Docker, Kubernetes, or equivalents
Understanding of messaging platforms, collaborative tools, app-level firewalls, proxy servers, and common operating systems
Comfortable working in Linux and Windows environments, with scripting experience in Bash, CSH, KSH, ZSH, or PowerShell
Experience utilizing monitoring and alerting frameworks such as Prometheus, Grafana, or Datadog

ABOUT US:

At Indev, we are redefining Intelligent Development—delivering forward-thinking, mission-driven technology solutions that empower federal agencies to modernize, automate, and innovate with confidence. We go beyond the status quo, thinking creatively and providing impactful, non-traditional solutions that drive federal technology transformation. Our team harnesses the power of AI-driven automation, mission analytics, and cloud-native technologies to create agile, secure, and efficient enterprises that are built for the future. Let’s innovate. www.indev.com.
Job Type: Full-time
Pay: From $115,000.00 per year

Benefits:

401(k)
401(k) matching
Dental insurance
Health insurance
Health savings account
Life insurance
Paid time off
Referral program
Vision insurance

Schedule:

Monday to Friday

Work Location: Hybrid remote in Ashburn, VA 2014

How To Apply:

Incase you would like to apply to this job directly from the source, please click here

Responsibilities

Serve as Team Lead over staff of reliability engineers
Schedule and ensure Emergency Operations Center is always staffed with Reliability Engineers
Participate in outage calls when possible
Ensure SLAs for escalations and notifications are met
Coach and guide less experienced reliability engineers
In close coordination with PM, ensure all project deliverables are met
Provide regular improvement or refresher training sessions to reliability engineers
Aid Project Manager in gathering monitoring metrics and other material for presentations
Serve as liaison between Federal Staff and Contractors
Present Production Monitoring related material to Sr. Leadership
Aid PM in interviewing new candidates
Serve as Reliability Engineer
Triage and escalate events in accordance with Standard Operating Procedures (SOPs)
Asses initial severity, gather impacts, create tickets, engage support teams, and escalate issues properly.
Effectively document incidents describing the issue, business impact, root cause and resolution
Monitor various applications to proactively identify system disruptions and preempt enterprise outages
Notify internal and external departments of performance issues and trends
Support maintenance and scheduled outages
Review and update tickets with current status information
Understand applications and their interdependencies
Monitor and support scheduled change activity in the production environment and escalate unexpected issues
Provide application verification support to support teams upon completion of scheduled changes in the production environment
Aid in development of Root Cause Analyst (RCA) following an event
Provide shift reports detailing the health of the environment and any pending changes which may potentially impact applications
Provide documentation and presentation support as needed
Identify areas where improvements in processes or documentation will increase the team’s overall proficiency
Communicate clearly and effectively across IT, with business process owners, and customers at all levels of the organization.
Communicate overall status and health of the application to business and application support teams.
Participate in the creation and maintenance of technical and knowledge base documentation.