Site Reliability Engineer (SRE)- OpenShift at NTT Data
Plano, TX 75024, USA -
Full Time


Start Date

Immediate

Expiry Date

11 Jun, 25

Salary

0.0

Posted On

11 Mar, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Kubernetes, Scripting, Openshift, Code, Jenkins, Devops, Computer Science, Systems Engineering, Infrastructure, Docker, Containerization, Orchestration, Azure, Reliability Engineering

Industry

Information Technology/IT

Description

Company Overview:
Req ID: 316280
NTT DATA strives to hire exceptional, innovative and passionate individuals who want to grow with us. If you want to be part of an inclusive, adaptable, and forward-thinking organization, apply now.
We are currently seeking a Site Reliability Engineer (SRE)- OpenShift to join our team in Dallas, Texas (US-TX), United States (US).
Job Description:

Key Responsibilities:

  • Building and maintaining reliable systems, ensuring high availability, and improving the overall performance of our infrastructure.
  • Designing, implementing, and managing observability solutions that provide deep insights into our systems and applications.
  • Reliability and Availability:
  • Ensure the reliability and availability of mission-critical systems.
  • Design and implement monitoring, alerting, and incident management strategies.
  • Performance and Scalability:
  • Optimize system performance, scalability, and capacity planning.
  • Conduct performance tuning and load testing to identify bottlenecks.
  • Automation and CI/CD:
  • Develop and maintain CI/CD pipelines for automated deployment.
  • Automate operational tasks and infrastructure management using scripts and tools.
  • Infrastructure Management:
  • On-premise infrastructure management and container orchestration platforms using OpenShift and Kubernetes.
  • Implement infrastructure as code (IaC) using tools like Terraform or other related tool.
  • Security and Compliance:
  • Ensure system security and compliance with industry standards.
  • Implement and maintain backup, disaster recovery, and high-availability solutions.
  • Collaboration and Communication:
  • Collaborate with development teams to build reliable and scalable software.
  • Communicate system status, incidents, and performance metrics to stakeholders.

Qualifications:

  • 5+ years of experience in Site Reliability Engineering, DevOps, or Systems Engineering.
  • OnPremise and cloud platforms (AWS, GCP, or Azure)
  • Containerization and orchestration (OpenShift, Docker, Kubernetes).
  • Hands on experience with OpenShift
  • Scripting (e.g., Python, Bash).
  • CI/CD tools (Jenkins, GitLab CI, CircleCI).
  • Monitoring and logging tools (Prometheus, Grafana, ELK stack).

Preferred Qualifications:

  • Infrastructure as Code (IaC) tools (Terraform, CloudFormation).
  • Security best practices and compliance standards.
  • Agile/Scrum development methodologies.

Education:

  • Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience)
Responsibilities
  • Building and maintaining reliable systems, ensuring high availability, and improving the overall performance of our infrastructure.
  • Designing, implementing, and managing observability solutions that provide deep insights into our systems and applications.
  • Reliability and Availability:
  • Ensure the reliability and availability of mission-critical systems.
  • Design and implement monitoring, alerting, and incident management strategies.
  • Performance and Scalability:
  • Optimize system performance, scalability, and capacity planning.
  • Conduct performance tuning and load testing to identify bottlenecks.
  • Automation and CI/CD:
  • Develop and maintain CI/CD pipelines for automated deployment.
  • Automate operational tasks and infrastructure management using scripts and tools.
  • Infrastructure Management:
  • On-premise infrastructure management and container orchestration platforms using OpenShift and Kubernetes.
  • Implement infrastructure as code (IaC) using tools like Terraform or other related tool.
  • Security and Compliance:
  • Ensure system security and compliance with industry standards.
  • Implement and maintain backup, disaster recovery, and high-availability solutions.
  • Collaboration and Communication:
  • Collaborate with development teams to build reliable and scalable software.
  • Communicate system status, incidents, and performance metrics to stakeholders
Loading...