Site Reliability Engineer (SRE)- OpenShift at NTT Data

Plano, TX 75024, USA -

Full Time

Start Date

Immediate

Expiry Date

11 Jun, 25

Salary

0.0

Posted On

11 Mar, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Kubernetes, Scripting, Openshift, Code, Jenkins, Devops, Computer Science, Systems Engineering, Infrastructure, Docker, Containerization, Orchestration, Azure, Reliability Engineering

Industry

Information Technology/IT

Description

Company Overview:
Req ID: 316280
NTT DATA strives to hire exceptional, innovative and passionate individuals who want to grow with us. If you want to be part of an inclusive, adaptable, and forward-thinking organization, apply now.
We are currently seeking a Site Reliability Engineer (SRE)- OpenShift to join our team in Dallas, Texas (US-TX), United States (US).
Job Description:

Key Responsibilities:

Building and maintaining reliable systems, ensuring high availability, and improving the overall performance of our infrastructure.
Designing, implementing, and managing observability solutions that provide deep insights into our systems and applications.
Reliability and Availability:
Ensure the reliability and availability of mission-critical systems.
Design and implement monitoring, alerting, and incident management strategies.
Performance and Scalability:
Optimize system performance, scalability, and capacity planning.
Conduct performance tuning and load testing to identify bottlenecks.
Automation and CI/CD:
Develop and maintain CI/CD pipelines for automated deployment.
Automate operational tasks and infrastructure management using scripts and tools.
Infrastructure Management:
On-premise infrastructure management and container orchestration platforms using OpenShift and Kubernetes.
Implement infrastructure as code (IaC) using tools like Terraform or other related tool.
Security and Compliance:
Ensure system security and compliance with industry standards.
Implement and maintain backup, disaster recovery, and high-availability solutions.
Collaboration and Communication:
Collaborate with development teams to build reliable and scalable software.
Communicate system status, incidents, and performance metrics to stakeholders.

Qualifications:

5+ years of experience in Site Reliability Engineering, DevOps, or Systems Engineering.
OnPremise and cloud platforms (AWS, GCP, or Azure)
Containerization and orchestration (OpenShift, Docker, Kubernetes).
Hands on experience with OpenShift
Scripting (e.g., Python, Bash).
CI/CD tools (Jenkins, GitLab CI, CircleCI).
Monitoring and logging tools (Prometheus, Grafana, ELK stack).

Preferred Qualifications:

Infrastructure as Code (IaC) tools (Terraform, CloudFormation).
Security best practices and compliance standards.
Agile/Scrum development methodologies.

Education:

Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience)

Responsibilities

Building and maintaining reliable systems, ensuring high availability, and improving the overall performance of our infrastructure.
Designing, implementing, and managing observability solutions that provide deep insights into our systems and applications.
Reliability and Availability:
Ensure the reliability and availability of mission-critical systems.
Design and implement monitoring, alerting, and incident management strategies.
Performance and Scalability:
Optimize system performance, scalability, and capacity planning.
Conduct performance tuning and load testing to identify bottlenecks.
Automation and CI/CD:
Develop and maintain CI/CD pipelines for automated deployment.
Automate operational tasks and infrastructure management using scripts and tools.
Infrastructure Management:
On-premise infrastructure management and container orchestration platforms using OpenShift and Kubernetes.
Implement infrastructure as code (IaC) using tools like Terraform or other related tool.
Security and Compliance:
Ensure system security and compliance with industry standards.
Implement and maintain backup, disaster recovery, and high-availability solutions.
Collaboration and Communication:
Collaborate with development teams to build reliable and scalable software.
Communicate system status, incidents, and performance metrics to stakeholders