Senior MLOps / AIOps Platform Engineer at NTT DATA
Chennai, tamil nadu, India -
Full Time


Start Date

Immediate

Expiry Date

22 Jan, 26

Salary

0.0

Posted On

24 Oct, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

MLOps, AIOps, Platform Engineering, DevSecOps, Infrastructure as Code, Automation, Monitoring, Logging, Observability, IBM Watsonx, Google Cloud Vertex AI, CI/CD, Containerization, Orchestration, Data Engineering, Cloud Certifications

Industry

IT Services and IT Consulting

Description
Platform Engineering & Operations Lead the provisioning, configuration, and ongoing support of IBM watsonx and Google Cloud Vertex AI platforms. Ensure platforms are production-ready, secure, cost-efficient, and performant across training, inference, and orchestration workflows. Manage lifecycle tasks such as patching, upgrades, integrations, and service reliability. Partner with security, compliance, and product teams to align platforms with enterprise and regulatory standards. Enterprise MLOps / AIOps Enablement Define and implement standardized MLOps/AIOps practices across business units for consistency and scalability. Build and maintain reusable workflows for model development, deployment, retraining, and monitoring. Provide onboarding, enablement, and support to AI/ML teams adopting enterprise platforms and tools. Support development/deployment of GenAI applications and maintain them at an Enterprise scale. DevSecOps Integration Embed security and compliance guardrails across the ML lifecycle, including CI/CD pipelines and IaC templates. Implement policy-as-code, access controls, vulnerability scanning, and automated compliance checks. Ensure all deployments meet enterprise and regulatory requirements (HIPAA, SOX, FedRAMP, etc.). Infrastructure as Code & Automation Design and maintain IaC templates (Terraform, Pulumi, Ansible, CloudFormation) for reproducible ML infrastructure. Build and optimize CI/CD pipelines for AI/ML assets including data pipelines, training workflows, deployment artifacts, and monitoring systems. Enforce best practices around automation, reusability, and observability of infrastructure and workflows. Monitoring, Logging & Observability Implement comprehensive observability for AI/ML workloads using Prometheus, Grafana, Stackdriver, or Datadog. Monitor both infrastructure health (CPU, memory, cost) and ML-specific metrics (model drift, data integrity, anomaly detection). Define KPIs and usage metrics to measure platform performance, adoption, and operational health. Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field. 5+ years in MLOps, DevOps, Platform Engineering, or Infrastructure Engineering. 2+ years applying DevSecOps practices (secure CI/CD, vulnerability management, policy enforcement). Hands-on experience configuring and managing enterprise AI/ML platforms (IBM watsonx, Google Vertex AI). Demonstrated success in building and scaling ML infrastructure, automation pipelines, and platform support models. Proficiency with IaC tools (Terraform, Pulumi, Ansible, CloudFormation). Deep understanding of containerization and orchestration (Docker, Kubernetes). Experience with model lifecycle tools (MLflow, TFX, Vertex Pipelines, or equivalents). Familiarity with secrets management, policy-as-code, access control, and monitoring tools. Working knowledge of data engineering concepts and their integration into ML pipelines. Cloud certifications (e.g., GCP Professional ML Engineer, AWS DevOps Engineer, IBM Cloud AI Engineer). Experience supporting platforms in regulated industries (HIPAA, FedRAMP, SOX, PCI-DSS). Contributions to open-source projects in MLOps, automation, or DevSecOps. Familiarity with responsible AI practices including governance, fairness, interpretability, and explainability. Hands-on experience with enterprise feature stores, model monitoring frameworks, and fairness toolkits.
Responsibilities
Lead the provisioning, configuration, and ongoing support of IBM watsonx and Google Cloud Vertex AI platforms. Define and implement standardized MLOps/AIOps practices across business units for consistency and scalability.
Loading...