Site Reliability Engineer on AI Platform, Director at Morgan Stanley
Bengaluru, karnataka, India -
Full Time


Start Date

Immediate

Expiry Date

09 Feb, 26

Salary

0.0

Posted On

11 Nov, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Site Reliability Engineering, Infrastructure Management, Automation, Infrastructure-as-Code, Monitoring Tools, Containerization, Kubernetes, Capacity Planning, Performance Tuning, Security Compliance, Disaster Recovery, Documentation, Generative AI, High-Performance Computing, ModelOps, Chaos Engineering

Industry

Financial Services

Description
Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving). Design and build automation for core platform capabilities, reducing manual toil Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc. Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards. Work on Grafana dashboards for various metrics which are being scrapped by Prometheus. Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation. Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting. Optimize cost vs. performance tradeoffs in large-scale compute environments. Harden systems for security, compliance, auditability, and data governance Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems. Define disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanisms. Maintain runbooks, operational playbooks, documentation, and training materials. Participate in on-call rotations and respond to production incidents 24/7 as needed. Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability. At least 6+ years' relevant experience would generally be expected to find the skills required for this role. Production experience in SRE / Infrastructure / ops for large-scale systems Strong programming/scripting skills (Python, Go, Java, or equivalent) Deep experience with containerization (Docker), orchestration (Kubernetes, etc.) Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, PagerDuty, etc.) Understanding of SRE techniques. Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.) Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage) Solid experience in capacity planning, performance tuning, scaling, and incident response Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements Experience in regulated environments (financial services, compliance, audit, security) is a strong plus Excellent communication, documentation, and cross-team collaboration skills Proven track record of reducing operational toil via automation Proficiency with Open Telemetry tools including Grafana, Loki, Prometheus, and Cortex. Good knowledge of Microservice based architecture, industry standards, for both public and private cloud. Knowledge of data pipeline technologies (Kafka, Spark, Flink, etc.) Good knowledge of various DB engines (SQL, Redis, Kafka, Snowflake, etc) for cloud app storage. Experience working with Generative AI development, embeddings, fine tuning of Generative AI models Experience in high-performance computing (HPC), distributed GPU cluster scheduling (e.g. Slurm, Kubernetes GPU scheduling) Understanding of ModelOps/ ML Ops/ LLM Op. Experience with chaos engineering, canary deployments, blue/green rollouts Our values - putting clients first, doing the right thing, leading with exceptional ideas, committing to diversity and inclusion, and giving back - aren't just beliefs, they guide the decisions we make every day to do what's best for our clients, communities and more than 80,000 employees in 1,200 offices across 42 countries. Our teams are relentless collaborators and creative thinkers, fueled by their diverse backgrounds and experiences. We are proud to support our employees and their families at every point along their work-life journey, offering some of the most attractive and comprehensive employee benefits and perks in the industry. There's also ample opportunity to move about the business for those who show passion and grit in their work. To learn more about our offices across the globe, please copy and paste https://www.morganstanley.com/about-us/global-offices​ into your browser. We work to provide a supportive and inclusive environment where all individuals can maximize their full potential. Our skilled and creative workforce is comprised of individuals drawn from a broad cross section of the global communities in which we operate and who reflect a variety of backgrounds, talents, perspectives, and experiences. Our strong commitment to a culture of inclusion is evident through our constant focus on recruiting, developing, and advancing individuals based on their skills and talents.
Responsibilities
The role involves operating, monitoring, and maintaining infrastructure for GenAI applications while designing automation to reduce manual toil. Additionally, it includes leading incident response and collaborating across teams to ensure safe deployment and integration of new systems.
Loading...