Databricks AI Platform SRE_Director_Infrastructure Production Management & at Morgan Stanley
Bengaluru, karnataka, India -
Full Time


Start Date

Immediate

Expiry Date

24 Feb, 26

Salary

0.0

Posted On

26 Nov, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Databricks, Infrastructure As Code, Terraform, Python, Java, Cloud Networking, Identity And Access Management, Monitoring, Logging, CI/CD, Agile, DevOps, Prometheus, Grafana, MLOps, Scrum, Kanban

Industry

Financial Services

Description
## Design and implement secure, scalable, and automated Databricks environments to support AI/ML workloads. ## Develop infrastructure-as-code (IaC) solutions using Terraform for provisioning Databricks, cloud resources, and network configurations. ## Build automation and self-service capabilities using Python, Java and APIs for platform onboarding, workspace provisioning, orchestration and monitoring. ## Collaborate with data science and ML teams to define compute requirements, governance policies, and efficient workflows across dev/qa/prod environments. ## Integrate Databricks offering with cloud-native services on Azure/AWS- Champion CI/CD and GitOps for managing ML infrastructure and configurations.- Ensure compliance with enterprise security and data governance policies using RBAC, Audit Controls, Encryption, Network Isolation, and policies. ## Monitor platform performance, reliability, and usage, and drive improvements to optimize cost and resource utilizations. ## At least 4+ years' relevant experience would generally be expected to find the skills required for this role . ## Proven experience with Terraform for building and managing infrastructure. ## Strong programming skills in Python and Java. ## Hands-on experience with cloud networking, identity and access management, key vaults, monitoring, and logging in Azure. ## Hands on experience with Databricks (Workspace management, Clusters, Jobs, MLFlow, Delta Lake, Unity Catalog, Mosaic AI). ## Deep understanding of Azure or AWS infrastructure (e.g. IAM, VNets/VPC, Storage, Networks, Compute, Key management, monitoring)- Strong experience in distributed system design, development and deployment using agile/devops practices. ## Experience with CI/CD pipelines (GitHub Actions, or similar) ## Experience implementing monitoring and observability using Prometheus, Grafana or Databricks-native solutions. ## Good communication skills, excellent teamwork experience, ability to mentor and develop more junior developers, including participating in constructive code reviews. ## Experience in multi-cloud environments (AWS/GCP) is a bonus. ## Experience in working in highly regulated environments (finance, healthcare, etc.) is desirable- ## Experience with Databricks REST APIs and SDKs- Knowledge of MLFlow, Mosaic AC, & MLOps tooling- ## Working with teams using Scrum, Kanban or other agile practices ## Proficiency with standard Linux command line and debugging tools Our values - putting clients first, doing the right thing, leading with exceptional ideas, committing to diversity and inclusion, and giving back - aren't just beliefs, they guide the decisions we make every day to do what's best for our clients, communities and more than 80,000 employees in 1,200 offices across 42 countries. Our teams are relentless collaborators and creative thinkers, fueled by their diverse backgrounds and experiences. We are proud to support our employees and their families at every point along their work-life journey, offering some of the most attractive and comprehensive employee benefits and perks in the industry. There's also ample opportunity to move about the business for those who show passion and grit in their work. To learn more about our offices across the globe, please copy and paste https://www.morganstanley.com/about-us/global-offices​ into your browser. We work to provide a supportive and inclusive environment where all individuals can maximize their full potential. Our skilled and creative workforce is comprised of individuals drawn from a broad cross section of the global communities in which we operate and who reflect a variety of backgrounds, talents, perspectives, and experiences. Our strong commitment to a culture of inclusion is evident through our constant focus on recruiting, developing, and advancing individuals based on their skills and talents.
Responsibilities
Design and implement secure, scalable, and automated Databricks environments to support AI/ML workloads. Monitor platform performance, reliability, and usage, and drive improvements to optimize cost and resource utilization.
Loading...