Software Engineer II - AI Infrastructure (Scheduler) - CoreAI at Microsoft
Redmond, Washington, United States -
Full Time


Start Date

Immediate

Expiry Date

24 Feb, 26

Salary

0.0

Posted On

26 Nov, 25

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

C#, Service Fabric, Kubernetes, Dev-Ops, Machine Learning, Cloud Security, Data Structures, Algorithms, Unit Testing, Performance Engineering, NoSQL, Microservices, Control Plane Services, Concurrency Management, Resource Management

Industry

Software Development

Description
Work on the design and development of the core AI Infrastructure distributed and in-cluster services that support large scale AI training and inferencing. Develop, test, and maintain control plane services written in C#, hosted on Service Fabric or Kubernetes (AKS) clusters. Enhance systems and applications to ensure high stability, efficiency and maintainability, low latency, tight cloud security. Provide operational support and DRI (on-call) responsibilities for the service. Develop and foster a deep understanding of the machine learning concepts, use cases, and relevant services used by our customers. Investigate use of tools and cloud services and prototype solutions for problems in our control plane space. Embody our culture and values. Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, Java, Scala, Rust, Go, TypeScript OR equivalent experience. OOP proficiency and practical familiarity with common code design patterns 2+ years of experience with service development in a distributed environment, in a dev-ops role, including concurrency management and stateful resource management Master's degree in Computer Science or a related technical field Hands-on experience with public cloud services at the IaaS level Advanced knowledge of C# and .Net Proficiency with use of complex data structures and algorithms, preferably in the setting of a resource allocator/scheduler, workflow/execution orchestration engine, database engine, or similar Significant experience with unit testing and writing testable code Technical communication skills: verbal and written First-hand experience with building large-scale, multi-tenant global services with high availability Experience with building and operating “stateful” and critical control plane services; handling challenges with data size and data partitioning; related use of a NoSQL cloud database Experience with mapping complex object models to relational and non-relational datastores Dev-ops experience with microservices architecture in a complex infrastructure and operational environment Service reliability and fundamentals engineering; instrumentation for KPIs or performance analysis; demonstrated service and code quality mindset Performance engineering: work on scalability, profiling; CPU, memory and I/O use optimization techniques Applied knowledge of Kubernetes: service model, workload packaging and deployment, programmatic extensibility (CRDs, operators); or equivalent knowledge of Service Fabric Server-side Windows programming and performance engineering Data analytics skills, in particular with Kusto Experience working in a geo-distributed team
Responsibilities
Design and develop core AI Infrastructure services for large scale AI training and inferencing. Provide operational support and enhance systems for stability, efficiency, and security.
Loading...