Machine Learning Engineer – Vision Team at Sarvam AI

Bengaluru, karnataka, India -

Full Time

Start Date

Immediate

Expiry Date

16 Jun, 26

Salary

0.0

Posted On

18 Mar, 26

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Python, PyTorch, Training Pipelines, Fine-tuning, Large Models, Multimodal Data Pipelines, Transformer Architectures, Evaluation Harnesses, Inference Optimization, Quantisation, Serving Infrastructure, Production-Grade Systems, Retrieval-Augmented Workflows, Secure Coding, Code Quality, System Reliability

Industry

Software Development

Description

Company Overview Sarvam.ai is a pioneering generative AI startup headquartered in Bengaluru, India. Our mission is to make generative AI accessible and impactful for Bharat. Founded by a team of AI experts, we are building cost-effective, high-performance AI systems tailored for the Indian market — enabling enterprises to deploy speech, language, and vision models at scale. Join us to build the foundational vision models backbone that power the next generation of AI systems for India and beyond. Job Summary As a Machine Learning Engineer, you will work across the full lifecycle of VLM development — data, training, evaluation, and getting models into production. The team's scope will evolve as the field does; we want engineers who are comfortable with that. Key Responsibilities * Design and run training and fine-tuning pipelines for large vision-language models on GPU clusters * Build multimodal data pipelines — ingestion, filtering, deduplication, synthetic generation, quality assurance * Implement and experiment with new architectures and training techniques from research * Build evaluation harnesses, benchmarks, and automated regression tracking * Optimise models for inference — quantisation, batching, serving infrastructure * Build robust pipelines and integrations that put vision model capabilities in the hands of end users * Translate real-world problems into well-scoped ML tasks with the right data and evaluation strategy * Work directly with clients to understand their use cases — document processing, visual search, form extraction, or whatever the problem is — and own the solution end to end * Build production-grade systems on top of Sarvam Vision and open-source models: multimodal pipelines, retrieval-augmented workflows, structured output extraction, and more * Debug and improve deployed solutions — latency, accuracy, edge cases, and integration with client infrastructure Must-Have Skills * Strong Python and PyTorch — comfortable reading and modifying model internals * Hands-on experience training or fine-tuning large models; you've debugged a broken run * Experience building data pipelines at scale * Solid grounding in transformer architectures and modern training techniques * Comfort with ambiguity — the roadmap is not fully pre-specified * Strong focus on secure coding practices, code quality, and system reliability * Undergraduate degree in a technical discipline (CS, statistics, physics, etc) Good to Have * Experience with vision-language models or multimodal systems * Distributed training (FSDP, DeepSpeed, Megatron-LM) * Post-training methods — RLHF, DPO, or alignment techniques * Inference optimisation — quantisation, distillation, serving * Prior exposure to vision-based AI systems or document processing pipelines. * Contributions to open-source backend projects or a solid GitHub portfolio. Location Bengaluru, India (Hybrid)

Responsibilities

The engineer will design and manage training/fine-tuning pipelines for large vision-language models and build multimodal data pipelines covering ingestion, filtering, and quality assurance. They will also be responsible for optimizing models for inference and building production systems that integrate vision model capabilities for end users.