Lead Software Engineer at Armada

, , -

Full Time

Start Date

Immediate

Expiry Date

10 Apr, 26

Salary

0.0

Posted On

10 Jan, 26

Experience

10 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Kubernetes Operator Expertise, Deep Kubernetes Internals, Multi-Tenant Platform Architecture, High-Performance Storage & Networking, Infrastructure Automation, Observability & Monitoring, Programming Languages

Industry

Software Development

Description

About the Company Armada is an edge computing startup that provides computing infrastructure to remote areas where connectivity and cloud infrastructure is limited, as well as areas where data needs to be processed locally for real-time analytics and AI at the edge. We’re looking to bring on the most brilliant minds to help further our mission of bridging the digital divide with advanced technology infrastructure that can be rapidly deployed anywhere. Lead Software Engineer – AI Platform / GPUaaS Experience: 10–15 years Location: Bangalore, India Role Type: Individual Contributor with Technical Leadership About Armada Armada is an edge computing startup building resilient, high-performance computing infrastructure for environments where traditional cloud connectivity is limited or unavailable. Our platforms power real-time analytics and AI workloads at the edge, enabling customers to deploy advanced compute anywhere. We’re on a mission to bridge the digital divide with rapidly deployable, production-grade infrastructure—and we’re looking for exceptional engineers to help us do it. About the Role As a Lead Software Engineer (L5) on the AI Platform team, you will serve as the primary architect for Armada’s GPU-as-a-Service (GPUaaS) platform. You will design and own the abstraction layers that transform complex GPU fabrics, storage systems, and high-performance networking into a seamless, self-service experience for AI workloads. Your work will enable researchers and engineers to launch Jupyter environments, distributed training jobs, and LLM inference endpoints with a single click, while abstracting away the complexity of PCI topology, RDMA fabrics, and parallel storage—without compromising isolation, security, or performance in a multi-tenant environment. You will operate at the intersection of distributed systems, Kubernetes internals, and GPU infrastructure, setting technical direction, driving architectural decisions, and mentoring engineers across levels. What You’ll Do Architectural Strategy Lead the design of a globally scalable AI control plane for GPU, storage, and network orchestration. Define architectural patterns for custom Kubernetes operators managing complex AI training and inference workloads. Own the long-term scalability, extensibility, and evolution of the GPUaaS platform. Systemic Multi-Tenancy Architect hard isolation strategies at the kernel, hypervisor, and hardware levels (IOMMU, SR-IOV, device isolation). Design secure multi-tenant execution models aligned with zero-trust networking principles. Storage & Networking Strategy Drive integration strategies for high-performance storage platforms such as VAST, Weka, and DDN. Collaborate with hardware and networking vendors to optimize RDMA, GPUDirect, and RoCE v2 traffic patterns. Design and evolve VXLAN / BGP-EVPN–based network architectures for AI workloads. Feature Development Develop and maintain custom Kubernetes operators for GPU, storage, and infrastructure automation. Implement CRDs, reconciliation logic, and full lifecycle management for AI workloads. Reliability, Performance & Scale Define platform SLOs, capacity planning models, and GPU availability targets. Establish benchmarking standards using MLPerf and custom training/inference stress tests. Lead post-incident reviews and drive continuous performance optimization initiatives. Technical Leadership & Mentorship Set engineering standards through design reviews, architecture documents, and technical RFCs. Mentor and grow L3/L4 engineers into strong platform owners. Influence cross-functional teams across infrastructure, security, and product. Required Skills (Mandatory) Kubernetes Operator Expertise: Proven experience designing and operating production-grade Kubernetes controllers and operators using Go (Kubebuilder / Operator SDK). Deep Kubernetes Internals: Strong understanding of etcd, API machinery, CRDs, controllers, and scheduling concepts. Multi-Tenant Platform Architecture: Hands-on experience building secure, multi-tenant platforms with strong isolation and zero-trust networking. High-Performance Storage & Networking: Knowledge of POSIX semantics, CSI drivers for parallel filesystems, and InfiniBand / RoCE v2 networking. Infrastructure Automation: Experience with Ansible, Terraform, or equivalent automation frameworks. Observability & Monitoring: Hands-on experience with Prometheus, OpenTelemetry (OTEL), Grafana, Splunk, or similar tools. Programming Languages: Strong proficiency in Go and Python. Preferred / Nice-to-Have Skills AI Serving Frameworks: Experience with vLLM, Ray Serve, Triton Inference Server, or similar systems. Virtualization & Low-Level Systems: Familiarity with VMware vSphere, OpenStack, KVM, or bare-metal provisioning platforms. GPU Infrastructure: Experience with NVIDIA DGX/HGX systems, GPU Operator, DCGM, Nsight, or GPU performance profiling tools. Distributed Training Systems: Exposure to PyTorch DDP, DeepSpeed, or large-scale distributed training frameworks. . Compensation & Benefits For India-based candidates: We offer a competitive base salary along with equity options, providing an opportunity to share in the success and growth of Armada. #LI-Onsite You're a Great Fit if You're A go-getter with a growth mindset. You're intellectually curious, have strong business acumen, and actively seek opportunities to build relevant skills and knowledge A detail-oriented problem-solver. You can independently gather information, solve problems efficiently, and deliver results with a "get-it-done" attitude Thrive in a fast-paced environment. You're energized by an entrepreneurial spirit, capable of working quickly, and excited to contribute to a growing company A collaborative team player. You focus on business success and are motivated by team accomplishment vs personal agenda Highly organized and results-driven. Strong prioritization skills and a dedicated work ethic are essential for you Equal Opportunity Statement At Armada, we are committed to fostering a work environment where everyone is given equal opportunities to thrive. As an equal opportunity employer, we strictly prohibit discrimination or harassment based on race, color, gender, religion, sexual orientation, national origin, disability, genetic information, pregnancy, or any other characteristic protected by law. This policy applies to all employment decisions, including hiring, promotions, and compensation. Our hiring is guided by qualifications, merit, and the business needs at the time. Unsolicited Resumes and Candidates Armada does not accept unsolicited resumes or candidate submissions from external agencies or recruiters. All candidates must apply directly through our careers page. Any resumes submitted by agencies without a prior signed agreement will be considered unsolicited and Armada will not be obligated to pay any fees.

Responsibilities

As a Lead Software Engineer, you will architect the GPU-as-a-Service platform, designing abstraction layers for AI workloads. You will also set technical direction and mentor engineers across various levels.