Manager, Software Engineering - AIOps at NVIDIA
Raanana, Center District, Israel -
Full Time


Start Date

Immediate

Expiry Date

12 Aug, 26

Salary

0.0

Posted On

14 May, 26

Experience

10 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Software Engineering Management, AIOps, Distributed Systems, Python, Go, Kubernetes, Docker, CI/CD, Solution Design, LLM Agents, Cloud Native Architecture, Telemetry Data Analysis, Predictive Orchestration, System Software Engineering, SaaS Scaling, Autonomous Systems

Industry

Computer Hardware Manufacturing

Description
NVIDIA is at the forefront of the AI revolution, and the AIOps department is critical to ensuring our AI-driven data centers operate with unmatched efficiency. We are looking for a visionary, hands-on Software Engineering Manager to lead a team building the next generation of AI-based monitoring and operation platforms. This role focuses on leveraging AI Agents to automate, predict, and optimize data center performance at an internet scale. If you are a resilient leader who excels in fast-paced environments and has a passion for autonomous system operations, we want you on our team. What You’ll Be Doing: Strategic Roadmap Development: Define software design and implementation roadmaps for AI-driven operations, ensuring data center availability, resiliency, and performance through autonomous agent-based monitoring. Innovative AIOps Engineering: Lead the development of tools and proof-of-concepts focused on software-defined operations, utilizing AI agents to automate root cause analysis and proactive remediation. Scalable Architecture: Build and scale monitoring applications that handle massive telemetry data from AI infrastructure across public, private, and hybrid cloud environments. Agentic Frameworks: Oversee the integration of LLM-based agents into CI/CD and operational workflows to shift from reactive monitoring to predictive orchestration. Team Leadership: Actively hire, mentor, and grow a high-performing engineering team, fostering a culture of technical excellence and creative problem-solving. Customer Engagement: Directly contribute to internal and external customer engagements to align AIOps solutions with real-world data center challenges. What We Need to See: BS/MS degree in Computer Science or a related technical field (or equivalent experience). 8+ years of overall software engineering experience, with at least 2+ years in a management or technical lead role. Domain Expertise: 3+ years of experience in system software engineering for large-scale production systems, with a strong background in Solution Design and Distributed Systems. Cloud Native Mastery: Deep experience with Docker and Kubernetes orchestration, alongside PaaS or IaaS cloud platforms. Programming Proficiency: Strong programming skills in Python (essential for AI/ML workflows) and Go. Operational Intelligence: Extensive knowledge of CI/CD pipelines and automated software-defined operations. Exceptional written and verbal communication skills to bridge the gap between complex AI logic and operational requirements. Ways to Stand Out from the Crowd: AI/ML Background: Experience building or deploying AI Agents (LangChain, AutoGPT) or using ML models for anomaly detection and predictive analytics. Infrastructure Knowledge: Familiarity with Ethernet switching, networking protocols, or NVIDIA’s hardware stack (GPUs/DPUs). Control Systems: Experience in developing autonomous systems or closed-loop feedback monitoring tools. SaaS Background: Proven track record of managing and scaling cloud-based SaaS applications. NVIDIA pioneered accelerated computing. Today, our AI infrastructure powers global intelligence, transforming every industry. Learn more about NVIDIA.
Responsibilities
Lead a team in developing AI-based monitoring and operation platforms to automate and optimize data center performance. Define strategic roadmaps for autonomous agent-based monitoring and integrate LLM-based agents into operational workflows.
Loading...