Manager of Infrastructure Engineering (Observability) at Miray Holdings

Redmond, Washington, United States -

Full Time

Start Date

Immediate

Expiry Date

12 Apr, 26

Salary

0.0

Posted On

12 Jan, 26

Experience

10 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Infrastructure Engineering, SRE, Platform Roles, Observability Systems, Linux, Distributed Systems, Production Operations, GPU Infrastructure, HPC Environments, Bare-Metal Systems, Telemetry, Kubernetes, Alerting Strategy, Metrics Systems, Logging Systems, Distributed Tracing

Industry

Description

Voltage Park is your enterprise AI factory. We offer scalable compute power, on-demand and reserved bare metal AI infrastructure using NVIDIA GPUs, with world-class service, performance, and value. Founded with the mission of making accessible AI computing for all, our flexible, affordable GPU solutions power everyone from builders to enterprises. Voltage Park is looking for a Manager of Infrastructure Engineering for our Infrastructure Engineering team. Our team is responsible for building automation, tooling, and API-driven systems to bridge the gap between our physical infrastructure and the systems that our customers depend on for AI/ML training, inference, and HPC workloads at scale. In this role, you’ll design and implement systems that enable humans and software to interact programmatically with thousands of bare-metal servers, storage clusters, and high-performance networks. You will work closely with teams across Voltage Park to drive new infrastructure rollouts and improve the lifecycle management of existing resources. Observability is not a nice-to-have—it is foundational to how we operate safely, efficiently, and at scale. QUALIFICATIONS 7+ years in infrastructure engineering, SRE, or platform roles 2+ years managing technical teams Deep experience designing and operating observability systems at scale Strong background in Linux, distributed systems, and production operations Experience in GPU, HPC, or AI infrastructure environments Hands-on experience with bare-metal systems and hardware-level telemetry (power, thermal, network, GPU) Comfort operating in environments with hardware dependencies, physical failure modes, and tight SLAs Strong Technical Background In Metrics systems (Prometheus, VictoriaMetrics, Mimir, etc.) Logging systems (ELK / OpenSearch, Loki, ClickHouse, Kafka-based pipelines) Distributed tracing (OpenTelemetry, Jaeger, Tempo) Kubernetes observability (nodes, clusters, workloads, control plane) Alerting strategy, SLOs, SLIs, and error budgets High-cardinality, high-volume telemetry tradeoffs Nice to Have Experience designing observability for monitoring hardware failure modes (GPU ECC, PCIe, NIC errors, power or thermal limits) Experience operating observability platforms across multiple data centers and failure domains Familiarity with capacity-aware or constraint-driven alerting (power, thermal, rack-level limits) Experience balancing telemetry cost, retention, and fidelity at large scale Prior experience evolving alerting from reactive to SLO-driven Experience building or scaling observability teams or platforms in high-growth environments WHAT YOU'LL DO Technical Ownership & Strategy Own Voltage Park’s observability strategy across infrastructure and platform layers Define standards for metrics, logs, traces, alerts, dashboards, and SLOs Drive architecture decisions for telemetry pipelines, storage, and retention Balance signal quality, system performance, and cost at scale Team Leadership Build, manage, and mentor a team of infrastructure engineers focused on observability Set clear technical direction, priorities, and expectations Review designs, guide implementation, and raise the bar on operational rigor Partner closely with Engineering and Operations teams Platform Engineering Design and operate high-throughput observability pipelines (metrics, logs, traces) Ensure observability platforms are reliable, scalable, and resilient Improve alert quality and reduce noise across production systems Enable self-service observability for internal engineering teams Reliability & Operations Participate in and lead infrastructure incident response Use observability data to drive root-cause analysis and systemic improvements Build feedback loops from incidents into better tooling, alerts, and runbooks Help establish a culture of measurement-driven reliability Voltage Park is an equal opportunity employer and makes employment decisions on the basis of merit. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. If you require an accommodation during the job application process, please notify your recruiter.

Responsibilities

The Manager of Infrastructure Engineering will own Voltage Park’s observability strategy and lead a team focused on observability across infrastructure and platform layers. Responsibilities include designing high-throughput observability pipelines and improving alert quality across production systems.