🇧🇷 Datacenter Hardware & Network Support Technician (Remote, from Brazil) at GECI Int.
, , -
Full Time


Start Date

Immediate

Expiry Date

26 Sep, 26

Salary

0.0

Posted On

28 Jun, 26

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Hardware Troubleshooting, Network Diagnosis, GPU Hardware Fault Isolation, InfiniBand, Dell Server Hardware, English Fluency, System Fundamentals, Network Connectivity Troubleshooting

Industry

IT Services and IT Consulting

Description
Context AS+ provides run support for GPU clusters operated by a cloud infrastructure partner. We are building a support team to handle day-to-day incidents on these clusters. This first role focuses on weekday coverage. The work sits low in the stack — hardware and network diagnosis — rather than high-level HPC or application support. Responsibilities Diagnose and triage incidents on GPU compute clusters, determining whether a fault originates on our side or the client's. Investigate hardware failures: collect and analyze hardware logs, identify failed components, and document findings for resolution or RMA. Diagnose GPU hardware faults (failure detection and isolation — not performance tuning or porting). Configure and troubleshoot network connectivity, including InfiniBand fabric. Work directly with the client as first line of support, in English. Required skills Solid system and network fundamentals — low-level networking and connectivity diagnosis. Hands-on hardware troubleshooting, ideally on Dell server hardware. Ability to diagnose GPU hardware failures (no deep GPU expertise required). InfiniBand knowledge (important). Fluent English (all client communication is in English). Not required No advanced OS administration. No Slurm or workload-scheduler expertise. No HPC application or GPU-porting background. Setup Full remote. Weekday coverage (first hire; the team will expand to cover a wider window).
Responsibilities
Diagnose and triage hardware and network incidents on GPU compute clusters. Collect hardware logs and identify failed components for resolution or RMA.
Loading...