Edge-cloud Systems Engineer at Lutra
, , Canada -
Full Time


Start Date

Immediate

Expiry Date

07 Mar, 26

Salary

0.0

Posted On

07 Dec, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Linux, Kubernetes, AWS, Azure, GCP, Docker, Prometheus, Grafana, MQTT, Python, TypeScript, SQL, OpenCV, CUDA, PyTorch, TensorFlow

Industry

Description
Your opportunity Our client is a well-funded, seed-stage AI startup that builds agents for the factory floor. They develop and distribute a software-first agent layer that plugs into the cameras and machines factories already have. Their models run and act at the edge so agents can see, decide, and act in real time. Events and metrics flow into a dashboard that provides plant teams immediate visibility. They’re approaching a large (~$14B) and underserved market with a disruptive, asset-light alternative to hardware-heavy robotics and batch analytics and they’ve already found early traction with clients in the food & beverage, pharma/cosmetics, and materials processing verticals. As a senior production systems engineer, you’ll own the control and data planes for a fleet of edge devices. You’ll ship over the air updates and versioned model releases, provision and monitor devices, instrument and stream telemetry, and backhaul prioritized data for retraining. You’ll integrate with camera and industrial equipment, enforce safe rollout/rollback policies, and lead incident response through root-cause analysis and remediation. You’ll be joining a flat, dynamic environment in the midst of its scale-up phase that’s led by an accomplished ex-Deepmind researcher with specialization in reinforcement learning, deep learning and robotics. The company closed a $13.9M CAD seed round in March of 2025 and are scaling R&D and delivery to meet accelerating demand, with headcount tracking to double by year-end. Please note that this role involves occasional travel to client sites across Canada and the US. Key responsibilities Systems architecture & reliability: Design and implement fault-tolerant services on Linux across on-prem and cloud environments; apply resilience patterns (circuit breakers, retries, bulkheads, failover) to meet deterministic SLOs under real-world load Real-time performance & optimization: Profile hotspots, tune latency and throughput, and optimize concurrency, memory, and I/O paths to sustain strict timing constraints for production workloads Edge integration & platform engineering: Integrate software with embedded/edge hardware and interfaces; package and deploy services via containers to bare metal/VMs/Kubernetes while accounting for resource constraints at the edge Observability, on-call & continuous improvement: Instrument systems with metrics, tracing, and logs (Prometheus/Grafana/ELK); define actionable alerts, lead incident response and post-mortems, and convert findings into reliability upgrades CI/CD, testing & secure delivery: Own unit/integration test strategy and automated pipelines; enforce secure coding practices, guardrails, and reviews that keep releases fast, stable, and auditable Tech stack Operating system: Linux Orchestration & compute: Kubernetes, on-prem bare metal, VMs Cloud providers: AWS, Azure, GCP Containers: Docker Monitoring, observability & logging: Prometheus, Grafana, ELK Messaging & IoT: MQTT, HTTP/REST, RabbitMQ, Apache Kafka Edge platforms: NVIDIA Jetson, Raspberry Pi (ARM) Cameras & vision I/O: GenICam, GigE Vision, USB3 Vision Industrial automation: PLC integration; protocols: Ethernet/IP, Modbus, Profinet, OPC UA Backend: Python (Flask, FastAPI), TypeScript/Node.js Frontend: TypeScript/React Databases & storage: SQL, InfluxDB, MongoDB Scientific computing: NumPy, Pandas Computer vision: OpenCV GPU/acceleration: CUDA, TensorRT, ONNX, OpenVINO ML/DL frameworks: PyTorch, TensorFlow, Keras, scikit-learn Your know-how You have 3+ years of experience designing and operating scaled production environments for manufacturing, robotics, IoT and/or industrial automation applications You have a software engineering skillset and a fantastic command of C/C++, Python or TypeScript You have experience building latency-sensitive deterministic systems You have experience with monitoring, observability and alerting stacks and best practices You have a solid understanding of networking, storage, and compute resources in on-prem environments You have experience integrating software with embedded/edge hardware You have experience collaborating effectively within and across cross-functional delivery teams You are a contagiously curious person with entrenched learning habits It’s a bonus if You are predisposed to mentorship and crafting a culture of continuous improvement You have deep expertise in computer vision, robotics, or manufacturing automation You have production experience deploying AI models at the edge You have experience scaling an AI and/or B2B SaaS venture You have an academic research background in machine learning, computer vision, and/or artificial intelligence (likely, but not necessarily, reflected in a graduate degree in these fields) Interested in learning more? Please upload your resume or a .pdf export of your LinkedIn profile using the following “Apply Now” button, or send your resume or LinkedIn profile URL to talent@lutrapartners.com with “Senior Production Systems Engineer, Edge AI” as the subject line. One of our talent partners will be in contact shortly.
Responsibilities
You will own the control and data planes for a fleet of edge devices, shipping over-the-air updates and monitoring devices. Additionally, you will integrate with camera and industrial equipment while leading incident response through root-cause analysis.
Loading...