Backend Engineer - AI/ML Infrastructure at CloudGeometry

Remote, Oregon, USA -

Full Time

Start Date

Immediate

Expiry Date

14 Nov, 25

Salary

0.0

Posted On

14 Aug, 25

Experience

7 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Infrastructure, Containerization, Python, Amazon Web Services, Aws, Communication Skills, Software, Typescript

Industry

Information Technology/IT

Description

Remote
At CloudGeometry, we’re redefining how modern data and AI systems are built. As a leading cloud-native engineering firm, we work with pioneering technology companies to deliver high-impact solutions across infrastructure, machine learning, and intelligent applications.
We are looking for a highly skilled AI Infrastructure Engineers x 5 people to join our growing team supporting large-scale AI/ML systems. This is a hands-on engineering role focused on building scalable, secure, and production-ready infrastructure that powers ML workflows end-to-end—from experimentation to deployment and monitoring.

What You’ll Do

Design, implement, and maintain robust infrastructure for ML workflows across real-time and batch environments.
Build and support production-grade model lifecycle systems, including registration, versioning, and deployment workflows.
Develop APIs and backend services in TypeScript and Python to support model integration and orchestration.
Manage and optimize infrastructure using AWS and infrastructure-as-code (CDK preferred).
Work with Databricks MLFlow for end-to-end model management, including asset bundling and serving pipelines.
Collaborate with cross-functional teams including ML scientists, backend engineers, and DevOps to deliver high-impact features.
Monitor and improve infrastructure reliability, security, and performance across diverse deployment targets.
Contribute to CI/CD workflows, container orchestration (Docker, ECS), and automation for ML pipelines.

Why Join CloudGeometry?
You’ll work alongside top-tier engineers across the US, LATAM, and Europe on cutting-edge projects in AI, cloud, and enterprise SaaS. We value deep technical curiosity, strong collaboration, and a bias for action in solving meaningful problems.
-

SKILLS

Large Language Models (LLM)
Software as a Service (SaaS)
Databricks Products
Python (Programming Language)
Infrastructure
TypeScript
MLflow
MLOps
Amazon Web Services (AWS)

REQUIREMENTS

What We’re Looking For

7+ years in software or infrastructure engineering with proven experience supporting AI/ML systems.
Deep hands-on experience with AWS services and modern IaC practices (Terraform/CDK).
Strong backend programming skills in TypeScript and Python.
Production-level use of MLFlow for model management and deployment.
Expertise in containerization (Docker), CI/CD automation, and orchestration tools.
Solid understanding of designing scalable and secure systems in cloud-native environments.
Strong communication skills, able to bridge gaps between engineering and product stakeholders.
Comfortable in fast-paced, collaborative environments working across time zones.

Nice to Have

Exposure to LLM infrastructure and frameworks (e.g., DSPy, LangChain).
Knowledge of LLM performance metrics: latency, cost monitoring, and usage optimization.
Familiarity with semantic search tools and vector stores (e.g., OpenSearch, Pinecone).

Responsibilities

Design, implement, and maintain robust infrastructure for ML workflows across real-time and batch environments.
Build and support production-grade model lifecycle systems, including registration, versioning, and deployment workflows.
Develop APIs and backend services in TypeScript and Python to support model integration and orchestration.
Manage and optimize infrastructure using AWS and infrastructure-as-code (CDK preferred).
Work with Databricks MLFlow for end-to-end model management, including asset bundling and serving pipelines.
Collaborate with cross-functional teams including ML scientists, backend engineers, and DevOps to deliver high-impact features.
Monitor and improve infrastructure reliability, security, and performance across diverse deployment targets.
Contribute to CI/CD workflows, container orchestration (Docker, ECS), and automation for ML pipelines