Senior Data Engineer at Maincode
Melbourne, Victoria, Australia -
Full Time


Start Date

Immediate

Expiry Date

16 Nov, 25

Salary

180000.0

Posted On

16 Aug, 25

Experience

0 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Good communication skills

Industry

Information Technology/IT

Description

OVERVIEW

Maincode is building sovereign AI models in Australia. We are training foundation models from scratch, designing new reasoning architectures, and deploying them on state-of-the-art GPU clusters. Our models are built on datasets we create ourselves, curated, cleaned, and engineered for performance at scale. This is not buying off-the-shelf corpora or scraping without thought. This is building world-class datasets from the ground up.
As a Senior Data Engineer, you will lead the design and construction of these datasets. You will work hands-on to source, clean, transform, and structure massive amounts of raw data into training-ready form. You will design the architecture that powers data ingestion, validation, and storage for multi-terabyte to petabyte-scale AI training. You will collaborate with AI Researchers and Engineers to ensure every byte is high quality, relevant, and optimised for training cutting-edge large language models and other architectures.
This is a deep technical role. You will be writing code, building pipelines, defining schemas, and debugging unusual data edge cases at scale. You will think like both a data scientist and a systems engineer, designing for correctness, scalability, and future proofing. If you want to build the datasets that power sovereign AI from first principles, this is your team.

How To Apply:

Incase you would like to apply to this job directly from the source, please click here

Responsibilities
  • Design and build large-scale data ingestion and curation pipelines for AI training datasets
  • Source, filter, and process diverse data types including text, structured data, code, and multimodal, from raw form to model-ready format
  • Implement robust quality control and validation systems to ensure dataset integrity, relevance, and ethical compliance
  • Architect storage and retrieval systems optimised for distributed training at scale
  • Build tooling to track dataset lineage, reproducibility, and metadata at all stages of the pipeline
  • Work closely with AI Researchers to align datasets with evolving model architectures and training objectives
  • Collaborate with DevOps and ML engineers to integrate data systems into large-scale training workflows
  • Continuously improve ingestion speed, preprocessing efficiency, and data freshness for iterative training cycles
Loading...