Sign up with

Already have an account? Log in here

Need some help?
Talk to us at +91 7670800001

Senior Data Engineer at Maincode

Melbourne, Victoria, Australia -

Full Time

Start Date

Immediate

Expiry Date

16 Nov, 25

Salary

180000.0

Posted On

16 Aug, 25

Experience

0 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Good communication skills

Industry

Information Technology/IT

Description

OVERVIEW

Maincode is building sovereign AI models in Australia. We are training foundation models from scratch, designing new reasoning architectures, and deploying them on state-of-the-art GPU clusters. Our models are built on datasets we create ourselves, curated, cleaned, and engineered for performance at scale. This is not buying off-the-shelf corpora or scraping without thought. This is building world-class datasets from the ground up.
As a Senior Data Engineer, you will lead the design and construction of these datasets. You will work hands-on to source, clean, transform, and structure massive amounts of raw data into training-ready form. You will design the architecture that powers data ingestion, validation, and storage for multi-terabyte to petabyte-scale AI training. You will collaborate with AI Researchers and Engineers to ensure every byte is high quality, relevant, and optimised for training cutting-edge large language models and other architectures.
This is a deep technical role. You will be writing code, building pipelines, defining schemas, and debugging unusual data edge cases at scale. You will think like both a data scientist and a systems engineer, designing for correctness, scalability, and future proofing. If you want to build the datasets that power sovereign AI from first principles, this is your team.

How To Apply:

Incase you would like to apply to this job directly from the source, please click here

Responsibilities

Design and build large-scale data ingestion and curation pipelines for AI training datasets
Source, filter, and process diverse data types including text, structured data, code, and multimodal, from raw form to model-ready format
Implement robust quality control and validation systems to ensure dataset integrity, relevance, and ethical compliance
Architect storage and retrieval systems optimised for distributed training at scale
Build tooling to track dataset lineage, reproducibility, and metadata at all stages of the pipeline
Work closely with AI Researchers to align datasets with evolving model architectures and training objectives
Collaborate with DevOps and ML engineers to integrate data systems into large-scale training workflows
Continuously improve ingestion speed, preprocessing efficiency, and data freshness for iterative training cycles