Data Engineer at AI Robot Association

Tokyo, , Japan -

Full Time

Start Date

Immediate

Expiry Date

27 Jan, 26

Salary

0.0

Posted On

29 Oct, 25

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Data Engineering, Machine Learning, Robotics, Python, SQL, Distributed Systems, Cloud Services, Data Processing, Data Quality, Data Pipelines, Data Schemas, Workflow Orchestration, Sensor Data Processing, Image Processing, Sensor Fusion, ROS

Industry

technology;Information and Internet

Description

About AIRoA The AI Robot Association (AIRoA) is launching a groundbreaking initiative: collecting one million hours of humanoid robot operation data with hundreds of robots, and leveraging it to train the world’s most powerful Vision-Language-Action (VLA) models. What makes AIRoA unique is not only the unprecedented scale of real-world data and humanoid platforms, but also our commitment to making everything open and accessible. We are building a shared “robot data ecosystem” where datasets, trained models, and benchmarks are available to everyone. Researchers around the world will be able to evaluate their models on standardized humanoid robots through our open evaluation platform. For researchers, this means an opportunity to: Work on fundamental challenges in robotics and AI: multimodal learning, tactile-rich manipulation, sim-to-real transfer, and large-scale benchmarking. Access state-of-the-art infrastructure: hundreds of humanoid robots, GPU clusters, high-fidelity simulators, and a global-scale evaluation pipeline. Collaborate with leading experts across academia and industry, and publish results that will shape the next decade of robotics. Contribute to an initiative that will redefine the future of embodied AI—with all results made open to the world. Key Responsibilities You will play a critical role in building the data backbone powering next-generation robotics foundation models: Design and implement large-scale data pipelines that cover the full lifecycle of high-quality datasets for robotics foundation models—collection, processing, curation, and publishing. Design, build, and maintain data schemas, storage solutions, and query interfaces to enable VLA researchers to efficiently discover, query, and consume curated datasets. Collaborate closely with VLA researchers to capture evolving data requirements and continuously improve data pipelines through analysis and experimentation. Design and scale distributed data-processing pipelines capable of handling petabyte-scale multimodal datasets (e.g., RGB/Depth, point clouds) with full lineage and reproducibility. Define data-quality metrics and build feedback loops to continuously monitor and improve data quality. Required Qualifications Master’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience). 3+ years of professional experience as a software engineer in data engineering, machine learning, or robotics. Experience developing high-quality, production-level software in a team environment. Experience in deploying distributed systems to popular cloud services such as AWS, GCP, Azure. Experience with distributed data-processing frameworks (e.g., Spark, Flink, Ray) and workflow orchestration tools (e.g., Airflow, Kedro, Dagster). High proficiency in Python. Preferred Qualifications Experience working with terabyte or petabyte-scale datasets. Expertise in data lake storage systems such as Apache Iceberg or Delta Lake with query systems such as Trino and catalog systems such as Nessie. Expertise in distributed processing frameworks like Spark, Flink, or Ray. Expertise in workflow tools such as Airflow, Kedro, or Dagster. Experience in analyzing, monitoring, and managing data quality. Experience with processing robotics-related sensor data (e.g., RGB/Depth images, point clouds), including knowledge of image/signal processing, sensor fusion, and time synchronization. Experience with ROS/ROS2. High proficiency in SQL. Experience optimizing system performance using performance analysis tools. Others (linguistic qualification, etc.) 【Highly appreciated】 English proficiency at business level; Japanese proficiency a plus. There are currently no comparable projects in the world that collect data and develop foundation models on such a large scale. As mentioned above, this is one of Japan’s leading national projects, supported by a substantial investment of 20.5 billion yen from NEDO. This position will play a crucial role in determining the success of the project. You will have broad discretion and responsibility, and we are confident that, if successful, you will gain both a great sense of achievement and the opportunity to make a meaningful contribution to society. Furthermore, we strongly encourage engineers to actively build their careers through this project—for example, by publishing research papers and engaging in academic activities.

Responsibilities

You will design and implement large-scale data pipelines for robotics foundation models, covering the full lifecycle of datasets. Collaborate with researchers to capture evolving data requirements and improve data pipelines through analysis and experimentation.