Start Date
Immediate
Expiry Date
25 Sep, 25
Salary
0.0
Posted On
27 Jun, 25
Experience
0 year(s) or above
Remote Job
Yes
Telecommute
Yes
Sponsor Visa
No
Skills
Good communication skills
Industry
Information Technology/IT
Location
Toronto, Ottawa, San Francisco, New York, London, Paris
Employment Type
Full time
Location Type
Remote
Department
Modelling
Modeling
As a Pre-Training Data Engineer, you will play a pivotal role in developing the data infrastructure that underpins Cohere’s advanced language models. Your responsibilities will encompass the end-to-end management of training data, including ingestion, cleaning, filtering, and optimization, as well as data modeling to ensure datasets are structured and formatted for optimal model performance. You will work with diverse data sources—such as web data, code data, multilingual corpora, and synthetic data—to ensure their quality, diversity, and reliability.
In this role, you will design and implement scalable, robust pipelines for data processing, conduct data ablations to evaluate quality, and experiment with data mixtures to enhance model performance. By combining research and engineering, you will bridge the gap between raw data and cutting-edge AI models, directly contributing to improvements in critical training metrics like throughput and accelerator utilization.
Your work will be essential to Cohere’s mission of delivering efficient and reliable language understanding and generation capabilities, driving innovation in natural language processing. If you are passionate about transforming data into the foundation of AI systems, this role offers a unique opportunity to make a meaningful impact.
Please Note: We have offices in London, Paris, Toronto, Ottawa, San Francisco and New York but also embrace being remote-friendly! There are no restrictions on where you can be located for this role.