Thesis: Out-of-Distribution Detection Techniques on Trained Chemical Transf at Knightec Group
Gothenburg, , Sweden -
Full Time


Start Date

Immediate

Expiry Date

12 Jan, 26

Salary

0.0

Posted On

14 Oct, 25

Experience

0 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Programming Experience, Python, AI/ML Concepts, Software Integration, Hardware Integration, DNN Architecture, PyTorch, LLMs, LLM APIs, Statistics

Industry

Business Consulting and Services

Description
High Level Description Recent advances in deep learning have made it possible to represent chemical structures as dense, high dimensional embeddings where the AI model has captured subtle relationships from training samples. These embeddings are used to predict various chemical properties, crucial in fields such as drug discovery and chemical risk assessment. However, in real-world scenarios these models often encounter input samples that are outside of the scope of the model, so called out-of-distribution (OOD) samples. The detection of OOD samples is critical in order to guarantee prediction accuracy and reliability. Project Description This thesis aims to systematically evaluate and categorize recent advances OOD detection methods mentioned in scientific literature and evaluate their applicability on chemical embeddings from a transformer model trained for chemical toxicity prediction [1]. The trained transformer model represents a chemical structure as textual-tokens, same as modern LLM:s, and updates their token embedding iteratively over several layers. The final layer outputs a single embedding which is used for toxicity prediction. The central focus of the thesis is to investigate how OOD detection can be applied on the token embeddings to quantify how far off a new chemical lie compared to its in-distribution. Using energy based or distance-based measures, such as cosine similarity, the project aims to evaluate OOD detection applied on the embedding vectors and evaluate the applicability of the methods on TRIDENT-models [1,2]. The data to be used comprises ~10 000 chemicals stemming from the ECOTOX database [3]. Who are we looking for? We are looking for students who want to write a 30 credit MSc thesis. You should have: Required: Programming experience (Python), basic understanding of AI/ML concepts, interest in both software and hardware integration Nice-to-haves: Experience with DNN Architecture, PyTorch, LLMs and LLM APIs, Statistics Students should have studied computer science, AI/ML, robotics, or related fields where software and algorithms are relevant. An interest in data science is helpful but not required. Purpose The purpose of this research is to explore the usage of OOD detection in Life Science by exploring existing state-of-the-art and apply it on a real-world scenario. By creating accurate OOD detection methods, this thesis aims to contribute towards more trustworthy AI models that can be incorporated in data-driven life science. An Exciting Journey with Knightec Group Semcon and Knightec have joined forces as Knightec Group. Together, we are Northern Europe’s leading strategic partner in product and digital service development. With a unique combination of cross-functional expertise and a holistic business understanding, we help our clients realize their strategies – from idea to complete solution. Practical Information This is a master’s thesis position, located at our office in Gotheburg. Start date January 2026. Please submit your application as soon as possible, but no later than 2025-11-30. If you have any questions, you are welcome to contact Julia Hellberg. Note that due to GDPR, we only accept applications through our careers page. References [1] Mikael Gustavsson et al., Transformers enable accurate prediction of acute and chronic chemical toxicity in aquatic organisms. Sci. Adv.10,eadk6669(2024). DOI:10.1126/sciadv.adk6669 [2] TRIDENT prediction tool: https://trident.serve.scilifelab.se/ [3] ECOTOX database: https://cfpub.epa.gov/ecotox/index.cfm
Responsibilities
The thesis aims to evaluate and categorize recent advances in out-of-distribution detection methods and their applicability on chemical embeddings from a transformer model. The project focuses on investigating how OOD detection can be applied to quantify the distance of new chemicals compared to in-distribution samples.
Loading...