MSc Thesis Proposal by: Siyam Sajnan Chowdhury

Thursday, May 9, 2024 - 11:30

The School of Computer Science is pleased to present…

Investigating Large Language Model Embeddings for Predicting High-Frequency Drug Side Effects

MSc Thesis Proposal by: Siyam Sajnan Chowdhury


Date: Thursday, 09 May 2024

Time: 11:30 am

Location: Essex Hall, Room 122


Large language models brought about a paradigm shift in the domain of natural language processing, characterized by their large scale, deep architectures, and pre-training on massive amounts of data, enabling them to learn rich and nuanced representations of language. They have demonstrated impressive performance in natural language understanding tasks across different domains. This research aims to investigate large language models' performance in the drug side effect frequency prediction domain. Our methodology uses Galeano's dataset, a standard benchmark dataset for drug side-effect frequency prediction. We used ChemBERTa, a large language model based on the BERT architecture, to embed the chemical structure of the drugs. We used SimCSE, another large language model based on BERT incorporating contrastive learning, to embed the side effects. Utilizing these embeddings, we predicted the high-frequency side effects using a deep learning model. Measuring the frequency of the side effects can help determine the therapeutic efficacy of a drug in clinical settings and help weigh the potential risks and benefits of certain drugs. The key objective of this research is to look into the performance of various large language models for predicting the frequencies of drug side effects.
Keywords: Large Language Models, ChemBERTa, SimCSE, Drug Side Effect Frequency
Thesis Committee:
Internal Reader: Dr. Dan Wu      
External Reader: Dr. Andrew Swan         
Advisor: Dr. Alioune Ngom

Vector Institute Logo

MAC STUDENTS ONLY - Register here