Multi-View Deep Learning Framework for Protein Sequence Classification
MSc Thesis Defense by: Jaber Al Siam
Date: 6 May 2026
Time: 2:00 PM
Location: 122 Essex Hall
Abstract:
Protein sequence classification is a fundamental problem in bioinformatics, with applications in functional annotation, drug discovery, and understanding biological systems. However, accurate classification remains challenging due to the complex relationships among sequence, structure, and function, as well as the heterogeneous nature of biological data.
This thesis presents a multi-view learning framework that integrates complementary protein representations to improve classification performance. Four modalities are considered: physicochemical properties encoded via Gramian Angular Fields (GAF), compositional patterns using Frequency Chaos Game Representation (FCGR), contextual embeddings from a pretrained protein language model (ProtBERT), and protein--protein interaction (PPI) features derived from graph embeddings.
Experiments on a large-scale dataset derived from the RCSB Protein Data Bank (PDB), focusing on the top 20 most frequent protein classes, show that progressively combining complementary representations significantly improves performance, achieving a peak accuracy of 94.27. These results demonstrate that representation diversity is the primary driver of performance gains in multi-view protein classification.
A systematic evaluation of fusion strategies shows that increasing fusion complexity yields only marginal improvements, whereas simple methods such as logit averaging and gated logit fusion achieve competitive, stable performance.
Thesis Committee:
Reader 1: Dr. Jessica Chen
Reader 2: Dr. Ikjot Saini
Advisor: Dr. Alioune Ngom
Chair: Dr. Xiaobu Yuan