MSc Thesis Defense: Multi-View Deep Learning Framework for Protein Sequence Classification by Jaber Al Siam

Wednesday, May 6, 2026 - 14:00

Multi-View Deep Learning Framework for Protein Sequence Classification

MSc Thesis Defense by: Jaber Al Siam

 

Date: 6 May 2026

Time:  2:00 PM

Location: 122 Essex Hall

 

Abstract:

Protein sequence classification is a fundamental problem in bioinformatics, with applications in functional annotation, drug discovery, and understanding biological systems. However, accurate classification remains challenging due to the complex relationships among sequence, structure, and function, as well as the heterogeneous nature of biological data.

This thesis presents a multi-view learning framework that integrates complementary protein representations to improve classification performance. Four modalities are considered: physicochemical properties encoded via Gramian Angular Fields (GAF), compositional patterns using Frequency Chaos Game Representation (FCGR), contextual embeddings from a pretrained protein language model (ProtBERT), and protein--protein interaction (PPI) features derived from graph embeddings.

Experiments on a large-scale dataset derived from the RCSB Protein Data Bank (PDB), focusing on the top 20 most frequent protein classes, show that progressively combining complementary representations significantly improves performance, achieving a peak accuracy of 94.27. These results demonstrate that representation diversity is the primary driver of performance gains in multi-view protein classification.

A systematic evaluation of fusion strategies shows that increasing fusion complexity yields only marginal improvements, whereas simple methods such as logit averaging and gated logit fusion achieve competitive, stable performance.

Thesis Committee:

Reader 1: Dr. Jessica Chen

Reader 2: Dr. Ikjot Saini

Advisor: Dr. Alioune Ngom

Chair: Dr. Xiaobu Yuan