MSc Thesis Proposal of Prem Shanker Mohan:"Understanding BERT with Transformers on Small Datasets"

Tuesday, September 6, 2022 - 14:00 to 15:30


The School of Computer Science is pleased to present… 

MSc Thesis Proposal by: Prem Shanker Mohan 

Date: Tuesday September 6th  2022 
Time:  2:00 pm – 3:30pm 
Passcode: If interested in attending this event, contact the Graduate Secretary at with sufficient notice before the event to obtain the passcode.


With the introduction of the attention technique, the Bidirectional Encoder Representations from Transformers (BERT) has greatly advanced the study on solving sequence-to-sequence tasks in Natural Language Processing (NLP). When the task-specific annotations are limited, the NLP tasks are typically performed by pre-training a model using transformer technique on large-scale general corpora, followed by fine-tuning the model on domain-specific data. Instead of using shallow neural components for fine tuning, additional transformer layers could be introduced into the architecture. Recent research shows that, by resolving some initialization and optimization issues, these augmented transformer layers could lead to performance gains despite of the limited size of the available data, and this can be successful especially for well-structured data. Along this direction, we propose to perform comprehensive experiments in order to get more insight of the performance of the prediction models with respect to various structures in the datasets. In addition to the already studied datasets with structures for semantic parsing and for logical reading comprehension, several other datasets with different structures will be identified and analyzed with the experiments.  
Stemming from batch normalization widely adopted in the area of computer vision, power normalization is shown to outperform the layer normalization usually found in the transformers. In the family of adaptive gradient methods, the momentumized, adaptive, dual averaged gradient method was recently proposed together with its great performance on deep learning optimization problems from a variety of fields. For possible performance improvement in the setting considered in this thesis work, we will study the applicability of these two new methods for stochastic optimization. 
Keywords: Transformer, BERT, NLP 

MSc Thesis Committee:  

Internal Reader:   Dr. Aznam Yacoub 
External Reader:  Dr. Michael Wang         
Advisor:                 Dr. Jessica Chen   

MSc Thesis Proposal Announcement 

Vector Institute in Artificial Intelligence, artificial intelligence approved topic logo


5113 Lambton Tower 401 Sunset Ave. Windsor ON, N9B 3P4 (519) 253-3000 Ext. 3716