MSc Thesis Defense Announcement of Prem Shanker Mohan:"Effects of PowerNorm and MADGRAD on DT-Fixup Performance"

Monday, June 5, 2023 - 13:00 to 16:00


The School of Computer Science is pleased to present…

MSc Thesis Defense by: Prem Shanker Mohan

Date: Monday June 5th, 2023
Time:  1:00 pm – 3:00pm
Location: Essex Hall, Room 122
Reminders: 1. Two-part attendance mandatory (sign-in sheet, QR Code) 2. Arrive 5-10 minutes prior to event starting - LATECOMERS WILL NOT BE ADMITTED. Note that due to demand, if the room has reached capacity, even if you are "early" admission is not guaranteed. 3. Please be respectful of the presenter by NOT knocking on the door for admittance once the door has been closed whether the presentation has begun or not (If the room is at capacity, overflow is not permitted (ie. sitting on floors) as this is a violation of the Fire Safety code). 4. Be respectful of the decision of the advisor/host of the event if you are not given admittance. The School of Computer Science has numerous events occurring soon



With the introduction of the attention technique, the Bidirectional Encoder Representations from Transformers (BERT) has greatly advanced the study on solving sequence-to-sequence tasks in Natural Language Processing (NLP). When the task-specific annotations are limited, the NLP tasks are commonly performed by pre-training a model using transformer technique on large-scale general corpora, followed by fine-tuning the model on domain-specific data. Instead of using shallow neural components for fine tuning, additional transformer layers could be introduced into the architecture. Recent research shows that, by resolving some initialization and optimization issues, these augmented transformer layers could lead to performance gains despite of the limited size of the available data, and this can be successful especially for well-structured data. Along this direction, we will perform comprehensive experiments on DT-Fixup algorithm which is designed to mitigate mentioned issues. For possible performance improvement on DT-Fixup, we propose to study the applicability of the power normalization and Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization (MADGRAD) in this setting. This is motivated by the recent literature which shows that, stemming from batch normalization widely adopted in the area of computer vision, power normalization is shown to outperform the layer normalization usually found in the transformers. In the family of AdaGrad adaptive gradient methods, MADGRAD is a new optimization technique that performs exceptionally well on deep learning optimization problems from a variety of fields, including classification and image-to-image tasks in vision and recurrent and bidirectionally-masked models in natural language processing. Even on issues where adaptive methods typically perform badly, MADGRAD matches or beats both SGD and ADAM in test set performance for each of these tasks. This research will be performed on ReClor, and LogiQA datasets selected according to its structure.
Keywords: RoBERTa, PowerNorm, MADGRAD, DT-Fixup  

MSc Thesis Committee:

Internal Reader: Aznam Yacoub
External Reader: Michael Wang
Advisor: Jessica Chen
Chair: Ahmad Biniaz

MSc Thesis Defense Announcement  Vector Institute, artificial intelligence approved topic logo


5113 Lambton Tower 401 Sunset Ave. Windsor ON, N9B 3P4 (519) 253-3000 Ext. 3716