MSc Thesis Proposal of Prem Shanker Mohan:"Understanding BERT with Transformers on Small Datasets"

Tuesday, September 6, 2022 - 14:00 to 15:30

SCHOOL OF COMPUTER SCIENCE

The School of Computer Science is pleased to present…

MSc Thesis Proposal by: Prem Shanker Mohan

Date: Tuesday September 6th 2022

Time: 2:00 pm – 3:30pm

ZOOM URL: https://us06web.zoom.us/j/87178350481?from=addon

Passcode: If interested in attending this event, contact the Graduate Secretary at csgradinfo@uwindsor.ca with sufficient notice before the event to obtain the passcode.

Abstract:

With the introduction of the attention technique, the Bidirectional Encoder Representations from Transformers (BERT) has greatly advanced the study on solving sequence-to-sequence tasks in Natural Language Processing (NLP). When the task-specific annotations are limited, the NLP tasks are typically performed by pre-training a model using transformer technique on large-scale general corpora, followed by fine-tuning the model on domain-specific data. Instead of using shallow neural components for fine tuning, additional transformer layers could be introduced into the architecture. Recent research shows that, by resolving some initialization and optimization issues, these augmented transformer layers could lead to performance gains despite of the limited size of the available data, and this can be successful especially for well-structured data. Along this direction, we propose to perform comprehensive experiments in order to get more insight of the performance of the prediction models with respect to various structures in the datasets. In addition to the already studied datasets with structures for semantic parsing and for logical reading comprehension, several other datasets with different structures will be identified and analyzed with the experiments.

Stemming from batch normalization widely adopted in the area of computer vision, power normalization is shown to outperform the layer normalization usually found in the transformers. In the family of adaptive gradient methods, the momentumized, adaptive, dual averaged gradient method was recently proposed together with its great performance on deep learning optimization problems from a variety of fields. For possible performance improvement in the setting considered in this thesis work, we will study the applicability of these two new methods for stochastic optimization.

Keywords: Transformer, BERT, NLP

MSc Thesis Committee:

Internal Reader: Dr. Aznam Yacoub

External Reader: Dr. Michael Wang

Advisor: Dr. Jessica Chen

MSc Thesis Proposal Announcement

5113 Lambton Tower 401 Sunset Ave. Windsor ON, N9B 3P4 (519) 253-3000 Ext. 3716 csgradinfo@uwindsor.ca