PhD Dissertation Proposal - Training Vision Transformer for Image Classification Task by: Chris Khalil

Tuesday, April 16, 2024 - 12:00

The School of Computer Science is pleased to present…

Training Vision Transformer for Image Classification Task

PhD Dissertation Proposal by: Chris Khalil


Date: Tuesday, April 16, 2024

Time:  12:00pm

Location: Essex Hall, Room 122


Vision Transformers (ViTs) are powerful computer vision models believed to be challenging to train. Currently, pre-training vision transformers on large private datasets containing 14-300 million images is the trend in computer vision. However, these models require hundreds of computing days to train and are only possible with significant capital. This proposal aims to present methods that overcome the difficulty of training ViTs and applications of ViTs to challenging computer vision problems. We first describe a new pretext task that combines Token Merge and Clustering to learn a function that maps high-dimensional inputs to lower-dimensional outputs. The new method is more powerful than similar methods and more efficient to train. Our proposed approach has proven promising, as our initial model has performed better than the existing methods. In particular, under the linear evaluation protocol on ImageNet, our model reaches 85.7% top-1 accuracy with a standard ViT-B. Human-designed augmentations leveraging domain knowledge are essential for self-supervised learning to learn valuable representations. Our approach combines standard augmentations with Token Merge. However, finding a domain-knowledge-free augmentation strategy is desirable so that this approach can be applied to more general domains or out-of-distribution scenarios. Among all the augmentation techniques available, masking is the most versatile and straightforward approach, capable of being applied to various input types with minimal prerequisite domain knowledge. However, such pixel-level recovery tasks tend to waste modeling capability on pre-training short-range dependencies and high-frequency details. Our goal is to overcome the above issues for pre-training of vision Transformers by employing superpixels to construct a vocabulary for vision, akin to how words form the building blocks of human language.
Thesis Committee:
Internal Reader: Dr. Saeed Samet
Internal Reader: Dr. Curtis Bright             
External Reader: Dr. Mohammad Hassanzadeh 
Advisor: Dr. Alioune Ngom

Vector Institute Logo