The University of Windsor has moved to an “essential service only” model. Learn More.

MSc Thesis Defense Announcement by Ziyang Tian:"A Comparative Study of Document Representation Methods"

Thursday, December 5, 2019 - 13:00 to 15:00

SCHOOL OF COMPUTER SCIENCE

 

The School of Computer Science at the University of Windsor is pleased to present …

MSc Thesis Proposal by:   Ziyang Tian

 

Date: Thursday December 5th, 2019

Time:  1:00 pm – 3:00 pm

Location: Lambton Tower Room 3105

 

Abstract:

Document representation learning is crucial for downstream machine learning tasks such as document classification. Recent neural network approaches such as Doc2Vec and its variants are popular. Regarding its comparison with traditional representation methods such as the TF-IDF method, the results are not very conclusive due to several factors-- Doc2vec has many hyper-parameters, resulting in performance fluctuation; traditional methods have space to improve. More importantly, document length and data size have impacts on the result. This thesis conducts a comparative study of these methods, and propose to improve the TF-IDF weighting with mutual information(MI). We find that Doc2vec works good only for short documents, and only when the data size (the number of documents) is large. For long documents and small data size, MI performs better. The experiments are conducted extensively on 11 data sets that are of a variety of combinations of document length and data size. 

 

In addition, we study the relationship between TF-IDF and MI weighting. We find that their correlation is high overall (Pearson correlation coefficient is over 0.9 on all the data sets used in our thesis). For medium frequency words, the MI weighting is always smaller than the TF-IDF weighting. However, for rare words and popular words, MI diverges from TF-IDF greatly, and the weighting of MI is higher than TF-IDF for popular words but lower than TF-IDF for rare words.

Thesis Committee:

Internal Reader: Dr. Alioune Ngom

External Reader: Dr. Behnam Shahrrava

Advisor: Dr. Jianguo Lu

Chair: TBD

MSc Thesis Defense Announcement

 

5113 Lambton Tower, 401 Sunset Ave., Windsor ON., N9B 3P4 (519) 25303000 Ext. 3716 csgradinfo@uwindsor.ca