SCHOOL OF COMPUTER SCIENCE
The School of Computer Science at the University of Windsor is pleased to present …
MSc Thesis Proposal by: Ziyang Tian
Date: Thursday December 5th, 2019
Time: 1:00 pm – 3:00 pm
Location: Lambton Tower Room 3105
Document representation learning is crucial for downstream machine learning tasks such as document classification. Recent neural network approaches such as Doc2Vec and its variants are popular. Regarding its comparison with traditional representation methods such as the TF-IDF method, the results are not very conclusive due to several factors-- Doc2vec has many hyper-parameters, resulting in performance fluctuation; traditional methods have space to improve. More importantly, document length and data size have impacts on the result. This thesis conducts a comparative study of these methods, and propose to improve the TF-IDF weighting with mutual information(MI). We find that Doc2vec works good only for short documents, and only when the data size (the number of documents) is large. For long documents and small data size, MI performs better. The experiments are conducted extensively on 11 data sets that are of a variety of combinations of document length and data size.
In addition, we study the relationship between TF-IDF and MI weighting. We find that their correlation is high overall (Pearson correlation coefficient is over 0.9 on all the data sets used in our thesis). For medium frequency words, the MI weighting is always smaller than the TF-IDF weighting. However, for rare words and popular words, MI diverges from TF-IDF greatly, and the weighting of MI is higher than TF-IDF for popular words but lower than TF-IDF for rare words.
Internal Reader: Dr. Alioune Ngom
External Reader: Dr. Behnam Shahrrava
Advisor: Dr. Jianguo Lu
MSc Thesis Defense Announcement
5113 Lambton Tower, 401 Sunset Ave., Windsor ON., N9B 3P4 (519) 25303000 Ext. 3716 email@example.com