Friday, January 8, 2021 - 11:00 to 13:00
SCHOOL OF COMPUTER SCIENCE
The School of Computer Science is pleased to present…
MSc Thesis Defense by: Jayanth Prakash Kulkarni
Date: Friday January 8th, 2021
Time: 11:00 am – 1:00 pm
Zoom URL: https://zoom.us/j/91852146902?
Passcode: If interested in attending the event, contact the Graduate Secretary at email@example.com for the passcode.
Automated terminology extraction is a crucial task in natural language processing and ontology construction. Termhood can be inferred using linguistic and statistic techniques. This thesis focuses on the statistic methods. Inspired by feature selection techniques in documents classification, we experiment with a variety of metrics including PMI (point-wise mutual information), MI (mutual information), and Chi-square. We find that PMI is in favor of identifying top keywords in a domain, but MI can recognize more keywords overall. Based on this observation, we propose a hybrid approach, called HMI, that combines the best of PMI and MI. HMI outperforms both PMI and MI. The result is verified by comparing overlapping between the extracted keywords and the author-identified keywords in arXiv data. When the corpora are computer science and physics papers, the top-100 hit rate can reach 0.96 for HMI.
We also demonstrate that terminologies can improve documents embeddings. In this experiment, we treat machine-identified multi-word terminologies with one word. Then we use the transformed text as input for the document embedding. Compared with the representations learnt from unigrams only, we observe a performance improvement over 5.67% for F1 score in arXiv data on document classification tasks.
Keywords: Terminology extraction, document embedding, pointwise mutual information, mutual information, chi-squared
Internal Reader: Dr. Dan Wu
External Reader: Dr. Mohamed Belalia
Advisor: Dr. Jianguo Lu
Chair: Dr. Saeed Samet
MSc Thesis Defense Announcement
5113 Lambton Tower 401 Sunset Ave. Windsor ON, N9B 3P4 (519) 253-3000 Ext. 3716 firstname.lastname@example.org