MSc Thesis Defense By: Sumisha Surendran "Plagiarism Detection in Source Code Using Machine Learning"

Tuesday, May 28, 2024 - 11:00

The School of Computer Science is pleased to present…

Plagiarism Detection in Source Code Using Machine Learning

MSc Thesis Defense by: Sumisha Surendran


Date: Tuesday, May 28th, 2024

Time:  11:00 AM

Location: Memorial Hall Room 109



This thesis presents an advanced plagiarism detection system for source code, leveraging Convolutional Neural Networks (CNN), Long Short-Term Memory Networks (LSTM), and a Code Pre-trained Model (CodePTM) to identify various levels of plagiarism in programming submissions. The system enhances detection accuracy for different plagiarism types, from simple copy-paste to complex structural and semantic modifications, which are often challenging for traditional methods.


Our methodology includes preprocessing with tokenization, Abstract Syntax Tree (AST) generation, and vector transformation to encapsulate lexical, structural, and semantic features. These features are analyzed using CNN for structural patterns, LSTM for contextual dependencies, and CodePTM for deep semantic relationships. This integrated approach significantly improves the detection of a wide range of plagiarism, achieving high recall rates and overall results, across six defined levels of plagiarism.


The results show the system’s strength in identifying complex plagiarism cases, making it a powerful tool for maintaining integrity and originality in academic programming assignments.


Keywords: CNN, LSTM, NLP, Source Code, Plagiarism Detection


Thesis Committee:

Internal Reader: Dr. Hossein Fani              

External Reader: Dr. Guoqing Zhang        

Advisor: Dr. Dan Wu

Chair: Dr. Peter Tsin

Vector Logo

MAC STUDENTS ONLY - Register here