The School of Computer Science is pleased to present…
Plagiarism Detection in Source Code Using Machine Learning
Date: Tuesday, May 28th, 2024
Time: 11:00 AM
Location: Memorial Hall Room 109
This thesis presents an advanced plagiarism detection system for source code, leveraging Convolutional Neural Networks (CNN), Long Short-Term Memory Networks (LSTM), and a Code Pre-trained Model (CodePTM) to identify various levels of plagiarism in programming submissions. The system enhances detection accuracy for different plagiarism types, from simple copy-paste to complex structural and semantic modifications, which are often challenging for traditional methods.
Our methodology includes preprocessing with tokenization, Abstract Syntax Tree (AST) generation, and vector transformation to encapsulate lexical, structural, and semantic features. These features are analyzed using CNN for structural patterns, LSTM for contextual dependencies, and CodePTM for deep semantic relationships. This integrated approach significantly improves the detection of a wide range of plagiarism, achieving high recall rates and overall results, across six defined levels of plagiarism.
The results show the system’s strength in identifying complex plagiarism cases, making it a powerful tool for maintaining integrity and originality in academic programming assignments.
Keywords: CNN, LSTM, NLP, Source Code, Plagiarism Detection
Internal Reader: Dr. Hossein Fani
External Reader: Dr. Guoqing Zhang
Advisor: Dr. Dan Wu
Chair: Dr. Peter Tsin