MSc Thesis Defense " Anomaly Detection in Large Datasets: A Case Study in Loan Defaults" By:

Thursday, August 24, 2023 - 10:00 to 12:00

The School of Computer Science at the University of Windsor is pleased to present......

Anomaly Detection in Large Datasets: A Case Study in Loan Defaults 
MSc Thesis Defense by: Rayhaan Pirani 

Date: Thursday, August 24th, 2023

Time:  10:00am – 12:00pm

Location: Essex Hall, Room #122
 

Abstract:

Given the rise in loan defaults, especially after the COVID-19 pandemic, it is necessary to predict if customers might default on a loan for risk management. This thesis proposes an early warning system architecture using anomaly detection based on the unbalanced nature of loan default data in the real world. Most customers do not default on their loans; only a tiny percentage do, resulting in an unbalanced dataset. We aim to evaluate potential anomaly detection methods for their suitability in handling unbalanced datasets. We conduct a comparative study on different anomaly detection approaches on four balanced and unbalanced datasets. We compare five of each supervised, unsupervised, and semi-supervised anomaly detection approaches. The supervised algorithms compared are logistic regression, stochastic gradient descent (SGD), XGBoost, LightGBM, and CatBoost classification methods. The unsupervised anomaly detection methods are isolation forest, angle-based outlier detection (ABOD), outlier detection using empirical cumulative distribution function (ECOD), copula-based outlier detection (COPOD), and deep one-class classifier with autoencoder (DeepSVDD). The semi-supervised anomaly detection methods are improving supervised outlier detection with unsupervised representation learning (XGBOD), feature encoding with autoencoders for weakly-supervised anomaly detection (FeaWAD), deep semi-supervised anomaly detection (DeepSAD), progressive image deraining networks (PReNet), and deep anomaly detection with deviation networks (DevNet). We compare them using standard evaluation metrics such as accuracy, precision, recall, F1 score, training and prediction time, and area under the receiver operating characteristic (ROC) curve. The results show that anomaly detection methods perform significantly better on unbalanced loan default data and are more suitable for real-world applications. The results also show that supervised methods work better for balanced datasets, and for peer-to-peer lending datasets, boosting approaches are expected to perform well. 

Keywords: Anomaly Detection, Unbalanced Dataset, Early Warning System, Loan Default.

MSc Thesis Committee: 

Internal Reader: Dr. Alioune Ngom

External Reader: Dr. Bharat Maheshwari

Advisor: Dr. Ziad Kobti

Chair: Dr. Kalyani Selvarajah