School of Computer Science Colloquium - Machine Learning on High-Dimensional Biological Data Analysis by Yan Yan

Friday, October 20, 2023 - 11:00

The School of Computer Science at the University of Windsor is pleased to present…

Speaker: Yan Yan

Title: Machine Learning on High-Dimensional Biological Data Analysis

Date: October 20, 2023

Time: 11:00 am – 12:00 pm

Location: Erie Hall Room 3123

Abstract:

In the rapidly evolving landscape of biological data analysis, machine learning has emerged as powerful tools for high-dimensional complex biological data analysis. Here we present applications of penalized regressions on two different types of biological data, the whole-genome Single Nucleotide Polymorphisms (SNPs) and the single-cell RNA sequencing (scRNA-seq) data. These data usually contain much more features than the number of samples, resulting the “large-p(feature)-small-n(samples)” problems.  We show that combination of various penalized regression models can potentially solve the problem and yield better performance.

In the scRNA-seq analysis, various penalized regression methods are rigorously compared for feature selection. Sparse group lasso (SGL) emerges as the top performer, surpassing six other methods. Building on these findings, a novel algorithm is proposed, utilizing SGL for gene selection in scRNA-seq data without the need for domain-specific gene grouping information. This approach consistently yields superior results in terms of area under the receiver operating curve (AUC).

In the SNP data study, "PentaPen" is introduced as a computational workflow for enhanced SNP detection accuracy, combining the strengths of five penalized models. PentaPen excels in situations with a high number of predictors and few samples, and harnesses the power of parallel computing, marking a significant advancement in SNP identification.

These studies underscore the pivotal role of machine learning methods in addressing high-dimensional data challenges across distinct data types. The developed methodologies shine the lights of new findings in complex biological data.

Keywords: penalized regressions, single-cell RNA sequencing, single nucleotide polymorphisms, LASSO, machine learning, Arabidopsis thaliana

Biography:

Yan Yan is an assistant professor at the School of Computer Science of University of Guelph. Dr. Yan’s expertise is Bioinformatics and Machine Learning. Her research program focuses on multi-omics analysis which provide novel information on the mechanisms of the biological process and cell states in disease development. Through the program, varies computational studies have been developed including a peptide sequence prediction based on high-throughput tandem mass spectrometry, graph-based de novo sequencing algorithms on different MS/MS spectra and their combinations, and statistical association studies and deep learning on genome-wide population data. Dr. Yan currently holds NSERC Discovery Grant (2021-2026) on “Computational Methods for Phenotype Prediction to Assist Plant Breeding”. She was previously participated in research grants as a contributor and a participant including Microsoft Azure Research Grant, NSERC CREATE Grant and Canada First Research Excellence Fund.