The School of Computer Science presents...
Presenter: Reem Al-Saidi
Date: Monday, May 26, 2025
Time: 10:00 am
Location: 4th Floor (Workshop space) at 300 Ouellette Avenue (School of Computer Science Advanced Computing Hub)
This workshop focuses on the use of synthetic data generation techniques to enable responsible sharing of sensitive genomic and health data. With increasing worries about privacy risks and data misuse, synthetic data has surfaced as a promising approach to advance research while reducing the risks of individual reidentification. Nevertheless, doubts persist regarding the ability of synthetic data to genuinely safeguard privacy, particularly when produced from robust models trained on confidential genomic sequences. The workshop explores state-of-the-art techniques used to generate synthetic genomic and health datasets, including generative adversarial networks (GANs), variational autoencoders (VAEs), and differentially private mechanisms. It also examines the limitations of these methods, particularly concerning rare variant leakage, memorization risks, and utility degradation in downstream analysis such as clustering and classification. Participants will gain practical insight into the privacy–utility trade-off, learn how to assess both privacy leakage and data utility, and understand the legal and ethical frameworks necessary for working with synthetic genomic data.
Part 1: Introduction to Synthetic Genomics and Health Data
- Rationale for using synthetic data in genomics and health research
- Regulatory pressures and trust considerations
Part 2: Generation Techniques
- Generative models: GANs, VAEs, transformers
- Differential privacy integration in synthetic generation
- Applications to genetic datasets and patient health records
Part 3: Limitations and Privacy Risks
- Memorization and reconstruction attacks
- Membership inference in synthetic datasets
- Representation challenges in diverse genomes
Part 4: Measuring Utility and Privacy
- Utility metrics: predictive accuracy, clustering validity, biological relevance
- Privacy metrics: distance-based tests, DP bounds, adversarial risk evaluations
- Case studies with benchmark datasets
Part 5: Future Outlook
- Emerging defenses and privacy audits
- Use of synthetic data in regulated environments
- Open problems and research directions
- Foundational understanding of genomic data structures and privacy concerns in health data
- Basic familiarity with generative AI models.
- Interest or experience in analyzing genomic or electronic health record (EHR) data
- Awareness of ethical and legal responsibilities in handling sensitive patient-level data
Reem is a Ph.D. student at the University of Windsor in the School of Computer Science. She focuses on applying different privacy and security techniques in AI tools, providing trust and reputation in various AI applications, and assessing bias and fairness in NLP models.