Technical Workshop - Synthetic Health Data: A Double-Edged Sword for Privacy and Utility in Genomic (2nd Offering) by: Reem Al-Saidi

Tuesday, May 27, 2025 - 10:00

The School of Computer Science presents...

Synthetic Health Data: A Double-Edged Sword for Privacy and Utility in Genomic (2nd Offering)

 

Presenter:  Reem Al-Saidi

Date: Tuesday, May 27, 2025

Time:  10:00 am

Location: 4th Floor (Workshop space) at 300 Ouellette Avenue (School of Computer Science Advanced Computing Hub)

 

Abstract: 

This workshop focuses on the use of synthetic data generation techniques to enable responsible sharing of sensitive genomic and health data. With increasing worries about privacy risks and data misuse, synthetic data has surfaced as a promising approach to advance research while reducing the risks of individual reidentification. Nevertheless, doubts persist regarding the ability of synthetic data to genuinely safeguard privacy, particularly when produced from robust models trained on confidential genomic sequences. The workshop explores state-of-the-art techniques used to generate synthetic genomic and health datasets, including generative adversarial networks (GANs), variational autoencoders (VAEs), and differentially private mechanisms. It also examines the limitations of these methods, particularly concerning rare variant leakage, memorization risks, and utility degradation in downstream analysis such as clustering and classification. Participants will gain practical insight into the privacy–utility trade-off, learn how to assess both privacy leakage and data utility, and understand the legal and ethical frameworks necessary for working with synthetic genomic data.

 

Workshop Outline:

Part 1: Introduction to Synthetic Genomics and Health Data

  • Rationale for using synthetic data in genomics and health research
  • Regulatory pressures and trust considerations

Part 2: Generation Techniques

  • Generative models: GANs, VAEs, transformers
  • Differential privacy integration in synthetic generation
  • Applications to genetic datasets and patient health records

Part 3: Limitations and Privacy Risks

  • Memorization and reconstruction attacks
  • Membership inference in synthetic datasets
  • Representation challenges in diverse genomes

Part 4: Measuring Utility and Privacy

  • Utility metrics: predictive accuracy, clustering validity, biological relevance
  • Privacy metrics: distance-based tests, DP bounds, adversarial risk evaluations
  • Case studies with benchmark datasets

Part 5: Future Outlook

  • Emerging defenses and privacy audits
  • Use of synthetic data in regulated environments
  • Open problems and research directions

 

Prerequisites:
  • Foundational understanding of genomic data structures and privacy concerns in health data
  • Basic familiarity with generative AI models.
  • Interest or experience in analyzing genomic or electronic health record (EHR) data
  • Awareness of ethical and legal responsibilities in handling sensitive patient-level data

 

Biography: 

Reem is a Ph.D. student at the University of Windsor in the School of Computer Science. She focuses on applying different privacy and security techniques in AI tools, providing trust and reputation in various AI applications, and assessing bias and fairness in NLP models.  

Registration Link (only MAC students need to pre-register)