MSc Thesis Defense: CERA: Context-Engineered Reviews Architecture for LLM-based Synthetic Dataset Generation by Kap Thang

Tuesday, April 28, 2026 - 11:00

 

CERA: Context-Engineered Reviews Architecture for LLM-based Synthetic Dataset Generation

MSc Thesis Defense by:

Kap Thang

 

Date: April 28th, 2026

Time:  11:00 am – 12:30 pm

Location: Essex Hall 122

 

Abstract:

Aspect-Based Sentiment Analysis models require large-scale annotated datasets that are scarce, expensive to create, and suffer from class imbalance. While large language models offer promising synthetic data generation, existing approaches lack factual grounding, struggle with the polite phenomenon, and provide limited aspect-level control. This thesis presents CERA (Context-Engineered Reviews Architecture), a training-free three-phase framework for generating realistic, controllable synthetic review text. CERA integrates a Composition Phase with a Subject Intelligence Layer for agentic web-search factual grounding and multi-agent verification, a Generation Phase with configurable polarity balance and demographic-grounded personas, and an Evaluation Phase using multi-dimensional quality assessment. We evaluate CERA across three review domains (laptop, restaurant, hotel) using intrinsic text quality metrics, extrinsic evaluation via latent aspect detection, a factual grounding ablation study, and a user evaluation study. CERA achieves Real-data-level corpus diversity while heuristic prompting collapses, generalizes across domains, and scales to 8,000 reviews with broadly stable semantic fidelity. A factual grounding ablation across six subjects and 360 datasets demonstrates that the Subject Intelligence Layer is essential: the full CERA pipeline achieves 64–86% Factual Score on novel subjects while conditions without it collapse to below 10%. User evaluation (N=50) shows CERA reviews are selected as real 30% of the time in triplet identification (chance level 33%), outperforming Heuristic at 18%.

 

Keywords: Synthetic Data Generation, Aspect-Based Sentiment Analysis, Large Language Models, Controllable Text Generation

 

Thesis Committee:

Internal Reader: Dr. Arunita Jaekel          

External Reader: Dr. Mahsa Hosseini       

Advisor: Dr. Luis Rueda