HLFEval: A Hybrid LLM-Based Framework for Sentence-Level Evaluation of LLM Text Responses
MSc Thesis Proposal by:
Nehme Haidura
Date: Friday, May 22nd, 2026
Time: 2:30 PM
Location: Microsoft Teams
Abstract:
Large Language Model (LLM) evaluation remains a challenging problem, as existing evaluation methods often focus on isolated aspects of response quality while overlooking important dimensions such as fluency, safety, and factuality. In this thesis, we propose HLFEval, a hybrid LLM-based framework for sentence-level evaluation of generated responses across four complementary dimensions: semantic relevance, fluency, safety, and factuality. The framework combines established evaluation methods, including GPT-2 perplexity for fluency, detoxify for safety, and an ensemble of open-source LLM judges for factuality, into a unified composite scoring system using learned weights. In addition, semantic relevance is measured using a novel LLM-assisted token-level scoring method that decomposes text into linguistic categories and constructs similarity heatmaps between reference and candidate responses, providing an alternative to BERTScore. The framework is evaluated across 12 non-overlapping batches under both independent and sequential training settings. Experimental results demonstrate stable convergence and consistent improvements under sequential training, while maintaining competitive performance under independent training. These findings highlight the potential of HLFEval as a transparent and adaptive framework for multidimensional evaluation of generated responses.
Thesis Committee:
Internal Reader: Dr. Jianguo Lu
External Reader: Dr. Esam Abdel-Raheem
Advisor(s): Dr. Ziad Kobti, Dr. Hussein Assaf
