AIFriday, February 13, 2026

Quantitative evaluation frameworks for the trustworthiness of large language model outputs in medical domains

CDS 1646
12:00 PM - 1:00 PM

About

Although large language model (LLM)–based tools have become increasingly popular, their deployment in real-world clinical settings demands a much higher level of precision and reliability, where the cost of diagnostic errors is substantial. Currently, clinicians remain skeptical about relying on LLMs for clinical decision-making, largely due to the lack of rigorous evidence supporting individual model outputs and limited understanding of how such outputs are generated. Even when an LLM produces a correct answer, clinicians often find it difficult to trust the result without transparent justification. Addressing this trust gap is therefore an urgent need. In Yi’s first project, she proposes a scalable, entity-centric evaluation framework for medical question answering, which assesses the clinical alignment and informativeness of LLM-generated responses by tracing and verifying clinically relevant medical entities within patient-specific contexts. This framework enables more faithful and interpretable evaluation of medical LLM outputs beyond surface-level correctness. Building on this work, Yi’s ongoing research explores interpretability methods to analyze the decision flow of LLMs, examining how patient information is processed through internal model representations and transformed into diagnostic summaries or clinical decisions. Together, these efforts aim to improve the transparency and trustworthiness of LLMs for clinical applications.

Speaker

Yi Liu

Yi Liu

Advised by Professor Vijiaya Kolachalama, with a general interest in free-form text evaluation and methods for assessing open-ended model output and making it more reliable. Yi's work focuses on large language models in clinical settings, particularly in medical question answering and diagnostic reasoning for Alzheimer’s disease. She is especially interested in evaluation frameworks and interpretability methods that help reveal how medical evidence is represented, transformed, and utilized inside LLMs, as well as approaches for detecting reasoning errors and improving the accuracy of model-generated clinical summaries.

Event Details

Date
Friday, February 13, 2026
Time
12:00 PM - 1:00 PM
Location
CDS 1646
Theme
AI