Quantitative evaluation frameworks for the trustworthiness of large language model outputs in medical domains

About

Although large language model (LLM)–based tools have become increasingly popular, their deployment in real-world clinical settings demands a much higher level of precision and reliability, where the cost of diagnostic errors is substantial. Currently, clinicians remain skeptical about relying on LLMs for clinical decision-making, largely due to the lack of rigorous evidence supporting individual model outputs and limited understanding of how such outputs are generated. Even when an LLM produces a correct answer, clinicians often find it difficult to trust the result without transparent justification. Addressing this trust gap is therefore an urgent need. In Yi’s first project, she proposes a scalable, entity-centric evaluation framework for medical question answering, which assesses the clinical alignment and informativeness of LLM-generated responses by tracing and verifying clinically relevant medical entities within patient-specific contexts. This framework enables more faithful and interpretable evaluation of medical LLM outputs beyond surface-level correctness. Building on this work, Yi’s ongoing research explores interpretability methods to analyze the decision flow of LLMs, examining how patient information is processed through internal model representations and transformed into diagnostic summaries or clinical decisions. Together, these efforts aim to improve the transparency and trustworthiness of LLMs for clinical applications.

About

Speaker

Yi Liu

Event Details

Bayesian Online Model Selection

Propagating Surrogate Uncertainty in Bayesian Inverse Problems