TY - JOUR
T1 - Word-Sequence Entropy
T2 - Towards uncertainty estimation in free-form medical question answering applications and beyond
AU - Wang, Zhiyuan
AU - Duan, Jinhao
AU - Yuan, Chenxi
AU - Chen, Qingyu
AU - Chen, Tianlong
AU - Zhang, Yue
AU - Wang, Ren
AU - Shi, Xiaoshuang
AU - Xu, Kaidi
N1 - Publisher Copyright:
© 2024 Elsevier Ltd
PY - 2025/1
Y1 - 2025/1
N2 - Uncertainty estimation is crucial for the reliability of safety-critical human and artificial intelligence (AI) interaction systems, particularly in the domain of healthcare engineering. However, a robust and general uncertainty measure for free-form answers has not been well-established in open-ended medical question-answering (QA) tasks, where generative inequality introduces a large number of irrelevant words and sequences within the generated set for uncertainty quantification (UQ), which can lead to biases. This paper proposes Word-Sequence Entropy (WSE), which calibrates uncertainty at both the word and sequence levels based on semantic relevance, highlighting keywords and enlarging the generative probability of trustworthy responses when performing UQ. We compare WSE with six baseline methods on five free-form medical QA datasets, utilizing seven popular large language models (LLMs), and demonstrate that WSE exhibits superior performance in accurate UQ under two standard criteria for correctness evaluation. Additionally, in terms of the potential for real-world medical QA applications, we achieve a significant enhancement (e.g., a 6.36% improvement in model accuracy on the COVID-QA dataset) in the performance of LLMs when employing responses with lower uncertainty that are identified by WSE as final answers, without requiring additional task-specific fine-tuning or architectural modifications.
AB - Uncertainty estimation is crucial for the reliability of safety-critical human and artificial intelligence (AI) interaction systems, particularly in the domain of healthcare engineering. However, a robust and general uncertainty measure for free-form answers has not been well-established in open-ended medical question-answering (QA) tasks, where generative inequality introduces a large number of irrelevant words and sequences within the generated set for uncertainty quantification (UQ), which can lead to biases. This paper proposes Word-Sequence Entropy (WSE), which calibrates uncertainty at both the word and sequence levels based on semantic relevance, highlighting keywords and enlarging the generative probability of trustworthy responses when performing UQ. We compare WSE with six baseline methods on five free-form medical QA datasets, utilizing seven popular large language models (LLMs), and demonstrate that WSE exhibits superior performance in accurate UQ under two standard criteria for correctness evaluation. Additionally, in terms of the potential for real-world medical QA applications, we achieve a significant enhancement (e.g., a 6.36% improvement in model accuracy on the COVID-QA dataset) in the performance of LLMs when employing responses with lower uncertainty that are identified by WSE as final answers, without requiring additional task-specific fine-tuning or architectural modifications.
KW - Generative inequality
KW - Open-ended medical question-answering
KW - Semantic relevance
KW - Uncertainty quantification
UR - http://www.scopus.com/inward/record.url?scp=85208764309&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85208764309&partnerID=8YFLogxK
U2 - 10.1016/j.engappai.2024.109553
DO - 10.1016/j.engappai.2024.109553
M3 - Article
AN - SCOPUS:85208764309
SN - 0952-1976
VL - 139
JO - Engineering Applications of Artificial Intelligence
JF - Engineering Applications of Artificial Intelligence
M1 - 109553
ER -