TY - GEN
T1 - Optimizing Manual Review Using Machine Learning in Interface Terminology Curation for Automatic EHR Highlighting
AU - Dehkordi, Mahshad Koohi H.
AU - Perl, Yehoshua
AU - Deek, Fadi P.
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Discharge notes are dense, information-rich documents that contain patient histories, diagnoses, treatments, and clinical observations, as well as post-discharge care instructions. While they provide essential data for clinical decision-making and research, they are often written using abbreviations and complex medical jargon, making them difficult for patients to interpret. Automatic highlighting of discharge notes enhances information accessibility, supports summarization and simplification, and improves clinical interoperability of the notes. Achieving accurate highlighting requires terminologies that include fine-granularity phrases, which existing reference terminologies such as SNOMED CT lack. To address this limitation, in our previous work, we proposed the Cardiology Interface Terminology (CIT), tailored for accurate highlighting of discharge notes of cardiology patients. Candidate concepts to be added to CIT were extracted from notes through concatenation and anchoring operations, with each phrase undergoing automatic and manual review before inclusion in CIT. Manual review of these phrases is highly time-consuming and costly process. In this study, we propose a Machine Learning (ML)-assisted approach to reduce the manual review efforts involved in terminology curation. We trained a Neural Network (NN) model on varying subsets of phrases generated through concatenation and anchoring, to determine the minimum number of phrases that must be manually reviewed to effectively train the ML model to label the remaining phrases automatically. The optimal batch sizes were identified as 6,000 (out of 28,617) for concatenation and 3,000 (out of 9,845) for anchoring. The resulting terminology (CITML2+) achieved a coverage of 68.74% and breadth of 1.6 on the test dataset, closely matching the fully manually curated CIT+ (coverage 70.21%, breadth 1.6), with comparable completeness (97.4% vs. 98.6%) and conciseness (84.1% vs. 83.6%). These findings demonstrate that substantial reductions in manual review can be achieved without compromising highlighting quality, providing a scalable and efficient framework for curating interface terminologies across diverse medical domains.
AB - Discharge notes are dense, information-rich documents that contain patient histories, diagnoses, treatments, and clinical observations, as well as post-discharge care instructions. While they provide essential data for clinical decision-making and research, they are often written using abbreviations and complex medical jargon, making them difficult for patients to interpret. Automatic highlighting of discharge notes enhances information accessibility, supports summarization and simplification, and improves clinical interoperability of the notes. Achieving accurate highlighting requires terminologies that include fine-granularity phrases, which existing reference terminologies such as SNOMED CT lack. To address this limitation, in our previous work, we proposed the Cardiology Interface Terminology (CIT), tailored for accurate highlighting of discharge notes of cardiology patients. Candidate concepts to be added to CIT were extracted from notes through concatenation and anchoring operations, with each phrase undergoing automatic and manual review before inclusion in CIT. Manual review of these phrases is highly time-consuming and costly process. In this study, we propose a Machine Learning (ML)-assisted approach to reduce the manual review efforts involved in terminology curation. We trained a Neural Network (NN) model on varying subsets of phrases generated through concatenation and anchoring, to determine the minimum number of phrases that must be manually reviewed to effectively train the ML model to label the remaining phrases automatically. The optimal batch sizes were identified as 6,000 (out of 28,617) for concatenation and 3,000 (out of 9,845) for anchoring. The resulting terminology (CITML2+) achieved a coverage of 68.74% and breadth of 1.6 on the test dataset, closely matching the fully manually curated CIT+ (coverage 70.21%, breadth 1.6), with comparable completeness (97.4% vs. 98.6%) and conciseness (84.1% vs. 83.6%). These findings demonstrate that substantial reductions in manual review can be achieved without compromising highlighting quality, providing a scalable and efficient framework for curating interface terminologies across diverse medical domains.
KW - Discharge notes
KW - EHRs
KW - Highlighting
KW - Interface Terminology
KW - Machine Learning
UR - https://www.scopus.com/pages/publications/105033539847
UR - https://www.scopus.com/pages/publications/105033539847#tab=citedBy
U2 - 10.1109/BIBM66473.2025.11357046
DO - 10.1109/BIBM66473.2025.11357046
M3 - Conference contribution
AN - SCOPUS:105033539847
T3 - Proceedings - 2025 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2025
SP - 6974
EP - 6980
BT - Proceedings - 2025 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2025
A2 - Liu, Juan
A2 - Huang, Jingshan
A2 - Wang, Xiaowo
A2 - Zhang, Fa
A2 - Zou, Xiufen
A2 - Tian, Tian
A2 - Hu, Xiaohua
A2 - Hu, Bin
A2 - Xiong, Yi
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2025 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2025
Y2 - 15 December 2025 through 18 December 2025
ER -