TY - GEN
T1 - MoE-MSC
T2 - 16th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2025
AU - Rubel, Al Shahriar
AU - Shih, Frank Y.
AU - Deek, Fadi P.
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/12/10
Y1 - 2025/12/10
N2 - Large Vision-Language Models (LVLMs) have demonstrated promising capabilities in medical image captioning task. Their architecture usually integrates a vision encoder, a Large Language Model (LLM), and a connector specifically designed to bridge the modality gap between vision and language. However, a light-weight connector struggles to filter appropriate visual features for an LLM to generate structured captions for medical images across various modalities. This paper proposes MoE-MSC: Mixture of Experts (MoE) with Multi-Stream Connector (MSC) for modality-aware structured medical image captioning. The captions are structured with identified modality, anatomical structures, Region-of-Interest (RoI) analysis, lesion findings, and local-global relationships indicating potential impact of lesion findings in RoI to other regions. The MSC with multiple Cross Attentions projects visual features from a vision encoder to a representation that enables an LLM to generate captions focusing on different aspects in a caption related to modality and anatomical structures, RoI, and local-global relationships. Furthermore, the MoE enables modality-aware captioning, in which modality-specific visual features from a vision encoder are routed to specialized experts, offering scalability, enhanced interpretability, and reduced computational cost during inference. Our extensive experiments demonstrate the capabilities of our model with superior quantitative and qualitative results, compared to relevant methods. The source code is available at https://github.com/alshahriarrubel/MoE-MSC.
AB - Large Vision-Language Models (LVLMs) have demonstrated promising capabilities in medical image captioning task. Their architecture usually integrates a vision encoder, a Large Language Model (LLM), and a connector specifically designed to bridge the modality gap between vision and language. However, a light-weight connector struggles to filter appropriate visual features for an LLM to generate structured captions for medical images across various modalities. This paper proposes MoE-MSC: Mixture of Experts (MoE) with Multi-Stream Connector (MSC) for modality-aware structured medical image captioning. The captions are structured with identified modality, anatomical structures, Region-of-Interest (RoI) analysis, lesion findings, and local-global relationships indicating potential impact of lesion findings in RoI to other regions. The MSC with multiple Cross Attentions projects visual features from a vision encoder to a representation that enables an LLM to generate captions focusing on different aspects in a caption related to modality and anatomical structures, RoI, and local-global relationships. Furthermore, the MoE enables modality-aware captioning, in which modality-specific visual features from a vision encoder are routed to specialized experts, offering scalability, enhanced interpretability, and reduced computational cost during inference. Our extensive experiments demonstrate the capabilities of our model with superior quantitative and qualitative results, compared to relevant methods. The source code is available at https://github.com/alshahriarrubel/MoE-MSC.
KW - attention mechanism
KW - large language model
KW - large vision-language model
KW - medical image captioning
KW - mixture of experts
KW - modality-aware captioning
KW - multi-stream connector
KW - region-of-interest
KW - vision encoder
UR - https://www.scopus.com/pages/publications/105025569246
UR - https://www.scopus.com/pages/publications/105025569246#tab=citedBy
U2 - 10.1145/3765612.3767248
DO - 10.1145/3765612.3767248
M3 - Conference contribution
AN - SCOPUS:105025569246
T3 - BCB 2025 - Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
BT - BCB 2025 - Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
PB - Association for Computing Machinery, Inc
Y2 - 12 October 2025 through 15 October 2025
ER -