MoE-MSC: Mixture of Experts with Multi-Stream Connector for Modality-Aware Medical Image Captioning

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Large Vision-Language Models (LVLMs) have demonstrated promising capabilities in medical image captioning task. Their architecture usually integrates a vision encoder, a Large Language Model (LLM), and a connector specifically designed to bridge the modality gap between vision and language. However, a light-weight connector struggles to filter appropriate visual features for an LLM to generate structured captions for medical images across various modalities. This paper proposes MoE-MSC: Mixture of Experts (MoE) with Multi-Stream Connector (MSC) for modality-aware structured medical image captioning. The captions are structured with identified modality, anatomical structures, Region-of-Interest (RoI) analysis, lesion findings, and local-global relationships indicating potential impact of lesion findings in RoI to other regions. The MSC with multiple Cross Attentions projects visual features from a vision encoder to a representation that enables an LLM to generate captions focusing on different aspects in a caption related to modality and anatomical structures, RoI, and local-global relationships. Furthermore, the MoE enables modality-aware captioning, in which modality-specific visual features from a vision encoder are routed to specialized experts, offering scalability, enhanced interpretability, and reduced computational cost during inference. Our extensive experiments demonstrate the capabilities of our model with superior quantitative and qualitative results, compared to relevant methods. The source code is available at https://github.com/alshahriarrubel/MoE-MSC.

Original languageEnglish (US)
Title of host publicationBCB 2025 - Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9798400722004
DOIs
StatePublished - Dec 10 2025
Event16th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2025 - Philadelphia, United States
Duration: Oct 12 2025Oct 15 2025

Publication series

NameBCB 2025 - Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

Conference

Conference16th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2025
Country/TerritoryUnited States
CityPhiladelphia
Period10/12/2510/15/25

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Software
  • Biomedical Engineering
  • Health Informatics

Keywords

  • attention mechanism
  • large language model
  • large vision-language model
  • medical image captioning
  • mixture of experts
  • modality-aware captioning
  • multi-stream connector
  • region-of-interest
  • vision encoder

Fingerprint

Dive into the research topics of 'MoE-MSC: Mixture of Experts with Multi-Stream Connector for Modality-Aware Medical Image Captioning'. Together they form a unique fingerprint.

Cite this