TY - GEN
T1 - Adapting Vision Foundation Models for Real-Time Ultrasound Image Segmentation
AU - Zhang, Xiaoran
AU - Chen, Eric Z.
AU - Zhao, Lin
AU - Chen, Xiao
AU - Liu, Yikang
AU - Maihe, Boris
AU - Duncan, James S.
AU - Chen, Terrence
AU - Sun, Shanhui
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.
PY - 2026
Y1 - 2026
N2 - We propose a novel approach that adapts hierarchical vision foundation models for real-time ultrasound image segmentation. Existing ultrasound segmentation methods often struggle with adaptability to new tasks, relying on costly manual annotations, while real-time approaches generally fail to match state-of-the-art performance. To overcome these limitations, we introduce an adaptive framework that leverages the vision foundation model Hiera to extract multi-scale features, interleaved with DINOv2 representations to enhance visual expressiveness. These enriched features are then decoded to produce precise and robust segmentation. We conduct extensive evaluations on six public datasets and one in-house dataset, covering both cardiac and thyroid ultrasound segmentation. Experiments show that our approach outperforms state-of-the-art methods across multiple datasets and excels with limited supervision, surpassing nnUNet by over 20% on average in the 1% and 10% data settings. Our method achieves ∼77 FPS inference speed with TensorRT on a single GPU, enabling real-time clinical applications.
AB - We propose a novel approach that adapts hierarchical vision foundation models for real-time ultrasound image segmentation. Existing ultrasound segmentation methods often struggle with adaptability to new tasks, relying on costly manual annotations, while real-time approaches generally fail to match state-of-the-art performance. To overcome these limitations, we introduce an adaptive framework that leverages the vision foundation model Hiera to extract multi-scale features, interleaved with DINOv2 representations to enhance visual expressiveness. These enriched features are then decoded to produce precise and robust segmentation. We conduct extensive evaluations on six public datasets and one in-house dataset, covering both cardiac and thyroid ultrasound segmentation. Experiments show that our approach outperforms state-of-the-art methods across multiple datasets and excels with limited supervision, surpassing nnUNet by over 20% on average in the 1% and 10% data settings. Our method achieves ∼77 FPS inference speed with TensorRT on a single GPU, enabling real-time clinical applications.
KW - Real-time inference
KW - Ultrasound image segmentation
KW - Vision foundation model
UR - https://www.scopus.com/pages/publications/105017845705
UR - https://www.scopus.com/pages/publications/105017845705#tab=citedBy
U2 - 10.1007/978-3-032-04971-1_3
DO - 10.1007/978-3-032-04971-1_3
M3 - Conference contribution
AN - SCOPUS:105017845705
SN - 9783032049704
T3 - Lecture Notes in Computer Science
SP - 24
EP - 34
BT - Medical Image Computing and Computer Assisted Intervention, MICCAI 2025 - 28th International Conference, 2025, Proceedings
A2 - Gee, James C.
A2 - Hong, Jaesung
A2 - Sudre, Carole H.
A2 - Golland, Polina
A2 - Alexander, Daniel C.
A2 - Iglesias, Juan Eugenio
A2 - Venkataraman, Archana
A2 - Kim, Jong Hyo
PB - Springer Science and Business Media Deutschland GmbH
T2 - 28th International Conference on Medical Image Computing and Computer Assisted Intervention, MICCAI 2025
Y2 - 23 September 2025 through 27 September 2025
ER -