FREQUENCY-GUIDED CONTEXTUAL IMAGE CAPTIONING

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Both a deep understanding of visual cues and their contextual importance are demanded by effective image captioning. However, seamlessly integrating balanced contextual information continues to be a substantial challenge. In this paper, we present FreConCap, a novel Frequency-guided Contextual Image Captioning framework, to overcome the challenge using high-frequency and background features, along with object-level region features. We transform grid features into frequency domain and filter out low-frequency components by a cutoff ratio that enhances fine details critical for detailed visual understanding. Multi-Stream Cross Attention is developed to reduce the modality gap between vision and language, and to capture the interaction of text features with high-frequency local features, objects, context, and their relationships. Our experiments on the MS COCO image captioning benchmark show the superiority of our approach as compared with existing methods for enhanced image captions with more contextual information.

Original languageEnglish (US)
Title of host publication2025 IEEE International Conference on Image Processing, ICIP 2025 - Proceedings
PublisherIEEE Computer Society
Pages1229-1234
Number of pages6
ISBN (Electronic)9798331523794
DOIs
StatePublished - 2025
Event32nd IEEE International Conference on Image Processing, ICIP 2025 - Anchorage, United States
Duration: Sep 14 2025Sep 17 2025

Publication series

NameProceedings - International Conference on Image Processing, ICIP
ISSN (Print)1522-4880

Conference

Conference32nd IEEE International Conference on Image Processing, ICIP 2025
Country/TerritoryUnited States
CityAnchorage
Period9/14/259/17/25

All Science Journal Classification (ASJC) codes

  • Software
  • Signal Processing
  • Computer Vision and Pattern Recognition

Keywords

  • Encoder-Decoder
  • Frequency-Guided Feature
  • Image Captioning
  • Transformer

Fingerprint

Dive into the research topics of 'FREQUENCY-GUIDED CONTEXTUAL IMAGE CAPTIONING'. Together they form a unique fingerprint.

Cite this