Abstract
Effective image captioning relies on both visual understanding and contextual relevance. In this paper, we present two approaches, BFC-Capb-a novel background-based image captioning and its extension BFC-Capf-frequency-guided, to achieve the above goals. First, we develop an Object-Background Attention (OBA) module to capture the interaction and relationship between objects and background features. Then, we incorporate feature fusion with spatial shift operation, enabling alignment with neighbors and avoiding potential redundancy. This framework is extended to transform grid features into frequency domain and filter out low-frequency components to enhance fine details. Our approaches are evaluated using traditional and recent metrics on MS COCO image captioning benchmark. Experimental results show the effectiveness of our proposed approaches, achieving better quantitative scores as compared to the relevant existing methods. Furthermore, our methods show improved qualitative captions with more background and concise contextual information, including more accurate information regarding the objects and their attributes.
Original language | English (US) |
---|---|
Article number | 2555009 |
Journal | International Journal of Pattern Recognition and Artificial Intelligence |
Volume | 39 |
Issue number | 8 |
DOIs | |
State | Published - Jun 30 2025 |
All Science Journal Classification (ASJC) codes
- Software
- Computer Vision and Pattern Recognition
- Artificial Intelligence
Keywords
- attention mechanism
- background features
- encoder-decoder
- frequency-guided component
- grid features
- Image captioning
- region features
- transformer