Abstract
The current transformer-based encoder-decoder architecture has demonstrated significant performance in image captioning. It has great potential in applications of environmental understanding, particularly in assisting visually impaired individuals. However, the model complexity, coupled with the often overlooked confounding factors within scenes, often leads to a reduction in model efficiency and a decrease in accuracy. This article introduces de-confounding feature fusion transformer network (DFFTNet) for image captioning, specifically aiming to provide real-world assistance to the visually impaired. In the encoding phase, we introduce a distance-enhanced feature expansion (DEFE) module. This module effectively enriches the fine-grained details of image features while integrating relevant positional information into them. In the decoding phase, a causal adjustment (CA) module is proposed to eliminate confounding factors. Extensive experiments with different datasets demonstrate that our model effectively addresses visual-semantic confusion and outperforms the state-of-the-art methods. Twenty volunteers are recruited to evaluate DFFTNet’s efficacy in real-world settings in terms of generated outputs by wearing our designed assistive navigation devices. The outcomes well show its outstanding performance and great potential for the visually impaired to use.
Original language | English (US) |
---|---|
Article number | 4006112 |
Journal | IEEE Transactions on Instrumentation and Measurement |
Volume | 74 |
DOIs | |
State | Published - 2025 |
Externally published | Yes |
All Science Journal Classification (ASJC) codes
- Instrumentation
- Electrical and Electronic Engineering
Keywords
- Deep learning
- image captioning
- transformer
- visual perception
- wearable robotics