De-Confounding Feature Fusion Transformer Network for Image Captioning in Assistive Navigation Applications for the Visually Impaired

Zhengcai Cao, Ji Xia, Meng Chu Zhou

Research output: Contribution to journalArticlepeer-review

Abstract

The current transformer-based encoder-decoder architecture has demonstrated significant performance in image captioning. It has great potential in applications of environmental understanding, particularly in assisting visually impaired individuals. However, the model complexity, coupled with the often overlooked confounding factors within scenes, often leads to a reduction in model efficiency and a decrease in accuracy. This article introduces de-confounding feature fusion transformer network (DFFTNet) for image captioning, specifically aiming to provide real-world assistance to the visually impaired. In the encoding phase, we introduce a distance-enhanced feature expansion (DEFE) module. This module effectively enriches the fine-grained details of image features while integrating relevant positional information into them. In the decoding phase, a causal adjustment (CA) module is proposed to eliminate confounding factors. Extensive experiments with different datasets demonstrate that our model effectively addresses visual-semantic confusion and outperforms the state-of-the-art methods. Twenty volunteers are recruited to evaluate DFFTNet’s efficacy in real-world settings in terms of generated outputs by wearing our designed assistive navigation devices. The outcomes well show its outstanding performance and great potential for the visually impaired to use.

Original languageEnglish (US)
Article number4006112
JournalIEEE Transactions on Instrumentation and Measurement
Volume74
DOIs
StatePublished - 2025
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Instrumentation
  • Electrical and Electronic Engineering

Keywords

  • Deep learning
  • image captioning
  • transformer
  • visual perception
  • wearable robotics

Fingerprint

Dive into the research topics of 'De-Confounding Feature Fusion Transformer Network for Image Captioning in Assistive Navigation Applications for the Visually Impaired'. Together they form a unique fingerprint.

Cite this