A Dual-Feature-Based Adaptive Shared Transformer Network for Image Captioning

Yinbin Shi, Ji Xia, Meng Chu Zhou, Zhengcai Cao

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Current models exhibit notable efficacy in image-captioning tasks. Mainstream research shows that combining dual visual features enhances visual representations and brings a performance boost. However, the incorporation of dual visual features complicates computation and expands parameters, hindering streamlined model deployment. The selection of region features requires a pretrained object detector, neglecting the model's ease of use for new scenarios and data. In this article, we propose a dual-feature adaptive shared transformer network, capitalizing on the merits of grid and shallow patch features, while circumventing the extra complexity from dual channels. Specifically, we eschew complex features such as region features to facilitate straightforward dataset compilation and expedite inference. We propose an adaptive shared transformer block (AST) to conserve parameters and diminish the model's FLOPs. A gating mechanism is employed to adaptively compute the importance of each feature, thereby obtaining stronger visual features. Since using flattening grid features before a transformer often leads to a loss of crucial spatial information, we incorporate the learning of relative geometric information based on grid features into our proposed method. Our analysis of various feature fusion techniques reveals that the AST approach outperforms its counterparts in terms of FLOPs and model size while still achieving high performance. Extensive experiments on different datasets indicate that our model demonstrates competitive performance on MSCOCO and outperforms state-of-the-art models on small-scale datasets.

Original languageEnglish (US)
Pages (from-to)1-13
Number of pages13
JournalIEEE Transactions on Instrumentation and Measurement
Volume73
DOIs
StatePublished - 2024

All Science Journal Classification (ASJC) codes

  • Instrumentation
  • Electrical and Electronic Engineering

Keywords

  • Deep learning
  • image captioning
  • transformer

Fingerprint

Dive into the research topics of 'A Dual-Feature-Based Adaptive Shared Transformer Network for Image Captioning'. Together they form a unique fingerprint.

Cite this