Abstract
Current models exhibit notable efficacy in image-captioning tasks. Mainstream research shows that combining dual visual features enhances visual representations and brings a performance boost. However, the incorporation of dual visual features complicates computation and expands parameters, hindering streamlined model deployment. The selection of region features requires a pretrained object detector, neglecting the model's ease of use for new scenarios and data. In this article, we propose a dual-feature adaptive shared transformer network, capitalizing on the merits of grid and shallow patch features, while circumventing the extra complexity from dual channels. Specifically, we eschew complex features such as region features to facilitate straightforward dataset compilation and expedite inference. We propose an adaptive shared transformer block (AST) to conserve parameters and diminish the model's FLOPs. A gating mechanism is employed to adaptively compute the importance of each feature, thereby obtaining stronger visual features. Since using flattening grid features before a transformer often leads to a loss of crucial spatial information, we incorporate the learning of relative geometric information based on grid features into our proposed method. Our analysis of various feature fusion techniques reveals that the AST approach outperforms its counterparts in terms of FLOPs and model size while still achieving high performance. Extensive experiments on different datasets indicate that our model demonstrates competitive performance on MSCOCO and outperforms state-of-the-art models on small-scale datasets.
Original language | English (US) |
---|---|
Pages (from-to) | 1-13 |
Number of pages | 13 |
Journal | IEEE Transactions on Instrumentation and Measurement |
Volume | 73 |
DOIs | |
State | Published - 2024 |
All Science Journal Classification (ASJC) codes
- Instrumentation
- Electrical and Electronic Engineering
Keywords
- Deep learning
- image captioning
- transformer