Abstract
Accuracy and diversity represent two critical quantifiable performance metrics in the generation of natural and semantically accurate captions. While efforts are made to enhance one of them, the other suffers due to the inherent conflicting and complex relationship between them. In this study, we demonstrate that the suboptimal accuracy levels derived from human annotations are unsuitable for machine-generated captions. To boost diversity while maintaining high accuracy, we propose an innovative variational transformer (VaT) framework. By integrating 'invisible information prior (IIP)' and 'auto-selectable Gaussian mixture model (AGMM)', we enable its encoder to learn precise linguistic information and object relationships in various scenes, thus ensuring high accuracy. By incorporating the 'range-median reward (RMR)' baseline into it, we preserve a wider range of candidates with higher rewards during the reinforcement-learning-based training process, thereby guaranteeing outstanding diversity. Experimental results indicate that our method achieves simultaneous improvements in accuracy and diversity by up to 1.1% and 4.8%, respectively, over the state-of-the-art. Furthermore, our approach demonstrates its performance that is the closest to human annotations in semantic retrieval, with its score of 50.3 versus the human score of 50.6. Thus, the method can be readily put into industrial use.
Original language | English (US) |
---|---|
Journal | IEEE Transactions on Neural Networks and Learning Systems |
DOIs | |
State | Accepted/In press - 2024 |
Externally published | Yes |
All Science Journal Classification (ASJC) codes
- Software
- Computer Science Applications
- Computer Networks and Communications
- Artificial Intelligence
Keywords
- Auto-selectable Gaussian mixture model (AGMM)
- diverse generation
- image captioning
- retrieval
- variational transformer (VaT)