Abstract
Recently, self-supervised learning has drawn lots of attention from researchers. CLIP is a vision-language model that performs cross-modality contrastive pre-training. In this paper, we propose a novel method of prompt tuning by optimal transport to improve zero-shot generalization of the CLIP pre-trained model. Existing entropy-based approaches fail to consider the global structure of output distribution, and cannot align distributions effectively across domains. We develop the Optimal Transport-Test Time Prompt Tuning, named OT-TPT, to resolve this issue. With the help of optimal transport, it can directly align distributions to provide a global regularization effect, and therefore improve robustness against noise and distribution shifts. Moreover, a Sinkhorn regularization term is adopted to provide an efficient and smooth approximation that reduces distribution shifts while improving zero-shot generalization. Experimental results show that the proposed OT-TPT can achieve higher classification accuracies over existing state-of-the-art approaches.
| Original language | English (US) |
|---|---|
| Article number | 2551017 |
| Journal | International Journal of Pattern Recognition and Artificial Intelligence |
| Volume | 39 |
| Issue number | 14 |
| DOIs | |
| State | Published - Nov 1 2025 |
All Science Journal Classification (ASJC) codes
- Software
- Computer Vision and Pattern Recognition
- Artificial Intelligence
Keywords
- CLIP
- Test time prompt tuning
- Wasserstein barycenter
- machine learning
- optimal transport
- self-supervised learning