Learning To Prompt With Text Only Supervision For Vision-Language Models

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Learning to Prompt with Text Only Supervision for Vision-Language Models

Muhammad Uzair Khattak1 Muhammad Ferjad Naeem2 Muzammal Naseer1


Luc Van Gool2 Federico Tombari3,4
1
Mohamed bin Zayed University of AI 2 ETH Zurich 3 TU Munich 4 Google

Abstract
arXiv:2401.02418v1 [cs.CV] 4 Jan 2024

Do not require Transfer to


Method
images unseen datasets
CoOp [50] ✘ ✓
Foundational vision-language models such as CLIP are Prompt learning CoCoOp [49] ✘ ✓
becoming a new paradigm in vision, due to their excellent methods MaPLe [20] ✘ ✓
generalization abilities. However, adapting these models PromptSRC [21] ✘ ✓

for downstream tasks while maintaining their generaliza- DCLIP [29] ✓ ✘


Prompt ensembling WaffleCLIP-Concept [39] ✓ ✘
tion remains a challenge. In literature, one branch of meth- methods (LLM) CuPL [36] ✓ ✘
ods adapts CLIP by learning prompts using visual informa- ProText (Ours) ✓ ✓
tion. While effective, most of these works require labeled
data which is not practical, and often struggle to generalize Table 1. Existing methods improve CLIP’s generalization by
towards new datasets due to over-fitting on the source data. learning prompts with image supervision or using non-transferable
An alternative approach resorts to training-free methods by prompt ensembling with LLM knowledge. In contrast, our ap-
generating class descriptions from large language models proach, ProText, effectively learns prompts with text-only super-
(LLMs) and perform prompt ensembling. However, these vision which are transferable to new datasets and classes.
methods often generate class specific prompts that cannot
be transferred to other classes, which incur higher costs by highlights which leverage contrastive pre-training on mas-
generating LLM descriptions for each class separately. In sive image-text pairs from the internet. During pre-training,
this work, we propose to combine the strengths of these both CLIP learns to align image-text samples in a shared feature
streams of methods by learning prompts using only text data space. This allows CLIP to encode open-vocabulary con-
derived from LLMs. As supervised training of prompts is not cepts and generalize well to zero-shot recognition tasks.
trivial due to absence of images, we develop a training ap- CLIP consists of two encoders to encode image and text
proach that allows prompts to extract rich contextual knowl- inputs respectively. At inference, a hand-crafted prompt
edge from LLM data. Moreover, with LLM contextual data such as ‘a photo of a CLS’ is used as the text input.
mapped within the learned prompts, it enables zero-shot Text features of classes are compared with visual feature
transfer of prompts to new classes and datasets potentially and class with highest similarity is assigned as predicted la-
cutting the LLM prompt engineering cost. To the best of bel. Improving the quality of text templates such as adding
our knowledge, this is the first work that learns generalized attributes [1], or class-specific details [19, 36] has shown
prompts using text only data. We perform extensive evalu- to improve CLIP performance. However, designing high-
ations on 4 benchmarks where our method improves over quality prompts that can best describe test image remains a
prior ensembling works while being competitive to those key challenge, as image content is not known in advance.
utilizing labeled images. Our code and pre-trained models In literature, numerous techniques have been proposed to
are available at https://github.com/muzairkhattak/ProText. adapt CLIP for downstream recognition tasks. One branch
of methods [6, 17, 27, 41, 49, 50] treat text prompts as learn-
able vectors and optimize them using task-specific objec-
1. Introduction tives such as cross-entropy. As prompts are learned in the
embedding space, this allows them to be used with classes
The Vision field is experiencing a new paradigm in its and datasets beyond those on which they were trained on.
model-building approach with the emergence of founda- While effective over the baseline CLIP, most of these meth-
tional models [18, 23, 37, 47], which are large DNNs pre- ods require annotated image labels to optimize the prompts
trained on web-scale data. Among these, Vision-Language which is often impractical, especially in real-world sce-
models (VLMs) such as CLIP [37] stand out as the latest narios such as medical imaging, remote sensing, security,
68 CLIP CoCoOp MaPLe dataset transfer setting, ProText without using any visual
CuPL PromptSRC ProText (Ours) 67.23 information achieves an average gain of +2.08% over CLIP
Performance (%)

67 CoOp
66.3
while surpassing the performance of previous best image-
66 65.74 65.81 supervised prompt learning method MaPLe [20] by +0.93%
65.15 65.15 (Fig. 1). Further, ProText with text-only supervision per-
65
forms competitively against prior methods in domain gener-
64 63.88
alization, base-to-novel class, and text-only supervised set-
63 ting. Our main contributions are summarized as follows:
Average over 10 datasets • We present a new approach for prompt learning in CLIP
Figure 1. Without using any images for supervision, ProText with using text-only supervision. Our method harmonically
text-only training improves over CLIP, CuPL, and prior 16-shot combines the strengths of prompt learning and prompt en-
image-supervised methods in challenging cross-dataset transfer sembling methods to improve CLIP’s generalization.
settings. Prompt ensembling based CuPL performs same as CLIP • To optimize prompts with text-only data, we develop a
as it cannot transfer class specific LLM templates to cross-datasets.
training approach that allows prompts to learn a mapping
surveillance, etc. Moreover, these methods tend to overfit by extracting rich contextual information from LLM data.
on few-shot source samples and struggle to retain CLIP’s • As LLM contextual knowledge is mapped within the
generalization, especially in cross-dataset settings. learned prompts, this enables prompts to be directly used
with new classes and datasets potentially cutting the ad-
Alternatively, several methods [29, 36] have adopted the
ditional LLM serving and prompt engineering cost.
training-free approach of prompt ensembling by leveraging
• We validate the effectiveness of our method through ex-
the capabilities of Large Language Models (LLMs). Instead
tensive experiments across four benchmarks. Our TextPro
of using hand-crafted templates, these methods mine dataset
approach improves the generalization of CLIP across var-
or class specific descriptors and captions from LLMs to en-
ious settings and fares competitive to approaches that ex-
rich text features. These enriched features aim to better rep-
plicitly use labeled image samples for training.
resent content that could possibly occur in test images, lead-
ing to improvements over baseline CLIP. Although these
methods do not require image information, the knowledge
2. Related Work
acquired from LLMs is mostly specific to each class and Foundational Vision-Language models (VLMs). VLMs
not directly transferable across unseen classes and datasets [18, 33, 37, 46–48] leverage joint image-text pretrain-
since no optimization is performed. Additionally, gener- ing using internet-scale data in a self-supervised fashion.
ating LLM descriptions for each concept separately incurs Representative VLMs like CLIP [37] and ALIGN [18]
additional LLM serving and prompt engineering costs. have utilized around 400M and 1B image-text pairs dur-
In this work, we present a new paradigm to improve ing their pre-training. Using the contrastive learning ob-
CLIP’s generalization. Our motivation comes from com- jective, VLMs learn rich multi-modal features by attract-
bining the strengths of prompt learning and prompt ensem- ing together the features of paired images and texts while
bling approaches while effectively addressing their limita- repelling un-paired image-text features in a joint feature
tions. To this end, we introduce ProText: Prompt Learning space. The resulting model learns open-vocabulary con-
with Text-Only Supervision. In contrast to previous meth- cepts interpretable through natural language suitable for
ods, our approach instead proposes to learn prompts using various downstream discriminative vision tasks such as
text only data obtained from LLMs. As supervised train- open-vocabulary image classification [6, 20, 27, 31, 32, 50],
ing of prompts is not trivial due to image-free setting, we detection [3, 10, 26, 30, 51], and segmentation [13, 24,
develop a novel training framework that allows prompts to 25]. Although promising, adapting VLMs effectively while
learn and extract rich contextual knowledge from LLM data. maintaining their original generalization remains a crucial
Moreover, as LLM contextual knowledge is mapped within challenge. In this work, we propose a novel method to adapt
the learned prompts, it enables zero-shot transfer of prompts CLIP with prompt learning through text modality supervi-
to new classes and datasets, potentially leading to a substan- sion to improve its performance on vision modality tasks.
tial reduction in LLM serving and prompt engineering cost. Prompt Learning for VLMs. Prompt Learning [6, 9, 27,
As shown in Tab. 1, our approach is different from 40, 41, 49, 50] has emerged as an effective fine-tuning strat-
prior methods as it does not require image samples to learn egy to adapt large-scale models. This approach adds a small
prompts, in addition the adapted CLIP transfers well to un- number of learnable embeddings along with model inputs
seen classes and datasets, therefore addressing a key lim- which are optimized during training while the rest of the
itation of LLM-based prompt ensembling techniques. We model is kept frozen. As the pre-trained model is unchanged
demonstrate the effectiveness of ProText by performing ex- during prompt learning, it has become particularly effective
tensive evaluations on 4 benchmarks. On challenging cross- for VLMs such as CLIP, where maintaining the model’s
original generalizability is crucial. CoOp [50] is the pio- these works generate class-specific LLM prompts that are
neering prompt learning method for CLIP which learns text not directly transferable to new classes and datasets.
prompt embeddings to fine-tune CLIP. CoCoOp [49] im- To this end, we present a new paradigm for learning gen-
proves CoOp’s generalization by conditioning text prompts eralized transferable prompts for VLMs using text-only su-
on visual features. MaPLe [20] proposes a multi-modal pervision. Our proposed adaptation framework, ProText:
prompting framework to adapt both vision and language Prompt Learning with Text only supervision aims to ad-
branches of CLIP. UPL [17] adopts an unsupervised prompt dress the challenges of existing approaches by learning
learning approach to finetune CLIP. PromptSRC [21] im- transferable prompts without relying on images. Fig. 2
proves prompt learning from a regularization perspective shows our ProText framework. First, we curate text-only
by making use of additional loss functions during training. LLM template data using class names of a given dataset and
While these methods improve baseline CLIP performance, a LLM such as GPT-3 [5]. As a text-supervised approach,
most of them require image samples with labels, which is ProText only requires CLIP text encoders during training.
less practical, and generating pseudo-labels is often less ef- Specifically, we employ one frozen encoder with learn-
fective. In contrast, we present a novel prompt learning ap- able prompts and a second frozen encoder without learnable
proach that improves CLIP generalization without relying prompts. Learnable prompts with class-name templates are
on any visual samples during training. input to the prompted text encoder to obtain the class-name
Training-Free Text Prompt Enhancement. With the template feature, and a frozen text encoder generates LLM
emergence of LLMs such as GPT-3 [5], several approaches template feature from its description obtained from LLM
[29, 36, 39] have demonstrated their potential for improv- data. Next, we employ a contextual mapping training ob-
ing zero-shot generalization of CLIP. Instead of using hand- jective which maps class-name template feature to the LLM
crafted templates for generating class features, these meth- template feature. Contextual mapping allows the prompts
ods leverage LLMs to generate high-level concepts, class to learn a mapping function that embeds rich contextual
descriptions, and/or attributes which are used in one form or knowledge from LLM data within the prompt vectors. As
another to produce enriched text features. DCLIP [29] gen- prompts are learned in the embedding space, they are di-
erates fine-grained per-class language descriptors and en- rectly compatible with new classes and datasets. At infer-
semble its similarity with image to produce classification ence, the learned prompts are shipped with CLIP model for
scores. WaffleCLIP [39] matches DCLIP performance with standard zero-shot CLIP inference for visual recognition.
random descriptors and show further gains by data-specific Below we explain our proposed approach in detail. We
concepts generated via LLMs. CuPL [36] query LLMs to first revisit CLIP and previous methods including Prompt
generate class-specific prompt descriptions for text prompt Learning and Prompt Ensembling via LLMs in Sec. 3.1 and
ensembling. Although effective, most of these approaches then we present our ProText approach in Sec. 3.2.
generate class-specific text data from LLMs which are not
directly transferable to unseen classes and new datasets 3.1. Preliminaries
since no training is performed. On the other hand, we aim
to leverage the same LLM data via novel text-only prompt Contrastive Language-Image Pre-training (CLIP). CLIP
learning technique which seamlessly allows the transfer of consist of an image encoder f and a text encoder g which
learned prompts toward unseen classes and new datasets. maps image and text input into visual and textual feature re-
spectively. We denote CLIP parameters as θCLIP = {θf , θg }
3. Method where θf and θg refer to the image and text encoder pa-
rameters, respectively. Input image X is divided into M
Given the language interpretable nature of foundational patches which are linearly projected to produce patch to-
VLMs such as CLIP [37], they are naturally suited for zero- kens and a learnable class token CLS is prepended result-
shot recognition tasks. However, to achieve full potential of ing in the final sequence as X̃ = {CLS, e1 , e2 , · · · , eM }.
CLIP’s generalization for downstream tasks, adaptation still The image encoder f encodes the input patches via mul-
appears to be necessary. Numerous approaches have since tiple transformer blocks to produce a latent visual feature
been proposed to adapt general knowledge of CLIP for user- representation f˜ = f (X̃, θf ), where f˜ ∈ Rd . Next, the
specific downstream tasks. One line of methods adopts corresponding class label y is embedded in a text template,
prompt learning [20, 27, 49, 50] to re-purpose CLIP features such as ‘a photo of a [CLASS]’ which can be formulated as
for downstream data. While effective, most of them require Ỹ = {SOS, t1 , t2 , · · · , tL , ck , EOS}. Here {tl |L
l=1 } and
image samples with labels to learn the prompts, which is a ck are the word embeddings corresponding to the text tem-
hard requirement to meet. Another line of methods adopts plate and the label y, respectively while SOS and EOS are
training-free prompt ensembling techniques [29, 36, 39] the learnable start and end token embeddings. The text en-
with the help of LLMs. Although ensembling-based ap- coder g encodes Ỹ via multiple transformer blocks to pro-
proaches do not require image information, the majority of duce the latent text feature as g̃ = g(Ỹ , θg ), where g̃ ∈ Rd .
Class name
feature
Class name

es
LLM template templates

Text Encoder

Text Encoder
m
na
feature

ss
la
"a photo of a

C
F5005F
Text prompts persian cat" concat "a photo of a concat
persian cat"
Frozen
parameters
Learnable
parameters Contextual
LLM Data Mapping

Image Encoder
Text Encoder
How does a A persian cat is a
LLM
persian cat large, long-haired cat
look like?
GPT-3 with a broad face
and round eyes."

Training with text-only supervision Inference on visual domain

Figure 2. Overview of ProText framework. (Left) First, diverse captions are generated for training classes using LLM like GPT-3. During
training, CLIP text encoders generate prompted class-name feature (g̃p ) from class-name templates with learnable prompts and frozen
LLM template feature (g̃) from LLM generated templates. Next, we employ contextual mapping loss to guide learnable prompts to learn
a mapping from the prompted class-name feature to the LLM template feature containing more information about the class. This allows the
learned prompts to exploit internal knowledge of text encoder complemented by LLM descriptions. (Right) At inference, learned prompts
are used with class-name templates, and the standard zero-shot CLIP inference protocol is followed. Moreover, rich contextual information
from LLM descriptions mapped within the learned prompts enables its transferability to new classes and datasets.

For zero-shot inference, text features of text template with [36] or the similarity score of each attribute with the image
class labels {1, 2, · · · , C} are matched with image feature is calculated to obtain classification scores [39] [29].
f˜)τ ) In this work, we focus our comparison with a strong en-
f˜ as PCexp(sim(g̃· ˜i ·f˜)τ )
, where sim() denotes the cosine
i=1 exp(sim(g sembling baseline CuPL [36]. Specifically, a Large Lan-
similarity and τ is the temperature.
guage Model F such as GPT-3 [5] is used to generate class-
Prompt Learning with CLIP. Being a parameter efficient
specific descriptions for class labels {1, 2, · · · , C} using
tuning method, prompt learning has emerged as a popular
queries such as ‘How does a CLASS look like’. Text features
technique to adapt vision-language models like CLIP. Since
of the same class description are averaged together, which
most of the model is kept frozen during adaptation, prompt
serves as the ensembled text features. Finally, zero-shot in-
learning aims to reduce overfitting. Learnable prompts are
ference is performed with those ensembled text features.
appended either at the image side [2], text encoder side
[49, 50], or both sides. In this work, we learn hierarchical 3.2. Prompt Learning with Text-Only Supervision
prompts at the text encoder named Deep Language Prompt-
While image-supervised prompt learning and LLM-based
ing (DLP) [20] formulated as follows.
prompt ensembling methods have proven effective in adapt-
T learnable language prompts Pt = {p1t , p2t , · · · , pTt }
ing CLIP, they face notable challenges as outlined below.
are appended with text input tokens, resulting in Ỹp =
Visual data dependency. Existing prompt learning meth-
{SOS, Pt , t1 , t2 , · · · , tL , ck , EOS}. The text encoder pro-
ods require visual samples with labels to optimize prompts
cesses Ỹp and prompted text feature is obtained as g̃p =
using Eq. 1. However, collecting samples and labels is dif-
g(Ỹp , θg ). We use deep prompting which learns hierarchi-
ficult in critical scenarios like medical images, remote sens-
cal prompts at subsequent transformer blocks of text en-
ing, and surveillance. Pseudo-labels alleviate label depen-
coder. Visual feature f˜ is obtained without utilizing learn-
dency but they are often less effective. Furthermore, these
able prompts. To adapt CLIP on image classification task on
methods tend to overfit CLIP to source data distributions
dataset D, prompts Pt are optimized in a supervised fashion
and compromise generalization across cross-datasets. For
using labeled image samples with cross-entropy loss, LCE .
instance, CoOp utilizing labeled source samples reduces av-
LCE = arg min E(X,y)∼D L(sim(f˜, g̃p ), y). (1) erage CLIP performance by 1.27% on 10 cross-datasets.
Pt
LLM Prompts transferabilty limitation. LLM-based
Prompt Ensembling with LLM descriptions. Several prompt ensembling approaches like CuPL [36] generate
methods have recently proposed to adapt CLIP via training- class-specific LLM descriptions that cannot be directly
free prompt ensembling techniques. The majority of these transferred to unseen classes and datasets. While open-
approaches leverage the capabilities of LLMs to mine rich source LLMs exhibit lower performance, proprietary ones
descriptions, attributes, or high-level concepts of class such as GPT-3 are required for generating data for new
names. The corresponding text features are either averaged classes and datasets leading to additional serving costs.
Our work aims to address the aforementioned limitations For an ith training sample from DPROMPT consisting of
within a unified framework. Below we detail our strategy a text-to-text pair {Linputs , Loutputs }i , we obtain prompted
for curating text-to-text data via LLMs for training, fol- class-name feature g̃p for Liinputs using learnable prompts
lowed by our text-only prompt learning framework. and frozen LLM feature g̃ for Lioutputs without the prompt
vectors within the pre-trained latent space of CLIP. We then
3.2.1 Text-Only LLM data for Prompt Learning impose a contextual mapping constraint between g̃p and g̃
text features as follows,
As discussed in Sec. 3.1, optimizing prompts for down- d
stream datasets typically requires image-labels pairs. Since 1X
Lmapping = ||g̃p − g̃||22 . (2)
we explicitly aim to bypass this requirement, we first lever- d i=1
age LLMs to curate text data for prompt learning which con-
sists of text inputs and text outputs. Given a set of classes As shown above, we utilize MSE loss objective to enforce
{ci }C i C contextual mapping from Liinputs to Lioutputs . We study other
i=1 , we prepare text inputs {Linputs }i=1 by wrapping
each class name in a standard hand-written text template, choices of consistency objectives in our ablations (Sec. 4.7).
Motivation for Lmapping . Contextual mapping objective al-
Liinputs = ‘a photo of a ci ’. lows learnable prompts to exploit internal knowledge of text
encoder of CLIP to generate rich contextual features aligned
Next, we prepare text outputs corresponding to the with the LLM descriptions (Lioutputs ) for a given class. This
Linputs . Specifically, we query GPT-3 model to generate strategy effectively learns prompts without using any visual
detailed descriptions for each class name ci . Similar to information and when trained using all training classes to-
CuPL [36], we prompt GPT-3 with different queries Q con- gether, it enables prompts to capture versatile and gener-
ditioned on class names such as ‘How does a ci look like?’ alized context from the LLM descriptions. These context-
and ‘How can you identify a ci ?” to obtain text outputs, aware prompts become adaptable for use with any dataset
and effectively enable the transferability of class-specific
Lioutputs = F(Q|ci ). LLM descriptions to unseen classes and datasets. Conse-
quently, this substantially reduces the per-dataset overhead
Similar to [36], we generate M text outputs per query associated with LLM serving and prompt engineering.
Q and use N different queries, resulting in M × N text Inference. Once text prompt vectors are optimized through
outputs per class category. We associate all Loutputs with our TextPro framework in the text domain, they become
the corresponding single Linputs for each class ci . As ready to be shipped with CLIP for downstream visual do-
LLMs are pre-trained on internet-scale text corpora, they main inference with a standard zero-shot CLIP inference
possess the capability of generating very diverse and high- setup. As shown in Fig. 2 (right), the learned prompts Pt
quality descriptions and captions for different class cate- are fused with each given class name to produce prompted
gories which results in high-quality text outputs. Finally text features {g̃p }C
i=1 . Finally, zero-shot inference is per-
we combine Linputs and Loutputs to create LLM based formed with the prompted text features and the input image
text-to-text data for text only prompt learning, DPROMPT = feature f˜ to produce classification scores on test images.
×N ×C
{Liinputs , Lioutputs }M
i=1 . We refer the readers to sup-
plementary for additional details on the choice of LLM 4. Experiments
prompts and examples of DPROMPT .
4.1. Evaluation settings

3.2.2 Contextual mapping with Prompt Learning We perform evaluations in 4 benchmark settings. Prompt
ensembling methods and ProText utilize text-only LLM
To leverage LLM text-to-text data DPROMPT for learning data for adapting CLIP while image-supervised prompt
generalized transferable prompts, we propose a contex- learning methods use image-label pairs for training.
tual mapping strategy that effectively learns a mapping Base-to-Novel Generalization. This setting evaluates the
function that maps standard class name templates such generalization of methods within a dataset. Following pre-
as ‘a photo of a ci ’ to the text feature generated from a vious methods [49, 50], we split each dataset into base and
LLM description which contains more information about novel classes. Models are trained on base classes and eval-
the class ci . In other words, contextual mapping allows uated on the test set of base and novel classes respectively.
learnable prompts to map Linputs to Loutputs in the text fea- Cross-dataset transfer. This setting evaluates the general-
ture space of CLIP. The mapping function is realized in the ization ability of models trained on ImageNet-1k [8] source
form of learnable prompt vectors, which we found to be dataset by directly transferring it on cross-datasets.
more effective in our ablations as compared to other tech- Domain Generalization. We evaluate the robustness of
niques such as adapters via linear projection and MLP. different methods on out-of-distribution datasets. We train
Method ImageNet Acc. Table 2. With the text-only settings, ProText uses optimal prompt length and
1: CLIP (ICML’21) 66.72
same amount of epoch configuration for each dataset. Optimal training con-
2: CLIP-Attribute 67.60 text data, learning figuration is obtained through hyper-parameter search on
3: CLIP-80 68.32 contextual prompts
validation split of datasets. To generate text-only data, we
4: DCLIP (ICLR’23) 68.03 with text-only su-
5: Waffle CLIP (ICCV’23) 68.34 pervision improves
utilize GPT-3 DaVinci-002 model [5] and generate class-
6: CuPL (ICCV’23) 69.62 CLIP performance specific descriptions using the LLM prompts provided by
7: ProText-Attribute 68.05 in comparison to the CuPL [36]. We use publicly available CuPL data and gener-
8: ProText-80 68.48 prompt ensembling ate descriptions for datasets not provided by CuPL. AdamW
9: ProText-CuPL 70.22 techniques. optimizer is used with 5 warm-up epochs for training. We
use a single 16-GB V100 to train our models. Refer to sup-
Dataset CLIP [37] CuPL [50] ProText (Ours) plementary material for additional implementation details.
Base Novel HM Base Novel HM Base Novel HM
ImageNet 72.43 68.14 70.22 74.30 68.14 71.09 75.00 71.38 73.14 4.2. Effectiveness of Text-Only Supervision
Caltech101 96.84 94.00 95.40 97.22 94.00 95.58 98.06 95.63 96.83
OxfordPets 91.17 97.26 94.12 94.42 97.26 95.82 94.95 98.00 96.45 We first present an ablation to motivate our approach of
StanfordCars 63.37 74.89 68.65 63.54 74.89 68.75 64.54 76.08 69.84
Flowers102 72.08 77.80 74.83 74.36 77.80 76.04 74.36 78.44 76.35
learning prompts with text-only supervision. We train Pro-
Food101 90.10 91.22 90.66 89.93 91.22 90.57 90.20 91.98 91.08 Text with 3 types of text data and evaluate performance on
Aircraft 27.19 36.29 31.09 30.61 36.29 33.21 30.91 34.13 32.44
SUN397 69.36 75.35 72.23 76.02 75.35 75.68 76.14 79.14 77.61
ImageNet-1k [8]. ProText-Attribute uses 46 templates from
DTD 53.24 59.90 56.37 62.85 59.90 61.34 63.08 61.59 62.33 [1] which corresponds to common image attributes such as
EuroSAT 56.48 64.05 60.03 59.64 64.05 61.77 59.71 80.97 68.73 rotation, blurriness, etc. ProText-80 is trained on standard
UCF101 70.53 77.50 73.85 75.28 77.50 76.37 75.54 79.50 77.47
80 templates provided by CLIP [37] and ProText-CuPL is
Average 69.34 74.22 71.70 72.56 74.22 73.38 72.95 76.98 74.91
trained on class-specific LLM data employed by our main
Table 3. Base-to-novel setting. ProText enables the transferability baseline CuPL [36] for its ensembling approach.
of learned prompts to new classes and improves over CuPL [36]. In Tab. 2, we compare ProText with CLIP and recent
LLM-based ensembling methods. Prompt ensembling with
models on the ImageNet-1k source dataset and evaluate its attribute templates and 80 templates improves over CLIP
performance on four ImageNet variants with domain shifts. single template result. Among the LLM-based ensembling
Supervised setting. We provide performance comparison methods, CuPL provide highest performance of 69.62%. In
of ProText with CuPL[36] with text-only data per dataset. contrast, ProText uses a learning-based approach and shows
Datasets. For the aforementioned benchmarks, we use competitive performance against prompt ensembling meth-
same datasets as followed by previous works [20, 21, 49, ods using the same text data. ProText-Attribute provides
50]. For cross-dataset transfer, domain generalization, gain of 0.45% over CLIP-Attribute while roughly maintain-
and base-to-novel generalization settings, we use 11 im- ing its performance against CLIP-80. When equipped with
age datasets that cover multiple recognition tasks. These CuPL LLM text-data, ProText surpasses CuPL by 0.60%
includes ImageNet [8] and Caltech101 [11] which contains leading to highest performance against all methods. These
generic objects; OxfordPets [35], StanfordCars [22], Flow- results motivate our approach that instead of prompt ensem-
ers102 [34], Food101 [4], and FGVCAircraft [28] for fine- bling, one can achieve competitive results by utilizing the
grained classification, SUN397 [45] for scene recognition, same available text data to learn prompts. Next, we demon-
UCF101 [42] for action recognition, DTD [7] for texture strate the generalization of ProText such that the learned
classification, and EuroSAT [14] for satellite images catego- prompts transfer well across new classes and datasets.
rization. For domain generalization setting, we train mod-
4.3. Base to novel class generalization
els on ImageNet [8] as a source dataset and use ImageNet-
A [16], ImageNet-R [15], ImageNet-Sketch [44] and Ima- We now present results in base-to-novel class generaliza-
geNetV2 [38] for out of distribution dataset evaluation. tion setting where training data for only base classes are
Implementation details. We use a publically available pre- available and the model is evaluated on both base and novel
trained ViT-B/16 CLIP model from OpenAI [37]. We train classes. For CuPL [36], we use base-class LLM templates
ProText with Deep Language Prompting in the first 9 trans- for base classes and zero-shot CLIP results for its novel
former blocks of the CLIP text encoder. For cross-dataset classes. For ProText, we use base-class LLM templates for
transfer and domain generalization setting, we train ProText training and transfer the learned prompts for novel classes.
using T = 4 and T = 16 language prompts with 10 and Results are shown in Tab. 3. CuPL outperforms zero-
200 epochs respectively. Similar to [44], ProText and zero- shot CLIP on base classes while maintaining its perfor-
shot CLIP use additional concepts where available with its mance on novel classes as LLM prompts for new classes
prompts such as ‘a photo of a CLS, a type of flower’ are not available. ProText shows consistent improvements
for OxfordFlowers [34]. For base-to-novel and supervised over CuPL on base classes for 11 datasets. Furthermore,
Source Target Table 4. Cross-
dataset transfer

ars
setting. CuPL and

02
01

ts

r dC
et

Pe

AT
rs1
h1

01

97

01

ge
eN

t
CLIP perform same

raf
nfo
c

d1

N3

r oS
we

F1
for

era
lte
ag

D
rc
o

UC
Flo

DT
Sta

SU
for cross-datasets

Ox
Im

Ca

Eu
Fo

Ai

Av
as CuPL source
Methods utilizing labeled visual samples
CoOp 71.51 93.70 89.14 64.51 68.71 85.30 18.47 64.15 41.92 46.39 66.55 63.88 data cannot transfer
Co-CoOp 71.02 94.43 90.14 65.32 71.88 86.06 22.94 67.36 45.73 45.37 68.21 65.74 to cross-datasets.
MaPLe 70.72 93.53 90.49 65.57 72.23 86.20 24.74 67.01 46.49 48.06 68.69 66.30 Image-based mod-
PromptSRC 71.27 93.60 90.25 65.70 70.25 86.15 23.90 67.10 46.87 45.50 68.75 65.81 els are trained on
Zero-shot & Prompt ensembling methods 16-shot ImageNet
CLIP 66.72 92.98 89.13 65.29 71.30 86.11 24.90 62.59 44.56 47.84 66.83 65.15 samples. ProText
CuPL 69.62 92.98 89.13 65.29 71.30 86.11 24.90 62.59 44.56 47.84 66.83 65.15 employ same Ima-
Prompt learning with text-only supervision geNet data as CuPL
ProText (Ours) 69.80 94.81 91.01 66.00 72.35 86.66 24.72 67.34 47.93 51.86 69.60 67.23 for prompt learning.

Source Target ods [20, 21, 49, 50] are trained with 16-shot ImageNet data.
ImageNet -V2 -S -A -R Avg.
We show our main comparison results in Tab. 4. CuPL
Methods utilizing labeled visual samples
CoOp 71.51 64.20 47.99 49.71 75.21 59.28 improves ImageNet performance of CLIP by ensembling
CoCoOp 71.02 64.07 48.75 50.63 76.18 59.91 ImageNet LLM prompts, while its cross-dataset results re-
MaPLe 70.72 64.07 49.15 50.90 76.98 60.27
main the same as CLIP. In contrast, ProText effectively ad-
Zero-shot & Prompt ensembling methods
CLIP 66.72 60.83 46.15 47.77 73.96 57.18
dresses the transferability challenges of CuPL using gener-
CuPL 69.62 63.27 49.02 50.72 77.05 60.01 alized prompts trained with the same ImageNet LLM data.
Prompt learning with text-only supervision Since ProText allows generalization to unseen datasets,
ProText (Ours) 70.22 63.54 49.45 51.47 77.35 60.45 these learned prompts can directly be used with CLIP for
Table 5. Domain generalization. Prompt learning methods are cross-datasets leading to absolute average gains of +2.1%
trained on imageNet and evaluated on datasets with domain shifts. against CLIP and CuPL. With ProText, one can notably
reduce proprietary LLM serving and prompt engineering
Dataset CLIP CuPL ProText ∆ Table 6. Pro- costs as prompts learned on one dataset are effectively trans-
ImageNet 66.72 69.60 70.22 +0.62
Text results ferable to other datasets. We next compare ProText with
Caltech101 92.98 94.32 95.29 +0.97 with text
strong 16-shot image-supervised methods. Without using
DTD 44.56 53.96 54.02 +0.06 supervision on
EuroSAT 47.84 60.27 58.53 -1.74 each dataset.
any visual samples, ProText demonstrates effective gener-
StanfordCars 65.29 65.95 66.77 +0.82 We compare alization on cross-datasets and consistently surpasses previ-
Flowers102 71.30 73.85 74.42 +0.57 ous state-of-the-art MaPLe on 9/10 datasets leading to the
Aircraft 24.90 27.66 29.01 +1.35
ProText with
SUN397 62.59 69.00 69.76 +0.76 CLIP and highest average accuracy of 67.23%. This highlights that
OxfordPets 89.13 91.11 92.72 +1.61 CuPL. Gains text-only methods like ProText can lead to better general-
UCF101 66.83 70.63 71.45 +0.82 of ProText ization of CLIP as compared to image-supervised methods
Food101 86.11 86.11 86.68 +0.57 over CuPL are which tend to overfit on the source sample distributions.
Average 65.15 69.31 69.90 +0.59 shown in blue.

4.5. Domain generalization experiments


with the same LLM base-class data as CuPL, ProText ef-
fectively transfers learned prompts towards novel classes We present the results for domain generalization task in Ta-
and improves CLIP and CuPL novel class performance by ble 5. As the domain shift variants of ImageNet share class
2.76% averaged across 11 datasets. This shows the advan- names with ImageNet, CuPL employs prompt ensembling
tage of ProText prompts to benefit unseen class performance for each dataset and provides an average gain of +2.84%
potentially reducing the LLM prompt serving cost by half. over CLIP. In contrast, ProText with learned prompts shows
an additional gain of +0.44% against CuPL averaged over 4
4.4. Cross-dataset transfer
datasets. Moreover, ProText fairs competitively with image-
In cross-dataset transfer setting, we compare ProText with supervised methods by showing consistent improvements
CLIP [37], CuPL [36], and image-supervised prompt learn- over CoOp, CoCoOp, and MaPLe. These results suggest
ing methods. Since class-specific ImageNet LLM prompts that text-only supervision methods like ProText can serve as
limit its transfer to other datasets in CuPL, we assign CLIP an effective alternative to improve the robustness of VLMs
results to CuPL for cross-datasets. Image-supervised meth- when no visual information is available for training.
Figure 4. (Left) Effect of LLM data size on performance. (Right)
Figure 3. Ablation: Prompt length (left) and prompt depth (right).
Ablation on ensembling LLM descriptions for training ProText.
Method ImageNet Top1. Table 7. Ablation of
choice of loss for contex- cross-datasets. Compared to CLIP, ProText exhibits in-
1: ProText-contrastive loss 68.12 creased confidence scores for correct classes across various
2: ProText- L1 loss 69.96 tual mapping. MSE loss
3: ProText-MSE loss 70.22 provides highest results. datasets, while marginally decreasing confidence scores for
incorrect classes. This suggests that the prompts learned on
Table 8. Effect on per-
ImageNet-1k provide complementary and transferable con-
Method ImageNet Top1
formance with different text textual cues, leading to improved results. We conjecture
1: ProText-80 templates 68.48 that ProText prompts potentially improve the classification
2: ProText-Alpaca 67.10 data for training. GPT-3 text
3: ProText-GPT-3 70.22 data show highest results. of test samples situated near the decision boundary due to
higher confidence for correct classes. Refer to the supple-
Table 9. Ablation on the
mentary section for qualitative and additional analysis.
Method ImageNet Top1.
choice of mapping network. Loss metric in contextual mapping. We ablate on choice
1: Linear Adaptor 69.36
2: MLP Adaptor 69.24 Prompt Learning shows op- of loss used for the contextual mapping module in Tab. 7.
3: Prompt Learning 70.22 timal performance. Distance-based losses improve over contrastive loss. We
conjecture that contrastive loss treats samples of same class
Correct class confidence (%) ↑ Incorrect class confidence (%) ↓ labels in a same batch as negatives leading to noisy training.
Method DTD SUN Caltech UFC DTD SUN Caltech UFC Choice of LLM for generating text data. ProText by de-
CLIP 30.5 49.3 84.5 56.4 1.51 0.13 0.16 0.44 fault uses GPT-3 [5] LLM to obtain text templates for train-
ProText 33.1 54.2 89.1 59.5 1.45 0.12 0.11 0.40 ing. Here we ablate on an open-source Alpaca [43] model
as an alternative choice. As shown in Tab. 8, ProText with
Table 10. Confidence score analysis: ProText trained on ImageNet Alpaca templates performs worse than ProText-80 template
improves its logit confidence for correct classes in unseen datasets.
and ProText-GPT-3. We observed that Alpaca templates are
often noisy while GPT-3 descriptions contain more enriched
class details which results in better performance.
4.6. Supervised text-only training
Prompt learning verses adapter. While ProText employs
In this setting, we compare ProText with CuPL for each prompt learning to learn contextual mapping from LLM
dataset trained on LLM template data and the results are templates, here ablations on adapters in Tab. 9. Similar to
shown in Tab. 6. While utilizing the same LLM data, [12], we attach adapter at the output of CLIP text encoder.
ProText achieves consistent improvements over CuPL on Adapters perform lower as compared to prompting. We
10/11 datasets with an average gain of +0.59%. This reflects conjecture that adapter completely transforms text features
the generalization of the ProText approach across various and loses CLIP generalization. In contrast, prompt learning
diverse image datasets where it better utilizes LLM data append learnable vectors with CLIP text input without sig-
within the learned prompts. We also compare ProText with nificant replacement and learns effective mapping function.
image-supervised methods and observe that ProText fares Training data size for text-supervision. To assess the ef-
competitively with approaches utilizing up to 2-shot sam- fect of LLM template data size on ProText, we ablate on
ples for training. This shows ProText as a potential alter- the number of descriptions per class in Fig. 4 (left). In-
native to image-supervised methods in extremely low-data creasing descriptions for each class consistently improves
regimes. Refer to supplementary for additional results. the results. This suggests that we could further boost Pro-
Text performance as quality and size of text data increases.
4.7. Ablative analysis
Ensembling in ProText training. ProText uses multiple
On understanding ProText prompts. In Table. 10, we descriptions per class and enforce mapping of class-name
present average confidence scores obtained from ProText template feature to feature of each LLM description for that
logits trained on ImageNet-1k text data when applied to class. We conduct an alternative experiment by ensembling
a single feature from multiple LLM descriptions per class Food-101–mining discriminative components with random
and enforce mapping on ensembled LLM feature. As shown forests. In ECCV, pages 446–461. Springer, 2014. 6
in Fig. 4 (right), ProText-ensembled performs lower than [5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
ProText with individual samples. We conjecture that learn- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan-
ing on each description allows the model to utilize addi- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
tional context present in each description. Ensembling can Language models are few-shot learners. NeurIPS, 33:1877–
1901, 2020. 3, 4, 6, 8
potentially mask out less frequent details available in text.
[6] Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li,
Prompt length and prompt depth. Fig. 3 (left) shows
Yongming Rao, and Kun Zhang. Plot: Prompt learning
the effect of prompt length for training ProText. Setting with optimal transport for vision-language models. In ICLR,
prompt length to 16 leads to optimal performance. Fig. 2022. 1, 2
3 (right) shows the effect of prompt depth on final perfor- [7] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy
mance where prompt depth of 9 shows optimal results. Mohamed, and Andrea Vedaldi. Describing textures in the
wild. In CVPR, pages 3606–3613, 2014. 6
5. Conclusion [8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
Prompt learning and LLM-based ensembling are effective database. In CVPR, pages 248–255. Ieee, 2009. 5, 6
techniques to improve CLIP’s generalization. However, [9] Mohammad Mahdi Derakhshani, Enrique Sanchez, Adrian
prompt learning often requires labeled images, which is less Bulat, Victor G Turrisi da Costa, Cees GM Snoek, Geor-
practical, while LLM-based ensembling methods are dom- gios Tzimiropoulos, and Brais Martinez. Bayesian prompt
inantly class-specific and not directly transferable to new learning for image-language model generalization. In CVPR,
classes. To address these challenges, we propose a new di- pages 15237–15246, 2023. 2
rection to adapt CLIP by learning generalized prompts with [10] Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao,
text-only supervision, without relying on visual data. We and Guoqi Li. Learning to prompt for open-vocabulary ob-
introduce a training strategy for prompts to learn a mapping ject detection with vision-language model. In CVPR, pages
function that embeds rich contextual knowledge from LLM 14084–14093, 2022. 2
text data within the prompts. The context learned by these [11] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener-
prompts transfers well to unseen classes and datasets, po- ative visual models from few training examples: An incre-
tentially reducing the LLM prompt engineering and serving mental bayesian approach tested on 101 object categories. In
CVPR Workshop, pages 178–178. IEEE, 2004. 6
cost. We perform extensive evaluations on four benchmarks
[12] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao
where our text-only approach performs favorably well over
Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao.
previous methods, including those utilizing labeled images.
Clip-adapter: Better vision-language models with feature
adapters. IJCV, pages 1–15, 2023. 8
Acknowledgements: We would like to thank Hanan Ghani
[13] Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal-
and Jameel Hassan for their help in downloading datasets. ing open-vocabulary image segmentation with image-level
We also thank Muhammad Jehanzeb Mirza for providing labels. In ECCV, pages 540–557. Springer, 2022. 2
Alpaca LLM prompt data for ablation experiments. [14] Patrick Helber, Benjamin Bischke, Andreas Dengel, and
Damian Borth. Eurosat: A novel dataset and deep learn-
References ing benchmark for land use and land cover classification. J-
STARS, 12(7):2217–2226, 2019. 6
[1] Bang An, Sicheng Zhu, Michael-Andrei Panaitescu-Liess, [15] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada-
Chaithanya Kumar Mummadi, and Furong Huang. More vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu,
context, less distraction: Improving zero-shot inference Samyak Parajuli, Mike Guo, et al. The many faces of robust-
of clip by inferring and describing spurious features. In ness: A critical analysis of out-of-distribution generalization.
Workshop on Efficient Systems for Foundation Models@ In ICCV, pages 8340–8349, 2021. 6
ICML2023, 2023. 1, 6 [16] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Stein-
[2] Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and hardt, and Dawn Song. Natural adversarial examples. In
Phillip Isola. Visual prompting: Modifying pixel space to CVPR, pages 15262–15271, 2021. 6
adapt pre-trained models. arXiv preprint arXiv:2203.17274, [17] Tony Huang, Jack Chu, and Fangyun Wei. Unsupervised
2022. 4 prompt learning for vision-language models. arXiv preprint
[3] Hanoona Bangalath, Muhammad Maaz, Muhammad Uzair arXiv:2204.03649, 2022. 1, 3
Khattak, Salman H Khan, and Fahad Shahbaz Khan. Bridg- [18] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
ing the gap between object and image-level representations Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom
for open-vocabulary detection. NeurIPS, 35:33781–33794, Duerig. Scaling up visual and vision-language representation
2022. 2 learning with noisy text supervision. In ICLR, pages 4904–
[4] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 4916. PMLR, 2021. 1, 2
[19] Woojeong Jin, Yu Cheng, Yelong Shen, Weizhu Chen, and [33] Muhammad Ferjad Naeem, Yongqin Xian, Xiaohua Zhai,
Xiang Ren. A good prompt is worth millions of pa- Lukas Hoyer, Luc Van Gool, and Federico Tombari. Silc:
rameters? low-resource prompt-based learning for vision- Improving vision language pretraining with self-distillation.
language models. arXiv preprint arXiv:2110.08484, 2021. arXiv preprint arXiv:2310.13355, 2023. 2
1 [34] Maria-Elena Nilsback and Andrew Zisserman. Automated
[20] Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad flower classification over a large number of classes. In
Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: ICVGIP, pages 722–729. IEEE, 2008. 6, 2
Multi-modal prompt learning. In CVPR, pages 19113– [35] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and
19122, 2023. 1, 2, 3, 4, 6, 7 CV Jawahar. Cats and dogs. In CVPR, pages 3498–3505.
[21] Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal IEEE, 2012. 6
Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shah- [36] Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What
baz Khan. Self-regulating prompts: Foundational model does a platypus look like? generating customized prompts
adaptation without forgetting. In ICCV, pages 15190–15200, for zero-shot image classification. In ICCV, pages 15691–
2023. 1, 3, 6, 7 15701, 2023. 1, 2, 3, 4, 5, 6, 7
[22] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. [37] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
3d object representations for fine-grained categorization. In Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
ICCV, pages 554–561, 2013. 6 Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
[23] Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui ing transferable visual models from natural language super-
Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation vision. In International Conference on Machine Learning,
via large language model. arXiv preprint arXiv:2308.00692, pages 8748–8763. PMLR, 2021. 1, 2, 3, 6, 7, 4
2023. 1 [38] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and
[24] Boyi Li, Kilian Q. Weinberger, Serge Belongie, Vladlen Vaishaal Shankar. Do imagenet classifiers generalize to im-
Koltun, and René Ranftl. Language-driven semantic seg- agenet? In International Conference on Machine Learning,
mentation, 2022. 2 pages 5389–5400. PMLR, 2019. 6
[39] Karsten Roth, Jae Myung Kim, A Koepke, Oriol Vinyals,
[25] Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan
Cordelia Schmid, and Zeynep Akata. Waffling around for
Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana
performance: Visual classification with random words and
Marculescu. Open-vocabulary semantic segmentation with
broad concepts. 2023. 1, 3, 4, 2
mask-adapted clip. In CVPR, pages 7061–7070, 2023. 2
[40] Jameel Hassan Abdul Samadh, Hanan Gani, Noor Hazim
[26] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao
Hussein, Muhammad Uzair Khattak, Muzammal Naseer, Fa-
Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun
had Khan, and Salman Khan. Align your prompts: Test-time
Zhu, et al. Grounding dino: Marrying dino with grounded
prompting with distribution alignment for zero-shot general-
pre-training for open-set object detection. arXiv preprint
ization. In Thirty-seventh Conference on Neural Information
arXiv:2303.05499, 2023. 2
Processing Systems, 2023. 2
[27] Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, [41] Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom
and Xinmei Tian. Prompt distribution learning. In CVPR, Goldstein, Anima Anandkumar, and Chaowei Xiao. Test-
pages 5206–5215, 2022. 1, 2, 3 time prompt tuning for zero-shot generalization in vision-
[28] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew language models. NeurIPS, 35:14274–14289, 2022. 1, 2
Blaschko, and Andrea Vedaldi. Fine-grained visual classi- [42] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah.
fication of aircraft. arXiv preprint arXiv:1306.5151, 2013. Ucf101: A dataset of 101 human actions classes from videos
6 in the wild. arXiv preprint arXiv:1212.0402, 2012. 6
[29] Sachit Menon and Carl Vondrick. Visual classification via [43] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois,
description from large language models. In ICLR, 2023. 1, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B
2, 3, 4 Hashimoto. Stanford alpaca: An instruction-following llama
[30] Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim model, 2023. 8
Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh [44] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P
Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Xing. Learning robust global representations by penalizing
Shen, et al. Simple open-vocabulary object detection. In local predictive power. In NeurIPS, 2019. 6
ECCV, pages 728–755. Springer, 2022. 2 [45] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva,
[31] Muhammad Ferjad Naeem, Yongqin Xian, Luc V Gool, and and Antonio Torralba. Sun database: Large-scale scene
Federico Tombari. I2dformer: Learning image to document recognition from abbey to zoo. In CVPR, pages 3485–3492.
attention for zero-shot image classification. NeurIPS, 2022. IEEE, 2010. 6
2 [46] Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe
[32] Muhammad Ferjad Naeem, Muhammad Gul Zain Ali Khan, Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and
Yongqin Xian, Muhammad Zeshan Afzal, Didier Stricker, Chunjing Xu. Filip: Fine-grained interactive language-image
Luc Van Gool, and Federico Tombari. I2mvformer: Large pre-training. In ICLR, 2021. 2
language model generated multi-view document supervision [47] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo-
for zero-shot image classification. In CVPR, 2023. 2 jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive
captioners are image-text foundation models. arXiv preprint
arXiv:2205.01917, 2022. 1
[48] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella,
Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang,
Boxin Li, Chunyuan Li, et al. Florence: A new
foundation model for computer vision. arXiv preprint
arXiv:2111.11432, 2021. 2
[49] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
Liu. Conditional prompt learning for vision-language mod-
els. In CVPR, pages 16816–16825, 2022. 1, 2, 3, 4, 5, 6,
7
[50] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
Liu. Learning to prompt for vision-language models. IJCV,
130(9):2337–2348, 2022. 1, 2, 3, 4, 5, 6, 7
[51] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp
Krähenbühl, and Ishan Misra. Detecting twenty-thousand
classes using image-level supervision. In ECCV, pages 350–
368. Springer, 2022. 2
Learning to Prompt with Text Only Supervision for Vision-Language Models
Supplementary Material
The following sections provide supplementary mate- Layer # CTX 1 CTX 2 CTX 3 CTX 4
rial for our main paper. This includes additional analysis 1 a a for onto
and comparison experiments, implementation details, and 2 bi paper erup believes
specifics of our text-to-text data used for training. The con- 3 ilwx ered emon enclosure
tents are organized as follows: 4 devoted fly ced hair
• Additional analysis and comparison experiments (Sec. A) 5 sin tous cona emor
6 foto unwanted swagg curfew
• Additional Implementation details (Sec. B)
7 banan lift knob maz
• Details on Text-Only Data (Sec. C) 8 slow commuter helene nuff
9 chevron rear crepe opi
A. Additional Experiments
Table 11. Illustration of nearest words in CLIP word vocabulary
A.1. Additional Analysis. against ProText prompts in different transformer layers. ProText
prompts are trained on ImageNet-1k LLM prompt data.
Here we provide additional analysis experiments for our
ProText technique.
Qualitative Analysis. In order to understand the transfer- Towards interpreting ProText prompts. Our main ex-
ability of ProText prompts across new datasets, we visual- periments in Sec. 4.7 demonstrated that ProText trained on
ize attention maps in Fig. 5. Specifically, we employ Pro- ImageNet-1k text dataset performs favorably well across
Text prompts learned on ImageNet-1k text-only dataset and cross-datasets. Here we are interested in studying how the
transfer it to cross-datasets. We observe that ProText tends ProText prompt vectors are interpreted in natural language.
to focus to relevant image features while reducing its atten- Specifically, we searched for words in CLIP vocabulary that
tion towards spurious features as shown in Oxford Pets and are closest to the learned prompts using Euclidean distance
Caltech-101 images. In case of texture image from DTD, in the embedding space. The results in Table 11 show the
ProText shows more global attention on the texture por- nearest (valid) word for ProText prompts across different
tion of the image which is crucial in recognizing the cor- transformer layers. Note that these words may note con-
rect texture due to the fine-grained nature of texture classes. cretely correspond to the learned prompts as we could only
This suggest that ProText can learn complementary contex- select nearest ones. We observe that the represented words
tual features, which steers CLIP for better transferability to- are diverse containing connecting words that are common
wards new datasets without relying on visual samples. in web captions such as ”a,” ”for,” and ”onto”. Addition-
ally, since CLIP uses a BPE representation for tokenization,
several subwords appear among the nearest words, such as
Oxford Pets

”sin,” ”ced,” and ”banan.” These subwords can collectively


contribute to strong context priors, such as deriving ”ba-
nana” from ”banan,” ”Mercedes” from ”ced,” and ”casino”
from ”sin,” which may be potentially relevant for down-
stream datasets like SUN397 and Stanford Cars. At the
same time, some words do not appear to contribute much
for context enhancement such as ”ilwx”, ”curfew” etc. In
DTD

summary, similar to the findings in [50], the learned vectors


may encompass word representations not explicitly present
in the existing vocabulary.
Caltech-101

A.2. Additional comparisons with WaffleCLIP.


We present additional comparisons between ProText and
WaffleCLIP [39] approach. WaffleCLIP employs prompt
Input Image CLIP ProText ensembling by introducing random descriptors and char-
acters alongside class names. Specifically, we perform a
Figure 5. Attention map visualizations for CLIP and ProText for comparison with a WaffleCLIP-Concept variant, which in-
cross-datasets. ProText is trained on ImageNet-1k text-only data. corporates high-level dataset concepts in its text prompts,
such as ‘a photo of a flower: a CLS ’ for OxfordFlow- few-shot image-supervised methods including CLIP Linear
ers [34]. Further details on the WaffleCLIP framework and Probe, CoOp, and CoCoOp. ProText shows improved aver-
its variants can be found in [39]. aged performance over 1 & 2 shot Linear Probe. Similarly,
Cross-dataset transfer. For cross-dataset transfer settings, ProText without using any images for training improves on
all methods only utilize ImageNet source dataset LLM most datasets against CoOp and CoCoOp trained with 1 and
prompt information. The results are shown in Tab. 12. 2 shots. ProText, without using any images for training,
CuPL shows the same performance as CLIP for cross- outperforms CoOp and CoCoOp trained with 1 and 2 shots
datasets as class-specific descriptions for new datasets are on most datasets. This suggests that text-only training can
not available in this setting. Overall, WaffleCLIP uses ran- be considered an effective alternative approach to image-
dom descriptors which leads to improvements over CLIP supervised methods under extreme low-data regimes.
and CuPL. In contrast, ProText with text-only training
with ImageNet-1k LLM templates shows consistent im-
A.4. Additional ablation studies.
provements over WaffleCLIP by surpassing on 9/10 cross- We present additional ablation experiments conducted on
datasets and leads to the averaged accuracy of 67.23% in ProText as outlined below.
the challenging cross-dataset transfer setting. Combining prompt ensembling and prompt learning. In
Text-only supervised setting. We additionally compare our ProText approach, learnable prompts for inference are
WaffleCLIP in text-only supervised setting. As shown in trained on text data. Here, we explore an alternative exper-
Tab. 13, WaffleCLIP improves over CLIP but lags behind iment by averaging the text features with ProText-learned
CuPL as it only relies on high-level dataset concepts and prompts and text features of LLM templates obtained via
random descriptors. CuPL uses class-specific LLM de- prompt ensembling. Specifically, we average the LLM
scriptions for prompt ensembling and shows improved re- prompt features (e.g., CuPL features) and ProText features
sults. In contrast to these approaches, ProText adopts a for the same classes to study if prompt learning and prompt
learning-based approach using text data and shows the high- ensembling could be complementary. The results are shown
est performance by surpassing both WaffleCLIP and CuPL in Table 16. Combining ProText and CuPL features leads to
in 10/11 datasets. This suggests that text-only prompt learn- marginal improvement compared to ProText alone. We con-
ing can serve as a better alternative to training-free prompt jecture that since ProText uses the same LLM template data
ensembling methods. to learn prompts, the LLM template features and ProText
features might not be strongly complementary.
A.3. Comparison with image-supervised methods.
We show additional comparisons of ProText with image-
B. Additional Implementation details
supervised methods in terms of generalization performance. Training details. For training ProText, we use a publically
In base to novel class generalization setting, we include available CLIP ViT-B/16 model from OpenAI [37]. Lan-
prompt learning methods utilizing 16-shot image data guage prompts for each training are initialized with ‘a photo
where we mainly focus on novel class performance for of a” for the first layer and randomly initialized for the re-
comparison. In text-only supervised setting, we compare maining transformer layers of the text encoder of CLIP. All
ProText with few-shot image supervised methods including models are trained using the AdamW optimizer on a sin-
CLIP Linear Probe, CoOp, and CoCoOp, which are trained gle 16-GB V100 GPU. For cross-dataset and domain gener-
up to 2-shot data. alization benchmarks, we train ProText using T = 4 and
Unseen class generalization. All methods are trained on T = 16 language prompts, respectively, for 10 and 200
seen classes of each dataset and we specifically analyze epochs, respectively. The warm-up epochs are set to 5 dur-
their performance on unseen classes to study generaliza- ing training.
tion. Results are shown in Tab. 14. Image-supervised As text data from LLMs varies in quality and size across
prompt learning methods utilize 16-shot base-class labeled datasets, we have observed that training ProText on each
data and demonstrate improved accuracy for novel classes. dataset requires custom training configurations to achieve
For example, the previous state-of-the-art method, Promp- the best performance. Therefore, ProText employs opti-
SRC, achieves a substantial accuracy of 70.73% on Ima- mal prompt length and epoch configuration for each dataset.
geNet for novel classes. In comparison, ProText, leveraging The optimal training configurations are obtained through
text-only data, shows an improvement of +0.65% against the validation splits of each dataset.
PromptSRC for novel classes on ImageNet. In summary, Base-to-novel generalization setting. In Tab. 17, we show
ProText consistently outperforms PromptSRC on 9 out of the hyperparameters used for training models in base-to-
11 datasets for novel classes, leading to the highest novel novel generalization settings. We use a learning rate of 0.03
class accuracy of 76.98% averaged over 11 datasets. for all datasets except UCF101, FOOD101, and Oxford-
Supervised setting. In Tab. 15, compare ProText with Flowers where learning rate of 0.0025 is used.
Source Target

ars

02
01

ts

r dC
t

Pe

AT
rs1
Ne

01

01

ge
aft
ch

9
d

nfo

d1
e

N3

roS
we

F1
for

era
lte

rcr
ag

D
o

UC
Flo

DT
Sta

SU
Ox
Im

Ca

Eu
Fo

Ai

Av
Zero-shot & Prompt ensembling methods
CLIP 66.72 92.98 89.13 65.29 71.30 86.11 24.90 62.59 44.56 47.84 66.83 65.15
CuPL 69.62 92.98 89.13 65.29 71.30 86.11 24.90 62.59 44.56 47.84 66.83 65.15
WaffleCLIP-Concept 68.34 94.01 89.57 63.42 72.00 86.84 24.49 66.17 45.15 47.74 67.96 65.74
Prompt learning with text-only supervision
ProText (Ours) 69.80 94.81 91.01 66.00 72.35 86.66 24.72 67.34 47.93 51.86 69.60 67.23

Table 12. Cross-dataset transfer setting. Results comparison of ProText with CLIP, CuPL, and Waffle-CLIP. ProText overall shows
consistent improvements over LLM-based prompt ensembling methods.

Dataset CLIP CuPL WaffleCLIP-C ProText ∆ same LLM text data as utilized by CuPL. Hyperparameter
ImageNet 66.72 69.60 68.34 70.22 +0.62 values are shown in Table 18. All models are trained using
Caltech101 92.98 94.32 94.01 95.29 +0.97 a learning rate of 0.03, except for UCF101, EuroSAT, and
DTD 44.56 53.96 45.15 54.04 +0.06 Oxford-Flowers, where a learning rate of 0.0025 is used.
EuroSAT 47.84 60.27 47.74 58.53 -1.74
StanfordCars 65.29 65.95 63.42 66.77 +0.82
Flowers102 71.30 73.85 72.00 74.42 +0.57 C. Details on Text-Only Data
Aircraft 24.90 27.66 24.49 29.01 +1.35
SUN397 62.59 69.00 66.17 69.76 +0.76 As discussed in Sec. 3.2.1, our ProText approach relies on
OxfordPets 89.13 91.11 89.57 92.72 +1.61 text-only data (D PROMPT) curated from Language Models
UCF101 66.83 70.63 67.96 71.45 +0.82 (LLMs) for training its language prompts. Here, we provide
Food101 86.11 86.11 86.84 86.68 +0.57
additional details on the curation of text-only data. Specif-
Average 65.15 69.31 65.97 69.90 +0.59 ically, we first provide information on the text queries used
as input to LLMs for generating prompts, followed by qual-
Table 13. ProText results with text supervision on each dataset. We itative examples of DPROMPT .
compare ProText with CLIP and CuPL and WaffleCLIP-Concept.
Gains of ProText over CuPL are shown in blue.
C.1. Queries to LLMs to curate Text-Only Data
Dataset CuPL ProText CoOp CoCoOp MaPLe PromptSRC ∆ Following [36], we obtain class descriptions from LLMs by
[36] Ours [50] [49] [20] [21] providing various queries as inputs. Specifically, we utilize
ImageNet 68.14 71.38 67.88 70.43 70.54 70.73 +3.2 queries termed as Full prompts by CuPL [36]. For instance,
Caltech101 94.00 95.63 89.81 93.81 94.36 94.03 +1.6 to generate class descriptions of ImageNet-1k classes, we
DTD 59.90 61.59 41.18 56.00 59.18 62.97 +1.7
prompt GPT-3 with the following 5 queries:
EuroSAT 64.05 80.97 54.74 60.04 73.23 73.90 +17
StanfordCars 74.89 76.08 60.40 73.59 74.00 74.97 +1.2
• ‘Describe what a(n) CLS looks like.’
Flowers102 77.80 78.44 59.67 71.75 72.46 76.50 +0.6
Aircraft 36.29 34.13 22.30 23.71 35.61 37.87 -2.2 • ‘How can you identify a(n) CLS? ’
SUN397 75.35 79.14 65.89 76.86 78.70 78.47 +3.8 • ‘What does a(n) look like? ’
OxfordPets 97.26 98.00 95.29 97.69 97.76 97.30 +0.7 • ‘Describe an image from the internet of a(n) CLS.’
UCF101 77.50 79.50 56.05 73.45 78.66 78.80 +2.0 • ‘A caption of an image of a(n) CLS.’
Food101 91.22 91.98 82.26 91.29 92.05 91.53 +0.8
Average 74.22 76.98 63.22 71.69 75.14 76.10 +2.8 Here, CLS denotes the class names present in the
dataset. After generating LLM class descriptions, we asso-
Table 14. Novel-class generalization comparison.We compare ciate all descriptions of the same class with its class-name
ProText with prompt ensembling and image-supervised methods template given as ‘A photo of a CLS’. This results in
on unseen class performance in base-to-novel class generalization our text-only training data DPROMPT with text-to-text map-
setting. Gains of ProText over CuPL are shown in blue. ping pairs used to train ProText. Refer to [36] for LLM
queries of other datasets used to generate class-specific de-
scriptions. For standardized comparisons, we use publicly
Text-only supervised setting. For our comparison with available CuPL data and generate descriptions for datasets
CuPL [36] in Table 15, ProText models are trained using the not provided by CuPL.
Linear Probe CoOp CoCoOp
Dataset CLIP CuPL ProText
K=1 K=2 K=1 K=2 K=1 K=2
ImageNet 66.70 69.62 70.22 32.13 44.88 66.33 67.07 69.43 69.78
Caltech101 92.98 94.32 95.29 79.88 89.01 92.60 93.07 93.83 94.82
DTD 44.56 53.96 54.04 34.59 40.76 50.23 53.60 48.54 52.17
EuroSAT 47.84 60.27 58.53 49.23 61.98 54.93 65.17 55.33 46.74
StanfordCars 65.29 65.95 66.77 35.66 50.28 67.43 70.50 67.22 68.37
Flowers102 71.30 73.85 74.42 69.74 85.07 77.53 87.33 72.08 75.79
Aircraft 24.90 27.66 29.01 19.61 26.41 21.37 26.20 12.68 15.06
SUN397 62.59 69.00 69.76 41.58 53.70 66.77 66.53 68.33 69.03
OxfordPets 89.13 91.11 92.72 44.06 58.37 90.37 89.80 91.27 92.64
UCF101 66.83 70.63 71.45 53.66 65.78 71.23 73.43 70.30 73.51
Food101 86.11 86.11 86.68 43.96 61.51 84.33 84.40 85.65 86.22
Average 65.15 69.31 69.90 45.83 57.98 67.56 70.65 66.79 67.65

Table 15. ProText results with text supervision on each dataset. We compare ProText with CLIP [37], CuPL [36] and image supervised
Linear Probe [37], CoOp [50] and CoCoOp [49] methods.

Method ImageNet Top1. Table 16. Abla- • ‘A tench is a freshwater fish with a dark green back
1: CuPL 69.62 tion on combining and light-colored sides.’
2: ProText 70.22 CuPL and ProText • ‘A tench looks like a freshwater fish with a dark
3: Ensembling: ProText + CuPL 70.28 text features.
olive-green back, fading to yellowish-brown on the
sides.’
StanfordCars

• ‘Tench are a freshwater fish that can grow up to


Flowers102
Caltech101
OxfordPets
ImageNet

EuroSAT
Food101

SUN397

UCF101

70cm long! They have olive-brown skin with dark


Aircraft

DTD

spots, and their meat is white and firm.’


H.parameter • ‘This image shows a large, dark green tench swim-
Epochs 30 30 50 30 150 50 200 30 200 30 20 ming in a pond.’
# Prompts (T ) 4 8 4 8 4 8 4 8 4 16 16 Class: bath towel
Class-name template: ‘A photo of a bath towel’
Table 17. Hyper-parameters setting used for base-to-novel gen- Associated LLM descriptions:
eralization setting. Optimal configuration is set using validation
• ‘A bath towel typically has a loops on one side and
splits of each dataset.
a smooth surface on the other.’
• ‘A bath towel is a rectangular piece of fabric, usu-
StanfordCars
Flowers102

ally Cotton, that is used to dry oneself after a bath


Caltech101
OxfordPets
ImageNet

EuroSAT
Food101

SUN397

UCF101
Aircraft

or shower.’
DTD

• ‘The image is of a white bath towel with a blue


H.parameter
and green stripes.’
Epochs 200 30 50 20 300 30 200 200 200 300 100 • ‘A fluffy white bath towel draped over a towel
# Prompts (T ) 16 16 4 8 4 16 4 16 16 4 8
rack.’
Class: sandal
Table 18. Hyper-parameters used for text-only supervised setting.
Class-name template: ‘A photo of a sandal’
Associated LLM descriptions:
C.2. Qualitative examples • ‘A sandal is a shoe typically made of leather or
synthetic material that has an open toe and a strap
As LLMs are pre-trained on internet-scale text corpora,
or straps that go around the foot or up the ankle.’
they possess the capability of generating diverse and high-
• ‘A sandal is usually a flat shoe with a strap that
quality descriptions and captions for different class cate-
goes around the foot or ankle.’
gories, resulting in high-quality text outputs. Below we
• ‘This sandal is from the ancient Egyptian city of
show some examples of DPROMPT text-to-text pairs for the
Thebes.’
ImageNet-1k dataset.
• ‘When you are looking to identify a sandal, the
Class: Tench
first place to start is by looking at the features of
Class-name template: ‘A photo of a Tench’
the shoe.’
Associated LLM descriptions:

You might also like