Professional Documents
Culture Documents
Learning To Prompt With Text Only Supervision For Vision-Language Models
Learning To Prompt With Text Only Supervision For Vision-Language Models
Learning To Prompt With Text Only Supervision For Vision-Language Models
Abstract
arXiv:2401.02418v1 [cs.CV] 4 Jan 2024
67 CoOp
66.3
while surpassing the performance of previous best image-
66 65.74 65.81 supervised prompt learning method MaPLe [20] by +0.93%
65.15 65.15 (Fig. 1). Further, ProText with text-only supervision per-
65
forms competitively against prior methods in domain gener-
64 63.88
alization, base-to-novel class, and text-only supervised set-
63 ting. Our main contributions are summarized as follows:
Average over 10 datasets • We present a new approach for prompt learning in CLIP
Figure 1. Without using any images for supervision, ProText with using text-only supervision. Our method harmonically
text-only training improves over CLIP, CuPL, and prior 16-shot combines the strengths of prompt learning and prompt en-
image-supervised methods in challenging cross-dataset transfer sembling methods to improve CLIP’s generalization.
settings. Prompt ensembling based CuPL performs same as CLIP • To optimize prompts with text-only data, we develop a
as it cannot transfer class specific LLM templates to cross-datasets.
training approach that allows prompts to learn a mapping
surveillance, etc. Moreover, these methods tend to overfit by extracting rich contextual information from LLM data.
on few-shot source samples and struggle to retain CLIP’s • As LLM contextual knowledge is mapped within the
generalization, especially in cross-dataset settings. learned prompts, this enables prompts to be directly used
with new classes and datasets potentially cutting the ad-
Alternatively, several methods [29, 36] have adopted the
ditional LLM serving and prompt engineering cost.
training-free approach of prompt ensembling by leveraging
• We validate the effectiveness of our method through ex-
the capabilities of Large Language Models (LLMs). Instead
tensive experiments across four benchmarks. Our TextPro
of using hand-crafted templates, these methods mine dataset
approach improves the generalization of CLIP across var-
or class specific descriptors and captions from LLMs to en-
ious settings and fares competitive to approaches that ex-
rich text features. These enriched features aim to better rep-
plicitly use labeled image samples for training.
resent content that could possibly occur in test images, lead-
ing to improvements over baseline CLIP. Although these
methods do not require image information, the knowledge
2. Related Work
acquired from LLMs is mostly specific to each class and Foundational Vision-Language models (VLMs). VLMs
not directly transferable across unseen classes and datasets [18, 33, 37, 46–48] leverage joint image-text pretrain-
since no optimization is performed. Additionally, gener- ing using internet-scale data in a self-supervised fashion.
ating LLM descriptions for each concept separately incurs Representative VLMs like CLIP [37] and ALIGN [18]
additional LLM serving and prompt engineering costs. have utilized around 400M and 1B image-text pairs dur-
In this work, we present a new paradigm to improve ing their pre-training. Using the contrastive learning ob-
CLIP’s generalization. Our motivation comes from com- jective, VLMs learn rich multi-modal features by attract-
bining the strengths of prompt learning and prompt ensem- ing together the features of paired images and texts while
bling approaches while effectively addressing their limita- repelling un-paired image-text features in a joint feature
tions. To this end, we introduce ProText: Prompt Learning space. The resulting model learns open-vocabulary con-
with Text-Only Supervision. In contrast to previous meth- cepts interpretable through natural language suitable for
ods, our approach instead proposes to learn prompts using various downstream discriminative vision tasks such as
text only data obtained from LLMs. As supervised train- open-vocabulary image classification [6, 20, 27, 31, 32, 50],
ing of prompts is not trivial due to image-free setting, we detection [3, 10, 26, 30, 51], and segmentation [13, 24,
develop a novel training framework that allows prompts to 25]. Although promising, adapting VLMs effectively while
learn and extract rich contextual knowledge from LLM data. maintaining their original generalization remains a crucial
Moreover, as LLM contextual knowledge is mapped within challenge. In this work, we propose a novel method to adapt
the learned prompts, it enables zero-shot transfer of prompts CLIP with prompt learning through text modality supervi-
to new classes and datasets, potentially leading to a substan- sion to improve its performance on vision modality tasks.
tial reduction in LLM serving and prompt engineering cost. Prompt Learning for VLMs. Prompt Learning [6, 9, 27,
As shown in Tab. 1, our approach is different from 40, 41, 49, 50] has emerged as an effective fine-tuning strat-
prior methods as it does not require image samples to learn egy to adapt large-scale models. This approach adds a small
prompts, in addition the adapted CLIP transfers well to un- number of learnable embeddings along with model inputs
seen classes and datasets, therefore addressing a key lim- which are optimized during training while the rest of the
itation of LLM-based prompt ensembling techniques. We model is kept frozen. As the pre-trained model is unchanged
demonstrate the effectiveness of ProText by performing ex- during prompt learning, it has become particularly effective
tensive evaluations on 4 benchmarks. On challenging cross- for VLMs such as CLIP, where maintaining the model’s
original generalizability is crucial. CoOp [50] is the pio- these works generate class-specific LLM prompts that are
neering prompt learning method for CLIP which learns text not directly transferable to new classes and datasets.
prompt embeddings to fine-tune CLIP. CoCoOp [49] im- To this end, we present a new paradigm for learning gen-
proves CoOp’s generalization by conditioning text prompts eralized transferable prompts for VLMs using text-only su-
on visual features. MaPLe [20] proposes a multi-modal pervision. Our proposed adaptation framework, ProText:
prompting framework to adapt both vision and language Prompt Learning with Text only supervision aims to ad-
branches of CLIP. UPL [17] adopts an unsupervised prompt dress the challenges of existing approaches by learning
learning approach to finetune CLIP. PromptSRC [21] im- transferable prompts without relying on images. Fig. 2
proves prompt learning from a regularization perspective shows our ProText framework. First, we curate text-only
by making use of additional loss functions during training. LLM template data using class names of a given dataset and
While these methods improve baseline CLIP performance, a LLM such as GPT-3 [5]. As a text-supervised approach,
most of them require image samples with labels, which is ProText only requires CLIP text encoders during training.
less practical, and generating pseudo-labels is often less ef- Specifically, we employ one frozen encoder with learn-
fective. In contrast, we present a novel prompt learning ap- able prompts and a second frozen encoder without learnable
proach that improves CLIP generalization without relying prompts. Learnable prompts with class-name templates are
on any visual samples during training. input to the prompted text encoder to obtain the class-name
Training-Free Text Prompt Enhancement. With the template feature, and a frozen text encoder generates LLM
emergence of LLMs such as GPT-3 [5], several approaches template feature from its description obtained from LLM
[29, 36, 39] have demonstrated their potential for improv- data. Next, we employ a contextual mapping training ob-
ing zero-shot generalization of CLIP. Instead of using hand- jective which maps class-name template feature to the LLM
crafted templates for generating class features, these meth- template feature. Contextual mapping allows the prompts
ods leverage LLMs to generate high-level concepts, class to learn a mapping function that embeds rich contextual
descriptions, and/or attributes which are used in one form or knowledge from LLM data within the prompt vectors. As
another to produce enriched text features. DCLIP [29] gen- prompts are learned in the embedding space, they are di-
erates fine-grained per-class language descriptors and en- rectly compatible with new classes and datasets. At infer-
semble its similarity with image to produce classification ence, the learned prompts are shipped with CLIP model for
scores. WaffleCLIP [39] matches DCLIP performance with standard zero-shot CLIP inference for visual recognition.
random descriptors and show further gains by data-specific Below we explain our proposed approach in detail. We
concepts generated via LLMs. CuPL [36] query LLMs to first revisit CLIP and previous methods including Prompt
generate class-specific prompt descriptions for text prompt Learning and Prompt Ensembling via LLMs in Sec. 3.1 and
ensembling. Although effective, most of these approaches then we present our ProText approach in Sec. 3.2.
generate class-specific text data from LLMs which are not
directly transferable to unseen classes and new datasets 3.1. Preliminaries
since no training is performed. On the other hand, we aim
to leverage the same LLM data via novel text-only prompt Contrastive Language-Image Pre-training (CLIP). CLIP
learning technique which seamlessly allows the transfer of consist of an image encoder f and a text encoder g which
learned prompts toward unseen classes and new datasets. maps image and text input into visual and textual feature re-
spectively. We denote CLIP parameters as θCLIP = {θf , θg }
3. Method where θf and θg refer to the image and text encoder pa-
rameters, respectively. Input image X is divided into M
Given the language interpretable nature of foundational patches which are linearly projected to produce patch to-
VLMs such as CLIP [37], they are naturally suited for zero- kens and a learnable class token CLS is prepended result-
shot recognition tasks. However, to achieve full potential of ing in the final sequence as X̃ = {CLS, e1 , e2 , · · · , eM }.
CLIP’s generalization for downstream tasks, adaptation still The image encoder f encodes the input patches via mul-
appears to be necessary. Numerous approaches have since tiple transformer blocks to produce a latent visual feature
been proposed to adapt general knowledge of CLIP for user- representation f˜ = f (X̃, θf ), where f˜ ∈ Rd . Next, the
specific downstream tasks. One line of methods adopts corresponding class label y is embedded in a text template,
prompt learning [20, 27, 49, 50] to re-purpose CLIP features such as ‘a photo of a [CLASS]’ which can be formulated as
for downstream data. While effective, most of them require Ỹ = {SOS, t1 , t2 , · · · , tL , ck , EOS}. Here {tl |L
l=1 } and
image samples with labels to learn the prompts, which is a ck are the word embeddings corresponding to the text tem-
hard requirement to meet. Another line of methods adopts plate and the label y, respectively while SOS and EOS are
training-free prompt ensembling techniques [29, 36, 39] the learnable start and end token embeddings. The text en-
with the help of LLMs. Although ensembling-based ap- coder g encodes Ỹ via multiple transformer blocks to pro-
proaches do not require image information, the majority of duce the latent text feature as g̃ = g(Ỹ , θg ), where g̃ ∈ Rd .
Class name
feature
Class name
es
LLM template templates
Text Encoder
Text Encoder
m
na
feature
ss
la
"a photo of a
C
F5005F
Text prompts persian cat" concat "a photo of a concat
persian cat"
Frozen
parameters
Learnable
parameters Contextual
LLM Data Mapping
Image Encoder
Text Encoder
How does a A persian cat is a
LLM
persian cat large, long-haired cat
look like?
GPT-3 with a broad face
and round eyes."
Figure 2. Overview of ProText framework. (Left) First, diverse captions are generated for training classes using LLM like GPT-3. During
training, CLIP text encoders generate prompted class-name feature (g̃p ) from class-name templates with learnable prompts and frozen
LLM template feature (g̃) from LLM generated templates. Next, we employ contextual mapping loss to guide learnable prompts to learn
a mapping from the prompted class-name feature to the LLM template feature containing more information about the class. This allows the
learned prompts to exploit internal knowledge of text encoder complemented by LLM descriptions. (Right) At inference, learned prompts
are used with class-name templates, and the standard zero-shot CLIP inference protocol is followed. Moreover, rich contextual information
from LLM descriptions mapped within the learned prompts enables its transferability to new classes and datasets.
For zero-shot inference, text features of text template with [36] or the similarity score of each attribute with the image
class labels {1, 2, · · · , C} are matched with image feature is calculated to obtain classification scores [39] [29].
f˜)τ ) In this work, we focus our comparison with a strong en-
f˜ as PCexp(sim(g̃· ˜i ·f˜)τ )
, where sim() denotes the cosine
i=1 exp(sim(g sembling baseline CuPL [36]. Specifically, a Large Lan-
similarity and τ is the temperature.
guage Model F such as GPT-3 [5] is used to generate class-
Prompt Learning with CLIP. Being a parameter efficient
specific descriptions for class labels {1, 2, · · · , C} using
tuning method, prompt learning has emerged as a popular
queries such as ‘How does a CLASS look like’. Text features
technique to adapt vision-language models like CLIP. Since
of the same class description are averaged together, which
most of the model is kept frozen during adaptation, prompt
serves as the ensembled text features. Finally, zero-shot in-
learning aims to reduce overfitting. Learnable prompts are
ference is performed with those ensembled text features.
appended either at the image side [2], text encoder side
[49, 50], or both sides. In this work, we learn hierarchical 3.2. Prompt Learning with Text-Only Supervision
prompts at the text encoder named Deep Language Prompt-
While image-supervised prompt learning and LLM-based
ing (DLP) [20] formulated as follows.
prompt ensembling methods have proven effective in adapt-
T learnable language prompts Pt = {p1t , p2t , · · · , pTt }
ing CLIP, they face notable challenges as outlined below.
are appended with text input tokens, resulting in Ỹp =
Visual data dependency. Existing prompt learning meth-
{SOS, Pt , t1 , t2 , · · · , tL , ck , EOS}. The text encoder pro-
ods require visual samples with labels to optimize prompts
cesses Ỹp and prompted text feature is obtained as g̃p =
using Eq. 1. However, collecting samples and labels is dif-
g(Ỹp , θg ). We use deep prompting which learns hierarchi-
ficult in critical scenarios like medical images, remote sens-
cal prompts at subsequent transformer blocks of text en-
ing, and surveillance. Pseudo-labels alleviate label depen-
coder. Visual feature f˜ is obtained without utilizing learn-
dency but they are often less effective. Furthermore, these
able prompts. To adapt CLIP on image classification task on
methods tend to overfit CLIP to source data distributions
dataset D, prompts Pt are optimized in a supervised fashion
and compromise generalization across cross-datasets. For
using labeled image samples with cross-entropy loss, LCE .
instance, CoOp utilizing labeled source samples reduces av-
LCE = arg min E(X,y)∼D L(sim(f˜, g̃p ), y). (1) erage CLIP performance by 1.27% on 10 cross-datasets.
Pt
LLM Prompts transferabilty limitation. LLM-based
Prompt Ensembling with LLM descriptions. Several prompt ensembling approaches like CuPL [36] generate
methods have recently proposed to adapt CLIP via training- class-specific LLM descriptions that cannot be directly
free prompt ensembling techniques. The majority of these transferred to unseen classes and datasets. While open-
approaches leverage the capabilities of LLMs to mine rich source LLMs exhibit lower performance, proprietary ones
descriptions, attributes, or high-level concepts of class such as GPT-3 are required for generating data for new
names. The corresponding text features are either averaged classes and datasets leading to additional serving costs.
Our work aims to address the aforementioned limitations For an ith training sample from DPROMPT consisting of
within a unified framework. Below we detail our strategy a text-to-text pair {Linputs , Loutputs }i , we obtain prompted
for curating text-to-text data via LLMs for training, fol- class-name feature g̃p for Liinputs using learnable prompts
lowed by our text-only prompt learning framework. and frozen LLM feature g̃ for Lioutputs without the prompt
vectors within the pre-trained latent space of CLIP. We then
3.2.1 Text-Only LLM data for Prompt Learning impose a contextual mapping constraint between g̃p and g̃
text features as follows,
As discussed in Sec. 3.1, optimizing prompts for down- d
stream datasets typically requires image-labels pairs. Since 1X
Lmapping = ||g̃p − g̃||22 . (2)
we explicitly aim to bypass this requirement, we first lever- d i=1
age LLMs to curate text data for prompt learning which con-
sists of text inputs and text outputs. Given a set of classes As shown above, we utilize MSE loss objective to enforce
{ci }C i C contextual mapping from Liinputs to Lioutputs . We study other
i=1 , we prepare text inputs {Linputs }i=1 by wrapping
each class name in a standard hand-written text template, choices of consistency objectives in our ablations (Sec. 4.7).
Motivation for Lmapping . Contextual mapping objective al-
Liinputs = ‘a photo of a ci ’. lows learnable prompts to exploit internal knowledge of text
encoder of CLIP to generate rich contextual features aligned
Next, we prepare text outputs corresponding to the with the LLM descriptions (Lioutputs ) for a given class. This
Linputs . Specifically, we query GPT-3 model to generate strategy effectively learns prompts without using any visual
detailed descriptions for each class name ci . Similar to information and when trained using all training classes to-
CuPL [36], we prompt GPT-3 with different queries Q con- gether, it enables prompts to capture versatile and gener-
ditioned on class names such as ‘How does a ci look like?’ alized context from the LLM descriptions. These context-
and ‘How can you identify a ci ?” to obtain text outputs, aware prompts become adaptable for use with any dataset
and effectively enable the transferability of class-specific
Lioutputs = F(Q|ci ). LLM descriptions to unseen classes and datasets. Conse-
quently, this substantially reduces the per-dataset overhead
Similar to [36], we generate M text outputs per query associated with LLM serving and prompt engineering.
Q and use N different queries, resulting in M × N text Inference. Once text prompt vectors are optimized through
outputs per class category. We associate all Loutputs with our TextPro framework in the text domain, they become
the corresponding single Linputs for each class ci . As ready to be shipped with CLIP for downstream visual do-
LLMs are pre-trained on internet-scale text corpora, they main inference with a standard zero-shot CLIP inference
possess the capability of generating very diverse and high- setup. As shown in Fig. 2 (right), the learned prompts Pt
quality descriptions and captions for different class cate- are fused with each given class name to produce prompted
gories which results in high-quality text outputs. Finally text features {g̃p }C
i=1 . Finally, zero-shot inference is per-
we combine Linputs and Loutputs to create LLM based formed with the prompted text features and the input image
text-to-text data for text only prompt learning, DPROMPT = feature f˜ to produce classification scores on test images.
×N ×C
{Liinputs , Lioutputs }M
i=1 . We refer the readers to sup-
plementary for additional details on the choice of LLM 4. Experiments
prompts and examples of DPROMPT .
4.1. Evaluation settings
3.2.2 Contextual mapping with Prompt Learning We perform evaluations in 4 benchmark settings. Prompt
ensembling methods and ProText utilize text-only LLM
To leverage LLM text-to-text data DPROMPT for learning data for adapting CLIP while image-supervised prompt
generalized transferable prompts, we propose a contex- learning methods use image-label pairs for training.
tual mapping strategy that effectively learns a mapping Base-to-Novel Generalization. This setting evaluates the
function that maps standard class name templates such generalization of methods within a dataset. Following pre-
as ‘a photo of a ci ’ to the text feature generated from a vious methods [49, 50], we split each dataset into base and
LLM description which contains more information about novel classes. Models are trained on base classes and eval-
the class ci . In other words, contextual mapping allows uated on the test set of base and novel classes respectively.
learnable prompts to map Linputs to Loutputs in the text fea- Cross-dataset transfer. This setting evaluates the general-
ture space of CLIP. The mapping function is realized in the ization ability of models trained on ImageNet-1k [8] source
form of learnable prompt vectors, which we found to be dataset by directly transferring it on cross-datasets.
more effective in our ablations as compared to other tech- Domain Generalization. We evaluate the robustness of
niques such as adapters via linear projection and MLP. different methods on out-of-distribution datasets. We train
Method ImageNet Acc. Table 2. With the text-only settings, ProText uses optimal prompt length and
1: CLIP (ICML’21) 66.72
same amount of epoch configuration for each dataset. Optimal training con-
2: CLIP-Attribute 67.60 text data, learning figuration is obtained through hyper-parameter search on
3: CLIP-80 68.32 contextual prompts
validation split of datasets. To generate text-only data, we
4: DCLIP (ICLR’23) 68.03 with text-only su-
5: Waffle CLIP (ICCV’23) 68.34 pervision improves
utilize GPT-3 DaVinci-002 model [5] and generate class-
6: CuPL (ICCV’23) 69.62 CLIP performance specific descriptions using the LLM prompts provided by
7: ProText-Attribute 68.05 in comparison to the CuPL [36]. We use publicly available CuPL data and gener-
8: ProText-80 68.48 prompt ensembling ate descriptions for datasets not provided by CuPL. AdamW
9: ProText-CuPL 70.22 techniques. optimizer is used with 5 warm-up epochs for training. We
use a single 16-GB V100 to train our models. Refer to sup-
Dataset CLIP [37] CuPL [50] ProText (Ours) plementary material for additional implementation details.
Base Novel HM Base Novel HM Base Novel HM
ImageNet 72.43 68.14 70.22 74.30 68.14 71.09 75.00 71.38 73.14 4.2. Effectiveness of Text-Only Supervision
Caltech101 96.84 94.00 95.40 97.22 94.00 95.58 98.06 95.63 96.83
OxfordPets 91.17 97.26 94.12 94.42 97.26 95.82 94.95 98.00 96.45 We first present an ablation to motivate our approach of
StanfordCars 63.37 74.89 68.65 63.54 74.89 68.75 64.54 76.08 69.84
Flowers102 72.08 77.80 74.83 74.36 77.80 76.04 74.36 78.44 76.35
learning prompts with text-only supervision. We train Pro-
Food101 90.10 91.22 90.66 89.93 91.22 90.57 90.20 91.98 91.08 Text with 3 types of text data and evaluate performance on
Aircraft 27.19 36.29 31.09 30.61 36.29 33.21 30.91 34.13 32.44
SUN397 69.36 75.35 72.23 76.02 75.35 75.68 76.14 79.14 77.61
ImageNet-1k [8]. ProText-Attribute uses 46 templates from
DTD 53.24 59.90 56.37 62.85 59.90 61.34 63.08 61.59 62.33 [1] which corresponds to common image attributes such as
EuroSAT 56.48 64.05 60.03 59.64 64.05 61.77 59.71 80.97 68.73 rotation, blurriness, etc. ProText-80 is trained on standard
UCF101 70.53 77.50 73.85 75.28 77.50 76.37 75.54 79.50 77.47
80 templates provided by CLIP [37] and ProText-CuPL is
Average 69.34 74.22 71.70 72.56 74.22 73.38 72.95 76.98 74.91
trained on class-specific LLM data employed by our main
Table 3. Base-to-novel setting. ProText enables the transferability baseline CuPL [36] for its ensembling approach.
of learned prompts to new classes and improves over CuPL [36]. In Tab. 2, we compare ProText with CLIP and recent
LLM-based ensembling methods. Prompt ensembling with
models on the ImageNet-1k source dataset and evaluate its attribute templates and 80 templates improves over CLIP
performance on four ImageNet variants with domain shifts. single template result. Among the LLM-based ensembling
Supervised setting. We provide performance comparison methods, CuPL provide highest performance of 69.62%. In
of ProText with CuPL[36] with text-only data per dataset. contrast, ProText uses a learning-based approach and shows
Datasets. For the aforementioned benchmarks, we use competitive performance against prompt ensembling meth-
same datasets as followed by previous works [20, 21, 49, ods using the same text data. ProText-Attribute provides
50]. For cross-dataset transfer, domain generalization, gain of 0.45% over CLIP-Attribute while roughly maintain-
and base-to-novel generalization settings, we use 11 im- ing its performance against CLIP-80. When equipped with
age datasets that cover multiple recognition tasks. These CuPL LLM text-data, ProText surpasses CuPL by 0.60%
includes ImageNet [8] and Caltech101 [11] which contains leading to highest performance against all methods. These
generic objects; OxfordPets [35], StanfordCars [22], Flow- results motivate our approach that instead of prompt ensem-
ers102 [34], Food101 [4], and FGVCAircraft [28] for fine- bling, one can achieve competitive results by utilizing the
grained classification, SUN397 [45] for scene recognition, same available text data to learn prompts. Next, we demon-
UCF101 [42] for action recognition, DTD [7] for texture strate the generalization of ProText such that the learned
classification, and EuroSAT [14] for satellite images catego- prompts transfer well across new classes and datasets.
rization. For domain generalization setting, we train mod-
4.3. Base to novel class generalization
els on ImageNet [8] as a source dataset and use ImageNet-
A [16], ImageNet-R [15], ImageNet-Sketch [44] and Ima- We now present results in base-to-novel class generaliza-
geNetV2 [38] for out of distribution dataset evaluation. tion setting where training data for only base classes are
Implementation details. We use a publically available pre- available and the model is evaluated on both base and novel
trained ViT-B/16 CLIP model from OpenAI [37]. We train classes. For CuPL [36], we use base-class LLM templates
ProText with Deep Language Prompting in the first 9 trans- for base classes and zero-shot CLIP results for its novel
former blocks of the CLIP text encoder. For cross-dataset classes. For ProText, we use base-class LLM templates for
transfer and domain generalization setting, we train ProText training and transfer the learned prompts for novel classes.
using T = 4 and T = 16 language prompts with 10 and Results are shown in Tab. 3. CuPL outperforms zero-
200 epochs respectively. Similar to [44], ProText and zero- shot CLIP on base classes while maintaining its perfor-
shot CLIP use additional concepts where available with its mance on novel classes as LLM prompts for new classes
prompts such as ‘a photo of a CLS, a type of flower’ are not available. ProText shows consistent improvements
for OxfordFlowers [34]. For base-to-novel and supervised over CuPL on base classes for 11 datasets. Furthermore,
Source Target Table 4. Cross-
dataset transfer
ars
setting. CuPL and
02
01
ts
r dC
et
Pe
AT
rs1
h1
01
97
01
ge
eN
t
CLIP perform same
raf
nfo
c
d1
N3
r oS
we
F1
for
era
lte
ag
D
rc
o
UC
Flo
DT
Sta
SU
for cross-datasets
Ox
Im
Ca
Eu
Fo
Ai
Av
as CuPL source
Methods utilizing labeled visual samples
CoOp 71.51 93.70 89.14 64.51 68.71 85.30 18.47 64.15 41.92 46.39 66.55 63.88 data cannot transfer
Co-CoOp 71.02 94.43 90.14 65.32 71.88 86.06 22.94 67.36 45.73 45.37 68.21 65.74 to cross-datasets.
MaPLe 70.72 93.53 90.49 65.57 72.23 86.20 24.74 67.01 46.49 48.06 68.69 66.30 Image-based mod-
PromptSRC 71.27 93.60 90.25 65.70 70.25 86.15 23.90 67.10 46.87 45.50 68.75 65.81 els are trained on
Zero-shot & Prompt ensembling methods 16-shot ImageNet
CLIP 66.72 92.98 89.13 65.29 71.30 86.11 24.90 62.59 44.56 47.84 66.83 65.15 samples. ProText
CuPL 69.62 92.98 89.13 65.29 71.30 86.11 24.90 62.59 44.56 47.84 66.83 65.15 employ same Ima-
Prompt learning with text-only supervision geNet data as CuPL
ProText (Ours) 69.80 94.81 91.01 66.00 72.35 86.66 24.72 67.34 47.93 51.86 69.60 67.23 for prompt learning.
Source Target ods [20, 21, 49, 50] are trained with 16-shot ImageNet data.
ImageNet -V2 -S -A -R Avg.
We show our main comparison results in Tab. 4. CuPL
Methods utilizing labeled visual samples
CoOp 71.51 64.20 47.99 49.71 75.21 59.28 improves ImageNet performance of CLIP by ensembling
CoCoOp 71.02 64.07 48.75 50.63 76.18 59.91 ImageNet LLM prompts, while its cross-dataset results re-
MaPLe 70.72 64.07 49.15 50.90 76.98 60.27
main the same as CLIP. In contrast, ProText effectively ad-
Zero-shot & Prompt ensembling methods
CLIP 66.72 60.83 46.15 47.77 73.96 57.18
dresses the transferability challenges of CuPL using gener-
CuPL 69.62 63.27 49.02 50.72 77.05 60.01 alized prompts trained with the same ImageNet LLM data.
Prompt learning with text-only supervision Since ProText allows generalization to unseen datasets,
ProText (Ours) 70.22 63.54 49.45 51.47 77.35 60.45 these learned prompts can directly be used with CLIP for
Table 5. Domain generalization. Prompt learning methods are cross-datasets leading to absolute average gains of +2.1%
trained on imageNet and evaluated on datasets with domain shifts. against CLIP and CuPL. With ProText, one can notably
reduce proprietary LLM serving and prompt engineering
Dataset CLIP CuPL ProText ∆ Table 6. Pro- costs as prompts learned on one dataset are effectively trans-
ImageNet 66.72 69.60 70.22 +0.62
Text results ferable to other datasets. We next compare ProText with
Caltech101 92.98 94.32 95.29 +0.97 with text
strong 16-shot image-supervised methods. Without using
DTD 44.56 53.96 54.02 +0.06 supervision on
EuroSAT 47.84 60.27 58.53 -1.74 each dataset.
any visual samples, ProText demonstrates effective gener-
StanfordCars 65.29 65.95 66.77 +0.82 We compare alization on cross-datasets and consistently surpasses previ-
Flowers102 71.30 73.85 74.42 +0.57 ous state-of-the-art MaPLe on 9/10 datasets leading to the
Aircraft 24.90 27.66 29.01 +1.35
ProText with
SUN397 62.59 69.00 69.76 +0.76 CLIP and highest average accuracy of 67.23%. This highlights that
OxfordPets 89.13 91.11 92.72 +1.61 CuPL. Gains text-only methods like ProText can lead to better general-
UCF101 66.83 70.63 71.45 +0.82 of ProText ization of CLIP as compared to image-supervised methods
Food101 86.11 86.11 86.68 +0.57 over CuPL are which tend to overfit on the source sample distributions.
Average 65.15 69.31 69.90 +0.59 shown in blue.
ars
02
01
ts
r dC
t
Pe
AT
rs1
Ne
01
01
ge
aft
ch
9
d
nfo
d1
e
N3
roS
we
F1
for
era
lte
rcr
ag
D
o
UC
Flo
DT
Sta
SU
Ox
Im
Ca
Eu
Fo
Ai
Av
Zero-shot & Prompt ensembling methods
CLIP 66.72 92.98 89.13 65.29 71.30 86.11 24.90 62.59 44.56 47.84 66.83 65.15
CuPL 69.62 92.98 89.13 65.29 71.30 86.11 24.90 62.59 44.56 47.84 66.83 65.15
WaffleCLIP-Concept 68.34 94.01 89.57 63.42 72.00 86.84 24.49 66.17 45.15 47.74 67.96 65.74
Prompt learning with text-only supervision
ProText (Ours) 69.80 94.81 91.01 66.00 72.35 86.66 24.72 67.34 47.93 51.86 69.60 67.23
Table 12. Cross-dataset transfer setting. Results comparison of ProText with CLIP, CuPL, and Waffle-CLIP. ProText overall shows
consistent improvements over LLM-based prompt ensembling methods.
Dataset CLIP CuPL WaffleCLIP-C ProText ∆ same LLM text data as utilized by CuPL. Hyperparameter
ImageNet 66.72 69.60 68.34 70.22 +0.62 values are shown in Table 18. All models are trained using
Caltech101 92.98 94.32 94.01 95.29 +0.97 a learning rate of 0.03, except for UCF101, EuroSAT, and
DTD 44.56 53.96 45.15 54.04 +0.06 Oxford-Flowers, where a learning rate of 0.0025 is used.
EuroSAT 47.84 60.27 47.74 58.53 -1.74
StanfordCars 65.29 65.95 63.42 66.77 +0.82
Flowers102 71.30 73.85 72.00 74.42 +0.57 C. Details on Text-Only Data
Aircraft 24.90 27.66 24.49 29.01 +1.35
SUN397 62.59 69.00 66.17 69.76 +0.76 As discussed in Sec. 3.2.1, our ProText approach relies on
OxfordPets 89.13 91.11 89.57 92.72 +1.61 text-only data (D PROMPT) curated from Language Models
UCF101 66.83 70.63 67.96 71.45 +0.82 (LLMs) for training its language prompts. Here, we provide
Food101 86.11 86.11 86.84 86.68 +0.57
additional details on the curation of text-only data. Specif-
Average 65.15 69.31 65.97 69.90 +0.59 ically, we first provide information on the text queries used
as input to LLMs for generating prompts, followed by qual-
Table 13. ProText results with text supervision on each dataset. We itative examples of DPROMPT .
compare ProText with CLIP and CuPL and WaffleCLIP-Concept.
Gains of ProText over CuPL are shown in blue.
C.1. Queries to LLMs to curate Text-Only Data
Dataset CuPL ProText CoOp CoCoOp MaPLe PromptSRC ∆ Following [36], we obtain class descriptions from LLMs by
[36] Ours [50] [49] [20] [21] providing various queries as inputs. Specifically, we utilize
ImageNet 68.14 71.38 67.88 70.43 70.54 70.73 +3.2 queries termed as Full prompts by CuPL [36]. For instance,
Caltech101 94.00 95.63 89.81 93.81 94.36 94.03 +1.6 to generate class descriptions of ImageNet-1k classes, we
DTD 59.90 61.59 41.18 56.00 59.18 62.97 +1.7
prompt GPT-3 with the following 5 queries:
EuroSAT 64.05 80.97 54.74 60.04 73.23 73.90 +17
StanfordCars 74.89 76.08 60.40 73.59 74.00 74.97 +1.2
• ‘Describe what a(n) CLS looks like.’
Flowers102 77.80 78.44 59.67 71.75 72.46 76.50 +0.6
Aircraft 36.29 34.13 22.30 23.71 35.61 37.87 -2.2 • ‘How can you identify a(n) CLS? ’
SUN397 75.35 79.14 65.89 76.86 78.70 78.47 +3.8 • ‘What does a(n) look like? ’
OxfordPets 97.26 98.00 95.29 97.69 97.76 97.30 +0.7 • ‘Describe an image from the internet of a(n) CLS.’
UCF101 77.50 79.50 56.05 73.45 78.66 78.80 +2.0 • ‘A caption of an image of a(n) CLS.’
Food101 91.22 91.98 82.26 91.29 92.05 91.53 +0.8
Average 74.22 76.98 63.22 71.69 75.14 76.10 +2.8 Here, CLS denotes the class names present in the
dataset. After generating LLM class descriptions, we asso-
Table 14. Novel-class generalization comparison.We compare ciate all descriptions of the same class with its class-name
ProText with prompt ensembling and image-supervised methods template given as ‘A photo of a CLS’. This results in
on unseen class performance in base-to-novel class generalization our text-only training data DPROMPT with text-to-text map-
setting. Gains of ProText over CuPL are shown in blue. ping pairs used to train ProText. Refer to [36] for LLM
queries of other datasets used to generate class-specific de-
scriptions. For standardized comparisons, we use publicly
Text-only supervised setting. For our comparison with available CuPL data and generate descriptions for datasets
CuPL [36] in Table 15, ProText models are trained using the not provided by CuPL.
Linear Probe CoOp CoCoOp
Dataset CLIP CuPL ProText
K=1 K=2 K=1 K=2 K=1 K=2
ImageNet 66.70 69.62 70.22 32.13 44.88 66.33 67.07 69.43 69.78
Caltech101 92.98 94.32 95.29 79.88 89.01 92.60 93.07 93.83 94.82
DTD 44.56 53.96 54.04 34.59 40.76 50.23 53.60 48.54 52.17
EuroSAT 47.84 60.27 58.53 49.23 61.98 54.93 65.17 55.33 46.74
StanfordCars 65.29 65.95 66.77 35.66 50.28 67.43 70.50 67.22 68.37
Flowers102 71.30 73.85 74.42 69.74 85.07 77.53 87.33 72.08 75.79
Aircraft 24.90 27.66 29.01 19.61 26.41 21.37 26.20 12.68 15.06
SUN397 62.59 69.00 69.76 41.58 53.70 66.77 66.53 68.33 69.03
OxfordPets 89.13 91.11 92.72 44.06 58.37 90.37 89.80 91.27 92.64
UCF101 66.83 70.63 71.45 53.66 65.78 71.23 73.43 70.30 73.51
Food101 86.11 86.11 86.68 43.96 61.51 84.33 84.40 85.65 86.22
Average 65.15 69.31 69.90 45.83 57.98 67.56 70.65 66.79 67.65
Table 15. ProText results with text supervision on each dataset. We compare ProText with CLIP [37], CuPL [36] and image supervised
Linear Probe [37], CoOp [50] and CoCoOp [49] methods.
Method ImageNet Top1. Table 16. Abla- • ‘A tench is a freshwater fish with a dark green back
1: CuPL 69.62 tion on combining and light-colored sides.’
2: ProText 70.22 CuPL and ProText • ‘A tench looks like a freshwater fish with a dark
3: Ensembling: ProText + CuPL 70.28 text features.
olive-green back, fading to yellowish-brown on the
sides.’
StanfordCars
EuroSAT
Food101
SUN397
UCF101
DTD
EuroSAT
Food101
SUN397
UCF101
Aircraft
or shower.’
DTD