Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

arXiv:2403.12964v1 [cs.

CV] 19 Mar 2024


Negative Yields Positive: Unified Dual-Path
Adapter for Vision-Language Models

Ce Zhang Simon Stepputtis Katia Sycara Yaqi Xie

School of Computer Science, Carnegie Mellon University


{cezhang, sstepput, katia, yaqix}@cs.cmu.edu

Abstract. Recently, large-scale pre-trained Vision-Language Models


(VLMs) have demonstrated great potential in learning open-world visual
representations, and exhibit remarkable performance across a wide range
of downstream tasks through efficient fine-tuning. In this work, we inno-
vatively introduce the concept of dual learning into fine-tuning VLMs,
i.e., we not only learn what an image is, but also what an image isn’t.
Building on this concept, we introduce a novel DualAdapter approach to
enable dual-path adaptation of VLMs from both positive and negative
perspectives with only limited annotated samples. In the inference stage,
our DualAdapter performs unified predictions by simultaneously conduct-
ing complementary positive selection and negative exclusion across target
classes, thereby enhancing the overall recognition accuracy of VLMs in
downstream tasks. Our extensive experimental results across 15 datasets
validate that the proposed DualAdapter outperforms existing state-of-the-
art methods on both few-shot learning and domain generalization tasks
while achieving competitive computational efficiency. Code is available
at https://github.com/zhangce01/DualAdapter.

Keywords: Vision-Language Models · CLIP · Few-Shot Classification


· Domain Generalization · Transfer Learning

1 Introduction
Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP [48] and
ALIGN [24], provide a new paradigm for multi-modal learning and open-world
visual recognition [8, 78]. Recently, researchers have demonstrated the robustness
and effectiveness of these VLMs in interpreting complex visual and textual inputs,
as evidenced by their superior performance in a variety of vision-language tasks,
e.g., image captioning [21, 60, 73], visual question answering [25, 31, 52], and other
domains [7, 54, 58, 85].
Current VLMs excel in zero-shot inference [15,86], but struggle in transferring
to specific downstream tasks, primarily due to domain shifts and mismatches
between pre-training and task-specific data [10, 42]. Additionally, the extensive
number of parameters and demanding computational costs make it intractable
for individual practitioners to re-train or fine-tune the entire models to enhance
their task-specific performance [12, 84]. To address this, recent studies have
focused on designing efficient fine-tuning strategies that maintain the core pre-
trained encoders frozen while incorporating lightweight adapter [12, 79] modules
2 C. Zhang et al.

Fig. 1: Dual-path inference examples. We show four sample images from diverse
recognition tasks, along with their similarities (scaled by 100) to positive/negative text
descriptions after few-shot dual-path adaptation. Intuitively, an image is more likely to
be classified into a specific class if it exhibits higher positive similarity and lower negative
similarity. We can observe that while the positive branch alone may struggle to distin-
guish among some closely related classes, incorporating the negative path, which elim-
inates certain incorrect classes, enhances its ability to accurately identify the true class.
or optimizing prompts [83, 84]. These methods enable effective task-specific
knowledge extraction from a limited amount of annotated training samples, on
top of the inherited foundational knowledge from pre-training.
In this work, we introduce the innovative concept of dual inference in the con-
text of fine-tuning VLMs. Take, for instance, a basic binary pet classification task:
distinguishing cats from dogs. To identify an image as a cat, we can approach the
task from both the positive and negative perspectives: one might directly identify
the image as a cat (selection), or alternatively, one can also first determine that it
is not a dog (exclusion), thereby indirectly concluding it’s a cat. These approaches,
namely, positive selection and negative exclusion, complement each other and col-
lectively enhance the overall classification accuracy in general classification tasks.
Building on this concept, we introduce a novel DualAdapter approach for few-
shot adaptation of VLMs that conducts positive selection and negative exclusion
across a set of target classes. While CLIP typically performs classification tasks us-
ing prompts such as “A photo of a {CLASS}”, we recognize its ability to understand
negation, employing prompts like “A photo of no {CLASS}”. Utilizing this capability,
we design our DualAdapter to adapt VLMs from a dual perspective: we propose
to learn not only what an image is but also what an image isn’t by matching
the input image feature with the positive and negative text features. Intuitively,
our approach first determines a set of likely candidate classes through image
classification but, in a second step, refines these classification logits by asking the
inverse question of what an image isn’t across the entire set of target classes. In
Fig. 1, we showcase several dual-path inference examples to illustrate that when
positive selection struggles with distinguishing between similar classes, employing
negative exclusion can provide further insights to improve recognition accuracy.
Moreover, in DualAdapter, we symmetrically apply dual-path adaptation to
the visual modality, treating image features of a specific class as positive and
Unified Dual-Path Adapter for Vision-Language Models 3

those from all other classes as negative. Recognizing the variability in image
quality and representativeness, where some image features are noisy and not
equally representative of their respective classes, we implement an unsupervised
similarity-based label refinement technique. This method assigns confidence scores
to each image, reducing the impact of outliers or less representative samples
during few-shot adaptation, ensuring more accurate and reliable classification.
To empirically validate DualAdapter, we perform thorough evaluations on
few-shot learning and domain generalization tasks using 15 diverse recognition
datasets. These experimental results demonstrate the effectiveness of DualAdapter
in adapting VLMs for downstream tasks and verify its superior robustness to dis-
tribution shifts. Additionally, our efficiency benchmarks indicate that DualAdapter
achieves competitive efficiency compared to other state-of-the-art methods.
In summary, our primary contributions are listed as follows:
• We explore and exploit the negative inference capabilities of VLMs, and
introduce a unique dual-path adaptation approach (i.e., DualAdapter) for CLIP.
• To mitigate the impact of noisy samples in few-shot adaptation, we introduce
a novel unsupervised similarity-based label refinement technique.
• Through extensive experiments, we demonstrate that our DualAdapter out-
performs other state-of-the-art methods in both adaptation performance and
generalizability across 15 diverse recognition datasets.

2 Related Work

Vision-Language Models. In recent years, extensive Vision-Language Mod-


els (VLMs) have been developed to bridge the vision and language modalities
through large-scale pre-training [24, 48]. Generally, these pre-trained VLMs can
be classified into four categories based on their pre-training objectives: masked
language modeling [27, 30, 37], masked region prediction [30, 37, 59], image-text
matching [27, 30, 31, 61], and contrastive learning [24, 33, 48, 74, 77]. Researchers
have demonstrated that these pre-trained VLMs exhibit remarkable open-world
visual understanding capabilities [8], achieving state-of-the-art performance across
a range of open-world tasks, such as zero-shot learning [41,48,51], open-world seg-
mentation [34,49,72], and open-world detection [9,16,53]. In this work, we mainly
focus on contrastive-based VLMs, especially CLIP [48], which has become the
current mainstream approach in this field. Nonetheless, we believe that the core
concept of this work is adaptable and can be extended to other VLMs as well.
Efficient Fine-Tuning for VLMs. Given the substantial size of the VLMs,
recent research efforts [12,79,80,84] are focusing on the development of lightweight
fine-tuning techniques to efficiently adapt VLMs for downstream visual tasks.
These methods can generally be divided into two primary categories: prompt-
based learning and adapter-style fine-tuning. Specifically, prompt-based learning
methods [2, 62, 83, 84] aim to optimize the input prompts from downstream data,
while adapter-style fine-tuning approaches [12, 64, 75, 79, 88] directly tune the
extracted visual and textual representations. Our proposed DualAdapter follows
the line of adapter-style fine-tuning, but it distinguishes itself from previous
4 C. Zhang et al.

work [75, 79] by incorporating negative adapters for the first time. Moreover, we
empirically demonstrate that leveraging negative cues from CLIP can effectively
improve both few-shot classification performance and generalization capability.
Negative Learning. Given that obtaining typical positive labels (which indicate
the categories an image belongs to) can be costly and labor-intensive, researchers
have proposed an alternative approach known as the indirect negative learn-
ing [23, 76] paradigm. This method focuses on learning from easily accessible
negative/complementary labels, which specify the categories to which an image
does not belong. In recent years, negative learning has been effectively applied to
various vision applications, such as image recognition [13, 23, 28], few-shot learn-
ing [22, 69], and semantic segmentation [47, 67]. In this work, empowered by the
strong image-text association capabilities of the advanced pre-trained CLIP [48]
model, we enable negative learning within CLIP without the need for explicit
negative labels. Instead, we design negative prompts and negative image samples
to guide the CLIP model to predict from a negative perspective, which in turn
enhances its overall positive prediction capability. To the best of our knowledge,
there are only two concurrent approaches (i.e., CLIPN [66] and LSN [43]) that
work in a similar manner: they both enable the negative predictions of CLIP to
identify unknown out-of-distribution samples. However, we formulate negative
learning from a different and more fundamental perspective, i.e., using negative
prediction to directly improve the accuracy of positive prediction.
Contrastive Learning. We also recognize the underlying concept behind our
proposed DualAdapter has parallels with existing supervised contrastive learning
literature [14, 26, 82]: Our method also aims to maximize similarities between pos-
itive image-text/image-image pairs, while simultaneously minimizing similarities
between negative pairs during few-shot adaptation. However, our work focuses
on a fundamentally different dual learning perspective, i.e., we explicitly train
two distinct positive and negative classifiers for ensembling using a single image
input at a time, rather than applying supervised contrastive loss to train a single
classifier with image pairs as inputs. It is also important to note that contrastive
learning is not directly applicable in our fine-tuning setting because we keep the
CLIP image encoder frozen. Thereby, the extracted image features are fixed and
cannot be pulled together or pushed apart.

3 DualAdapter: Unified Dual-Path Adaptation of VLMs


We introduce a novel framework DualAdapter, as illustrated in Fig. 2, to enable
efficient dual-path adaptation of VLMs. Our DualAdapter consists of four adapters:
two that positively adapt CLIP to enhance its ability to accurately identify the
true class of an input image, and two that negatively adapt CLIP to effectively
exclude incorrect candidate classes. Furthermore, we formulate few-shot learning
as a noisy problem, and propose an unsupervised similarity-based label refine-
ment approach to select high-quality representative samples to more effectively
adapt CLIP to downstream tasks. In the following sections, we first briefly review
zero-shot CLIP, then introduce each component of our DualAdapter in detail.
Unified Dual-Path Adapter for Vision-Language Models 5

Fig. 2: An overview of our proposed DualAdapter. We introduce four adapters,


each of which builds a cache with positive/negative text/image features that are updated
during few-shot training. Given an image to be classified, the classification logit for a
specific class increases when the image feature closely aligns with the corresponding
features in the positive cache and diverges from those in the negative cache.

3.1 Preliminary: A Revisit of CLIP

By leveraging contrastive training on billions of image-text pairs from the Internet,


CLIP has shown an outstanding ability to align vision and language modalities [48].
In this work, we employ CLIP’s pre-trained visual encoder FV and textual encoder
FT to map the images and textual descriptions into a shared d-dimensional
embedding space. For a C-class classification task, CLIP performs zero-shot
predictions by assessing the similarity between the image feature and the text
feature specific to each class c ∈ {1, ..., C} as follows:

f_{\mathcal {I}}=\mathcal {F}_\mathsf {V}(\mathcal {I}), \quad f_{t_c}=\mathcal {F}_\mathsf {T}(\mathcal {T}_c), \quad \mathbb {P}(y=c|\mathcal {I}) = \frac {\exp \left ( \mathrm {cos}\left (f_{t_c},f_{\mathcal {I}} \right ) /t \right )}{\sum \nolimits _{t'}{\exp \left ( \mathrm {cos}\left ( f_{t'},f_{\mathcal {I}} \right ) /t \right )}}, (1)

where I is the input image, and Tc represents the text description for class c (e.g.,
“A photo of a {CLASSc }”). The parameter t is the temperature in the softmax
function specified by CLIP [48], and cos(·, ·) computes the cosine similarity. Given
that both the image and text features are L2-normalized (∥ft ∥2 = ∥fv ∥2 = 1),
the cosine similarity is effectively a dot product, i.e., cos (ft , fv ) = fv⊤ ft .
To streamline this process, a weight matrix can be precomputed and stored in
a textual cache, which concatenates the textual features associated with each class,
denoted as Tcache = [ft1 ft2 · · · ftC ]⊤ ∈ RC×d . Subsequently, we can efficiently
6 C. Zhang et al.

obtain the logit S and the final prediction P(y|I) via vectorized computation:
\label {eq:cache} \mathcal {S} = f_{\mathcal {I}} \mathsf {T}_{\mathtt {cache}}^{\quad \,\,\,\, \top } \in \mathbb {R}^{1\times C}, \quad \mathbb {P}(y|\mathcal {I}) = \mathsf {Softmax} \left ( \mathcal {S}\right ). (2)

3.2 Positive Adaptation


Although the zero-shot CLIP model is trained on web-scale datasets and encodes
extensive prior knowledge, it cannot be effectively applied to downstream tasks
due to the presence of domain shifts and a deficiency in task-specific knowledge.
To address this, we efficiently leverage few-shot annotated samples to distill new,
task-specific knowledge. As we discussed in Section 1, our DualAdapter enables
few-shot adaptation from both positive and negative perspectives. In this section,
we first focus on the positive side, i.e., refining CLIP’s ability to more precisely
predict the true class of the input image in a specific downstream task.
Positive Textual Adapter. As shown in Eq. (2), CLIP can make zero-shot pre-
dictions utilizing a textual cache, which stores the textual features from positive
text descriptions (e.g., “A photo of a {CLASS}”). To avoid ambiguity, we denote
this cache as T+ + + + ⊤
cache = [ft1 ft2 · · · ftC ] ∈ R
C×d
. With a small set of annotated
training images, we can improve this cache-based classifier by enhancing the inter-
modal alignment between textual features and the few-shot image features from
the same class. Specifically, we follow TaskRes [75] to achieve this by introducing
a group of learnable parameters R+ T ∈R
C×d
. By incorporating these parameters
+
with the textual cache Tcache via element-wise addition in a residual approach,
it integrates task-specific knowledge and enhances the inter-modal classifier:
\mathsf {T}_{\mathtt {cache}}^{+} \leftarrow \mathsf {Normalize}\left (\mathsf {T}_{\mathtt {cache}}^{+} + \mathcal {R}_{\mathsf {T}}^{+}\right ), \quad \mathcal {S}_{\mathsf {T}}^{+} = f_{\mathcal {I}} \mathsf {T}_{\mathtt {cache}}^{+\quad \top } \in \mathbb {R}^{1\times C}. (3)
Here, Normalize denotes the L2-normalization applied to each row of the matrix.
Positive Visual Adapter. Inspired by Tip-Adapter [79], we further extend the
recognition capability of the CLIP model by constructing a visual cache-based
classifier, which operates by measuring intra-modal similarities between the input
image feature and few-shot image features. Given a C-class K-shot training
dataset from a specific downstream task, we store all CK image features in a
+ (1) (2) (K)
precomputed visual cache, denoted as Vcache = [fv1 fv1 · · · fvC ]⊤ ∈ RCK×d .
CK×C
Their one-hot labels are also correspondingly recorded in L ∈ R . Given an
image feature fI to be classified, we calculate its image-image affinities A+ with
all the few-shot training images, then multiplied by their corresponding one-hot
labels L to obtain the classification logit SV+ :
\mathcal {A}^{+} = \exp \left ( - \beta \left ( 1 - f_\mathcal {I} \mathsf {V}_{\mathtt {cache}}^{+ \quad \top } \right )\right ) \in \mathbb {R}^{1 \times CK}, \quad \mathcal {S}_{\mathsf {V}}^{+} = \alpha \mathcal {A}^{+} L \in \mathbb {R}^{1 \times C}, \label {eq:positivevisual} (4)
where α represents a balance factor and β denotes a modulating hyper-parameter.
Empirically, the value of α should be increased in cases of larger domain shifts,
where more domain-specific few-shot knowledge is needed for adaptation.
Symmetrically, we also introduce a set of learnable parameters R+ V ∈R
C×d
,
CK×d
which are broadcast to R and added to the positive visual cache:
\mathsf {V}_{\mathtt {cache}}^{+} \leftarrow \mathsf {Normalize}\left (\mathsf {V}_{\mathtt {cache}}^{+} + \mathcal {R}_{\mathsf {V}}^{+}\right ). (5)
Unified Dual-Path Adapter for Vision-Language Models 7

3.3 Negative Adaptation


Having introduced the positive branch of our DualAdapter, we now explore the
negative inference capabilities of VLMs, adapting them to downstream tasks
from a negative perspective. Specifically, we craft negative prompts and generate
negative image prototypes to direct CLIP towards more confidently excluding
incorrect classes based on the given input image.
Negative Textual Adapter. Recall that the positive textual cache, denoted as
T+cache , is constructed based on the positive class descriptor prompt such as “A
photo of a {CLASS}.” Conversely, we introduce a set of negative prompts (e.g., “A
photo of no {CLASS},” “A photo without {CLASS}”) that convey meanings opposite
in semantics1 . By leveraging the textual features ft− derived from the negative
text descriptions linked with these prompts, we again precompute a weight
matrix and construct negative textual cache T− − − − ⊤
cache = [ft1 ft2 · · · ftC ] ∈ R
C×d
.
Intuitively, if the feature of an input image fI closely resembles the negative
textual feature for a specific class c, it strongly suggests that the image does not
belong to class c, i.e., P(y = c|I) ∝ 1 − fI⊤ ft−c . Considering the zero-shot CLIP
model is not designed to perform this kind of negative prediction, we incorporate
a learnable residual R− T ∈R
C×d
to refine the negative text embeddings through
supervised task-specific training. The prediction logit is subsequently given by:
\mathsf {T}_{\mathtt {cache}}^{-} \leftarrow \mathsf {Normalize}\left (\mathsf {T}_{\mathtt {cache}}^{-} + \mathcal {R}_{\mathsf {T}}^{-}\right ), \quad \mathcal {S}_{\mathsf {T}}^{-} = \delta _\mathsf {T} \left (\mathbf {1} - f_{\mathcal {I}}\mathsf {T}_{\mathtt {cache}}^{- \quad \top }\right ) \in \mathbb {R}^{1 \times C}, (6)

where δT is a fixed scaling parameter that adjusts ST− to match the mean value
of ST+ , and 1 ∈ R1×C denotes an all-ones C-dimensional vector.
Negative Visual Adapter. In the visual domain, we also similarly pursue
some negative image features to enable negative predictions. To achieve this, we
naturally treat all (C−1)K images from the other C−1 classes as negative samples
for a given class c. These negative image samples hold the same property as the
negative text descriptions: they are explicitly selected to be non-representative of
a specific class, which means that the higher the similarity to them, the lower the
probability that the image is classified to that class. To mitigate individual biases,
we randomly select one image from each of the C − 1 classes and compute the
average of their extracted features to represent the negative image prototypes [56].
In this way, we can get a total of K negative image prototypes for each of the C

classes, thereby constructing a negative visual cache Vcache ∈ RCK×d . Similar to
Eq. (4), we compute the affinities between the input image and those negative
image prototypes, and calculate the classification logit:
\mathcal {A}^{-} \!=\!\delta _\mathsf {V}\exp \left ( - \beta \left ( \mathbf {1} -\left (\mathbf {1}- f_\mathcal {I} \mathsf {V}_{\mathtt {cache}}^{- \quad \top }\right ) \right )\right )&\!=\! \delta _\mathsf {V} \exp \left ( - \beta f_\mathcal {I} \mathsf {V}_{\mathtt {cache}}^{- \quad \top } \right ) \in \mathbb {R}^{1 \times CK},\\ \mathcal {S}_{\mathsf {V}}^{-} = \alpha \mathcal {A}^{-} L &\in \mathbb {R}^{1 \times C}. \label {eq:negativevisual}
(8)

where δV is another fixed scaling parameter that adjusts A− /SV− to match the
mean value of A+ /SV+ , L ∈ RCK×C denotes the corresponding one-hot labels.
1
The comprehensive collection of positive and negative prompts utilized for each
dataset is available in the Supplementary Materials.
8 C. Zhang et al.

Fig. 3: Visualization of cosine similarities on the validation set of ImageNet [4].


(Left) Example features of the class dog stored in the cache; (Middle) Distribution of
pairwise image-text similarities with dual-path features in textual adapters; (Right) Dis-
tribution of pairwise image-image similarities with dual-path features in visual adapters.

To further encode the diverse fine-grained visual knowledge associated with the
specific downstream task, we introduce an extra set of learnable residual parame-
ters, R−V ∈R
C×d
, which is similarly broadcast to refine the negative visual cache:

\mathsf {V}_{\mathtt {cache}}^{-} \leftarrow \mathsf {Normalize}\left (\mathsf {V}_{\mathtt {cache}}^{-} + \mathcal {R}_{\mathsf {V}}^{-}\right ). (9)

3.4 Unified Dual-Path Adapter


To effectively adapt VLMs to downstream tasks, our DualAdapter combines the
functionalities of all four introduced adapters. As illustrated in Figure 3 (Left),
we provide an overview of the specific example features stored in the cache for
the class dog. By measuring the similarities between the input image feature
and these cached example features, our DualAdapter produces a unified final
prediction of the probability that the input image represents a dog. The same
process is consistently applied across all classes to perform recognition tasks.
DualAdapter Inference. To derive the final classification scores, we aggregate
the four outputs from dual-path (i.e., positive and negative) adapters spanning
both textual and visual modalities, expressed as:

\label {eq:lambda} \mathcal {S}_{\mathsf {final}} = \lambda \left (\mathcal {S}_{\mathsf {T}}^{+} + \mathcal {S}_{\mathsf {V}}^{+}\right ) + (1 - \lambda )\left (\mathcal {S}_{\mathsf {T}}^{-} + \mathcal {S}_{\mathsf {V}}^{-}\right ). (10)

Here, λ serves as a tuning hyper-parameter to balance the contribution of positive


and negative adapter logits. Throughout the training process, the collection of
− −
learnable parameters R = {R+ +
T , RV , RT , RV } is updated via stochastic gradient
descent, employing a cross-entropy loss function for optimization.
Effectiveness of Four Adapters. Fig. 3 qualitatively shows the effectiveness of
the four adapters we designed. In the middle and right sub-figure, we visualize the
distributions of the pairwise cosine similarities between the features of the input
image and the textual/visual features stored in the adapters on ImageNet [4]
validation set. From Fig. 3, we can observe the following statistical patterns: (1) As
we expected, the input image feature is more similar with positive features from the
same class (blue > orange) and negative features from other classes (red > green)
across both two modalities; (2) Within the textual modality, negative features
Unified Dual-Path Adapter for Vision-Language Models 9

tend to be less similar to the input image compared to positive features (red <
blue, green < orange), since the negative prompts are less common in CLIP [48]
training corpus; (3) Within the visual modality, the similarity distribution of
negative features occupies an intermediate position between the positive features
of the same and different classes (orange < green/red < blue). This is because the
negative visual features are constructed by averaging the image features across
all other classes, inherently leading to a more generic representation that lacks
the distinctiveness characteristic of positive features within a single class.

3.5 Similarity-Based Label Refinement

Previous adapter-style fine-tuning methods [64,79] generally assume that the few-
shot training samples are meticulously curated to provide accurate representation
for each class. However, in practical scenarios, this assumption often does not hold
true. We recognize that not every image is of high quality or equally representative
of its respective class, which can be particularly detrimental in the context of
few-shot learning, where the availability of training images is extremely limited.
For example, we provide the t-SNE [38] visualization of few-shot image features
from the OxfordPets [45] dataset in Fig. 4 (Left). We argue that the few-shot
samples are highly noisy: there is even a data point from the Basset Hound class
that overlaps with the prototype (i.e., average feature) of Cocker Spaniel.
In response to this, we devise an unsupervised similarity-based label refinement
process to address this challenge. Recall that in our visual adapter, we calculate
the image-image affinities between the input image and few-shot training images,
with all few-shot images being assigned equal weights2 as detailed in Eq. (4).
As we discussed, this approach is possibly problematic in the noisy few-shot
learning setting. Consequently, we extend the visual adapter to assign non-uniform
confidence scores according to pairwise similarities, effectively downweighting
outliers and emphasizing more representative samples.
Specifically, our refinement process is based on an intuitive assumption: the
representative image feature is closer to other image features from the same class
than those low-quality outliers [26]. Based on this assumption, given the K-shot
(i)
image features {fvc }K i=1 from a specific class c, we calculate the average cosine
similarities of each image feature to others:

\mathsf {d}_c^{(i)} = \frac {1}{K-1} \sum _{j, \,j\neq i} \cos \left (f_{v_c}^{(i)},f_{v_c}^{(j)} \right ) = \frac {1}{K-1} \sum _{j, \,j\neq i}f_{v_c}^{(i)\top } f_{v_c}^{(j)}. (11)

(i)
We then assign non-uniform confidence scores wc and compute refined label
(i)
values ℓc for all K-shot image features based on their similarities to others:

w_c^{(i)} = \frac {\exp (\mathsf {d}_c^{(i)} / \tau )}{\sum _{i'}\exp (\mathsf {d}_c^{(i')} / \tau )}, \quad \ell _c^{(i)} = K w_c^{(i)}, \label {eq:refine} (12)

2
Each few-shot image is assigned a one-hot label, corresponding to a weight of 1.
10 C. Zhang et al.

Fig. 4: Motivation for the proposed similarity-based label refinement. (Left)


t-SNE [38] visualization of visual features for 4 representative classes from the Oxford-
Pets [45] dataset, with prototype [56] of each class denoted by the crossmark ×; (Middle)
Our refined labels for 4 example images from class Beagle; (Right) An illustrative plot of
the modeled distributions using one-hot labels and our refined labels for the Beagle class.

where τ is a temperature hyper-parameter to control the intensity of our refine-


ment. In particular, as τ → ∞, our refinement becomes inactive, assigning a
label value of 1 to all instances. Conversely, as τ → 0, our refinement process
becomes more selective, assigning the highest label value exclusively to the most
representative image feature, while all other features are assigned 0.
This refinement process is applied across each class, wherein the original one-
hot labels L, are weighted by the new label values computed in Eq. (12), resulting
in the updated labels L. This refinement process is incorporated into both the
positive and negative visual adapters; therefore we can rewrite Eqs. (4) and (8) as:

\mathcal {S}_{\mathsf {V}}^{+} = \alpha \mathcal {A}^{+} \mathbb {L} \in \mathbb {R}^{1 \times C}, \quad \mathcal {S}_{\mathsf {V}}^{-} = \alpha \mathcal {A}^{-} \mathbb {L} \in \mathbb {R}^{1 \times C}. (13)

Effectiveness of Label Refinement. In Fig. 4 (Middle), we present our refined


label values for 4 example images from class Beagle in OxfordPets [45] dataset.
Notably, our unsupervised refinement approach functions by measuring the
pairwise similarities among 8-shot images from class Beagle, without requiring
any extra parameters or additional training. Our similarity-based refinement
process allocates higher values to the more representative samples at the top,
and lower values to the lower-quality images at the bottom, where the Beagle is
either obscured or presented in black and white. In Fig. 4 (Right), we provide
an illustration of the modeled visual feature distribution for the Beagle class
using one-hot labels and our refined labels. Through our refinement process, we
are better equipped to capture the underlying feature distribution (typically,
a normal distribution [68, 81]) of the Beagle class, whereas the use of one-hot
labels restricts us to modeling merely a uniform distribution.

4 Experiments
In this section, we conduct extensive experiments on two tasks across 15 datasets.
These results demonstrate that our DualAdapter outperforms other state-of-the-art
methods, exhibiting enhanced adaptation and generalization capabilities.
Unified Dual-Path Adapter for Vision-Language Models 11

4.1 Experimental Settings

Tasks and Datasets. To validate the effectiveness of our proposed DualAdapter,


we evaluate our proposed method on two standard benchmarking tasks: few-shot
learning and domain generalization, respectively. For few-shot learning task,
we comprehensively evaluate our method on 11 well-known image classification
benchmarks: ImageNet [4], Caltech101 [11], OxfordPets [45], StandfordCars [29],
Flowers102 [44], Food101 [1], FGVCAircraft [39], DTD [3], SUN397 [71], Eu-
roSAT [18], and UCF101 [57]. For domain generalization, we investigate the
generalization capability of our DualAdapter on 4 variant datasets of ImageNet:
ImageNet-V2 [50], ImageNet-Sketch [65], ImageNet-A [20], and ImageNet-R [19].
Implementation Details. Following previous works [79, 83], we adopt ResNet-
50 [17] backbone as the visual encoder of CLIP in our experiments by default.
We adopt prompt ensembling, leveraging textual prompts from both CLIP [48]
and CuPL [46] to enhance model performance. For the negative prompts we used
for each dataset, please kindly refer to the Supplementary Materials. We set the
hyper-parameters λ and τ as 0.75 and 1, respectively. Our DualAdapter is trained
using the AdamW [36] optimizer with a cosine scheduler [35]. The batch size is

set to 256. For R+ +
T and RV , the learning rate is set to 0.0001, while for RT and

RV , the learning rate is set to 0.0005. Additionally, our model is trained for 200
epochs on the EuroSAT [18] dataset, and for 20 epochs on all other datasets.
To ensure the reliability of our results, we perform each experiment three times
using different initialization seeds and report the mean accuracy achieved. All
experiments are conducted on a single NVIDIA RTX 6000 Ada GPU.
Baselines. We compare our proposed method with the following state-of-the-
art methods: zero-shot and linear probe CLIP [48], CoOp [84], CoCoOp [83],
ProGrad [87], CLIP-Adapter [12], Tip-Adapter-F [79], TPT [55], TaskRes [75],
and GraphAdapter [32]. For a fair comparison, we directly report the results of
these baselines from their respective original papers.

4.2 Few-Shot Learning

In Fig. 5, we compare the few-shot learning performance of our DualAdapter


with other state-of-the-art methods on 11 image classification datasets. In the
top-left sub-figure, we also present the average classification accuracy across all 11
datasets. The results indicate that our DualAdapter consistently outperforms other
methods across various few-shot learning settings by substantial margins. More-
over, our DualAdapter demonstrates more pronounced performance improvements
in specialized classification tasks, such as satellite image and texture classification
on the EuroSAT [18] and DTD [3] datasets. With 16-shot training on these two
datasets, our DualAdapter surpasses Tip-Adapter-F [79] by a notable 3.56% and
3.50%, and enhances the performance of zero-shot CLIP [48] significantly by
45.56% and 30.50%, respectively. For full numerical results, please refer to our
Supplementary Materials. Overall, the consistently superior performance on 11
datasets fully demonstrates the effectiveness of our DualAdapter.
12 C. Zhang et al.

Fig. 5: Performance comparisons on few-shot learning on 11 image classifica-


tion datasets. For each dataset, we report the mean accuracy and 95% confidence
interval over 3 random seeds of our DualAdapter on 1-/2-/4-/8-/16-shot settings.

4.3 Robustness to Natural Distribution Shifts


In Table 1, we compare the Table 1: Performance comparison on robust-
generalization performance of ness to distribution shifts. All the models are
our DualAdapter with other trained on 16-shot ImageNet [4] and directed tested
state-of-the-art methods in on the OOD target datasets. The best results are
in bold and the second best are underlined.
the presence of distribution
Source Target
shifts. Specifically, all the mod- Method
ImageNet -V2 -Sketch -A -R Avg.
els are trained solely on 16-
Zero-Shot CLIP [48] 60.33 53.27 35.44 21.65 56.00 41.59
shot ImageNet [4], and di- Linear Probe CLIP [48] 56.13 45.61 19.13 12.74 34.86 28.09
rectly tested on 4 out-of- CoOp [84] 62.95 55.40 34.67 23.06 56.60 42.43
CoCoOp [83] 62.71 55.72 34.48 23.32 57.74 42.82
distribution ImageNet variant ProGrad [87] 62.17 54.70 34.40 23.05 56.77 42.23
datasets. As shown in Table TPT [55] 60.74 54.70 35.09 26.67 59.11 43.89
TaskRes [75] 64.75 56.47 35.83 22.80 60.70 43.95
1, our DualAdapter not only GraphAdapter [32] 64.94 56.58 35.89 23.07 60.86 44.10
achieves state-of-the-art per- DualAdapter (Ours) 66.52 57.87 36.38 25.73 61.12 45.28

formance on the source dataset but also attains an average performance gain of
1.18% across 4 out-of-distribution (OOD) target datasets. These experimental
results demonstrate that by enabling adaptation through both positive and nega-
tive paths, our DualAdapter can more effectively prevent overfitting to the source
dataset, thereby enhancing robustness against distribution shifts.

4.4 Ablation Studies

Effectiveness of Different Components. In Table 2, we conduct a systematic


analysis of the impacts of various components within our DualAdapter framework.
Unified Dual-Path Adapter for Vision-Language Models 13

Table 2: Ablation studies for different variants of DualAdapter. We evaluate


the few-shot adaptation capabilities of four DualAdapter variants on ImageNet [4].
# Method R+
T R−
T R+
V R−
V 1-shot 2-shot 4-shot 8-shot 16-shot
1 DualAdapterT ! ! % % 62.86 63.36 64.01 65.23 66.34
2 DualAdapterV % % ! ! 62.21 62.37 62.68 63.72 65.30
3 DualAdapter+ ! % ! % 62.83 63.31 63.95 65.13 66.27
4 DualAdapter− % ! % ! 62.65 63.07 63.60 64.36 65.12
5 DualAdapter ! ! ! ! 62.89 63.47 64.12 65.37 66.52

Fig. 6: More ablation results. (Left) Performance comparison of our DualAdapter


with other methods on few-shot learning task using different visual backbones; (Middle)
Sensitivity analysis of λ from Eq. (10) on ImageNet [4]; (Right) Sensitivity analysis of
τ from Eq. (12) on 3 datasets, with the value of τ presented on a logarithmic scale.

More specifically, we assess the performance of four distinct DualAdapter variants,


each configured to allow two adapters to be updated while keeping the others
fixed. We have the following main observations: (1) Compared to zero-shot
CLIP, all four variants demonstrate a performance improvement of approximately
5%∼6% (from 60.33%) with 16-shot samples, indicating that each variant can
operate effectively; (2) Relatively, the textual variant (DualAdapterT ) and the
positive variant (DualAdapter+ ) demonstrate superior efficiency over the visual
and negative counterparts; (3) The full DualAdapter method, which combines all
four adapters together, outperforms the individual approaches by achieving the
highest accuracy of 66.52% in the 16-shot scenario.
Effects of Different Visual Backbones. We also implement our DualAdapter
with various visual encoders, including ResNet [17] and ViT [6], and evaluate
its performance against other fine-tuning approaches in Fig. 6 (Left). We can
see that our DualAdapter consistently exceeds other methods across all visual
backbones, indicating the general effectiveness of our approach.
Sensitivity Analysis of Hyper-Parameters. We provide sensitivity analysis
for the hyper-parameters λ and τ in Fig. 6. The hyper-parameter λ from Eq. (10)
controls the combination of the positive and negative predictions. Assigning λ =
0/1 simplifies our DualAdapter to solely rely on negative or positive predictions,
respectively. In Fig. 6 (Middle), we can observe that setting λ = 0.75 consistently
yields the optimal performance. Moreover, the temperature hyper-parameter τ
from Eq. (12) controls the intensity of our label refinement. In Fig. 6 (Right),
we vary τ from 0.04 to 25 and report the 16-shot accuracy of the DualAdapterV
14 C. Zhang et al.

Table 3: Performance comparison with Table 4: Efficiency comparison with other


Tip-Adapter(-F) [79] on ImageNet [4] existing methods on 16-shot ImageNet [4].
with 16-/32-/64-/128-shot settings. Method Training Epochs GFLOPs Param. Accuracy

Method 16-shot 32-shot 64-shot 128-shot CLIP [48] - - - - 60.33


CoOp [84] 14 hr 200 >10 0.01M 62.95
CLIP [48] †
0-shot: 60.33 ProGrad [87] 17 hr 200 >10 0.01M 63.45
Tip-Adapter [79] 62.03 62.51 62.88 63.15 CLIP-Adapter [12] 50 min 200 0.004 0.52M 63.59
Tip-Adapter-F [79] 65.47 66.58 67.96 69.74 Tip-Adapter-F [79] 5 min 20 0.030 16.38M 65.51
DualAdapter (Ours) 66.52 67.68 69.01 70.98 DualAdapter (Ours) 5 min 20 0.009 4.10M 66.52

variant. By refining the labels, we can achieve a 0.07% ∼ 0.31% higher perfor-
mance across three datasets. Our DualAdapter attains optimal accuracy at varying
optimal τ values, depending on the noise levels of different datasets.
Scalability to More Shots. When scaling up to more than 16 shots, Tip-
Adapter-F [79] limits the cache size to prevent memory overflow. In contrast, our
DualAdapter features reduced memory requirements, enabling it to operate without
imposing cache size constraints, even in the 128-shot setting. Performance com-
parisons in Table 3 indicate that our DualAdapter outperforms Tip-Adapter-F [79]
more significantly in higher shot settings, e.g., by 1.24% in the 128-shot setting.
Efficiency Comparison. We also compare the efficiency of DualAdapter with
existing methods in Table 4. Our DualAdapter achieves the highest accuracy while
also exhibiting advantageous computational efficiency: (1) Compared to prompt-
based learning methods such as CoOp [84] and ProGrad [87], our DualAdapter
requires approximately 300× less training time and over 1000× fewer FLOPs since
we do not need to propagate gradients through the textual encoder; (2) Compared
to adapter-style fine-tuning methods, our DualAdapter is also competitive in
efficiency. It requires 10× less training time than CLIP-Adapter [12], and demands
3× fewer FLOPs and 4× fewer learnable parameters than Tip-Adapter-F [79].

5 Conclusion

In this work, we introduce DualAdapter, a novel approach to effectively adapt


vision-language models for downstream tasks by integrating both positive and neg-
ative adapters. This innovative approach allows for effective few-shot adaptation
from a dual perspective. Furthermore, we develop an unsupervised similarity-
based label refinement approach to mitigate the adverse effects of low-quality
image samples during the few-shot adaptation process. Comprehensive evaluations
on 15 diverse datasets demonstrate that DualAdapter outperforms the state-of-
the-art methods in both few-shot learning and domain generalization tasks.
Limitations. We identify two potential limitations of our DualAdapter: (1) Its
efficacy in zero-shot situations is constrained due to the scarcity of original
negative descriptions in the CLIP training corpus; (2) Similar to other adapter-
style approaches such as Tip-Adapter-F [79], our DualAdapter fine-tuned on
a specific task cannot be directly applied to another task without additional
adaptation. However, recently, Wang et al . [68] shows that adapter-style fine-
tuning methods can be extended for these scenarios using the kNN algorithm.
Unified Dual-Path Adapter for Vision-Language Models 15

Acknowledgement

This work has been funded in part by the Army Research Laboratory (ARL) under
grant W911NF-23-2-0007 and W911NF-19-2-0146, and the Air Force Office of Sci-
entific Research (AFOSR) under grants FA9550-18-1-0097 and FA9550-18-1-0251.

References

1. Bossard, L., Guillaumin, M., Van Gool, L.: Food-101–mining discriminative com-
ponents with random forests. In: European Conference on Computer Vision. pp.
446–461. Springer (2014) 11, 21, 22, 24
2. Chen, G., Yao, W., Song, X., Li, X., Rao, Y., Zhang, K.: PLOT: Prompt learn-
ing with optimal transport for vision-language models. In: International Confer-
ence on Learning Representations (2023), https://openreview.net/forum?id=
zqwryBoXYnh 3
3. Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures
in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. pp. 3606–3613 (2014) 11, 22, 24
4. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale
hierarchical image database. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 248–255 (2009) 8, 11, 12, 13, 14,
22, 23, 24
5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
tional transformers for language understanding. In: Proceedings of the Conference
of the North American Chapter of the Association for Computational Linguistics.
pp. 4171–4186 (2019) 24
6. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.:
An image is worth 16x16 words: Transformers for image recognition at scale. In:
International Conference on Learning Representations (2021), https://openreview.
net/forum?id=YicbFdNTTy 13
7. Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A.,
Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language
model. arXiv preprint arXiv:2303.03378 (2023) 1
8. Du, Y., Liu, Z., Li, J., Zhao, W.X.: A survey of vision-language pre-trained models.
In: Proceedings of the International Joint Conference on Artificial Intelligence. pp.
5436–5443 (2022) 1, 3
9. Du, Y., Wei, F., Zhang, Z., Shi, M., Gao, Y., Li, G.: Learning to prompt for
open-vocabulary object detection with vision-language model. In: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp.
14084–14093 (2022) 3
10. Fang, A., Ilharco, G., Wortsman, M., Wan, Y., Shankar, V., Dave, A., Schmidt,
L.: Data determines distributional robustness in contrastive language image pre-
training (clip). In: International Conference on Machine Learning. pp. 6216–6234.
PMLR (2022) 1
11. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few
training examples: An incremental bayesian approach tested on 101 object categories.
Computer Vision and Image Understanding 106(1), 59–70 (2007) 11, 22, 24
16 C. Zhang et al.

12. Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: Clip-
adapter: Better vision-language models with feature adapters. International Journal
of Computer Vision 132, 581–595 (2024) 1, 3, 11, 14, 22
13. Gao, Y., Zhang, M.L.: Discriminative complementary-label learning with weighted
loss. In: International Conference on Machine Learning. pp. 3587–3597. PMLR
(2021) 4
14. Gunel, B., Du, J., Conneau, A., Stoyanov, V.: Supervised contrastive learning for
pre-trained language model fine-tuning. In: International Conference on Learning
Representations (2021), https://openreview.net/forum?id=cu7IUiOhujH 4
15. Guo, Z., Zhang, R., Qiu, L., Ma, X., Miao, X., He, X., Cui, B.: Calip: Zero-shot
enhancement of clip with parameter-free attention. In: Proceedings of the AAAI
Conference on Artificial Intelligence. vol. 37, pp. 746–754 (2023) 1
16. Gupta, A., Narayan, S., Joseph, K., Khan, S., Khan, F.S., Shah, M.: Ow-detr:
Open-world detection transformer. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. pp. 9235–9244 (2022) 3
17. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. pp. 770–778 (2016) 11, 13
18. Helber, P., Bischke, B., Dengel, A., Borth, D.: Eurosat: A novel dataset and deep
learning benchmark for land use and land cover classification. IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing 12(7), 2217–
2226 (2019) 11, 22, 24
19. Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., Desai,
R., Zhu, T., Parajuli, S., Guo, M., et al.: The many faces of robustness: A critical
analysis of out-of-distribution generalization. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. pp. 8340–8349 (2021) 11, 24
20. Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial
examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. pp. 15262–15271 (2021) 11, 24
21. Hu, X., Gan, Z., Wang, J., Yang, Z., Liu, Z., Lu, Y., Wang, L.: Scaling up vision-
language pre-training for image captioning. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 17980–17989 (2022)
1
22. Huang, S., Ma, J., Han, G., Chang, S.F.: Task-adaptive negative envision for
few-shot open-set recognition. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 7171–7180 (2022) 4
23. Ishida, T., Niu, G., Hu, W., Sugiyama, M.: Learning from complementary labels.
In: Advances in Neural Information Processing Systems. vol. 30, pp. 5644–5654
(2017) 4
24. Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H.,
Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning
with noisy text supervision. In: International Conference on Machine Learning. pp.
4904–4916 (2021) 1, 3, 24
25. Khan, Z., BG, V.K., Schulter, S., Yu, X., Fu, Y., Chandraker, M.: Q: How to
specialize large vision-language models to data-scarce vqa tasks? a: Self-train on
unlabeled images! In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. pp. 15005–15015 (2023) 1
26. Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A.,
Liu, C., Krishnan, D.: Supervised contrastive learning. In: Advances in Neural
Information Processing Systems. vol. 33, pp. 18661–18673 (2020) 4, 9
Unified Dual-Path Adapter for Vision-Language Models 17

27. Kim, W., Son, B., Kim, I.: Vilt: Vision-and-language transformer without convolu-
tion or region supervision. In: International Conference on Machine Learning. pp.
5583–5594. PMLR (2021) 3
28. Kim, Y., Yim, J., Yun, J., Kim, J.: Nlnl: Negative learning for noisy labels. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision. pp.
101–110 (2019) 4
29. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-
grained categorization. In: Proceedings of the IEEE/CVF International Conference
on Computer Vision Workshops. pp. 554–561 (2013) 11, 22, 24
30. Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-vl: A universal encoder
for vision and language by cross-modal pre-training. In: Proceedings of the AAAI
Conference on Artificial Intelligence. vol. 34, pp. 11336–11344 (2020) 3
31. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for
unified vision-language understanding and generation. In: International Conference
on Machine Learning. pp. 12888–12900. PMLR (2022) 1, 3
32. Li, X., Lian, D., Lu, Z., Bai, J., Chen, Z., Wang, X.: Graphadapter: Tuning vision-
language models with dual knowledge graph. In: Advances in Neural Information
Processing Systems. vol. 36, pp. 13448–13466 (2023) 11, 12, 23
33. Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., Yu, F., Yan, J.:
Supervision exists everywhere: A data efficient contrastive language-image pre-
training paradigm. In: International Conference on Learning Representations (2022),
https://openreview.net/forum?id=zq1iJkNk3uN 3
34. Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P.,
Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. pp. 7061–7070 (2023) 3
35. Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. In:
International Conference on Learning Representations (2016), https://openreview.
net/forum?id=Skq89Scxx 11
36. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International
Conference on Learning Representations (2019), https://openreview.net/forum?
id=Bkg6RiCqY7 11
37. Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguis-
tic representations for vision-and-language tasks. In: Advances in Neural Information
Processing Systems. vol. 32, pp. 13–23 (2019) 3
38. van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of Machine
Learning Research 9, 2579–2605 (2008) 9, 10
39. Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual
classification of aircraft. arXiv preprint arXiv:1306.5151 (2013) 11, 22, 24
40. Maniparambil, M., Vorster, C., Molloy, D., Murphy, N., McGuinness, K., O’Connor,
N.E.: Enhancing clip with gpt-4: Harnessing visual descriptions as prompts. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision
Workshops. pp. 262–271 (2023) 25
41. Nag, S., Zhu, X., Song, Y.Z., Xiang, T.: Zero-shot temporal action detection via
vision-language prompting. In: European Conference on Computer Vision. pp.
681–697. Springer (2022) 3
42. Nguyen, T., Ilharco, G., Wortsman, M., Oh, S., Schmidt, L.: Quality not quantity:
On the interaction between dataset design and robustness of clip. In: Advances in
Neural Information Processing Systems. vol. 35, pp. 21455–21469 (2022) 1
18 C. Zhang et al.

43. Nie, J., Zhang, Y., Fang, Z., Liu, T., Han, B., Tian, X.: Out-of-distribution detection
with negative prompts. In: International Conference on Learning Representations
(2024), https://openreview.net/forum?id=nanyAujl6e 4
44. Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of
classes. In: Indian Conference on Computer Vision, Graphics and Image Processing.
pp. 722–729. IEEE (2008) 11, 22, 24
45. Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
pp. 3498–3505 (2012) 9, 10, 11, 21, 22, 24
46. Pratt, S., Covert, I., Liu, R., Farhadi, A.: What does a platypus look like? generat-
ing customized prompts for zero-shot image classification. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision. pp. 15691–15701 (2023)
11, 24, 25
47. Qiao, P., Wei, Z., Wang, Y., Wang, Z., Song, G., Xu, F., Ji, X., Liu, C., Chen, J.:
Fuzzy positive learning for semi-supervised semantic segmentation. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp.
15465–15474 (2023) 4
48. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from
natural language supervision. In: International Conference on Machine Learning.
pp. 8748–8763. PMLR (2021) 1, 3, 4, 5, 9, 11, 12, 14, 22, 23
49. Rao, Y., Zhao, W., Chen, G., Tang, Y., Zhu, Z., Huang, G., Zhou, J., Lu, J.:
Denseclip: Language-guided dense prediction with context-aware prompting. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition. pp. 18082–18091 (2022) 3
50. Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize
to imagenet? In: International Conference on Machine Learning. pp. 5389–5400.
PMLR (2019) 11, 24
51. Sanghi, A., Chu, H., Lambourne, J.G., Wang, Y., Cheng, C.Y., Fumero, M., Malek-
shan, K.R.: Clip-forge: Towards zero-shot text-to-shape generation. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp.
18603–18613 (2022) 3
52. Shen, S., Li, L.H., Tan, H., Bansal, M., Rohrbach, A., Chang, K.W., Yao, Z., Keutzer,
K.: How much can CLIP benefit vision-and-language tasks? In: International
Conference on Learning Representations (2022), https://openreview.net/forum?
id=zf_Ll3HZWgy 1
53. Shi, H., Hayat, M., Wu, Y., Cai, J.: Proposalclip: Unsupervised open-category object
proposal generation via exploiting clip cues. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 9611–9620 (2022) 3
54. Shridhar, M., Manuelli, L., Fox, D.: Cliport: What and where pathways for robotic
manipulation. In: Conference on robot learning. pp. 894–906. PMLR (2022) 1
55. Shu, M., Nie, W., Huang, D.A., Yu, Z., Goldstein, T., Anandkumar, A., Xiao, C.:
Test-time prompt tuning for zero-shot generalization in vision-language models.
In: Advances in Neural Information Processing Systems. vol. 35, pp. 14274–14289
(2022) 11, 12
56. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In:
Advances in Neural Information Processing Systems. vol. 30, pp. 4080–4090 (2017)
7, 10
57. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes
from videos in the wild. arXiv preprint arXiv:1212.0402 (2012) 11, 22, 24
Unified Dual-Path Adapter for Vision-Language Models 19

58. Stepputtis, S., Campbell, J., Phielipp, M., Lee, S., Baral, C., Ben Amor, H.:
Language-conditioned imitation learning for robot manipulation tasks. In: Advances
in Neural Information Processing Systems. vol. 33, pp. 13139–13150 (2020) 1
59. Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: Vl-bert: Pre-training of
generic visual-linguistic representations. In: International Conference on Learning
Representations (2020), https://openreview.net/forum?id=SygXPaEYvH 3
60. Surís, D., Menon, S., Vondrick, C.: Vipergpt: Visual inference via python execution
for reasoning. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. pp. 11888–11898 (2023) 1
61. Tan, H., Bansal, M.: Lxmert: Learning cross-modality encoder representations from
transformers. In: Proceedings of the Conference on Empirical Methods in Natural
Language Processing. pp. 5100–5111 (2019) 3
62. Tian, X., Zou, S., Yang, Z., Zhang, J.: Argue: Attribute-guided prompt tuning for
vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (2024) 3
63. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov,
N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and
fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 24
64. Udandarao, V., Gupta, A., Albanie, S.: Sus-x: Training-free name-only transfer of
vision-language models. In: Proceedings of the IEEE/CVF International Conference
on Computer Vision. pp. 2725–2736 (2023) 3, 9
65. Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by
penalizing local predictive power. In: Advances in Neural Information Processing
Systems. vol. 32, pp. 10506–10518 (2019) 11, 24
66. Wang, H., Li, Y., Yao, H., Li, X.: Clipn for zero-shot ood detection: Teaching clip
to say no. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision. pp. 1802–1812 (2023) 4
67. Wang, Y., Wang, H., Shen, Y., Fei, J., Li, W., Jin, G., Wu, L., Zhao, R., Le, X.: Semi-
supervised semantic segmentation using unreliable pseudo-labels. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp.
4248–4257 (2022) 4
68. Wang, Z., Liang, J., Sheng, L., He, R., Wang, Z., Tan, T.: A hard-to-beat baseline
for training-free CLIP-based adaptation. In: International Conference on Learning
Representations (2024), https://openreview.net/forum?id=Js5PJPHDyY 10, 14
69. Wei, X.S., Xu, H.Y., Zhang, F., Peng, Y., Zhou, W.: An embarrassingly simple
approach to semi-supervised few-shot learning. In: Advances in Neural Information
Processing Systems. vol. 35, pp. 14489–14500 (2022) 4
70. Wu, W., Yao, H., Zhang, M., Song, Y., Ouyang, W., Wang, J.: Gpt4vis: What can
gpt-4 do for zero-shot visual recognition? arXiv preprint arXiv:2311.15732 (2023)
25
71. Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-scale
scene recognition from abbey to zoo. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. pp. 3485–3492 (2010) 11, 22, 24
72. Xu, M., Zhang, Z., Wei, F., Lin, Y., Cao, Y., Hu, H., Bai, X.: A simple baseline for
open-vocabulary semantic segmentation with pre-trained vision-language model.
In: European Conference on Computer Vision. pp. 736–753. Springer (2022) 3
73. Yang, X., Wu, Y., Yang, M., Chen, H., Geng, X.: Exploring diverse in-context
configurations for image captioning. In: Advances in Neural Information Processing
Systems. vol. 36, pp. 40924–40943 (2023) 1
20 C. Zhang et al.

74. Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca:
Contrastive captioners are image-text foundation models. Transactions on Machine
Learning Research (2022), https://openreview.net/forum?id=Ee277P3AYC 3, 24
75. Yu, T., Lu, Z., Jin, X., Chen, Z., Wang, X.: Task residual for tuning vision-language
models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. pp. 10899–10909 (2023) 3, 4, 6, 11, 12, 21, 22, 23
76. Yu, X., Liu, T., Gong, M., Tao, D.: Learning with biased complementary labels. In:
European Conference on Computer Vision. pp. 68–83. Springer (2018) 4
77. Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., Beyer,
L.: Lit: Zero-shot transfer with locked-image text tuning. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18123–
18133 (2022) 3
78. Zhang, J., Huang, J., Jin, S., Lu, S.: Vision-language models for vision tasks: A
survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) 1
79. Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H.: Tip-
adapter: Training-free adaption of clip for few-shot classification. In: European
Conference on Computer Vision. pp. 493–510. Springer (2022) 1, 3, 4, 6, 9, 11, 14,
22
80. Zhang, Y., Zhang, C., Liao, Z., Tang, Y., He, Z.: Bdc-adapter: Brownian distance co-
variance for better vision-language reasoning. In: British Machine Vision Conference.
BMVA (2023), https://papers.bmvc2023.org/0182.pdf 3
81. Zhang, Y., Gong, M., Liu, T., Niu, G., Tian, X., Han, B., Schölkopf, B., Zhang,
K.: Adversarial robustness through the lens of causality. In: International Confer-
ence on Learning Representations (2022), https://openreview.net/forum?id=
cZAi1yWpiXQ 10
82. Zheng, M., Wang, F., You, S., Qian, C., Zhang, C., Wang, X., Xu, C.: Weakly
supervised contrastive learning. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision. pp. 10042–10051 (2021) 4
83. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-
language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. pp. 16816–16825 (2022) 2, 3, 11, 12, 22
84. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language
models. International Journal of Computer Vision 130(9), 2337–2348 (2022) 1, 2,
3, 11, 12, 14, 22, 23
85. Zhou, Y., Sonawani, S., Phielipp, M., Ben Amor, H., Stepputtis, S.: Learning
modular language-conditioned robot policies through attention. Autonomous Robots
47(8), 1013–1033 (2023) 1
86. Zhou, Z., Lei, Y., Zhang, B., Liu, L., Liu, Y.: Zegclip: Towards adapting clip for
zero-shot semantic segmentation. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 11175–11185 (2023) 1
87. Zhu, B., Niu, Y., Han, Y., Wu, Y., Zhang, H.: Prompt-aligned gradient for prompt
tuning. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision. pp. 15659–15669 (2023) 11, 12, 14
88. Zhu, X., Zhang, R., He, B., Zhou, A., Wang, D., Zhao, B., Gao, P.: Not all features
matter: Enhancing few-shot clip with adaptive prior refinement. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision. pp. 2605–2615
(2023) 3
Unified Dual-Path Adapter for Vision-Language Models 21

Negative Yields Positive: Unified Dual-Path


Adapter for Vision-Language Models
Supplementary Material

In this supplementary document, we provide additional details and experimen-


tal results to enhance understanding and insights into our proposed DualAdapter.
This supplementary document is organized as follows:
• Full numerical results for the few-shot learning task are detailed in Section A.1.
• We compare our DualAdapter with other state-of-the-art methods in domain
generalization tasks, utilizing an alternative ViT backbone, in Section A.2.
• More sensitivity analyses of the hyper-parameters are conducted in Section A.3.
• Detailed statistics for all utilized datasets are provided in Section B.1.
• We list the specific positive and negative prompts we used for each dataset
in Section B.2.
• Finally, we explore potential avenues for future research in Section C.

A Additional Experimental Results


A.1 Full Numerical Results on Few-Shot Learning
In Fig. 5 in Section 4.2 of the main text, we have evaluated our DualAdapter on
few-shot learning task and compared with other state-of-the-art methods. In
Table A1, we present the corresponding full numerical results on the few-shot
learning task. We also report the 95% confidence interval over 3 random seeds
of our DualAdapter to ensure reliability of our results. In the last column, we
present the average recognition accuracy over 11 datasets. The results indicate
that our DualAdapter consistently outperforms other state-of-the-art methods
across various few-shot learning settings by substantial margins.
Our DualAdapter demonstrates superior recognition performance across nearly
all tested scenarios, with certain exceptions in lower-shot settings for the Food101 [1]
and OxfordPets [45] datasets. We attribute this to the prevalent challenge of
overfitting, a common issue not exclusive to our approach but also affecting
many existing methods, especially TaskRes [75]: While TaskRes consistently
secures the second-best performance across the other 9 datasets, it underperforms
significantly on these two. We hypothesize that this issue arises from the noisy
training data with intense colors and sometimes wrong labels [1, 45]. However, in
this work, we have designed a label refinement mechanism specifically to address
this issue. As a result, our DualAdapter secures a relatively robust performance
and achieves the second-best on these two datasets.
22 C. Zhang et al.

Table A1: Full numerical results on few-shot learning task. For each dataset,
we report the mean accuracy and 95% confidence interval over 3 random seeds of our
DualAdapter on 1-/2-/4-/8-/16-shot settings. † We report the zero-shot performance of
CLIP [48] for all settings. For TaskRes [75], we report the results using the enhanced base
classifier (i.e., TaskRes*). The best results are in bold and the second are underlined.

FGVCAircraft [39]

StanfordCars [29]
OxfordPets [45]
Flowers102 [44]
Caltech101 [11]

EuroSAT [18]

ImageNet [4]

UCF101 [57]
SUN397 [71]
Food101 [1]
DTD [3]

Method Setting Avg.



Zero-shot CLIP [48] 84.52 40.33 41.80 16.98 65.46 77.31 60.33 85.51 54.26 58.56 61.44 58.77
CoOp [84] 87.43 44.13 50.51 9.80 67.90 73.71 57.15 86.51 55.48 60.10 62.10 59.53
CoCoOp [83] 86.01 45.14 35.08 17.81 67.52 77.42 60.84 86.96 57.22 62.28 62.84 59.92
CLIP-Adapter [12] 88.70 45.66 61.51 17.21 73.43 76.77 61.20 85.99 55.14 61.28 62.29 62.65
1-shot
Tip-Adapter-F [79] 89.38 50.31 59.16 20.83 80.13 77.61 61.32 86.47 58.51 62.51 64.91 64.65
TaskRes [75] 88.80 50.20 61.70 21.41 79.17 74.03 61.90 83.60 59.13 62.33 64.77 64.28
90.87 53.13 67.70 23.61 84.17 77.93 62.89 86.90 61.34 65.13 68.25 67.45
DualAdapter (Ours)
(±0.33) (±0.48) (±0.47) (±0.18) (±0.34) (±0.22) (±0.13) (±0.31) (±0.28) (±0.08) (±0.23) (±0.28)

Zero-shot CLIP [48] 84.52 40.33 41.80 16.98 65.46 77.31 60.33 85.51 54.26 58.56 61.44 58.77
CoOp [84] 87.92 45.04 60.43 18.25 77.47 72.26 55.88 82.36 58.10 59.82 64.13 61.97
CoCoOp [83] 89.47 46.20 38.51 20.22 70.70 78.81 61.86 88.81 58.28 63.50 65.23 61.96
CLIP-Adapter [12] 89.32 51.81 64.11 20.10 81.77 77.20 61.52 86.73 58.71 62.21 67.27 65.52
2-shot
Tip-Adapter-F [79] 89.81 54.00 65.82 23.47 82.50 77.83 61.69 87.10 62.05 63.55 66.23 66.73
TaskRes [75] 90.27 55.13 65.83 24.13 86.57 75.17 62.63 84.63 63.70 64.97 70.00 67.54
91.19 59.17 74.40 26.22 88.43 78.35 63.47 87.68 64.77 66.73 70.25 70.06
DualAdapter (Ours)
(±0.26) (±0.31) (±0.87) (±0.37) (±0.18) (±0.17) (±0.06) (±0.28) (±0.21) (±0.33) (±0.48) (±0.31)

Zero-shot CLIP [48] 84.52 40.33 41.80 16.98 65.46 77.31 60.33 85.51 54.26 58.56 61.44 58.77
CoOp [84] 89.17 53.38 70.20 21.72 85.81 72.72 59.93 87.22 61.92 63.46 67.08 66.60
CoCoOp [83] 90.31 47.90 63.56 20.56 72.72 79.51 62.52 88.60 59.90 64.90 67.90 65.31
CLIP-Adapter [12] 89.98 57.02 73.18 22.99 87.30 77.93 61.84 87.36 62.26 65.90 68.90 68.61
4-shot
Tip-Adapter-F [79] 90.67 57.78 73.85 26.01 89.02 78.26 62.52 87.72 64.82 66.13 70.87 69.79
TaskRes [75] 90.97 60.70 73.83 25.70 90.20 76.10 63.57 86.33 67.43 67.27 70.93 70.28
92.21 66.01 76.54 28.95 92.04 78.74 64.12 88.13 67.96 68.59 73.46 72.43
DualAdapter (Ours)
(±0.21) (±0.49) (±0.59) (±0.29) (±0.25) (±0.11) (±0.17) (±0.26) (±0.36) (±0.29) (±0.51) (±0.32)

Zero-shot CLIP [48] 84.52 40.33 41.80 16.98 65.46 77.31 60.33 85.51 54.26 58.56 61.44 58.77
CoOp [84] 90.15 59.88 76.51 25.93 90.84 71.52 60.91 86.40 68.49 65.63 71.81 69.82
CoCoOp [83] 90.14 52.21 64.13 22.03 75.88 79.59 62.40 88.74 60.87 65.37 68.25 66.33
CLIP-Adapter [12] 91.22 60.70 78.34 25.77 91.79 78.01 62.68 87.70 67.78 67.52 73.02 71.32
8-shot
Tip-Adapter-F [79] 91.54 62.67 77.83 30.21 91.85 78.71 64.00 88.07 69.53 68.80 74.50 72.52
TaskRes [75] 92.40 64.77 79.33 31.48 94.73 76.40 64.67 87.17 71.83 68.73 75.33 73.35
93.40 67.78 81.62 33.90 95.23 79.23 65.37 89.29 72.08 70.93 76.84 75.06
DualAdapter (Ours)
(±0.23) (±0.55) (±0.54) (±0.63) (±0.26) (±0.21) (±0.09) (±0.21) (±0.51) (±0.15) (±0.39) (±0.34)

Zero-shot CLIP [48] 84.52 40.33 41.80 16.98 65.46 77.31 60.33 85.51 54.26 58.56 61.44 58.77
CoOp [84] 91.61 63.11 82.36 31.01 94.39 73.80 62.95 87.30 72.51 69.11 75.70 73.07
CoCoOp [83] 90.90 57.53 70.77 22.40 79.14 79.68 62.71 89.93 62.22 67.21 70.81 68.48
CLIP-Adapter [12] 92.44 66.14 82.76 31.83 93.91 78.21 63.59 87.91 74.12 69.59 76.80 74.30
16-shot
Tip-Adapter-F [79] 92.93 67.33 83.80 35.50 95.01 79.50 65.51 89.71 75.50 71.31 78.01 75.83
TaskRes [75] 93.43 67.13 84.03 36.30 96.03 77.60 65.73 87.83 76.83 70.67 77.97 75.78
93.77 70.83 87.36 40.27 96.51 79.87 66.52 90.58 77.48 72.32 80.28 77.80
DualAdapter (Ours)
(±0.33) (±0.77) (±0.84) (±0.53) (±0.58) (±0.34) (±0.13) (±0.38) (±0.68) (±0.24) (±0.26) (±0.50)
Unified Dual-Path Adapter for Vision-Language Models 23

A.2 More Results on Domain Generalization


In Table 1 in Section 4.3 Table A2: Performance comparison on robust-
of the main text, we com- ness to distribution shifts. All the models are
pare the generalization per- trained on 16-shot ImageNet [4] and directed tested
formance of our DualAdapter on the OOD target datasets. The best results are in
with other state-of-the-art bold and the second best are underlined.
methods in the presence Source Target
Method
of distribution shifts with ImageNet -V2 -Sketch -A -R Avg.
ResNet-50 visual backbone. Zero-Shot CLIP [48] 62.05 54.79 40.82 29.57 65.99 47.79
In Table A2, we present the Linear Probe CLIP [48] 59.58 49.73 28.06 19.67 47.20 36.17
CoOp [84] 66.85 58.08 40.44 30.62 64.45 48.40
performance comparison on TaskRes [75] 68.20 59.20 42.50 31.43 69.33 50.62
domain generalization task GraphAdapter [32] 68.47 59.10 42.70 31.73 69.43 50.74
DualAdapter (Ours) 69.63 59.76 43.41 32.48 69.60 51.31
using ViT-B/32 visual back-
bone. Consistently, our DualAdapter not only achieves state-of-the-art performance
on the source dataset but also attains an average performance gain of 0.57% across
4 out-of-distribution (OOD) target datasets. This verifies that our DualAdapter
demonstrates superior generalizability compared to other state-of-the-art methods,
independent of the visual backbone utilized.

A.3 More Sensitivity Analyses of Hyper-Parameters


Building upon the sensitiv- Table A3: Sensitivity of hyper-parameters. All
ity analyses of λ and τ de- the results are reported on 16-shot ImageNet [4].
tailed in Section 4.4 of the 0.0 0.5 1.0 1.2 1.5 2.0
α
main text, this section ex- 66.34 66.40 66.47 66.52 66.44 66.36
tends our examination to in- 1.0 1.5 2.0 2.5 3.0 3.5
clude the sensitivity of pa- β
66.38 66.44 66.52 66.50 66.48 66.40
rameters α and β on 16-shot
ImageNet [4]. In our experiments on ImageNet [4], we set the hyper-parameters α
and β defined in Section 3 to 1.2 and 2.0, respectively. To comprehensively inves-
tigate the effects of different hyper-parameters, we conducted a sensitivity experi-
ment where we varied each hyper-parameter individually and evaluated the perfor-
mance on 16-shot ImageNet [4] in Fig. A3. We can see that our choice of α = 1.2
and β = 2.0 yields the highest performance. Moreover, our DualAdapter main-
tains robust performance when adjusting these two hyper-parameters, since our
DualAdapter includes adapters from both textual and visual modalities and each of
our four adapters can work effectively, as we presented in Table 2 in the main text.

B Additional Implementation Details


B.1 Dataset Details
In Table B4, we present the detailed statistics of each dataset we used in our
experiments, including the number of classes, the sizes of training, validation and
testing sets, and their original tasks.
24 C. Zhang et al.

Table B4: Detailed statistics of datasets used in experiments. Note that the last
4 ImageNet variant datasets are designed for evaluation and only contain the test sets.

Dataset Classes Training Validation Testing Task


Caltech101 [11] 100 4,128 1,649 2,465 Object recognition
DTD [3] 47 2,820 1,128 1,692 Texture recognition
EuroSAT [18] 10 13,500 5,400 8,100 Satellite image recognition
FGVCAircraft [39] 100 3,334 3,333 3,333 Fine-grained aircraft recognition
Flowers102 [44] 102 4,093 1,633 2,463 Fine-grained flowers recognition
Food101 [1] 101 50,500 20,200 30,300 Fine-grained food recognition
ImageNet [4] 1,000 1.28M - 50,000 Object recognition
OxfordPets [45] 37 2,944 736 3,669 Fine-grained pets recognition
StanfordCars [29] 196 6,509 1,635 8,041 Fine-grained car recognition
SUN397 [71] 397 15,880 3,970 19,850 Scene recognition
UCF101 [57] 101 7,639 1,898 3,783 Action recognition
ImageNet-V2 [50] 1,000 - - 10,000 Robustness of collocation
ImageNet-Sketch [65] 1,000 - - 50,889 Robustness of sketch domain
ImageNet-A [20] 200 - - 7,500 Robustness of adversarial attack
ImageNet-R [19] 200 - - 30,000 Robustness of multi-domains

Table B5: Positive and negative prompts used in experiments. In addition


to these prompts, we also employ CuPL [46] prompts to further enhance performance.
Dataset Positive Prompts Negative Prompts
“itap of a {CLASS}.” “itap without any {CLASS}.”
ImageNet [4] “a bad photo of the {CLASS}.” “a bad photo with no {CLASS} in it.”
ImageNet-V2 [50] “a origami {CLASS}.” “a origami that isn’t a {CLASS}.”
ImageNet-Sketch [65] “a photo of the large {CLASS}.” “a photo with no large {CLASS}.”
ImageNet-A [20] “a {CLASS} in a video game.” “a video game scene without a {CLASS}.”
ImageNet-R [19] “art of the {CLASS}.” “art that doesn’t include a {CLASS}.”
“a photo of the small {CLASS}.” “a photo with no small {CLASS}.”
Caltech101 [11] “a photo of a {CLASS}.” “a photo without {CLASS}.”
DTD [3] “{CLASS} texture.” “not {CLASS} texture.”
EuroSAT [18] “a centered satellite photo of {CLASS}.” “a centered satellite photo without {CLASS}.”
FGVCAircraft [39] “a photo of a {CLASS}, a type of aircraft.” “a photo without {CLASS}, a type of aircraft.”
Flowers102 [44] “a photo of a {CLASS}, a type of flower.” “a photo without {CLASS}, a type of flower.”
Food101 [1] “a photo of {CLASS}, a type of food.” “a photo without {CLASS}, a type of food.”
OxfordPets [45] “a photo of a {CLASS}, a type of pet.” “a photo without {CLASS}, a type of pet.”
StanfordCars [29] “a photo of a {CLASS}.” “a photo of no {CLASS}.”
SUN397 [71] “a photo of a {CLASS}.” “a photo without {CLASS}.”
UCF101 [57] “a photo of a person doing {CLASS}.” “a photo of a person not doing {CLASS}.”

B.2 Positive and Negative Prompts

In Table B5, we detail the specific positive and negative prompts utilized for
each dataset. Additionally, as mentioned in Section 4.1, we incorporate prompts
from CuPL [46] to further enhance model performance.

C Future Work

In this work, we introduce the first dual-path adapter-style fine-tuning framework


for VLMs. We believe that the similar concept of dual learning can also be
applied to prompt-based learning methods, and be extended to fine-tune other
foundational models (e.g., other VLMs [24, 74] and LLMs [5, 63]). Besides, we
also notice that there are a lot of research efforts dedicated to design better
Unified Dual-Path Adapter for Vision-Language Models 25

prompts (e.g., using LLM [40, 46, 70]) to fully exploit the capabilities of CLIP.
We hope that with our work, future research endeavors can also be directed to
investigate the utilization of negative prompts to better activate the negative
inference capabilities, further broadening the scope and effectiveness of CLIP.

You might also like