Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

An Yang∗1 , Junshu Pan∗1,2† , Junyang Lin∗1 ,


Rui Men1 , Yichang Zhang1 , Jingren Zhou1 , Chang Zhou1♣
1
DAMO Academy, Alibaba Group
2
Beihang University
{ya235025, panjunshu.pjs, junyang.ljy, ericzhou.zc}@alibaba-inc.com

Abstract
Image Image
The tremendous success of CLIP (Radford Encoder Encoder
arXiv:2211.01335v2 [cs.CV] 3 Nov 2022

et al., 2021) has promoted the research and


application of contrastive learning for vision-
Contrastive Loss Contrastive Loss
language pretraining. In this work, we con-
struct a large-scale dataset of image-text pairs
in Chinese, where most data are retrieved from ⼀只⾦丝猴的照⽚ Text Text
publicly available datasets, and we pretrain (Photo of a golden monkey) Encoder Encoder
Chinese CLIP models on the new dataset. We
develop 5 Chinese CLIP models of multiple
sizes, spanning from 77 to 958 million param- Figure 1: An illustration of pretraining Chinese
eters. Furthermore, we propose a two-stage CLIP. To leverage the advantages of the existing pre-
pretraining method, where the model is first trained models, we initialize the image encoder with
trained with the image encoder frozen and then the weights of OpenAI CLIP models, and the text en-
trained with all parameters being optimized, coder with the Chinese RoBERTa models. In Stage 1,
to achieve enhanced model performance. Our we freeze the weights of the image encoder to avoid
comprehensive experiments demonstrate that weight optimization, and in Stage 2, we unfreeze it and
Chinese CLIP can achieve the state-of-the-art optimize both encoders.
performance on MUGE, Flickr30K-CN, and
COCO-CN in the setups of zero-shot learning
and finetuning, and it is able to achieve com-
petitive performance in zero-shot image clas- 2022a,b). They achieved new state-of-the-art per-
sification based on the evaluation on the ELE- formance in multiple vision-language downstream
VATER benchmark (Li et al., 2022b). We have tasks, e.g., visual question answering, image cap-
released our codes, models, and demos1 . tioning, cross-modal retrieval, etc. This reflects
the great potential of building a foundation model
1 Introduction pretrained on large-scale data for cross-modal rep-
resentation learning.
Starting from the burst of pretraining in NLP, foun-
dation models have attracted attention from multi- A milestone of foundation models in multimodal
ple research communities. Foundation models that representation learning is CLIP (Radford et al.,
learn from large-scale unsupervised or weakly su- 2021). Different from the conventional genera-
pervised data play as the basis of downstream mod- tive pretraining, CLIP is a contrastive-learning-
els. Following natural language pretraining, the based model pretrained on a large-scale dataset
transfer of BERT (Devlin et al., 2019) and T5 (Raf- of around 400 million image-text pair data col-
fel et al., 2020) to multimodal pretraining was also lected from the web. Despite the simplicity of
successful (Lu et al., 2019; Chen et al., 2020; Zhang the method, CLIP not only achieved outstanding
et al., 2021; Kim et al., 2021; Wang et al., 2021b, performance in vision-language retrieval but more
importantly played as a vision foundation model

Co-first authors. and demonstrated state-of-the-art performance in

Corresponding author.
† zero-shot image classification across a series of
Work done as an intern in DAMO Academy.
1
Github: https://github.com/OFA-Sys/Chinese-CLIP; datasets. Furthermore, its simplicity facilitates its
ModelScope: https://www.modelscope.cn/models application to downstream tasks, and thus we have
witnessed a series of studies following or utilizing ual translation for the evaluation of Chinese mod-
CLIP, e.g., DALL-E (Ramesh et al., 2021, 2022), els.4 On the classification datasets, Chinese CLIP
ViLD (Gu et al., 2021), StyleCLIP (Patashnik et al., demonstrates competitive performance in compari-
2021), SLIP (Mu et al., 2021), CLIP4Clip (Luo son with both English and Chinese models.
et al., 2021), etc. CLIP which builds a connection In brief, our contributions are:
between vision and language has been transform-
• We propose Chinese CLIP, a simple imple-
ing the research in both multimodal representation
mentation of CLIP pretrained on our collected
learning and computer vision.
large-scale Chinese image-text pair data, and
However, it is still a pity that there is no such we implement a two-stage pretraining method
model that strictly follows the design of CLIP in to achieve improved performance.
the community of Chinese multimodal representa-
tion learning. Previous Chinese multimodal pre- • Chinese CLIP achieves state-of-the-art per-
trained models are mainly generative multiomodal formance in cross-modal retrieval in the se-
pretrained models (Lin et al., 2021a; Yang et al., tups of zero-shot learning and finetuning and
2021; Lin et al., 2021b), or contrastive-learning- competitive performance in zero-shot image
based models with additional complex model and classification.
training task designs (Fei et al., 2021; Gu et al.,
In the following sections, we introduce the detailed
2022; Xie et al., 2022). We assume that it is suf-
implementation of Chinese CLIP and illustrate the
ficient for the simple implementation of CLIP on
experimental results and analyses of the model.
large-scale Chinese data to achieve satisfactory per-
Finally, we review the related work, point out our
formance in downstream tasks, and we are con-
limitations, and conclude the paper.
vinced that the simple architectural design of CLIP
is essential to the application of multimodal pre- 2 Method
trained models in real-world scenarios.
In this work, we propose Chinese CLIP (CN- In this section, we first introduce the preliminaries
CLIP for short), the CLIP model pretrained on about CLIP and illustrate the details of the data
a large-scale dataset of Chinese image-text pairs. construction and the implementation of Chinese
Specifically, we construct a dataset of nearly 200 CLIP.
million image-text pairs, most of which are pub- 2.1 Preliminaries
licly available data. We build 5 Chinese CLIP mod-
Before introducing our implementation, we first
els, ranging from 77 to 958 million parameters.
demonstrate the architecture and training methods
For the pretraining, we implement a two-stage pre-
of CLIP. CLIP consists of a classical two-tower
training method. In Stage 1, the image encoder
architecture with an image encoder and a text en-
is frozen, while in Stage 2, both encoders should
coder. An image encoder Mi can be a vision back-
be optimized, as demonstrated in Figure 1. Our
bone, e.g., ResNet (He et al., 2016), ViT (Doso-
ablation study shows that it can achieve improved
vitskiy et al., 2021), etc., and a text encoder can
model performance compared with the other pre-
be a Transformer model, e.g., BERT (Devlin et al.,
training options. We evaluate Chinese CLIP on
2019), GPT (Radford et al., 2018), etc. In prac-
3 Chinese cross-modal retrieval dataset, including
tice, Radford et al. (2021) proposed CLIP mod-
MUGE2 , Flickr30K-CN (Lan et al., 2017), and
els of multiple model scales with different vision
COCO-CN (Li et al., 2019c). Experimental results
and language backbones. The vision backbones
demonstrate that both the large-size and huge-size
include 5 ResNets and 3 ViTs. The 5 ResNets in-
Chinese CLIP reach state-of-the-art performance
clude ResNet-50, ResNet101, RN50x4, RN50x16,
on the 3 datasets in the setups of both zero-shot
and RN50x64, and the 3 Transformers include
learning and finetuning. Additionally, we eval-
ViT-B/32, ViT-B/16, and ViT-L/14. The attention-
uate the capability of zero-shot image classifica-
pooled representation of the top-level features from
tion on the track “Image Classification in the Wild”
ResNets or the top-level feature at the [CLS] posi-
of the ELEVATER benchmark (Li et al., 2022b)3 .
tion from ViTs is treated as the image representa-
Specifically, we transform the datasets with man-
tion. The top-level feature at the [EOS] position
2 4
https://tianchi.aliyun.com/muge We will soon release the datasets and templates for prompt
3
https://eval.ai/web/challenges/challenge-page/1832 ensembling.
from Transformer is treated as the text representa- including Visual Genome (Krishna et al., 2017)
tion. and MSCOCO (Chen et al., 2015), where test sets
To train the model, Radford et al. (2021) applied are removed. Finally, we construct a dataset for
large-batch contrastive learning. Given N sam- Chinese multimodal pretraining with around 200
ples in a batch, we have N × N possible image- million image-text pairs.5
text pairs, where there are N positive samples and Below illustrates the procedure for data prepro-
N 2 − N negative samples. Thus we have N 2 cessing. For the part of data from LAION-5B, we
similarity scores with inner products of the L2- remove the samples with CLIP scores lower than
normalized features and optimize the cross-entropy 0.26 computed by mCLIP (Carlsson et al., 2022).
loss with a learned temperature to control the soft- Besides, we remove the samples with captions
max distribution. containing words in our internal blacklist. The
CLIP can transfer to cross-modal retrieval. Also, blacklist contains words related to advertising, im-
the image encoder can play as a vision backbone age filenames, etc. We remove those samples that
to replace the original ResNets or ViTs trained on are too short (fewer than 5 characters) or too long
ImageNet (Deng et al., 2009). Furthermore, while (more than 50 characters). For the images, we re-
CLIP builds a connection between vision and lan- size them to the resolution of 224 × 224 for most
guage, it is possible to implement open-vocabulary cases and 336 × 336 for the ViT-L/14@336px.
zero-shot classification by computing similarities
2.3 Pretraining Method
between the given image and the candidate labels
equipped with a template, e.g., “a photo of [la- There are multiple design choices for pretraining
bel]”. To level up the performance in zero-shot the Chinese CLIP models. One of the simplest
image classification, it is a common practice to methods should be pretraining from scratch, where
apply multiple templates to transform a label to both the image and text encoders are randomly ini-
multiple sentences, and calculate the mean of the tialized. However, we assume that its performance
scores of the sentences for a single label. This is will be limited by the quantity and quality of the
also known as prompt ensembling. current pretraining data. To leverage the advan-
tages of existent pretrained models, we initialize
2.2 Data the models with weights from the pretrained check-
points from the official release of CLIP 6 for the im-
One key to CLIP’s success should be the large-scale age encoder, and RoBERTa-wwm-ext and RBT3 7
dataset for pretraining. Based on the experiments for the text encoder. To adapt the model to the
of a CLIP reimplementation (Ilharco et al., 2021), introduced pretraining data, it is available to pre-
scaling up data and lengthening the training process train it with “contrastive tuning”, similar to the way
can consistently improve the model performance to transfer CLIP to downstream retrieval data. In
in zero-shot learning. This year, the most recent comparison with contrastive tuning, Locked-image
multimodal pretrained models Wukong (Gu et al., Tuning (LiT) (Zhai et al., 2022) demonstrated im-
2022) and R2D2 (Xie et al., 2022) were pretrained proved performance in downstream transfer.
on a public dataset of 100 million image-text pairs In this work, we propose a two-stage pretrain-
and an in-house dataset of 250 million samples, ing method, as shown in Figure 1. The core idea
where only a subset of 23 million samples were is to first utilize LiT to enable the text encoder
released. For the facility in reimplementation, we to read out high-quality representations from the
aim at pretraining Chinese CLIP on as many pub- foundation vision model from OpenAI CLIP, and
licly available data as possible, and thus we focus then transfer the whole model to the domain of
on collecting high-quality public datasets. We ex- the introduced pretraining data. It is not sufficient
tract the Chinese data (with the mark “zh”) from to pretrain Chinese CLIP with solely LiT, as the
the latest LAION-5B (Schuhmann et al., 2021), and image encoder should learn the information of the
collect the data from the Wukong dataset. How- images of the Chinese datasets and model the distri-
ever, due to the problems of unavailable links, we bution of such data. Before the two-stage pretrain-
can only collect around 108 million samples and 5
72 million samples from LAION-5B and Wukong We add around 20 million high-quality internal image-
text pairs to provide more diversity.
respectively. We additionally add the translated 6
https://github.com/openai/CLIP
7
data from the classic English multimodal datasets, https://github.com/ymcui/Chinese-BERT-wwm
ing, we first initialize both encoders with pretrained CN-CLIPViT-H/14 consists of a ViT-H/14 for the
models. In Stage 1, we “lock” the image encoder image encoder and RoBERTa-wwm-Large for the
by freezing its parameters during pretraining. We text encoder. More implementation details are pre-
only pretrain the text encoder for vision-language sented in Appendix A.
alignment, based on the assumption that the vision
backbone with pretrained weights is already a pow- 3 Evaluation
erful vision foundation model (Zhai et al., 2022; To comprehensively probe the effects of Chinese
Gu et al., 2022). We pretrain it until there is no CLIP, we follow the conventional practice that we
salient performance improvement in downstream first evaluate its basic capabilities of cross-modal
tasks even if we prolong the pretraining progress. retrieval, i.e. text-to-image retrieval and image-
Then we switch to Stage 2, where we “unlock” to-text retrieval, in different domains, including
the image encoder by enabling its optimization. e-commerce and the general domain. Addition-
In Stage 2, we continue pretraining without any ally, as the contrastive-learning-based pretraining
parameter frozen, so that the image encoder can builds a foundation vision model that is semanti-
learn to model the distribution of the image data cally connected with natural language, we follow
from Chinese websites. In the ablation study, we Radford et al. (2021) and evaluate its capabilities
discuss the influence of the initialization of the pre- of zero-shot classification. Specifically, we vali-
trained checkpoints and pretraining methods on date Chinese CLIP on the classification datasets
the downstream performance. Experimental results of the ELEVATER benchmark, which is known as
show that the two-stage pretraining method can out- “Image Classification in the Wild”.
perform either pretraining from scratch or directly
finetuning from the pretrained models. 3.1 Cross-modal Retrieval
This section illustrates the datasets and evaluation
2.4 Scaling metrics and reports our experimental results. We
also provide an ablation study to show the effective-
We develop 5 Chinese CLIP models of dif- ness of our proposed two-stage pretraining method.
ferent sizes, spanning from around 77 to 958
million parameters. We include 1 ResNet- 3.1.1 Datasets and Metrics
50 model CN-CLIPRN50 and 4 ViT mod- We validate Chinese CLIP on 3 cross-modal
els, i.e., CN-CLIPViT-B/16 , CN-CLIPViT-L/14 , retrieval datasets, namely MUGE-Retrieval,
CN-CLIPViT-L/14@336px and CN-CLIPViT-H/14 , Flickr30K-CN (Lan et al., 2017), and COCO-
where models are pretrained on images of the CN (Li et al., 2019c). MUGE-Retrieval is an
resolution of 224 × 224 without specification. image-text retrieval dataset, where data are
Table 1 presents the details of the model ar- extracted from Chinese E-commerce websites.
chitecture. The smallest model CN-CLIPRN50 Flickr30K-CN and COCO-CN are built from the
consists of a ResNet-50 for the image en- classical datasets Flickr30K and MSCOCO-1K
coder and a RBT3 for the text encoder. The whose texts are translated into Chinese. Our
base-size model CN-CLIPViT-B/16 consists of evaluation includes setups of zero-shot learning
a ViT-B/16@224px for the image encoder and and finetuning. For zero-shot learning, we use
a RoBERTa-wwm-Base for the text encoder. Chinese CLIP models to compute the similarity
The large-size model CN-CLIPViT-L/14 consists scores between images and texts and return the
of a ViT-L/14@224px for the image encoder top-K most similar candidates. For finetuning, we
and a RoBERTa-wwm-Base for the text en- finetune the Chinese CLIP models for cross-modal
coder, while CN-CLIPViT-L/14@336px consists of retrieval with contrastive tuning. The evaluation is
a ViT-L/14@336px and a RoBERTa-wwm-Base. the same as that in zero-shot learning. The evalua-
Specifically, we pretrain CN-CLIPViT-L/14@336px tion metrics are Recall@K, where K = {1, 5, 10},
by continuing pretraining on the pretrained and Mean Recall (MR, i.e., the average of
CN-CLIPViT-L/14 . For the adaptation to a larger Recall@K). For comparison, we choose the
resolution, we initialize the image positional em- base-size and large-size Wukong and R2D2 as the
bedding by applying interpolation to the posi- baselines, which are the previous SOTA models
tional embedding of CN-CLIPViT-L/14 , follow- in Chinese multimodal representation learning.
ing Ilharco et al. (2021). The huge-size model Following these baselines, we report validation
Model #Params (All) Backbone (I) #Params (I) Backbone (T) #Params (T) Resolution
CN-CLIPRN50 77M ResNet-50 38M RBT3 39M 224 × 224
CN-CLIPViT-B/16 188M ViT-B/16 86M RoBERTa-wwm-Base 102M 224 × 224
CN-CLIPViT-L/14 406M ViT-L/14 304M RoBERTa-wwm-Base 102M 224 × 224
CN-CLIPViT-L/14@336px 407M ViT-L/14 304M RoBERTa-wwm-Base 102M 336 × 336
CN-CLIPViT-H/14 958M ViT-H/14 632M RoBERTa-wwm-Large 326M 224 × 224

Table 1: Hyperparameters of Chinese CLIP models of different sizes.

Tasks Zero-shot Finetuning


Metrics R@1 R@5 R@10 MR R@1 R@5 R@10 MR
Tiny-size Model
CN-CLIPRN50 42.6 68.6 77.9 63.0 48.6 75.1 84.0 69.2
Base-size Models
WukongViT-B/32 33.4 59.3 69.7 54.1 39.2 66.9 77.4 61.2
R2D2ViT-B - - - - 47.4 75.1 83.5 68.7
CN-CLIPViT-B/16 52.1 76.7 84.4 71.1 58.4 83.6 90.0 77.4
Large-size Models
WukongViT-L/14 42.7 69.0 78.0 63.2 52.7 77.9 85.6 72.1
R2D2ViT-L/14 49.5 75.7 83.2 69.5 60.1 82.9 89.4 77.5
CN-CLIPViT-L/14 56.3 79.8 86.2 74.1 63.3 85.6 91.3 80.1
CN-CLIPViT-L/14@336px 59.0 81.4 87.8 76.1 65.3 86.7 92.1 81.3
Huge-size Models
CN-CLIPViT-H/14 63.0 84.1 89.2 78.8 68.9 88.7 93.1 83.6

Table 2: Experimental results on MUGE-Retrieval. We report the performance of both baselines and Chinese CLIP
models on text-to-image retrieval and image-to-text retrieval in the setups of zero-shot evaluation and finetuning.

performance on MUGE and test performance forms R2D2ViT-L/14 by 6.6 MR in zero-shot learn-
on Flickr30K-CN and COCO-CN. Note that in ing and 3.8 MR in finetuning. When scaling to
the setup of finetuning, R2D28 is essentially an CN-CLIPViT-H/14 , the performance is further im-
end-to-end model of retrieval and ranking. proved. Compared with the best large-size model
CN-CLIPViT-L/14@336px , CN-CLIPViT-H/14 sur-
3.1.2 Results
passes it by 2.7 MR in zero-shot learning and 2.3
Table 2 reports the model performance on MR in finetuning.
MUGE-Retrieval. For the base-size model,
CN-CLIPViT-B/16 outperforms the baselines on all Table 3 and 4 report the model performance
metrics and in both setups of zero-shot learning on Flickr30K-CN and COCO-CN. We focus
and finetuning. Specifically, for the base-size mod- on the evaluation of R@1. In both datasets,
els, CN-CLIPViT-B/16 surpasses WukongViT-B/32 CN-CLIP achieves better performance than the
by 17.0 MR in zero-shot learning and surpasses baselines. For the base-size models, in the
R2D2ViT-B by 8.7 MR in finetuning. Besides, the setup of zero-shot learning of Flickr30K-CN,
tiny model CN-CLIPRN50 can outperform the base- CN-CLIPViT-B/16 surpasses WukongViT-B/32 by
size WukongViT-B/32 by 8.9 MR in zero-shot learn- 17.0 R@1 in text-to-image retrieval and 8.4 R@1
ing and 8.0 MR in finetuning. in image-to-text retrieval, and in the finetuning
For the large-size models, CN-CLIPViT-L/14 setup, CN-CLIPViT-B/16 surpasses R2D2ViT-B by
can outperform both baselines in all metrics and 0.8 R@1 in image retrieval and 0.9 R@1 in text re-
CN-CLIPViT-L/14@336px pretrained on images of trieval. Similarly, in the finetuning setup of COCO-
a larger resolution can achieve the state-of-the- CN, CN-CLIPViT-B/16 surpasses R2D2ViT-B by
art performance. CN-CLIPViT-L/14@336px outper- 1.9 R@1 in image retrieval and 1.3 R@1 in
8
text retrieval. For the tiny-size CN-CLIPRN50 , it
Since the original paper of R2D2 does not provide the
patch size of their base-size model, here we denote the model again achieves or surpasses the performance of
as R2D2ViT-B . WukongViT-B/32 in several metrics of Flickr30K-
Tasks Text-to-Image Image-to-Text
Setups Zero-shot Finetuning Zero-shot Finetuning
Metrics R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Tiny-size Model
CN-CLIPRN50 48.8 76.0 84.6 66.7 89.4 94.1 60.0 85.9 92.0 84.2 96.7 98.0
Base-size Models
WukongViT-B/32 45.7 73.8 82.2 67.6 89.6 94.2 66.2 88.7 94.3 83.9 97.6 99.0
R2D2ViT-B - - - 78.3 94.6 97.0 - - - 92.6 99.1 99.8
CN-CLIPViT-B/16 62.7 86.9 92.8 79.1 94.8 97.4 74.6 93.5 97.1 93.5 99.0 99.5
Large-size Models
WukongViT-L/14 51.7 78.9 86.3 77.4 94.5 97.0 76.1 94.8 97.5 92.7 99.1 99.6
R2D2ViT-L/14 60.9 86.8 92.7 84.4 96.7 98.4 77.6 96.7 98.9 95.6 99.8 100.0
CN-CLIPViT-L/14 68.0 89.7 94.4 82.7 96.7 98.6 80.2 96.6 98.2 96.1 99.5 99.9
CN-CLIPViT-L/14@336px 69.0 90.7 95.4 84.4 97.1 98.7 83.3 97.2 98.5 96.6 99.8 100.0
Huge-size Models
CN-CLIPViT-H/14 71.2 91.4 95.5 83.8 96.9 98.6 81.6 97.5 98.8 95.3 99.7 100.0

Table 3: Experimental results on Flickr30K-CN. We report the performance of both baselines and Chinese CLIP
models on text-to-image retrieval and image-to-text retrieval in the setups of zero-shot evaluation and finetuning.

CN and COCO-CN. Specifically, CN-CLIPRN50 R@1 in text retrieval. We also compare our
surpasses WukongViT-B/32 by 3.1 R@1 in the zero- CN-CLIPViT-H/14 with a huge CLIP-like model,
shot learning of Flickr30K-CN image retrieval and T-Bletchley9 . This model has 2.5 billion param-
by 2.6 R@1 in the finetuning of COCO-CN text eters and is pretrained on billions of multilingual
retrieval. image-caption pairs. In the finetuning setup of
COCO-CN, with smaller sizes of model param-
For the large-size models, in the zero-shot
eters and pretrain dataset, CN-CLIPViT-H/14 still
setup of Flickr30K-CN, CN-CLIPViT-L/14 sur-
surpasses T-Bletchley by 3.9 MR.
passes WukongViT-L/14 by 16.3 R@1 in text-to-
image retrieval and 4.1 R@1 in image-to-text 3.1.3 Ablation Study
retrieval. CN-CLIPViT-L/14@336px further im-
Here we provide an ablation study on our proposed
proves over CN-CLIPViT-L/14 by 1.0 R@1 in
two-stage training methods. To validate its sig-
image retrieval and 3.1 R@1 in text retrieval.
nificance and effectiveness, we design several se-
In the finetuning setup, CN-CLIPViT-L/14 sur-
tups for the ablation study. Our experiments are
passes R2D2ViT-L/14 by 0.5 R@1 in text re-
conducted on CN-CLIPViT-B/16 . To evaluate the
trieval. CN-CLIPViT-L/14@336px achieves equal
influence of initialization, we pretrain a model
performance with R2D2ViT-L/14 in image retrieval
from scratch, and to examine the influence of LiT,
and surpasses it by 1.0 R@1 in text retrieval.
we pretrain a model without freezing the image
Similarly, in the finetuning setup of COCO-CN,
encoder. For a better demonstration, we report
CN-CLIPViT-L/14 surpasses R2D2ViT-L/14 by 0.9
the curves of model performance of zero-shot re-
R@1 in text retrieval. CN-CLIPViT-L/14@336px fur-
trieval on different datasets in terms of pretraining
ther surpasses R2D2ViT-L/14 by 1.0 in image re-
progress, indicated by the number of processed
trieval and 1.9 R@1 in text retrieval.
samples.
On Flickr30K-CN and COCO-CN, scaling from Figure 2 shows the performance on different
CN-CLIPViT-L/14 to CN-CLIPViT-H/14 improves tasks, namely MUGE text-to-image retrieval, text-
the performance in almost all the metrics. Specif- to-image and image-to-text retrieval on Flicrk30K-
ically, in the zero-shot setup of Flickr30K-CN, CN and COCO-CN. In comparison with pretrain-
CN-CLIPViT-H/14 surpasses CN-CLIPViT-L/14 by ing with pretrained model initialization, pretrain-
3.2 R@1 in image retrieval and 1.4 R@1 in ing from scratch performs much worse though
text retrieval. Moreover, in the finetuning setup it shows consistent performance improvement in
of COCO-CN, CN-CLIPViT-H/14 even surpasses 9
https://www.microsoft.com/en-us/research/blog/turing-
CN-CLIPViT-L/14@336px with larger image reso- bletchley-a-universal-image-language-representation-model-
lution by 1.4 R@1 in image retrieval and 2.3 by-microsoft/
Tasks Text-to-Image Image-to-Text
Setups Zero-shot Finetuning Zero-shot Finetuning
Metrics R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Tiny-size Model
CN-CLIPRN50 48.1 81.3 90.5 66.8 91.1 97.0 51.6 81.2 90.5 68.4 93.3 97.8
Base-size Models
WukongViT-B/32 49.2 79.4 87.9 67.0 91.4 96.7 48.3 77.8 88.8 65.8 90.3 96.6
R2D2ViT-B - - - 75.1 94.2 98.1 - - - 76.1 95.3 98.5
CN-CLIPViT-B/16 62.2 86.6 94.9 77.0 97.1 99.0 57.0 84.1 93.6 77.4 96.2 98.9
Large-size Models
WukongViT-L/14 53.4 80.2 90.1 74.0 94.4 98.1 55.2 81.0 90.6 73.3 94.0 98.0
R2D2ViT-L/14 56.4 85.0 93.1 79.1 96.5 98.9 63.3 89.3 95.7 79.3 97.1 98.7
CN-CLIPViT-L/14 64.0 89.2 94.4 78.9 96.3 99.0 60.4 84.2 92.9 80.2 96.7 99.2
CN-CLIPViT-L/14@336px 64.7 89.6 94.6 80.1 96.7 99.2 63.4 87.2 94.4 81.2 97.2 99.1
Huge-size Models
CN-CLIPViT-H/14 69.2 89.9 96.1 81.5 96.9 99.1 63.0 86.6 92.9 83.5 97.3 99.2

Table 4: Experimental results on COCO-CN. We report the performance of both baselines and Chinese CLIP
models on text-to-image retrieval and image-to-text retrieval in the setups of zero-shot evaluation and finetuning.
Since machine translated COCO is included in our pretraining dataset, here the numbers of Chinese CLIP zero-shot
performances are shown in gray.

terms of pretraining progress. As to the importance ity of zero-shot image classification by computing
of LiT, we observe different phenomena on differ- similarities between the given image and the text
ent datasets. On MUGE, a dataset of samples origi- descriptions of the labels in the candidate set. Re-
nally collected from Chinese websites, we find that cent progress in this field is the proposal of the
pretraining without LiT might be the best solution ELEVATER benchmark (Li et al., 2022b). The EL-
though its performance gap with two-stage pretrain- EVATER benchmark includes “Image Classifica-
ing is quite small. However, on the other datasets, tion in the Wild (ICinW)” for open-domain image
i.e., Flickr30K-CN and COCO-CN, whose sam- classification and “Object Detection in the Wild
ples are translated from the English datasets, we (ODinW)” for open-domain object detection. We
find that our two-stage pretraining performs sig- evaluate Chinese CLIP for zero-shot image classifi-
nificantly better than pretraining without LiT. Fur- cation on ICinW. ICinW consists of a series of im-
thermore, we observe a common phenomenon that age classification datasets, e.g., CIFAR-10, CIFAR-
in the two-stage pretraining switching from Stage 100 (Krizhevsky et al., 2009), DTD (Cimpoi
1 to Stage 2 can effectively boost the model per- et al., 2014), EuroSAT (Helber et al., 2019), FER-
formance to a higher level. This reflects the im- 2013 (Radford et al., 2021), FGVC-Aircraft (Maji
portance of adapting the pretrained model to the et al., 2013), KITTI-Distance (Fritsch et al., 2013),
data distribution of the Chinese multimodal data, MNIST (Deng, 2012), Pascal-VOC-2007 (Ever-
especially those concerned with visual information. ingham et al., 2010), etc. The evaluation metrics
include accuracy, mean-per-class, etc.
3.2 Zero-shot Image Classification In order to evaluate Chinese CLIP on the
This section introduces the image classification datasets, we first transform the datasets for Chinese
datasets, and provides comprehensive experimental models by translating labels and prompts into Chi-
results and in-depth analyses to probe the inherent nese. Specifically, we translate the text descriptions
defects of Chinese CLIP. of the labels and the templates for manual prompts
to Chinese. For example, the labels in CIFAR-10
3.2.1 Open-Domain Image Classification include “car, dog, ...”, and we manually translate
Benchmark in Chinese the words into Chinese. There are also particular
Contrastive pretraining on image-text pairs builds cases, such as the labels in FGVC-Aircraft (Maji
a connection between vision and natural language. et al., 2013), which it is difficult to translate or
Natural language supervision instead of crowd- transliterate. We search the names on Google and
sourced labeling endows models with the capabil- figure out the best Chinese name for each label. Be
MUGE Image Retrieval Flickr30k-CN Image Retrieval Flickr30k-CN Text Retrieval
70 80 90
Zero-Shot Mean Recall

Zero-Shot Mean Recall

Zero-Shot Mean Recall


70 80
60
60 70
50 60
50
40 50
40
30 CN-CLIP Base Stage 1 CN-CLIP Base Stage 1
40 CN-CLIP Base Stage 1
30
20 CN-CLIP Base Stage 2 CN-CLIP Base Stage 2 30 CN-CLIP Base Stage 2
CN-CLIP Base w/o two-stage method 20 CN-CLIP Base w/o two-stage method 20 CN-CLIP Base w/o two-stage method
10 CN-CLIP Base from scratch 10 CN-CLIP Base from scratch CN-CLIP Base from scratch
10
0 200B 400B 600B 800B 1000B 0 200B 400B 600B 800B 1000B 0 200B 400B 600B 800B 1000B
# of image-text pairs pretrained # of image-text pairs pretrained # of image-text pairs pretrained
(a) (b) (c)

COCO-CN Image Retrieval COCO-CN Text Retrieval


80 80
Zero-Shot Mean Recall

Zero-Shot Mean Recall


70 70
60 60
50 50
40 40
CN-CLIP Base Stage 1 CN-CLIP Base Stage 1
30 CN-CLIP Base Stage 2 30 CN-CLIP Base Stage 2
CN-CLIP Base w/o two-stage method CN-CLIP Base w/o two-stage method
20 CN-CLIP Base from scratch 20 CN-CLIP Base from scratch
0 200B 400B 600B 800B 1000B 0 200B 400B 600B 800B 1000B
# of image-text pairs pretrained # of image-text pairs pretrained
(d) (e)

Figure 2: Comparison of base-size Chinese CLIP models with different training methods on MUGE, Flickr30k-CN
and COCO-CN.

Model CIFAR10 CIFAR100 DTD EuroSAT FER FGVC KITTI MNIST PC VOC
Original benchmark
DeCLIP 90.9 66.8 44.9 39.9 23.3 9.0 39.7 13.6 55.3 80.6
GIT 88.5 61.1 42.9 43.4 41.4 6.7 22.1 68.9 50.0 80.2
ALIGN 94.9 76.8 66.1 52.1 50.8 25.0 41.2 74.0 55.2 83.0
OpenCLIP 93.5 76.2 56.4 53.7 50.3 20.8 28.8 70.9 50.5 82.3
CLIP 94.9 77.0 56.0 63.0 48.3 33.3 11.5 79.0 62.3 84.0
Translated benchmark
BriVL 72.3 35.9 18.8 25.5 - - - - - -
Wukong 95.4 77.1 40.9 50.3 - - - - - -
CN-CLIP 96.0 79.7 51.2 52.0 55.1 26.2 49.9 79.4 63.5 84.9

Table 5: Experimental results of the zero-shot image classification performance of models on ICinW.

that as it may, we cannot guarantee that we have 3.2.2 Experimental Results


the best Chinese translation, and more importantly, Table 5 reports the performance of both English
it is still hard for the Chinese pretrained model models and Chinese models. The baselines pre-
to understand some of the concepts, which may trained on English data include DeCLIP (Li et al.,
lead to unsatisfactory performance in the related 2021b), ALIGN (Jia et al., 2021), CLIP (Radford
downstream tasks. As to the templates, for some et al., 2021), and OpenCLIP (Ilharco et al., 2021),
datasets, we use our translation of the templates and the baselines pretrained on Chinese data in-
provided by the ELEVATER toolkit,10 and for the clude BriVL (Fei et al., 2021) and Wukong (Gu
others, we use the translation of the templates from et al., 2022). We report the results of the variant
OpenAI CLIP. 11 with the best downstream performance for the mod-
els.
We first focus on the comparison with the Chi-
nese baselines. Chinese CLIP outperforms both
10
https://github.com/Computer-Vision-in-the- baselines on the 4 datasets significantly. To be spe-
Wild/Elevater_Toolkit_IC
11
The organized datasets with translated labels and tem- cific, Chinese CLIP surpasses BriVL on all datasets
plates will be released. by large margins, e.g., over 100%, and also sig-
nificantly outperforms Wukong. Besides, we also to understand negation. In this work, we explore
compare Chinese CLIP with the foundation models, CLIP’s capability to understand negation. We con-
e.g., CLIP and ALIGN, which are pretrained on En- duct experiments on KITTI-Distance (Fritsch et al.,
glish data. It can be found that Chinese CLIP out- 2013). This dataset provides 4 options for models
performs CLIP or ALIGN on CIFAR-10, CIFAR- to judge, including “next to a car”, “near a car”, “at
100, FER-2013, KITTI-Distance, MNIST, Patch- a distance away from a car”, and “no car”. The last
Camelyon, and Pascal-VOC-2007. Also, on the one is concerned with negation. We compare the
classification datasets for general concepts or ob- model performance using the text “no cars” and
jects common in both western and eastern culture, “others” for the last label. Through the experiments
Chinese CLIP consistently achieves better perfor- with CN-CLIPViT-L/14 , we observe that it is hard
mance. This indicates that Chinese CLIP is capable for the model to understand negation. By changing
of categorizing images to general prototypes. the label from “others” to “no cars”, the perfor-
However, as to the classification concerned with mance drops by 48.1% in accuracy (49.9 vs. 25.9).
proper nouns, e.g., FGVC-Aircraft, it is difficult for Similarly, in the experiments on PatchCamelyon,
all models to achieve high accuracy. We assume the performance drops from 63.5 to 50.2 by chang-
that the related images and texts are not common in ing labels from “mainly red” and “green block in
the pretraining datasets, and it is also hard for the the middle” to “no green block in the middle” and
models to understand the names of airplanes with- “green block in the middle”. This shows the limi-
out finetuning. Specifically, for Chinese models, tation of the training of CLIP in learning negation.
the translation or even transliteration can signifi- The texts in the pretraining datasets are mostly de-
cantly affect the performance of Chinese CLIP. It scriptions of the images, which indicate their ob-
encourages building a benchmark of “Image Clas- jects or features but often do not indicate the ab-
sification in the Wild for Chinese Models”. sence of objects. In the future research, we will
explore heuristically building datasets with nega-
3.2.3 Analysis
tion. For example, we sample objects that are not
Here we illustrate some findings of Chinese CLIP present in an image, construct a text description for
in the experiments, including the sensitivity to negation with manual templates, and train CLIP
handcrafted prompts and the inability to understand models on such datasets.
negation.
4 Related Work
Sensitivity to Handcrafted Prompts While the
benchmark ELEVATER provides specific prompts This section first reviews the related work in
for each dataset, we find that this is not always the general mutlimodal pretraining with a focus on
best option, in comparison with our baseline, trans- contrastive pretraining, and further reviews the
lation of the prompts provided by OpenAI CLIP. progress of Chinese multimodal pretraining.
The baseline with around 90 prompts performs
Multimodal Pretraining There are mainly two
the best on average. However, for some datasets,
lines of studies in multimodal pretraining, namely
specific prompts designed with human knowledge
generative multimodal pretraining and contrastive
can boost the performance significantly. A typi-
multimodal pretraining. The core of generative
cal case is the classification of airplanes. We test
multimodal pretraining is the transfer of BERT
CN-CLIPViT-L/14 with our specified prompts that
or T5 to cross-modal data, and a number of mod-
are related to the knowledge of aircraft, e.g.,“label,
els consecutively created new state-of-the art per-
a photo of an airplane", “label, a zoomed image of
formance in a series of cross-modal downstream
a fighter”, etc., and the translation of the OpenAI
tasks (Chen et al., 2020; Li et al., 2019a,b; Lu et al.,
prompts. Experimental results show that the model
2019; Lin et al., 2020; Li et al., 2020; Huang et al.,
can achieve an accuracy of 16.0 with the specified
2020; Xu et al., 2021; Zhang et al., 2021; Shen
prompts but only 13.8 with the OpenAI prompts.
et al., 2021; Wang et al., 2021b, 2022a; Li et al.,
Inability to Understand Negation Previous 2021a, 2022c,a; Wang et al., 2021a, 2022b), e.g.,
studies (Khandelwal and Sawant, 2019; Hosseini image captioning (Chen et al., 2015), VQA (Antol
et al., 2021) demonstrate that even strong NLP pre- et al., 2015; Goyal et al., 2017), etc. The rise of con-
trained models like BERT often make mistakes trastive pretraining in multimodal representation
in negation problems and thus show the inability learning started from the proposal of CLIP (Rad-
ford et al., 2021). CLIP demonstrated that the recent studies (Yuan et al., 2021; Chen et al., 2022)
simple contrastive learning with large-scale multi- the scale of our pretraining data is relatively small.
modal data could build an effective vision founda- Thus one of our next-step studies is scaling up
tion model. It revolutionized both computer vision the quantity of the pretraining data to evaluate the
and multimodal representation learning. Following performance improvement with data scaling. Fur-
CLIP, a series of similar contrastive-learning-based thermore, we still find it hard to decide what a
multimodal pretrained models were proposed and “high-quality” dataset for CLIP is. In the previous
reached new SOTAs in cross-modal retrieval and studies (Jia et al., 2021; Li et al., 2021b), the pre-
zero-shot classification (Jia et al., 2021; Yao et al., processing methods are mostly simple to avoid the
2021; Yuan et al., 2021). Furthermore, CLIP can loss of data. However, there are still many samples
be adaptive to other models. A typical case is that where the image and text are not matched properly,
CLIP is essential to many image generation mod- which may provide negative information to the pre-
els, e.g., DALL-E (Ramesh et al., 2021), DALL-E training. In our future research, we plan to use the
2 (Ramesh et al., 2022), Stable Diffusion (Rom- pretrained Chinese CLIP model to compute a score
bach et al., 2022), etc. for each image-text pair in a larger dataset, filter
those whose scores are under the specified thresh-
Chinese Multimodal Pretraining The success old, and pretrain the new models with the new data.
of multimodal pretraining encouraged the trans- This is one of the possible solutions to explore the
fer of the existing methods to Chinese pretraining. relationship between data quality and pretraining
Similarly, researchers have proposed models for effectiveness. Also, such cycling might bring con-
both generative and contrastive pretraining. Lin tinuous performance enhancement in downstream
et al. (2021a) proposed a model that unifies under- tasks.
standing and generation and pretrained on an in-
house large-scale dataset, and further extended it to Model Recently we have witnessed that in many
larger scales (Lin et al., 2021a,b). Fei et al. (2021) domains the scaling of model size can lead to con-
started the research in contrastive multimodal pre- sistent performance improvement (Gordon et al.,
training for Chinese data and incorporated complex 2021; Wei et al., 2022), and in this work, we also
design for better performance. Gu et al. (2022) find that the scaling of model size for Chinese CLIP
proposed the Wukong dataset, a large-scale public can achieve steady performance improvement in
dataset for Chinese multimodal pretraining, and the different downstream tasks, including retrieval and
corresponding pretrained model. Xie et al. (2022) classification. Recent studies have scaled ViT and
proposed a new dataset called ZERO whose subset also CLIP-like models to a much larger scale than
was released, and pretrained a new model called our largest CN-CLIPViT-H/14 , e.g., the 3B Swin-
R2D2 that incorporated contrastive two-tower ar- v2 (Liu et al., 2022), the 4B ViT-e (Chen et al.,
chitecture and interactive module for cross-modal 2022), etc. In the future, we will continue explor-
fusion. In this work, different from the previous ing scaling up models in line with scaling up data
studies that explored intricate model designs, we in order to build a more effective Chinese CLIP.
build a baseline Chinese CLIP without significant Another issue of model scaling connected with
architectural changes from the original CLIP, and the real-world application is how to build effective
we use the simple two-stage pretraining to help small models. Experimental results show that our
Chinese CLIP reach outstanding performance. smallest Chinese CLIP CN-CLIPRN50 performs
much worse than the ViT variants. However, in
5 Limitations and Future Work real-world applications, effective small models that
are available for deployment are usually more wel-
A number of issues reflect the limitations of this comed. Thus it is necessary to explore distillation
work but also point out some directions for our fu- for CLIP so that the capability of large models can
ture research. In this section, we generally discuss be transferred to small models for application.
some limitations about the scale of data and model.
6 Conclusion
Data The core of CLIP pretraining is the simple
but effective large-scale contrastive pretraining on This work proposes Chinese CLIP, an implemen-
extremely large-scale data. Though we have uti- tation of CLIP (Radford et al., 2021) pretrained
lized around 200 million samples, compared with on large-scale Chinese datasets with contrastive
learning. Specifically, we collect image-text pairs C. Lawrence Zitnick. 2015. Microsoft COCO cap-
from public datasets and construct a pretraining tions: Data collection and evaluation server. CoRR,
abs/1504.00325.
dataset of around 200 million samples. We imple-
ment pretraining by initializing encoders with the Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El
weights of the existing pretrained models and us- Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and
ing our proposed two-stage pretraining method to Jingjing Liu. 2020. UNITER: universal image-text
build 5 Chinese CLIP models of different model representation learning. In ECCV 2020, volume
12375 of Lecture Notes in Computer Science, pages
sizes, ranging from 77 to 958 million parameters. 104–120. Springer.
Our comprehensive evaluation shows that Chinese
CLIP can reach state-of-the-art performance on Gong Cheng, Junwei Han, and Xiaoqiang Lu. 2017.
multiple cross-modal retrieval datasets, including Remote sensing image scene classification: Bench-
MUGE, Flickr30K-CN, and COCO-CN in both mark and state of the art. Proceedings of the IEEE,
105(10):1865–1883.
setups of zero-shot learning and finetuning. Fur-
thermore, we follow the ELEVATER benchmark Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos,
and construct image classification datasets with Sammy Mohamed, and Andrea Vedaldi. 2014. De-
labels and prompts in Chinese. Through the exper- scribing textures in the wild. In Proceedings of
the IEEE conference on computer vision and pattern
iments, we demonstrate that Chinese CLIP mod-
recognition, pages 3606–3613.
els can also achieve competitive performance in
zero-shot image classification across 10 datasets. Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vi-
However, we still find that Chinese CLIP is lim- jay Vasudevan, and Quoc V Le. 2019. Autoaug-
ited in some aspects, such as sensitivity to prompts ment: Learning augmentation strategies from data.
In Proceedings of the IEEE/CVF Conference on
and difficulty in understanding negation. Besides, Computer Vision and Pattern Recognition, pages
there is still much room in exploring data scaling 113–123.
and model scaling. In the future, we will continue
our research in building Chinese CLIP models with Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shi-
better performance and exploring the features of jin Wang, and Guoping Hu. 2020. Revisiting pre-
trained models for Chinese natural language pro-
contrastive-learning-based pretrained models. cessing. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing:
Findings, pages 657–668, Online. Association for
References Computational Linguistics.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-
garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
and Devi Parikh. 2015. VQA: visual question an- and Li Fei-Fei. 2009. Imagenet: A large-scale hier-
swering. In ICCV 2015, pages 2425–2433. IEEE archical image database. In 2009 IEEE Computer
Computer Society. Society Conference on Computer Vision and Pattern
Recognition (CVPR 2009), 20-25 June 2009, Miami,
Lukas Bossard, Matthieu Guillaumin, and Luc Van Florida, USA, pages 248–255. IEEE Computer Soci-
Gool. 2014. Food-101–mining discriminative ety.
components with random forests. In European
conference on computer vision, pages 446–461. Li Deng. 2012. The mnist database of handwritten
Springer. digit images for machine learning research [best
of the web]. IEEE signal processing magazine,
Fredrik Carlsson, Philipp Eisen, Faton Rekathati, and 29(6):141–142.
Magnus Sahlgren. 2022. Cross-lingual and multilin-
gual clip. In Proceedings of the Language Resources Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
and Evaluation Conference, pages 6848–6854, Mar- Kristina Toutanova. 2019. BERT: pre-training of
seille, France. European Language Resources Asso- deep bidirectional transformers for language under-
ciation. standing. In NAACL-HLT 2019, pages 4171–4186.
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Pier- Association for Computational Linguistics.
giovanni, Piotr Padlewski, Daniel Salz, Sebastian
Goodman, Adam Grycner, Basil Mustafa, Lucas Alexey Dosovitskiy, Lucas Beyer, Alexander
Beyer, et al. 2022. Pali: A jointly-scaled mul- Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
tilingual language-image model. arXiv preprint Thomas Unterthiner, Mostafa Dehghani, Matthias
arXiv:2209.06794. Minderer, Georg Heigold, Sylvain Gelly, Jakob
Uszkoreit, and Neil Houlsby. 2021. An image is
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- worth 16x16 words: Transformers for image recog-
ishna Vedantam, Saurabh Gupta, Piotr Dollár, and nition at scale. In ICLR 2021. OpenReview.net.
Mark Everingham, Luc Van Gool, Christopher K. I. Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei
Williams, John M. Winn, and Andrew Zisserman. Fu, and Jianlong Fu. 2020. Pixel-bert: Aligning im-
2010. The pascal visual object classes (VOC) chal- age pixels with text by deep multi-modal transform-
lenge. Int. J. Comput. Vis., 88(2):303–338. ers. CoRR, abs/2004.00849.

Nanyi Fei, Zhiwu Lu, Yizhao Gao, Guoxing Yang, Gabriel Ilharco, Mitchell Wortsman, Ross Wightman,
Yuqi Huo, Jingyuan Wen, Haoyu Lu, Ruihua Song, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal
Xin Gao, Tao Xiang, et al. 2021. Wenlan 2.0: Dave, Vaishaal Shankar, Hongseok Namkoong,
Make ai imagine via a multimodal foundation model. John Miller, Hannaneh Hajishirzi, Ali Farhadi, and
arXiv preprint arXiv:2110.14378. Ludwig Schmidt. 2021. Openclip. If you use this
software, please cite it as below.
Li Fei-Fei, Rob Fergus, and Pietro Perona. 2004.
Learning generative visual models from few training Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana
examples: An incremental bayesian approach tested Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung,
on 101 object categories. In 2004 conference on Zhen Li, and Tom Duerig. 2021. Scaling up
computer vision and pattern recognition workshop, visual and vision-language representation learn-
pages 178–178. IEEE. ing with noisy text supervision. arXiv preprint
arXiv:2102.05918.
Jannik Fritsch, Tobias Kuehnl, and Andreas Geiger.
2013. A new performance measure and evalua- Aditya Khandelwal and Suraj Sawant. 2019. Neg-
tion benchmark for road detection algorithms. In bert: a transfer learning approach for negation
16th International IEEE Conference on Intelligent detection and scope resolution. arXiv preprint
Transportation Systems (ITSC 2013), pages 1693– arXiv:1911.04211.
1700. IEEE.
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj
Mitchell A. Gordon, Kevin Duh, and Jared Kaplan. Goswami, Amanpreet Singh, Pratik Ringshia, and
2021. Data and parameter scaling laws for neural Davide Testuggine. 2020. The hateful memes
machine translation. In EMNLP 2021, pages 5915– challenge: Detecting hate speech in multimodal
5922. Association for Computational Linguistics. memes. Advances in Neural Information Processing
Systems, 33:2611–2624.
Yash Goyal, Tejas Khot, Douglas Summers-Stay,
Dhruv Batra, and Devi Parikh. 2017. Making the Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt:
V in VQA matter: Elevating the role of image un- Vision-and-language transformer without convolu-
derstanding in visual question answering. In CVPR tion or region supervision. In ICML 2021, volume
2017, pages 6325–6334. IEEE Computer Society. 139 of Proceedings of Machine Learning Research,
pages 5583–5594. PMLR.
Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou,
Minzhe Niu, Hang Xu, Xiaodan Liang, Wei Zhang, Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-
Xin Jiang, and Chunjing Xu. 2022. Wukong: 100 Fei. 2013. 3d object representations for fine-
million large-scale chinese cross-modal pre-training grained categorization. In Proceedings of the
dataset and a foundation framework. arXiv preprint IEEE international conference on computer vision
arXiv:2202.06767. workshops, pages 554–561.

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-
2021. Open-vocabulary object detection via vision son, Kenji Hata, Joshua Kravitz, Stephanie Chen,
and language knowledge distillation. arXiv preprint Yannis Kalantidis, Li-Jia Li, David A. Shamma,
arXiv:2104.13921. Michael S. Bernstein, and Li Fei-Fei. 2017. Vi-
sual genome: Connecting language and vision us-
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian ing crowdsourced dense image annotations. IJCV,
Sun. 2016. Deep residual learning for image recog- 123(1):32–73.
nition. In CVPR 2016, pages 770–778.
Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learn-
Patrick Helber, Benjamin Bischke, Andreas Dengel, ing multiple layers of features from tiny images.
and Damian Borth. 2019. Eurosat: A novel dataset
and deep learning benchmark for land use and Weiyu Lan, Xirong Li, and Jianfeng Dong. 2017.
land cover classification. IEEE Journal of Selected Fluency-guided cross-lingual image caption-
Topics in Applied Earth Observations and Remote ing. Proceedings of the 25th ACM international
Sensing, 12(7):2217–2226. conference on Multimedia.

Arian Hosseini, Siva Reddy, Dzmitry Bahdanau, Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang,
R Devon Hjelm, Alessandro Sordoni, and Aaron Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai
Courville. 2021. Understanding by understanding Xu, Zheng Cao, et al. 2022a. mplug: Effective and
not: Modeling negation in language models. arXiv efficient vision-language learning by cross-modal
preprint arXiv:2105.03519. skip-connections. arXiv preprint arXiv:2205.12005.
Chunyuan Li, Haotian Liu, Liunian Harold Li, Junyang Lin, An Yang, Yichang Zhang, Jie Liu, Jingren
Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Zhou, and Hongxia Yang. 2020. Interbert: Vision-
Jin, Yong Jae Lee, Houdong Hu, Zicheng Liu, et al. and-language interaction for multi-modal pretrain-
2022b. Elevater: A benchmark and toolkit for eval- ing. CoRR, abs/2003.13198.
uating language-augmented visual models. arXiv
preprint arXiv:2204.08790. Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda
Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang,
Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Li Dong, et al. 2022. Swin transformer v2: Scal-
Ming Zhou. 2019a. Unicoder-vl: A universal en- ing up capacity and resolution. In Proceedings of
coder for vision and language by cross-modal pre- the IEEE/CVF Conference on Computer Vision and
training. CoRR, abs/1908.06066. Pattern Recognition, pages 12009–12019.
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Lee. 2019. Vilbert: Pretraining task-agnostic visi-
2022c. Blip: Bootstrapping language-image pre- olinguistic representations for vision-and-language
training for unified vision-language understanding tasks. In NeurIPS 2019, pages 13–23.
and generation. arXiv preprint arXiv:2201.12086.
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen
Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Lei, Nan Duan, and Tianrui Li. 2021. Clip4clip: An
Gotmare, Shafiq R. Joty, Caiming Xiong, and empirical study of clip for end to end video clip re-
Steven Chu-Hong Hoi. 2021a. Align before fuse: trieval. arXiv preprint arXiv:2104.08860.
Vision and language representation learning with
momentum distillation. In NeurIPS 2021, pages Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew
9694–9705. Blaschko, and Andrea Vedaldi. 2013. Fine-grained
visual classification of aircraft. arXiv preprint
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui arXiv:1306.5151.
Hsieh, and Kai-Wei Chang. 2019b. Visualbert: A
simple and performant baseline for vision and lan- Norman Mu, Alexander Kirillov, David Wagner,
guage. ArXiv, abs/1908.03557. and Saining Xie. 2021. Slip: Self-supervision
meets language-image pre-training. arXiv preprint
Xirong Li, Chaoxi Xu, Xiaoxu Wang, Weiyu Lan, arXiv:2112.12750.
Zhengxiong Jia, Gang Yang, and Jieping Xu. 2019c. Maria-Elena Nilsback and Andrew Zisserman. 2008.
Coco-cn for cross-lingual image tagging, captioning, Automated flower classification over a large num-
and retrieval. IEEE Transactions on Multimedia, ber of classes. In 2008 Sixth Indian Conference
21:2347–2360. on Computer Vision, Graphics & Image Processing,
pages 722–729. IEEE.
Xiujun Li, Xi Yin, Chunyuan Li, Xiaowei Hu,
Pengchuan Zhang, Lei Zhang, Lijuan Wang, Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman,
Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and C. V. Jawahar. 2012. Cats and dogs. In 2012
and Jianfeng Gao. 2020. Oscar: Object-semantics IEEE Conference on Computer Vision and Pattern
aligned pre-training for vision-language tasks. In Recognition, pages 3498–3505. IEEE Computer So-
ECCV. ciety.

Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Or Patashnik, Zongze Wu, Eli Shechtman, Daniel
Wanli Ouyang, Jing Shao, Fengwei Yu, and Junjie Cohen-Or, and Dani Lischinski. 2021. Style-
Yan. 2021b. Supervision exists everywhere: A data clip: Text-driven manipulation of stylegan imagery.
efficient contrastive language-image pre-training In Proceedings of the IEEE/CVF International
paradigm. arXiv preprint arXiv:2110.05208. Conference on Computer Vision, pages 2085–2094.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Junyang Lin, Rui Men, An Yang, Chang Zhou, Ming
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish
Ding, Yichang Zhang, Peng Wang, Ang Wang,
Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
Le Jiang, Xianyan Jia, Jie Zhang, Jianwei Zhang,
Gretchen Krueger, and Ilya Sutskever. 2021. Learn-
Xu Zou, Zhikang Li, Xiaodong Deng, Jie Liu, Jin-
ing transferable visual models from natural lan-
bao Xue, Huiling Zhou, Jianxin Ma, Jin Yu, Yong
guage supervision. In ICML 2021, volume 139 of
Li, Wei Lin, Jingren Zhou, Jie Tang, and Hongxia
Proceedings of Machine Learning Research, pages
Yang. 2021a. M6: A chinese multimodal pretrainer.
8748–8763. PMLR.
CoRR, abs/2103.00823.
Alec Radford, Karthik Narasimhan, Tim Salimans, and
Junyang Lin, An Yang, Jinze Bai, Chang Zhou, Ilya Sutskever. 2018. Improving language under-
Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Yong standing by generative pre-training.
Li, Wei Lin, Jingren Zhou, and Hongxia Yang.
2021b. M6-10T: A sharing-delinking paradigm for Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
efficient multi-trillion parameter pretraining. CoRR, Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
abs/2110.03888. Wei Li, and Peter J. Liu. 2020. Exploring the limits
of transfer learning with a unified text-to-text trans- Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yu-
former. J. Mach. Learn. Res., 21:140:1–140:67. lia Tsvetkov, and Yuan Cao. 2021b. Simvlm: Sim-
ple visual language model pretraining with weak su-
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, pervision. CoRR, abs/2108.10904.
Casey Chu, and Mark Chen. 2022. Hierarchical
text-conditional image generation with clip latents. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
arXiv preprint arXiv:2204.06125. Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, et al.
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott 2022. Emergent abilities of large language models.
Gray, Chelsea Voss, Alec Radford, Mark Chen, and arXiv preprint arXiv:2206.07682.
Ilya Sutskever. 2021. Zero-shot text-to-image gener-
ation. In ICML 2021, volume 139 of Proceedings Chunyu Xie, Heng Cai, Jianfei Song, Jincheng Li,
of Machine Learning Research, pages 8821–8831. Fanjing Kong, Xiaoyu Wu, Henrique Morimitsu,
PMLR. Lin Yao, Dexin Wang, Dawei Leng, et al. 2022.
Zero and r2d2: A large-scale chinese cross-modal
Robin Rombach, Andreas Blattmann, Dominik Lorenz, benchmark and a vision-language framework. arXiv
Patrick Esser, and Björn Ommer. 2022. High- preprint arXiv:2205.03860.
resolution image synthesis with latent diffusion mod-
els. In Proceedings of the IEEE/CVF Conference Haiyang Xu, Ming Yan, Chenliang Li, Bin Bi,
on Computer Vision and Pattern Recognition, pages Songfang Huang, Wenming Xiao, and Fei Huang.
10684–10695. 2021. E2e-vlp: End-to-end vision-language pre-
training enhanced by visual learning. arXiv preprint
Christoph Schuhmann, Richard Vencu, Romain Beau- arXiv:2106.01804.
mont, Robert Kaczmarczyk, Clayton Mullis, Aarush
Katta, Theo Coombes, Jenia Jitsev, and Aran Komat- An Yang, Junyang Lin, Rui Men, Chang Zhou,
suzaki. 2021. Laion-400m: Open dataset of clip- Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Jia-
filtered 400 million image-text pairs. arXiv preprint mang Wang, Yong Li, Di Zhang, Wei Lin, Lin
arXiv:2111.02114. Qu, Jingren Zhou, and Hongxia Yang. 2021. Ex-
ploring sparse expert models and beyond. CoRR,
Sheng Shen, Liunian Harold Li, Hao Tan, Mohit abs/2105.15082.
Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei
Yao, and Kurt Keutzer. 2021. How much can clip Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu,
benefit vision-and-language tasks? arXiv preprint Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo
arXiv:2107.06383. Li, Xin Jiang, and Chunjing Xu. 2021. FILIP:
fine-grained interactive language-image pre-training.
Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and CoRR, abs/2111.07783.
Christian Igel. 2011. The german traffic sign recog-
nition benchmark: A multi-class classification com- Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel
petition. In IJCNN 2011, pages 1453–1460. IEEE. Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu,
Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu,
Bastiaan S Veeling, Jasper Linmans, Jim Winkens, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi,
Taco Cohen, and Max Welling. 2018. Rota- Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen
tion equivariant cnns for digital pathology. In Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou,
International Conference on Medical image and Pengchuan Zhang. 2021. Florence: A new
computing and computer-assisted intervention, foundation model for computer vision. CoRR,
pages 210–218. Springer. abs/2111.11432.

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas
Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jin- Steiner, Daniel Keysers, Alexander Kolesnikov, and
gren Zhou, and Hongxia Yang. 2022a. Unifying ar- Lucas Beyer. 2022. Lit: Zero-shot transfer with
chitectures, tasks, and modalities through a simple locked-image text tuning. In Proceedings of the
sequence-to-sequence learning framework. CoRR, IEEE/CVF Conference on Computer Vision and
abs/2202.03052. Pattern Recognition, pages 18123–18133.
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei
Wenhui Wang, Hangbo Bao, Li Dong, Johan
Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and
Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal,
Jianfeng Gao. 2021. Vinvl: Revisiting visual rep-
Owais Khan Mohammed, Saksham Singhal, Subho-
resentations in vision-language models. In CVPR
jit Som, et al. 2022b. Image as a foreign language:
2021, pages 5579–5588. Computer Vision Founda-
Beit pretraining for all vision and vision-language
tion / IEEE.
tasks. arXiv preprint arXiv:2208.10442.

Wenhui Wang, Hangbo Bao, Li Dong, and Furu


Wei. 2021a. Vlmo: Unified vision-language pre-
training with mixture-of-modality-experts. CoRR,
abs/2111.02358.
Embedding Vision Transformer Text Transformer
Model
dimension layers width heads layers width heads
CN-CLIPViT-B/16 512 12 768 12 12 768 12
CN-CLIPViT-L/14 768 24 1,024 16 12 768 12
CN-CLIPViT-L/14@336px 768 24 1,024 16 12 768 12
CN-CLIPViT-H/14 1,024 32 1,280 16 24 1,024 24

Table 6: Detailed architecture hyperparameters of ViT-based CN-CLIP models.

Embedding ResNet Text Transformer


Model
dimension blocks width layers width heads
CN-CLIPRN50 1,024 (3, 4, 6, 3) 2,048 3 768 12

Table 7: Detailed architecture hyperparameters of ResNet-based CN-CLIPRN50 .

A Appendix Hyperparameters Value


Batch size 32, 768
A.1 Model Architecture Details Maximum text length 50
Peak learning rate 1e − 4
As mentioned in Section 2.4 and Table 1, we release Learning rate schedule Cosine
5 Chinese CLIP models of different sizes. Here we Maximum temperature 100
provide more details of their model architectures Weight decay 1e − 3
Warmup iterations 5, 000
in Table 6 and Table 7. We keep the architecture Adam β1 0.9
of ResNet-50, ViT-B/16 and ViT-L/14 backbones Adam β2 0.999 (ResNet), 0.98 (ViT)
Adam  1e − 8 (ResNet), 1e − 6 (ViT)
conformed with OpenAI CLIP12 and the architec-
ture of ViT-H/14 same with LAION CLIP13 . This
Table 8: Common pretraining hyperparameters in the
enables us to initialize the Chinese CLIP image first stage.
encoders with their weights. The text encoders are
Chinese Roberta models (Cui et al., 2020). Specifi-
cally, our most lightweight tiny-size Chinese CLIP weight is randomly initialized with normal distri-
uses the architecture of the 3-layer RBT3 model. bution.
The base-size and large-size Chinese CLIP models
use the 12-layer architecture of RoBERTa-wwm- Stage 1 The pretraining hyperparameters of
Base. For the huge-size CN-CLIP, we use the 24- Stage 1 are shown in Table 8, which are
layer architecture of RoBERTa-wwm-Large. The shared for CN-CLIPRN50 , CN-CLIPViT-B/16 ,
vocabulary size of text tokenizer is 21, 128. CN-CLIPViT-L/14 and CN-CLIPViT-H/14 . The val-
ues of hyperparameters are generally similar to
A.2 Pretraining Details those in OpenAI CLIP (Radford et al., 2021). As to
data augmentation, we use random resize cropping
Initialization As mentioned in Section 2.3, we and AutoAugment (Cubuk et al., 2019) on input
initialize the image encoders of CN-CLIPRN50 , images. We leverage all-gather communications
CN-CLIPViT-B/16 and CN-CLIPViT-L/14 using the across GPU workers to compute contrastive loss on
OpenAI CLIP weights. The image encoder of the global batch. The above 4 models are pretrained
CN-CLIPViT-H/14 is initialized with LAION CLIP. for around 20, 44, 64, and 26 epochs in this stage re-
Besides the ResNet or ViT parameters, the temper- spectively, with the image encoder frozen. The run-
ature and visual output projection parameters are ning variance and mean of batch normalization lay-
also initialized with the pretrained CLIP weights. ers are not updated in this stage for CN-CLIPRN50 .
For the text encoder, we initialize the parameters The optimal epochs of pretraining are determined
using the released Chinese Roberta weights14 of by measuring the mean-recall under the 3 down-
the corresponding model scale, with their pooler stream zero-shot retrieval tasks during training.
weights discarded. The text output projection Mixed-precision training is activated. In this stage,
12
https://github.com/openai/CLIP
we pretrain 1.6 days using 64 NVIDIA V100 GPUs
13
https://laion.ai/blog/large-openclip/ for CN-CLIPRN50 , 4.5 days using 128 NVIDIA
14
https://github.com/ymcui/Chinese-BERT-wwm V100 GPUs for CN-CLIPViT-B/16 , 11.5 days using
Batch size Peak learning rate Maximum epochs Warmup iterations
Model
MUGE Flickr COCO MUGE Flickr COCO MUGE Flickr COCO MUGE Flickr COCO
CN-CLIPRN50 24,576 24,576 24,576 5e-5 6e-5 5e-5 60 30 40 100 20 6
CN-CLIPViT-B/16 12,800 7,680 12,800 2e-5 5e-5 5e-5 20 16 30 40 20 6
CN-CLIPViT-L/14 4,096 4,096 4,096 3e-5 2e-5 6e-5 20 16 18 100 60 9
CN-CLIPViT-L/14@336px 8,192 8,192 8,192 2e-5 2e-5 4e-5 20 18 18 100 20 2
CN-CLIPViT-H/14 20,480 4,096 5,120 2e-5 6e-6 2e-5 20 18 18 20 6 10

Table 9: Detailed finetuning hyperparameters of CN-CLIP models.

Tasks Text-to-Image Image-to-Text


Metrics R@1 R@5 R@10 R@1 R@5 R@10 MR
R2D2ViT-B 42.2 69.4 77.8 43.4 69.8 78.4 63.5
R2D2ViT-L/14 60.7 82.0 86.9 61.5 82.9 87.7 77.0
CN-CLIPViT-B/16 55.4 79.0 85.2 56.6 79.5 85.6 73.5
CN-CLIPViT-L/14 61.6 83.6 89.0 62.5 83.9 89.1 78.3

Table 10: Finetuning results on ICR dataset. We report the performance of baselines, CN-CLIPViT-B/16 and
CN-CLIPViT-L/14 on text-to-image and image-to-text retrieval.

128 NVIDIA V100 GPUs for CN-CLIPViT-L/14 A.3 Finetuning Details


and 3.8 days using 184 NVIDIA A100 GPUs for
As reported in Table 2, 3 and 4, we mainly fine-
CN-CLIPViT-H/14 .
tune CN-CLIP on 3 cross-modal retrieval datasets:
MUGE, Flickr30K-CN, and COCO-CN. Most fine-
tuning experiments are conducted on 32 NVIDIA
Stage 2 In Stage 2, we unfreeze the image en- A100 GPUs. The finetuning strategy and loss are
coder and update all the model parameters. Ex- consistent with the pretraining process. For time
cept for the peak learning rate, batch size and efficiency and full utilization of computation re-
training epochs, all other hyperparameters men- sources, we set the batch size as large as possi-
tioned in Stage 1 are kept unchanged. We de- ble. We implement gradient checkpointing in the
crease the learning rate to 2e − 5 for subtler opti- finetuning process of CN-CLIPViT-L/14@336px and
mization. For CN-CLIPRN50 , CN-CLIPViT-B/16 CN-CLIPViT-H/14 for a larger batch size. Table 9
and CN-CLIPViT-L/14 , the batch size is shrunk shows the specific settings of batch size, peaking
to 16, 384, 16, 384 and 4, 608 respectively due learning rate, maximum epochs and warmup itera-
to the limitation in GPU memory. When scal- tions in the finetuning process. We set other hyper-
ing to CN-CLIPViT-H/14 , we implement gradient parameters to be the same as those in pretraining
checkpointing which enables a larger batch size of by default. We save the model parameters at the
32, 768. These 4 models are pretrained for around end of each epoch. For MUGE, we report the best
44, 15, 7 and 7 epochs in Stage 2, respectively. In results on the validation set. For Flickr30K-CN
this stage, we pretrain CN-CLIPRN50 for 5.8 days and COCO-CN, we choose the checkpoint with the
using 64 NVIDIA V100 GPUs, CN-CLIPViT-B/16 best performance on the validation set and report
for 3.0 days using 128 NVIDIA V100 GPUs, the results on the test set.
CN-CLIPViT-L/14 for 8.0 days using 128 Nvidia
V100 GPUs, and CN-CLIPViT-H/14 for 2.2 days A.4 Cross-modal Retrieval with Longer Texts
using 184 NVIDIA A100 GPUs.
The results reported in Section 3.1.2 demonstrate
To pretrain a model of a larger resolution, we the excellent cross-modal retrieval capability of
implement interpolation to the image positional Chinese CLIP. Note that the average text lengths
embedding of CN-CLIPViT-L/14 for adapting to of MUGE, Flickr30K-CN, and COCO-CN are
a larger resolution and continue pretraining with 7.4, 19.7 and 16.8, respectively. We also conduct
images of the resolution of 336 × 336. We start finetuning experiments on ICR (Xie et al., 2022)
from CN-CLIPViT-L/14 and continue pretraining dataset with an average text length of 45.3. Ex-
by 2 epochs. The pretraining only costs the use of perimental results are shown in Table 10. Since
128 NVIDIA A100 GPUs for 0.7 days. the texts in the ICR dataset are longer, we set the
Dataset #Labels Test Size Metric
Caltech-101 (Fei-Fei et al., 2004) 101 6,084 Mean-per-class
CIFAR-10 (Krizhevsky et al., 2009) 10 10,000 Accuracy
CIFAR-100 (Krizhevsky et al., 2009) 100 10,000 Accuracy
Country-211 (Radford et al., 2021) 211 21,100 Accuracy
DTD (Cimpoi et al., 2014) 47 1,880 Accuracy
EuroSAT (Helber et al., 2019) 10 5,000 Accuracy
FER-2013 (Radford et al., 2021) 7 3,589 Accuracy
FGVC-Aircraft (Maji et al., 2013) 100 3,333 Mean-per-class
Food-101 (Bossard et al., 2014) 101 25,250 Accuracy
GTSRB (Stallkamp et al., 2011) 43 12,630 Accuracy
Hateful-Memes (Kiela et al., 2020) 2 500 ROC AUC
KITTI-Distance (Fritsch et al., 2013) 4 711 Accuracy
MNIST (Deng, 2012) 10 10,000 Accuracy
Oxford Flowers-102 (Nilsback and Zisserman, 2008) 102 6,149 Mean-per-class
Oxford-IIIT Pets (Parkhi et al., 2012) 37 3,669 Mean-per-class
PatchCamelyon (Veeling et al., 2018) 2 32,768 Accuracy
Rendered-SST2 (Radford et al., 2021) 2 1,821 Accuracy
RESISC-45 (Cheng et al., 2017) 45 25,200 Accuracy
Stanford-Cars (Krause et al., 2013) 196 8,041 Accuracy
Pascal VOC-2007 (Everingham et al., 2010) 20 4,952 11-point mAP

Table 11: Details of the image classification datasets in the ELEVATER benchmark.

maximum text length to 128 for finetuning. The consistently bring improvements in model perfor-
results show that Chinese CLIP achieves state-of- mance. The predicative improvements of scaling
the-art performance in cross-modal retrieval tasks Chinese CLIP indicates that we can further scale
with longer texts. up the model for better performance in the future
work. However, it is still a pity that the tiny-size
A.5 Details about the ELEVATER CN-CLIPRN50 saliently performs much worse than
benchmark the ViT variants which are significantly larger. This
We present the data statistics and metrics of the 20 shows that there is still much room for the small
image classification datasets of the track ICinW in model to improve and the knowledge transfer of
the ELEVATER benchmark in Table 11, and also CLIP from large models to small models should be
the experimental results of all Chinese CLIP mod- an important research topic in multimodal repre-
els on zero-shot image classification in Table 12. sentation learning.
It can be found that the scaling of model size can
CN-CLIP CN-CLIP CN-CLIP CN-CLIP CN-CLIP
Dataset
RN50 ViT-B/16 ViT-L/14 ViT-L/14@336px ViT-H/14

Caltech-101 77.3 84.9 88.5 88.8 90.6


CIFAR-10 72.7 92.0 94.9 94.1 96.0
CIFAR-100 40.6 64.4 75.1 73.5 79.7
Country-211 7.7 15.2 21.0 25.4 25.3
DTD 36.9 43.6 44.2 43.8 51.2
EuroSAT 27.0 46.9 56.9 50.7 52.0
FER-2013 21.9 47.2 54.6 55.1 49.2
FGVC-Aircraft 5.4 12.8 16.0 17.1 26.2
Food-101 39.8 62.4 69.4 73.9 74.6
GTSRB 22.3 28.4 37.3 35.5 38.5
Hateful-Memes 50.3 56.2 53.4 52.8 54.7
KITTI-Distance 30.2 33.5 49.9 49.8 39.1
MNIST 50.2 67.6 69.8 65.0 79.4
Oxford Flowers-102 30.7 52.2 62.5 64.8 68.4
Oxford-IIIT Pets 48.7 73.0 81.6 83.1 83.5
PatchCamelyon 47.7 54.0 63.5 62.9 52.4
Rendered-SST2 50.1 52.3 61.4 62.9 61.0
RESISC-45 49.3 58.7 65.2 65.8 66.9
Stanford-Cars 27.3 42.3 49.8 54.1 71.8
VOC-2007 82.1 83.3 84.5 84.9 84.9
Average 40.9 53.5 60.0 60.2 62.3

Table 12: Experimental results of the zero-shot image classification performance of models on ICinW.

You might also like