Professional Documents
Culture Documents
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
Abstract
Image Image
The tremendous success of CLIP (Radford Encoder Encoder
arXiv:2211.01335v2 [cs.CV] 3 Nov 2022
Table 2: Experimental results on MUGE-Retrieval. We report the performance of both baselines and Chinese CLIP
models on text-to-image retrieval and image-to-text retrieval in the setups of zero-shot evaluation and finetuning.
performance on MUGE and test performance forms R2D2ViT-L/14 by 6.6 MR in zero-shot learn-
on Flickr30K-CN and COCO-CN. Note that in ing and 3.8 MR in finetuning. When scaling to
the setup of finetuning, R2D28 is essentially an CN-CLIPViT-H/14 , the performance is further im-
end-to-end model of retrieval and ranking. proved. Compared with the best large-size model
CN-CLIPViT-L/14@336px , CN-CLIPViT-H/14 sur-
3.1.2 Results
passes it by 2.7 MR in zero-shot learning and 2.3
Table 2 reports the model performance on MR in finetuning.
MUGE-Retrieval. For the base-size model,
CN-CLIPViT-B/16 outperforms the baselines on all Table 3 and 4 report the model performance
metrics and in both setups of zero-shot learning on Flickr30K-CN and COCO-CN. We focus
and finetuning. Specifically, for the base-size mod- on the evaluation of R@1. In both datasets,
els, CN-CLIPViT-B/16 surpasses WukongViT-B/32 CN-CLIP achieves better performance than the
by 17.0 MR in zero-shot learning and surpasses baselines. For the base-size models, in the
R2D2ViT-B by 8.7 MR in finetuning. Besides, the setup of zero-shot learning of Flickr30K-CN,
tiny model CN-CLIPRN50 can outperform the base- CN-CLIPViT-B/16 surpasses WukongViT-B/32 by
size WukongViT-B/32 by 8.9 MR in zero-shot learn- 17.0 R@1 in text-to-image retrieval and 8.4 R@1
ing and 8.0 MR in finetuning. in image-to-text retrieval, and in the finetuning
For the large-size models, CN-CLIPViT-L/14 setup, CN-CLIPViT-B/16 surpasses R2D2ViT-B by
can outperform both baselines in all metrics and 0.8 R@1 in image retrieval and 0.9 R@1 in text re-
CN-CLIPViT-L/14@336px pretrained on images of trieval. Similarly, in the finetuning setup of COCO-
a larger resolution can achieve the state-of-the- CN, CN-CLIPViT-B/16 surpasses R2D2ViT-B by
art performance. CN-CLIPViT-L/14@336px outper- 1.9 R@1 in image retrieval and 1.3 R@1 in
8
text retrieval. For the tiny-size CN-CLIPRN50 , it
Since the original paper of R2D2 does not provide the
patch size of their base-size model, here we denote the model again achieves or surpasses the performance of
as R2D2ViT-B . WukongViT-B/32 in several metrics of Flickr30K-
Tasks Text-to-Image Image-to-Text
Setups Zero-shot Finetuning Zero-shot Finetuning
Metrics R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Tiny-size Model
CN-CLIPRN50 48.8 76.0 84.6 66.7 89.4 94.1 60.0 85.9 92.0 84.2 96.7 98.0
Base-size Models
WukongViT-B/32 45.7 73.8 82.2 67.6 89.6 94.2 66.2 88.7 94.3 83.9 97.6 99.0
R2D2ViT-B - - - 78.3 94.6 97.0 - - - 92.6 99.1 99.8
CN-CLIPViT-B/16 62.7 86.9 92.8 79.1 94.8 97.4 74.6 93.5 97.1 93.5 99.0 99.5
Large-size Models
WukongViT-L/14 51.7 78.9 86.3 77.4 94.5 97.0 76.1 94.8 97.5 92.7 99.1 99.6
R2D2ViT-L/14 60.9 86.8 92.7 84.4 96.7 98.4 77.6 96.7 98.9 95.6 99.8 100.0
CN-CLIPViT-L/14 68.0 89.7 94.4 82.7 96.7 98.6 80.2 96.6 98.2 96.1 99.5 99.9
CN-CLIPViT-L/14@336px 69.0 90.7 95.4 84.4 97.1 98.7 83.3 97.2 98.5 96.6 99.8 100.0
Huge-size Models
CN-CLIPViT-H/14 71.2 91.4 95.5 83.8 96.9 98.6 81.6 97.5 98.8 95.3 99.7 100.0
Table 3: Experimental results on Flickr30K-CN. We report the performance of both baselines and Chinese CLIP
models on text-to-image retrieval and image-to-text retrieval in the setups of zero-shot evaluation and finetuning.
CN and COCO-CN. Specifically, CN-CLIPRN50 R@1 in text retrieval. We also compare our
surpasses WukongViT-B/32 by 3.1 R@1 in the zero- CN-CLIPViT-H/14 with a huge CLIP-like model,
shot learning of Flickr30K-CN image retrieval and T-Bletchley9 . This model has 2.5 billion param-
by 2.6 R@1 in the finetuning of COCO-CN text eters and is pretrained on billions of multilingual
retrieval. image-caption pairs. In the finetuning setup of
COCO-CN, with smaller sizes of model param-
For the large-size models, in the zero-shot
eters and pretrain dataset, CN-CLIPViT-H/14 still
setup of Flickr30K-CN, CN-CLIPViT-L/14 sur-
surpasses T-Bletchley by 3.9 MR.
passes WukongViT-L/14 by 16.3 R@1 in text-to-
image retrieval and 4.1 R@1 in image-to-text 3.1.3 Ablation Study
retrieval. CN-CLIPViT-L/14@336px further im-
Here we provide an ablation study on our proposed
proves over CN-CLIPViT-L/14 by 1.0 R@1 in
two-stage training methods. To validate its sig-
image retrieval and 3.1 R@1 in text retrieval.
nificance and effectiveness, we design several se-
In the finetuning setup, CN-CLIPViT-L/14 sur-
tups for the ablation study. Our experiments are
passes R2D2ViT-L/14 by 0.5 R@1 in text re-
conducted on CN-CLIPViT-B/16 . To evaluate the
trieval. CN-CLIPViT-L/14@336px achieves equal
influence of initialization, we pretrain a model
performance with R2D2ViT-L/14 in image retrieval
from scratch, and to examine the influence of LiT,
and surpasses it by 1.0 R@1 in text retrieval.
we pretrain a model without freezing the image
Similarly, in the finetuning setup of COCO-CN,
encoder. For a better demonstration, we report
CN-CLIPViT-L/14 surpasses R2D2ViT-L/14 by 0.9
the curves of model performance of zero-shot re-
R@1 in text retrieval. CN-CLIPViT-L/14@336px fur-
trieval on different datasets in terms of pretraining
ther surpasses R2D2ViT-L/14 by 1.0 in image re-
progress, indicated by the number of processed
trieval and 1.9 R@1 in text retrieval.
samples.
On Flickr30K-CN and COCO-CN, scaling from Figure 2 shows the performance on different
CN-CLIPViT-L/14 to CN-CLIPViT-H/14 improves tasks, namely MUGE text-to-image retrieval, text-
the performance in almost all the metrics. Specif- to-image and image-to-text retrieval on Flicrk30K-
ically, in the zero-shot setup of Flickr30K-CN, CN and COCO-CN. In comparison with pretrain-
CN-CLIPViT-H/14 surpasses CN-CLIPViT-L/14 by ing with pretrained model initialization, pretrain-
3.2 R@1 in image retrieval and 1.4 R@1 in ing from scratch performs much worse though
text retrieval. Moreover, in the finetuning setup it shows consistent performance improvement in
of COCO-CN, CN-CLIPViT-H/14 even surpasses 9
https://www.microsoft.com/en-us/research/blog/turing-
CN-CLIPViT-L/14@336px with larger image reso- bletchley-a-universal-image-language-representation-model-
lution by 1.4 R@1 in image retrieval and 2.3 by-microsoft/
Tasks Text-to-Image Image-to-Text
Setups Zero-shot Finetuning Zero-shot Finetuning
Metrics R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Tiny-size Model
CN-CLIPRN50 48.1 81.3 90.5 66.8 91.1 97.0 51.6 81.2 90.5 68.4 93.3 97.8
Base-size Models
WukongViT-B/32 49.2 79.4 87.9 67.0 91.4 96.7 48.3 77.8 88.8 65.8 90.3 96.6
R2D2ViT-B - - - 75.1 94.2 98.1 - - - 76.1 95.3 98.5
CN-CLIPViT-B/16 62.2 86.6 94.9 77.0 97.1 99.0 57.0 84.1 93.6 77.4 96.2 98.9
Large-size Models
WukongViT-L/14 53.4 80.2 90.1 74.0 94.4 98.1 55.2 81.0 90.6 73.3 94.0 98.0
R2D2ViT-L/14 56.4 85.0 93.1 79.1 96.5 98.9 63.3 89.3 95.7 79.3 97.1 98.7
CN-CLIPViT-L/14 64.0 89.2 94.4 78.9 96.3 99.0 60.4 84.2 92.9 80.2 96.7 99.2
CN-CLIPViT-L/14@336px 64.7 89.6 94.6 80.1 96.7 99.2 63.4 87.2 94.4 81.2 97.2 99.1
Huge-size Models
CN-CLIPViT-H/14 69.2 89.9 96.1 81.5 96.9 99.1 63.0 86.6 92.9 83.5 97.3 99.2
Table 4: Experimental results on COCO-CN. We report the performance of both baselines and Chinese CLIP
models on text-to-image retrieval and image-to-text retrieval in the setups of zero-shot evaluation and finetuning.
Since machine translated COCO is included in our pretraining dataset, here the numbers of Chinese CLIP zero-shot
performances are shown in gray.
terms of pretraining progress. As to the importance ity of zero-shot image classification by computing
of LiT, we observe different phenomena on differ- similarities between the given image and the text
ent datasets. On MUGE, a dataset of samples origi- descriptions of the labels in the candidate set. Re-
nally collected from Chinese websites, we find that cent progress in this field is the proposal of the
pretraining without LiT might be the best solution ELEVATER benchmark (Li et al., 2022b). The EL-
though its performance gap with two-stage pretrain- EVATER benchmark includes “Image Classifica-
ing is quite small. However, on the other datasets, tion in the Wild (ICinW)” for open-domain image
i.e., Flickr30K-CN and COCO-CN, whose sam- classification and “Object Detection in the Wild
ples are translated from the English datasets, we (ODinW)” for open-domain object detection. We
find that our two-stage pretraining performs sig- evaluate Chinese CLIP for zero-shot image classifi-
nificantly better than pretraining without LiT. Fur- cation on ICinW. ICinW consists of a series of im-
thermore, we observe a common phenomenon that age classification datasets, e.g., CIFAR-10, CIFAR-
in the two-stage pretraining switching from Stage 100 (Krizhevsky et al., 2009), DTD (Cimpoi
1 to Stage 2 can effectively boost the model per- et al., 2014), EuroSAT (Helber et al., 2019), FER-
formance to a higher level. This reflects the im- 2013 (Radford et al., 2021), FGVC-Aircraft (Maji
portance of adapting the pretrained model to the et al., 2013), KITTI-Distance (Fritsch et al., 2013),
data distribution of the Chinese multimodal data, MNIST (Deng, 2012), Pascal-VOC-2007 (Ever-
especially those concerned with visual information. ingham et al., 2010), etc. The evaluation metrics
include accuracy, mean-per-class, etc.
3.2 Zero-shot Image Classification In order to evaluate Chinese CLIP on the
This section introduces the image classification datasets, we first transform the datasets for Chinese
datasets, and provides comprehensive experimental models by translating labels and prompts into Chi-
results and in-depth analyses to probe the inherent nese. Specifically, we translate the text descriptions
defects of Chinese CLIP. of the labels and the templates for manual prompts
to Chinese. For example, the labels in CIFAR-10
3.2.1 Open-Domain Image Classification include “car, dog, ...”, and we manually translate
Benchmark in Chinese the words into Chinese. There are also particular
Contrastive pretraining on image-text pairs builds cases, such as the labels in FGVC-Aircraft (Maji
a connection between vision and natural language. et al., 2013), which it is difficult to translate or
Natural language supervision instead of crowd- transliterate. We search the names on Google and
sourced labeling endows models with the capabil- figure out the best Chinese name for each label. Be
MUGE Image Retrieval Flickr30k-CN Image Retrieval Flickr30k-CN Text Retrieval
70 80 90
Zero-Shot Mean Recall
Figure 2: Comparison of base-size Chinese CLIP models with different training methods on MUGE, Flickr30k-CN
and COCO-CN.
Model CIFAR10 CIFAR100 DTD EuroSAT FER FGVC KITTI MNIST PC VOC
Original benchmark
DeCLIP 90.9 66.8 44.9 39.9 23.3 9.0 39.7 13.6 55.3 80.6
GIT 88.5 61.1 42.9 43.4 41.4 6.7 22.1 68.9 50.0 80.2
ALIGN 94.9 76.8 66.1 52.1 50.8 25.0 41.2 74.0 55.2 83.0
OpenCLIP 93.5 76.2 56.4 53.7 50.3 20.8 28.8 70.9 50.5 82.3
CLIP 94.9 77.0 56.0 63.0 48.3 33.3 11.5 79.0 62.3 84.0
Translated benchmark
BriVL 72.3 35.9 18.8 25.5 - - - - - -
Wukong 95.4 77.1 40.9 50.3 - - - - - -
CN-CLIP 96.0 79.7 51.2 52.0 55.1 26.2 49.9 79.4 63.5 84.9
Table 5: Experimental results of the zero-shot image classification performance of models on ICinW.
Nanyi Fei, Zhiwu Lu, Yizhao Gao, Guoxing Yang, Gabriel Ilharco, Mitchell Wortsman, Ross Wightman,
Yuqi Huo, Jingyuan Wen, Haoyu Lu, Ruihua Song, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal
Xin Gao, Tao Xiang, et al. 2021. Wenlan 2.0: Dave, Vaishaal Shankar, Hongseok Namkoong,
Make ai imagine via a multimodal foundation model. John Miller, Hannaneh Hajishirzi, Ali Farhadi, and
arXiv preprint arXiv:2110.14378. Ludwig Schmidt. 2021. Openclip. If you use this
software, please cite it as below.
Li Fei-Fei, Rob Fergus, and Pietro Perona. 2004.
Learning generative visual models from few training Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana
examples: An incremental bayesian approach tested Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung,
on 101 object categories. In 2004 conference on Zhen Li, and Tom Duerig. 2021. Scaling up
computer vision and pattern recognition workshop, visual and vision-language representation learn-
pages 178–178. IEEE. ing with noisy text supervision. arXiv preprint
arXiv:2102.05918.
Jannik Fritsch, Tobias Kuehnl, and Andreas Geiger.
2013. A new performance measure and evalua- Aditya Khandelwal and Suraj Sawant. 2019. Neg-
tion benchmark for road detection algorithms. In bert: a transfer learning approach for negation
16th International IEEE Conference on Intelligent detection and scope resolution. arXiv preprint
Transportation Systems (ITSC 2013), pages 1693– arXiv:1911.04211.
1700. IEEE.
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj
Mitchell A. Gordon, Kevin Duh, and Jared Kaplan. Goswami, Amanpreet Singh, Pratik Ringshia, and
2021. Data and parameter scaling laws for neural Davide Testuggine. 2020. The hateful memes
machine translation. In EMNLP 2021, pages 5915– challenge: Detecting hate speech in multimodal
5922. Association for Computational Linguistics. memes. Advances in Neural Information Processing
Systems, 33:2611–2624.
Yash Goyal, Tejas Khot, Douglas Summers-Stay,
Dhruv Batra, and Devi Parikh. 2017. Making the Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt:
V in VQA matter: Elevating the role of image un- Vision-and-language transformer without convolu-
derstanding in visual question answering. In CVPR tion or region supervision. In ICML 2021, volume
2017, pages 6325–6334. IEEE Computer Society. 139 of Proceedings of Machine Learning Research,
pages 5583–5594. PMLR.
Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou,
Minzhe Niu, Hang Xu, Xiaodan Liang, Wei Zhang, Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-
Xin Jiang, and Chunjing Xu. 2022. Wukong: 100 Fei. 2013. 3d object representations for fine-
million large-scale chinese cross-modal pre-training grained categorization. In Proceedings of the
dataset and a foundation framework. arXiv preprint IEEE international conference on computer vision
arXiv:2202.06767. workshops, pages 554–561.
Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-
2021. Open-vocabulary object detection via vision son, Kenji Hata, Joshua Kravitz, Stephanie Chen,
and language knowledge distillation. arXiv preprint Yannis Kalantidis, Li-Jia Li, David A. Shamma,
arXiv:2104.13921. Michael S. Bernstein, and Li Fei-Fei. 2017. Vi-
sual genome: Connecting language and vision us-
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian ing crowdsourced dense image annotations. IJCV,
Sun. 2016. Deep residual learning for image recog- 123(1):32–73.
nition. In CVPR 2016, pages 770–778.
Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learn-
Patrick Helber, Benjamin Bischke, Andreas Dengel, ing multiple layers of features from tiny images.
and Damian Borth. 2019. Eurosat: A novel dataset
and deep learning benchmark for land use and Weiyu Lan, Xirong Li, and Jianfeng Dong. 2017.
land cover classification. IEEE Journal of Selected Fluency-guided cross-lingual image caption-
Topics in Applied Earth Observations and Remote ing. Proceedings of the 25th ACM international
Sensing, 12(7):2217–2226. conference on Multimedia.
Arian Hosseini, Siva Reddy, Dzmitry Bahdanau, Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang,
R Devon Hjelm, Alessandro Sordoni, and Aaron Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai
Courville. 2021. Understanding by understanding Xu, Zheng Cao, et al. 2022a. mplug: Effective and
not: Modeling negation in language models. arXiv efficient vision-language learning by cross-modal
preprint arXiv:2105.03519. skip-connections. arXiv preprint arXiv:2205.12005.
Chunyuan Li, Haotian Liu, Liunian Harold Li, Junyang Lin, An Yang, Yichang Zhang, Jie Liu, Jingren
Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Zhou, and Hongxia Yang. 2020. Interbert: Vision-
Jin, Yong Jae Lee, Houdong Hu, Zicheng Liu, et al. and-language interaction for multi-modal pretrain-
2022b. Elevater: A benchmark and toolkit for eval- ing. CoRR, abs/2003.13198.
uating language-augmented visual models. arXiv
preprint arXiv:2204.08790. Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda
Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang,
Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Li Dong, et al. 2022. Swin transformer v2: Scal-
Ming Zhou. 2019a. Unicoder-vl: A universal en- ing up capacity and resolution. In Proceedings of
coder for vision and language by cross-modal pre- the IEEE/CVF Conference on Computer Vision and
training. CoRR, abs/1908.06066. Pattern Recognition, pages 12009–12019.
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Lee. 2019. Vilbert: Pretraining task-agnostic visi-
2022c. Blip: Bootstrapping language-image pre- olinguistic representations for vision-and-language
training for unified vision-language understanding tasks. In NeurIPS 2019, pages 13–23.
and generation. arXiv preprint arXiv:2201.12086.
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen
Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Lei, Nan Duan, and Tianrui Li. 2021. Clip4clip: An
Gotmare, Shafiq R. Joty, Caiming Xiong, and empirical study of clip for end to end video clip re-
Steven Chu-Hong Hoi. 2021a. Align before fuse: trieval. arXiv preprint arXiv:2104.08860.
Vision and language representation learning with
momentum distillation. In NeurIPS 2021, pages Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew
9694–9705. Blaschko, and Andrea Vedaldi. 2013. Fine-grained
visual classification of aircraft. arXiv preprint
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui arXiv:1306.5151.
Hsieh, and Kai-Wei Chang. 2019b. Visualbert: A
simple and performant baseline for vision and lan- Norman Mu, Alexander Kirillov, David Wagner,
guage. ArXiv, abs/1908.03557. and Saining Xie. 2021. Slip: Self-supervision
meets language-image pre-training. arXiv preprint
Xirong Li, Chaoxi Xu, Xiaoxu Wang, Weiyu Lan, arXiv:2112.12750.
Zhengxiong Jia, Gang Yang, and Jieping Xu. 2019c. Maria-Elena Nilsback and Andrew Zisserman. 2008.
Coco-cn for cross-lingual image tagging, captioning, Automated flower classification over a large num-
and retrieval. IEEE Transactions on Multimedia, ber of classes. In 2008 Sixth Indian Conference
21:2347–2360. on Computer Vision, Graphics & Image Processing,
pages 722–729. IEEE.
Xiujun Li, Xi Yin, Chunyuan Li, Xiaowei Hu,
Pengchuan Zhang, Lei Zhang, Lijuan Wang, Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman,
Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and C. V. Jawahar. 2012. Cats and dogs. In 2012
and Jianfeng Gao. 2020. Oscar: Object-semantics IEEE Conference on Computer Vision and Pattern
aligned pre-training for vision-language tasks. In Recognition, pages 3498–3505. IEEE Computer So-
ECCV. ciety.
Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Or Patashnik, Zongze Wu, Eli Shechtman, Daniel
Wanli Ouyang, Jing Shao, Fengwei Yu, and Junjie Cohen-Or, and Dani Lischinski. 2021. Style-
Yan. 2021b. Supervision exists everywhere: A data clip: Text-driven manipulation of stylegan imagery.
efficient contrastive language-image pre-training In Proceedings of the IEEE/CVF International
paradigm. arXiv preprint arXiv:2110.05208. Conference on Computer Vision, pages 2085–2094.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Junyang Lin, Rui Men, An Yang, Chang Zhou, Ming
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish
Ding, Yichang Zhang, Peng Wang, Ang Wang,
Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
Le Jiang, Xianyan Jia, Jie Zhang, Jianwei Zhang,
Gretchen Krueger, and Ilya Sutskever. 2021. Learn-
Xu Zou, Zhikang Li, Xiaodong Deng, Jie Liu, Jin-
ing transferable visual models from natural lan-
bao Xue, Huiling Zhou, Jianxin Ma, Jin Yu, Yong
guage supervision. In ICML 2021, volume 139 of
Li, Wei Lin, Jingren Zhou, Jie Tang, and Hongxia
Proceedings of Machine Learning Research, pages
Yang. 2021a. M6: A chinese multimodal pretrainer.
8748–8763. PMLR.
CoRR, abs/2103.00823.
Alec Radford, Karthik Narasimhan, Tim Salimans, and
Junyang Lin, An Yang, Jinze Bai, Chang Zhou, Ilya Sutskever. 2018. Improving language under-
Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Yong standing by generative pre-training.
Li, Wei Lin, Jingren Zhou, and Hongxia Yang.
2021b. M6-10T: A sharing-delinking paradigm for Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
efficient multi-trillion parameter pretraining. CoRR, Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
abs/2110.03888. Wei Li, and Peter J. Liu. 2020. Exploring the limits
of transfer learning with a unified text-to-text trans- Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yu-
former. J. Mach. Learn. Res., 21:140:1–140:67. lia Tsvetkov, and Yuan Cao. 2021b. Simvlm: Sim-
ple visual language model pretraining with weak su-
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, pervision. CoRR, abs/2108.10904.
Casey Chu, and Mark Chen. 2022. Hierarchical
text-conditional image generation with clip latents. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
arXiv preprint arXiv:2204.06125. Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, et al.
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott 2022. Emergent abilities of large language models.
Gray, Chelsea Voss, Alec Radford, Mark Chen, and arXiv preprint arXiv:2206.07682.
Ilya Sutskever. 2021. Zero-shot text-to-image gener-
ation. In ICML 2021, volume 139 of Proceedings Chunyu Xie, Heng Cai, Jianfei Song, Jincheng Li,
of Machine Learning Research, pages 8821–8831. Fanjing Kong, Xiaoyu Wu, Henrique Morimitsu,
PMLR. Lin Yao, Dexin Wang, Dawei Leng, et al. 2022.
Zero and r2d2: A large-scale chinese cross-modal
Robin Rombach, Andreas Blattmann, Dominik Lorenz, benchmark and a vision-language framework. arXiv
Patrick Esser, and Björn Ommer. 2022. High- preprint arXiv:2205.03860.
resolution image synthesis with latent diffusion mod-
els. In Proceedings of the IEEE/CVF Conference Haiyang Xu, Ming Yan, Chenliang Li, Bin Bi,
on Computer Vision and Pattern Recognition, pages Songfang Huang, Wenming Xiao, and Fei Huang.
10684–10695. 2021. E2e-vlp: End-to-end vision-language pre-
training enhanced by visual learning. arXiv preprint
Christoph Schuhmann, Richard Vencu, Romain Beau- arXiv:2106.01804.
mont, Robert Kaczmarczyk, Clayton Mullis, Aarush
Katta, Theo Coombes, Jenia Jitsev, and Aran Komat- An Yang, Junyang Lin, Rui Men, Chang Zhou,
suzaki. 2021. Laion-400m: Open dataset of clip- Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Jia-
filtered 400 million image-text pairs. arXiv preprint mang Wang, Yong Li, Di Zhang, Wei Lin, Lin
arXiv:2111.02114. Qu, Jingren Zhou, and Hongxia Yang. 2021. Ex-
ploring sparse expert models and beyond. CoRR,
Sheng Shen, Liunian Harold Li, Hao Tan, Mohit abs/2105.15082.
Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei
Yao, and Kurt Keutzer. 2021. How much can clip Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu,
benefit vision-and-language tasks? arXiv preprint Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo
arXiv:2107.06383. Li, Xin Jiang, and Chunjing Xu. 2021. FILIP:
fine-grained interactive language-image pre-training.
Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and CoRR, abs/2111.07783.
Christian Igel. 2011. The german traffic sign recog-
nition benchmark: A multi-class classification com- Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel
petition. In IJCNN 2011, pages 1453–1460. IEEE. Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu,
Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu,
Bastiaan S Veeling, Jasper Linmans, Jim Winkens, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi,
Taco Cohen, and Max Welling. 2018. Rota- Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen
tion equivariant cnns for digital pathology. In Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou,
International Conference on Medical image and Pengchuan Zhang. 2021. Florence: A new
computing and computer-assisted intervention, foundation model for computer vision. CoRR,
pages 210–218. Springer. abs/2111.11432.
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas
Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jin- Steiner, Daniel Keysers, Alexander Kolesnikov, and
gren Zhou, and Hongxia Yang. 2022a. Unifying ar- Lucas Beyer. 2022. Lit: Zero-shot transfer with
chitectures, tasks, and modalities through a simple locked-image text tuning. In Proceedings of the
sequence-to-sequence learning framework. CoRR, IEEE/CVF Conference on Computer Vision and
abs/2202.03052. Pattern Recognition, pages 18123–18133.
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei
Wenhui Wang, Hangbo Bao, Li Dong, Johan
Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and
Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal,
Jianfeng Gao. 2021. Vinvl: Revisiting visual rep-
Owais Khan Mohammed, Saksham Singhal, Subho-
resentations in vision-language models. In CVPR
jit Som, et al. 2022b. Image as a foreign language:
2021, pages 5579–5588. Computer Vision Founda-
Beit pretraining for all vision and vision-language
tion / IEEE.
tasks. arXiv preprint arXiv:2208.10442.
Table 10: Finetuning results on ICR dataset. We report the performance of baselines, CN-CLIPViT-B/16 and
CN-CLIPViT-L/14 on text-to-image and image-to-text retrieval.
Table 11: Details of the image classification datasets in the ELEVATER benchmark.
maximum text length to 128 for finetuning. The consistently bring improvements in model perfor-
results show that Chinese CLIP achieves state-of- mance. The predicative improvements of scaling
the-art performance in cross-modal retrieval tasks Chinese CLIP indicates that we can further scale
with longer texts. up the model for better performance in the future
work. However, it is still a pity that the tiny-size
A.5 Details about the ELEVATER CN-CLIPRN50 saliently performs much worse than
benchmark the ViT variants which are significantly larger. This
We present the data statistics and metrics of the 20 shows that there is still much room for the small
image classification datasets of the track ICinW in model to improve and the knowledge transfer of
the ELEVATER benchmark in Table 11, and also CLIP from large models to small models should be
the experimental results of all Chinese CLIP mod- an important research topic in multimodal repre-
els on zero-shot image classification in Table 12. sentation learning.
It can be found that the scaling of model size can
CN-CLIP CN-CLIP CN-CLIP CN-CLIP CN-CLIP
Dataset
RN50 ViT-B/16 ViT-L/14 ViT-L/14@336px ViT-H/14
Table 12: Experimental results of the zero-shot image classification performance of models on ICinW.