Professional Documents
Culture Documents
Causalvlr: A Toolbox and Benchmark For Visual-Linguistic Causal Reasoning
Causalvlr: A Toolbox and Benchmark For Visual-Linguistic Causal Reasoning
Causalvlr: A Toolbox and Benchmark For Visual-Linguistic Causal Reasoning
Guanbin Li
Sun Yat-sen University
liguanbin@mail.sysu.edu.cn
Liang Lin
Sun Yat-sen University
linliang@ieee.org
Abstract
We present CausalVLR (Causal Visual-Linguistic Reasoning), an open-source
toolbox containing a rich set of state-of-the-art causal relation discovery and
causal inference methods for various visual-linguistic reasoning tasks, such as
VQA, image/video captioning, medical report generation, model generalization
and robustness, etc. These methods have been included in the toolbox with PyTorch
implementations under NVIDIA computing system. It not only includes training
and inference codes, but also provides model weights. We believe this toolbox
is by far the most complete visual-linguitic causal reasoning toolbox. We wish
that the toolbox and benchmark could serve the growing research community by
providing a flexible toolkit to re-implement existing methods and develop their
own new causal reasoning methods. Code and models are available at https://
github.com/HCPLab-SYSU/Causal-VLReasoning. The project is under active
development by HCP-Lab1 ’s contributors and we will keep this document updated.
1 Introduction
The emergence of vast amounts of heterogeneous multi-modal data, including images [2, 9], videos
[13, 11, 10, 15], languages [8, 7, 24], audios [3], and multi-sensor [12, 14, 28, 18, 22, 5] data, has
led to the application of large language models (LLMs) such as ChatGPT [21] and ChatGLM [27] in
various vision and language tasks, showing promising performance. However, current LLMs heavily
rely on fitting extensive knowledge distributions, often capturing spurious correlations across different
modalities. Consequently, they struggle to learn reliable chain-of-thought (COT) [23] that reflects
essential causal relations within multi-modal knowledge, limiting their generalization and cognitive
abilities. Fortunately, causal inference [19, 20, 16, 17] ] provides a promising alternative for learning
robust and reliable cross-modal models due to its promising ability to achieve robust representation
and model learning with good cognitive ability. For a detailed review of causal inference and visual
representation learning, please refer to our review paper [16].
Visual-linguistic reasoning endeavors to comprehend both visual and linguistic content while perform-
ing various reasoning tasks, such as visual question answering (VQA), visual dialog, image/video
1
https://www.sysu-hcp.net
captioning, and medical report generation. However, to date, there has been no comprehensive open-
source framework available for causality-aware visual-linguistic reasoning. With the aim of offering
a high-quality toolbox and a unified benchmark, we have developed CausalVLR: a pytorch-based
open-source toolbox and benchmark designed specifically for visual-linguistic causal reasoning.
Figure 1 provides an overview of CausalVLR.
CausalVLR offers several key features: (1) Modular design: We decompose the visual-linguistic
reasoning framework into different components, allowing for the easy construction of customized
visual-linguistic reasoning frameworks by combining various modules. (2) Support for multiple
frameworks out of the box: The toolbox provides support for popular and contemporary visual-
linguistic reasoning frameworks. (3) High efficiency: All basic modules and operations are executed
on GPUs to ensure optimal performance. (4) State of the art: The toolbox is derived from the codebase
developed by the experienced HCP-Lab team, specializing in causal inference and visual-linguistic
reasoning, and continuous improvements are made. In addition to introducing the codebase and
benchmark results, we also share our experiences and best practices for visual-linguistic causal
reasoning. Ablation experiments involving hyperparameters, architectures, and training strategies
are conducted and discussed. Our aim is to contribute to future research and facilitate comparisons
between different methods. The remaining sections are organized as follows: First, we present
the various supported methods and highlight important features of CausalVLR. Next, we present
the benchmark results. Finally, we showcase representative studies on selected baselines. Table 1
provides a summary of representative algorithms from the HCP-Lab team.
2 Algorithms
This section provides a summary of three representative state-of-the-art (SOTA) algorithms for
visual question answering (VQA) and medical report generation tasks. All algorithms have been
implemented using PyTorch. The CausalVLR library will be continuously updated in the coming
years. In this section, we will provide a concise introduction to the selected algorithms.
2.1 VQA
2
Table 1: Representative visual-linguistic causal reasoning algorithms in CausalVLR.
Medical Report Generation VLCI [1] Implicit visual causal intervention Yes
3
VDM aims to disentangle the region-based features from images in the encoder, and the
LDM aims to eliminate the spurious correlations caused by the visual-linguistic embedding.
• To alleviate the problem of unpaired data when pretraining visual-linguistic RRG data, we
combine the PLM and MIM for cross-modal pre-training in various data situations (e.g.,
unpaired, single modality), which is efficient and easy to implement.
• We propose a lightweight Visual-Linguistic Causal Intervention (VLCI) framework for RRG,
which introduces mediators without additional knowledge, to implicitly deconfound the
visual-linguistic confounder by causal front-door intervention. Experimental results show
that VLCI achieves state-of-the-art performance on two datasets IU-Xray and MIMIC-CXR.
References
[1] Weixing Chen, Yang Liu, Ce Wang, Guanbin Li, Jiarui Zhu, and Liang Lin. Visual-linguistic
causal intervention for radiology report generation. arXiv preprint arXiv:2303.09117, 2023.
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 770–778, 2016.
[3] Haoyuan Lan, Yang Liu, and Liang Lin. Audio-visual contrastive learning for self-supervised
action recognition. arXiv preprint arXiv:2204.13386, 2022.
[4] Junfan Lin, Keze Wang, Ziliang Chen, Xiaodan Liang, and Liang Lin. Towards causality-aware
inferring: A sequential discriminative approach for medical diagnosis. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2023.
[5] Junfan Lin, Yuying Zhu, Lingbo Liu, Yang Liu, Guanbin Li, and Liang Lin. Denselight: Efficient
control for large-scale traffic signals with dense feedback. arXiv preprint arXiv:2306.07553,
2023.
[6] Yang Liu, Guanbin Li, and Liang Lin. Cross-modal causal relational reasoning for event-level
visual question answering. arXiv preprint arXiv:2207.12647, 2022.
[7] Yang Liu, Guanbin Li, and Liang Lin. Causality-aware visual scene discovery for cross-modal
question reasoning. arXiv preprint arXiv:2304.08083, 2023.
[8] Yang Liu, Guanbin Li, and Liang Lin. Cross-modal causal relational reasoning for event-level
visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence,
2023.
[9] Yang Liu, Jing Li, ZhaoYang Lu, Tao Yang, and ZiJian Liu. Combining multiple features for
cross-domain face sketch recognition. In Biometric Recognition: 11th Chinese Conference,
CCBR 2016, Chengdu, China, October 14-16, 2016, Proceedings 11, pages 139–146. Springer,
2016.
[10] Yang Liu, Zhaoyang Lu, Jing Li, and Tao Yang. Hierarchically learned view-invariant represen-
tations for cross-view action recognition. IEEE Transactions on Circuits and Systems for Video
Technology, 29(8):2416–2430, 2018.
4
[11] Yang Liu, Zhaoyang Lu, Jing Li, Tao Yang, and Chao Yao. Global temporal representation
based cnns for infrared action recognition. IEEE Signal Processing Letters, 25(6):848–852,
2018.
[12] Yang Liu, Zhaoyang Lu, Jing Li, Tao Yang, and Chao Yao. Deep image-to-video adaptation and
fusion networks for action recognition. IEEE Transactions on Image Processing, 29:3168–3182,
2019.
[13] Yang Liu, Zhaoyang Lu, Jing Li, Chao Yao, and Yanzi Deng. Transferable feature representation
for visible-to-infrared cross-dataset human action recognition. Complexity, 2018:1–20, 2018.
[14] Yang Liu, Keze Wang, Guanbin Li, and Liang Lin. Semantics-aware adaptive knowledge
distillation for sensor-to-vision action recognition. IEEE Transactions on Image Processing,
30:5573–5588, 2021.
[15] Yang Liu, Keze Wang, Lingbo Liu, Haoyuan Lan, and Liang Lin. Tcgl: Temporal contrastive
graph for self-supervised video representation learning. IEEE Transactions on Image Processing,
31:1978–1993, 2022.
[16] Yang Liu, Yu-Shen Wei, Hong Yan, Guan-Bin Li, and Liang Lin. Causal reasoning meets visual
representation learning: A prospective study. Machine Intelligence Research, pages 1–27, 2022.
[17] Yang Liu, Yushen Wei, Hong Yan, Guanbin Li, and Liang Lin. Causal reasoning with spatial-
temporal representation learning: A prospective study. arXiv preprint arXiv:2204.12037, 2022.
[18] Jianyuan Ni, Raunak Sarbajna, Yang Liu, Anne HH Ngu, and Yan Yan. Cross-modal knowledge
distillation for vision-to-sensor action recognition. In ICASSP 2022-2022 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4448–4452. IEEE,
2022.
[19] Judea Pearl. Causality. Cambridge university press, 2009.
[20] Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner,
Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the
IEEE, 109(5):612–634, 2021.
[21] Eva AM Van Dis, Johan Bollen, Willem Zuidema, Robert van Rooij, and Claudi L Bockting.
Chatgpt: five priorities for research. Nature, 614(7947):224–226, 2023.
[22] Kuo Wang, LingBo Liu, Yang Liu, GuanBin Li, Fan Zhou, and Liang Lin. Urban regional
function guided traffic flow prediction. Information Sciences, 634:308–320, 2023.
[23] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le,
Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In
NeurIPS Systems, 2022.
[24] Yushen Wei, Yang Liu, Hong Yan, Guanbin Li, and Liang Lin. Visual causal scene refinement
for video question answering. arXiv preprint arXiv:2305.04224, 2023.
[25] Yang Wu, Pengxu Wei, and Liang Lin. Scene graph to image synthesis via knowledge consensus.
In AAAI, 2023.
[26] Yao Xiao, Ziyi Tang, Pengxu Wei, Cong Liu, and Liang Lin. Masked images are counterfactual
samples for robust fine-tuning. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 20301–20310, 2023.
[27] Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang,
Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. ICLR,
2023.
[28] Yuying Zhu, Yang Zhang, Lingbo Liu, Yang Liu, Guanbin Li, Mingzhi Mao, and Liang Lin.
Hybrid-order representation learning for electricity theft detection. IEEE Transactions on
Industrial Informatics, 19(2):1248–1259, 2022.