Is Knowledge All Large Language Models Needed For Causal Reasoning?

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Is Knowledge All Large Language Models Needed

for Causal Reasoning?


Hengrui Cai∗1 , Shengjie Liu∗2 , and Rui Song†3
1
University of California, Irvine, hengrc1@uci.edu
arXiv:2401.00139v1 [cs.AI] 30 Dec 2023

2
North Carolina State University, sliu56@ncsu.edu
3
Amazon, songray@gmail.com

Abstract
This paper explores the causal reasoning of large language models (LLMs) to enhance
their interpretability and reliability in advancing artificial intelligence. Despite the profi-
ciency of LLMs in a range of tasks, their potential for understanding causality requires fur-
ther exploration. We propose a novel causal attribution model that utilizes “do-operators”
for constructing counterfactual scenarios, allowing us to systematically quantify the influ-
ence of input numerical data and LLMs’ pre-existing knowledge on their causal reasoning
processes. Our newly developed experimental setup assesses LLMs’ reliance on contex-
tual information and inherent knowledge across various domains. Our evaluation reveals
that LLMs’ causal reasoning ability depends on the context and domain-specific knowledge
provided, and supports the argument that knowledge is, indeed, what LLMs principally re-
quire for sound causal reasoning. On the contrary, in the absence of knowledge, LLMs still
maintain a degree of causal reasoning using the available numerical data, albeit with limi-
tations in the calculations. A Python implementation of our proposed method is available
at https://github.com/ncsulsj/Causal_LLM.

1 Introduction

Large language models (LLMs) have been at the forefront of advancing artificial intelligence by
leveraging their multi-billion parameter frameworks and extensive training corpuses, thereby
becoming dramatically influential in diverse fields (e.g., Vaswani et al., 2017; Kenton and
Toutanova, 2019; Lewis et al., 2019; Brown et al., 2020; Neelakantan et al., 2022; Stiennon et al.,
2022; OpenAI, 2023). The versatility of LLMs emphasizes the breadth of their training and
the sophistication of their design, reflecting a new era in computational intelligence. However,
as LLMs become increasingly adept at handling various forms of data and engaging in multiple

Equal contribution.

This work is not related to the author’s position in Amazon.

1
Figure 1: Illustration of the ability attribution of ChatGPT answering the causal question.

domains, their potential for reasoning ability especially causal reasoning (Pearl et al., 2009)
remains a point of investigation (see a recent survey in Liu et al., 2023). This leads to the
pressing question: Can LLMs genuinely comprehend and process causal relationships, and if
so, what mechanisms underlie their inference-making process?
Several recent studies have attempted to explore this question. Kıcıman et al. (2023)
recently considered the use of LLM in tasks that involve causal discovery. They integrated
the variable names into a contextual text template and tasked LLMs with discerning the
cause among the options. Gao et al. (2023) extended such an evaluation on the contexts of a
board causal knowledge graph. Their findings revealed that LLMs outperformed traditional
machine learning methods on multiple benchmarks. Jin et al. (2023a) also examined the
capacity for causal discovery of LLMs through a correlation-to-causation inference framework.
This framework provided correlation descriptions among different nodes to LLMs and tested if
these models can determine the true causal dependences. The results indicated that most LLMs
exhibited performance close to random on this task. Drawing inspiration from Pearl (2000)’s
Ladder of Causation, Jin et al. (2023b) compiled an extensive dataset originating from a set
of causal graphs and queries, which was then transformed into a narrative comprising causal
context and questions presented in natural language. Their experiment revealed that LLMs
struggled with the complex dataset even with the aid of their proposed prompting strategy.

2
Figure 2: Illustration of the ability attribution on answering the causal question by generating coun-
terfactual examples.

Despite the promising insights yielded by these studies, the oscillation of task difficulty—from
overly simplistic with pure contexts to profoundly complex design—casts a veil over the true
extent of LLM’s causal reasoning abilities. Moreover, all these studies mainly focus on context
information without investigation into the numerical data that intrinsically reflect causalities
(Pearl et al., 2009). The mystery of LLMs’ capacity to access causality, and the mechanisms
by which they do so, remains unsolved, prompting further research into the delineation of their
inference capabilities.
In this paper, we focus on a more systematical and robust evaluation methodology to inves-
tigate how LLMs such as ChatGPT conduct causal reasoning with varying input components.
Our objective is to attribute the contributions of the inherent causal knowledge embedded
within the variable names, and the explicit numerical data that denote causal links, given
the context provided for the causal task (as illustrated in Figure 1). The cornerstone of our
technique is the generation of counterfactual examples, employing the “do-operators” concep-
tualized by Pearl et al. (2009). Counterfactual generation allows us to produce a comprehensive
combination of all different input components to observe changes in model output, thus quan-
tifying the influence of each individual component on model causal reasoning performance.
Refer to Figure 2 for an illustration of the generation of counterfactual examples. We design
a series of extensive experiments in diverse domain datasets to detect whether these recent

3
state-of-the-art LLMs process causal reasoning abilities. The contributions of our research are
threefold and offer significant insights into the mechanisms through which LLMs engage in
causal reasoning.
• We introduce a causal attribution model and propose the definitions of marginal and
conditional attributions of knowledge and data, through the notion of “do-operators” (Pearl
et al., 2009). The proposed definitions differentiate and quantify the effects of omitted knowl-
edge or data on the accuracy of the model’s predictions, given provided task contexts. Our
approach thus provides a refined analysis of how LLMs utilize their inherent knowledge and
the numerical information contained to formulate causal inferences.
• We further design a series of novel experiments that generate counterfactual settings
to estimate attribution scores in the proposed causal attribution model. This allows us to
comprehensively evaluate the weight of knowledge and data, given the optimal prompts of
causal reasoning tasks in LLMs. We conclude that, when equipped with adequate knowledge,
LLMs demonstrate a proficient capacity for causal reasoning that is consistent with human
logic and common sense, supporting the main argument that knowledge is, indeed, what LLMs
principally require for sound causal reasoning. In contrast, in the absence of substantial training
contexts, LLMs manage to maintain a degree of causal reasoning using the available numerical
data, although with limitations.
• Based on our main findings, we conduct innovative supporting analyses in terms of
robustness, generalizability, and explainability. First, we design the reverse causal inference
task that inverts the causal directions by switching the numerical data, while LLMs almost
produce the same causal reasoning as the original and thus reflect the extremely high weights
of intrinsic knowledge. We further examine LLMs’s broad causal reasoning in pairwise causal
discovery across an expansive range of domains, with results aligned with our main findings.
Finally, by leveraging the Chain of Thought (CoT) prompting, we show that certain LLMs lack
computational skills and incorrectly interpret correlations as indicative of causal relationships.
This locates the directions for LLMs to improve their causal inference abilities.
The general applicability of our framework highlights the potential of our method to serve
as a tool for broader investigative research on the cognitive capacities of LLMs, facilitating a
deeper understanding of how these models process and analyze information to emulate complex

4
human-like reasoning patterns.

2 Causal Attribution Model

In this section, we formalize the causal attribution model for understanding how different com-
ponents of input affect the predictive performance of LLMs. For the i-th sample in our dataset,
we decompose the input into several distinct elements: the input context ci , which provides
the scenario for the causal inference task; the embedded causal knowledge within variable
names ki , which may carry implicit causal cues; and numerical data di , representing explicit
causal relationships. Any remaining information, including potential unmeasured confounders,
is denoted as ui . The goal is to compare the causal inference made by an LLM, represented
as ybi , against the true causal responses yi . To dissect the influence of each input component
systematically, we employ the do-calculus, a powerful tool from the causal inference framework
that allows the modeling of interventions in a system (Pearl et al., 2009). This calculus fa-
cilitates the decomposition of each component’s effect by enabling hypothetical modifications
to the input data—akin to controlled experiments in traditional scientific inquiry. Applying
do-calculus, we define four key probabilities that represent the model’s accuracy under various
conditions of knowledge and data inclusion or exclusion. We first introduce the attribution to
knowledge given the data is fixed to quantify the impact of knowledge as follows.

Definition 2.1. Conditional Attribution of Knowledge Given Data:

yi = yi | do (ki = ki , di = di ) , ci , ui ) − P (b
CAKi = P (b yi = yi | do (ki = ∅, di = di ) , ci , ui ) . (1)

This equation defines the probability difference of an LLM making a correct causal inference
when variable names containing potential causal knowledge are included versus when they
are absent, given the same data. It quantifies the added value of linguistic cues present in
variable names that might carry implicit causal relationships. If the value of CAKi is near
zero, it suggests that incorporating the embedded knowledge into the prediction process has a
negligible effect on the accuracy of the model’s causal reasoning. A CAKi value approaching
one indicates that the knowledge component is highly beneficial and substantially enhances the
model’s predictive accuracy. Conversely, a negative value of CAKi implies that the presence

5
of this knowledge actually detracts from the model’s ability to reason causally, leading to less
accurate predictions. Similarly, we define the attribution to data given the knowledge below.

Definition 2.2. Conditional Attribution of Data Given Knowledge:

yi = yi | do (ki = ki , di = di ) , ci , ui ) − P (b
CADi = P (b yi = yi | do (ki = ki , di = ∅) , ci , ui ) . (2)

Here, we assess the impact of numerical data on causal inference accuracy by comparing
the performance of the LLM with full numerical data access to its performance with numerical
data systematically removed while retaining the embedded knowledge component. This helps
in understanding the extent to which explicit numerical information contributes to the model’s
causal reasoning. It is worth noting that the first term in (1) and (2) are the same as the
original prediction accuracy given the full set of information. The second term is the prediction
accuracy after omitting knowledge or data. These conditional attributions are independent
when knowledge and data have no overlap. In addition to conditional attributions, we further
consider the marginal attributions which examine how the presence of data or knowledge
alone can guide the LLM towards accurate causal inferences, in the absence of any auxiliary
knowledge as follows.

Definition 2.3. Marginal Attribution of Data

yi = yi | do (ki = ∅, di = di ) , ci , ui ) − P (b
M ADi = P (b yi = yi | do (ki = ∅, di = ∅) , ci , ui ) . (3)

Here, the marginal attribution of data measures the impact of data alone without any
auxiliary knowledge. This distinction is crucial for evaluating the independent effectiveness of
data in the inference process. In a similar logic, we define the marginal attribution of knowledge
as follows.

Definition 2.4. Marginal Attribution of Knowledge

yi = yi | do (ki = ki , di = ∅) , ci , ui ) − P (b
M AKi = P (b yi = yi | do (ki = ∅, di = ∅) , ci , ui ) . (4)

On the contrary, this attribution assesses the unique influence of the knowledge embedded
within the variable names when the numerical data are deliberately omitted. This measures
the ability of LLMs to leverage their learned knowledge to fill in gaps where explicit numerical

6
data are lacking. It can be observed that the second term, that is, the baseline, in (3) and (4)
is the same. Similarly, when these components are manipulated independently, we can infer
their independent contributions to the LLM’s causal reasoning. Combining Definitions 2.1 to
2.4, we can show the following attribution result:

M ADi − M AKi = CADi − CAKi .

The equation above establishes the internal consistency between conditional and marginal attri-
butions, revealing that the difference between the marginal attributions of data and knowledge
is equivalent to the difference between their respective conditional attributions. This alignment
underscores the robustness of our model in ascribing significance to knowledge and data inputs
in the causal reasoning of LLMs. By examining both conditional and marginal contributions,
we can better understand the interplay and significance of various input components in the
model’s decision-making framework.

3 Experiment Design

In this section, we detail the experimental design tailored to evaluate the causal reasoning ca-
pabilities of LLMs by attributing performance to various input components. We first introduce
the dataset involved in the causal discovery task in Section 3.1, which sets the stage for sub-
sequent experimental tasks. We optimize prompt training to enhance the accuracy of LLMs’
causal inference in Section 3.2, aligning with the baseline accuracy component described by the
first term in Definitions 2.1 and 2.2. Subsequent experiments are then meticulously designed
to measure LLMs’ performance in the absence of knowledge (see Section 3.3), data (see Section
3.1), or both (see Section 3.5), corresponding to the second terms in Definition 2.1, Definition
2.2, and Definitions 2.3-2.4, respectively. The findings from our causal attribution model sug-
gest that the numerical data may have a limited role in LLMs’ causal analysis, prompting us to
further investigate through a reverse causal inference experiment in Section 3.6 and a pairwise
causal discovery experiment in Section 3.7. These two experiments will challenge the models’
reliance on numerical data for causal reasoning. To ensure a robust and general evaluation,
we employ a suite of general-purpose autoregressive LLMs based on GPT (Radford et al.,
2019): GPT-3.5, known as ChatGPT, and its more advanced counterparts, GPT-4 and GPT-4

7
turbo (OpenAI, 2023), both accessed via the OpenAI API set to a temperature of zero for
consistency. Additionally, we extend our assessment to Claude 2 (Models, n.d.), LLaMa2-13B
(Touvron et al., 2023) to encompass a broad spectrum of model architectures and training
backgrounds.

3.1 Dataset Construction for Causal Reasoning

Our experimental design is supported by a compilation of nine distinct datasets (Mooij et al.,
2015; Zheng et al., 2023), each providing verified causal relationships established through ex-
pert knowledge and empirical analysis. These datasets serve as a benchmark for evaluating
the LLMs’ performance in causal inference tasks. Specifically, the Galton dataset provides
historical height measurements within families, enabling studies on hereditary influence. The
Sachs dataset includes data on 11 phosphorylated proteins and phospholipids from individual
primary immune system cells, offering insights into cellular behavior under various experimen-
tal conditions. Data on the health impacts of alcohol, drawn from blood tests and consumption
patterns, make up the Alcohol dataset. The Ecosystem dataset comprises carbon flux mea-
surements alongside environmental light conditions, offering a complex interplay of ecological
factors. The MPG dataset reflects automotive efficiency, correlating car attributes with fuel
consumption and performance. The DWD dataset combines geographical and climatological
variables, such as altitude and temperature, to model environmental effects. The Cement
dataset relates material composition to structural strength, furnishing a Concrete example
of causal reasoning in material science. The Stock dataset consists of stock return data for
various pairs of companies, such as Hang Seng Bank, allowing for a comprehensive analysis
of inter-company stock performance correlations. Lastly, the Arrhythmia dataset comprises
variables associated with Cardiac Arrhythmia.
Each dataset is carefully selected to encompass a range of domains—biology, physics, ge-
ography, atmospheric science, finance, and engineering—ensuring the LLMs are tested against
diverse causal reasoning scenarios. The detailed description of datasets is summarized in Table
1, including the number of nodes, the number of causal relations, the number of samples, and
the corresponding domain. All these datasets contain both variable names of causal meanings
with a set of numerical data supporting intrinsic causal relationships. We aim to evaluate the

8
Dataset Galton Sachs Alcohol EcoSystem MPG DWD Cement Stock Arrhythmia
Number of nodes 4 12 6 4 5 6 9 5 4
Number of causal relations 3 20 5 3 6 6 8 3 3
Number of samples 898 7466 345 721 392 349 1030 1331 450
Domain Biology Biology Biology Physics Engineering Geography Engineering Finance Biology

Table 1: Description of nine benchmark datasets for causal inference tasks.

LLMs’ causal inference capabilities against the established ground truth of these causal pairs.

3.2 Optimal Prompt Training

The initial phase of our experimental design involves zero-shot prompting strategies to engage
LLMs in the task of causal discovery, as suggested by Kojima et al. (2022). Our goal is to op-
timize prompt training to enhance the accuracy of LLMs’ causal inference and get the baseline
accuracy component, i.e., the first term in Equations (1) and (2). Our preliminary trials re-
veal that prompts containing certain directives like “provide” often stimulated non-committal
responses from LLMs, such as I am sorry, but as an AI, I cannot deduce causal relationships
directly from raw data without context or additional statistical analysis, indicating their reluc-
tance or inability to generate causal analyses from data without additional context or statistical
interpretation. To address this issue, we modify our prompts to use more suggestive language,
such as “suggest,” thereby guiding LLMs to produce more analytically useful responses. The
effectiveness of this modification was evaluated by applying the revised prompts to our datasets
and monitoring the output. Figure 3 lays out the attribution ability of ChatGPT by answering
the causal question, with encouraging prompt examples.
We detect a latent pattern for LLMs to construct causal relationships based on the sequence
in which data columns were presented—a potential artifact of their pre-training phase. In
particular, LLMs may by default consider the first variable as the cause of the second variable,
and the second variable subsequently causes the third variable. To mitigate this pattern-
induced bias and encourage more faithful causal reasoning, we propose randomization in the
column order for each dataset in each experiment, and evaluate the accuracy over 15 random
orders as replications. Furthermore, to understand the underlying logical reasoning of LLMs in
causal inference tasks, we expand our prompting repertoire by including advanced techniques
such as the Chain of Thought (CoT) prompting (Wei et al., 2022) with zero-shot format.
This method encourages LLMs to “think step by step”, and thus provide the sequential and

9
Figure 3: Illustration of the ability attribution of ChatGPT answering the causal question with
encouraging prompt examples.

transparent reasoning path of LLMs when tackling causal discovery tasks. This approach is
particularly valuable in unraveling the models’ internal logic.

3.3 Ability Attribution: Omit Knowledge

In line with our goal to dissect the internal knowledge component of LLMs—referred to in
the second term of Equation (1) in Definition 2.1—we conduct experiments specifically tai-
lored to assess how LLMs perform when their access to explicit knowledge is systematically
restricted. Acknowledging that LLMs possess vast stores of implicit knowledge, akin to ex-
pansive knowledge graphs (Pan et al., 2020), we aim to understand the extent to which this
internal knowledge influences causal reasoning. To isolate the effect of knowledge from other
variables, we retain the numerical data intact but obscure the variable names by substituting
them with arbitrary terms such as “bryoto”, “nienet”, and “feouni”. See an illustration in
Figure 3. The selection of these placeholders is deliberate; we avoid sequences that LLMs
might interpret as inherently ordered or connected, such as alphabetical sequences or familiar
nomenclatures. This strategy prevents the models from leveraging any pre-existing associative
patterns that could confound the assessment of their knowledge-based reasoning capabilities.

10
3.4 Ability Attribution: Omit Data

Next, we develop a subsequent experiment to measure LLM performance in the absence of


data, paralleling the scenario described by the second term in Equation (2) in Definition 2.2.
To evaluate the influence of numerical information within the dataset, we exclude the data
values from the prompt and present the prompt with the column names organized in random
order, intentionally omitting the numerical data as shown in Figure 3. This method allows us
to probe the LLMs’ capacity for reasoning with structural but non-quantitative information,
a contrast to their operation with full data availability. This technique has been previously
implemented to some extent, such as in the work by Kıcıman et al. (2023), where LLMs were
tasked with causal discovery using limited variable sets, each containing 3-4 variables. These
previous studies laid the groundwork by demonstrating LLMs’ responses when prompted with
variable names and metadata only. Our experiment builds on this foundation by scaling the
challenge up to seven datasets with varied complexity, including the biologically rich Sachs
dataset, which encompasses a multitude of variables and interactions (Sachs et al., 2005).

3.5 Ability Attribution: Random Guess

To define the baseline for the LLMs’ performance in causal reasoning tasks without input data
and knowledge, we designed an experiment where both the data and knowledge inputs are
obfuscated. This approach simulates a scenario where LLMs must operate without access to
informative cues, hence relying on random guesses to generate causal pairs. This experiment
mirrors the conditions set forth by the subtractive terms in Definitions 2.3-2.4, stripping the
models of the structured context usually provided by known data and knowledge. The rationale
for this approach is twofold: first, to quantify the lowest bound of LLM accuracy in causal
inference when deprived of meaningful input, and second, to access the models’ default pattern
of responses in the complete absence of guiding information. By comparing the results of this
experiment with those obtained from data- and knowledge-rich inputs, we can more accurately
measure the added value of each component.

11
Figure 4: Illustration of the experiment design of the reverse causal inference task.

3.6 Reverse Causal Inference

Our causal attribution model’s preliminary results indicate that numerical data might play a
less significant role in the causal reasoning processes of LLMs (see more details in Section 4).
This insight leads us to design an additional experiment that examines the model’s reasoning
under reversed causal pairs, as detailed in Figure 4. The reverse causal inference task is critical
to evaluate the LLMs’ dependency on numerical data when inferring causality. We achieve this
goal by manipulating the datasets’ structure while keeping their data content intact. Specifi-

12
cally, we first establish the topological ordering of variables within the original Directed Acyclic
Graphs (DAGs) representing the datasets’ causal structures. We then systematically reverse
this order, crafting a new, inverted set of relationships for the LLMs to analyze. Despite pre-
serving the data values, this rearrangement of the causal sequence presents a novel challenge
to the models, as it disrupts the directionality they have ‘learned’ or ‘observed’ from previous
exposures to similar datasets. For a tangible example, we use the Galton Family dataset as
illustrated in Figure 4, where the original causal connections—Father’s Height and Mother’s
Height leading to Child’s Height, with Child’s Gender as an additional factor—are well estab-
lished. Inverting these relationships means that the LLMs are now presented with the premise
that the Child’s Height influences the heights of the parents, a reversal of the natural inference.
This test probes whether the LLMs can be aware of the logical fallacy in such an inversion,
thereby testing if LLMs will make use of numerical data beyond their learned knowledge.

3.7 Pairwise Causal Discovery

To further validate our causal attribution model’s preliminary results, we design an extra
simulation experiment aimed at assessing the model’s generalizability in the task of pairwise
causal discovery task (Hoyer et al., 2008). Since the causal graph can be identified up to the
Markov equivalence class unless more model and noise assumptions are proposed (Geiger et al.,
1990), we conduct simulations involving non-Gaussian noise and the design principle is based
on the following theorem (Shimizu et al., 2006)

Theorem 3.1. (Shimizu et al., 2006) In the linear non-Gaussian noise setting, if the true
structural causal model is
Y := f (X) + U, X⊥
⊥ U,

then there does not exist a structural causal model in the reverse direction

X := g(Y ) + Ũ , Y ⊥
⊥ Ũ

that can generate data consistent with P (x, y).

To illustrate, in the Galton Family dataset, considering variables like Father’s Height and
Child’s Height, we simulate Father’s Height data using the equation

Father’s Height := f (Child’s Height) + U,

13
Figure 5: Illustration of the experiment design of the pairwise Causal discovery task.

where f is a linear function and U is a non-Gaussian noise such as chi-squared noise. Our
objective is to evaluate the capability of LLMs in detecting the simulated reversed causal
relationship ‘Child’s Height causes Father’s Height’. The evaluation is conducted under three
scenarios, as detailed in Figure 5: (1) employing LinGAM (Shimizu et al., 2011) directly on
simulated data as a baseline, where LinGAM is a numerical method to estimate causal ordering
and connection strengths based on non-Gaussianity; (2) formulating the prompt using data
accompanied by variable names to LLMs; (3) constructing the prompt with data along with
variable names, complemented by simplified instructions outlining the procedure for executing
LinGAM, to LLMs.

14
4 Experiment Results
4.1 Implementation Details

We provide the implementation details for probing the causal inference, by using GPT-4 on
the Sachs dataset as an example. The following steps outline the specifics of the procedure:

• Initialization: We initiate GPT-4 with a clear operational mandate by setting up the


system with an unambiguous instruction that defines its role: You are a helpful assistant
to suggest potential causal pairs with direction (A → B means A causes B). This precise
configuration is crucial to focus the model’s capabilities on generating causal relationships
with direction from the provided data.

• Prompt Generation: We first randomize the order of the Sachs dataset variables to
mitigate any ordering bias that may influence the LLM’s output. The prompt is then
constructed to direct the model’s attention strictly to the task at hand: Suggest causal
pairs with direction among following variables after analyzing following data: {raw data}
MUST Suggest ONLY the causal pairs with direction without saying any other things.
For the pairwise causal discovery experiment, we sample the relation from the ground
truth relations and generate the corresponding simulated data to construct the prompt,
as illustrated by Figure 5.

• Prompt Tailoring: In line with the experimental design, we tailor prompts to account
for different perturbations—removing data values, obscuring knowledge with placeholder
terms, or conducting reverse causal inference and pairwise causal discovery. Each varia-
tion is designed to test a specific aspect of the LLM’s causal reasoning under altered input
conditions (Figures 3, 4, and 5 illustrate the prompt adjustments for each scenario).

• Evaluation: We then extract the causal pairs predicted by GPT-4 in response to these
prompts and proceed to the evaluation phase. Here, we compute the evaluation metrics
by contrasting the LLM-generated causal pairs with the established ground truth of the
dataset. During reverse causal inference trials, we analyze two sets of outcomes: one
that aligns with the predictions for the reversed causal sequence and another set that
compares against the original causal graph.

15
4.2 Evaluation Metrics

To quantify and evaluate the performance of LLMs in causal inference tasks owing to different
components, we measure the proposed four attribution quantities in Definitions 1 to 4. In addi-
tion, we further utilize the true discovery rate (TDR), false discovery rate (FDR), and F1 score
as auxiliary metrics for accessing the overall accuracy. These rates are defined mathematically
as follows:

Number of Correctly Predicted Causal Pairs


TDR = ,
Total Number of True Causal Pairs
Number of Incorrectly Predicted Causal Pairs
FDR = ,
Total Number of Predicted Causal Pairs
2 × TDR × (1 − FDR)
F1 = .
1 − FDR + TDR

Drawing from these rates, we can derive the causal attribution following Definitions 1 to 4
below:

CAK = TDR (with raw data) − TDR (without knowledge),

CAD = TDR (with raw data) − TDR (without numerical data),

MAD = TDR (without knowledge) − TDR (with random guess),

MAK = TDR (without numerical data) − TDR (with random guess).

These metrics enable us to quantify the specific contributions of knowledge and data to the
LLMs’ causal reasoning accuracy. Structural Hamming Distance (SHD) is also calculated to
compare the difference between the predicted adjacency matrix and the true adjacency ma-
trix. Every edge that is either missing or not in the original graph is counted as a mistake
and increases the distance by 1. The mean and standard deviation of these evaluation metrics
of different LLMs on various datasets are summarized in Table 2, Table 4, Table 5, Table
6, and Table 7 over 15 replications. The evaluation for full causal graph discovery considers
the following scenarios: Raw Data (using the original data), Omit Data, Omit Knowledge,
Reverse (evaluating with the reversed causal graph after causal order reversal), and Reverse-
Raw (evaluating with the original causal graph after causal graph reversal); the evaluation for
pairwise causal discovery evaluate the following three cases: (1) employing LinGAM directly
on simulated data; (2) formulating the prompt using data accompanied by variable names; (3)

16
Method/Dataset Sachs Galton Alcohol EcoSystem MPG DWD Cement Stock Arrhythmia
GPT-4 turbo
CAK 0.49 0.58 0.75 0.33 0.51 0.17 0.87 0.07 0.09
CAD 0.01 0 0.01 0.20 -0.05 -0.01 0 -0.12 -0.13
MAD 0.04 0.02 -0.03 0.03 0.16 0.29 -0.62 0.05 0.39
MAK 0.52 0.60 0.70 0.16 0.72 0.48 0.26 0.24 0.62
GPT-4
CAK 0.01 0.67 0.58 0.41 0.28 0.07 0.79 0.12 0.15
CAD -0.17 0 0.08 0.02 -0.14 -0.10 0 -0.12 -0.10
MAD 0.26 0.18 0.29 0.29 0.32 0.33 0.01 0.04 0.44
MAK 0.45 0.85 0.79 0.67 0.74 0.51 0.81 0.28 0.70
GPT-3.5
CAK 0.20 0.42 0.13 0.24 0.19 0.09 0.88 0.08 0.18
CAD 0.04 0.01 -0.14 0.12 -0.09 -0.15 0 -0.03 0.14
MAD 0.05 0.17 0.33 0.01 0.02 0.06 0.02 0.06 0.01
MAK 0.20 0.58 0.60 0.13 0.30 0.30 0.91 0.17 0.05
LLaMa2-13B
CAK 0.08 0.09 0.05 0.14 0.32 0.23 0.51 0.09 0.11
CAD -0.04 -0.48 -0.11 -0.07 -0.04 0.18 -0.15 0.12 0.08
MAD 0.01 0.24 0.01 0.10 0.05 0.12 -0.02 0.13 0.06
MAK 0.13 0.82 0.17 0.31 0.41 0.17 0.64 0.10 0.09
Claude 2
CAK 0.32 0.56 0.52 0.44 0.37 0.44 0.85 0.20 0.26
CAD -0.16 -0.07 -0.04 0.22 0.16 0.11 -0.06 0.14 -0.01
MAD 0.08 0.12 0.03 0.02 0.16 -0.12 -0.09 -0.04 0.08
MAK 0.57 0.74 0.59 0.24 0.37 0.21 0.82 0.02 0.35

Table 2: Attribution scores of LLMs for different datasets.

Method/Dataset Sachs Galton Alcohol EcoSystem MPG DWD Cement Stock Arrhythmia
LinGAM 1 1 1 1 1 1 1 1 1
GPT-4 turbo 0.35 0.20 0 0 0.10 0.05 0 0.50 0.25
GPT-4 turbo (educated) 0.40 0.20 0.10 0.10 0.25 0.30 0.05 0.20 0.25
GPT-4 0.45 0.30 0.55 0.40 0.30 0.60 0.30 0.45 0.25
GPT-4 (educated) 0.60 0.15 0.60 0.25 0.30 0.60 0.15 0.35 0.40
GPT- 3.5 0.40 0.70 0.40 0.45 0.50 0.60 0.40 0.5 0.25
GPT-3.5 (educated) 0.40 0.55 0.05 0.45 0.55 0.45 0.05 0.5 0.45
LLaMa2-13B 0.50 0.15 0.25 0.40 0.45 0.40 0.20 0.75 0
LLaMa2-13B (educated) 0.10 0 0.10 0 0.25 0 0 0.50 0
Claude 2 0.45 0 0 0 0.15 0 0.15 0.30 0.45
Claude 2 (educated) 0.5 0.05 0.20 0.20 0.05 0.30 0.25 0.60 0.45

Table 3: Accuracy of LLMs for different datasets in the pairwise causal discovery analyses.

constructing the prompt with data along with variable names, complemented by simplified in-
structions outlining the procedure for executing LinGAM. In these three cases, accuracy serves
as the performance metric for evaluation over 20 replications, and the result is summarized in
Table 3.

17
Method/Dataset Sachs Galton Alcohol EcoSystem MPG DWD Cement Stock Arrhythmia
GPT-4 turbo
Raw Data 0.58±0.26 1±0 1±0 0.76±0.24 0.68±0.19 0.52±0.16 0.99±0.04 0.32±0.33 0.62±0.27
Omit Data 0.57±0.28 1±0 0.99±0.05 0.56±0.24 0.72±0.12 0.53±0.17 0.99±0.04 0.44±0.46 0.76±0.23
Omit Knowledge 0.09±0.07 0.42±0.44 0.25±0.24 0.43±0.39 0.16±0.16 0.35±0.20 0.12±0.15 0.25±0.27 0.53±0.35
Reverse 0.22±0.13 0.13±0.16 0.23±0.16 0.20±0.30 0.06±0.10 0.24±0.20 0.65±0.20 0.25±0.18 0.09±0.26
Reverse-Raw 0.49±0.28 1±0 0.95±0.09 0.84±0.18 0.58±0.13 0.46±0.12 0.97±0.12 0.19±0.19 0.62±0.27
Random Guess 0.05±0.19 0.40±0.49 0.28±0.41 0.40±0.49 0±0 0.06±0.12 0.73±0.44 0.20±0.40 0.13±0.34
GPT-4
Raw Data 0.34±0.24 1±0 1±0 0.91±0.17 0.60±0.14 0.40±0.11 1±0 0.30±0 0.60±0.19
Omit Data 0.52±0.17 1±0 0.92±0.17 0.89±0.22 0.74±0.07 0.51±0.12 1±0 0.42±0.23 0.70±0.13
Omit Knowledge 0.33±0.20 0.33±0.28 0.42±0.17 0.50±0.28 0.32±0.15 0.33±0.10 0.21±0.14 0.18±0.10 0.44±0.13
Reverse 0.30±0.16 0.13±0.21 0.32±0.09 0.16±0.16 0.30±0.13 0.26±0.12 0.72±0.41 0.30±0.03 0.36±0.18
Reverse-Raw 0.25±0.13 1±0 1±0 1±0 0.50±0.15 0.31±0.11 1±0 0.30±0.01 0.61±0.24
Random Guess 0.07±0.25 0.15±0.36 0.13±0.34 0.21±0.41 0±0 0±0 0.19±0.39 0.14±0.35 0±0
GPT-3.5
Raw Data 0.33±0.38 0.96±0.13 0.64±0.37 0.68±0.31 0.39±0.19 0.30±0.20 0.99±0.04 0.43±0.30 0.41±0.40
Omit Data 0.29±0.29 0.96±0.10 0.78±0.34 0.56±0.23 0.48±0.29 0.46±0.13 0.99±0.03 0.46±0.32 0.27±0.13
Omit Knowledge 0.13±0.11 0.55±0.29 0.51±0.41 0.43±0.43 0.20±0.15 0.21±0.15 0.11±0.07 0.36±0.32 0.22±0.28
Reverse 0.24±0.30 0.15±0.16 0.26±0.13 0.13±0.34 0.20±0.19 0.14±0.18 0.14±0.24 0.25±0.13 0.26±0.19
Reverse-Raw 0.40±0.35 1±0 0.65±0.39 0.93±0.18 0.41±0.25 0.33±0.23 0.95±0.13 0.37±0.21 0.41±0.31
Random Guess 0.09±0.09 0.15±0.36 0.17±0.10 0.42±0.48 0.18±0.17 0.16±0.25 0.08±0.07 0.29±0.30 0.21±0.27
LLaMa2-13B
Raw Data 0.29±0.23 0.52±0.33 0.21±0.12 0.43±0.28 0.60±0.26 0.46±0.27 0.63±0.36 0.38±0.21 0.40±0.16
Omit Data 0.32±0.17 1±0 0.32±0.20 0.50±0.40 0.64±0.22 0.28±0.30 0.78±0.26 0.26±0.25 0.32±0.22
Omit Knowledge 0.20±0.11 0.42±0.14 0.15±0.10 0.29±0.12 0.28±0.16 0.24±0.26 0.12±0.10 0.29±0.19 0.29±0.12
Reverse 0.29±0.13 0.18±0.20 0.25±0.11 0.33±0.10 0.25±0.25 0.40±0.17 0.22±0.24 0.23±0.19 0.27±0.14
Reverse-Raw 0.27±0.15 0.62±0.33 0.20±0.10 0.56±0.25 0.48±0.25 0.43±0.19 0.68±0.39 0.30±0.17 0.39±0.16
Random Guess 0.20±0.11 0.18±0.22 0.15±0.11 0.19±0.18 0.23±0.15 0.11±0.11 0.14±0.06 0.16±0.17 0.23±0.77
Claude 2
Raw Data 0.51±0.17 0.93±0.13 0.80±0.24 0.78±0.26 0.65±0.20 0.59±0.16 0.94±0.13 0.43±0.20 0.66±0.17
Omit Data 0.67±0.21 1±0 0.84±0.17 0.56±0.23 0.50±0.24 0.48±0.12 1±0 0.29±0.38 0.67±0.24
Omit Knowledge 0.19±0.11 0.38±0.26 0.28±0.18 0.37±0.26 0.28±0.16 0.15±0.12 0.09±0.14 0.23±0.19 0.4±0.29
Reverse 0.38±0.08 0.02±0.08 0.19±0.12 0.25±0.34 0.16±0.14 0.32±0.21 0.50±0.34 0.27±0.18 0.24±0.15
Reverse-Raw 0.57±0.12 0.89±0.16 0.87±0.16 0.89±0.20 0.64±0.26 0.51±0.12 0.95±0.12 0.37±0.17 0.63±0.17
Random Guess 0.10±0.12 0.26±0.28 0.25±0.08 0.31±0.27 0.12±0.17 0.27±0.22 0.18±0.33 0.27±0.30 0.32±0.33

Table 4: The results of true discovery rates of LLMs for different datasets.

4.3 Main Results

The proposed experiment on LLMs’ capabilities in causal reasoning has led to two key find-
ings. Firstly, the high marginal attribution of knowledge (MAK) scores presented in Table
2 emphasize the vital role of knowledge solely for LLMs in deriving sound causal inferences
across various datasets and models. Furthermore, the high conditional attribution of knowl-
edge (CAK) scores, detailed in Table 2, reveals that even in the presence of numerical data,
the variable names significantly improve LLMs’ causal reasoning accuracy by utilizing their
internal knowledge. This outcome suggests that an LLM’s performance in complex causal
analysis heavily depends on the extent and sophistication of its pre-existing knowledge base.
The comparative analysis of attribution scores among various LLMs, such as GPT-4 turbo,

18
Method/Dataset Sachs Galton Alcohol EcoSystem MPG DWD Cement Stock Arrhythmia
GPT-4 turbo
Raw Data 0.50±0.19 0±0 0±0 0.27±0.23 0.42±0.16 0.51±0.16 0.01±0.04 0.70±0.33 0.54±0.15
Omit Data 0.56±0.21 0±0 0.01±0.05 0.46±0.23 0.29±0.07 0.47±0.18 0.05±0.12 0.59±0.46 0.41±0.11
Omit Knowledge 0.91±0.07 0.80±0.19 0.79±0.13 0.72±0.24 0.84±0.16 0.65±0.20 0.88±0.15 0.84±0.16 0.58±0.26
Reverse 0.50±0.19 0.87±0.16 0.77±0.16 0.86±0.21 0.94±0.10 0.74±0.20 0.15±0.19 0.77±0.17 0.91±0.26
Reverse-Raw 0.52±0.29 0±0 0.08±0.16 0.17±0.19 0.50±0.14 0.51±0.21 0.03±0.12 0.83±0.15 0.49±0.20
Random Guess 0.95±0.19 0.60±0.49 0.72±0.41 0.60±0.49 1±0 0.94±0.12 0.27±0.44 0.80±0.40 0.87±0.34
GPT-4
Raw Data 0.66±0.24 0±0 0.40±0.33 0.41±0.18 0.46±0.14 0.67±0.09 0±0 0.70±0 0.48±0.11
Omit Data 0.45±0.18 0±0 0.41±0.25 0.46±0.07 0.28±0.08 0.52±0.12 0±0 0.61±0.23 0.37±0.07
Omit Knowledge 0.67±0.20 0.72±0.21 0.72±0.08 0.60±0.12 0.68±0.15 0.67±0.10 0.79±0.14 0.83±0.09 0.56±0.13
Reverse 0.66±0.24 0.87±0.21 0.68±0.09 0.84±0.16 0.70±0.13 0.74±0.12 0.28±0.41 0.71±0.03 0.67±0.08
Reverse-Raw 0.75±0.13 0±0 0.62±0.17 0.48±0.22 0.50±0.15 0.72±0.08 0.06±0.22 0.70±0.01 0.52±0.16
Random Guess 0.93±0.25 0.85±0.36 0.87±0.34 0.79±0.41 1±0 1±0 0.81±0.39 0.86±0.35 1±0
GPT-3.5
Raw Data 0.87±0.08 0.07±0.17 0.36±0.37 0.32±0.31 0.61±0.19 0.72±0.21 0.02±0.05 0.61±0.24 0.76±0.20
Omit Data 0.80±0.14 0.04±0.10 0.24±0.33 0.46±0.23 0.61±0.20 0.54±0.12 0.01±0.03 0.60±0.30 0.73±0.13
Omit Knowledge 0.89±0.10 0.67±0.12 0.78±0.10 0.69±0.30 0.81±0.14 0.80±0.13 0.91±0.06 0.67±0.28 0.67±0.20
Reverse 0.87±0.08 0.85±0.16 0.74±0.13 0.91±0.23 0.83±0.16 0.91±0.11 0.86±0.24 0.77±0.12 0.75±0.19
Reverse-Raw 0.81±0.13 0.11±0.21 0.42±0.38 0.10±0.20 0.62±0.21 0.71±0.22 0.07±0.19 0.67±0.20 0.68±0.15
Random Guess 0.91±0.09 0.70±0.35 0.83±0.10 0.59±0.48 0.82±0.17 0.83±0.26 0.92±0.07 0.77±0.22 0.69±0.27
LLaMa2-13B
Raw Data 0.71±0.23 0.53±0.32 0.79±0.13 0.59±0.22 0.40±0.26 0.55±0.26 0.34±0.37 0.67±0.18 0.62±0.14
Omit Data 0.67±0.17 0±0 0.66±0.18 0.50±0.40 0.36±0.22 0.72±0.30 0.23±0.27 0.74±0.25 0.64±0.23
Omit Knowledge 0.80±0.11 0.58±0.14 0.85±0.10 0.71±0.12 0.75±0.15 0.82±0.16 0.88±0.10 0.71±0.18 0.71±0.11
Reverse 0.71±0.23 0.82±0.20 0.75±0.11 0.68±0.07 0.75±0.25 0.53±0.19 0.79±0.23 0.77±0.18 0.74±0.13
Reverse-Raw 0.72±0.14 0.40±0.33 0.80±0.10 0.56±0.12 0.53±0.25 0.55±0.22 0.31±0.38 0.71±0.15 0.61±0.16
Random Guess 0.81±0.11 0.82±0.22 0.84±0.12 0.81±0.18 0.77±0.15 0.89±0.11 0.86±0.06 0.84±0.16 0.77±0.20
Claude 2
Raw Data 0.51±0.16 0.07±0.13 0.24±0.26 0.30±0.22 0.37±0.22 0.44±0.17 0.10±0.14 0.61±0.17 0.42±0.14
Omit Data 0.33±0.21 0±0 0.19±0.20 0.49±0.22 0.51±0.24 0.49±0.13 0±0 0.72±0.37 0.34±0.22
Omit Knowledge 0.80±0.11 0.66±0.20 0.73±0.15 0.66±0.21 0.73±0.15 0.85±0.12 0.91±0.14 0.78±0.17 0.67±0.18
Reverse 0.62±0.08 0.98±0.08 0.81±0.12 0.77±0.29 0.84±0.14 0.73±0.19 0.50±0.34 0.74±0.15 0.76±0.15
Reverse-Raw 0.45±0.11 0.12±0.18 0.19±0.21 0.24±0.24 0.38±0.24 0.47±0.13 0.08±0.12 0.61±0.19 0.41±0.15
Random Guess 0.90±0.12 0.79±0.21 0.75±0.08 0.87±0.19 0.88±0.17 0.73±0.22 0.82±0.33 0.73±0.30 0.68±0.33

Table 5: The results of false discovery rates of LLMs for different datasets.

GPT-4, GPT-3.5, Claude 2, and LLaMa2-13B, demonstrates a hierarchy in their knowledge


depth, with GPT-4 turbo leading the group, as shown in Figure 6. Consequently, when ade-
quately equipped with relevant knowledge, LLMs demonstrate a proficient capacity for causal
reasoning that is consistent with human logic and common sense.
Conversely, the relatively low marginal attribution of data (MAD) and conditional attri-
bution of data (CAD) scores across different datasets and models indicate that numerical data
alone is not a significant factor in the causal reasoning of LLMs. Without substantial contex-
tual knowledge, LLMs display limited but existent causal reasoning abilities based solely on
numerical data. This finding is validated by the results in Tables 4, 5, 6 and 7, which consis-
tently show high TDR, high F1, low FDR, and low SHD in scenarios where numerical data

19
Figure 6: The comparisons of the differences between attribution scores (MAK-MAD) among LLMs,
including GPT-4 turbo, GPT-4, GPT-3.5, Claude 2, and LLaMa2-13B, to demonstrate a hierarchy in
their knowledge depth.

Figure 7: Performance of GPT-4 turbo under Raw Data versus Reverse-Raw.

is omitted. In contrast, scenarios lacking variable names—which represent knowledge—result


in reduced TDR and F1 and increased FDR and SHD. Moreover, the results from the reverse
causal inference experiment reveal that altering the order of variables does not significantly
impact LLMs’ causal inference outcomes, as illustrated in Figure 7. This observation further
underscores the primacy of the knowledge component in guiding LLMs’ causal reasoning pro-
cesses. Hence, we conclude that knowledge is, indeed, what LLMs principally require for sound
causal reasoning.
Our experimental design additionally offers a comprehensive methodology for comparing
the performance of different LLMs across diverse datasets from the perspective of F1 score and
SHD. For the F1 score, in the context of raw data, as shown in Figure 8, GPT-4 turbo achieves

20
Method/Dataset Sachs Galton Alcohol EcoSystem MPG DWD Cement Stock Arrhythmia
GPT-4 turbo
Raw Data 0.53±0.20 1±0 1±0 0.74±0.23 0.62±0.14 0.50±0.15 0.99±0.04 0.30±0.33 0.51±0.16
Omit Data 0.47±0.20 1±0 0.99±0.05 0.55±0.23 0.72±0.09 0.52±0.18 0.97±0.08 0.42±0.46 0.65±0.13
Omit Knowledge 0.09±0.07 0.25±0.24 0.22±0.15 0.33±0.28 0.16±0.16 0.35±0.20 0.12±0.15 0.17±0.18 0.45±0.28
Reverse 0.28±0.14 0.13±0.16 0.23±0.16 0.16±0.24 0.06±0.10 0.25±0.20 0.64±0.19 0.24±0.17 0.09±0.26
Reverse-Raw 0.48±0.29 1±0 0.93±0.13 0.83±0.19 0.53±0.13 0.49±0.20 0.97±0.12 0.25±0.16 0.54±0.20
Random Guess 0.05±0.19 0.40±0.49 0.28±0.41 0.40±0.49 0±0 0.06±0.12 0.73±0.44 0.20±0.40 0.13±0.34
GPT-4
Raw Data 0.34±0.24 1±0 0.70±0.25 0.70±0.14 0.57±0.13 0.36±0.09 1±0 0.30±0 0.54±0.11
Omit Data 0.53±0.17 1±0 0.69±0.20 0.66±0.11 0.73±0.08 0.49±0.12 1±0 0.40±0.23 0.66±0.08
Omit Knowledge 0.33±0.20 0.29±0.21 0.31±0.11 0.43±0.16 0.32±0.15 0.33±0.10 0.21±0.25 0.17±0.09 0.44±0.13
Reverse 0.26±0.15 0.13±0.21 0.32±0.09 0.16±0.16 0.30±0.13 0.26±0.12 0.72±0.41 0.29±0.03 0.34±0.11
Reverse-Raw 0.25±0.13 1±0 0.53±0.14 0.67±0.16 0.50±0.15 0.29±0.09 0.95±0.19 0.30±0.01 0.52±0.15
Random Guess 0.07±0.25 0.15±0.36 0.13±0.34 0.21±0.41 0±0 0±0 0.19±0.39 0.14±0.35 0±0
GPT-3.5
Raw Data 0.15±0.10 0.94±0.15 0.64±0.37 0.68±0.31 0.39±0.19 0.29±0.21 0.99±0.04 0.41±0.26 0.28±0.25
Omit Data 0.22±0.17 0.96±0.10 0.77±0.34 0.55±0.23 0.41±0.20 0.46±0.13 0.99±0.03 0.42±0.29 0.28±0.13
Omit Knowledge 0.11±0.10 0.38±0.14 0.28±0.15 0.36±0.35 0.19±0.14 0.20±0.13 0.09±0.06 0.34±0.28 0.26±0.22
Reverse 0.12±0.09 0.15±0.16 0.26±0.13 0.11±0.27 0.18±0.17 0.11±0.13 0.14±0.24 0.24±0.12 0.25±0.19
Reverse-Raw 0.22±0.13 0.91±0.16 0.60±0.38 0.93±0.19 0.39±0.23 0.31±0.22 0.94±0.17 0.35±0.20 0.35±0.20
Random Guess 0.09±0.09 0.32±0.36 0.17±0.10 0.42±0.48 0.18±0.17 0.16±0.25 0.08±0.07 0.25±0.23 0.31±0.27
LLaMa2-13B
Raw Data 0.29±0.23 0.49±0.33 0.21±0.12 0.42±0.24 0.60±0.26 0.45±0.26 0.64±0.36 0.35±0.18 0.39±0.14
Omit Data 0.33±0.17 1±0 0.33±0.19 0.50±0.40 0.64±0.22 0.28±0.30 0.77±0.26 0.26±0.25 0.33±0.22
Omit Knowledge 0.20±0.11 0.42±0.14 0.15±0.10 0.29±0.12 0.26±0.15 0.19±0.16 0.12±0.10 0.29±0.18 0.29±0.12
Reverse 0.22±0.17 0.18±0.20 0.25±0.11 0.32±0.08 0.25±0.25 0.42±0.17 0.21±0.22 0.23±0.18 0.26±0.14
Reverse-Raw 0.27±0.15 0.61±0.33 0.20±0.10 0.47±0.13 0.47±0.25 0.44±0.20 0.69±0.38 0.29±0.15 0.39±0.16
Random Guess 0.20±0.11 0.18±0.22 0.15±0.12 0.19±0.18 0.23±0.15 0.11±0.11 0.14±0.06 0.16±0.16 0.23±0.20
Claude 2
Raw Data 0.50±0.16 0.93±0.13 0.78±0.25 0.72±0.24 0.64±0.21 0.57±0.16 0.91±0.13 0.41±0.18 0.62±0.15
Omit Data 0.67±0.21 1±0 0.82±0.18 0.52±0.19 0.50±0.24 0.49±0.12 1±0 0.29±0.38 0.66±0.23
Omit Knowledge 0.19±0.11 0.35±0.22 0.27±0.16 0.35±0.22 0.28±0.15 0.15±0.12 0.09±0.14 0.22±0.18 0.36±0.21
Reverse 0.38±0.08 0.02±0.08 0.19±0.12 0.24±0.30 0.16±0.14 0.29±0.19 0.50±0.34 0.26±0.16 0.24±0.15
Reverse-Raw 0.55±0.10 0.88±0.17 0.83±0.18 0.81±0.20 0.63±0.25 0.52±0.12 0.93±0.12 0.45±0.24 0.61±0.15
Random Guess 0.10±0.12 0.22±0.22 0.25±0.08 0.13±0.19 0.12±0.17 0.27±0.22 0.18±0.33 0.27±0.30 0.32±0.33

Table 6: The results of F1 scores of LLMs for different datasets.

Figure 8: The comparisons of F1 scores among LLMs, including GPT-4 turbo, GPT-4, GPT-3.5,
Claude 2, and LLaMa2-13B.

21
Method/Dataset Sachs Galton Alcohol EcoSystem MPG DWD Cement Stock Arrhythmia
GPT-4 turbo
Raw Data 20.40±8.82 0.14±0.35 2±0 1.20±1.22 4.47±1.75 5.40±0.71 2.67±3.53 4.67±1.58 2.64±0.97
Omit Data 17.73±2.52 0.16±0.35 1.80±0.40 2±1.37 2.80±1.05 5.3±0.97 0.80±2.99 3.27±1 2±0.65
Omit Knowledge 22.87±5.12 4.42±0.49 8.93±2.43 4.27±0.85 7.47±1.75 8±3.01 16.47±7 7.33±2.18 4±0.76
Reverse 22.53±1.75 4.86±0.35 7.93±0.25 4.33±0.94 7.93±1.18 8.13±0.81 12±3.35 4.73±2.02 4.14±0.35
Reverse-Raw 18.87±2.25 0.07±0.26 2.07±0.25 1.20±0.98 4.47±1.54 5.73±1.18 3.20±3.47 5±1.75 2.71±0.59
Random Guess 18.87±2.25 3.07±0.26 5.13±0.50 3±0 5.93±0.25 6±0 8±0 3.07±0.25 3.14±0.52
GPT-4
Raw Data 23.29±2.58 0.06±0 6.73±3.23 2.93±0.93 4±1.51 6±0 8.17±3.69 5.87±0.50 2.93±1
Omit Data 18.86±1.68 0±0 6.13±3.95 2.33±0.47 3±1.10 6±0.37 8.5±0.5 4.67±1.19 2.80±0.54
Omit Knowledge 25.43±2.41 4.60±1.08 10.27±2.35 4.07±1.12 7.20±2.04 11.60±2.89 21.3±9.26 7.80±1.42 4.07±0.77
Reverse 25.57±6.40 5±0 13.60±1.70 0.16±0.16 4.60±1.31 6±0 12.08±7.20 5.67±1.14 5.20±0.83
Reverse-Raw 22.93±2.60 0±0 6.53±3.40 2.87±1.45 5.87±2.73 6±0 8.42±5.09 6±0 3.60±0.88
Random Guess 20±0 3.33±0.47 5.53±0.96 3.27±0.44 6.13±1.02 6.20±0.65 9.17±1.67 4.40±1.40 3.53±0.50
GPT-3.5
Raw Data 25.47±6.23 0.86±1.19 5.33±2.12 1.87±1.75 6.33±1.70 7.73±2.74 2.13±3.54 4.67±1.49 4.40±0.71
Omit Data 20.73±2.29 0.50±0.82 3.67±1.62 1.84±1.77 5.53±1.45 5.27±0.44 2.53±3.44 3.44±1.34 3.40±0.61
Omit Knowledge 31±8.41 3.79±0.94 8.20±1.80 4.73±1 7.33±2.12 9.47±2.65 19.20±8.73 6±1.70 4.80±1.11
Reverse 27.20±8.57 4.79±0.67 9.47±2.31 4.13±0.81 6.20±1.42 9.60±1.62 13.93±2.72 5.11±1.45 4.60±0.88
Reverse-Raw 25.80±8.72 1.36±1.59 5.13±2.39 1.87±1.75 6.60±1.70 7.67±2.91 1.60±3.20 4.78±1.31 4.40±0.88
Random Guess 23.67±1.49 3.29±0.45 6.87±0.81 3.60±0.49 6.47±1.15 6.33±1.66 11.73±1.12 5.11±0.99 3.87±0.88
LLaMa2-13B
Raw Data 23.78±3.39 1.92±1.27 4.67±0.47 3.0±0.89 5.40±1.78 7.80±2.04 6±4.89 5.40±1.02 4.20±0.83
Omit Data 22±2.11 1±0 4±0.19 2.13±1.93 5.80±0.54 6.40±0.71 7.40±4.36 3±0 3.73±0.77
Omit Knowledge 23.44±2.41 3.38±1.08 7±0 4.27±0.44 7.07±1.12 7.67±1.40 10.33±1.07 4.80±1.47 4.53±0.81
Reverse 24.56±2.54 4.62±0.62 7.33±0.47 3.27±0.57 6.53±0.50 8.47±1.86 13.80±2.56 4.20±1.17 4.13±1.15
Reverse-Raw 24.11±2.47 2.23±1.42 5±1.63 3±0.89 5.33±1.07 8.47±2.06 9.13±5.43 5.20±0.98 4.07±0.85
Random Guess 24.44±2.22 3.38±0.49 6.67±0.47 3.73±0.77 6.40±1.36 7.60±1.36 0.12.87±1.20 5.20±0.75 3.80±0.75
Claude 2
Raw Data 19.60±1.36 0.93±0.93 2.20±1.47 2±0 5.53±1.02 5.27±1.24 2.40±2.70 4.60±1.89 2.73±1.12
Omit Data 17.07±1.81 1.07±0.57 2.07±1.69 2±0.96 5.27±0.85 4.93±0.93 3.53±3.42 3.13±0.34 1.73±1
Omit Knowledge 23.60±2.39 4.80±1.72 7.27±1.44 5.15±0.77 6.87±1.78 9±1.75 11.67±1.14 5.53±1.75 3.93±0.85
Reverse 22.07±1.06 4.40±0.71 8.13±0.96 3.23±0.80 6.20±1.17 6.73±1.39 13.33±4.66 4.27±1.29 5.13±0.62
Reverse-Raw 19.33±1.45 0.87±1.09 3.53±1.41 2±0 5.20±0.91 5.87±0.34 3.53±2.74 4.53±1.78 3.13±0.62
Random Guess 21.33±1.58 3.80±0.65 6.60±0.80 3.23±0.42 6.73±1.53 7.27±1.88 9.93±1.34 5±1.15 3.73±0.77

Table 7: The results of Structural Hamming Distance (SHD) of LLMs for different datasets.

the highest or comparable F1 scores across these nine datasets. Claude 2 demonstrates per-
formance similar to GPT-4 and slightly outperforms GPT-3.5. However, LlaMa2-13B exhibits
a performance gap across most datasets when compared to GPT-3.5. Despite the presence of
performance differences, these models exhibit consistent behavior across datasets. Specifically,
if one model performs poorly on a dataset, the other models also tend to exhibit lower per-
formance on the same dataset; for SHD, as illustrated in Figure 9, GPT-4 turbo achieves the
lowest or comparable SHD across the datasets. GPT-3.5 is slightly worse than Claude-2 but
is comparable with GPT-4 and LlaMa2-13B.
The findings derived from the pairwise causal discovery analysis presented in Figure 3
indicate that within the context of the perfect causal relation simulation setting, LinGAM
consistently achieves accurate predictions of the relationship, whereas LLMs fall short in com-

22
Figure 9: The comparisons of Structural Hamming Distances among LLMs, including GPT-4 turbo,
GPT-4, GPT-3.5, Claude 2, and LLaMa2-13B.

parison to the numerical method, even after incorporating the simplified instructions from
LinGAM.

4.4 Supporting Analyses

We provide more supporting analyses to deep dive into the causal reasoning process of LLMs.
First, there is the impact of variable order observed influencing the causal reasoning in LLMs.
Particularly, in the case of ChatGPT, our experiments reveal that the order in which variables
are presented affects the model’s causal reasoning accuracy. Use the chain of thought (CoT)
with the zero-shot prompt for deep-dive. In the case of omitting knowledge, GPT4/3.5, in
most cases, predicts the causal pairs following column orders. For example, if the column is
ordered as bryoto, nienet, feouni, alphan, it outputs the pairs like following bryoto → nienet,
nienet → feouni, feouni → alphan or bryoto → nienet, feouni → alphan. This finding points
to an inherent bias in the model’s processing mechanism, likely stemming from its training
phase, and highlights the importance of considering variable sequencing in the prompt design
for causal inference tasks.
Second, we further identify that certain LLMs lack intrinsic computational skills or incor-
rectly interpret correlations as causation. A notable observation is the misuse of numerical
data by many LLMs. For example, in the case of omitting knowledge, using CoT, the LLM
conducts the correlation analysis first and then outputs the causal relation based on the cor-
relation without further performing a conditional independence test. Besides, the result from

23
Figure 10: Chain of Thought for causal discovery when the knowledge is omitted.

the correlation analysis is not correct (refer to Figure 10 to see the CoT of LLMs for causal
inference). To be specific, the correct causal pair is feouni → nienet with the increase of
feouni causing the increase of nienet. However, the order of nienet lies before that of feouni.
The GPT-4 outputs nienet → feouni. Hence, instead of employing appropriate computational
methods for discerning correlation versus causation, LLMs often resorted to flawed or simplistic
data interpretations, leading to inaccuracies in causal reasoning. This misapplication suggests
a need for improved training methodologies that emphasize the correct utilization of numerical
information in causal contexts.

5 Related Works
5.1 Large Language Models

Recent advancements in natural language processing have led to a rapid increase in the amount
of available text data (e.g., Vaswani et al., 2017; Kenton and Toutanova, 2019; Lewis et al.,
2019; Brown et al., 2020), which has spurred new research developments in various fields. Large
language models have been at the forefront of natural language processing advancements in

24
recent years. With the introduction of models such as GPT-3 by OpenAI, which contains 175
billion parameters, the ability of machines to understand and generate human-like text has seen
unprecedented growth (Brown et al., 2020). These models leverage deep learning and massive
amounts of data to achieve their performance, often requiring extensive computation resources
(Kaplan et al., 2020). However, they are not without challenges; issues such as bias, ethical use,
and environmental impact have sparked intense debate in the academic community (Bender
et al., 2021; Strubell et al., 2019). Furthermore, their application in real-world scenarios often
necessitates fine-tuning, which poses additional challenges in terms of data selection and model
robustness (Zhang et al., 2020). Despite these concerns, large language models continue to push
the boundaries of what’s possible, with applications ranging from writing assistance to complex
code generation (Chen et al., 2021).
Causal reasoning in large language models is an emerging area of study that seeks to endow
these models with the ability to understand cause-effect relationships within text. While large
language models like OpenAI’s GPT series have shown remarkable performance in generating
coherent and contextually relevant text, their ability to parse and apply causal reasoning has
been less explored (Weidinger et al., 2021). Recent studies, however, have begun to address this
by integrating causal inference mechanisms into the models’ architecture. For instance, Keith
et al. (2020) proposed a method for enhancing text generation with causal structure, showing
that it can improve a model’s reasoning capabilities. Riedel et al. (2019) introduced latent
causal variables into the training of language models, which helps in disentangling the under-
lying causal factors from observed data. Such enhancements are believed to make language
models not only better at language understanding and generation but also at more sophisti-
cated tasks like summarization, question answering, and decision-making that require causal
reasoning (Pearl and Mackenzie, 2019). Hence, understanding the ability of causal reasoning
in large language models is extremely important.
The task of causal inference poses a unique challenge to LLM capabilities. To address these
challenges, researchers have proposed innovative strategies to improve the causal reasoning of
LLMs. One such strategy is in-context learning, which involves providing LLMs with demon-
strative examples within their input context to guide their subsequent responses to queries.
This method, which has been explored in the works of Min et al. (2022) and Xie et al. (2021),

25
suggests that LLMs could derive substantial benefits from exposure to contextually relevant
precedents. However, the effectiveness of in-context learning is often limited by the maximum
token length that models can process, which restricts the amount of contextual information
that can be provided. An alternative approach is the development of Auto-GPT systems, as
proposed by Yang et al. (2023). These systems are designed to act on prompt instructions
and automatically incorporate external causal models for better decision-making. Auto-GPT
represents a step toward the autonomous integration of domain-specific expertise into the
decision-making processes of LLMs, using external databases or algorithms to improve their
causal reasoning abilities. In this work, we focus on disentangling the causal inference abilities
of advanced LLMs, which has piqued considerable interest in light of recent breakthroughs
(e.g., Vaswani et al., 2017; Kenton and Toutanova, 2019; Lewis et al., 2019; Brown et al., 2020;
Neelakantan et al., 2022; Stiennon et al., 2022; OpenAI, 2023).

5.2 Attribution Models for LLMs

Attribution models have gained prominence for their role in fairly allocating contributions
across features in predictive modeling. The Shapley value, emerging from cooperative game
theory, offers a principled method to apportion payoffs by considering the marginal contribu-
tion of each feature within all possible combinations of feature subsets. Adapted to machine
learning, this approach aids in gauging feature importance in complex models, including LLMs
(Lundberg and Lee, 2017). The introduction of SHAP (SHapley Additive exPlanations) values
by Lundberg and Lee (2017) marked a significant advance in model interpretability, applying
the Shapley value to detail the contribution of individual features to model outputs. This
method stands out for its rigorous theoretical underpinnings and fair feature treatment, ful-
filling efficiency and symmetry—traits not consistently found in alternative methods (Roth,
1988). The recent focus on computational efficiency has made Shapley-based methods more
accessible for high-dimensional datasets, although the inherent computational demand still
necessitates the use of approximation techniques for LLMs with numerous features (Strumbelj
and Kononenko, 2010; Datta et al., 2016; Kumar et al., 2020).
Furthermore, attribution models facilitate insight into the decision-making processes of
LLMs, such as GPT-3, by assigning relevance to different parts of the input data, which allows

26
for a deeper understanding of how these models predict or generate text (Wallace et al., 2019).
Visual interpretation methods like Layer-wise Relevance Propagation (LRP) and Integrated
Gradients have been crucial for evaluating individual input contributions to outputs, thereby
enhancing the transparency and trustworthiness of LLMs (Bach et al., 2015; Sundararajan
et al., 2017). Nevertheless, the intricacies of LLMs’ internal representations, which are often
intricate and intertwined, present substantial challenges in clear attribution. Studies indicate
that attention mechanisms, although commonly employed for interpretability, may not reliably
indicate the reasoning behind a model’s decisions (Jain and Wallace, 2019; Doshi-Velez and
Kim, 2017), highlighting the need for more sophisticated attribution methods that can capture
the nuances of large-scale neural network decisions.

5.3 Causal Inference in LLMs

The past decade has witnessed increasing interest in causal inference to measure the effect or
impact of different components on the outcome of interest (e.g., Rubin, 1974, 1978; Rosenbaum
and Rubin, 1983; Robins, 2003; Rubin, 2006; Athey and Imbens, 2015; Chernozhukov et al.,
2018; Wager and Athey, 2018; Farrell et al., 2021). Relying on certain assumptions for causal
identification, there is a great popularity of causal effect estimation techniques, including tree
methods (see e.g., Athey and Imbens, 2016; Powers et al., 2018; Athey et al., 2019; Hahn
et al., 2020) and meta-learners (see e.g., Hill, 2011; Künzel et al., 2019; Kennedy, 2020; Nie
and Wager, 2021). However, all of these works are limited to addressing structured data, and
thus cannot be leveraged to the text data focused in this study. Recently, there have been
a few studies on leveraging NLP methods to address causal inference problems for text data,
such as text variables as confounders (e.g., Johansson et al., 2016; Keith et al., 2020; Roberts
et al., 2020; Mozer et al., 2020; Veitch et al., 2020), text variables as actions or treatments
(e.g., Tan et al., 2014; Fong and Grimmer, 2016; Wood-Doughty et al., 2018; Pryzant et al.,
2018; Wang and Culotta, 2019; Pryzant et al., 2020), and text variables as outcomes (e.g., Gill
and Hall, 2015; Egami et al., 2018; Sridhar and Getoor, 2019; Koroleva et al., 2019). Yet, all
these works simply focused on the text itself (see e.g., Feder et al., 2022) and thus cannot be
leveraged for more complex LLMs.
As the complexity of LLMs increases, the critical need for causal attribution—to identify

27
and quantify the influence of input factors on outputs—becomes more pronounced. This pro-
cess not only sheds light on the model’s decision-making but also bolsters transparency and
justifiability of model predictions (Chattopadhyay et al., 2019; Molnar, 2020). Techniques like
saliency mapping and influence functions are instrumental in highlighting relevant data and
tracing predictions back to training instances, thereby revealing how LLMs handle complex
inputs to produce coherent outputs (Simonyan et al., 2013; Koh and Liang, 2017). The advent
of causality-centric frameworks within LLMs is crucial for assuring model fidelity in practical
scenarios by understanding the ‘why’ behind outputs, making outputs reliable and actionable
(Goyal et al., 2019a; Wachter et al., 2017; Pearl, 2009). Furthermore, methods such as coun-
terfactual explanations and structural causal models have been key in unraveling the interplay
between input features and model predictions, which is vital for diagnosing failures and forti-
fying model robustness (Goyal et al., 2019b; Ribeiro et al., 2016). However, integrating causal
reasoning with LLMs is still an evolving and challenging field due to the models’ non-linear,
high-dimensional, and opaque nature (Moraffah et al., 2020; Kim et al., 2017). Our research
contributes to this field by proposing a novel causal attribution model that aligns with exper-
imental design, aiming to bridge the gap in current methodologies and facilitate the creation
of interpretable and ethically sound LLMs.

28
References

Athey, S. and Imbens, G. (2016), ‘Recursive partitioning for heterogeneous causal effects’,
Proceedings of the National Academy of Sciences 113(27), 7353–7360.

Athey, S. and Imbens, G. W. (2015), ‘Machine learning methods for estimating heterogeneous
causal effects’, stat 1050(5), 1–26.

Athey, S., Tibshirani, J. and Wager, S. (2019), ‘Generalized random forests’, The Annals of
Statistics 47(2), 1148–1178.

Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R. and Samek, W. (2015), ‘On
pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation’,
PLoS ONE 10(7), e0130140.

Bender, E. M., Gebru, T., McMillan-Major, A. and Shmitchell, S. (2021), ‘On the dangers of
stochastic parrots: Can language models be too big?’, pp. 610–623.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A.,
Shyam, P., Sastry, G., Askell, A. et al. (2020), ‘Language models are few-shot learners’,
Advances in neural information processing systems 33, 1877–1901.

Chattopadhyay, A., Manupriya, P., Sarkar, A. and Balakrishnan, A. (2019), ‘A framework for
evaluating the effects of input features on predictions in natural language processing’, arXiv
preprint arXiv:1902.07285 .

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H.,
Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf,
H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser,
L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis,
F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J.,
Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam,
J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer,
K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I. and Zaremba, W.
(2021), ‘Evaluating large language models trained on code’.

29
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and
Robins, J. (2018), ‘Double/debiased machine learning for treatment and structural param-
eters’.

Datta, A., Sen, S. and Zick, Y. (2016), Algorithmic transparency via quantitative input influ-
ence: Theory and experiments with learning systems, in ‘2016 IEEE Symposium on Security
and Privacy (SP)’, IEEE, pp. 598–617.

Doshi-Velez, F. and Kim, B. (2017), ‘Towards a rigorous science of interpretable machine


learning’, arXiv preprint arXiv:1702.08608 .

Egami, N., Fong, C. J., Grimmer, J., Roberts, M. E. and Stewart, B. M. (2018), ‘How to make
causal inferences using texts’, arXiv preprint arXiv:1802.02163 .

Farrell, M. H., Liang, T. and Misra, S. (2021), ‘Deep neural networks for estimation and
inference’, Econometrica 89(1), 181–213.

Feder, A., Keith, K. A., Manzoor, E., Pryzant, R., Sridhar, D., Wood-Doughty, Z., Eisenstein,
J., Grimmer, J., Reichart, R., Roberts, M. E. et al. (2022), ‘Causal inference in natural
language processing: Estimation, prediction, interpretation and beyond’, Transactions of
the Association for Computational Linguistics 10, 1138–1158.

Fong, C. and Grimmer, J. (2016), Discovery of treatments from text corpora, in ‘Proceedings
of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers)’, pp. 1600–1609.

Gao, J., Ding, X., Qin, B. and Liu, T. (2023), ‘Is chatgpt a good causal reasoner? a compre-
hensive evaluation’, arXiv preprint arXiv:2305.07375 .

Geiger, D., Verma, T. and Pearl, J. (1990), ‘Identifying independence in bayesian networks’,
Networks 20(5), 507–534.
URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/net.3230200504

Gill, M. and Hall, A. (2015), ‘How judicial identity changes the text of legal rulings’, Available
at SSRN 2620781 .

30
Goyal, A., Shroff, G., Gummadi, K. P. and Choudhury, M. (2019a), Explaining machine learn-
ing classifiers through diverse counterfactual explanations, in ‘Proceedings of the Conference
on Fairness, Accountability, and Transparency’, ACM, pp. 607–617.

Goyal, A., Shroff, G., Gummadi, K. P. and Choudhury, M. (2019b), Explaining machine learn-
ing classifiers through diverse counterfactual explanations, in ‘Proceedings of the Conference
on Fairness, Accountability, and Transparency’, ACM, pp. 607–617.

Hahn, P. R., Murray, J. S. and Carvalho, C. M. (2020), ‘Bayesian regression tree models for
causal inference: Regularization, confounding, and heterogeneous effects (with discussion)’,
Bayesian Analysis 15(3), 965–1056.

Hill, J. L. (2011), ‘Bayesian nonparametric modeling for causal inference’, Journal of Compu-
tational and Graphical Statistics 20(1), 217–240.

Hoyer, P. O., Janzing, D., Mooij, J., Peters, J. and Schölkopf, B. (2008), Nonlinear causal
discovery with additive noise models, in ‘Proceedings of the 21st International Conference
on Neural Information Processing Systems’, NIPS’08, Curran Associates Inc., Red Hook,
NY, USA, p. 689–696.

Jain, S. and Wallace, B. C. (2019), Attention is not explanation, in ‘Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies’, Vol. 1, Association for Computational Linguistics, pp. 3543–
3556.

Jin, Z., Chen, Y., Leeb, F., Gresele, L., Kamal, O., Lyu, Z., Blin, K., Gonzalez, F., Kleiman-
Weiner, M., Sachan, M. et al. (2023b), ‘Cladder: Assessing causal reasoning in language
models’.

Jin, Z., Liu, J., Lyu, Z., Poff, S., Sachan, M., Mihalcea, R., Diab, M. and Schölkopf, B.
(2023a), ‘Can large language models infer causation from correlation?’, arXiv preprint
arXiv:2306.05836 .

Johansson, F., Shalit, U. and Sontag, D. (2016), Learning representations for counterfactual
inference, in ‘International conference on machine learning’, PMLR, pp. 3020–3029.

31
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S.,
Radford, A., Wu, J. and Amodei, D. (2020), ‘Scaling laws for neural language models’,
arXiv preprint arXiv:2001.08361 .

Keith, K. A., Jensen, D. and O’Connor, B. (2020), ‘Text and causal inference: A review of
using text to remove confounding from causal estimates’, arXiv preprint arXiv:2005.00649 .

Kennedy, E. H. (2020), ‘Optimal doubly robust estimation of heterogeneous causal effects’,


arXiv preprint arXiv:2004.14497 .

Kenton, J. D. M.-W. C. and Toutanova, L. K. (2019), Bert: Pre-training of deep bidirectional


transformers for language understanding, in ‘Proceedings of NAACL-HLT’, pp. 4171–4186.

Kıcıman, E., Ness, R., Sharma, A. and Tan, C. (2023), ‘Causal reasoning and large language
models: Opening a new frontier for causality’, arXiv preprint arXiv:2305.00050 .

Kim, J., Rohrbach, A., Darrell, T., Canny, J. and Akata, Z. (2017), ‘Interpretable learning
for self-driving cars by visualizing causal attention’, Proceedings of the IEEE International
Conference on Computer Vision pp. 2942–2950.

Koh, P. W. and Liang, P. (2017), Understanding black-box predictions via influence func-
tions, in ‘Proceedings of the 34th International Conference on Machine Learning-Volume
70’, JMLR.org, pp. 1885–1894.

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. and Iwasawa, Y. (2022), ‘Large language models
are zero-shot reasoners’, Advances in neural information processing systems 35, 22199–22213.

Koroleva, A., Kamath, S. and Paroubek, P. (2019), ‘Measuring semantic similarity of clini-
cal trial outcomes using deep pre-trained language representations’, Journal of Biomedical
Informatics 100, 100058.

Kumar, P., Elenberg, E. R., Burke, M., Bhattacharyya, R., Ghosh, J. and Dimakis, A. G.
(2020), ‘Problems with shapley-value-based explanations as feature importance measures’,
arXiv preprint arXiv:2002.11097 .

32
Künzel, S. R., Sekhon, J. S., Bickel, P. J. and Yu, B. (2019), ‘Metalearners for estimating
heterogeneous treatment effects using machine learning’, Proceedings of the national academy
of sciences 116(10), 4156–4165.

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V.
and Zettlemoyer, L. (2019), ‘Bart: Denoising sequence-to-sequence pre-training for natural
language generation, translation, and comprehension’, arXiv preprint arXiv:1910.13461 .

Liu, Y., Yao, Y., Ton, J.-F., Zhang, X., Cheng, R. G. H., Klochkov, Y., Taufiq, M. F. and Li,
H. (2023), ‘Trustworthy llms: a survey and guideline for evaluating large language models’
alignment’, arXiv preprint arXiv:2308.05374 .

Lundberg, S. M. and Lee, S.-I. (2017), ‘A unified approach to interpreting model predictions’,
Advances in Neural Information Processing Systems 30.

Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H. and Zettlemoyer, L.
(2022), ‘Rethinking the role of demonstrations: What makes in-context learning work?’,
arXiv preprint arXiv:2202.12837 .

Models, C. (n.d.), ‘Model card and evaluations for claude models’, https://www-files.
anthropic.com/production/images/Model-Card-Claude-2.pdf.

Molnar, C. (2020), Interpretable Machine Learning, Leanpub. https://christophm.github.


io/interpretable-ml-book/.

Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J. and Schölkopf, B. (2015), ‘Distinguishing
cause from effect using observational data: methods and benchmarks’.

Moraffah, R., Ramezani, M., Liu, H., Papalexakis, E. and Kifer, D. (2020), ‘Causal inter-
pretability for machine learning - problems, methods and evaluation’, SIGKDD Explorations
22(1), 18–33.

Mozer, R., Miratrix, L., Kaufman, A. R. and Anastasopoulos, L. J. (2020), ‘Matching with
text data: An experimental evaluation of methods for matching documents and of measuring
match quality’, Political Analysis 28(4), 445–468.

33
Neelakantan, A., Xu, T., Puri, R., Radford, A., Han, J. M., Tworek, J., Yuan, Q., Tezak,
N., Kim, J. W., Hallacy, C., Heidecke, J., Shyam, P., Power, B., Nekoul, T. E., Sastry,
G., Krueger, G., Schnurr, D., Such, F. P., Hsu, K., Thompson, M., Khan, T., Sherbakov,
T., Jang, J., Welinder, P. and Weng, L. (2022), ‘Text and code embeddings by contrastive
pre-training’.

Nie, X. and Wager, S. (2021), ‘Quasi-oracle estimation of heterogeneous treatment effects’,


Biometrika 108(2), 299–319.

OpenAI (2023), ‘Gpt-4 technical report’.

Pan, T., Hu, G. and Shen, W. (2020), ‘Identifying latent groups in spatial panel data using a
markov random field constrained product partition model’, arXiv preprint arXiv:2012.10541
.

Pearl, J. (2000), Causality: models, reasoning and inference, Vol. 29, Springer.

Pearl, J. (2009), Causality: Models, Reasoning, and Inference, 2 edn, Cambridge University
Press.

Pearl, J. and Mackenzie, D. (2019), The Book of Why: The New Science of Cause and Effect,
Basic Books.

Pearl, J. et al. (2009), ‘Causal inference in statistics: An overview’, Statistics surveys 3, 96–146.

Powers, S., Qian, J., Jung, K., Schuler, A., Shah, N. H., Hastie, T. and Tibshirani, R. (2018),
‘Some methods for heterogeneous treatment effect estimation in high dimensions’, Statistics
in medicine 37(11), 1767–1787.

Pryzant, R., Card, D., Jurafsky, D., Veitch, V. and Sridhar, D. (2020), ‘Causal effects of
linguistic properties’, arXiv preprint arXiv:2010.12919 .

Pryzant, R., Shen, K., Jurafsky, D. and Wagner, S. (2018), Deconfounded lexicon induction for
interpretable social science, in ‘Proceedings of the 2018 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long Papers)’, pp. 1615–1625.

34
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. and Sutskever, I. (2019), ‘Language
models are unsupervised multitask learners’, OpenAI Blog 1(8), 2–6.

Ribeiro, M. T., Singh, S. and Guestrin, C. (2016), ” why should i trust you?” explaining
the predictions of any classifier, in ‘Proceedings of the 22nd ACM SIGKDD international
conference on knowledge discovery and data mining’, pp. 1135–1144.

Riedel, S., Bosselut, A., Holtzman, A., Forbes, M. and Choi, Y. (2019), Latent relation lan-
guage models, in ‘Proceedings of the AAAI Conference on Artificial Intelligence’, Vol. 33,
pp. 7373–7380.

Roberts, M. E., Stewart, B. M. and Nielsen, R. A. (2020), ‘Adjusting for confounding with
text matching’, American Journal of Political Science 64(4), 887–903.

Robins, J. M. (2003), ‘Semantics of causal dag models and the identification of direct and
indirect effects’, Highly Structured Stochastic Systems pp. 70–81.

Rosenbaum, P. R. and Rubin, D. B. (1983), ‘The central role of the propensity score in obser-
vational studies for causal effects’, Biometrika 70(1), 41–55.

Roth, A. E. (1988), The Shapley Value: Essays in Honor of Lloyd S. Shapley, Cambridge
University Press.

Rubin, D. B. (1974), ‘Estimating causal effects of treatments in randomized and nonrandomized


studies.’, Journal of educational Psychology 66(5), 688.

Rubin, D. B. (1978), ‘Bayesian inference for causal effects: The role of randomization’, The
Annals of statistics 6, 34–58.

Rubin, D. B. (2006), Matched sampling for causal effects, Cambridge University Press.

Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A. and Nolan, G. P. (2005), ‘Causal protein-
signaling networks derived from multiparameter single-cell data’, Science 308(5721), 523–
529.
URL: https://www.science.org/doi/abs/10.1126/science.1105809

35
Shimizu, S., Hoyer, P. O., Hyvärinen, A. and Kerminen, A. (2006), ‘A linear non-gaussian
acyclic model for causal discovery’, Journal of Machine Learning Research 7, 2003–2030.
URL: http://jmlr.org/papers/v7/shimizu06a.html

Shimizu, S., Inazumi, T., Sogawa, Y., Hyvärinen, A., Kawahara, Y., Washio, T., Hoyer, P. O.
and Bollen, K. (2011), ‘Directlingam: A direct method for learning a linear non-gaussian
structural equation model’, Journal of Machine Learning Research 12, 1225–1248.
URL: http://jmlr.org/papers/v12/shimizu11a.html

Simonyan, K., Vedaldi, A. and Zisserman, A. (2013), ‘Deep inside convolutional networks:
Visualising image classification models and saliency maps’, arXiv preprint arXiv:1312.6034.

Sridhar, D. and Getoor, L. (2019), ‘Estimating causal effects of tone in online debates’, arXiv
preprint arXiv:1906.04177 .

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D.
and Christiano, P. (2022), ‘Learning to summarize from human feedback’.

Strubell, E., Ganesh, A. and McCallum, A. (2019), ‘Energy and policy considerations for deep
learning in nlp’, arXiv preprint arXiv:1906.02243 .

Strumbelj, E. and Kononenko, I. (2010), ‘An efficient explanation of individual classifications


using game theory’, Journal of Machine Learning Research 11, 1–18.

Sundararajan, M., Taly, A. and Yan, Q. (2017), Axiomatic attribution for deep networks, in
‘International Conference on Machine Learning’, PMLR, pp. 3319–3328.

Tan, C., Lee, L. and Pang, B. (2014), ‘The effect of wording on message propagation: Topic-and
author-controlled natural experiments on twitter’, arXiv preprint arXiv:1405.1438 .

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra,
S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull,
G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal,
N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M.,
Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D.,

36
Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A.,
Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian,
R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I.,
Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S. and
Scialom, T. (2023), ‘Llama 2: Open foundation and fine-tuned chat models’.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u.
and Polosukhin, I. (2017), Attention is all you need, in I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett, eds, ‘Advances in Neural Informa-
tion Processing Systems’, Vol. 30, Curran Associates, Inc.
URL: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-
Paper.pdf

Veitch, V., Sridhar, D. and Blei, D. (2020), Adapting text embeddings for causal inference, in
‘Conference on Uncertainty in Artificial Intelligence’, PMLR, pp. 919–928.

Wachter, S., Mittelstadt, B. and Russell, C. (2017), ‘Counterfactual explanations without


opening the black box: Automated decisions and the gdpr’, Harvard Journal of Law &
Technology 31, 841.

Wager, S. and Athey, S. (2018), ‘Estimation and inference of heterogeneous treatment effects
using random forests’, Journal of the American Statistical Association 113(523), 1228–1242.

Wallace, E., Wang, J. T., Li, S. S., Singh, S. and Gardner, M. (2019), ‘Allennlp interpret: A
framework for explaining predictions of nlp models’, arXiv preprint arXiv:1909.09251 .

Wang, Z. and Culotta, A. (2019), When do words matter? understanding the impact of lexical
choice on audience perception using individual treatment effect estimation, in ‘Proceedings
of the AAAI Conference on Artificial Intelligence’, Vol. 33, pp. 7233–7240.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D. et al.
(2022), ‘Chain-of-thought prompting elicits reasoning in large language models’, Advances
in Neural Information Processing Systems 35, 24824–24837.

37
Weidinger, L., Kamm, L., Mendelsohn, J., Cotterell, R., Riedel, S. and Hovy, D. (2021),
‘Ethical and social risks of harm from language models’, arXiv preprint arXiv:2112.04359 .

Wood-Doughty, Z., Shpitser, I. and Dredze, M. (2018), Challenges of using text classifiers
for causal inference, in ‘Proceedings of the Conference on Empirical Methods in Natural
Language Processing. Conference on Empirical Methods in Natural Language Processing’,
Vol. 2018, NIH Public Access, p. 4586.

Xie, S. M., Raghunathan, A., Liang, P. and Ma, T. (2021), ‘An explanation of in-context
learning as implicit bayesian inference’, arXiv preprint arXiv:2111.02080 .

Yang, H., Yue, S. and He, Y. (2023), ‘Auto-gpt for online decision making: Benchmarks and
additional opinions’, arXiv preprint arXiv:2306.02224 .

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. and Artzi, Y. (2020), ‘Revisiting few-sample
bert fine-tuning’, arXiv preprint arXiv:2006.05987 .

Zheng, Y., Huang, B., Chen, W., Ramsey, J., Gong, M., Cai, R., Shimizu, S., Spirtes,
P. and Zhang, K. (2023), ‘Causal-learn: Causal discovery in python’, arXiv preprint
arXiv:2307.16405 .

38

You might also like