1 s2.0 S1467089522000240 Main

International Journal of Accounting Information Systems 46 (2022) 100572
Contents lists available at ScienceDirect
International Journal of Accounting

Information Systems
journal homepage: www.elsevier.com/locate/accinf
Explainable Artificial Intelligence (XAI) in auditing

Chanyuan (Abigail) Zhang a, *, Soohyun Cho b, Miklos Vasarhelyi b
a
Rutgers Business Schoo, United States & Research Institute of Economics and Management, Southwestern University of Finance and Economics,
People’s Republic of China
b
Rutgers Business School, United States
A R T I C L E I N F O A B S T R A C T
Keywords: Artificial Intelligence (AI) and Machine Learning (ML) are gaining increasing attention regarding
Explainable Artificial Intelligence (XAI) their potential applications in auditing. One major challenge of their adoption in auditing is the
Auditing lack of explainability of their results. As AI/ML matures, so do techniques that can enhance the
Machine learning
interpretability of AI, a.k.a., Explainable Artificial Intelligence (XAI). This paper introduces XAI
Material restatement
LIME
techniques to auditing practitioners and researchers. We discuss how different XAI techniques can
SHAP be used to meet the requirements of audit documentation and audit evidence standards.
Furthermore, we demonstrate popular XAI techniques, especially Local Interpretable Model-
agnostic Explanations (LIME) and Shapley Additive exPlanations (SHAP), using an auditing
task of assessing the risk of material misstatement. This paper contributes to accounting infor
mation systems research and practice by introducing XAI techniques to enhance the transparency
and interpretability of AI applications applied to auditing tasks.
1. Introduction
This paper introduces Explainable Artificial Intelligence (XAI) techniques to auditing practitioners and researchers. It also dem
onstrates popular XAI techniques applied in auditing. Artificial Intelligence (AI) can perform tasks typically requiring human intel
ligence (Russell and Norvig, 2002; IEEE, 2019). Recent years have seen a growing trend of AI, especially Machine Learning (ML)
applications in accounting and auditing (e.g., Perols et al., 2016; Bao et al., 2020; Brown et al., 2020; AICPA, 2020). ML is a
computational method that can identify hidden patterns in data and make predictions or classifications (Hastie et al., 2009; Alpaydin,
2020).1
Over the years, the focus of research and application of AI models in accounting and auditing has been improving their predictive
performance (e.g., Perols, 2011; Perols et al., 2016; Bao et al., 2020). However, as an AI model’s predictive performance increases, the
model’s explainability generally decreases (Virág and Nyitrai, 2014; DARPA, 2016; Baryannis et al., 2019). Many variables and multi-
layered calculations are often required to make a model more effective, making the model opaque like a “black box” (Lipton, 2018;
Miller, 2019). The lack of explainability of complicated AI models is one of the main challenges of AI adoption in accounting and
auditing (AICPA, 2020; CPAB, 2021).
While an AI model can provide auditors with a list of outlier transactions in an audit setting, it can hardly explain why they are
* Corresponding author.
E-mail addresses: abigail.zhang@rutgers.edu (C.(A. Zhang), scho@business.rutgers.edu (S. Cho), miklosv@business.rutgers.edu (M. Vasarhelyi).
1
We use the term “prediction” to broadly refer to either predicting a continuous value or categorical outcomes. The latter application is often
referred to as “classification”.
https://doi.org/10.1016/j.accinf.2022.100572
Available online 1 August 2022

1467-0895/© 2022 Elsevier Inc. All rights reserved.
C.(A. Zhang et al. International Journal of Accounting Information Systems 46 (2022) 100572
identified as outliers and what auditors should look for in their further investigation. Existing standards regarding audit documentation
and audit evidence (e.g., PCAOB AS 1105; AS 1215) imply that if auditors cannot explain and document the inner working or the
output of an AI model, they are restricted in how much reliance they can place on such tools (AICPA, 2020; CPAB, 2021).
As AI matures, explaining AI, rather than solely refining its performance, is becoming increasingly important in the accounting and
auditing profession (ACCA, 2020). In a recent survey of members of the Association of Chartered Certified Accountants (ACCA) and
Institute of Management Accountants (IMA), 54 % of respondents agree that AI explainability affects the ability of professional ac
countants to display skepticism, which is more than twice the number who disagreed (ACCA, 2020).
The need to explain opaque AI programs is not unique to public accounting professionals. It is a legal mandate in banking, in
surance, and healthcare to have interpretable, fair, and transparent models (Hall and Gill, 2019).2 To tackle the universal need for a
better interpretation of AI processes and outputs, computer scientists have developed a stream of research dedicated to XAI. XAI is a set
of techniques explaining a black-box machine learning algorithm (DARPA, 2016; Miller, 2019). XAI has been used to document,
understand, and validate black-box machine learning models used for credit lending (Hall and Gill, 2019). Although computer science
research has developed many XAI techniques, little work has been done in introducing XAI to accounting and auditing professionals. In
the survey mentioned above of members of ACCA and IMA, 51 % of respondents are not aware of XAI or AI explainability (ACCA,
2020).
This paper focuses on XAI in auditing mainly because the assurance profession is highly regulated and, therefore, is subject to strict
requirements for understanding and explaining analytics tools used in audit engagements. It is crucial to provide XAI literacy to audit
professionals because XAI can potentially help them better understand and document the inner working/output of AI tools used for
audits, enhance professional skepticism while using AI tools, and guide practitioners on how to choose and implement appropriate XAI
techniques to make their AI applications more transparent, auditable, and responsible.
This paper contributes to the accounting and auditing literature in the following ways. First, it adds to the growing stream of
accounting and auditing research that adopts AI (e.g., Cecchini et al., 2010; Perols, 2011; Perols et al., 2016; Bao et al., 2020; Brown
et al., 2020; Ding et al., 2020; Bertomeu et al., 2021) by introducing XAI techniques to help interpret the inner workings and results of
advanced AI models. As more accounting and auditing research adopts AI, it is not enough to solely boost the performance of the AI
models. Instead, it becomes increasingly essential to explain the outputs of AI. This paper describes techniques that can be applied to
make AI artifacts in accounting and auditing more interpretable. It can also spark future accounting research along the vein of XAI.
Second, while studies from computer science have provided an overview of XAI (e.g., Adadi and Berrada, 2018; Miller, 2019;
Molnar, 2021), they do not directly link it to auditing practice and research. This research bridges such a knowledge gap by introducing
the state-of-the-art XAI knowledge to auditing professionals using familiar terms, linking XAI techniques to existing auditing standards,
and illustrating popular XAI methods with concrete auditing examples.
Third, this paper responds to the call for practice-relevant accounting research (Kaplan, 2011; Basu, 2012; Wood, 2016; Rajgopal,
2020; Burton et al., 2020a; Burton et al., 2020b; Christ et al., 2020; Christ et al., 2021) by discussing techniques and guidance that both
researchers and practitioners can adopt to enhance the interpretability of their AI applications. Furthermore, policymakers can also use
this paper as a baseline to suggest proper XAI techniques in public accounting practice.
2. Explainable Artificial Intelligence
The term “Explainable AI” was first coined by Van Lent et al. (2004) to describe the ability of a training system developed for the US
Army to explain its AI-driven decisions (Van Lent et al., 2004; Adadi and Berrada, 2018). In 2017, the Defense Advanced Research
Projects Agency (DARPA) launched its XAI program to develop techniques that can explain intelligent systems. DARPA defines XAI as a
suite of techniques that “produce explainable models that, when combined with effective explanation techniques, enable end-users to
understand, appropriately trust, and effectively manage the emerging generation of Artificial Intelligence (AI) systems” (DARPA, 2016,
p. 5). Following XAI literature, this paper uses the term “explainability” and “interpretability” interchangeably to mean how a user
understands the cause of a decision (Biran and Cotton, 2017; Miller, 2019; Molnar, 2021). XAI emphasizes the role of ML algorithms
not just in providing an output but also in sharing with the user the supporting information on how the algorithm reached a particular
conclusion (ACCA, 2020).
The concept of explaining AI has existed since the mid-1970s, when expert systems were the norm for AI (Moore and Swartout,
1988; Adadi and Berrada, 2018; Miller, 2019). Expert systems are considered the “first wave of AI,” which are rule-based systems built
on specialist knowledge (Moore and Swartout, 1988; Liao, 2005; Launchbury, 2017). With the advances of the “second wave of AI,”
represented by statistical learning methods like machine learning, the notion of XAI slowed down because the focus of AI research has
shifted towards model implementation and predictive power improvement (Launchbury, 2017; Adadi and Berrada, 2018). As a result,
advances in ML have produced autonomous systems that can perceive, learn, decide, and act independently (DARPA, 2016). However,
recent years have seen a resurgence in XAI discussions and research mainly because many modern AI applications have been found to
lack transparency and interpretability, leading to limited applications and concerns regarding ethics and trust (e.g., Miller, 2019;
Adadi and Berrada, 2018). XAI is seen as a part of the “third wave AI,” which aims to generate algorithms that can explain themselves
2
Major regulatory statues currently governing these industries include Civil Rights Acts of 1964 and 1991, the Americans with Disabilities Act,
the Genetic Information Nondiscrimination Act, the Health Insurance Portability and Accountability Act, the Equal Credit Opportunity Act (ECOA),
the Fair Credit Reporting Act (FCRA), the Fair Housing Act, Federal Reserve SR 11–7, and European Union (EU) Greater Data Privacy Regulation
(GDPR) Article 22 (Burt, 2018; Hall and Gill, 2019).
2
(Launchbury, 2017). Adadi and Berrada (2018) further summarized the need for XAI into four aspects: explain to justify, explain to
control, explain to improve, and explain to discover. In 2020, Gartner classified XAI as one of the emerging technologies with a peak of
inflated expectations.
XAI is consistent with the concept of “human-in-the-loop,” as the objective of XAI is to enable a better understanding of opaque AI
systems so that humans can better utilize such tools to assist their work (Bauer and Baldes, 2005; Tamagnini et al., 2017; Zhu et al.,
2018; Abdul et al., 2018). In recent years, XAI has been applied to areas of transportation (e.g., Eliot, 2021), healthcare (e.g., Pawer
et al., 2020; Khedkar et al., 2020), legal (e.g., Atkinson et al., 2020), finance (e.g., Bussmann et al., 2020), and military (e.g., Van Lent
et al., 2004).
Existing XAI methods mainly apply to supervised learning. They can be applied to tabular, textual, and image data (Molnar, 2021).
XAI techniques can be generally divided into two types: ante-hoc techniques and post-hoc techniques (Lipton, 2018; Adadi and
Berrada, 2018; Molnar, 2021). Fig. 1 provides a holistic summary of the common XAI techniques. The idea of ante-hoc techniques is to
directly adopt ML models that are inherently interpretable, such as decision trees and explainable neural networks (Hall and Gill, 2019;
Molnar, 2021). Common interpretable models and their advantages and disadvantages can be found in Molnar (2021). Although most
ante-hoc techniques are intuitive and straightforward, they are restricted to a selected list of models that are considered inherently
interpretable, most of which have inferior predictive performance compared to more complex ML models like deep neural networks (e.
g., Virág and Nyitrai, 2014; Baryannis et al., 2019). In contrast, post-hoc techniques can be applied to any ML algorithm because they
generate an explanation after a model is trained (Molnar, 2021). In XAI, post-hoc techniques are mainstream because they are not
model-specific (Ribeiro et al., 2016; Hall and Gill, 2019; Molnar, 2021). Therefore, this paper focuses on post-hoc techniques.
3. XAI in auditing
When AI/ML is used in audit procedures, it is subject to auditing standards, such as audit evidence (e.g., PCAOB AS 1105) and audit
documentation (e.g., PCAOB AS 1215). Audit evidence standards require auditors to obtain sufficient and appropriate audit evidence
to provide a reasonable basis for their opinion when performing audit procedures (PCAOB AS 1105). Sufficiency measures the quantity
of audit evidence, and appropriateness represents the quality of audit evidence, which can be further broken down into relevance and
reliability. When AI/ML is used in audit procedures, the sufficiency of audit evidence can be increased as AI/ML can automate specific
audit procedures, such as automating full-population testing (No et al., 2019). When AI/ML assists audit procedures by detecting
anomalies or making predictions, auditors need to consider which assertions are the most relevant to the outputs from AI/ML and
whether the outcomes from the AI/ML are reliable. For example, when an ML algorithm predicts that one transaction is fraudulent, the
auditor needs to decide which assertion is affected (e.g., valuation or existence) and how reliable the ML algorithm makes this pre
diction before taking action.
Audit documentation standards require auditors to document the basis for the auditor’s conclusions concerning every relevant
financial statement assertion, including “records of the planning and performance of the work, the procedures performed, evidence
obtained, and conclusions reached by the auditor” (PCAOB AS 1215). Audit documentation should be in sufficient detail to clearly
understand its purpose, source, and the conclusions reached (PCAOB AS 1215). When AI/ML assists auditors’ judgments regarding any
financial statement assertions, auditors should document audit evidence (PCAOB AS 1105) obtained from AI/ML tools and their nature
(i.e., mechanism, reliability). Audit documentation is essential as it is the basis for the review of the quality of the work, and it fa
cilitates the planning, performance, and supervision of the engagement (PCAOB AS 1215).
Implied by audit evidence and documentation standards (e.g., PCAOB AS1105, AICPA AU-C 500, PCAOB AS 1215, AICPA AU-
C230), the following aspects of AI/ML application in audit engagements need to be documented:
1) What ML algorithm is used?

2) How does the algorithm work?
3) What data is used to train the model?
4) What is the overall performance of the model?
Fig. 1. Summary of Common XAI Techniques.
3
5) How does the trained model make a decision?

6) How does the model decide for a particular instance (e.g., a transaction and an audit engagement)?
Aspects 1 to 5 regard the overall (i.e., global) information about the AI/ML tool. Consequently, they are of primary concern for
national audit offices when they develop, document, and evaluate the tools. These steps are performed for audit partners and regu
lators to gain a basic understanding of the tools. Aspect 6 is concerned with how the AI/ML tool decides on a specific (i.e., local)
instance, such as in a particular audit engagement or on a specific account/transaction. Hence, they are more relevant to field auditors
who conduct the audit procedures.
Aspects 1 to 4 require general information about the AI/ML tool that is relatively easy to obtain, especially if the tool is developed
in-house by the national audit office. However, aspects 5 to 6 require explaining the inner working and the rationale behind a decision
made by the tool. Addressing aspects 5 and 6 of explainability becomes challenging when the AI/ML algorithm is complex and opaque,
thus, falling into the scope of XAI.
Fig. 2 summarizes the common post-hoc XAI techniques based on their functions and relevance to audit evidence and audit
documentation standards (e.g., PCAOB AS 1215, AS 1105). Besides differences in terms of functions, XAI methods also differ in the
scope of their explanation. Some XAI methods are used for a global interpretation of an opaque model, while others are for a local
interpretation. The difference between global and local interpretation is that the former establishes interpretation on a model
aggregating all instances, while the latter focuses on interpreting individual predictions. The Appendix provides descriptions of the
post-hoc XAI techniques. In the next section, we demonstrate popular XAI techniques with a concrete set of data using ML to assess the
risk of material misstatement.
4. Demonstration of lime and shap
In this demonstration, we first develop black-box ML models that can predict the risk of material restatement, which we use as a
proxy for material misstatement. Then, we apply-two mainstream XAI techniques, Local Interpretable Model-agnostic Explanations
(LIME) and Shapley Additive exPlanations (SHAP), to the established ML models to obtain instance-level interpretations. We focus on
LIME and SHAP in this section because they are the favored XAI techniques (Lu et al., 2020; Slack et al., 2020; Senoner et al., 2021).
Furthermore, LIME and SHAP provide local-level interpretations relevant to field auditors who focus on specific audit engagements,
accounts, or transactions.3
4.1. Use Case: Risk of material misstatement assessment
ML is mainly applied to tabular data (i.e., data stored in columns and rows). Thus, this demonstration focuses on a use case where
tabular data is utilized. The demonstration use case is an audit task that ML can potentially assist: assessing the client’s risk of material
misstatement at the financial statement level in audit planning (AS 2101). The client’s risk of material misstatement is a product of
inherent risk and control risk (Louwers et al., 2020). With an assessed level of risk of material misstatement, auditors adjust their
detection risk to achieve an acceptable level of audit risk, which is the risk that auditors issue a clean audit opinion on a financial
statement that contains material misstatements (Louwers et al., 2020).
4.2. Sample and data
The dataset is adapted from Zhang et al. (2022). It spans from 2005 to 2017 with 26,841 firm-year observations. Since the existence
of material misstatement in the financial statement is not public knowledge until the financial statement is restated, we follow prior
literature to use material restatement as a proxy for the existence of material misstatement (Bertomeu et al., 2021). Material
restatement (RES) is a binary variable that equals one if an annual financial statement is restated due to either GAAP violations or
fraud, as disclosed on Form 8-K or Form 8-K/A (Bertomeu et al., 2021; Zhang et al., 2021). Table 1 presents the sample distribution by
year. The sample’s overall percentage of observations with material restatements is 3.53 %.4 In establishing ML models, RES is used as
the outcome variable.
We use two alternative sets of variables as the predictors for RES. The first set of predictors is 35 raw financial statement variables
commonly used in restatement/misstatement prediction literature (Beneish 1997, 1999; Dechow et al., 2011; Cecchini et al., 2010;
Perols 2011; Perols et al., 2016; and Bao et al., 2020; Zhang et al., 2022). All raw financial statement variables are scaled by total assets.
The second set of predictors is the 21 variables derived from financial statements identified by Bertomeu et al.,(2021) to be the most
predictive of material misstatement.5 We use financial statement variables as predictors to mimic the situation where auditors evaluate
a client’s financial statement and identify risky accounts (Louwers et al., 2020). Tables 2 and 3 list the two sets of predictor variables,
respectively.
3
LIME and SHAP are XAI techniques that provide explanations for a given instance at the local level.
4
Unlike Zhang et al., (2022) who only kept the first year of restatement if this restatement affects financial statements in consecutive years, we
retained all restatements since the objective is to predict the existance of material misstatement.
5
The 21 variables come from Table 7 in Bertomeu et al. (2021). We only included the variables that can be derived from financial statements.
4
Fig. 2. Overview of Post-hoc XAI Methods and Their Relevance to Auditing Standards.
Table 1
Sample Distribution by Year.
Data Year - Fiscal Number of firms Number of material restatements Percentage
2005 2324 205 8.82 %

2006 2132 114 5.35 %
2007 1901 68 3.58 %
2008 2090 80 3.83 %
2009 1923 56 2.91 %
2010 2005 69 3.44 %
2011 1976 65 3.29 %
2012 1993 65 3.26 %
2013 2102 57 2.71 %
2014 2090 52 2.49 %
2015 2074 50 2.41 %
2016 2110 34 1.61 %
2017 2121 33 1.56 %
Total 26,841 948 3.53 %
4.3. Establish a Black-box Machine learning model
The ML algorithm chosen for the demonstration is XGBoost. XGBoost is an ensemble machine learning method that aggregates
predictions from single models (Hastie et al., 2009; Bao et al., 2020). An introduction to the XGBoost algorithm is documented in Chen
and Guestrin (2016). XGBoost is used in this demonstration for two reasons: first, it is one of the most popular ML algorithms applied in
prediction tasks in a variety of domains (Chen and Guestrin, 2016; Nielsen, 2016; ACCA, 2020); second, XGBoost is compatible with
the open-source LIME and SHAP tools utilized in this demonstration.
We divide the dataset into training and testing, with the training data spanning from 2005 to 2014 (76.5 % of the population) and
the testing data covering 2015 to 2017 (23.5 % of the population). We further split the training dataset into “training for tuning” (years
2005 to 2012) and “testing for tuning” (years 2013 to 2014) to tune the hyperparameters of XGBoost.6 Specifically, we train XGBoost
models with different combinations of hyperparameters using the “training for tuning” dataset and apply the trained model to the hold-
out “testing for tuning” dataset. The model that generates the highest out-of-sample Area Under the Receiver Operating Characteristic
Curve (AUC) is chosen.7 Then, this “tuned” model is trained using the training dataset and applied to the hold-out testing dataset. Fig. 3
illustrates the sample splitting.
6
The hyperparameters and their values tuned for XGBoost are eta: [0.01, 0.2, 0.3], min_child_weight: [1, 5], max_depth: [3, 6, 10], subsample:
[0.5, 1], colsample_bytree: [0.5, 1].
7
Receiver Operating Characteristic curve is a plot of the true positive rate on the y-axis against the false positive rate on the x-axis for different
classification thresholds (Bradley, 1997).
5
Table 2
35 Raw Financial Variables (Excerpted from Zhang et al., 2022).
Variable Main Source
Cash and Short-Term Investments Beneish (1997, 1999), Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and
Perols et al., (2016)
Receivables - Total Beneish (1997, 1999), Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and
Inventories - Total Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and Perols et al., (2016)
Short-Term Investments - Total Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and Perols et al., (2016)
Current Assets - Total Beneish (1997, 1999), Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and
Property, Plant and Equipment - Total Beneish (1997, 1999), Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and
(Gross) Perols et al., (2016)
Investment and Advances - Other Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and Perols et al., (2016)
Assets - Total Beneish (1997, 1999), Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and
Accounts Payable - Trade Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and Perols et al., (2016)
Debt in Current Liabilities - Total Beneish (1997, 1999), Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and
Income Taxes Payable Beneish (1997, 1999), Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and
Current Liabilities - Total Beneish (1997, 1999), Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and
Long-Term Debt - Total Beneish (1997, 1999), Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and
Liabilities - Total Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and Perols et al., (2016)
Common/Ordinary Equity - Total Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and Perols et al., (2016)
Preferred/Preference Stock (Capital) - Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and Perols et al., (2016)
Total
Retained Earnings Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and Perols et al., (2016)
Sales/Turnover (Net) Beneish (1997, 1999), Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and
Cost of Goods Sold Beneish (1997, 1999), Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and
Depreciation and Amortization Beneish (1997, 1999), Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and
Interest and Related Expense - Total Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and Perols et al., (2016)
Income Taxes - Total Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and Perols et al., (2016)
Income Before Extraordinary Items Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and Perols et al., (2016)
Net Income (Loss) Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and Perols et al., (2016)
Long-Term Debt - Issuance Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and Perols et al., (2016)
Sale of Common and Preferred Stock Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and Perols et al., (2016)
Price Close - Annual - Calendar Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and Perols et al., (2016)
Common Shares Outstanding Bao et al., (2020), Cecchini et al., (2010), Dechow et al., (2011), Perols (2011) and Perols et al., (2016)
Common Shares Issued Perols (2011) and Perols et al., (2016)
Operating Activities - Net Cash Flow Perols (2011) and Perols et al., (2016)
Operating Income Before Depreciation Perols (2011) and Perols et al., (2016)
Property, Plant and Equipment - Total Perols (2011) and Perols et al., (2016)
(Net)
Working Capital (Balance Sheet) Perols (2011) and Perols et al., (2016)
Selling, General and Administrative Beneish (1997, 1999)
Expense
Amortization of Intangibles Beneish (1997, 1999)
The dataset is highly imbalanced in that the number of instances with material restatements only takes a small portion of the
population. To increase the “weights” of the restatement instances in the training process, we follow prior literature to randomly
under-sample the non-restated instances to have equal amounts of observations for restated and non-restated instances (Bao et al.,
2020). The testing datasets are left intact. To avoid potentially losing useful non-restated instances in the random under-sampling
process, we follow Perols et al. (2016) and adopt the multi-subset observation undersampling method (Chan and Stolfo 1998). Spe
cifically, we aggregate predictions from models constructed using different subsets of the under-sampled non-restated instances
combined with the same restated instances (Perols et al., 2016; Zhang et al., 2022).
We follow prior literature to measure the model’s overall performance using AUC on the hold-out testing dataset (e.g., Bao et al.,
2020; Brown et al., 2020). AUC ranges from 0 to 1, with values above 0.5 indicating that the prediction is better than a random guess.
When using the first set of predictor variables (i.e., raw financial statement variables), the tuned XGBoost models have an average AUC
value of 0.67. When using the second set of predictor variables (i.e., variables derived from financial statements), the average AUC is
0.62. Both scenarios outperform the benchmark logistic regression model, which generates an average AUC of 0.56.
6
Table 3
21 Variables Derived from Financial Statements (Adapted from Bertomeu et al., 2021).
Predictor Calculation
% Soft assets (Total Assets-PP&E-Cash and Cash Equivalent)/Total Assets

Change in operating lease The change in the present value of future noncancelable operating lease obligations deflated by average total assets
activity
Leverage Long-term Debt/Total Assets
Level of finance raised Financing Activities Net Cash Flow /Average total assets
Abnormal change in Percentage change in the number of employees-percentage change in assets
employees
WC accruals [(Change in Current Assets- Change in Cash and Short-term Investments)-(Change in Current Liabilities- Change in Debt in
Current Liabilities- Change in Taxes Payable)]/ Average Total Assets
Book-to-market Equity/Market Value
Earnings to price Earnings/Market Value
Change in inventory Change in Inventory/Average Total Assets
Lag one year return Previous year’s annual buy-and-hold return inclusive of delisting returns minus the annual buy-and-hold value-weighted market
return
Return Annual buy-and-hold return inclusive of delisting returns minus the annual buy-and-hold value-weighted market return
Lag mean-adjusted absolute The following regression is estimated for each two-digit SIC industry: Change in WC = b0 + b1 Change in CFOt − 1 + b2 Change
value of DD residuals in CFOt + b3 Change in CFOt + 1 +%. The mean absolute value of the residual is calculated for each industry and is then
subtracted from the absolute value of each firm’s observed residual.
Change in cash margin Percentage change in cash margin where cash margin is measure as 1- [(Cost of Good-Change in Inventory + Change in Accounts
Payable)/ (Sales - Change in Accounts Receivable)]
RSST accruals (Change in WC + Change in NCO + Change in FIN)/Average Total Assets, Where WC = (Current Assets- Cash and Short- term
Investments) - (Current Liabilities-Debt in Current Liabilities); NCO = (Total Assets - Current Assets-Investments and Advances) -
(Total Liabilities - Current Liabilities- Long-term Debt; FIN = (Short-term Investments + Long-term Investments) - (Long-term
Debt + Debt in Current Liabilities + Preferred Stock)
Change in cash sales Percentage change in cash sales (Sales - Change in Accounts Receivable)
Deferred tax expense Deferred tax expense for year t/ total assets for year t-1
Lag studentized DD residuals Scales each residual by its standard error from the industry-level regression
Change in receivables Change in Accounts Receivable/Average Total Assets
Change in return on assets (Earnings t /Average Total Assets t)- (Earningst − 1/Average Total Assetst − 1)
Performance matched The difference between the modified Jones discretionary accruals for firm i in year t and the modified Jones discretionary
discretionary accruals accruals for the matched firm in year t, following Kothari et al., (2005); each firm-year observation is matched with another firm
from the same two-digit SIC code and year with the closest return on assets.
Modified Jones model The modified Jones model discretionary accruals is estimated cross-sectionally each year using all firm-year observations in the
discretionary accruals same two-digit SIC code: WC Accruals = b0 + b1 (1/Beginning assets) + b2 (Change in Sales- Change in Rec) /Beginning assets +
b3 Change in PPE /Beginning assets + e. The residuals are used as the modified Jones model discretionary accruals.
4.4. Local interpretable Model-agnostic explanations (LIME)
LIME is used for instance-level interpretations. It uses a simple model to approximate a complex model (Ribeiro et al., 2016;
Molnar, 2021). LIME works by first creating perturbed instances (i.e., random variations of the instance of interest) and obtaining the
predicted outcomes for these perturbed instances from the complex model to be explained. Then, a simple model is fitted using the
perturbed instances as inputs and the complex model’s predictions for these perturbed instances as outputs. In this process, the
Fig. 3. Sample Split for Machine Learning Experiment.
7
perturbed instances are weighed by their proximity to the original instance. The features that are significant from the simple fitted
model approximately drive the complex model’s prediction for this instance.
We utilize an open-source Python package for LIME provided by Ribeiro et al. (2016). This research team developed the LIME
methodology.8 Fig. 4 presents the LIME interpretations based on the model that uses the first set of predictor variables (i.e., 35 raw
financial variables). Fig. 5 shows those for the model that uses the second set of predictor variables (i.e., 21 variables derived from
financial statements). In both figures, Panel A is for an instance that does not have a material restatement (CIK number 1750 in the
fiscal year 2015). Panel B shows another example of an instance with a material restatement (CIK number 2034 in the fiscal year 2015).
The center of the LIME panel presents LIME interpretations, which are each feature’s contribution toward prediction probability
(Ribeiro et al., 2016). For example, in Panel B, Fig. 5, if the variables “change in cash sales” and “leverage” are removed from the
model, the probability of this instance having a material restatement is expected to be 0.61–0.06–0.04 = 0.51. The leftmost LIME panel
shows the prediction probabilities for “Res” and “No Res,” respectively. The rightmost LIME panel lists the feature values of the
instance of interest.
The most salient advantage of LIME is that it can provide local-level interpretations. Furthermore, LIME presents contributions of
original features instead of highly reprocessed features that some black-box algorithms use to train models. For example, a deep neural
network may be trained on selected features of credit card transactions from the principal component analysis. However, LIME can be
trained on the original unprocessed features, which retain the interpretability of these features. Furthermore, users can choose features
to establish the simple model used to approximate the black-box model to provide more human-friendly explanations. The most
significant disadvantage for LIME is that it is difficult to determine how many perturbed instances are needed when it applies to tabular
data (Molnar, 2021). Different choices may result in different interpretation results (Alvarez-Melis and Jaakkola, 2018). Research is
ongoing in computer science to address this challenge.
4.5. Shapley Additive exPlanations (SHAP)
SHAP stands for Shapley Additive exPlanations. It assigns each feature an importance value that represents how this feature
contributes to the difference between the predicted outcome of an instance to be explained and that of the average population
(Lundberg and Lee, 2017). SHAP values are the Shapley values of a conditional expectation function of the model to be explained
(Lundberg and Lee, 2017). Shapley value is based on cooperative game theory. It is a method to assign payouts to players depending on
their contribution to the total payout (Shapley 1953; Lundberg and Lee, 2017). In the context of using ML to make predictions, the
“game” is to predict the outcome of an instance; the “gain” is the difference between the prediction for this instance and the average
prediction for all instances; and the “players” are the feature values of the instance that collaborate to receive the gain (Lundberg and
Lee, 2017; Molnar, 2021).
We utilize the open-source Python package for SHAP developed by Lundberg and Lee (2017) who proposed the SHAP method.9For
the same examples in the LIME demonstration, we provide the SHAP interpretations produced by this open-source tool as follows in
Fig. 6 and Fig. 7.
E[f(X)] in Fig. 6 and Fig. 7 represents the average log odds ratio of the probability that an instance has a material restatement.
Moreover, f(X) in Fig. 6 and Fig. 7 is the log adds ratio for the instance of interest to have a material restatement. For example, in Panel
B, Fig. 7, the log odds ratio for the firm with CIK number 2035 to have a material restatement in 2015 is 0.466, much greater than the
average log odds ratio of − 0.035.
SHAP explains how each feature contributes to the difference between the average log odds ratio and the log odds ratio for an
instance. In Fig. 7, for the example with a material restatement, its log odds ratio is 0.501 more than the average (0.466-(-0.035)), and
this difference is composed of contributions of 0.96 from “change in receivables,” − 0.52 from “level of finance raised,” − 0.51 from
“change in inventory,” 0.4 from “SoftAssets”, 0.37 from “performance matched discretionary accruals,” 0.31 from “lag_studentized DD
residuals,” 0.31 from “change in cash margin,” − 0.31 from “wc accruals,” − 0.21 from “btm” and − 0.3 from 12 other features
(0.96–0.52–0.51 + 0.4 + 0.37 + 0.31 + 0.31–0.31–0.21–0.3 = 0.50).10 A similar interpretation can be applied to the instance without
a material restatement.
Not only can SHAP be used for local interpretation, it can also be used for global overview (Lundberg et al., 2018). First, by plotting
the SHAP values of every feature for every observation, we can get an overview of which features are the most impactful per SHAP
values. Fig. 8 sorts features by the sum of the SHAP absolute values for all observations in the testing dataset, and then presents the
distribution of SHAP values for each feature. The red color represents high feature value, and blue represents low feature value. For
example, Fig. 8 shows the nine most impactful features: Liabilities – Total, Operating Income Before Depreciation, Sale of Common and
Preferred Stock, Retained Earnings, Income Taxes Payable, Current Assets – Total, Property, Plant and Equipment – Total (Net),
Receivables – Total, and Accounts Payable – Trade.
The most significant advantage of SHAP is that it allows contrasting explanations. Besides comparing an instance prediction to the
average prediction of the entire dataset, comparisons can also be made to a subset of the dataset or another instance prediction
(Molnar, 2021). Furthermore, Shapley’s value is backed by a solid theory, the cooperative game theory. It can deliver a complete
explanation that is fairly distributed among the feature values of the instance (Lundberg and Lee, 2017; Molnar, 2021). Even though
8
See: https://github.com/marcotcr/lime.
9
See: https://github.com/slundberg/shap.
10
0.500 is 0.01 short of 0.501 due to the rounding of individual odds ratio of probabilities.
8
Fig. 4. LIME Interpretations – using raw financial statement variables as predictors.
generating SHAP requires accessing the population dataset, SHAP is one of the most popular XAI techniques considering the afore
mentioned advantages (Das and Rad, 2020; Molnar, 2021).
4.6. Illustrative comparison and evaluation of LIME and SHAP
Since the reasons for restatement are public information, they create a unique opportunity to compare and evaluate the reliability
of SHAP and LIME interpretations. For illustration purposes, we use the firm-year observation with a material restatement (i.e., CIK
number 2034 in the fiscal year 2015) from the previous demonstrations as a specific example to compare LIME and SHAP. To obtain
the reasons for restatement, we consulted the Audit Analytics database. The restatement reasons for the firm with CIK number 2034 in
the fiscal year 2015 are revenue recognition, accounts receivables, investments, and cash issues.
As shown in Panel B, Fig. 4, when using raw financial statement variables, LIME fails to identify variables related to revenue
recognition, accounts receivable, investment, and cash that positively contribute to this instance’s propensity to have a material
restatement. In contrast, Panel B, Fig. 6 shows that SHAP is able to identify “Investment and Advances - Other” and “Sales/Turnover
(Net)” as investment and revenue-related features that positively contribute to the chance of having a material restatement.
When using variables derived from the financial statement, as shown in Panel B, Fig. 5, LIME correctly identified “change in cash
sales” and “change in cash margin” as reg flags for cash-related issues, “lag mean-adjusted absolute value of DD residuals” for revenue
recognition issue, and “change in receivables” for accounts receivable issues. Panel B, Fig. 7 shows that SHAP correctly identified
“change in receivables” to be an indicator for receivables-related restatement, “performance matched discretionary accruals” and
“lag_studentized DD residual” for potential revenue recognition issues, and “change in cash margin” for cash-related issues.
While the above comparison is based on one firm-year observation, it provides insights that LIME and SHAP may provide an
imperfect overlap of interpretations. Future research can systematically compare and evaluate the reliability of XAI interpretations
based on extensive sample evidence.
5. Demonstration of other XAI techniques
The previous section demonstrates two mainstream XAI techniques: LIME and SHAP. This section demonstrates other post-hoc XAI
techniques, including the global surrogate model, permutation feature importance, partial dependence plot, accumulated local effects,
scoped rules, individual conditional expectation, and counterfactual explanations. Like LIME and SHAP, scoped rules, individual
9
Fig. 5. LIME Interpretations – using variables derived from financial statement as predictors.
conditional expectation, and counterfactual explanations provide local-level explanations (i.e., individual predictions of any classifier
in an interpretable and faithful manner). In contrast, the global surrogate model, permutation feature importance, partial dependence
plot, and accumulated local effects present global-level explanations. Since we showed in the previous section that the black-box
XGBoost model works best when the 35 raw financial variables are used as predictors, we construct the following demonstration
based on that model.
5.1. Global surrogate model
A global surrogate model is a simple model that approximates a black-box model (Islam et al., 2021; Molnar, 2021). After a black-
box model is trained, a global surrogate model can be constructed in the following steps. First, select a simple model, such as linear
regression or a decision tree; Second, train this interpretable model on the original features of the dataset and the predicted outcome
from the black-box model; Third, evaluate the fit between the predicted outcome from the black-box model and that from the simple
model. If the fit (usually measured in R squared) is reasonably high, the interpretable model can be considered a surrogate model.
Fig. 9 illustrates how to construct a global surrogate model.
Fig. 10 presents a decision tree surrogate model to approximate the decision-making of an XGBoost model that we construct in the
demonstration. The decision tree surrogate model can reasonably approximate the XGBoost model to be explained, with an accuracy of
76 % and an R squared of 21.6 %.11
The above decision tree model approximates the decision paths of the black-box XGBoost model, potentially providing auditors
with a method to explain and document how an opaque model works. In the illustrated example, the XGBoost model is likely to classify
an instance as having a material restatement if it has Income Before Extraordinary Items scaled by total assets above 182.65 and
Operating Income Before Depreciation scaled by total assets above 413.645.
The advantage of global surrogate models is that the choice of the surrogate model is flexible and independent from the black-box
11
Accuracy equals (true positives + true negatives)/total number of observations. In this case, accuracy = (605+3968)/ (605+3968+221+133) =
93%; R squared is calculated by calculating the proportion of the variance for the prediction made by the black box XGBoost model that’s explained
by the prediction given by the surrogate model.
10
Fig. 6. SHAP Interpretations – using raw financial statement variables as predictors.
model to be approximated. Besides, the idea behind surrogate models is also intuitive and easily acceptable. However, it is unclear
what fitness level (i.e., R squared) is considered the best in creating surrogate models. Another issue with choosing a surrogate model is
that the intrinsic interpretability is subjective because a linear regression with too many parameters can also become hard to interpret.
5.2. Permutation feature importance (PFI)
PFI is a method to identify essential features in prediction by identifying features whose permutated values will most increase
significant prediction error (Breiman 2001a; Fisher, Rudin, and Dominici 2018; Molnar, 2021). Table 4 presents ten features with the
highest PFI: Three of these features overlap with the impactful features identified by SHAP documented in Fig. 8.
The most significant advantage of PFI is its ease of interpretation and that it does not require retraining the model (Molnar, 2021).
Other methods (e.g., backward feature selection) that can identify important features require deleting certain features and then
retraining the model to see if the deletion has significantly impacted the model performance (Hastie et al., 2009). Despite its merits, PFI
is prone to the randomness added to the permutation of feature values, which may create unrealistic values for that feature (Molnar,
2021).
5.3. Partial dependence plot (PDP)
PDP presents the relationship between the predicted outcome and features of interest (Friedman, 2001; Goldstein et al., 2015;
Molnar, 2021). The features of interest can be either categorical or continuous. Due to the limitation of human perception, the number
of features of interest to be plotted in one PDP is usually 1 (2D figure) or 2 (3D figure) (Molnar, 2021). Panel A, Fig. 11 presents PDP
plots for the feature Current Assets - Total and Accounts Payable - Trade, respectively. Panel B, Fig. 11, is a PDP plot for the two features
combined.
From the above PDP plots, auditors can learn that the black box XGBoost model learned a negative relationship between total
current assets and the likelihood of having material restatement and a positive relationship between accounts payable and the like
lihood of having material restatement. The PDP that combines the two features shows that when both features have low values, the
likelihood of having material restatement reaches its highest.
The advantage of PDP is that it is intuitive and easy to implement. However, a caveat for interpreting PDP is that it can only provide
correlation but not causal relationships. Besides, PDP assumes that the input feature of interest is independent of the rest of the features
11
Fig. 7. SHAP Interpretation – using variables derived from financial statement as predictors.
Fig. 8. Scatter Plot of SHAP.
(Friedman 2001; Goldstein et al., 2015). However, this assumption is often violated in practice, causing non-reliable PDP figures. The
following section introduces alternative methods of PDP that can better address the issue of correlation among features.
5.4. Accumulated local effects (ALE) plot
The challenge of PDP is that the assumption of independence between features is often violated. ALE helps address this issue by
calculating the differences in predictions instead of averages based on the conditional distribution of the features. Details of how ALE
works are documented in Apley and Zhu (2020) and Molnar (2021). The advantage of ALE is that it can work when the features are
12
Fig. 9. The Intuition Behind a Global Surrogate Model.
Fig. 10. A Decision Tree Surrogate Model.
Table 4
Feature Importance Ranked by PFI.
Feature Average PFI Std. Err.
Current Assets - Total 0.022 0.003

Cost of Goods Sold 0.019 0.002
Current Liabilities - Total 0.018 0.002
Receivables - Total 0.015 0.002
Depreciation and Amortization 0.013 0.002
Net Income (Loss) 0.013 0.002
Interest and Related Expense - Total 0.013 0.002
Liabilities - Total 0.013 0.003
Debt in Current Liabilities - Total 0.011 0.002
Income Before Extraordinary Items 0.009 0.002
correlated. Thus, it is preferred to PDP in most situations in actual practice. However, even though ALE can be applied when features
are correlated, the interpretation remains difficult when features are strongly correlated. Nevertheless, in general, ALE is more reliable
than PDP. Furthermore, ALE is faster to compute than PDP. The disadvantage of ALE is that the plot can be very unstable when the
number of instances increases.
Fig. 12 presents ALE plots for Current Assets - Total and Account Payable - Trade. For Current Assets - Total, the ALE plot presents a
similar pattern to the negative relationship shown in the PDP plot. For Account Payable - Trade, while the PDP plot showed a positive
relationship, the ALE plot presented a non-linear relationship. This difference could be due to the high correlation between the Account
Payable - Trade and other features.
13
Fig. 11. PDP Interpretation.
Fig. 12. ALE Interpretation.
5.5. Scoped rules (Anchors)
Anchor is a decision rule to explain any black-box classification model (Ribeiro et al., 2018; Molnar, 2021). A decision rule anchors
a prediction when changes in other feature values do not affect the prediction (Molnar, 2021). Like LIME, anchors create decision rules
by creating perturbed samples around the instance to be explained. Fig. 13 presents an anchor produced by an open-source Python
package for the same firm-year observations as in the previous demonstration section (i.e., CIK number 1750 in the fiscal year 2015
and CIK number 2034 in the fiscal year 2015).12
12
See: https://github.com/marcotcr/anchor/blob/master/notebooks/Anchor%20on%20tabular%20data.ipynb.
14
Fig. 13. Anchor explanation for a no-res and a res instance.
The anchor shows that the XGBoost model deems the restated instance “RES” mainly because its Long-Term Debt – Total scaled by
total asset is between 19.75 and 695.29 and that its Sales/Turnover (Net) scaled by total asset is above 13.25. The anchor also explains
that it applies to 24 % of the perturbed instances. In those cases, the explanation is 53 % accurate. Interestingly, for the non-restated
instance, the anchor fails to generalize a rule.
Anchors differ from LIME because LIME creates a surrogate model, whereas anchors directly generate easy-to-understand “IF-
THEN” rules (Molnar, 2021). Furthermore, the anchor approach requires hyperparameter tuning, discretization of continuous vari
ables, constant accessing of the ML model, and an unclear coverage ratio benchmark. However, the advantage of the anchor is that it is
easy to interpret, and it works when model predictions are non-linear or complex.
5.6. Individual conditional expectation (ICE)
ICE is similar to PDP in that they plot the relationship between the predicted outcome and a feature of interest (Goldstein et al.,
2015). However, unlike PDP which plots the average effect of the input feature, ICE visualizes the relationship for each instance
(Goldstein et al., 2015). Fig. 14 illustrates ICE for the features of total current assets and accounts payable for the last 50 instances in
the testing dataset.
Like PDP, the ICE plots show a nearly negative relationship between total current assets and the predicted risk of material
restatement and a nearly positive relationship for accounts payable. However, ICE can present heterogeneous relationships for
different instances (Goldstein et al., 2015). For example, while the PDP plot shows that, on average, when the Current Assets – Total
scaled by total assets is between 0.2 and 0.4, the risk of material restatement decreases with the value of the total current assets.
Nevertheless, the ICE plot shows that such a negative relationship does not hold for all instances. The disadvantage of ICE compared to
PDP is that ICE can be overly crowded to view (Goldstein et al., 2015). Furthermore, similar to PDP, ICE is subject to the assumption
that the feature of interest is independent of all other features (Goldstein et al., 2015).
5.7. Counterfactual explanations
Counterfactual explanations enable us to imagine a hypothetical situation that contradicts what has happened (Wachter et al.,
2017; Dandl et al., 2020). For example, if a firm has a different value of a certain feature, how will that change the black-box model’s
prediction for this firm? Thus, the counterfactual explanation can answer how to change certain features so that the predicted outcome
could be different (Wachter et al., 2017; Dandl et al., 2020). In general, the way to identify a counterfactual explanation for an instance
is first to find its most similar instance measured by a chosen distance metric, but that has an opposite prediction. Then, compare the
instance to be explained with its counterpart to spot differences, which are the counterfactual explanations (Dandl et al., 2020; Molnar,
2021). Fig. 15 illustrates the counterfactual explanation using Google’s what-if-tool (WIT) for an instance predicted to have material
misstatement.
15
Fig. 14. ICE Interpretation.
Fig. 15. Counterfactual explanations.
In the above counterfactual explanation panel, the instance to be explained is a red dot highlighted with a red circle. The red dots
represent instances that the black-box algorithm predicts as having material restatements, and the blue dots are predicted not to have
material restatements. Using a distance metric, the WIT identifies its most similar instance with an opposite prediction result: the blue
dot highlighted in a blue circle.13 Then, by comparing and contrasting the two instances, we can conclude that compared to its closest
peer with a flipped prediction, the instance of interest has higher amounts of Common Shares Outstanding, Current Liabilities – Total,
Depreciation and Amortization, and Liabilities – Total, suggesting that lowering the values of these variables could change the pre
diction to be zero.
The advantage of counterfactual explanation is that it is intuitive and relatively easy to implement. Furthermore, the counterfactual
explanation does not require access to the raw data or the model but only the model’s prediction function when the raw data or model
is proprietary (Molnar, 2021). However, when working with counterfactuals, we often can identify more than one counterfactual
instance with various counterfactual explanations, making it challenging to choose which explanation to rely on (Molnar, 2021).
6. Future research suggestions
6.1. Testing XAI’s interpretability and reliability
Although the purpose of XAI is to increase the interpretability of black box AI models, the interpretability and reliability of XAI
techniques themselves also demand validation. Future research can examine whether LIME and SHAP interpretations are under
standable to average auditors and how reliable the interpretations produced by LIME and SHAP are in an audit engagement. In
validating XAI’s reliability, the most common method is to see whether the XAI techniques can enable humans to correctly determine
the outcome of a model prediction based on the input values (Doshi-Velez and Kim, 2017; Hall and Gill, 2019). However, since
13
Common distance measures are L1 (a.k.a., Manhattan Distance) and L2 (a.k.a., Euclidean Distance). See details of each measure at: https://
www.cs.utah.edu/~jeffp/teaching/cs5955/L7-Distances.pdf.
16
including human subjects to test the reliability of the XAI techniques is not always feasible due to expense or time constraints,
alternative methods can be adopted to examine the reliability of XAI techniques (Hall and Gill, 2019). One can use simulated data with
known characteristics or explanations to test explanations provided by XAI techniques (Hall and Gill, 2019). While we provide an
illustrative comparison and evaluation of LIME and SHAP based on an individual prediction, future research can systematically
validate the reliability of LIME and SHAP interpretations.
6.2. XAI to enhance “Auditor-in-the-Loop”
XAI is a “human-in-the-loop” application as it serves as an interface between AI and humans to facilitate human–machine coop
eration (Adadi and Berrada, 2018; Miller, 2019). While this paper introduces different XAI techniques and demonstrates how popular
XAI techniques, LIME and SHAP, can be applied to auditing tasks, future research can empirically examine whether XAI can make
auditors better interpret ML outcomes and thus, can more appropriately execute professional judgment, enhancing audit quality.
6.3. Cost and benefit analysis of XAI
When an audit firm adopts XAI, this adoption will generate additional fixed costs (e.g., software purchase or development costs).
After adopting XAI, auditors’ usage of XAI may require additional effort (e.g., pulling out the XAI results and understanding them).
However, since XAI enhances the explanation of opaque AI/ML systems and provides algorithmic transparency, auditors can better
understand and document the output of the opaque AI/ML systems used in audits (Rudin, 2019; Lu, Lee, Kim, and Dank 2021).
Furthermore, if auditors are held liable for an audit failure due to insufficient interpretation of AI/ML outputs that could have been
avoided by using XAI, then auditors may be motivated to exert additional effort to use XAI (Ewart and Wagenhofer, 2019). We
encourage future research to theoretically and empirically examine the cost-benefit tension of XAI adoption in audits.
6.4. Unintended consequence of XAI
Although XAI techniques foster transparency and a better understanding of black-box AI models, they could also be used for
malicious purposes like adversarial attacks (Tramèr et al., 2016; Shokri et al., 2017; Hall and Gill, 2019; Slack et al., 2020). For
example, when LIME or SHAP explanations are used in audit engagements, once the client’s management team understands what
features are driving an AI model to conclude a high risk of material misstatement of the firm’s financial reports, they may intentionally
modify those features to make the auditor’s AI system favor their position. Thus, further studies are needed to explore the unintended
consequences of XAI adoption in an audit setting, especially the interactions between the auditors’ use of XAI and management’s
strategies in mitigating XAI’s influence.
6.5. Fairness of AI
Recent research shows that AI applications in different domains present ethical challenges (e.g., Munoko et al., 2020). XAI tech
niques have the potential to enhance fairness. Therefore, ethical and responsible use of AI Common methods to examine the fairness of
AI models include disparate impact analysis and reweighting (Kamiran and Calders, 2012; Feldman et al., 2015; Hall and Gill, 2019).
Disparate impact analysis assesses model predictions and errors across different groups, such as ethnicity or gender (Feldman et al.,
2015). Reweighting reprocesses data by balancing different groups to ensure fairness during model training (Kamiran and Calders,
2012). Future research can explore using XAI to unveil potential biased AI applications in accounting and auditing domains.
6.6. Explanation and trust
Explanation and trust are different aspects of an AI model: one can trust an AI model without the need to understand it, or one can
understand a model without trusting it (Hall and Gill, 2019). For example, before sophisticated XAI techniques were established, black-
box ML models like deep neural networks were still used in many applications if they were well-tested and had high predictive
performance (Gopinathan et al., 1998; Hall and Gill, 2019). On the other hand, suppose there is a simple model with poor prediction
performance. Even if one can easily interpret the model, the model’s output cannot be trusted (Hall and Gill, 2019). Future research
can extend this stream of debate on the necessity of explanation versus trust in different accounting/auditing settings.
6.7. Model locality
Model locality, or the multiplicity of good models, refers to the phenomenon that one machine learning algorithm may produce
multiple accurate models with similar, but not identical, internal architectures (Breiman, 2001; Hall and Gill, 2019). Therefore, it is
17
essential to note that the explanations apply only to the model on which the XAI technique is used. The explanations can vary across
different models trained on the same data set with the same machine learning algorithm (Hall and Gill, 2019). Although there is no
solution to address model locality, the impact of model variants can be reduced by using XAI techniques like SHAP that can account for
the perturbation of similar models (Hall and Gill, 2019). Future research can examine the robustness of each XAI method applied in an
audit setting by evaluating its ability to produce consistent and coherent interpretations.
6.8. XAI to enhance the auditability of AI applications
In an audit setting, not only can XAI provide interpretations of complex ML models used by auditors, but it could also assist auditors
in assuring the AI applications used by clients that affect the financial reporting process. Clients could also embed XAI in their AI
applications for internal use and assist auditors’ understanding of clients’ information systems. With XAI’s insights into the inner
workings and rationale of the AI algorithm used by the client, external auditors could better assess the risk of material misstatement
coming from the usage of this AI tool. Internal auditors can also utilize XAI to examine whether the client’s AI tool is working as
expected or needs improvement due to issues like making biased decisions. Future research can extend this line of discussion to explore
XAI’s role in assuring AI applications.
7. Concluding remarks
In response to the challenge that AI application in auditing lacks interpretability, this paper introduces XAI to auditing researchers
and practitioners. We map different XAI techniques to existing audit documentation and evidence standards, and we use an ML-based
auditing task to illustrate popular XAI techniques, especially LIME and SHAP. This paper provides auditing researchers and practi
tioners with knowledge and tools to create more transparent AI applications in auditing.
A limitation of this paper is that, in the demonstration section, we measure material misstatement using a binary variable that
indicates whether or not the financial statement was materially restated rather than the dollar amount of accounts being restated. This
limitation is mainly due to a lack of access to the original financial statements to derive the misstatement amount. Future research can
examine whether XAI, especially LIME or SHAP, can inform auditors of the dollar amounts that could be misstated at the account
levels. In this study, based on one firm-year observation with a material restatement, we illustrated how to examine the reliability of
LIME and SHAP. Future research can extend our analysis by examining a large sample of observations. Future research can also
consider examining XAI reliability by using other settings like fraud detection, where the underlying reasons for fraud are available.
Furthermore, the demonstrations provided in this paper are based on supervised learning on tabular data since most auditing tasks deal
with tabular data, and most ML tasks involve supervised learning. Future research can be extended by applying XAI techniques to other
forms of data (e.g., images and texts) or ML tasks (e.g., unsupervised learning and semi-supervised learning).
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
influence the work reported in this paper.
Acknowledgement
We thank the editor, anonymous reviewer(s), participants from the 12th Biennial Symposium on Information Integrity and In
formation Systems Assurance, Aleksandr Kogan, Michael Alles, Andrea Rozario, and Edward Wilkins, David Wood, Michael Leo
nardson, Steven Katz, Helen Brown-Liburd.
Appendix: A Summary of Common XAI Techniques
18
C.(A. Zhang et al.
Post-hoc methods Description Function Global or local Advantages Disadvantages Available open-source tools
Partial Dependence Plot PDP shows the Understand a Global • Intuitive • The maximum number • Google What-If Tool (WIT)141
(PDP) marginal effect of feature’s • Easy to of features in a PDP is 2 • Python scikit learn
one or two features relationship with implement due to human “plot_partial_dependence”
on the predicted the predicted perception limitation package152
outcome of a outcome • The assumption of • R “pdp” package163
machine learning independence between
model (Friedman, features is always
2001; Molnar, 2021). violated in practice
Individual Conditional ICE displays one line Understand a Local • Intuitive • The maximum number • Python scikit learn
Expectation (ICE) per instance that feature’s • Easy to of features in ICE is 1 “plot_partial_dependence”
shows how the relationship with implement due to human package174
relationship between the predicted • Can uncover perception limitation • R “pdp” package185
the predicted outcome heterogeneous • The assumption of
outcome and a relationships independence between
feature (Goldstein features is always
et al., 2015; Molnar, violated in practice
2021). • ICE plots can be
overcrowded
Accumulated Local Accumulated local Understand a Global • Less prone to • Less intuitive to • Python “PyALE” package196
Effects (ALE) Plot effects describe how feature’s correlated understand compared • R “ALEPlot” package207
features influence relationship with features to PDP or ICE
the prediction of a the predicted compared to • The plot can be
machine learning outcome PDP or ICE unstable when the
model on average. • Easier to number of intervals
ALE plots are a faster interpret increases
19
and unbiased compared to • More difficult to

alternative to partial PDP or ICE implement than PDP or
dependence plots • Faster ICE
(PDPs). computation

time than PDP
or ICE
Permutation Feature PFI measures feature Understand Global • Ease of • Prone to the • Python PFI package218
Importance (PFI) importance by feature interpretation randomness added to • R “iml” package229
examining which importance • Does not the feature value
features increase the require permutation process
prediction error the retraining the • The feature value
most after model permutation process
permutating its may generate
values to break the unrealistic data points
relationship between
the feature and the
true outcome (
Molnar, 2021).
2310
Global Surrogate A global surrogate Approximate the Global • Intuitive • Conclusions drawn • Demonstration Python codes
model is “an decision making • Flexible from the surrogate • R “iml” package
interpretable model of a black box choice of the model cannot be
that is trained to model surrogate treated as equivalent to
approximate the model those from the black
predictions of a black box model
box model” (Molnar, • There is no clear cut-
2021). off for R squared
(continued on next page)
(continued )
C.(A. Zhang et al.

Post-hoc methods Description Function Global or local Advantages Disadvantages Available open-source tools
• The choice of surrogate

model is subjective
Local Surrogate (LIME) A surrogate model Approximate the Local • Has local • Vague definition of the • Python “LIME” package2411
that can interpret decision making fidelity (i.e., neighborhood of an • R “iml” package
individual of a black box can be used to instance
predictions (Ribeiro model interpret • Results subject to the
et al., 2016; Molnar, individual choice of neighbors.
2021) predictions).
• Explanations
are human-
friendly
Scoped Rules (Anchors) Anchor is a decision Approximate the Local • Easy to • Requires • Python “Anchor” package2512
rule to explain decision making interpret hyperparameter
individual of a black box • Works when tuning, discretization
predictions of any model model of continuous
black box predictions variables, constant
classification model ( are non-linear calling to the ML
Molnar, 2021). or complex model
• Has unclear
benchmark of coverage
ratio
Shapley Values Shapley value is the Understand Local • Allows • Requires a lot of • Python “SHAP” package2613
average of all the feature contrastive computing power • R “iml” package
marginal explanations • Is prone to
20
contributions to all • Has solid misinterpretation

possible coalitions. backup theory • Needs access to data
• Can deliver a every time a Shapley
full value of a new data

explanation instance is calculated
that is fairly
distributed
among the
feature values
SHAP (Shapley Additive SHAP is used to Understand Local or global • All Shapley • KernelSHAP can be Python “SHAP” package
exPlanations) explain the feature value’s very slow and ignores
prediction of an advantages feature importance.
instance x by apply • TreeSHAP may
computing the • Combines produce unintuitive
contribution of each LIME and feature attributions
feature to the Sharpley
prediction (Molnar, value
2021). • Fast
computation
that supports
global
explanation
References
Abdul, A., Vermeulen, J., Wang, D., Lim, B.Y., Kankanhalli, M., 2018. April). Trends and trajectories for explainable, accountable and intelligible systems: An hci
research agenda. In: Proceedings of the 2018 CHI conference on human factors in computing systems, pp. 1–18.
ACCA. (2020). Explainable AI: Putting the user at the core. Available at: https://www.accaglobal.com/gb/en/professional-insights/technology/Explainable_AI.
html#:~:text=Explainable%20AI%20systems%20are%20key,is%20doing%20%E2%80%93%20which%20needs%20explainability.
Adadi, A., Berrada, M., 2018. Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160.
AICPA. (2020). The Data-Driven Audit: How Automation and AI are Changing the Audit and the Role of the Auditor. Available at: https://www.aicpa.org/content/
dam/aicpa/interestareas/frc/assuranceadvisoryservices/downloadabledocuments/the-data-driven-audit.pdf.
Alvarez-Melis, D., and Jaakkola, T. S. (2018). On the robustness of interpretability methods. arXiv preprint arXiv:1806.08049.
Apley, D.W., Zhu, J., 2020. Visualizing the effects of predictor variables in black box supervised learning models. J. .Stat. Soc.: Ser. B (Stat. Methodol.) 82 (4),
1059–1086.
Atkinson, K., Bench-Capon, T., Bollegala, D., 2020. Explanation in AI and law: Past, present and future. Artif. Intell. 103387.
AU-C section 230, Audit Documentation.
‘ 500, Audit Evidence.
Bao, Y., Ke, B., Li, B., Yu, Y.J., Zhang, J., 2020. Detecting Accounting Fraud in Publicly Traded US Firms Using a Machine Learning Approach. J. Account. Res.
Baryannis, G., Dani, S., Antoniou, G., 2019. Predicting supply chain risks using machine learning: The tradeoffbetween performance and interpretability. Fut. Gener.
Comput.er Syst. 101, 993–1004.
Basu, S., 2012. How can accounting researchers become more innovative? Account. Horizons 26 (4), 851–870.
Bauer, M., and Baldes, S. (2005, January). An ontology-based interface for machine learning. In Proceedings of the 10th international conference on Intelligent user
interfaces (pp. 314-316).
Beneish, M.D., 1997. Detecting GAAP violation: Implications for assessing earnings management among firms with extreme financial performance. J. Account. Public
Policy 16 (3), 271–309.
Beneish, M.D., 1999. The detection of earnings manipulation. Financial Analysts Journal 55 (5), 24–36.
Bertomeu, J., Cheynel, E., Floyd, E., Pan, W., 2021. Using Machine Learning to Detect Misstatements. Rev. Acc. Stud. 26, 468–519.
Biran, O., and Cotton, C. (2017, August). Explanation and justification in machine learning: A survey. In IJCAI-17 workshop on explainable AI (XAI) (Vol. 8, No. 1, pp.
8-13).
Bradley, A.P., 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30 (7), 1145–1159.
Breiman, L., 2001. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical science 16 (3), 199–231.
Brown, N.C., Crowley, R.M., Elliott, W.B., 2020. What are you saying? Using topic to detect financial misreporting. J. Account. Res. 58 (1), 237–291.
Burt, A. (2018). How will the GDPR impact machine learning? Answers to the three most commonly asked questions about maintaining GDPR-compliant machine
learning programs.
Burton, F. G., S. L. Summers, T. J. Wilks, and D. A. Wood. 2020a. Attention afforded accounting research by policy makers, academics, and the general public.
Working Paper, Brigham Young University.
Burton, F. G., S. L. Summers, T. J. Wilks, and D. A. Wood. 2020b. Creating relevance of accounting research (ROAR) scores to evaluate the relevance of accounting
research to practice. Working Paper, Brigham Young University.
Bussmann, N., Giudici, P., Marinelli, D., Papenbrock, J., 2020. Explainable ai in fintech risk management. Front. Artif. Intell. 3, 26.
Canadian Public Accountability Board (CPAB), 2021. Technology in the audit. Available at: https://www.cpab-ccrc.ca/docs/default-source/thought-leadership-
publications/2021-technology-audit-en.pdf?sfvrsn=f29b51ce_14.
Cecchini, M., Aytug, H., Koehler, G.J., Pathak, P., 2010. Detecting Management Fraud in Public Companies. Manage. Sci. 56 (7), 1146–1160.
Chan, P.K., Stolfo, S.J., 1998. Learning with non-uniform class and cost distributions: Effects and a distributed multi-classifier approach. Workshop Notes KDD-98
Workshop on Distributed Data Mining.
Chen, T., Guestrin, C., 2016. August). Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge
discovery and data mining, pp. 785–794.
Christ, M.H., Emett, S.A., Summers, S.L., Wood, D.A., 2021. Prepare for takeoff: Improving asset measurement and audit quality with drone-enabled inventory audit
procedures. Rev. Acc. Stud. 1–21.
Christ, M. H., Eulerich, M.,Krane, R., and Wood, D. A. (June 8, 2020). New Frontiers for Internal Audit Research. Available at SSRN: https://ssrn.com/
abstract=3622148 or https://doi.org/10.2139/ssrn.3622148.
Dandl, S., Molnar, C., Binder, M., Bischl, B., 2020. In: September). Multi-objective counterfactual explanations. Springer, Cham, pp. 448–469.
DARPA, 2016. Broad Agency Announcement Explainable Artificial Intelligence (XAI). Available at: https://www.darpa.mil/attachments/DARPA-BAA-16-53.pdf.
Das, A., and Rad, P. (2020). Opportunities and challenges in explainable artificial intelligence (xai): A survey. arXiv preprint arXiv:2006.11371.
Dechow, P.M., Ge, W., Larson, C.R., Sloan, R.G., 2011. Predicting material accounting misstatements. Contempor. Accounting Res. 28 (1), 17–82.
Ding, K., Lev, B., Peng, X., Sun, T., Vasarhelyi, M.A., 2020. Machine learning improves accounting estimates: evidence from insurance payments. Rev. Acc. Stud. 25,
1–37.
Doshi-Velez, F., and Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608.
Eliot, L. (2021). Explaining Why Explainable AI (XAI) Is Needed For Autonomous Vehicles And Especially Self-Driving Cars. Forbes. Available at: https://www.forbes.
com/sites/lanceeliot/2021/04/24/explaining-why-explainable-ai-xai-is-needed-for-autonomous-vehicles-and-especially-self-driving-cars/.
Ewert, R., Wagenhofer, A., 2019. Effects of Increasing Enforcement on Financial Reporting Quality and Audit Quality. J. Account. Res. 57 (1), 121–168.
Feldman, M., Friedler, S.A., Moeller, J., Scheidegger, C., Venkatasubramanian, S., 2015. August). Certifying and removing disparate impact. In: Proceedings of the
21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 259–268.
Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine. Ann. Stat. 1189–1232.
14
See: https://pair-code.github.io/what-if-tool/get-started/#notebooks.
15
See: https://scikit-learn.org/stable/modules/partial_dependence.html.
16
See: https://bgreenwell.github.io/pdp/index.html.
17
See: https://scikit-learn.org/stable/modules/partial_dependence.html.
18
See: https://bgreenwell.github.io/pdp/index.html.
19
See: https://pypi.org/project/PyALE/.
20
See: https://cran.r-project.org/web/packages/ALEPlot/index.html.
21
See: https://scikit-learn.org/stable/modules/permutation_importance.html.
22
See: https://cran.r-project.org/web/packages/iml/index.html.
23
See: https://github.com/jphall663/interpretable_machine_learning_with_python.
24
See: https://github.com/marcotcr/lime/tree/master/lime.
25
See: https://github.com/marcotcr/anchor.
26
See: https://github.com/slundberg/shap.
21
Goldstein, A., Kapelner, A., Bleich, J., Pitkin, E., 2015. Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation.
J. Comput. Graph. Stat. 24 (1), 44–65.
Gopinathan, K. M., Biafore, L. S., Ferguson, W. M., Lazarus, M. A., Pathria, A. K., and Jost, A. (1998). US Patent No. 5,819,226. Washington, DC: US Patent and
Trademark Office.
Hall, P., Gill, N., 2019. An introduction to machine learning interpretability. O’Reilly Media, Incorporated.
Hastie, T., Tibshirani, R., Friedman, J., 2009. The elements of statistical learning: data mining, inference, and prediction. Springer Science and Business Media.
Ieee, 2019. IEEE Position Statement – Artificial Intelligence. Available at: https://globalpolicy.ieee.org/wp-content/uploads/2019/06/IEEE18029.pdf.
Islam, S. R., Eberle, W., Ghafoor, S. K., and Ahmed, M. (2021). Explainable artificial intelligence approaches: A survey. arXiv preprint arXiv:2101.09429.
Kamiran, F., Calders, T., 2012. Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 33 (1), 1–33.
Kaplan, R.S., 2011. Accounting scholarship that advances professional knowledge and practice. Account. Rev. 86 (2), 367–383.
Khedkar, S., Gandhi, P., Shinde, G., Subramanian, V., 2020. Deep learning and explainable AI in healthcare using EHR. In Deep learning techniques for biomedical and
health informatics. Springer, Cham, pp. 129–148.
Launchbury, J. (2017). A DARPA perspective on artificial intelligence. Retrieved November, 11, 2019.
Liao, S.H., 2005. Expert system methodologies and applications—a decade review from 1995 to 2004. Expert Syst. Appl. 28 (1), 93–103.
Lipton, Z.C., 2018. The Mythos of Model Interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 16 (3), 31–57.
Louwers, T.J., Ramsay, R.J., Sinason, D.H., Strawser, J.R., Thibodeau, J.C., 2020. Auditing & assurance services, (8th ed.). McGraw Hill.
Lu, J., Lee, D., Kim, T. K., and Dank, D. (2020). Good Explanation for Algorithmic Transparency. Working paper.
Lundberg, S., and Lee, S. I. (2017). A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874.
Lundberg, S. M., Erion, G. G., and Lee, S. I. (2018). Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888.
Miller, T., 2019. Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell. 267, 1–38.
Molnar, C. (2021). Interpretable machine learning. Lulu. com. Available at: https://christophm.github.io/interpretable-ml-book/.
Moore, J. D., and Swartout, W. R. (1988). Explanation in expert systemss: A survey. University of Southern California Marina Del Rey Information sciences INST.
Munoko, I., Brown-Liburd, H.L., Vasarhelyi, M., 2020. The ethical implications of using artificial intelligence in auditing. J. Bus. Ethics 167 (2), 209–234.
Nielsen, D. (2016). Tree boosting with xgboost-why does xgboost win“ every” machine learning competition? (Master’s thesis, NTNU).
No, W.G., Lee, K., Huang, F., Li, Q., 2019. Multidimensional audit data selection (MADS): A framework for using data analytics in the audit data selection process.
Accounting Horizons 33 (3), 127–140.
Pawar, U., O’Shea, D., Rea, S., O’Reilly, R., 2020. In: June). Explainable ai in healthcare. IEEE, pp. 1–2.
PCAOB. AS 1105, Audit Evidence.
PCAOB. AS 1215, Audit Documentation.
PCAOB. AS 2101, Audit Planning.
Perols, J., 2011. Financial statement fraud detection: An analysis of statistical and machine learning algorithms. Audit.: J. Pract. Theor. 30 (2), 19–50.
Perols, J.L., Bowen, R.M., Zimmermann, C., Samba, B., 2016. Finding needles in a haystack: Using data analytics to improve fraud prediction. Account. Rev. 92 (2),
221–245.
Rajgopal, S., 2020. Integrating Practice into Accounting Research. Manage. Sci.
Ribeiro, M. T., Singh, S., and Guestrin, C. (2016, August). “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM
SIGKDD international conference on knowledge discovery and data mining (pp. 1135-1144).
Ribeiro, M. T., Singh, S., and Guestrin, C. (2018, April). Anchors: High-precision model-agnostic explanations. In Proceedings of the AAAI Conference on Artificial
Intelligence (Vol. 32, No. 1).
Rudin, C., 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intelli. 1 (5), 206–215.
Russell, S., and Norvig, P. (2002). Artificial intelligence: a modern approach.
Shapley, L.S., 1953. In: A Value for n-Person Games in Contributions to the Theory of Games II (Annals of Mathematics Studies 28). Princeton University Press,
pp. 307–317.
Shokri, R., Stronati, M., Song, C., and Shmatikov, V. (2017, May). Membership inference attacks against machine learning models. In 2017 IEEE Symposium on
Security and Privacy (SP) (pp. 3-18). IEEE.
Slack, D., Hilgard, S., Jia, E., Singh, S., and Lakkaraju, H. (2020, February). Fooling lime and shap: Adversarial attacks on post hoc explanation methods. In
Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (pp. 180-186).
Tamagnini, P., Krause, J., Dasgupta, A., Bertini, E., 2017. May). Interpreting black-box classifiers using instance-level visual explanations. In: Proceedings of the 2nd
Workshop on Human-In-the-Loop Data Analytics, pp. 1–6.
Tramèr, F., Zhang, F., Juels, A., Reiter, M. K., and Ristenpart, T. (2016). Stealing machine learning models via prediction apis. In 25th {USENIX} Security Symposium
({USENIX} Security 16) (pp. 601-618).
Van Lent, M., Fisher, W., and Mancuso, M. (2004, July). An explainable artificial intelligence system for small-unit tactical behavior. In Proceedings of the national
conference on artificial intelligence (pp. 900-907). Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999.
Virág, M., Nyitrai, T., 2014. Is there a ttradeoffetween the predictive power and the interpretability of bankruptcy models? The case of the first Hungarian bankruptcy
prediction model. Acta Oeconomica 64 (4), 419–440.
Wachter, S., Mittelstadt, B., Russell, C., 2017. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. J.L. and Tech.
31, 841.
Wood, D.A., 2016. Comparing the publication process in accounting, economics, finance, management, marketing, psychology, and the natural sciences. Account.
Horizons 30 (3), 341–361.
Zhang, C., Cho, S., and Vasarhelyi, M. (2022). Identifying Informative Audit Quality Indicators Using Machine Learning. Rutgers Business School. Working paper.
Available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3981622.
Zhu, J., Liapis, A., Risi, S., Bidarra, R., Youngblood, G.M., 2018. Explainable AI for designers: A human-centered perspective on mixed-initiative co-creation. In: 2018
IEEE Conference on Computational Intelligence and Games (CIG). IEEE, pp. 1–8.
22

1 s2.0 S1467089522000240 Main

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S1467089522000240 Main

Uploaded by

Copyright:

Available Formats

International Journal of Accounting Information Systems 46 (2022) 100572

Contents lists available at ScienceDirect

International Journal of Accounting

Explainable Artificial Intelligence (XAI) in auditing

Available online 1 August 2022

2. Explainable Artificial Intelligence

1) What ML algorithm is used?

Fig. 1. Summary of Common XAI Techniques.

5) How does the trained model make a decision?

4. Demonstration of lime and shap

4.1. Use Case: Risk of material misstatement assessment

4.2. Sample and data

2005 2324 205 8.82 %

4.3. Establish a Black-box Machine learning model

% Soft assets (Total Assets-PP&E-Cash and Cash Equivalent)/Total Assets

4.4. Local interpretable Model-agnostic explanations (LIME)

Fig. 3. Sample Split for Machine Learning Experiment.

4.5. Shapley Additive exPlanations (SHAP)

Fig. 4. LIME Interpretations – using raw financial statement variables as predictors.

4.6. Illustrative comparison and evaluation of LIME and SHAP

5. Demonstration of other XAI techniques

5.1. Global surrogate model

Fig. 6. SHAP Interpretations – using raw financial statement variables as predictors.

5.2. Permutation feature importance (PFI)

5.3. Partial dependence plot (PDP)

Fig. 8. Scatter Plot of SHAP.

5.4. Accumulated local effects (ALE) plot

Fig. 9. The Intuition Behind a Global Surrogate Model.

Fig. 10. A Decision Tree Surrogate Model.

Current Assets - Total 0.022 0.003

Fig. 11. PDP Interpretation.

Fig. 12. ALE Interpretation.

5.5. Scoped rules (Anchors)

Fig. 13. Anchor explanation for a no-res and a res instance.

5.6. Individual conditional expectation (ICE)

5.7. Counterfactual explanations

Fig. 14. ICE Interpretation.

Fig. 15. Counterfactual explanations.

6. Future research suggestions

6.1. Testing XAI’s interpretability and reliability

6.2. XAI to enhance “Auditor-in-the-Loop”

6.3. Cost and benefit analysis of XAI

6.4. Unintended consequence of XAI

6.6. Explanation and trust

6.7. Model locality

6.8. XAI to enhance the auditability of AI applications

Declaration of Competing Interest

Appendix: A Summary of Common XAI Techniques

and unbiased compared to • More difficult to

International Journal of Accounting Information Systems 46 (2022) 100572

C.(A. Zhang et al.

• The choice of surrogate

contributions to all • Has solid misinterpretation

International Journal of Accounting Information Systems 46 (2022) 100572

You might also like