Field Experiments in Managerial Accounting Research

Foundations and Trends® in Accounting
Field Experiments in Managerial

Accounting Research
Suggested Citation: Sofia M . L ourenço ( 2019), “ Field E xperiments i n Managerial
Accounting Research”, Foundations and Trends® in Accounting: Vol. 14, No. 1, pp 1–72.
DOI: 10.1561/1400000059.
Sofia M. Lourenço
ISEG, Universidade de Lisboa
and Advance, CSG Research Center, Portugal
slourenco@iseg.ulisboa.pt
This article may be used only for the purpose of research, teaching,
and/or private study. Commercial use or systematic downloading
(by robots or other automatic processes) is prohibited without ex-
plicit Publisher approval.
Boston — Delft
Contents
1 Introduction 2
2 Causal inference 7
2.1 Correlation, causation, and threats to causal claims . . . . 7
2.2 Remedies used in observational studies . . . . . . . . . . . 9
3 The experimental method 16

3.1 What is an experiment? . . . . . . . . . . . . . . . . . . . 16
3.2 Types of experiments . . . . . . . . . . . . . . . . . . . . 17
4 The promise of field experiments for managerial accounting

research 25
4.1 The endogenous nature of the phenomena studied . . . . . 25
4.2 Internal validity problems of observational studies in
managerial accounting research . . . . . . . . . . . . . . . 25
5 Some examples of field experiments in managerial

accounting research 28
6 Insights on running field experiments 38

6.1 Research question . . . . . . . . . . . . . . . . . . . . . . 39
6.2 Research site . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.3 Collaboration . . . . . . . . . . . . . . . . . . . . . . . . 40
6.4 Experimental design . . . . . . . . . . . . . . . . . . . . . 43
6.5 Institutional review board . . . . . . . . . . . . . . . . . . 46
6.6 Pilot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . 48
6.8 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7 Limitations of field experiments 53
8 Concluding remarks 57
Acknowledgements 60
References 61
Field Experiments in Managerial
Accounting Research
Sofia M. Lourenço
ISEG, Universidade de Lisboa, and Advance, CSG Research Center,
Portugal; slourenco@iseg.ulisboa.pt
ABSTRACT
The use of field experiments in managerial accounting re-
search has increased substantially in the last couple of years.
One reason for this upsurge is the call in the literature to
address causality more confidently, which can be best ac-
complished with field experiments. This research method
provides a clear mechanism for identifying causal effects
in the field because the researcher introduces exogenous
manipulations in different experimental conditions to which
observational units (e.g., individuals, groups, or companies)
are randomly assigned. Hence, field experiments are well
suited to address managerial accounting phenomena that
are plagued with endogeneity concerns when analyzed by us-
ing the results of retrospective (observational) studies. This
manuscript provides an introduction to field experiments
and is especially directed toward managerial accounting re-
searchers who wish to consider adding this research method
to their toolbox.
Keywords: managerial accounting research; field experiments;

quasi-experiments; natural experiments; laboratory experiments.
Sofia M . L ourenço ( 2019), “ Field E xperiments i n M anagerial A ccounting Re-

search”, Foundations and Trends® in Accounting: Vol. 14, No. 1, pp 1–72. DOI:
10.1561/1400000059.
1
Introduction
Field experiments, also known as randomized controlled trials (Duflo

and Banerjee, 2017) or natural field experiments (Harrison and List,
2004), are a research method that is gaining popularity and is now
being used in accounting research (e.g., Floyd and List, 2016). This
trend is similar to that in other fields of research, such as economics
(e.g., Bandiera et al., 2011; Harrison and List, 2004; Levitt and List,
2006, 2007, 2009), finance (e.g., Bernstein et al., 2017; Cole et al., 2011),
strategy (e.g., Chatterji et al., 2016), and information systems (e.g.,
Bapna et al., 2017).
The recent upsurge of field experiments in accounting, and other
fields, responds to calls in the literatures to more confidently address
issues regarding causality. Examples of this call in the accounting lit-
erature were in (i) the Accounting, Organizations and Society special
issue/section on causality (Balakrishnan and Penno, 2014; Gassen,
2014; Ittner, 2014; Luft and Shields, 2014; Lukka, 2014; Van der
Stede, 2014), (ii) the Journal of Accounting Research issue of the fifti-
eth JAR Conference (Bloomfield et al., 2016; Floyd and List, 2016;
Gow et al., 2016), and (iii) the Foundations and Trends in Account-
ing special issue on Causal Inferences in Capital Markets Research
2
3
(Bertomeu et al., 2016; Cartwright, 2016; Chen and Schipper, 2016;

Kahn and Whited, 2016; Manski, 2016; Marinovic, 2016; Reiss, 2016;
Rust, 2016; Welch, 2016).
Similar to researchers in other fields, researchers in accounting aim
to discover causal relationships, i.e., that (a change in) A leads (to a
change in) B. For example, does the introduction of a certain type of
incentive (or management control system), or (A), lead to an increase
in performance, (B)? The discovery of causal relationships not only
contributes to theory development (Gow et al., 2016), but also to the
relevance and impact of academic research in the practitioners’ world
(Van der Stede, 2014). Hence, causal inferences are an aspirational stan-
dard that most non-experimental studies claim (Gow et al., 2016). To
support researchers in these causal claims, significant progress has been
made in econometric analysis (e.g., propensity score matching, entropy
balancing, instrumental variables, regression discontinuity, structural
modeling). However, even when grounded in theory or field-qualitative
evidence (Ittner, 2014), these causal claims are not indisputable due
to the absence of a true counterfactual in observational (retrospective)
studies. Experiments—that is, studies that use random assignment
of individuals/groups/companies to different conditions—provide, by
design, the necessary counterfactual. Therefore, causal inferences can
be made.
Field experiments are experiments designed by the researcher with
random assignment of the observational units to treatment conditions
that are done in naturally occurring environments (i.e., not created by
the researcher) and where, in general, the participants do not know that
they are part of an experiment. As such, field experiments are substan-
tially different from (1) quasi-experiments, (2) natural experiments, and
(3) laboratory experiments.1 The first two lack random assignment to
treatment conditions (Shadish et al., 2002) and, thus, face internal va-
lidity concerns related to selection bias and endogeneity. The third faces
external validity concerns related to the pool of subjects used—usually
students who have pre-agreed (i.e., given their informed consent) to
1
I will discuss thoroughly the nature of these experiments in Section 3.2, “Types
of experiments.”
4 Introduction
participate in a set of experiments—and the artificial environment/task

created by the researcher, which may poorly mimic the real world.2
Aside from providing the necessary counterfactual for making a
causal claim, field experiments occur in real-world environments in which
participants face real-world consequences, such as financial, reputational,
image, and social effects. These types of high stakes cannot credibly
be replicated in the lab and, as such, evidence collected in the lab
may not generalize to the field (Levitt and List, 2006, 2007, 2009; List,
2006). Therefore, field experiments may be the only method that allows
the researcher to document causal relationships when participants face
real-world stakes.
Field experiments are especially suited to study managerial account-
ing phenomena because empirical retrospective studies in this stream
of research face many identification threats (e.g., omitted correlated
variables and self-selection) and lab environments lack many of the real-
world consequences that employees and organizations have to deal with.
Hence, field experiments are a powerful method for managerial account-
ing research because they provide (i) clean tests of causality; (ii) the
possibility of disentangling alternative mechanisms of that causality;
(iii) the external validity of a field setting; (iv) estimates of effect sizes in
the field; (v) estimates of heterogeneous treatment effects, and (vi) rel-
evant results for both academics and practitioners. Field experiments
represent an excellent opportunity to establish win–win relationships
between academia and the corporate world—a gap that many scholars
and practitioners advocate to eliminate (Kaplan, 2011; Van der Stede,
2015).
Examples of recently published field experiments in managerial ac-
counting research deal mainly with the implementation of strategic
performance measurement systems (Aranda and Arellano, 2010) and
2
In field experiments, participants do not usually know that they are taking part
in a study and, as such, there is no self-selection by participants. However, and to
the extent that in some field experiments the organizations choose to take (or not to
take) part in a study, there is self-selection of the partner. This is a limitation of
field experiments that is discussed in Section 7, “Limitations of Field Experiments.”
This drawback is common to the majority of field studies.
5
the performance effects of different types of incentives (Kelly et al.,

2017; Lourenço, 2016), feedback (Casas-Arce et al., 2017; Eyring and
Narayanan, 2018; Lourenço et al., 2018), and information sharing sys-
tems (Li and Sandino, 2018).3 In general, these field experiments use
individuals (employees or managers) as the unit of analysis.4 However,
other fields of research are now using organizations themselves as the
unit of analysis. This is the case of financial accounting (e.g., Duin et al.,
2018) and economics (e.g., Bloom et al., 2013). Even though the hurdle
to do field experiments in which the organization is the experimental
unit is much higher as it may require the access to multiple sites, the
advantages in terms of generalizability and impact (relevance) are undis-
putable. Managerial accounting researchers who use field experiments
can also explore the advantages of using several sources of data in their
study (e.g., questionnaires, archival) and thus, strengthen their claims
about the mechanism by which an effect occurs. Researchers can also
take advantage of long timeframes to identify how effects evolve over
time. By using the natural heterogeneity in the field, researchers can
also analyze different conditions (i.e., heterogeneous treatment effects)
in which these effects arise. This is most relevant to decision-makers in
organizations, who can be very different not only among themselves,
but also in their pool of managers and employees.
Overall, field experiments promise to be a fruitful method in man-
agerial accounting research, not only because they allow making causal
claims with confidence, but also because they permit researchers to
answer a set of research questions in the field that were not possible
before.
The remainder of this manuscript is as follows. Section 2 discusses
causal inference and Section 3 describes the experimental method.
3
Presslee et al. (2013) use an experimental study in the field that deals with
different types of incentives but, in the typology presented in this manuscript, this
study is a quasi-experiment and not a field experiment, as it does not have random
assignment of the experimental units to the treatment conditions. The allocation to
the treatment conditions was decided by the company in which the quasi-experiment
was implemented.
4
The exceptions are Kelly et al. (2017), who use independent retailer companies
as experimental units in their field experiment, and Li and Sandino (2018), who use
stores.
6 Introduction
Section 4 discusses the promise of field experiments for managerial

accounting research and Section 5 provides examples of field experiments
in this stream of the literature. Section 6 presents some guidelines for
running field experiments and Section 7 presents the limitations of this
method. Finally, Section 8 concludes.
2
Causal inference
2.1 Correlation, causation, and threats to causal claims
Every researcher understands that correlation (or association) is not

causation. Nevertheless, many observational (retrospective) studies
are written as if causality can be intuitively inferred based on simple
statistical analyses. Cross-sectional data, and even longitudinal data
to some extent, cannot directly show causality. The researcher must
demonstrate and/or argue that the documented association is, in fact,
true causation. In observational quantitative studies, the leap from
association to causation can be challenging because researchers face
identification threats that are due to the lack of a (proper) counterfactual.
The list of threats, in any reviewer’s toolkit, includes omitted correlated
variable bias, selection bias, reverse causality bias, and simultaneous
causality bias.
Omitted correlated variable bias occurs when a variable related
with both the dependent and the independent variable is omitted from
the identification strategy. This is a concern because the estimated
relationship may be spurious, i.e., once the omitted correlated variable
is considered in the identification model, the previous relationship
disappears. A simple illustration of this bias is the estimation of the
7
8 Causal inference
performance effect of a new performance measurement system (PMS).

If the introduction of a new PMS is accompanied with other changes
(e.g., new management team), the documented relationship can be
spurious. Once the change in management team—the omitted correlated
variable and the true cause of the effect—is included in the identification
model, the previous relationship between the new PMS and performance
disappears. In retrospective studies, we do not know what caused PMS
to change (the change in the independent variable is not exogenous)
and, thus, the causes of that change are omitted correlated variables
when estimating the effect on performance (dependent variable).
Selection (sampling) bias occurs when the method to select individ-
uals, teams, companies, or some other observational units to be part
of the study is not conducive to obtaining a random sample. Because
these units may not be representative of the population, this is a threat
to the external validity of the study, i.e., the results of the study may
not generalize to the population. Selection (sampling) bias may also
pose a threat to the internal validity of the study if there are baseline
differences between the groups that will be compared in the study. The
most common problem of selection (sampling) bias in (managerial)
accounting research is self-selection. To illustrate this issue, and con-
tinuing with the previous example, researchers who use observational
studies to document the effect of PMS on performance will compare
the performance of adopters to that of the non-adopters of PMS. How-
ever, because firms self-select to adopt PMS, the results found may be
biased by the characteristics of these firms that are different from the
non-adopters. It is reasonable to consider that the firms that expect
greater benefits from PMS are the adopters, while the firms that do not
expect large benefits are the non-adopters. Thus, the effect found in
the comparison of adopters and non-adopters may be biased upwards
due to the differences of these two groups.
Reverse causality bias refers to the problem a researcher faces when
the relationship tested and documented in the study does not go from the
independent to the dependent variable but rather from the dependent
to the independent variable. This is mainly a problem in cross-sectional
studies in which the researcher cannot show that the cause precedes
the effect. To exemplify this issue, and continuing with the previous
2.2. Remedies used in observational studies 9
example, researchers who use observational studies may document

a relationship between PMS and performance, but the direction of
causality is not clear. Researchers may hypothesize that PMS leads
to higher performance. However, it is also reasonable to consider that
organizations that have higher performance will be more prone to invest
in PMS. When only cross-sectional data is available, researchers will
find it difficult to identify the (correct) direction of causality.
Finally, simultaneous causality bias occurs when dependent and
independent variables mutually influence each other, but researchers
only consider one side of this relationship in their identification models.
Continuing, again, with the previous example, this is also a plausible
threat to causal claims. Researchers who use observational studies may
document a relationship between PMS and performance, but it is reason-
able to consider that the two variables influence each other. PMS may
lead companies to have higher performance, but this higher performance
may also lead companies to invest more in their PMS. Simultaneous
causality bias is the hardest problem to solve in observational studies
as it is (almost) impossible to document each effect per se with this
type of data, i.e., without some exogenous variables.
Overall, there are concerns regarding the internal and the external
validity of observational (retrospective) studies that aim to find causal
effects. Table 2.1 provides a summary of these concerns (threats), as well
as the statistical/econometric remedies used to mitigate them, which
will be discussed next.
2.2 Remedies used in observational studies
When using non-experimental methods, researchers have to be aware

of the threats to causal inference and of the remedies available to mit-
igate them. For ease of presentation, and despite being intertwined,
I group those solutions into four groups: theory, study design, econo-
metrics/statistics, and writing.
Theory helps researchers develop a causal claim. Theory in manage-
rial accounting research can come from analytical models or qualitative
arguments. In either case, theory is needed for both experimental and
non-experimental studies. Theory should guide the design of the study
Table 2.1: Threats to causal inference when using observational (retrospective) studies
10
Econometric remedies
Threats What is this threat? to mitigate the threat∗ Limits to remedies
Omitted A variable related to Include omitted correlated There can be myriad omitted correlated
correlated both the dependent and variables in the model. variables. It is difficult for a researcher to
variable bias the independent argue that she has included all of the
variable is omitted from potentially omitted correlated variables in
the identification her model.
strategy, which can lead
to a spurious
relationship between
the two variables in the
model (once the
omitted correlated
variable is considered in
the model, the previous
relationship
disappears).
Selection Selection of study units Correct for selection If there are unobservable variables that
(sampling) bias or “treated” units is not (sampling) bias with are determining the selection and are also
random, which can limit sample selection models related to the outcome variable but not
the generalizability of (e.g. Heckman, 1979), considered in the selection model or
the study and its matching procedures (e.g., matching/balancing procedure, the
internal validity due to Rosenbaum and Rubin, remedies will not solve the selection
biased estimates of the 1983), or balancing (sampling) bias. It is difficult for the
“treatment” effect. procedures (e.g., researcher to argue that she has all
Hainmueller, 2012). relevant variables in her selection model
or matching/balancing procedure.
Causal inference
Continued.
Table 2.1: Continued
Econometric remedies to
Threats What is this threat? mitigate the threat∗ Limits to remedies
Reverse The relationship does not go Use longitudinal data and Even though this threat can
causality bias from the independent to the estimate the relationship be solved, other threats may
dependent variable, which is using lags. still be present.
tested and documented by the
researcher, but rather from
the dependent to the
independent variable.
Simultaneous Dependent and independent Use simultaneous equation Some exogenous assumptions
causality bias variables are mutually models, instrumental (variables) are still needed,
influencing each other. variables (Angrist and which may not be credible or
Because researchers only Krueger, 2001), or easy to find given the
consider one side of this structural models. managerial accounting
2.2. Remedies used in observational studies
relationship in their phenomena.

identification models, they
will find biased estimates of
the effect of the independent
variable on the dependent
variable.
∗ Difference-in-differences
estimators (e.g., Athey and Imbens, 2006; Card and Krueger, 1994), fixed effect estimators, and regression disconti-
nuity designs (e.g., Hahn et al., 2001; Thistlewaite and Campbell, 1960) can also mitigate these threats by controlling for baseline differences,
accounting for time-invariant and/or unit-invariant unobservable factors, and comparing more similar “treated” and “non-treated” groups,
respectively. However, and even though these techniques help create better counterfactuals, it is always an issue as to how time-variant or
unit-variant unobservable factors have affected non-random groups.
11
12 Causal inference
and the interpretation of its results. For example, theory can help in
the selection of the variables, not only dependent and independent, but
also control variables that can be used to dismiss concerns regarding
omitted correlated variables. Theory can also help when interpreting
the results. If there are competing hypotheses, theory can provide an
explanation for potential contradictory results. Moreover, theory can
provide arguments for why A causes B and not the inverse. Hence,
theory can help researchers who are using non-experimental data to
make causal claims.
Study design can help alleviate part of the concerns related to causal
inference. For example, by choosing a longitudinal design instead of
cross-sectional design, researchers will be able to make causal claims
with more confidence. Specifically, researchers will be able to mitigate
the threat of reverse causality by showing that the cause precedes the
effect. Researchers use data from t-1 (or some other lag) for the indepen-
dent variable to estimate the effect on the dependent variable in t. Even
in a cross-sectional study, such as a one-shot survey, researchers may
also be able to collect information from past periods that allows the use
of lagged variables or changes and, hence, enhances causal claims (Van
der Stede, 2014). It is also important in the study design stage to clearly
identify the variables for which data need to be collected. Similar to
the theory of the study, failing to identify and collect data for critical
variables—dependent, independent, and control variables—may limit
the researcher’s ability to make causal inferences. Hence, study design
can help address omitted correlated variables by carefully identifying
and planning to collect data for those variables. However, there may be
myriad potential confounders or difficulties in collecting data that make
it (almost) impossible to confidently dismiss the omitted correlated
variable concern. Researchers can also make use of qualitative field
evidence or institutional knowledge that can help them to make causal
claims (Ittner, 2014). Hence, the study design should consider possibili-
ties of collecting data about the specific setting in which the study is
taking place. That additional knowledge may help the researcher build
a compelling story and make convincing arguments for a certain causal
relationship.
Econometrics/statistics are central to many, if not all, observational

quantitative studies that aim to show causal relationships. As described
before, researchers have to overcome several hurdles (omitted correlated
variable bias, selection bias, reverse causality bias, and simultaneous
causality bias) before making a causal claim. These hurdles are as-
sociated with identification problems that are due to the lack of a
(proper) counterfactual. For each of these hurdles there are economet-
ric/statistical remedies, but as with many remedies, they have their
own limitations. For example, omitted correlated variables bias can be
overcome by including those variables in the analysis. The problem is
making sure that those controls are the appropriate ones.
Another hurdle is selection (sampling) bias. To overcome this prob-
lem, researchers commonly use some type of model to correct for sample
selection bias (e.g., Heckman, 1979) or matching/balancing procedures
to ex post create similar treatment and control groups. In the former
case, the researcher has to show that she has the right model to cor-
rect for selection bias. In the latter case, the researcher has to show
that she hast the right observable variables to match/balance. In both
cases, the issue is whether there are unobservable variables that are
determining the selection and that are related to the outcome variable
but not considered in the selection model or matching/balancing pro-
cedure. A popular matching procedure is propensity score matching
(Rosenbaum and Rubin, 1983), which is also being used in account-
ing research (Shipman et al., 2016). This matching procedure ensures
that the “treated” and the “non-treated” groups in non-randomized
studies are similar in some observable variables that may be correlated
with the selection bias and the outcome variable of interest. Entropy
balancing (Hainmueller, 2012) also aims to assure covariate balance
between “treated” and “non-treated” groups. This data-preprocessing
method uses a re-weighting scheme to calibrate unit weights so that the
re-weighted groups are similar in the predefined variables.
Researchers also have to consider reverse causality bias. As men-
tion previously, a researcher needs longitudinal data so that she can
show that the cause precedes the effect and not the inverse. Thus,
the empirical identification strategy employs lag independent variables
(e.g., t − 1) to show their effect on the dependent variable in t. Even
14 Causal inference
though the researcher does not need an experiment to show that the
cause precedes the effect, only longitudinal data, other problems of
observational (retrospective) studies can still be present (e.g., omitted
correlated variables bias and selection bias).
Finally, the researcher needs to consider simultaneous causality bias.
This problem can be solved by simultaneous equations models, although
some exogenous variation is needed to ensure that the causal link of
interest can be identified. The common solution is to use instrumental
variables (Angrist and Krueger, 2001) that deal with endogeneity issues
in general. The problem in this case is to find a good instrument,
which needs to be relevant (significantly related with the endogenous
independent variable) and exogenous (not related with other variables
in the model). Another approach to making causal inferences with
observational data is to use structural modeling (Gow et al., 2016). This
approach demands the researcher to analytically model the phenomenon
of interest, which provides insights into the mechanisms that drive the
phenomenon. Nevertheless, and because observational data is used,
these models still need the researcher to make some causal (exogeneity)
assumptions to make sure that the model is solvable.
Difference-in-differences estimators (e.g., Athey and Imbens, 2006;
Card and Krueger, 1994), fixed effect estimators, and regression disconti-
nuity designs (e.g., Hahn et al., 2001; Thistlewaite and Campbell, 1960)
are also part of the econometric tools that are available for quantitative
empirical accounting researchers who want to make causal claims using
retrospective data (Gow et al., 2016). These tools help researchers to
mitigate the threats previously described by controlling for baseline
differences, accounting for time-invariant and/or unit-invariant unob-
servable factors, and comparing more similar “treated” and “non-treated”
groups, respectively. However, these tools are also subject to criticism
in their ability to completely solve these problems (e.g., Abadie, 2005;
Bertrand et al., 2004). Overall, and even though the use of these tech-
niques help to create better counterfactuals, it is always an issue as
to how time-variant or unit-variant unobservable factors have affected
non-random groups.
Writing objectively about the study and acknowledging the limita-
tions of using observational data to make causal inferences can also help
the researcher to gain the goodwill of the reader to accept the causality
explanation that is offered. It is important that the researcher clearly
delineates any assumptions of the causal inference, for example which
variables are regarded as exogenous and which are regarded as endoge-
nous. Likewise, the researcher should separate the causal arguments
from the quantitative or qualitative evidence provided. This evidence
may only show correlation, at which point it is up to the researcher to
argue that the correlation is suggestive of causation in light of her theory
development. Hence, one should be cautious and use the appropriate
terminology, for example using X is associated with Y instead of X
causes Y (Van der Stede, 2014).
In sum, observational studies have internal and external validity
problems that can be mitigated, albeit not completely eliminated, with
the appropriate use of theory, study design, econometrics/statistical
analyses, and writing. Because of the limitations of all these solutions,
it is up to the researcher to develop her causal inference. Experimental
research is one solution to the identification problems of observational
studies.
3
The experimental method
3.1 What is an experiment?
An experiment is a prospective study in which the researcher introduces,

in a randomly chosen part of the sample (the treatment group), a
manipulation that is related to the variable of interest (the cause). The
remainder of the sample does not receive the manipulation (or receives a
different manipulation) and serves as the control group against which the
treatment group is compared. The researcher identifies the consequence
(the effect) on one or more outcome variables by comparing the two
conditions—the “treated” with the “non-treated” (or treated differently).
Because experimental units (e.g., individuals, groups, organizations) are
randomly assigned to treatment and control conditions, and assuming
that the number of observations is large enough, these two groups are,
in expectation, similar. However, in a specific random draw, covariate
balance may not occur and this imbalance is a threat to the identification
of the treatment effect. To mitigate the threat of covariate imbalance,
researchers can use specific procedures to obtain covariate balance (Imai
et al., 2008; Morgan and Rubin, 2012).1 Covariate balance assures that
1
I discuss these procedures in Section 6.
16
3.2. Types of experiments 17
the groups are similar at the start of the experiment and, thus, any
differences between them after the introduction of the manipulation
can be attributed to that manipulation. Thus, causal inference can
be directly made as the researcher has built the counterfactual—what
would have happened if the manipulation was not introduced—which is
the control group.
Saying it differently, the manipulation is an exogenous shock that
the treatment group, with randomly assigned units, receives. Because
the treatment group is, in expectation, similar to the control group and
both co-exist at the same time in the same context, all the identifica-
tion problems of observational studies disappear. There should not be
any omitted correlated variables as the manipulation is exogenous and
thus not correlated with any other variable. Moreover, even if some
variables that are correlated with the outcome are not considered in
the estimation model of the effect, due to randomization these (omit-
ted) variables should be similar across groups and thus should not
interfere in the estimation of the average treatment effect (difference
between groups). Selection bias should likewise not be present because
randomization should make the groups similar in expectation before
the manipulation and the manipulation is orthogonal to the groups.
Reverse and simultaneous causality bias are also not present as the
manipulation is exogenous and precedes the effect. Hence, the cause
cannot be dependent on the effect.
Overall, any endogeneity concern is eliminated with the exogenous
manipulation and the random assignment of experimental units to
treatment and control groups.
3.2 Types of experiments
Harrison and List (2004), in a seminal paper on field experiments,

created a classification for different types of experiments that has
been widely used since then (e.g., Bandiera et al., 2011; Floyd and
List, 2016; Levitt and List, 2007). These authors describe 4 types
of experiments: laboratory experiment, artefactual field experiment,
framed field experiment, and natural field experiment. I briefly describe
18 The experimental method
them and provide examples within managerial accounting research.

Table 3.1 provides a summary of this discussion.
A laboratory experiment is an experiment completed by a standard
pool of subjects, usually students who have signed up to participate
in experimental research. This type of experiment takes place in an
artificial environment created by the researcher. This environment
refers to a task that the subjects will do according to the instructions
established by the researcher. This task, or game, can be performed at
a computer with specific experimental software, such as z-Tree, or with
paper and pencil. Laboratory experiments are usually considered the
experiments with the highest degree of control by the researcher because
she can control and customize every dimension of the experiment, such
as the task/game, the interactions among participants, the number of
participants, and the characteristics of the subjects chosen within the
pool. Thus, laboratory experiments are considered to have the highest
internal validity among all types of studies. However, this high internal
validity comes at the expense of the external validity, or generalizability
of the results. The pool of subjects used (usually college students) and
the artificial environment—which consists of simple tasks/games, limited
interactions, short periods of time, and participants who know that
they are in an experiment—pose questions as to whether the findings
generalize to settings with skilled agents, in multi-task multi-period
environments, with multiple interactions, and long-term horizons. Due
to these issues, some have questioned what can be generalized from
laboratory experiments to the real world (Levitt and List, 2006, 2007).
A few studies have compared the results of different types of experiments
that examine the same phenomenon and found important differences
between the behavior of subjects in the lab and in the field (e.g., Levitt
et al., 2010; List, 2006).2
In managerial accounting research, laboratory experiments are
a relatively common method that is used (e.g., Brink et al., 2012;
Hesford et al., 2006). For reviews on the use of laboratory experiments
see, for example, Bonner et al. (2000), Bonner and Sprinkle (2002),
2
However, some researchers have questioned whether these differences really exist
(e.g., Camerer, 2015; Falk and Heckman, 2009).
Table 3.1: Types of experiments
Examples in Managerial
Type of What is this Accounting Research
experiment experiment? Advantages Disadvantages (MAR)
Laboratory Experiment completed by a High Control (high internal Limits to the external See reviews on the use of
experiment standard pool of subjects, validity). validity are due to the pool laboratory experiments in
usually students who have of subjects, self-selection to MAR by Bonner et al. (2000),
signed up to participate in participate in a study, and Bonner and Sprinkle (2002),
experimental research, in an artificial environment. Sprinkle (2003), and Sprinkle
artificial environment and Williamson (2006).
created by the researcher.
Artefactual This experiment is similar High Control (high internal Limits to the external Not common in MAR as
3.2. Types of experiments
field to a laboratory experiment validity) validity are due to the researchers prefer to use framed
experiment but uses a non-standard Higher external validity self-selection of subjects to field experiments that make use
pool of subjects. than lab experiments due to participate in a study and of participants’ knowledge
pool of subjects used. the artificial environment. when using non-standard
subjects.
Framed field This experiment is similar High Control (high internal Limits to the external Libby (1975); Harrison et al.
experiment to an artefactual field validity) validity are due to the (1988), Roodhooft and Warlop
experiment by using a Higher external validity self-selection of participants (1999); Bol and Smith (2011).
non-standard pool of than artefactual field to participate in the study
subjects, but uses their experiments due to the type and the artificial
specific of task used. environment (although close
knowledge/background in a to what is experienced by
given task. the participants in the
field).
Natural Experiment done in a High Control (high internal Control is not as strict as in Aranda and Arellano (2010);
field naturally occurring validity) lab experiments. Lourenço (2016);
experiments environment (not created Greater external validity is Single-site field experiments Kelly et al. (2017);
or by the researcher) with due to absence of and self-selection of partner Casas-Arce et al. (2017);
Field subjects who are normally self-selection by organizations may limit the Lourenço et al. (2018);
experiments not aware that they are part participants, the realism of generalizability of the Eyring and Narayanan (2018),
of a study and participate a field setting, and results Li and Sandino (2018).
by performing their daily real stakes. Difficulties in applying
routine/work/life. replicability.
Continued.
19
20
Table 3.1: Continued
Examples in Managerial
Type of What is this Accounting Research
experiment experiment? Advantages Disadvantages (MAR)
Natural Shocks (changes) to Use of field settings Lack of randomization Not common in MAR due
experiment naturally occurring allows the researcher to limits causal inference. to the phenomena studied,
environments that address a broader set of but the strongest design for
researchers use to ex post topics than when using identifying causal effects
create the treatment group other types of experiments when using naturally
(those who have received because certain occurring data in financial
the shock) and the control manipulations are not accounting research (e.g.,
group (those who have not ethically accepted. Gippel et al., 2015; Gow
received the shock). et al., 2016).
Quasi- Are similar to natural Similar to natural Lack of randomization Presslee et al. (2013)
experiment experiments but the experiments or field limits causal inference.
researcher can be involved experiments without
in the design of the randomization (lower
experiment (similar to a internal validity).
field experiment), even
though she does apply
randomization.
The experimental method
Sprinkle (2003), and Sprinkle and Williamson (2006). Examples of

laboratory experiments in managerial accounting research and of the
themes studied are: incentives (e.g., Christ et al., 2012b; Sprinkle, 2000),
feedback (e.g., Hannan et al., 2008, 2012; Newman and Tafkov, 2014;
Tafkov, 2012), performance measurement (e.g., Cardinaels and Veen-
Dirks, 2010; Cheng and Humphreys, 2012; Libby et al., 2004; Lipe and
Salterio, 2000), controls (e.g., Christ, 2013; Christ et al., 2012a), and
reporting (e.g., Evans et al., 2001; Maas and Van Rinsum, 2013).
Artefactual field experiments are similar to laboratory experiments
but they use a non-standard pool of subjects. Hence, instead of using
college students drawn from a pre-existing pool of subjects, artefac-
tual field experiments use subjects from the population of interest. For
example, a researcher doing an artefactual field experiment in manage-
rial accounting may invite managers, controllers, or decision-makers in
general, to participate, according to the research question being inves-
tigated. Despite using a non-standard pool of subjects, the artificial
environment is still a limitation to the generalizability of the results.
Framed field experiments are experiments that are completed by a
non-standard subject pool and use the subjects’ knowledge/background
of the field context, i.e., using tasks, information, and stakes with which
the subjects are familiar. This type of experiments makes full use of the
participants’ context and tasks but still occurs in an artificial environ-
ment (wherein the subjects know that they are part of a study), which is
created by the researcher and that attempts to mimic participants’ daily
activities. Moreover, the participants self-select to participate in the
experiment (although they are randomly assigned to the experimental
conditions). Framed field experiments are more costly and difficult to
run than the other two types of experiments, but they are especially
suited when the researcher anticipates the characteristics of the context
and subjects to interfere with the results. Thus, the generalizability of
the results of framed field experiments tends to be higher than those of
laboratory experiments and artefactual field experiments.
Artefactual and framed field experiments are not so common in man-

agerial accounting research because they are more difficult to run than
laboratory experiments that use a standard pool of subjects.3 When
using a non-standard pool of subjects researchers prefer to use framed
field experiments because they make use of participants’ knowledge.
Examples of framed field experiments are Libby (1975), who uses loan
officers as experimental participants; Harrison et al. (1988), who use
members of the American Production and Inventory Control Society;
Roodhooft and Warlop (1999), who use managers of healthcare organi-
zations; and Bol and Smith (2011), who use non-academic supervisors
at a university.
The scarce use of artefactual and framed field experiments in manage-
rial accounting can be due to a difficulty in finding and using alternative
pools of subjects who are available to engage in experimental research or
because students are considered, in general, good surrogates for decision-
makers in accounting (e.g., Ashton and Kramer, 1980; Elliott et al.,
2007; Libby et al., 2002; Liyanarachchi and Milne, 2005; Mortensen
et al., 2012; Peecher and Solomon, 2001). Nevertheless, and similar to
researchers in other fields (Levitt and List, 2006, 2007, 2009), some ac-
counting researchers argue that in order to increase the external validity
of the studies, non-standard participants should be used whenever it
is possible (e.g., Brandon et al., 2014). Moreover, online instrument
delivery and participant recruitment services, such as Survey Monkey,
Qualtrics, and Amazon’s Mechanical Turk, provide a relative inexpen-
sive opportunity for accounting researchers to use non-standard pools
of subjects in their experiments (e.g., Brandon et al., 2014; Buchheit
et al., 2017; Farrell et al., 2017). In other fields of accounting, such as
financial reporting and auditing, the use of non-standard subjects is
relatively more common (e.g., Barrett, 1971; Belkaoui, 1980; Boatsman
and Robertson, 1974; Hammersley et al., 2011; Kadous et al., 2003;
3
In accounting research, artefactual field experiments and framed field experi-
ments have been called in the past behavioral field experiments or field experiments
(e.g., Barrett, 1971; Belkaoui, 1980; Boatsman and Robertson, 1974; Harrison et al.,
1988; Ortman, 1975; Patton, 1978; Roodhooft and Warlop, 1999; Sharma and Iselin,
2003).
Ortman, 1975; Patton, 1978; Rennekamp, 2012; Sharma and Iselin,

2003).
Natural field experiments (hereafter, “field experiments”) are ex-
periments done in naturally occurring environments, i.e., a research
setting that is not created by the researcher. The subjects in these
experiments participate by performing their daily routine/work/life.
Moreover, subjects are normally not aware that they are part of an
experiment and, in that case, they cannot opt out, which eliminates
any selection bias concern related to the pool of subjects who partic-
ipate in experimental research (Al-Ubaydli and List, 2015a,b; Floyd
and List, 2016). Thus, field experiments provide greater generalizability
to naturally occurring settings. This higher external validity usually
comes at the expense of control as the researcher cannot artificially
isolate the experimental subjects and the manipulation as can be done
in the laboratory.4 Nevertheless, researchers who run field experiments
make great efforts (as described in Section 6) to ensure an accurate
identification of the causal effect.
In summary, field experiments aim to combine the best of two
worlds—the controlled environment of lab experiments with high inter-
nal validity, and the realism of naturally occurring contexts with high
external validity. The rise of field experiments in such diverse fields as
economics (e.g., Bandiera et al., 2011; Harrison and List, 2004; Levitt
and List, 2006, 2007, 2009), finance (e.g., Bernstein et al., 2017; Cole
et al., 2011), strategy (e.g., Chatterji et al., 2016), and information
systems (e.g., Bapna et al., 2017) seemingly attests to the legitimacy of
these claims. Examples of field experiments in managerial accounting
research are provided in Section 5.
In addition to the 4 types of experiments described by Harrison and
List (2004), it is important to consider experiments that use naturally
occurring data, also known as natural experiments (Levitt and List,
2009), and quasi-experiments, or experiments without randomization
(Shadish et al., 2002). Natural experiments are shocks (changes) to
naturally occurring environments (Shadish et al., 2002). Researchers
4
Further limitations of field experiments are discussed in Section 7, “Limitations
of field experiments.”
use these shocks to ex post create the treatment group (those who have
received the shock) and the control group (those who have not received
the shock). Natural experiments can be the only type of experiment
available if the manipulation by the researcher is not possible or de-
sirable (e.g., when ethical concerns are present). In this case, natural
experiments are the preferred, i.e., the more robust research design in
naturally (not manipulated) occurring data (Gippel et al., 2015). Quasi-
experiments are similar to natural experiments in that they do not
employ random assignment to treatment and control groups. However,
some decision-maker (e.g., CEO of a firm, director of a governmental
agency, or a politician) has a say in the allocation of the experimental
units (e.g., individuals, groups, or organizations) to the treatment con-
ditions (Shadish et al., 2002). The allocation can also be the product
of self-selection of the experimental units to the treatment conditions.
Even though natural experiments and quasi-experiments are similar in
their non-random assignment to conditions, quasi-experiments can be
prospective studies wherein the researcher collaborates with a partner
organization, designs the manipulation, and collects data. The researcher
can also collect observable measures that dictate the assignment to the
treatment conditions, which allows her to later account for them in the
identification strategy. Some researchers do not differentiate between
quasi-experiments and natural experiments as they share the same
drawback—the lack of random assignment to the different conditions
(e.g., Harrison and List, 2004).
Due to their nature, i.e., exogenous shocks, natural experiments are
not common in managerial accounting research, which is focused on
decisions and behaviors that are internal, i.e., within an organization.
However, natural experiments have become the Holy Grail for identifying
causal effects in financial accounting research (e.g., Gippel et al., 2015;
Gow et al., 2016), and are now relatively common (e.g., Christensen
et al., 2016; Gormley et al., 2013; Kinney et al., 2011; Li and Zhang,
2015). An example of a quasi-experiment in managerial accounting
research is Presslee et al. (2013). In this study, the researchers designed
the manipulations, but the assignment of experimental units to the
treatment conditions was determined by the collaborating research site.
4
The promise of field experiments for managerial
accounting research
4.1 The endogenous nature of the phenomena studied
Field experiments are especially suited to investigate causal relation-

ships in managerial accounting research because many of the phenomena
of interest, such as management accounting/control practices, perfor-
mance measurement systems, incentives, or reporting, are plagued with
endogeneity concerns. Specifically, these phenomena occur within an or-
ganization, often the decision to adopt/implement is internal, and their
causes and effects are intertwined with many other decisions/practices.
Thus, identification is a challenge in managerial accounting research
because there are many concerns related to omitted correlated variables
bias, selection (sampling) bias, reverse causality bias, and simultaneous
causality bias. The severity of these problems can vary according to the
non-experimental methodology used—archival, survey, or field—and
the type of data available—cross-sectional or longitudinal.
4.2 Internal validity problems of observational studies in

managerial accounting research
The internal validity concerns of observational studies in managerial

accounting research cannot be easily solved with econometric solutions.
25
26 Managerial Accounting Research
These solutions are difficult to apply given the endogenous nature of

the phenomena studied. For example, to solve the problem of omitted
correlated variables the researcher should consider those variables in
the identification procedure. However, data for these variables may not
be readily available if the researcher is doing an archival study. If the
researcher is using survey data, she can collect data on those potential
confounding variables. Nevertheless, surveys are usually cross-sectional,
which limits the researcher’s ability to show that the cause precedes
the effect (Van der Stede, 2014). Hence, reverse and/or simultaneous
causality bias may not be convincingly dismissed. Field studies, with one
company or multiple companies, usually allow the researcher to collect
longitudinal data on multiple variables, with more discretion than when
using archival data but less than survey data. However, in field studies,
as well as archival and survey research, selection (sampling) bias is a
concern because changes are almost never exogenous and counterfactuals
are almost never available. Thus, it is up to the researcher to find or
build that counterfactual.
When using observational studies, propensity score matching may
be a solution if some of the units were not exposed to the “treatment”
(change). In this situation, the challenge is in identifying and accessing
observable variables that may be correlated with the selection bias and
the outcome variable of interest. By matching “treated” to “non-treated”
units on these observables, the researcher builds a counterfactual with
which the treatment units can be compared. The risk in this procedure
is failing to have the key determinants of the selection bias that are also
related to the outcome variable of interest. In this case, unobservables
(i.e., variables not considered in the matching process) may interfere in
the identification procedure and lead to biased estimates of the effects
in the causal inference.
Another possibility for mitigating endogeneity concerns is to use
instrumental variables that introduce a (completely) exogenous factor
in the identification procedure. The challenge in managerial accounting
research, as in many other fields, is to find a good instrument—one that is
relevant and exogenous. Due to the type of phenomenon studied, usually
within-firm decisions/practices, researchers are likely to find relevant but
not exogenous instruments. Similarly, structural modeling still requires
4.2. Internal Validity Problems 27
the researcher to make some causal (exogeneity) assumptions that may

not be credible when studying managerial accounting phenomena.
Due to the identification challenges of causality when using observa-
tional data, many managerial accounting researchers use experimental
methods. As discussed previously, these methods overcome endogeneity
issues by introducing an exogenous manipulation in a set of randomly
assigned experimental units (e.g., individuals, groups, or organizations).
Experiments have been a relatively popular method in managerial ac-
counting research (e.g., DeFond, 2015; Hesford et al., 2006; Kachelmeier,
2010; Shields, 1997), with a particular emphasis on laboratory exper-
iments (Sprinkle and Williamson, 2006).1 Despite their high internal
validity, external validity is a concern in laboratory experiments as the
results obtained in the artificial environment created by the researcher
may not easily generalize to field settings (Bloomfield et al., 2016).
Hence, field experiments arise as a method that combines the rigor of
laboratory experiments with their high internal validity with the realism
of field settings with their high external validity.
1
For example, in the review by Hesford et al. (2006) experiments are the fourth
method most widely used in managerial accounting (12.7% of all manuscripts con-
sidered in the selection), after frameworks (19.5%), analytical (18.4%), and surveys
(16.3%).
5
Some examples of field experiments in
managerial
accounting research
Despite the advantages of field experiments and the rise of this method
in other streams of research, the use of field experiments in managerial
accounting research is relatively scarce. Examples of recently published
field experiments in managerial accounting research mainly deal with the
implementation of strategic performance measurement systems (Aranda
and Arellano, 2010) or information sharing systems (Li and Sandino,
2018), and the performance effects of different types of incentives (Kelly
et al., 2017; Lourenço, 2016) or feedback (Casas-Arce et al., 2017; Eyring
and Narayanan, 2018; Lourenço et al., 2018). In this section, and given
the focus of this manuscript, I compare those studies with regard to
the multiple design choices that researchers have to make in their field
experiment. Table 5.1 presents the main findings of this comparison.
Table 5.1 shows that feedback and incentives are the most com-
mon topics addressed in field experiments in managerial accounting
research. This is not surprising given that field experiments within
one organization, which are the most frequent, tend to have individ-
uals as observational units. This leads to the analysis of individual
level effects and, as such, manipulations are also at the individual
level. The other two experiments that are considered focus on the
28
Table 5.1: Some of examples of field experiments in managerial accounting research
Participants
(unit of Outcome Question- Analysis
Study Topic Industry randomization) Design Manipulations measure naires of HTE Data Duration
Aranda PMS Banking Managers Between- BSC, Consensus Pre, No Pre, 4 months
and subjects, 2 alternative about strategy Exp Exp
Arellano conditions PMS (survey-based)
(2010)
Lourenço Incentives Retail Sales reps Between- Monetary Sales relative Pre, Yes, in Pre, 2 months
(2016) subjects, incentives, to goals Post robustness Exp
2×2×2 feedback, tests
recognition
Kelly Incentives Retail Sales reps Between- Cash, tangible Sales Post No, but Pre, 6 months
et al. (retailers ) subjects, 2 rewards additional Exp
(2017) conditions analysis of
effort
measure
(video
views)
Casas- Feedback Repair Professionals Between- Feedback Detractors (Net Pre Yes, in Pre, 3 months
Arce et al. subjects, frequency, Promoter Score), robustness Exp,
(2017) 2×2 feedback Process measures tests Post
detail (use of PDA, on
time services)
Lourenço Feedback Healthcare Physicians Between- Feedback, no E-prescribing Pre, Yes, main Pre, 12 months
et al. (Practice) subjects, 2 feedback rate Post and Exp,
(2018) conditions additional Post
analyses
Eyring Feedback Education On-line Between- No RPI, RPI Activity level Pre, Yes, main Pre, 2 months
and students subjects, 3 with median, (process Post and Exp
Narayanan conditions RPI with measure), grade additional
(2018) top quartile (performance analyses
measure)
Li and Information Retail Sales reps Between- Information Quality of Pre, Yes, as Pre, 4 months
Sandino sharing (stores) subjects, 2 sharing system, creative work, Post additional Exp
(2018) system for conditions no inform- job engagement analyses
creative ation sharing (attendance),
work system financial perfor-
mance (sales)
29
introduction of new control systems, either performance measurement

systems, such as the balanced scorecard (Aranda and Arellano, 2010),
or information sharing systems, such as a system for sharing creative
work (Li and Sandino, 2018). Even though the manipulations of these
studies are at organizational level, one of these experiments still uses
an individual-level outcome—specifically, consensus with the strategy
(Aranda and Arellano, 2010). As mentioned previously, and given the
examples in other streams of research, there is room for field experiments
in managerial accounting research to address organizational-level effects
and, thus, employ organizational-level manipulations (e.g., the introduc-
tion of a new control system or management accounting information).
The industry that is most commonly used as a context for field
experiments in managerial accounting research is retailing. This is not
surprising given the advantages of this industry for designing a field
experiment—clear and easy to access outcome measure (e.g., sales),
data available on weekly basis, a large pool of potential participants
(necessary for statistical power reasons), and geographically dispersed
participants that mitigates concerns of contamination among treatments.
Other industries presented in Table 5.1 are banking, repair, healthcare,
and education. Banking and healthcare promise to be fruitful contexts
for field experiments given their commitment to measurement, which
facilitates the identification of outcome measures. Moreover, partners
in the healthcare industry are also familiar with experiments; thus,
a field experiment project can be easier to “sell” as they do not fear
“randomization” and understand its value for the validity of the study.
Education is also a promising industry for field experiments given the
close proximity to researchers. In this particular case, and given that
the participants are likely students of the university at which the field
experiment is taking place, the approval of the field experiment by the
Institutional Review Board is even more important.1
Table 5.1 also shows that participants are usually frontline employees
of the organizations in which the field experiment is taking place—sales
reps in the retailing industry, professionals in repair, and physicians
in healthcare. The choice of frontline employees can be explained by
1
The approval of the Institutional Review Board will be discussed in Section 6.5.
31
the fact that these types of employees represent the largest pool in the
partner organization and, as such, researchers maximize the power of
the experiments they are conducting. In some instances, these frontline
employees are considered as a group (Kelly et al., 2017; Li and Sandino,
2018), likely because either the research question focuses on an organi-
zational outcome, it is not possible to obtain individual data, or it is
undesirable to have multiple treatments within that group due to the
risk of contamination. Contamination refers to participants learning
about other treatments different from the one that they are experienc-
ing in the study. This is a threat to the study because participants
may react to information about multiple treatments rather than to
the specific treatment they are receiving. Thus, contamination can
invalidate the inferences of a study. Aranda and Arellano (2010) are
an exception to the use of frontline employees as participants because
they use middle managers. This option makes sense given the research
question of their study—the effect of the balanced scorecard (BSC) on
middle managers’ consensus with the strategy established by the top
management team. Managers are an interesting pool of participants,
given their decision-making power in many managerial accounting and
management control decisions. Hence, future field experiments in this
stream of research are likely to more often use managers as participants.
However, power concerns need to be taken into account given that the
number of managers within an organization is more reduced than the
number of front-line employees, but clever research designs can mitigate
this concern.2
Eyring and Narayanan (2018) are also an exception to the use of
frontline employees because participants in their study are students,
who are more close to “customers” than to “employees”. The use of
customers as the unit of analysis promises to be a fruitful choice for
future field experiments in managerial accounting research. Customers
are usually a larger population than employees and, thus, sample size
(and the power of the experiment) increases substantially. The use of
customers (or their decisions) as the unit of analysis are commonly
used in other literatures, such as marketing (e.g., Ngwe et al., 2019).
2
Experimental designs will be discussed in Section 6.4.
Of course, managerial accounting researchers need to ensure that the

research questions they are addressing in the field experiment can be
answered at the customer level and not at the employee level.
The unit of randomization coincides, in most cases, with the par-
ticipants of the experiment. The cases in which this does not happen
refer to situations where the risk of contamination is high. Researchers
mitigate this risk by reducing the likelihood that participants learn
about other treatments. This is most easily done when participants who
share the same physical space and, hence, are likely to talk about
what is going on in their job, are subject to the same manipula-
tion. To ensure that this happens, researchers randomize at the group
(e.g., office) level, even if the subsequent analysis uses individual-level
data (e.g., Lourenço et al., 2018). If the outcome measure is at the
group, store or organizational level, then researchers tend to use as
participating units those groups, stores, or organizations and the unit
of randomization is also at that level (e.g., Kelly et al., 2017; Li and
Sandino, 2018).
In all of the cases considered in Table 5.1, the experimental design
is between-subjects, i.e., each subject experiences only one condition
and, thus, the effects will be estimated by comparing different subjects.
No study uses a within-subjects design, i.e., each subject experiences
multiple conditions and, thus, the effects will be estimated by compar-
ing the same subject in multiple conditions. Within-subjects designs
are riskier for causal identification in managerial accounting research
because behaviors can be different according to the order in which
the participants experience the manipulations and because economic
conditions change with time, which makes comparisons pre-post harder.
In both cases, causal inference is at risk because of the within-subjects
design. Between-subjects design eliminates these risks by providing the
proper counterfactual (a similar subject in the same timeframe who
experiences a different condition).
Another issue in experimental design is the use of factorial designs,
i.e., studies with multiple conditions that include all combinations
possible of those conditions. Factorial designs are especially suited to
address interaction effects between conditions. Even though two studies
use factorial designs—2 × 2 (Casas-Arce et al., 2017) or 2 × 2 × 2
33
(Lourenço, 2016)—the majority of studies uses simpler designs with just

two (e.g., Aranda and Arellano, 2010) or three conditions (Eyring and
Narayanan, 2018). Factorial designs are “heavier” in terms of sample
size and, thus, researchers may avoid them to mitigate the risk of
not having enough power in their tests. In a 2 × 2 design, researchers
have four groups; in a 2 × 2 × 2 design, researchers have eight groups.
Thus, these designs are more demanding for the implementation of the
field experiment as there are more groups involved and the partner
organization may not be able to deal with the complexity of having so
many different groups.
The manipulations used in the field experiments presented in Ta-
ble 5.1 are a function of the topic studied and the design implemented.
Hence, the manipulations focus on feedback, incentives, and perfor-
mance measurement or information sharing systems. Most studies in
this list have a control and a treatment group (e.g., Li and Sandino,
2018) or two treatment groups that are compared to one another (e.g.,
Kelly et al., 2017). When there are different levels of the treatment,
researchers have different treatment groups that can be compared to
the control group or to one another (Eyring and Narayanan, 2018). Fac-
torial designs may have only two or three manipulations, but they scale
up quickly because the researcher has to account for all combinations
possible of the manipulations.
Table 5.1 also shows that the outcome measures of the majority of
these studies come directly from the field, i.e., they are not created by
the researcher. Sales are the most common outcome measure (e.g., Kelly
et al., 2017), which shows the practical relevance of field experiments.
An exception to these “natural” outcome measures is the use of a survey
measure as outcome (Aranda and Arellano, 2010). This type of measure
may be a necessity of the study’s research question (e.g., the effect of
the balanced scorecard in the consensus about strategy) but it is more
“intrusive” for participants than naturally occurring outcomes, such
as performance that is already measured by the partner organization
(e.g., sales). Because of the naturally occurring environment, researchers
can also explore the effects of the manipulations in multiple (natural)
outcome measures. This strategy not only enriches the story of the
study, but can also be used to document the process or mechanism
by which the effects are taking place. To date, not many studies have
explored the possibility of having multiple outcome measures and those
that do tend to use some type of process measures (e.g., Casas-Arce
et al., 2017; Eyring and Narayanan, 2018; Li and Sandino, 2018). Thus,
there is room for researchers to be more creative about the outcome
measures considered in their studies and how, ultimately, these measures
translate into “hard” performance outcomes.
All of the studies described in Table 5.1 use questionnaires. The
combination of questionnaires and field experiments is fruitful be-
cause it allows the researcher to collect data other than what is al-
ready available in the partner organization. Questionnaires can be
done before the start of the experiment (pre-questionnaire), during
the experiment (exp-questionnaire), or after the end of the experiment
(post-questionnaire). Pre-questionnaires normally collect data in order
to establish a successful randomization, dismiss concerns about con-
tamination (e.g., Casas-Arce et al., 2017), and/or obtain baseline levels
of the outcome variable (e.g., Aranda and Arellano, 2010) or some
other variable used in the study (e.g., Li and Sandino, 2018). These
additional variables can be used to test heterogeneous treatment effects
(i.e., whether the treatment effect varies according to some observable
characteristic), identify the process or mechanism by which treatment
effects occur, or dismiss alternative explanations for the documented
effects. Exp-questionnaires occur during the experiment and aim to
collect process measures that can help researchers to explain the mech-
anism by which the effects are taking place. This type of measures
can also come from “hard” data that the experiment produces (e.g.,
Kelly et al., 2017) or from the partner organization (e.g., Eyring and
Narayanan, 2018). Post-questionnaires occur after the experiment is
finished and can help researchers in performing manipulation checks
and additional analyses based on the perceptions of the participants
regarding the different manipulations.
Even though there are many advantages of using questionnaires
within a field experiment study, researchers must also consider the risks
associated with them. The major risk is that questionnaires prompt
participants to react in a certain way. Thus, the effects documented in
the field experiment may not be due to the manipulation per se but
35
rather to the questionnaire. To mitigate this risk, researchers can avoid

asking manipulation-specific questions before the experiment and only
ask them after the end of the experiment. However, by collecting data
at the end of the experiment, the answers may be influenced by how
the participants experienced the manipulations. This endogeneity may
invalidate some of the analyses the researchers aim to do with these data.
Another possibility is for researchers to administer the questionnaire
well before the start of the experiment. This will minimize the risk
that participants associate the questionnaire with the manipulations
introduced with the experiment. Finally, researchers can be very cautious
in the wording of the questionnaire and avoid being too “obvious” in
their intentions. An easy way to achieve this is to have several constructs
in the questionnaire rather than only questions about the manipulations
that the participants will experience. The downside of this solution
is that the questionnaire will take more time to answer and, thus,
the response rate of the questionnaire may be lower. In fact, a low
response rate is the second risk of using questionnaires as part of a
field experiment study. As discussed previously, power is always an
issue when designing a field experiment. Thus, if researchers already
have to struggle to have a sample that is large enough to run their
manipulations, they may not want to be dependent on the percentage
of participants who will complete the questionnaire within each group.
This may explain why questionnaires, even though commonly used in
field experiments, are not central for the development of the hypotheses.
Usually, questionnaire data is used for additional analyses (e.g., Eyring
and Narayanan, 2018; Lourenço et al., 2018) or robustness tests (e.g.,
Lourenço, 2016).
Another important characteristic of a field experiment study is
whether heterogeneous treatment effects are analyzed. These effects
can be analyzed not only with questionnaire data, but also with data
provided by the partner organization, such as the demographic charac-
teristics of the participants or the baseline level of the outcome variable.
The use of pre-experimental outcome data enriches the analysis, not
only by allowing analyses of heterogeneous treatment effects, but also
by enabling researchers to use more sophisticated estimation tech-
niques when analyzing data (e.g., difference-in-differences, fixed effects).
The majority of the studies included in Table 5.1 perform some type of
heterogeneous treatment effect analysis. However, they do so mainly in
additional analyses (e.g., Li and Sandino, 2018). Only two studies use
heterogeneous treatment effects in the main analysis, i.e., they develop
different hypotheses for subsamples of their participants’ pool (Eyring
and Narayanan, 2018; Lourenço et al., 2018). In both studies, baseline
performance is the variable used for the heterogeneous treatment effect
analysis. Heterogeneous treatment effects promise to be a fruitful area
for future research. In the field, several opposite forces may come into
play and lead to average null effects. However, in subsamples of the
population there may be (positive and negative) statistically significant
effects that can only be documented by using heterogeneous treatment
effects. Researchers need to consider these opposite forces upfront so
that they can develop their hypotheses according to the characteristics
of the population.
Data available in all experiments presented in Table 5.1 include
data prior to the experiment and data during the experiment. As
mentioned previously, these types of data allow researchers to apply
more sophisticated estimation techniques (e.g., difference-in-differences,
fixed effects) and analyze heterogeneous treatment effects that are
related to the baseline level of the outcome variable. Two studies also
use post-experimental data. This type of data allows the analysis of
the effects once the manipulations are removed (Lourenço et al., 2018)
or when they are spread to the entire pool of participants (e.g., Casas-
Arce et al., 2017). This kind of analysis can be an important feature of
the study because researchers can document whether participants go
back to the previous level of performance once the stimulus is removed
or whether they converge to the same level of performance once all
participants experience the same stimulus. Both of these issues have
great relevance to and impact for practitioners.
Finally, for the studies included in Table 5.1 there is a wide variation
in the duration of the experiment, i.e., the time during which participants
experience the manipulations. The smallest experimental window is two
months (e.g., Eyring and Narayanan, 2018; Lourenço, 2016) and the
largest is one year (Lourenço et al., 2018). Despite this difference, the
number of interactions with participants is not that large; in Lourenço
37
(2016) communications are sent to participants weekly (2 × 4 = 8),

and in Lourenço et al. (2018) communications are sent monthly (1 ×
12 = 12). Regardless of the number of interactions, the salience of
the manipulations and how those manipulations are perceived by the
participants may differ according to the time window of the experiment.
This is also an area for future research as the effects documented may
vary with time.
6
Insights on running field experiments
There are a few papers that provide insight on how to run field ex-
periments. For example, List (2011) enumerates 14 tips to conduct a
field experiment in economics.1 This advice ranges from methodolog-
ical issues, such as having a proper control group and large enough
sample size, to practicalities in the relationship with the partner or-
ganization, such as having a sponsor in the organization and making
the best use of organizational dynamics. Another important guide to
field experiments, particularly on randomization, is Duflo et al. (2008).
The authors discuss (i) the advantages of randomization, (ii) power
calculations, (iii) methods to implement randomization, and (iv) issues
that can interfere in statistical inferences based on experimental data,
such as partial compliance and attrition.
Based on my own experience in running field experiments, in the
following subsections I offer some practical advice in several important
dimensions for a successful field experiment project.
1
These tips are also reproduced in Floyd and List (2016).
38
6.1. Research question 39
6.1 Research question
An interesting research question is the starting point for a successful field

experiment (or any other research project). Experiments are especially
suited to test theory (causal effects) and the mechanisms implied by
a theory (how those causal effects occur). The research question must
be solidly grounded on theory because, ultimately, a researcher aims to
collect data to test a theory. The researcher must bear in mind that is
the theory and the findings about it that need to be generalizable (List,
2011), not the data collected. The research question (and the theory) will
also make a contribution to the literature, regardless of the results found.
Since field experiments are risky and long-term projects, the researcher
needs to ensure that the manuscript has a high publication potential, due
to the nature of the research question and theory development, regardless
of the results found, which the researcher will only discover much
later after starting the project. Registration-based editorial processes
(Bloomfield et al., 2018), with the guarantee of publication after the
approval of pre-registered proposals (before data collection even starts),
can alleviate some of the risks of field experiments because publication
is not results-dependent.
The research question should be also of interest to practitioners.
On one hand, this is the only way the researcher can guarantee that
she will be able to find a research site that is willing to collaborate.
No organization will engage in a collaboration from which they will
not derive a benefit. On the other hand, by assuring that the research
question is interesting to practitioners, the researcher will also guarantee
the relevance of her work outside academia (Kaplan, 2011; Van der
Stede, 2014, 2015). Relevance of academic work for non-academics is
something that increasingly more universities, policy makers, media,
and the public in general care about (e.g., Georghiou, 2015; Gunn and
Mintrom, 2017; Khazragui and Hudson, 2015; Kristof, 2014).
6.2 Research site
Finding a research venue that is willing to collaborate with the researcher

is a sine qua non condition to running a field experiment. Ideally, the
40 Insights on running field experiments
researcher starts with a research question, presents it to a few sites,

and then chooses among those willing to collaborate as an appropriate
partner to run the field experiment. However, the research question will
rarely be an exact match to the interests of potential partners and their
operational conditions. Thus, the researcher will need to work with the
different sites to determine what research question is of interest to both
parties and that can actually be implemented. A related strategy may be
to take advantage of changes that organizations are already considering
(or planning to introduce) and convince them to run a field experiment
on those changes if the changes or the expected results of the changes are
of interest to the researcher. In any case, the research question and the-
ory may need to be reshaped as the conversations between the researcher
and possible organization venues continue. This is not new in research.
Research processes are rarely linear. It is common for the researcher to
revisit her research question as she goes through the literature review
and the design of the data-collection process (Edmondson and McManus,
2007). This is exactly what happens in a field experiment, with the ad-
ditional layer of articulating the research question and the experimental
design for the organization with whom the researcher will potentially
partner. This “ongoing conversation” is of value to the researcher as it
will allow the researcher to gain the trust of the organization.
When confidentiality issues are a concern, and similarly for any
field study, the researcher may negotiate a non-disclosure agreement
to ensure that the name of the organization will not be made public.
Likewise, data may be disguised to protect sensitive information that
the organization does not want to be in the public domain. Finally, and
to protect the anonymity of the participants and their personal data, the
researcher may negotiate that personal identifiers are stripped out the
data. Either at the partner level or using a third party organization, a
blind ID can be created for every participant to guarantee that different
sources of data can still be matched for the purposes of the study.
6.3 Collaboration
Establishing collaboration based on trust is key for a field experiment

project to be successful. Therefore, researcher will very likely need to
6.3. Collaboration 41
invest a significant amount of time and effort in multiple meetings

with different persons in the organization. In this collaboration process,
it is important to identify and then get the buy-in of three main
actors in the organization—the sponsor, the decision-maker, and the
doer. The sponsor is the study’s champion in the organization, and
who will support and promote the field experiment project within
the organization. Thus, the higher the champion in the organizational
hierarchy, the better (List, 2011). The sponsor may not be personally
involved in the project, but her support signals to the rest of the
organization the importance and the priority level of the project. The
decision-maker is the person who, on behalf of the company, will decide
on the details of the experimental design. Thus, she may or may not be
the sponsor. In large organizations, the decision-maker and the sponsor
are usually different people as the former will be closer to the operations
and, hence, to the experimental design. Finally, the doer is someone who
will actually implement the field experiment within the organization.
Although the doer usually reports to the decision-maker and will likely
follow all the negotiations around the design of the field experiment,
she will not have the decision rights over the experiment.
It is nearly impossible to anticipate all the “nitty-gritty” details of
the implementation in the design stage. Thus, the researcher’s close
proximity to the doer is crucial to ensure that implementation details
are correct and appropriate prior to beginning the experiment. Finding
out that wrong decisions about the details were made can have a
big impact and in an extreme case can invalidate the results of the
experiment. Hence, compiling a list of everything the doer will need
to do to actually implement the experimental design, or having the
researcher present when the experiment is being implemented, can help
to guarantee the precise implementation of the experimental design
and, subsequently, the validity of the experimental results. Thus, the
researcher has to invest a significant amount of time in qualitative work,
such as observation and interviews, in order to prepare the experimental
design and implementation of the field experiment. This qualitative
work may not be part of the final manuscript, but it is essential to
design an experiment that fits the partner organization.
In order to make any collaboration work, the researcher may need

to provide “intermediate products” to the organization to keep their
interest. These “intermediate products” may be a diagnostic analysis
or a questionnaire about issues that are of concern to the organization.
Ideally, these “intermediate products” will also be of interest to the
researcher. For example, diagnostic analysis may help to identify critical
issues that the organization is dealing with and that the organization
will be willing to run an experiment. The questionnaire may also be
valuable to the researcher as it is a means of collecting primary data that
can help in the design and analysis of the experimental data. Specifically,
a questionnaire may be the only way for the researcher to collect data
on variables that are important for the randomization (e.g., variables
that will create blocks of comparable participants) or for a planned
heterogeneous treatment effect analysis. These “intermediate products”
will also help the researcher to build a relationship that is based on
trust, which is a key issue that will help the researcher to convince the
organization about the necessity of randomization.
Many studies fail due to the lack of a proper control group (List,
2011). Randomization is a sine qua non condition for building a proper
counterfactual with which the treated units can be compared. Many
organizations are uncomfortable with randomizing because this creates
a working environment in which employees/teams will be, concurrently,
treated differently. Organizations almost always prefer to choose the
units that go in each bucket (which would lead to a quasi-experiment) or,
even better, to do a pre-post analysis of the same units. The researcher
has to be clear, from the start, about the necessity of randomization and
include randomization in the “contract” for the collaboration. Avoid-
ing the actual term “randomization” and explaining instead that the
research team will allocate the units in each condition to ensure com-
parability usually helps. If an argument ensues— “It is not fair that
some receive x and others receive y”—then the researcher can reply
by saying that the initial allocation can be a first stage and that the
conditions can be reversed in a second stage. This will assure that all
participants receive the same treatment within the two stages. When
the time comes, organizations rarely want to reverse the design. Most
of the time, organizations just want to implement the most effective
6.4. Experimental design 43
manipulation. Another alternative is to present the field experiment as

a “pilot test” to make sure that everything will work properly when the
change is implemented for the entire organization. Organizations are
usually comfortable with this “pilot” as they are generally accustomed
to introducing changes by phases in their regular operations.
Overall, collaboration is about building a win-win relationship be-
tween the research team and the research site. Sometimes researchers
will have to do things that they do not value that much, but those
things create goodwill that will help them win important things, such
as the possibility of applying randomization. Moreover, a project that
is beneficial for both parts may create opportunities for future studies
in the partner organization.
6.4 Experimental design
The experimental design includes the manipulations that the researcher

will run. These manipulations should be the product of the interactions
between the research team and the research site. The manipulations
should also reflect the qualitative work done in the pre-experimental
stage and any relevant information collected from the “intermediate
products,” such as a questionnaire. The collection of covariates as-
sociated with the outcome variable of interest or baseline data (pre-
experimental) of this outcome variable allows the detection of smaller
treatment effects by later incorporating these data in the identification
strategy. It is also important to define the random assignment strat-
egy in the experimental design. The randomization protocol should
clearly identify the unit of randomization (e.g., individuals, groups, or
organizations). Because sample size is always a concern, researchers
usually prefer to randomize at the smallest level possible (e.g., individ-
uals), which entails the highest number of independent observations.
However, contamination concerns (i.e., the possibility that participants
learn about other treatments) may favor other levels of randomization
(e.g., groups). Sample size concerns and heterogeneity in the population
may also lead researchers to design a stratified or blocked randomization
(Duflo et al., 2008). In this case, the researcher is actively intervening to
increase the probability that treatment and control groups are alike in
covariates that can interfere with the identification of the treatment ef-
fect and that should be used for blocking (Imai et al., 2008). Conversely,
a simple (not blocked) randomization could lead to unbalanced groups
in variables correlated with the outcome of interest and, in an extreme
case, in the baseline level of the outcome variable. If the sample size is
small, even a blocked randomization may lead to covariate imbalance,
which is a threat to identification. In these cases, some authors advocate
for rerandomization until covariate balance is achieved (Morgan and
Rubin, 2012). The randomization strategy should also specify whether
the units will be randomly assigned at a single point in time (e.g., before
the start of the experiment), in different waves (e.g., defined at the start
of the experiment), or in real-time (e.g., randomization occurs as the
units join the experience).
The choice of the randomization strategy may be linked to the power
the researchers want to have in the experiment. Sample size is critical
to gain sufficient statistical power to reject the null hypothesis of a zero
effect. Hence, researchers are advised to run power calculations prior
to the experiment. These power calculations enable the researcher to
estimate the number of observations needed to detect a pre-defined effect
size, for a given power and significance level. Power calculations can be
enhanced if the researcher has (i) baseline data of the outcome variable,
(ii) control variables correlated with the outcome, and (iii) information
about intra- or inter-cluster correlation in the case of group randomiza-
tion. There is specific software, such as Optimal Design, that researchers
can use to compute power calculations. Even though power calculations
enable the researcher to make better design choices, researchers may
not have all data needed available to run those calculations. Running
a pilot study, if possible, can give the researchers that data. However,
the pilot study may be the experiment itself because the collaborating
research site is not willing to apply randomization, and thus the exper-
iment, to all employees of the organization. Thus, power calculations
may not be possible to compute. Nevertheless, and if the researcher
has baseline data available (even though it is not guaranteed that the
distribution of the outcome variable will remain the same during the
experiment), she can gain some insights about the sample size needed.
If the sample size is fixed by definition, that is, the research site has a
6.4. Experimental design 45
static x units/participants, power calculations can help the researcher

decide the number of manipulations that she will introduce (the more
manipulations introduced, the weaker the power to detect an effect for
a fixed number of experimental units/participants).
Staggered designs (multiple baseline designs), i.e., studies in which
the treatment is introduced in different moments in time, are also an
interesting option for the researcher because units that have not yet
received the treatment can be used as controls. Staggered designs are
attractive when the aim is to make the treatment available to all units
and it is not expected that those units will go back to the pre-treatment
state. The researcher still needs randomization of units to (the schedule
of introducing) the treatment to ensure that there is no sampling bias
(i.e., units with the largest expected effects are the ones that receive
the treatment first). Because all units will ultimately be subject to
treatment, which allows a smoother transition to a new “state,” partner
organizations may be more comfortable with such a design and agree
to randomize.
In the design stage of the field experiment, and according to the
research question and related theory development, researchers should
also consider whether they will engage in the analysis of heterogeneous
treatment effects, i.e. whether the expected treatment effect varies
according to some observable variable. It is important to consider this
issue in the design stage because the researcher may need to collect data
prior to the start of the experiment, or to plan obtaining data during
or after the experiment. Additionally, clearly stating the heterogeneous
treatment effect analysis in the design stage of the experiment avoids
ex-post unplanned analyses.
Finally, the experimental design should include a detailed protocol
of the intervention. To be ready to implement the experiment, the
researcher needs to precisely define all the changes that need to be
introduced and identify the person responsible for those changes, how
those changes will be communicated to the units/participants, and how
the data will be shared with the researcher. This protocol should be
articulated with the decision-maker and the doer of the collaborating
research site. The former has to agree to all those details and the latter
has to be aware of and implement all of them (or delegate part of them).
When there are multiple doers, it is important to have a single contact in

the organization who coordinates everything. Otherwise, the researcher
has to engage in multiple interactions and coordination may be difficult,
especially if something unplanned occurs. Unexpected events are not
that uncommon in the field given the long time horizon (compared to
laboratory experiments) and the fact that the research site is a “living”
organization.
6.5 Institutional review board
The researcher should obtain Institutional Review Board (IRB) approval,

which is an assertion of ethical compliance. IRB approval aims to provide
assurance that the study will cause no harm to the participants. Hence,
IRB approval is also a safeguard for the researcher. IRB approval is
mandatory in some schools and a pre-publication requisite for some
journals. Some schools do not have an IRB committee, in which case the
researcher should make sure that her experiment adheres to the ethics
guidelines of the broader research community and desired publication
outlet. The researcher may be asked about ethics compliance later on
in the publication process.
IRB approval should be obtained before the data collection starts.
This means that the researcher should submit the package for approval
as soon as the experimental design is established. If the researcher
wants to collect pre-experimental questionnaire data, she may also
need IRB approval before the questionnaire can be distributed. In
this case, the researcher may need a two-step approval (first step for
the questionnaire, the second step for the experiment) because she
may intend the pre-experimental questionnaire data to inform the
final experimental design. In any case, working closely with the IRB
officials will help the researcher to incorporate all the safeguards for
the participants and the data collected. Some schools also require
that researchers participate in an ethics or human subjects’ protection
training as part of the IRB application. Some schools provide their own
training while others require that researchers participate in training
provided by external institutions, such as the Collaborative Institutional
Training Initiative (CITI). Researchers can take this training as affiliates
6.6. Pilot 47
of a certain institution that has a pre-agreement with CITI or as

individual learners.
6.6 Pilot
If possible, the researcher should run a pilot of the field experiment.

A pilot is the implementation of the experimental design at a small
scale. The pilot program can help the researcher estimate some of the
inputs needed for the power calculations, namely the variability of the
outcome variable, covariates associated with the outcome, and the intra-
or inter-cluster correlation in the case of group randomization. Hence,
the researcher can make a better estimate of the number of units needed
in the field experiment in order to detect the desired effect size.
Ideally, and to avoid biased estimates of the treatment effect, the pilot
should be run using a randomly selected group or location. If the
organization chooses the pilot units according to the belief that the units
that are chosen are those in which the treatment is most effective, the
estimates obtained in the pilot will be biased and they will not generalize
to the remaining units. Thus, and according to the large treatment
effect found in the pilot, the researcher may choose a smaller sample size
for the “real” experiment than the one actually needed. Therefore, and
if the researcher is aware of the biased selection of units that are chosen
to participate in the pilot, she should be as conservative as possible
when using data from the pilot in the power analysis.
The pilot can also help the researcher with all the more finite details
of the implementation. If something was not thought of in the initial
design, it will show up in the pilot. Thus, the pilot also provides logistics
training and, if something goes wrong, the researcher can still fix it
before the “real” experiment.
Unfortunately, it is not always possible to run pilots. Most organiza-
tions will only give the researcher one opportunity to experiment and,
thus, she is better off using that opportunity for the “real” experiment
rather than a pilot. In other cases, the pilot ends up being the “real”
experiment because organizations will only allow the experiment to be
implemented with a small set of their units. In any case, either as a
first trial or as the only trial, the researcher should make the most out
of the pilot.
6.7 Implementation
If the experiment has survived all of the aforementioned thresholds,

the researcher will finally run the implementation. This means that the
researcher must now apply the randomization protocol to the pool of
units (e.g., individuals, groups, or organizations) available. The larger
the pool the better, as a large number of observations improves the
statistical power to detect a treatment effect. However, if implementation
has a cost per unit, then the benefits and costs should be weighed in
order to obtain the most efficient design (Duflo et al., 2008). After
the random assignment has been done and before introducing the
manipulation, it is wise to check the balance of covariates (if available
at this stage). Covariate imbalance is a threat to identification and
some authors advocate for rerandomization until the covariate balance
is achieved (Morgan and Rubin, 2012).2
After the randomization has been done, the researcher can imple-
ment the intervention protocol. Working closely with the “doer” will
assure that all the details are working as expected. A sequential check
list for the implementation stage can be a powerful tool to ensure
that nothing is forgotten. Additionally, the researcher may implement
some controls to make sure that the treatment is being received as
expected by the participants (for example, creating an email account
that receives all the communications that the participants in each con-
dition receive). The researcher can also have a manipulation check in
a post-experimental questionnaire that confirms the manipulations by
asking treatment-related questions. This questionnaire can also allow
the collection of other variables that are of interest for the study. For
example, the researcher may collect information that helps to disen-
tangle alternative mechanisms by which the treatment effect occurs.
The researcher may also collect data for heterogeneous treatment effect
analyses. Because this data collection occurs after the experiment, data
2
See Imai et al. (2008) and Morgan and Rubin (2012) for discussions regarding
alternative methods for balancing covariates.
6.7. Implementation 49
can be contaminated by the treatments. Nevertheless, this may be the

only chance the researcher has to collect certain information because
she may not want to trigger any reaction by asking sensitive questions
before the start of the experiment.
In the implementation stage, it is also important to consider the
duration of the experiment. Ideally, the duration of the experiment
should be established at the start, as this will avoid the criticism that
the duration was determined by the (non-) results of the experiment as
they unfolded. The longer the experiment, the harder it will be to have
a clean manipulation, i.e., the probability that other things relevant to
the outcome variable occur increases. Long experiments may also be
more costly, which can be a limitation to the researcher, and be more
subject to attrition problems. If attrition is related to the treatment,
this can complicate the analysis. However, experiments that are too
short run the risk of not having the effects materialize or lack sufficient
observations and, hence, statistical power to detect a “true” effect.
At the implementation stage it is also necessary to have a protocol
for sharing data between the research site and the research team. Some
researchers like to look at data as soon as the data is available, while
others prefer data collection to be over before they start their analyses.
In the field, events not considered in the study design (e.g., strikes,
changes in the IT system of the partner, changes in the economic en-
vironment, arrival of consultants) may occur without the researcher
knowing about them. Thus, it is important to keep track of what is
happening and maintain an ongoing conversation with the organization
regarding changes or unexpected events that may be occurring. Finding
out about weird trends in the data or unexpected events that occurred
during the study after the experiment is done may kill the project.
The sooner the researcher knows about these unforeseen events, the
sooner she can figure out what is going on and what she can (or can-
not) do about it. Again, a post-experimental questionnaire can help
the researcher to determine what happened and how it affected the
participants of the experiment. Hence, if possible, researchers should use
post-experimental questionnaires because they can enrich the analysis,
help to test different mechanisms, provide qualitative evidence, and
help understand unexpected events.
6.8 Analysis
In the best case scenario, without compliance issues or attrition, the

analysis of a field experiment tends to be standard and straightforward.
This allows for easy comparisons across studies. Compliance issues
occur if the treatment was not taken when it was assigned or when the
treatment was taken when it was not assigned. Sometimes compliance
is also used to refer to implementation problems, specifically when the
implementation deviates from the design of the experiment (for example,
when the treatment was not applied similarly across locations). The
second issue mentioned—attrition—occurs when some units that have
started the experiment do not complete it.
The analysis of an experiment usually starts with confirming that
randomization was successful, i.e., that the treatment and control groups
do not differ in observables. As mentioned previously, this analysis can
be done even before the introduction of the manipulations to ensure that
if a bad randomization occurs, the researchers have the opportunity
to re-randomize (Morgan and Rubin, 2012). If researchers plan to
engage in heterogeneous treatment effect analysis, they should also
verify that the groups are balanced for those variables. The second
step is the identification of the treatment effect. If the two groups do
not differ, the identification is straightforward. If they differ in some
variables, these variables are usually considered in the identification
strategy to mitigate the concern that they influence the estimation
of the treatment effect. The treatment effect is the difference in the
outcome variable between treatment and control groups. Depending on
the type of outcome variable and the type of data (e.g., cross-sectional,
longitudinal), different statistical techniques may be used to estimate
the treatment effect (e.g., difference in means, regression analysis).
Afterwards, researchers usually analyze heterogeneous treatment effects
and/or try to disentangle alternative mechanisms for the treatment
effects to take place by running additional analyses.
When there are compliance issues, researchers use an instrumental
variables approach. In this case, the first regression estimates treatment
status (receiving or not receiving treatment) by using as an instrument
the random assignment (being assigned or not to receive the treatment).
6.8. Analysis 51
The treatment status estimates are then used as regressors in the second
stage regression, which estimates the local average treatment effect in
the outcome variable of interest (Duflo et al., 2008). In most managerial
accounting field experiments, compliance will not be an issue because
the units, e.g. employees of an organization, will receive the manipula-
tion as part of their job. Hence, they are not likely to fail to comply
with the randomization protocol. Compliance referring to implemen-
tation departures from the design can be a problem, particularly in
multiple-locations and/or multi-period field experiments. In this case,
the researcher has to identify where and when those deviations took
place and decide what data to analyze/exclude. The best strategy will
always be for the researcher to act ex ante by preemptively following
the implementation and making sure that the design is appropriately
put into practice. Likewise, the contamination of treatments, i.e., partic-
ipants learning that other treatments are in place and reacting to that
(and not to their specific manipulation) can also complicate the analysis
because the researcher cannot easily disentangle the two effects ex post.
Again, the best strategy is to design an experiment in which the risk of
contamination is minimized (for example, by opting to randomize at
the group level and not at the individual level).
Another problematic issue, particularly in field experiments with
long time horizons, will be attrition. In this case, the simple solution for
the researcher is to show that attrition is not related to the treatment
(e.g., attrition rates are similar across conditions) and that attrition
does not interfere with the results (e.g., the results hold when those who
left are eliminated from the analysis). Attrition may be problematic if
dropouts differ among conditions because it may signal that attrition is
related to the manipulation.3 In this case, eliminating dropouts from
the analyses may lead to a biased estimate of the treatment effect.
For example, if those who leave are those for whom treatment had
a smaller or a negative effect, the researcher will overestimate the
3
If attrition is the product of the manipulation and the only outcome of interest,
then different levels of attrition can be the experiment’s finding. However, if there
are other outcome variables of interest, and the level of attrition varies across groups,
then it will interfere with the estimate of the average treatment effect because the
researchers fail to have data for those who left.
average treatment effect by excluding them from the analysis. Ideally,

the researcher would like to collect outcome data for those who left (or
at least for some of them), but this may not be doable, particularly
if the outcome variable is job-related and the participant left the job.
An alternative is to compare dropouts to those who stayed with regard
to observable measures (and ideally on the baseline of the outcome
variable or for part of the experiment) to determine whether they are
different. If they are not, this may mitigate the concern that treatment
led to the exit. If they are different, then more advanced econometric
techniques should be used to compute the average treatment effect or
upper/lower bounds for that effect (e.g., Angrist et al., 2006; Grasdal,
2001; Hausman and Wise, 1979; Manski, 1989; Wooldridge, 2002).
7
Limitations of field experiments
As with any other method, field experiments are also subject to criticism
(e.g., Camerer, 2015; Falk and Heckman, 2009). The most common
critique is that field experiments do not allow the researcher to create
the same level of control within the study as is possible in a laboratory
experiment. In the laboratory, the researcher can completely control
not only the manipulation that is going to be introduced, but also
how and when participants will receive it, as well as the need for
or extent to which participants interact. Another concern refers to
the generalizability of the results of field experiments, as they usually
come from a single organization and/or a single site. This is an issue
because specific characteristics of the partner organization and of the
experimental participants may influence the results and, thus, they
may not generalize to other organizations/sites. This is a drawback
of any field study. A third limitation refers to the selection of the
partner organization. Usually, researchers do not randomly select an
organization to partner with but rather are dependent on organizations
that self-select to participate in the field experiment. Thus, and with
any self-selection, if these organizations are somehow different from
others, the results may not generalize. This is also a drawback common
53
54 Limitations of field experiments
to any field study. A fourth limitation refers to the difficulty in applying

replicability. Replicability of past studies allows for the corroboration
of prior findings and, thus, provides assurance that the results are not
study specific. Replicability is difficult to apply in field experiments
because there are many specific details in the design and implementation
that may not be accessible to other researchers or applicable in other
contexts. Even though researchers can mitigate this problem by making
available their design and implementation protocol, replicability will still
be difficult because research sites are inherently different and even the
same site changes over time. Thus, the same manipulations in a different
organization or in a different time period at the same organization may
not produce the same results. Finally, field experiments may focus
on a narrow range of topics. This narrower range, when compared to
observational studies, comes from ethical considerations, the phenomena
studied, and the interests of the collaborating organizations. First,
researchers will not introduce unethical manipulations, such as those
that will cause harm to the participants (e.g., the dismissal of an
employee), even though the effects of those types of manipulations
might be of interest for academic research and practice. Second, the
nature of some research questions makes experimentation very difficult
(e.g., evaluation of the long-term effect of an accounting practice).1
Third, the potential research sites will likely agree to participate in
projects in which the expected outcomes are positive and reject projects
in which the expected outcomes are negative. Thus, field experiments
are prone to finding positive outcomes.
There are counter-arguments for these criticisms (e.g., Al-Ubaydli
and List, 2015a,b). One of the main counter-arguments is the fact that
subjects (not the partner organizations) do not self-select to participate
1
Even though some research questions seem impossible to answer through exper-
imental methods, some researchers have engaged in ambitious research projects and
provided important insights. For example, Bloom et al. (2013) in a US$1.3 million
field experiment examine the effects of adopting and using management practices such
as quality controls, inventory management, cost allocation, and performance-based
incentives. They randomly provide management consulting services to Indian textile
factories and compare the performance of these factories to that of a control group.
They find that productivity rose by 17% in the first year due to improved quality
and efficiency, and reduced inventory.
55
in field experiments. This is not the case in lab experiments as partici-

pants actively select to be part of a pool of participants for a certain
experimental laboratory and, hence, can be called to participate in mul-
tiple experiments. As such, if these subjects differ from those who opt
not to participate, the findings obtained by these laboratory experiments
may not generalize to the rest of the population. Field experiments do
not have this problem because participants usually do not know that
they are part of an experiment and, thus, have no opportunity to opt
out. Therefore, and even though laboratory experiments may provide
the researcher with greater control over what happens to participants
during the experiment, field experiments provide the researcher more
control over who participates (Al-Ubaydli and List, 2015b). Thus, while
field experiments should, at least, generalize to similar research sites,
laboratory results may not generalize to the field at all (Levitt and List,
2006, 2007).
A counter-argument for the difficulties regarding replicability is that,
like any other study, the results from a single field experiment should
be considered with caution and integrated with evidence drawn from
other methods and research sites. Triangulating evidence from different
studies is critical to ensure that the results are due to neither the use of
a specific method nor being sample-/context-specific. Hence, by using
different methods to investigate the same or similar research questions,
researchers corroborate the internal and external validity of the evidence
produced by past studies. This allows science, in general, to build on
prior knowledge and progress as previously unanswered questions in
one study may be answered in a new study.
Finally, ethical considerations about what is or should be manipu-
lated are present in the field as in the laboratory. The range of manipula-
tions available to the researchers in an experiment is more limited than
naturally occurring phenomena (not manipulated by the researcher),
which can be analyzed with observational (retrospective) studies. Never-
theless, researchers can sometimes design creative field experiments that
56 Limitations of field experiments
allow for addressing directly complex issues that, at first glance, would
hardly be considered for experimentation due to ethical concerns.2
Despite these counter-arguments, researchers should be aware of
the criticisms faced by field experiments. Critiques should be weighed
by the researcher as cons that are counterbalanced by the pros when
deciding on the best method to use to answer that study’s research
question(s). These limitations should also be acknowledged as applicable
and appropriate, according to the specificities of each project in the
final write up of the manuscript.
2
For example, Bertrand and Mullainathan (2004) address the issue of
discrimination in a field experiment. In their study resumes of job applicants are
sent to employers seeking to hire personnel. The authors find that resumes that use
white-sounding names, compared to black-sounding names with identical resumes,
received 50% more invitations to an interview.
8
Concluding remarks
Field experiments are a novel but increasingly popular method that is

used in managerial accounting research. Field experiments have also
become popular in other fields of research, such as economics, because
they promise the best of two worlds. On one hand, field experiments
provide much of the control of a laboratory experiment, specifically the
exogenous manipulation designed by the researcher and the random
assignment of experimental units to different conditions, which ensures
a high internal validity of the causal effects. On the other hand, field
experiments provide the “realism” of a field study as the experiment
takes place in a naturally occurring environment, not one artificially
created by the researcher. Therefore, field experiments score high in
external validity as the results should generalize to similar field settings.
Field experiments are well suited for managerial accounting research.
In this research vein, observational (retrospective) studies are likely
to have identification problems such as omitted correlated variables
bias, selection (sampling) bias, reverse causality bias, and simultaneous
causality bias due to the inherently endogenous phenomena under study.
Thus, experiments are a necessity in managerial accounting research
because they create proper counterfactuals that permit a clear and clean
57
58 Concluding remarks
identification of causal effects. Making causal claims is at the heart of

what academics aim to do, both for the development of their theories
and for assuring the relevance of their research in the practitioners’
community (Kaplan, 2011; Van der Stede, 2014, 2015).
Managerial accounting researchers are also in a great position to run
field experiments. Many of these researchers already have connections
in organizations, either by previous (retrospective) field studies done
in those organizations, through consulting, or from pro bono projects
related to managerial accounting knowledge. They also have school
alumni in leadership roles in many organizations. Hence, managerial ac-
counting researchers are likely to have already trust-based relationships
with a few organizations—and trust is a key condition for running field
experiments.
Field experiments are high risk, albeit high reward, projects. Field
experiments are high risk because there are many things that can go
wrong, and that do not directly depend on the researcher. For example,
the research site may cancel the experiment at the last minute or the
management team may change and the researcher has to start the
negotiations for the project again. However, field experiments offer a
high reward because once the researcher successfully completes the
study, the internal validity of that experiment eliminates many of the
identification concerns that are common in (managerial) accounting
research.
Nevertheless, and even if the researcher does not have the opportu-
nity to run a field experiment, once she gains access to an organization,
it may be possible to run a quasi-experiment, use data from a natural
experiment, or perform a field study. The researcher should make the
most of the relationship built with the partner organization, within the
boundaries imposed by the research site. Overall, and even if a researcher
cannot always perform a field experiment, this is a useful method to
have in a researcher’s toolbox to use whenever it is appropriate and
possible.
Going forward, managerial accounting researchers using field ex-
periments will likely better explore the advantages of using several
sources of data in their study (e.g., questionnaires, archival), use the
59
natural heterogeneity in the field to detect heterogeneous treatment

effects, and take advantage of long timeframes to identify how effects
evolve. Moreover, researchers will also likely move from the analysis
of individual performance effects to organizational- (e.g., firm-) level
effects. Even though the hurdle to do these field experiments is much
higher as it may require access to multiple sites, the advantages in terms
of generalizability and impact (relevance) are undisputable.
Acknowledgements
This manuscript was originally prepared for my habilitation (agregação)

at ISEG, Universidade de Lisboa. I thank Pablo Casas-Arce, F. Asís
Martínez Jerez, Stefan Reichelstein, a reviewer, and seminar partici-
pants at the XXIV Workshop on Accounting and Management Control
“Memorial Raymond Konopka,” XVIII Grudis Conference and Doctoral
Colloquium, Field Experiments Workshop—Technical University of
Berlin, PhD Seminar on Field Experiments in Accounting—University
of Bern, Universidad Pablo de Olavide, and Universidad Autónoma
de Madrid for helpful comments in earlier drafts. I gratefully acknowl-
edge financial support from ISEG, Universidade de Lisboa; Advance,
CSG Research Center; and FCT—Fundação para a Ciência e Tec-
nologia (Portugal), research grants PTDC/EGE-GES/119607/2010,
UID/SOC/04521/2013, and UID/SOC/04521/2019 (national funding).
All errors remain my own.
60
References
Abadie, A. (2005). “Semiparametric Difference-in-Differences Estima-

tors”. The Review of Economic Studies. 72(1): 1–19.
Al-Ubaydli, O. and J. A. List (2015a). “On the Generalizability of
Experimental Results in Economics”. In: Handbook of Experimental
Economics Methodology, Fréchette and Schotter (editors). Oxford
University Press.
Al-Ubaydli, O. and J. A. List (2015b). “Do Natural Field Experiments
Afford Researchers More or Less Control than Laboratory Experi-
ments?” American Economic Review. 105(5): 462–466.
Angrist, J. D. and A. B. Krueger (2001). “Instrumental Variables and
the Search for Identification: From Supply and Demand to Natural
Experiments”. Journal of Economic Perspectives. 15(4): 69–85.
Angrist, J., E. Bettinger, and M. Kremer (2006). “Long-Term Educa-
tional Consequences of Secondary School Vouchers: Evidence from
Administrative Records in Colombia”. American Economic Review.
96(3): 847–862.
Aranda, C. and J. Arellano (2010). “Consensus and Link Structure
in Strategic Performance Measurement Systems: A Field Study”.
Journal of Management Accounting Research. 22(1): 271–299.
Ashton, R. H. and S. S. Kramer (1980). “Students as Surrogates in
Behavioral Accounting Research: Some Evidence”. Journal of Ac-
counting Research. 18(1): 1–15.
61
62 References
Athey, S. and G. W. Imbens (2006). “Identification and Inference in

Nonlinear Difference-in-Differences Models”. Econometrica. 74(2):
431–497.
Balakrishnan, R. and M. Penno (2014). “Causality in the Context
of Analytical Models and Numerical Experiments”. Accounting,
Organizations and Society. 39(7): 531–534.
Bandiera, O., I. Barankay, and I. Rasul (2011). “Field Experiments
with Firms”. Journal of Economic Perspectives. 25(3): 63–82.
Bapna, R., A. Gupta, S. Rice, and A. Sundararajan (2017). “Trust and
the Strength of Ties in Online Social Networks: An Exploratory Field
Experiment”. MIS Quarterly: Management Information Systems.
41(1): 115–130.
Barrett, M. E. (1971). “Accounting for Intercorporate Investments:
A Behavior Field Experiment”. Journal of Accounting Research.
9: 50–65.
Belkaoui, A. (1980). “The Impact of Socio-Economic Accounting State-
ments on the Investment Decision: An Empirical Study”. Accounting,
Bernstein, S., A. Korteweg, and K. Laws (2017). “Attracting Early-
Stage Investors: Evidence from a Randomized Field Experiment”.
The Journal of Finance. 72(2): 509–538.
Bertomeu, J., A. Beyer, and D. J. Taylor (2016). “From Casual to
Causal Inference in Accounting Research: The Need for Theoretical
Foundations”. Foundations and Trends® in Accounting. 10(2–4):
262–313.
Bertrand, M., E. Duflo, and S. Mullainathan (2004). “How Much Should
We Trust Differences-in-Differences Estimates?” The Quarterly Jour-
nal of Economics. 119(1): 249–275.
Bertrand, M. and S. Mullainathan (2004). “Are Emily and Greg More
Employable than Lakisha and Jamal? A Field Experiment on Labor
Market Discrimination”. American Economic Review. 94(4): 991–
1013.
Bloom, N., B. Eifert, A. Mahajan, D. McKenzie, and J. Roberts (2013).
“Does Management Matter? Evidence from India”. The Quarterly
Journal of Economics. 128(1): 1–51.
References 63
Bloomfield, R., M. W. Nelson, and E. Soltes (2016). “Gathering Data

for Archival, Field, Survey, and Experimental Accounting Research”.
Journal of Accounting Research. 54(2): 341–395.
Bloomfield, R., K. Rennekamp, and B. Steenhoven (2018). “No System is
Perfect: Understanding how Registration-Based Editorial Processes
Affect Reproducibility and Investment in Research Quality”. Journal
of Accounting Research. 56(2): 313–362.
Boatsman, J. R. and J. C. Robertson (1974). “Policy-Capturing on
Selected Materiality Judgments”. The Accounting Review. 49(2):
342–352.
Bol, J. C. and S. D. Smith (2011). “Spillover Effects in Subjective
Performance Evaluation: Bias and the Asymmetric Influence of
Controllability”. The Accounting Review. 86(4): 1213–1230.
Bonner, S. E., R. Hastie, G. B. Sprinkle, and S. M. Young (2000).
“A Review of the Effects of Financial Incentives on Performance
in Laboratory Tasks: Implications for Management Accounting”.
Journal of Management Accounting Research. 12(1): 19–64.
Bonner, S. E. and G. B. Sprinkle (2002). “The Effects of Monetary
Incentives on Effort and Task Performance: Theories, Evidence, and
a Framework for Research”. Accounting, Organizations and Society.
27(4–5): 303–345.
Brandon, D. M., J. H. Long, T. M. Loraas, J. Mueller-Phillips, and
B. Vansant (2014). “Online Instrument Delivery and Participant
Recruitment Services: Emerging Opportunities for Behavioral Ac-
counting Research”. Behavioral Research in Accounting. 26(1): 1–23.
Brink, A. G., R. Glasscock, and B. Wier (2012). “The Current State
of Accounting Ph. D. Programs in the United States”. Issues in
Accounting Education. 27(4): 917–942.
Buchheit, S., M. Doxey, T. Pollard, and S. Stinson (2017). “A Technical
Guide to Using Amazon’s Mechanical Turk in Behavioral Accounting
Research”. Behavioral Research in Accounting. 30(1): 111–122.
Camerer, C. (2015). “The Promise of Lab-Field Generalizability in
Experimental Economics: A Reply to Levitt and List, 2007”. In:
Handbook of Experimental Economics Methodology, Fréchette and
Schotter (editors). Oxford University Press.
64 References
Card, D. and A. B. Krueger (1994). “Minimum Wages and Employ-

ment: A Case Study of the Fast-Food Industry in New Jersey and
Pennsylvania”. American Economic Review. 84(4): 772–793.
Cardinaels, E. and P. M. van Veen-Dirks (2010). “Financial Versus Non-
Financial Information: The Impact of Information Organization and
Presentation in a Balanced Scorecard”. Accounting, Organizations
and Society. 35(6): 565–578.
Cartwright, N. (2016). “Where’s the Rigor When You Need It?” Foun-
dations and Trends® in Accounting. 10(2–4): 106–124.
Casas-Arce, P., S. M. Lourenço, and F. Martínez-Jerez (2017). “The
Performance Effect of Feedback Frequency and Detail: Evidence
from a Field Experiment in Customer Satisfaction”. Journal of
Accounting Research. 55(5): 1051–1088.
Chatterji, A. K., M. Findley, N. M. Jensen, S. Meier, and D. Niel-
son (2016). “Field Experiments in Strategy Research”. Strategic
Management Journal. 37(1): 116–132.
Chen, Q. and K. Schipper (2016). “Comments and Observations Re-
garding the Relation Between Theory and Empirical Research in
Contemporary Accounting Research”. Foundations and Trends® in
Accounting. 10(2–4): 314–360.
Cheng, M. M. and K. A. Humphreys (2012). “The Differential Improve-
ment Effects of the Strategy Map and Scorecard Perspectives on
Managers’ Strategic Judgments”. The Accounting Review. 87(3):
899–924.
Christ, M. H. (2013). “An Experimental Investigation of the Inter-
actions among Intentions, Reciprocity, and Control”. Journal of
Management Accounting Research. 25(1): 169–197.
Christ, M. H., S. A. Emett, S. L. Summers, and D. A. Wood (2012a).
“The Effects of Preventive and Detective Controls on Employee
Performance and Motivation”. Contemporary Accounting Research.
29(2): 432–452.
Christ, M. H., K. L. Sedatole, and K. L. Towry (2012b). “Sticks and
Carrots: The Effect of Contract Frame on Effort in Incomplete
Contracts”. The Accounting Review. 87(6): 1913–1938.
References 65
Christensen, H. B., L. Hail, and C. Leuz (2016). “Capital-Market Effects

of Securities Regulation: Prior Conditions, Implementation, and
Enforcement”. The Review of Financial Studies. 29(11): 2885–2924.
Cole, S., T. Sampson, and B. Zia (2011). “Prices or Knowledge? What
Drives Demand for Financial Services in Emerging Markets?” The
Journal of Finance. 66(6): 1933–1967.
DeFond, M. L. (2015). “Annual Report and Editorial Commentary for
The Accounting Review”. The Accounting Review. 90(6): 2603–2638.
Duflo, E. and A. Banerjee (2017). Handbook of Field Experiments. Vol. 1.
North-Holland, Elsevier.
Duflo, E., R. Glennerster, and M. Kremer (2008). “Using Randomization
in Development Economics Research: A Toolkit”. In: Handbook of
Development Economics, Schultz and Strauss (editors). Vol. 4. 3895–
3962.
Duin, S. R. V., H. C. Dekker, J. L. Wielhouwer, and J. P. Mendoza
(2018). “The Tone from above: The Effect of Communicating a
Supportive Regulatory Strategy on Reporting Quality”. Journal of
Accounting Research. 56(2): 467–519.
Edmondson, A. C. and S. E. McManus (2007). “Methodological Fit
in Management Field Research”. Academy of Management Review.
32(4): 1246–1264.
Elliott, W. B., F. D. Hodge, J. J. Kennedy, and M. Pronk (2007). “Are
MBA Students a Good Proxy for Nonprofessional Investors?” The
Accounting Review. 82(1): 139–168.
Evans, J. H., R. L. Hannan, R. Krishnan, and D. V. Moser (2001).
“Honesty in Managerial Reporting”. The Accounting Review. 76(4):
537–559.
Eyring, H. and V. G. Narayanan (2018). “Performance Effects of Setting
a High Reference Point for Peer-Performance Comparison”. Journal
of Accounting Research. 56(2): 581–615.
Falk, A. and J. Heckman (2009). “Lab Experiments Are a Major Source
of Knowledge in the Social Sciences”. Science. 326(5952): 535–538.
Farrell, A. M., J. H. Grenier, and J. Leiby (2017). “Scoundrels or Stars?
Theory and Evidence on the Quality of Workers in Online Labor
Markets”. The Accounting Review. 92(1): 93–114.
66 References
Floyd, E. and J. A. List (2016). “Using Field Experiments in Accounting

and Finance”. Journal of Accounting Research. 54(2): 437–475.
Gassen, J. (2014). “Causal Inference in Empirical Archival Financial
Accounting Research”. Accounting, Organizations and Society. 39(7):
535–544.
Georghiou, L. (2015). “Value of Research, Policy Paper by the Research,
Innovation, and Science Policy Experts (RISE)”. European Com-
mission. (Last accessed 2018/04/05) url: https://ec.europa.eu/
research/openvision/pdf/rise/georghiou-value_research.pdf.
Gippel, J., T. Smith, and Y. Zhu (2015). “Endogeneity in Accounting
and Finance Research: Natural Experiments as a State-of-the-Art
Solution”. Abacus. 51(2): 143–168.
Gormley, T. A., D. A. Matsa, and T. Milbourn (2013). “CEO Compen-
sation and Corporate Risk: Evidence from a Natural Experiment”.
Journal of Accounting and Economics. 56(2–3): 79–101.
Gow, I. D., D. F. Larcker, and P. C. Reiss (2016). “Causal Inference
in Accounting Research”. Journal of Accounting Research. 54(2):
477–523.
Grasdal, A. (2001). “The Performance of Sample Selection Estimators
to Control for Attrition Bias”. Health Economics. 10: 385–398.
Gunn, A. and M. Mintrom (2017). “Evaluating the Non-Academic
Impact of Academic Research: Design Considerations”. Journal of
Higher Education Policy and Management. 39(1): 20–30.
Hahn, J., P. Todd, and W. Van der Klaauw (2001). “Identification and
Estimation of Treatment Effects with a Regression Discontinuity
Design”. Econometrica. 69(1): 201–209.
Hainmueller, J. (2012). “Entropy Balancing for Causal Effects: A Mul-
tivariate Reweighting Method to Produce Balanced Samples in
Observational Studies”. Political Analysis. 20: 25–46.
Hammersley, J. S., K. M. Johnstone, and K. Kadous (2011). “How
do Audit Seniors Respond to Heightened Fraud Risk?” Auditing:
A Journal of Practice & Theory. 30(3): 81–101.
Hannan, R. L., R. Krishnan, and A. H. Newman (2008). “The Effects
of Disseminating Relative Performance Feedback in Tournament
and Individual Performance Compensation Plans”. The Accounting
Review. 83(4): 893–913.
References 67
Hannan, R. L., G. P. McPhee, A. H. Newman, and I. D. Tafkov (2012).

“The Effect of Relative Performance Information on Performance and
Effort Allocation in a Multi-Task Environment”. The Accounting
Review. 88(2): 553–575.
Harrison, G. W. and J. A. List (2004). “Field Experiments”. Journal of
Economic literature. 42(4): 1009–1055.
Harrison, P. D., S. G. West, and J. H. Reneau (1988). “Initial Attri-
butions and Information-Seeking by Superiors and Subordinates
in Production Variance Investigations”. Accounting Review. 63(2):
307–320.
Hausman, J. A. and D. A. Wise (1979). “Attrition Bias in Experimen-
tal and Panel Data: The Gary Income Maintenance Experiment”.
Econometrica. 47(2): 455–473.
Heckman, J. J. (1979). “Sample Selection Bias as a Specification Error”.
Econometrica. 47(1): 153–161.
Hesford, J. W., S. H. S. Lee, W. A. Van der Stede, and S. M. Young
(2006). “Management Accounting: A Bibliographic Study”. In: Hand-
books of Management Accounting Research, Chapman, Hopwood,
and Shields (editors). Vol. 1. 3–26.
Imai, K., G. King, and E. A. Stuart (2008). “Misunderstandings Between
Experimentalists and Observationalists about Causal Inference”.
Journal of the Royal Statistical Society: Series A (Statistics in
Society). 171(2): 481–502.
Ittner, C. D. (2014). “Strengthening Causal Inferences in Positivist Field
Studies”. Accounting, Organizations and Society. 39(7): 545–549.
Kachelmeier, S. J. (2010). “Annual Report and Editorial Commentary
for the Accounting Review”. The Accounting Review. 85(6): 2173–
2203.
Kadous, K., S. J. Kennedy, and M. E. Peecher (2003). “The Effect
of Quality Assessment and Directional Goal Commitment on Au-
ditors’ Acceptance of Client-Preferred Accounting Methods”. The
Kahn, R. J. and T. M. Whited (2016). “Identification with Models and
Exogenous Data Variation”. Foundations and Trends® in Accounting.
10(2–4): 361–375.
68 References
Kaplan, R. S. (2011). “Accounting Scholarship that Advances Profes-

sional Knowledge and Practice”. The Accounting Review. 86(2):
367–383.
Kelly, K., A. Presslee, and R. A. Webb (2017). “The Effects of Tangible
Rewards Versus Cash Rewards in Consecutive Sales Tournaments:
A Field Experiment”. The Accounting Review. 92(6): 165–185.
Khazragui, H. and J. Hudson (2015). “Measuring the Benefits of Univer-
sity Research: Impact and the REF in the UK”. Research Evaluation.
24(1): 51–62.
Kinney, J. R., R. William, and M. L. Shepardson (2011). “Do Control
Effectiveness Disclosures Require SOX 404 (b) Internal Control
Audits? A Natural Experiment with Small US Public Companies”.
Kristof, N. (2014). “Professors, We Need You!” New York Times. (Last
accessed 2018/04/05) url: https://www.nytimes.com/2014/02/16/
opinion/sunday/kristof-professors-we-need-you.html.
Levitt, S. D. and J. A. List (2006). What do Laboratory Experiments
Tell Us About the Real World? Working Paper, NBER.
Levitt, S. D. and J. A. List (2007). “What do Laboratory Experiments
Measuring Social Preferences Reveal about the Real World?” Journal
of Economic Perspectives. 21(2): 153–174.
Levitt, S. D. and J. A. List (2009). “Field Experiments in Economics:
The Past, the Present, and the Future”. European Economic Review.
53(1): 1–18.
Levitt, S. D., J. A. List, and D. H. Reiley (2010). “What Happens
in the Field Stays in the Field: Exploring Whether Professionals
Play Minimax in Laboratory Experiments”. Econometrica. 78(4):
1413–1434.
Li, S. X. and T. Sandino (2018). “Effects of an Information Sharing
System on Employee Creativity, Engagement, and Performance”.
Li, Y. and L. Zhang (2015). “Short Selling Pressure, Stock Price Behav-
ior, and Management Forecast Precision: Evidence from a Natural
Experiment”. Journal of Accounting Research. 53(1): 79–117.
Libby, R. (1975). “The Use of Simulated Decision Makers in Information
Evaluation”. The Accounting Review. 50(3): 475–489.
References 69
Libby, R., R. Bloomfield, and M. W. Nelson (2002). “Experimental

Research in Financial Accounting”. Accounting, Organizations and
Society. 27(8): 775–810.
Libby, T., S. E. Salterio, and A. Webb (2004). “The Balanced Scorecard:
The Effects of Assurance and Process Accountability on Managerial
Judgment”. The Accounting Review. 79(4): 1075–1094.
Lipe, M. G. and S. E. Salterio (2000). “The Balanced Scorecard: Judg-
mental Effects of Common and Unique Performance Measures”. The
List, J. A. (2006). “The Behavioralist Meets the Market: Measuring
Social Preferences and Reputation Effects in Actual Transactions”.
Journal of Political Economy. 114(1): 1–37.
List, J. A. (2011). “Why Economists Should Conduct Field Experiments
and 14 Tips for Pulling One Off”. Journal of Economic Perspectives.
25(3): 3–16.
Liyanarachchi, G. A. and M. J. Milne (2005). “Comparing the Invest-
ment Decisions of Accounting Practitioners and Students: An Em-
pirical Study on the Adequacy of Student Surrogates”. Accounting
Forum. 29(2): 121–135.
Lourenço, S. M. (2016). “Monetary Incentives, Feedback, and
Recognition—Complements or Substitutes? Evidence from a Field
Experiment in a Retail Services Company”. The Accounting Review.
91(1): 279–297.
Lourenço, S. M., J. O. Greenberg, M. Littlefield, D. W. Bates, and
V. G. Narayanan (2018). “The Performance Effect of Feedback in a
Context of Negative Incentives: Evidence from a Field Experiment”.
Management Accounting Research. 40: 1–14.
Luft, J. and M. D. Shields (2014). “Subjectivity in Developing and
Validating Causal Explanations in Positivist Accounting Research”.
Accounting, Organizations and Society. 39(7): 550–558.
Lukka, K. (2014). “Exploring the Possibilities for Causal Explanation
in Interpretive Research”. Accounting, Organizations and Society.
39(7): 559–566.
Maas, V. S. and M. Van Rinsum (2013). “How Control System De-
sign Influences Performance Misreporting”. Journal of Accounting
Research. 51(5): 1159–1186.
70 References
Manski, C. F. (1989). “Schooling as Experimentation: A Reappraisal of

the Postsecondary Dropout Phenomenon”. Economics of Education
Review. 8(4): 305–312.
Manski, C. F. (2016). “Interpreting Point Predictions: Some Logical
Issues”. Foundations and Trends® in Accounting. 10(2–4): 238–261.
Marinovic, I. (2016). “Causal Inferences in Capital Markets Research:
Preface”. Foundations and Trends® in Accounting. 10(2–4): 101–
105.
Morgan, K. L. and D. B. Rubin (2012). “Rerandomization to Improve
Covariate Balance in Experiments”. The Annals of Statistics. 40(2):
1263–1282.
Mortensen, T., R. Fisher, and G. Wines (2012). “Students as Surrogates
for Practicing Accountants: Further Evidence”. Accounting Forum.
36(4): 251–265.
Newman, A. H. and I. D. Tafkov (2014). “Relative Performance Informa-
tion in Tournaments with Different Prize Structures”. Accounting,
Ngwe, D., K. J. Ferreira, and T. Teixeira (2019). “The Impact of
Increasing Search Frictions on Online Shopping Behavior: Evidence
from a Field Experiment”. Journal of Marketing Research. 56(6):
944–959.
Ortman, R. F. (1975). “The Effects on Investment Analysis of Alter-
native Reporting Procedure for Diversified Firms”. The Accounting
Review. 50(2): 298–304.
Patton, J. M. (1978). “An Experimental Investigation of Some Effects
of Consolidating Municipal Financial Reports”. Accounting Review.
53(2): 402–414.
Peecher, M. E. and I. Solomon (2001). “Theory and Experimentation
in Studies of Audit Judgments and Decisions: Avoiding Common
Research Traps”. International Journal of Auditing. 5(3): 193–203.
Presslee, A., T. W. Vance, and R. A. Webb (2013). “The Effects of
Reward Type on Employee Goal Setting, Goal Commitment, and
Performance”. The Accounting Review. 88(5): 1805–1831.
Reiss, P. C. (2016). “Just How Sensitive are Instrumental Variable
Estimates?” Foundations and Trends® in Accounting. 10(2–4): 204–
237.
References 71
Rennekamp, K. (2012). “Processing Fluency and Investors’ Reactions

to Disclosure Readability”. Journal of Accounting Research. 50(5):
1319–1354.
Roodhooft, F. and L. Warlop (1999). “On the Role of Sunk Costs
and Asset Specificity in Outsourcing Decisions: a Research Note”.
Accounting, Organizations and Society. 24(4): 363–369.
Rosenbaum, P. R. and D. B. Rubin (1983). “The Central Role of
the Propensity Score in Observational Studies for Causal Effects”.
Biometrika. 70(1): 41–55.
Rust, J. (2016). “Mostly Useless Econometrics? Assessing the Causal
Effect of Econometric Theory”. Foundations and Trends® in Ac-
counting. 10(2–4): 125–203.
Shadish, W. R., T. D. Cook, and D. T. Campbell (2002). Experimental
and Quasi-Experimental Designs for Generalized Causal Inference.
Boston: Houghton Mifflin.
Sharma, D. S. and E. R. Iselin (2003). “The Decision Usefulness of
Reported Cash Flow and Accrual Information in a Behavioural Field
Experiment”. Accounting and Business Research. 33(2): 123–135.
Shields, M. D. (1997). “Research in Management Accounting by North
Americans in the 1990s”. Journal of Management Accounting Re-
search. 9: 3–61.
Shipman, J. E., Q. T. Swanquist, and R. L. Whited (2016). “Propensity
Score Matching in Accounting Research”. The Accounting Review.
92(1): 213–244.
Sprinkle, G. B. (2000). “The Effect of Incentive Contracts on Learning
and Performance”. The Accounting Review. 75(3): 299–326.
Sprinkle, G. B. (2003). “Perspectives on Experimental Research in
Managerial Accounting”. Accounting, Organizations and Society.
28(2–3): 287–318.
Sprinkle, G. B. and M. G. Williamson (2006). “Experimental Research in
Managerial Accounting”. In: Handbooks of Management Accounting
Research, Chapman, Hopwood, and Shields (editors). Vol. 1. 415–
444.
Tafkov, I. D. (2012). “Private and Public Relative Performance Infor-
mation under Different Compensation Contracts”. The Accounting
Review. 88(1): 327–350.
72 References
Thistlewaite, D. and D. Campbell (1960). “Regression-Discontinuity

Analysis: An Alternative to the Ex-Post Facto Experiment”. Journal
of Educational Psychology. 51: 309–317.
Van der Stede, W. A. (2014). “A Manipulationist View of Causality in
Cross-Sectional Survey Research”. Accounting, Organizations and
Society. 39(7): 567–574.
Van der Stede, W. A. (2015). “Management Accounting: Where from,
Where Now, Where To?” Journal of Management Accounting Re-
search. 27(1): 171–176.
Welch, I. (2016). “Plausibility: A Fair & Balanced View of 30 Years
of Progress in Ecologics”. Foundations and Trends® in Accounting.
10(2–4): 376–412.
Wooldridge, J. M. (2002). “Inverse Probability Weighted M-Estimators
for Sample Selection, Attrition, and Stratification”. Portuguese Eco-
nomic Journal. 1(2): 117–139.

Field Experiments in Managerial Accounting Research

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Field Experiments in Managerial Accounting Research

Uploaded by

Copyright:

Available Formats

Foundations and Trends® in Accounting

Field Experiments in Managerial

3 The experimental method 16

4 The promise of field experiments for managerial accounting

5 Some examples of field experiments in managerial

6 Insights on running field experiments 38

7 Limitations of field experiments 53

Keywords: managerial accounting research; field experiments;

Sofia M . L ourenço ( 2019), “ Field E xperiments i n M anagerial A ccounting Re-

Field experiments, also known as randomized controlled trials (Duflo

(Bertomeu et al., 2016; Cartwright, 2016; Chen and Schipper, 2016;

participate in a set of experiments—and the artificial environment/task

the performance effects of different types of incentives (Kelly et al.,

Section 4 discusses the promise of field experiments for managerial

2.1 Correlation, causation, and threats to causal claims

Every researcher understands that correlation (or association) is not

performance effect of a new performance measurement system (PMS).

example, researchers who use observational studies may document

2.2 Remedies used in observational studies

When using non-experimental methods, researchers have to be aware

relationship in their phenomena.

Econometrics/statistics are central to many, if not all, observational

3.1 What is an experiment?

An experiment is a prospective study in which the researcher introduces,

3.2 Types of experiments

Harrison and List (2004), in a seminal paper on field experiments,

them and provide examples within managerial accounting research.

Table 3.1: Continued

Sprinkle (2003), and Sprinkle and Williamson (2006). Examples of

Artefactual and framed field experiments are not so common in man-

Ortman, 1975; Patton, 1978; Rennekamp, 2012; Sharma and Iselin,

4.1 The endogenous nature of the phenomena studied

Field experiments are especially suited to investigate causal relation-

4.2 Internal validity problems of observational studies in

The internal validity concerns of observational studies in managerial

These solutions are difficult to apply given the endogenous nature of

the researcher to make some causal (exogeneity) assumptions that may

introduction of new control systems, either performance measurement

Of course, managerial accounting researchers need to ensure that the

(Lourenço, 2016)—the majority of studies uses simpler designs with just

rather to the questionnaire. To mitigate this risk, researchers can avoid

(2016) communications are sent to participants weekly (2 × 4 = 8),

6.1 Research question

An interesting research question is the starting point for a successful field

6.2 Research site

Finding a research venue that is willing to collaborate with the researcher

researcher starts with a research question, presents it to a few sites,

Establishing collaboration based on trust is key for a field experiment

invest a significant amount of time and effort in multiple meetings

In order to make any collaboration work, the researcher may need

manipulation. Another alternative is to present the field experiment as

6.4 Experimental design

The experimental design includes the manipulations that the researcher

static x units/participants, power calculations can help the researcher

When there are multiple doers, it is important to have a single contact in

6.5 Institutional review board

The researcher should obtain Institutional Review Board (IRB) approval,

of a certain institution that has a pre-agreement with CITI or as

If possible, the researcher should run a pilot of the field experiment.

If the experiment has survived all of the aforementioned thresholds,

can be contaminated by the treatments. Nevertheless, this may be the

In the best case scenario, without compliance issues or attrition, the