Professional Documents
Culture Documents
Field Experiments in Managerial Accounting Research
Field Experiments in Managerial Accounting Research
Sofia M. Lourenço
ISEG, Universidade de Lisboa
and Advance, CSG Research Center, Portugal
slourenco@iseg.ulisboa.pt
This article may be used only for the purpose of research, teaching,
and/or private study. Commercial use or systematic downloading
(by robots or other automatic processes) is prohibited without ex-
plicit Publisher approval.
Boston — Delft
Contents
1 Introduction 2
2 Causal inference 7
2.1 Correlation, causation, and threats to causal claims . . . . 7
2.2 Remedies used in observational studies . . . . . . . . . . . 9
8 Concluding remarks 57
Acknowledgements 60
References 61
Field Experiments in Managerial
Accounting Research
Sofia M. Lourenço
ISEG, Universidade de Lisboa, and Advance, CSG Research Center,
Portugal; slourenco@iseg.ulisboa.pt
ABSTRACT
The use of field experiments in managerial accounting re-
search has increased substantially in the last couple of years.
One reason for this upsurge is the call in the literature to
address causality more confidently, which can be best ac-
complished with field experiments. This research method
provides a clear mechanism for identifying causal effects
in the field because the researcher introduces exogenous
manipulations in different experimental conditions to which
observational units (e.g., individuals, groups, or companies)
are randomly assigned. Hence, field experiments are well
suited to address managerial accounting phenomena that
are plagued with endogeneity concerns when analyzed by us-
ing the results of retrospective (observational) studies. This
manuscript provides an introduction to field experiments
and is especially directed toward managerial accounting re-
searchers who wish to consider adding this research method
to their toolbox.
2
3
1
I will discuss thoroughly the nature of these experiments in Section 3.2, “Types
of experiments.”
4 Introduction
2
In field experiments, participants do not usually know that they are taking part
in a study and, as such, there is no self-selection by participants. However, and to
the extent that in some field experiments the organizations choose to take (or not to
take) part in a study, there is self-selection of the partner. This is a limitation of
field experiments that is discussed in Section 7, “Limitations of Field Experiments.”
This drawback is common to the majority of field studies.
5
3
Presslee et al. (2013) use an experimental study in the field that deals with
different types of incentives but, in the typology presented in this manuscript, this
study is a quasi-experiment and not a field experiment, as it does not have random
assignment of the experimental units to the treatment conditions. The allocation to
the treatment conditions was decided by the company in which the quasi-experiment
was implemented.
4
The exceptions are Kelly et al. (2017), who use independent retailer companies
as experimental units in their field experiment, and Li and Sandino (2018), who use
stores.
6 Introduction
7
8 Causal inference
Econometric remedies
Threats What is this threat? to mitigate the threat∗ Limits to remedies
Omitted A variable related to Include omitted correlated There can be myriad omitted correlated
correlated both the dependent and variables in the model. variables. It is difficult for a researcher to
variable bias the independent argue that she has included all of the
variable is omitted from potentially omitted correlated variables in
the identification her model.
strategy, which can lead
to a spurious
relationship between
the two variables in the
model (once the
omitted correlated
variable is considered in
the model, the previous
relationship
disappears).
Selection Selection of study units Correct for selection If there are unobservable variables that
(sampling) bias or “treated” units is not (sampling) bias with are determining the selection and are also
random, which can limit sample selection models related to the outcome variable but not
the generalizability of (e.g. Heckman, 1979), considered in the selection model or
the study and its matching procedures (e.g., matching/balancing procedure, the
internal validity due to Rosenbaum and Rubin, remedies will not solve the selection
biased estimates of the 1983), or balancing (sampling) bias. It is difficult for the
“treatment” effect. procedures (e.g., researcher to argue that she has all
Hainmueller, 2012). relevant variables in her selection model
or matching/balancing procedure.
Causal inference
Continued.
Table 2.1: Continued
Econometric remedies to
Threats What is this threat? mitigate the threat∗ Limits to remedies
Reverse The relationship does not go Use longitudinal data and Even though this threat can
causality bias from the independent to the estimate the relationship be solved, other threats may
dependent variable, which is using lags. still be present.
tested and documented by the
researcher, but rather from
the dependent to the
independent variable.
Simultaneous Dependent and independent Use simultaneous equation Some exogenous assumptions
causality bias variables are mutually models, instrumental (variables) are still needed,
influencing each other. variables (Angrist and which may not be credible or
Because researchers only Krueger, 2001), or easy to find given the
consider one side of this structural models. managerial accounting
2.2. Remedies used in observational studies
and the interpretation of its results. For example, theory can help in
the selection of the variables, not only dependent and independent, but
also control variables that can be used to dismiss concerns regarding
omitted correlated variables. Theory can also help when interpreting
the results. If there are competing hypotheses, theory can provide an
explanation for potential contradictory results. Moreover, theory can
provide arguments for why A causes B and not the inverse. Hence,
theory can help researchers who are using non-experimental data to
make causal claims.
Study design can help alleviate part of the concerns related to causal
inference. For example, by choosing a longitudinal design instead of
cross-sectional design, researchers will be able to make causal claims
with more confidence. Specifically, researchers will be able to mitigate
the threat of reverse causality by showing that the cause precedes the
effect. Researchers use data from t-1 (or some other lag) for the indepen-
dent variable to estimate the effect on the dependent variable in t. Even
in a cross-sectional study, such as a one-shot survey, researchers may
also be able to collect information from past periods that allows the use
of lagged variables or changes and, hence, enhances causal claims (Van
der Stede, 2014). It is also important in the study design stage to clearly
identify the variables for which data need to be collected. Similar to
the theory of the study, failing to identify and collect data for critical
variables—dependent, independent, and control variables—may limit
the researcher’s ability to make causal inferences. Hence, study design
can help address omitted correlated variables by carefully identifying
and planning to collect data for those variables. However, there may be
myriad potential confounders or difficulties in collecting data that make
it (almost) impossible to confidently dismiss the omitted correlated
variable concern. Researchers can also make use of qualitative field
evidence or institutional knowledge that can help them to make causal
claims (Ittner, 2014). Hence, the study design should consider possibili-
ties of collecting data about the specific setting in which the study is
taking place. That additional knowledge may help the researcher build
a compelling story and make convincing arguments for a certain causal
relationship.
2.2. Remedies used in observational studies 13
though the researcher does not need an experiment to show that the
cause precedes the effect, only longitudinal data, other problems of
observational (retrospective) studies can still be present (e.g., omitted
correlated variables bias and selection bias).
Finally, the researcher needs to consider simultaneous causality bias.
This problem can be solved by simultaneous equations models, although
some exogenous variation is needed to ensure that the causal link of
interest can be identified. The common solution is to use instrumental
variables (Angrist and Krueger, 2001) that deal with endogeneity issues
in general. The problem in this case is to find a good instrument,
which needs to be relevant (significantly related with the endogenous
independent variable) and exogenous (not related with other variables
in the model). Another approach to making causal inferences with
observational data is to use structural modeling (Gow et al., 2016). This
approach demands the researcher to analytically model the phenomenon
of interest, which provides insights into the mechanisms that drive the
phenomenon. Nevertheless, and because observational data is used,
these models still need the researcher to make some causal (exogeneity)
assumptions to make sure that the model is solvable.
Difference-in-differences estimators (e.g., Athey and Imbens, 2006;
Card and Krueger, 1994), fixed effect estimators, and regression disconti-
nuity designs (e.g., Hahn et al., 2001; Thistlewaite and Campbell, 1960)
are also part of the econometric tools that are available for quantitative
empirical accounting researchers who want to make causal claims using
retrospective data (Gow et al., 2016). These tools help researchers to
mitigate the threats previously described by controlling for baseline
differences, accounting for time-invariant and/or unit-invariant unob-
servable factors, and comparing more similar “treated” and “non-treated”
groups, respectively. However, these tools are also subject to criticism
in their ability to completely solve these problems (e.g., Abadie, 2005;
Bertrand et al., 2004). Overall, and even though the use of these tech-
niques help to create better counterfactuals, it is always an issue as
to how time-variant or unit-variant unobservable factors have affected
non-random groups.
Writing objectively about the study and acknowledging the limita-
tions of using observational data to make causal inferences can also help
2.2. Remedies used in observational studies 15
the researcher to gain the goodwill of the reader to accept the causality
explanation that is offered. It is important that the researcher clearly
delineates any assumptions of the causal inference, for example which
variables are regarded as exogenous and which are regarded as endoge-
nous. Likewise, the researcher should separate the causal arguments
from the quantitative or qualitative evidence provided. This evidence
may only show correlation, at which point it is up to the researcher to
argue that the correlation is suggestive of causation in light of her theory
development. Hence, one should be cautious and use the appropriate
terminology, for example using X is associated with Y instead of X
causes Y (Van der Stede, 2014).
In sum, observational studies have internal and external validity
problems that can be mitigated, albeit not completely eliminated, with
the appropriate use of theory, study design, econometrics/statistical
analyses, and writing. Because of the limitations of all these solutions,
it is up to the researcher to develop her causal inference. Experimental
research is one solution to the identification problems of observational
studies.
3
The experimental method
1
I discuss these procedures in Section 6.
16
3.2. Types of experiments 17
the groups are similar at the start of the experiment and, thus, any
differences between them after the introduction of the manipulation
can be attributed to that manipulation. Thus, causal inference can
be directly made as the researcher has built the counterfactual—what
would have happened if the manipulation was not introduced—which is
the control group.
Saying it differently, the manipulation is an exogenous shock that
the treatment group, with randomly assigned units, receives. Because
the treatment group is, in expectation, similar to the control group and
both co-exist at the same time in the same context, all the identifica-
tion problems of observational studies disappear. There should not be
any omitted correlated variables as the manipulation is exogenous and
thus not correlated with any other variable. Moreover, even if some
variables that are correlated with the outcome are not considered in
the estimation model of the effect, due to randomization these (omit-
ted) variables should be similar across groups and thus should not
interfere in the estimation of the average treatment effect (difference
between groups). Selection bias should likewise not be present because
randomization should make the groups similar in expectation before
the manipulation and the manipulation is orthogonal to the groups.
Reverse and simultaneous causality bias are also not present as the
manipulation is exogenous and precedes the effect. Hence, the cause
cannot be dependent on the effect.
Overall, any endogeneity concern is eliminated with the exogenous
manipulation and the random assignment of experimental units to
treatment and control groups.
2
However, some researchers have questioned whether these differences really exist
(e.g., Camerer, 2015; Falk and Heckman, 2009).
Table 3.1: Types of experiments
Examples in Managerial
Type of What is this Accounting Research
experiment experiment? Advantages Disadvantages (MAR)
Laboratory Experiment completed by a High Control (high internal Limits to the external See reviews on the use of
experiment standard pool of subjects, validity). validity are due to the pool laboratory experiments in
usually students who have of subjects, self-selection to MAR by Bonner et al. (2000),
signed up to participate in participate in a study, and Bonner and Sprinkle (2002),
experimental research, in an artificial environment. Sprinkle (2003), and Sprinkle
artificial environment and Williamson (2006).
created by the researcher.
Artefactual This experiment is similar High Control (high internal Limits to the external Not common in MAR as
3.2. Types of experiments
field to a laboratory experiment validity) validity are due to the researchers prefer to use framed
experiment but uses a non-standard Higher external validity self-selection of subjects to field experiments that make use
pool of subjects. than lab experiments due to participate in a study and of participants’ knowledge
pool of subjects used. the artificial environment. when using non-standard
subjects.
Framed field This experiment is similar High Control (high internal Limits to the external Libby (1975); Harrison et al.
experiment to an artefactual field validity) validity are due to the (1988), Roodhooft and Warlop
experiment by using a Higher external validity self-selection of participants (1999); Bol and Smith (2011).
non-standard pool of than artefactual field to participate in the study
subjects, but uses their experiments due to the type and the artificial
specific of task used. environment (although close
knowledge/background in a to what is experienced by
given task. the participants in the
field).
Natural Experiment done in a High Control (high internal Control is not as strict as in Aranda and Arellano (2010);
field naturally occurring validity) lab experiments. Lourenço (2016);
experiments environment (not created Greater external validity is Single-site field experiments Kelly et al. (2017);
or by the researcher) with due to absence of and self-selection of partner Casas-Arce et al. (2017);
Field subjects who are normally self-selection by organizations may limit the Lourenço et al. (2018);
experiments not aware that they are part participants, the realism of generalizability of the Eyring and Narayanan (2018),
of a study and participate a field setting, and results Li and Sandino (2018).
by performing their daily real stakes. Difficulties in applying
routine/work/life. replicability.
Continued.
19
20
Examples in Managerial
Type of What is this Accounting Research
experiment experiment? Advantages Disadvantages (MAR)
Natural Shocks (changes) to Use of field settings Lack of randomization Not common in MAR due
experiment naturally occurring allows the researcher to limits causal inference. to the phenomena studied,
environments that address a broader set of but the strongest design for
researchers use to ex post topics than when using identifying causal effects
create the treatment group other types of experiments when using naturally
(those who have received because certain occurring data in financial
the shock) and the control manipulations are not accounting research (e.g.,
group (those who have not ethically accepted. Gippel et al., 2015; Gow
received the shock). et al., 2016).
Quasi- Are similar to natural Similar to natural Lack of randomization Presslee et al. (2013)
experiment experiments but the experiments or field limits causal inference.
researcher can be involved experiments without
in the design of the randomization (lower
experiment (similar to a internal validity).
field experiment), even
though she does apply
randomization.
The experimental method
3.2. Types of experiments 21
3
In accounting research, artefactual field experiments and framed field experi-
ments have been called in the past behavioral field experiments or field experiments
(e.g., Barrett, 1971; Belkaoui, 1980; Boatsman and Robertson, 1974; Harrison et al.,
1988; Ortman, 1975; Patton, 1978; Roodhooft and Warlop, 1999; Sharma and Iselin,
2003).
3.2. Types of experiments 23
4
Further limitations of field experiments are discussed in Section 7, “Limitations
of field experiments.”
24 The experimental method
use these shocks to ex post create the treatment group (those who have
received the shock) and the control group (those who have not received
the shock). Natural experiments can be the only type of experiment
available if the manipulation by the researcher is not possible or de-
sirable (e.g., when ethical concerns are present). In this case, natural
experiments are the preferred, i.e., the more robust research design in
naturally (not manipulated) occurring data (Gippel et al., 2015). Quasi-
experiments are similar to natural experiments in that they do not
employ random assignment to treatment and control groups. However,
some decision-maker (e.g., CEO of a firm, director of a governmental
agency, or a politician) has a say in the allocation of the experimental
units (e.g., individuals, groups, or organizations) to the treatment con-
ditions (Shadish et al., 2002). The allocation can also be the product
of self-selection of the experimental units to the treatment conditions.
Even though natural experiments and quasi-experiments are similar in
their non-random assignment to conditions, quasi-experiments can be
prospective studies wherein the researcher collaborates with a partner
organization, designs the manipulation, and collects data. The researcher
can also collect observable measures that dictate the assignment to the
treatment conditions, which allows her to later account for them in the
identification strategy. Some researchers do not differentiate between
quasi-experiments and natural experiments as they share the same
drawback—the lack of random assignment to the different conditions
(e.g., Harrison and List, 2004).
Due to their nature, i.e., exogenous shocks, natural experiments are
not common in managerial accounting research, which is focused on
decisions and behaviors that are internal, i.e., within an organization.
However, natural experiments have become the Holy Grail for identifying
causal effects in financial accounting research (e.g., Gippel et al., 2015;
Gow et al., 2016), and are now relatively common (e.g., Christensen
et al., 2016; Gormley et al., 2013; Kinney et al., 2011; Li and Zhang,
2015). An example of a quasi-experiment in managerial accounting
research is Presslee et al. (2013). In this study, the researchers designed
the manipulations, but the assignment of experimental units to the
treatment conditions was determined by the collaborating research site.
4
The promise of field experiments for managerial
accounting research
1
For example, in the review by Hesford et al. (2006) experiments are the fourth
method most widely used in managerial accounting (12.7% of all manuscripts con-
sidered in the selection), after frameworks (19.5%), analytical (18.4%), and surveys
(16.3%).
5
Some examples of field experiments in
managerial
accounting research
Despite the advantages of field experiments and the rise of this method
in other streams of research, the use of field experiments in managerial
accounting research is relatively scarce. Examples of recently published
field experiments in managerial accounting research mainly deal with the
implementation of strategic performance measurement systems (Aranda
and Arellano, 2010) or information sharing systems (Li and Sandino,
2018), and the performance effects of different types of incentives (Kelly
et al., 2017; Lourenço, 2016) or feedback (Casas-Arce et al., 2017; Eyring
and Narayanan, 2018; Lourenço et al., 2018). In this section, and given
the focus of this manuscript, I compare those studies with regard to
the multiple design choices that researchers have to make in their field
experiment. Table 5.1 presents the main findings of this comparison.
Table 5.1 shows that feedback and incentives are the most com-
mon topics addressed in field experiments in managerial accounting
research. This is not surprising given that field experiments within
one organization, which are the most frequent, tend to have individ-
uals as observational units. This leads to the analysis of individual
level effects and, as such, manipulations are also at the individual
level. The other two experiments that are considered focus on the
28
Table 5.1: Some of examples of field experiments in managerial accounting research
Participants
(unit of Outcome Question- Analysis
Study Topic Industry randomization) Design Manipulations measure naires of HTE Data Duration
Aranda PMS Banking Managers Between- BSC, Consensus Pre, No Pre, 4 months
and subjects, 2 alternative about strategy Exp Exp
Arellano conditions PMS (survey-based)
(2010)
Lourenço Incentives Retail Sales reps Between- Monetary Sales relative Pre, Yes, in Pre, 2 months
(2016) subjects, incentives, to goals Post robustness Exp
2×2×2 feedback, tests
recognition
Kelly Incentives Retail Sales reps Between- Cash, tangible Sales Post No, but Pre, 6 months
et al. (retailers ) subjects, 2 rewards additional Exp
(2017) conditions analysis of
effort
measure
(video
views)
Casas- Feedback Repair Professionals Between- Feedback Detractors (Net Pre Yes, in Pre, 3 months
Arce et al. subjects, frequency, Promoter Score), robustness Exp,
(2017) 2×2 feedback Process measures tests Post
detail (use of PDA, on
time services)
Lourenço Feedback Healthcare Physicians Between- Feedback, no E-prescribing Pre, Yes, main Pre, 12 months
et al. (Practice) subjects, 2 feedback rate Post and Exp,
(2018) conditions additional Post
analyses
Eyring Feedback Education On-line Between- No RPI, RPI Activity level Pre, Yes, main Pre, 2 months
and students subjects, 3 with median, (process Post and Exp
Narayanan conditions RPI with measure), grade additional
(2018) top quartile (performance analyses
measure)
Li and Information Retail Sales reps Between- Information Quality of Pre, Yes, as Pre, 4 months
Sandino sharing (stores) subjects, 2 sharing system, creative work, Post additional Exp
(2018) system for conditions no inform- job engagement analyses
creative ation sharing (attendance),
work system financial perfor-
mance (sales)
29
30 Managerial Accounting Research
1
The approval of the Institutional Review Board will be discussed in Section 6.5.
31
the fact that these types of employees represent the largest pool in the
partner organization and, as such, researchers maximize the power of
the experiments they are conducting. In some instances, these frontline
employees are considered as a group (Kelly et al., 2017; Li and Sandino,
2018), likely because either the research question focuses on an organi-
zational outcome, it is not possible to obtain individual data, or it is
undesirable to have multiple treatments within that group due to the
risk of contamination. Contamination refers to participants learning
about other treatments different from the one that they are experienc-
ing in the study. This is a threat to the study because participants
may react to information about multiple treatments rather than to
the specific treatment they are receiving. Thus, contamination can
invalidate the inferences of a study. Aranda and Arellano (2010) are
an exception to the use of frontline employees as participants because
they use middle managers. This option makes sense given the research
question of their study—the effect of the balanced scorecard (BSC) on
middle managers’ consensus with the strategy established by the top
management team. Managers are an interesting pool of participants,
given their decision-making power in many managerial accounting and
management control decisions. Hence, future field experiments in this
stream of research are likely to more often use managers as participants.
However, power concerns need to be taken into account given that the
number of managers within an organization is more reduced than the
number of front-line employees, but clever research designs can mitigate
this concern.2
Eyring and Narayanan (2018) are also an exception to the use of
frontline employees because participants in their study are students,
who are more close to “customers” than to “employees”. The use of
customers as the unit of analysis promises to be a fruitful choice for
future field experiments in managerial accounting research. Customers
are usually a larger population than employees and, thus, sample size
(and the power of the experiment) increases substantially. The use of
customers (or their decisions) as the unit of analysis are commonly
used in other literatures, such as marketing (e.g., Ngwe et al., 2019).
2
Experimental designs will be discussed in Section 6.4.
32 Managerial Accounting Research
by which the effects are taking place. To date, not many studies have
explored the possibility of having multiple outcome measures and those
that do tend to use some type of process measures (e.g., Casas-Arce
et al., 2017; Eyring and Narayanan, 2018; Li and Sandino, 2018). Thus,
there is room for researchers to be more creative about the outcome
measures considered in their studies and how, ultimately, these measures
translate into “hard” performance outcomes.
All of the studies described in Table 5.1 use questionnaires. The
combination of questionnaires and field experiments is fruitful be-
cause it allows the researcher to collect data other than what is al-
ready available in the partner organization. Questionnaires can be
done before the start of the experiment (pre-questionnaire), during
the experiment (exp-questionnaire), or after the end of the experiment
(post-questionnaire). Pre-questionnaires normally collect data in order
to establish a successful randomization, dismiss concerns about con-
tamination (e.g., Casas-Arce et al., 2017), and/or obtain baseline levels
of the outcome variable (e.g., Aranda and Arellano, 2010) or some
other variable used in the study (e.g., Li and Sandino, 2018). These
additional variables can be used to test heterogeneous treatment effects
(i.e., whether the treatment effect varies according to some observable
characteristic), identify the process or mechanism by which treatment
effects occur, or dismiss alternative explanations for the documented
effects. Exp-questionnaires occur during the experiment and aim to
collect process measures that can help researchers to explain the mech-
anism by which the effects are taking place. This type of measures
can also come from “hard” data that the experiment produces (e.g.,
Kelly et al., 2017) or from the partner organization (e.g., Eyring and
Narayanan, 2018). Post-questionnaires occur after the experiment is
finished and can help researchers in performing manipulation checks
and additional analyses based on the perceptions of the participants
regarding the different manipulations.
Even though there are many advantages of using questionnaires
within a field experiment study, researchers must also consider the risks
associated with them. The major risk is that questionnaires prompt
participants to react in a certain way. Thus, the effects documented in
the field experiment may not be due to the manipulation per se but
35
The majority of the studies included in Table 5.1 perform some type of
heterogeneous treatment effect analysis. However, they do so mainly in
additional analyses (e.g., Li and Sandino, 2018). Only two studies use
heterogeneous treatment effects in the main analysis, i.e., they develop
different hypotheses for subsamples of their participants’ pool (Eyring
and Narayanan, 2018; Lourenço et al., 2018). In both studies, baseline
performance is the variable used for the heterogeneous treatment effect
analysis. Heterogeneous treatment effects promise to be a fruitful area
for future research. In the field, several opposite forces may come into
play and lead to average null effects. However, in subsamples of the
population there may be (positive and negative) statistically significant
effects that can only be documented by using heterogeneous treatment
effects. Researchers need to consider these opposite forces upfront so
that they can develop their hypotheses according to the characteristics
of the population.
Data available in all experiments presented in Table 5.1 include
data prior to the experiment and data during the experiment. As
mentioned previously, these types of data allow researchers to apply
more sophisticated estimation techniques (e.g., difference-in-differences,
fixed effects) and analyze heterogeneous treatment effects that are
related to the baseline level of the outcome variable. Two studies also
use post-experimental data. This type of data allows the analysis of
the effects once the manipulations are removed (Lourenço et al., 2018)
or when they are spread to the entire pool of participants (e.g., Casas-
Arce et al., 2017). This kind of analysis can be an important feature of
the study because researchers can document whether participants go
back to the previous level of performance once the stimulus is removed
or whether they converge to the same level of performance once all
participants experience the same stimulus. Both of these issues have
great relevance to and impact for practitioners.
Finally, for the studies included in Table 5.1 there is a wide variation
in the duration of the experiment, i.e., the time during which participants
experience the manipulations. The smallest experimental window is two
months (e.g., Eyring and Narayanan, 2018; Lourenço, 2016) and the
largest is one year (Lourenço et al., 2018). Despite this difference, the
number of interactions with participants is not that large; in Lourenço
37
There are a few papers that provide insight on how to run field ex-
periments. For example, List (2011) enumerates 14 tips to conduct a
field experiment in economics.1 This advice ranges from methodolog-
ical issues, such as having a proper control group and large enough
sample size, to practicalities in the relationship with the partner or-
ganization, such as having a sponsor in the organization and making
the best use of organizational dynamics. Another important guide to
field experiments, particularly on randomization, is Duflo et al. (2008).
The authors discuss (i) the advantages of randomization, (ii) power
calculations, (iii) methods to implement randomization, and (iv) issues
that can interfere in statistical inferences based on experimental data,
such as partial compliance and attrition.
Based on my own experience in running field experiments, in the
following subsections I offer some practical advice in several important
dimensions for a successful field experiment project.
1
These tips are also reproduced in Floyd and List (2016).
38
6.1. Research question 39
6.3 Collaboration
covariates that can interfere with the identification of the treatment ef-
fect and that should be used for blocking (Imai et al., 2008). Conversely,
a simple (not blocked) randomization could lead to unbalanced groups
in variables correlated with the outcome of interest and, in an extreme
case, in the baseline level of the outcome variable. If the sample size is
small, even a blocked randomization may lead to covariate imbalance,
which is a threat to identification. In these cases, some authors advocate
for rerandomization until covariate balance is achieved (Morgan and
Rubin, 2012). The randomization strategy should also specify whether
the units will be randomly assigned at a single point in time (e.g., before
the start of the experiment), in different waves (e.g., defined at the start
of the experiment), or in real-time (e.g., randomization occurs as the
units join the experience).
The choice of the randomization strategy may be linked to the power
the researchers want to have in the experiment. Sample size is critical
to gain sufficient statistical power to reject the null hypothesis of a zero
effect. Hence, researchers are advised to run power calculations prior
to the experiment. These power calculations enable the researcher to
estimate the number of observations needed to detect a pre-defined effect
size, for a given power and significance level. Power calculations can be
enhanced if the researcher has (i) baseline data of the outcome variable,
(ii) control variables correlated with the outcome, and (iii) information
about intra- or inter-cluster correlation in the case of group randomiza-
tion. There is specific software, such as Optimal Design, that researchers
can use to compute power calculations. Even though power calculations
enable the researcher to make better design choices, researchers may
not have all data needed available to run those calculations. Running
a pilot study, if possible, can give the researchers that data. However,
the pilot study may be the experiment itself because the collaborating
research site is not willing to apply randomization, and thus the exper-
iment, to all employees of the organization. Thus, power calculations
may not be possible to compute. Nevertheless, and if the researcher
has baseline data available (even though it is not guaranteed that the
distribution of the outcome variable will remain the same during the
experiment), she can gain some insights about the sample size needed.
If the sample size is fixed by definition, that is, the research site has a
6.4. Experimental design 45
6.6 Pilot
first trial or as the only trial, the researcher should make the most out
of the pilot.
6.7 Implementation
6.8 Analysis
The treatment status estimates are then used as regressors in the second
stage regression, which estimates the local average treatment effect in
the outcome variable of interest (Duflo et al., 2008). In most managerial
accounting field experiments, compliance will not be an issue because
the units, e.g. employees of an organization, will receive the manipula-
tion as part of their job. Hence, they are not likely to fail to comply
with the randomization protocol. Compliance referring to implemen-
tation departures from the design can be a problem, particularly in
multiple-locations and/or multi-period field experiments. In this case,
the researcher has to identify where and when those deviations took
place and decide what data to analyze/exclude. The best strategy will
always be for the researcher to act ex ante by preemptively following
the implementation and making sure that the design is appropriately
put into practice. Likewise, the contamination of treatments, i.e., partic-
ipants learning that other treatments are in place and reacting to that
(and not to their specific manipulation) can also complicate the analysis
because the researcher cannot easily disentangle the two effects ex post.
Again, the best strategy is to design an experiment in which the risk of
contamination is minimized (for example, by opting to randomize at
the group level and not at the individual level).
Another problematic issue, particularly in field experiments with
long time horizons, will be attrition. In this case, the simple solution for
the researcher is to show that attrition is not related to the treatment
(e.g., attrition rates are similar across conditions) and that attrition
does not interfere with the results (e.g., the results hold when those who
left are eliminated from the analysis). Attrition may be problematic if
dropouts differ among conditions because it may signal that attrition is
related to the manipulation.3 In this case, eliminating dropouts from
the analyses may lead to a biased estimate of the treatment effect.
For example, if those who leave are those for whom treatment had
a smaller or a negative effect, the researcher will overestimate the
3
If attrition is the product of the manipulation and the only outcome of interest,
then different levels of attrition can be the experiment’s finding. However, if there
are other outcome variables of interest, and the level of attrition varies across groups,
then it will interfere with the estimate of the average treatment effect because the
researchers fail to have data for those who left.
52 Insights on running field experiments
As with any other method, field experiments are also subject to criticism
(e.g., Camerer, 2015; Falk and Heckman, 2009). The most common
critique is that field experiments do not allow the researcher to create
the same level of control within the study as is possible in a laboratory
experiment. In the laboratory, the researcher can completely control
not only the manipulation that is going to be introduced, but also
how and when participants will receive it, as well as the need for
or extent to which participants interact. Another concern refers to
the generalizability of the results of field experiments, as they usually
come from a single organization and/or a single site. This is an issue
because specific characteristics of the partner organization and of the
experimental participants may influence the results and, thus, they
may not generalize to other organizations/sites. This is a drawback
of any field study. A third limitation refers to the selection of the
partner organization. Usually, researchers do not randomly select an
organization to partner with but rather are dependent on organizations
that self-select to participate in the field experiment. Thus, and with
any self-selection, if these organizations are somehow different from
others, the results may not generalize. This is also a drawback common
53
54 Limitations of field experiments
1
Even though some research questions seem impossible to answer through exper-
imental methods, some researchers have engaged in ambitious research projects and
provided important insights. For example, Bloom et al. (2013) in a US$1.3 million
field experiment examine the effects of adopting and using management practices such
as quality controls, inventory management, cost allocation, and performance-based
incentives. They randomly provide management consulting services to Indian textile
factories and compare the performance of these factories to that of a control group.
They find that productivity rose by 17% in the first year due to improved quality
and efficiency, and reduced inventory.
55
allow for addressing directly complex issues that, at first glance, would
hardly be considered for experimentation due to ethical concerns.2
Despite these counter-arguments, researchers should be aware of
the criticisms faced by field experiments. Critiques should be weighed
by the researcher as cons that are counterbalanced by the pros when
deciding on the best method to use to answer that study’s research
question(s). These limitations should also be acknowledged as applicable
and appropriate, according to the specificities of each project in the
final write up of the manuscript.
2
For example, Bertrand and Mullainathan (2004) address the issue of
discrimination in a field experiment. In their study resumes of job applicants are
sent to employers seeking to hire personnel. The authors find that resumes that use
white-sounding names, compared to black-sounding names with identical resumes,
received 50% more invitations to an interview.
8
Concluding remarks
57
58 Concluding remarks
60
References
61
62 References