Now Trending Coping With Non-Parallel Trends in Difference-In-Differences Analysis

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Article

Statistical Methods in Medical Research


2019, Vol. 28(12) 3697–3711

Now trending: Coping with non-parallel ! The Author(s) 2018


Article reuse guidelines:
sagepub.com/journals-permissions
trends in difference-in-differences analysis DOI: 10.1177/0962280218814570
journals.sagepub.com/home/smm

Andrew M Ryan,1 Evangelos Kontopantelis,2 Ariel Linden3


and James F Burgess Jr4,y

Abstract
Difference-in-differences (DID) analysis is used widely to estimate the causal effects of health policies and interventions.
A critical assumption in DID is ‘‘parallel trends’’: that pre-intervention trends in outcomes are the same between treated
and comparison groups. To date, little guidance has been available to researchers who wish to use DID when the parallel
trends assumption is violated. Using a Monte Carlo simulation experiment, we tested the performance of several
estimators (standard DID; DID with propensity score matching; single-group interrupted time-series analysis; and
multi-group interrupted time-series analysis) when the parallel trends assumption is violated. Using nationwide data
from US hospitals (n ¼ 3737) for seven data periods (four pre-interventions and three post-interventions), we used
alternative estimators to evaluate the effect of a placebo intervention on common outcomes in health policy (clinical
process quality and 30-day risk-standardized mortality for acute myocardial infarction, heart failure, and pneumonia).
Estimator performance was assessed using mean-squared error and estimator coverage. We found that mean-squared
error values were considerably lower for the DID estimator with matching than for the standard DID or interrupted
time-series analysis models. The DID estimator with matching also had superior performance for estimator coverage.
Our findings were robust across all outcomes evaluated.

Keywords
Difference-in-differences, health policy, Monte Carlo simulation, non-parallel trends, health services research

1 Introduction
Difference-in-differences (DID) analysis has experienced an explosion of popularity in health policy, health
economics, and medicine.1,2 The key identifying assumptions of DID are ‘‘common shocks’’ and ‘‘parallel
trends.’’3 The ‘‘common shocks’’ assumption holds that other phenomena occurring at the same time or after
the start of treatment will affect the outcomes of the treatment and comparison groups equally. The parallel trends
assumption states that, although treatment and comparison groups may have different levels of the outcome prior
to the start of treatment, their trends in pre-treatment outcomes should be the same. This implies that, absent
treatment, outcomes for the treatment and comparison groups are expected to change at the same rate.
A combination of visual and statistical evidence of parallel trends is often required for researchers to support
the validity of DID. But what if trends are not parallel? This is common in health policy applications in which
treatment and comparison groups may vary with respect to critical characteristics that determine outcomes.4
Guidance for researchers under these circumstances has been limited.

1
Department of Health Management and Policy, School of Public Health, University of Michigan, Ann Arbor, MI, USA
2
Center for Health Informatics, University of Manchester, Manchester, UK
3
Department of Medicine, Medical School, University of California, San Francisco, CA, USA
4
Veterans Affairs Boston Health Care System, US Department of Veteran Affairs, Boston University School of Public Health, Boston, MA, USA
y
Deceased 26 June 2017.

Corresponding author:
Andrew M Ryan, University of Michigan, School of Public Health, 1415 Washington Heights, SPH II, Rm. M3124, Ann Arbor, MI 48109, USA.
Email: amryan@umich.edu
3698 Statistical Methods in Medical Research 28(12)

In this paper, we use a Monte Carlo simulation experiment to evaluate the accuracy of alternative estimators
when the parallel trends assumption is violated. In our simulation experiment, we seek to evaluate the effect of an
imaginary policy on clinical process quality (adherence with evidence-based guidelines) and 30-day mortality
among patients hospitalized in the United States with common medical conditions.

2 Estimating treatment effects with non-parallel trends


Using Rubin’s potential outcome framework,5 for each individual (i) drawn from a large population, the outcome
(Yi), treatment (Ti), where Ti ¼ 1 for those receiving treatment and Ti ¼ 0 for those not receiving treatment, and
pre-treatment covariates (Xi) are observed. The average treatment effect on the treated (ATT)—the typical
quantity of interest in DID analysis—is given by

ATT ¼ E y1  y0 jT ¼ 1 ð1Þ

In typical DID analysis, y1 is observed if a unit received treatment, y0 is observed if a unit did not receive
treatment, but both are not observed simultaneously for any case. As shown by Wooldridge,6 the observed
outcome for both treated and non-treated groups can be written as
Y ¼ Y0 þ TðY1  Y0 Þ ð2Þ

If treatment is statistically independent from the potential outcome (as with random assignment), equation (2) is
easily estimable as the difference in mean expectations: ATT ¼ E(y W w ¼ 1) – E(y W w ¼ 0). In non-randomized
studies, however, estimating the treatment effect shown in equation (2) is much more challenging. In a cross-
sectional context, conditional on covariates (X), treatment, and outcomes must be independent for consistent
estimation of the treatment effect using standard regression analysis (i.e., not instrumental variables). That is
   
ATT ¼ E y0 jX, T þ T E y1 jX, T  E y0 jX, T ð3Þ

In practice, because all relevant covariates are rarely observed, it is difficult to meet the assumption that treatment
and outcomes are independent, conditional on covariates.7 This is where DID analysis has a key advantage: under
the assumption that unobserved covariates are time-invariant, and that any ‘‘shocks’’ (S) in the post-treatment
period are the same for the treated and untreated units, unmeasured and unobserved covariates can be partialed
out, allowing the ATT to be consistently estimated. For the post-treatment and pre-treatment periods
  
ATT ¼ E y1, post jXT¼1 , S, T  E y0, post jXT¼0 , S, T
  
 E y1, pre jXT¼1 , T  E y0, pre jXT¼0 , T
   ð4Þ
¼ E y1, post jT  E y0, post jT
  
 E y1, pre jT  E y0, pre jT

The time-invariant terms XT ¼ 1, XT ¼ 0, and S cancel out through the differencing process. Equation (4) can be
estimated using the following regression equation

Y ¼ b0 þ b1 T þ b2 post þ T  post þ e ð5Þ

But when is it safe to assume that unobserved factors are actually time invariant in DID analysis? As a
heuristic, researchers often lean on the logic of the ‘‘parallel trends’’ assumption: in the pre-intervention period,
some set of observed and unobserved covariates influence the levels and trends in the outcomes for both
the treatment and comparison groups. If the observed trends are similar (parallel), it may be reasonable to
assume that observed and unobserved covariates are not changing differentially between treatment and
comparison groups prior to the intervention and therefore may not change differentially after the intervention.
However, if observed pre-intervention trends are different, it implies that the observed and unobserved covariates
may be changing at a differential rate for treatment and comparison groups, which may continue into the post-
intervention period. This undermines the identifying assumption related to time-invariant, unobserved covariates.
Ryan et al. 3699

Instead, the true ATT is given by


  
ATT ¼ E y1, post jXT¼1, post , S, T  E y0, post jXT¼0, post , S, T

 ½E y1, pre jXT¼1, pre , T  Eðy0, pre jXT¼0 , pre , TÞ
   ð6Þ
¼ E y1, post jXT¼1, post, T  E y0, post jXT¼0, post, T
  
 E y1, pre jXT¼1, pre , T  E y0, pre jXT¼0, pre, T

Estimating the ATT in equation (6)—when trends between treatment and comparison groups are not
parallel—is not straightforward. One approach involves modeling differential trends: this in effect acts as a
proxy for modeling underlying changing covariates, allowing them to be partialed out
Y ¼ b0 þ b1 T þ b2 post þ b3 time þ ðT  timeÞ þ 3 ðT  postÞ þ e ð7Þ

These estimators can specify time trends using flexible polynomials, but typically model differential linear trends
between treatment and comparison groups. This is the class of estimators that is most commonly applied when
researchers are concerned about violations to the parallel trends assumption.8–16 When linear trends are specified
for this estimator, the counterfactual is that treatment and comparison groups will continue (linearly) on their
separate, and different, pre-intervention trends after the program is implemented. We call this relaxation of the
parallel trends assumption the ‘‘persistent trends’’ assumption.
Attempts to control for differential trends may be problematic for a number of reasons. For instance, regression
to the mean may attenuate the post-intervention slope for the group with a stronger pre-intervention trend,
particularly trends defined by few pre-intervention periods.17,18 Unmeasured covariates related to the
differential slope may also give rise to treatment heterogeneity, undermining the common shocks assumption of
DID analysis. Mora and Reggio12 review a number of papers that apply this class of estimator. They propose a
more flexible estimator, yet one that nonetheless relies on modeling differential group trends. Over a longer pre-
intervention period and outcomes with high serial correlation, pre-intervention trends may be sufficiently stable to
plausibly assume that they would continue absent treatment. Under these circumstances, conditioning on trends
and performing DID may reduce bias. On the other hand, if differences in levels between treatment and control
groups lead to differences in trends, it is possible that as groups approach a natural limit for an outcome,
conditioning on trends may introduce bias into the DID estimation.
An alternative to statistically controlling for non-parallel trends is the use of matching estimators.
This approach amounts to choosing a subset of the treatment and comparison groups that have similar pre-
intervention levels or trends. For instance, researchers may choose to match treatment and comparison groups on
pre-intervention outcomes and other relevant observables,19,20 researchers may also use synthetic control
methods,21 or weighting.22 These approaches assume that the alternative comparison group can provide a
counterfactual for the treatment group, regardless of the specific post-intervention path of either group. Rather
than simply making the common shocks assumption, these approaches make a ‘‘common trajectory’’ assumption
for the post-intervention period.23–25
While researchers may match on time-varying or time-invariant covariates—instead of past
outcomes—covariate matching may not be sufficient to address bias in health policy research for several
reasons. First, health services research relies heavily on administrative databases for analyses. While
databases tend to have rich information on covariates (e.g., age, gender, comorbidities, past use of health
care services), these covariates are often weakly correlated with outcomes. Similarly, organizational-level
covariates (such as hospital size, teaching status, and region) also tend to have weak correlations with study
outcomes and generally vary little over time. Matching on a rich set of covariates—strongly correlated with
treatment and outcomes—can reduce bias in DID. However, matching on patient factors or hospital factors
(time-varying or time-invariant) is typically insufficient to re-introduce parallel trends when the assumption is
violated. Instead, matching on past outcomes has greater potential to address estimator bias resulting from
violations to parallel trends.
Recent theoretical and simulation research has argued that matching on time-varying covariates or past
outcomes in the context of DID has the potential to increase estimator bias.26–28 Yet simulation evidence also
suggests that DID estimators that match based on a past outcomes can yield lower bias than standard DID,
particularly when matching is based on a larger number of pre-intervention periods (between 3 and 30)27,28 and
outcomes exhibit high serial correlation.27 As a result, the potential for using matching to reduce bias in DID
appears to depend on the research context.
3700 Statistical Methods in Medical Research 28(12)

Other options to address non-parallel trends include interrupted time-series analysis (ITSA)19 models. For
instance, the single-group (SG) ITSA is often implemented by specifying linear splines for the treated group
around one or more ‘‘knots,’’ typically representing a change in policy.29 Under these models, a pre-
intervention and post-intervention trend in outcome is estimated. The counterfactual is that, absent treatment,
the pre-intervention trend would continue linearly into the treatment period (and thus the estimator is the
‘‘difference in trends’’19). Alternatively, the multi-group (MG) ITSA specifies separate pre- and post-
intervention trends for both treated and comparison groups. The counterfactual is based on the difference
between pre- and post-intervention trends for the comparison group (the ‘‘DID in trends’’19).
Which of these approaches will provide more accurate point estimates for a given outcome under a given set of
circumstances is unclear. This tension is illustrated in Figure 1. Treated and control units both have positive, but
different, pre-intervention trends for the study outcome. In the post-intervention period, the trend for the control
group becomes negative, resulting in a post-intervention mean that is identical to the pre-intervention mean.
Under these circumstances, different estimators (described below) all provide alternative counterfactuals, only
one of which (at most) is correct.
In the current study, we perform a Monte Carlo simulation experiment to better understand the effects of DID
specification choices in the context of non-parallel trends. We compare the performance of DID estimators to

Figure 1. Illustration of counterfactuals for different estimators.


Ryan et al. 3701

other common evaluation approaches while making different assumptions about the relationship between pre-
intervention performance and treatment assignment.

3 Data and quality performance measures


Our simulations used data on hospital-level clinical process performance (i.e., adherence with evidence-based
guidelines for inpatient care) (online Figure A1) and risk-standardized 30-day mortality for three diagnoses
(online Figures A2 to A4). For clinical process performance, we constructed a composite measure of process-
of-care quality from 37 individual measures. The composite is created by using the ‘‘opportunities model,’’ which
is calculated as the sum of successfully achieved measures divided by the sum of opportunities that practices have
to achieve these measures.30 This quality measure is expressed as a percentage. We also evaluate risk-standardized
mortality data for acute myocardial infarction (AMI), congestive heart failure (CHF), and pneumonia.
For all outcomes, we used seven periods of annual data between 2008 and 2014 from Hospital Compare,
Medicare’s public quality reporting program.31
For each outcome, we created a balanced panel, excluding hospitals without quality data in each year.
This resulted in a sample of 3582 hospitals for clinical process (25,074 observations), 2260 hospitals for AMI
mortality (15,820 observations), 3414 hospitals for heart failure mortality (23,898 observations), and 3737
hospitals for pneumonia mortality (26,159 observations). To ensure that all our performance measures were
evaluated in common units, we standardized the measure by subtracting the mean (calculated across the entire
study period for all hospitals) and dividing by the standard deviation.
Over the observation period, clinical process performance increased (i.e., improved), 30-day AMI mortality
decreased (i.e., improved), and 30-day congestive heart failure and pneumonia mortality were relatively flat
(Figure 2).

Figure 2. Time series data for study outcomes. Note: CHF: congestive heart failure; AMI: acute myocardial infarction.
3702 Statistical Methods in Medical Research 28(12)

4 Simulation
We performed a Monte Carlo simulation experiment to estimate the effect of an imaginary policy on quality.
Expanding on the approach developed in Ryan, Burgess, and Dimick,1 we calculated hospital-level pre-
intervention levels and trends for each outcome. Hospitals’ probability of selection to treatment was then based
on pre-intervention levels and trends according to the following logistic specification
e0þa1 ðprelevelþeÞþa2 ðpretrendþuÞ
prðselectionÞ ¼ ð8Þ
e1þa1 ðprelevelþeÞþa2 ðpretrendþuÞ

where pre-level and pre-trend are hospitals’ standardized (mean 0, standard deviation 1) pre-intervention levels and
trends for each study outcome, respectively. The terms e and u are random noise (mean 0, standard deviation 0.5).
Based on randomly assigned values of a1 (the relationship between pre-intervention levels and the log odds of
selection), a2 (the relationship between pre-intervention trends and the log odds of selection), and e and u (random
noise), each hospital receives a probability of selection to treatment. Depending on the values of a1 and a2, pre-
intervention levels of pre-intervention trends may be more important for treatment assignment. While different
hospitals will have different selection probabilities, in our base scenario, the selection equation is specified such that
we expect 50% of all hospitals to be assigned to treatment and 50% to control. Four hypothetical scenarios, varying
with respect to relationship between pre-intervention levels, trends, and selection probability, are shown in Figure 3.
We presumed that treatment occurred in 2012. All hospitals therefore had four observations in the pre-
intervention period and three in the post-intervention periods. After hospitals were assigned to treatment,
we assumed a treatment effect of 0, occuring over the course of the post-intervention period. We then

Figure 3. Illustration of pre-intervention levels and trends for heart failure mortality and selection probabilities under alternative
selection scenarios. Note: The x-axis shows hospitals’ pre-intervention trends (for heart failure mortality); the y-axis shows the
probability that a hospital would be selected for treatment given the assignment parameters (a1 and a2) in equation (8).
Ryan et al. 3703

estimated treatment effects using a variety of estimators (see below). The goal of a given estimator was to estimate
an effect of 0. For each estimator, standard errors were calculated to be robust to hospital-level heteroskedasticity.
We ran 10,000 simulations. After each iteration, we captured the mean-squared error (MSE) and coverage
across the entire post-intervention period for each estimator. Coverage is equal to 1 if the true program effect (in
our case 0) is contained within the confidence interval of the estimate, and 0 otherwise.
To assess how pre-intervention levels and trends affected estimator performance, we regressed MSE and
coverage on differences in pre-intervention levels and trends. We then used postestimation margins to calculate
MSE and coverage at different moments in the distribution of pre-intervention levels and trends. Our main results
show estimator performance for four different combinations of values for pre-intervention levels and trends; (1)
overall (unconditional on specific values of pre-intervention levels and trends); (2) when the absolute difference
between pre-intervention trends is set to 0; (3) when the absolute difference between pre-intervention trends is set
to 0.2 standard deviations of the outcome; and (4) when the absolute difference between pre-intervention trends is
set to 0.4 standard deviations of the outcome (Tables 2 and 3).

5 Estimators
We used several estimators to estimate the treatment effect for hospital j at time t. The standard DID model was
estimated as
Yjt ¼ b0 þ 1 T þ b1 postt þ ðTj  postt Þ þ ejt ð9Þ

where the treatment effect is identified by the parameter .


We also estimated equation (9) among a matched set of hospitals. The only variable used for matching was pre-
intervention levels of the outcome in each of the pre-intervention periods (t  1, t  2, t  3, and t  4). Matching
was performed using propensity scores (1:1 with replacement, enforcing common support and calipers of 0.01).32
The matching procedure was implemented in Stata using the user-written command PSMATCH2.33
The third specification is the single-group interrupted time-series analysis (SG-ITSA) design in which effects
were estimated for only the group that was exposed to treatment. This estimator uses only data from treated
group. We implemented this estimator using linear regression
Yjt ¼ b0 þ b1 timet þ b2 postt þ ðtimet  postt Þ þ ejt ð10Þ

where the treatment effect is the average marginal effect of postt


@ ðy, timeÞ ðb1 þ jtime ¼ 5Þ þ ðb1 þ jtime ¼ 6Þ þ ðb1 þ jtime ¼ 7Þ
¼ ð11Þ
@post 3

The fourth specification is the multi-group interrupted time-series analysis (MG-ITSA) design. By including
treatment and comparison groups while modeling linear trends, this estimator combines features of the DID and
ITSA estimator
 
Yjt ¼ b0 þ a1 timet þ a2 timet  Tj þ b1 postt þ  Tj  postt þ ejt ð12Þ

where the treatment effect is identified by the parameter .

6 Analysis
After the estimators were computed across the 10,000 simulation iterations, we evaluated their performance using
two measures: MSE and coverage. We also tested the performance of the estimators when treatment and
comparison groups differed with respect to pre-intervention levels and trends. We captured this value using
absolute differences in pre-intervention trends (W trend pre, treat  trend pre, control W). To evaluate how differences
in pre-intervention levels and trends affected estimated performance, we used data from our simulation output to
estimate the following generalized linear models (GLM) at the level of the simulation iteration (i)
Yi ¼ b0 þ b1 jtrendpre, treat  trendpre, control ji þ b2 jlevelpre, treat  levelpre, control ji þ ei ð13Þ
3704 Statistical Methods in Medical Research 28(12)

where Y is one of our two measures of estimator performance (MSE or coverage). For the MSE models, we
estimated GLM models with a log link (to account for bounding at 0 and the right skew). For the coverage models,
we estimated GLM models with a logit link from the binomial family (to account for bounding between 0 and 1).
We then used post-estimation to generate predictions of MSE and coverage values at the different values of the
difference in pre-intervention trends (0, 0.2 standard deviationsy, and 0.4 standard deviationsy).

7 Results
Table 1 shows descriptive statistics for the study outcomes among all eligible hospitals. Pre-intervention levels and
trends differed substantially across the study outcomes. Reliability - evaluating the ratio between within-hospital
variation to total variation34 - was very high for each measure.
Table 2 shows MSE values across estimators, outcomes, and differences in pre-intervention trends under
our base case simulation. It shows that MSE values were considerably lower for the DID estimator with
matching than for the other study estimators. For instance, for 30-day AMI mortality, overall MSE values
were 0.082 standard deviations of the outcome (sdy) for the standard DID estimator, 0.002 sdy for the
matching DID estimator, 0.087 sdy for the SG-ITSA estimator, and 0.096 for the MG-ITSA estimator. As the
difference in pre-intervention trends increased, MSE values increased for all estimators. However, this increase in
MSE was relatively small for the DID matching estimator (MSE increasing from 0.001 sdy with no differences in
pre-intervention trends to 0.003 sdy with a difference in pre-intervention trends of 0.4 sdy) and much larger for
other estimators, such as the standard DID (MSE increasing from 0.049 sdy with no differences in pre-intervention
trends to 0.160 sdy with a difference in pre-intervention trends of 0.4 sdy). These findings were robust across the
three outcomes assessed.
Table 3 shows estimator performance for the coverage outcome. Patterns of results are similar to those observed
for MSE. Coverage is decreasing in the difference in pre-intervention trends and the DID estimator with matching
has the best performance, greatly outperforming the other approaches. For instance, overall for 30-day AMI
mortality, coverage values were 0.138 for the standard DID estimator, 0.90 for the matching DID estimator, 0.193
for the SG-ITSA estimator, and 0.233 for the MG-ITSA estimator. With differences of pre-intervention trends of
0.4 sdy for 30-day AMI mortality, coverage values were 0.070 for the standard DID estimator, 0.815 for the
matching DID estimator, 0.005 for the SG-ITSA estimator, and 0.017 for the MG-ITSA estimator.
To further understand the reasons for variation in estimator performance, we plotted the relationship
between pre-intervention levels, trends, and estimator performance. We identified simulation iterations in which
a given estimator generated an estimate with a very low MSE value (MSE  .01 sdy) or very high MSE value
(MSE > 0.20 sdy) for each study outcome (Figures 4 to 7). This analysis shows that estimates from the DID
estimator with matching were largely invariant to pre-intervention differences in levels and trends. For the
standard DID estimator, differences in trends led to larger errors when they were inversely correlated with
differences in levels (e.g., a positive difference in trends and a negative difference in levels). The opposite tended
to be true for the SG-ITSA and MG-ITSA estimators, where differences in pre-intervention trends led to larger

Table 1. Descriptive statistics of study outcomes.

Outcome

Standardized clinical 30-day 30-day mortality, 30-day mortality,


process composite mortality, AMI heart failure pneumonia

Hospitals 3582 2260 3414 3737


Hospital-year observations 25,074 15,820 23,898 26,159
Level, mean 0.14 15.37 11.57 11.86
Level, standard deviation 0.66 1.72 1.57 1.85
Pre-intervention trend, mean 0.20 0.32 0.17 0.17
Pre-intervention trend, 0.22 0.63 0.55 0.65
standard deviation
Reliability 0.84 0.84 0.90 0.91
Intraclass correlation 0.42 0.42 0.57 0.59

Note: AMI: acute myocardial infarction.


Ryan et al. 3705

Table 2. Mean-squared error values for estimators and study outcomes.

Estimator

DID DID
Outcome standard w/matching SG-ITSA MG-ITSA

Clinical process
Overall 0.049 0.001 0.223 0.063
W trend pre, treat  trend pre, control W¼0 0.024 0.001 0.202 0.012
W trend pre, treat  trend pre, control W ¼ 0.2 sdy 0.056 0.001 0.232 0.079
W trend pre, treat  trend pre, control W ¼ 0.4 sdy 0.130 0.001 0.267 0.522
30-day mortality, AMI
Overall 0.082 0.002 0.087 0.096
W trend pre, treat  trend pre, control W¼0 0.049 0.001 0.025 0.026
W trend pre, treat  trend pre, control W ¼ 0.2 sdy 0.088 0.002 0.080 0.088
W trend pre, treat  trend pre, control W ¼ 0.4 sdy 0.160 0.003 0.260 0.296
30-day mortality, heart failure
Overall 0.047 0.001 0.084 0.083
W trend pre, treat  trend pre, control W¼0 0.029 0.001 0.028 0.021
W trend pre, treat  trend pre, control W ¼ 0.2 sdy 0.053 0.001 0.088 0.085
W trend pre, treat  trend pre, control W ¼ 0.4 sdy 0.096 0.003 0.276 0.348
30-day mortality, pneumonia
Overall 0.056 0.001 0.148 0.071
W trend pre, treat  trend pre, control W¼0 0.026 0.001 0.116 0.018
W trend pre, treat  trend pre, control W ¼ 0.2 sdy 0.064 0.001 0.154 0.069
W trend pre, treat  trend pre, control W ¼ 0.4 sdy 0.160 0.003 0.204 0.265

Note: Results are from 10,000 simulation iterations. DID: difference-in-differences; sd: standard deviation; SG-ITSA: single group
interrupted time-series analysis; MG-ITSA: multi-group interrupted time-series analysis; AMI: acute myocardial infarction.

Table 3. Coverage values for estimators and study outcomes.

Estimator

DID DID
Outcome standard w/matching SG-ITSA MG-ITSA

Clinical process
Overall 0.123 0.909 0.044 0.233
W trend pre, treat  trend pre, control W¼0 0.281 0.941 0.000 0.764
W trend pre, treat  trend pre, control W ¼ 0.2 sdy 0.059 0.897 0.011 0.038
W trend pre, treat  trend pre, control W ¼ 0.4 sdy 0.007 0.829 0.950 0.000
30-day mortality, AMI
Overall 0.138 0.900 0.193 0.233
W trend pre, treat  trend pre, control W¼0 0.212 0.943 0.590 0.581
W trend pre, treat  trend pre, control W ¼ 0.2 sdy 0.128 0.895 0.079 0.134
W trend pre, treat  trend pre, control W ¼ 0.4 sdy 0.070 0.815 0.005 0.017
30-day mortality, heart failure
Overall 0.155 0.889 0.164 0.221
W trend pre, treat  trend pre, control W¼0 0.218 0.946 0.546 0.674
W trend pre, treat  trend pre, control W ¼ 0.2 sdy 0.141 0.874 0.050 0.072
W trend pre, treat  trend pre, control W ¼ 0.4 sdy 0.088 0.735 0.002 0.003
30-day mortality, pneumonia
Overall 0.137 0.880 0.134 0.237
W trend pre, treat  trend pre, control W¼0 0.227 0.935 0.069 0.675
W trend pre, treat  trend pre, control W ¼ 0.2 sdy 0.118 0.867 0.146 0.092
W trend pre, treat  trend pre, control W ¼ 0.4 sdy 0.056 0.751 0.283 0.005

Note: Results are from 10,000 simulation iterations. Coverage is equal to 1 if the true program effect (in our case 0) is contained within the confidence
interval of the estimate, and 0 otherwise. DID: difference-in-differences; sd: standard deviation; SG-ITSA: single group interrupted time-series analysis;
MG-ITSA: multi-group interrupted time-series analysis; AMI: acute myocardial infarction.
3706 Statistical Methods in Medical Research 28(12)

Figure 4. DID: difference-in-differences; ITSA: interrupted time-series analysis; Mean-squared error values for clinical process
outcome for alternative estimators across differences in pre-intervention levels and trends. Note: y-axis shows the difference in pre-
intervention trends (trend pre, treat  trend pre, control); x-axis shows the difference in pre-intervention levels (level pre, treat  level
pre, control). Note: ‘‘Very low’’ MSE defined as MSE  .01 sdy; ‘‘Very high’’ MSE defined as MSE > .20 sdy.

errors when they were positively correlated with differences in pre-intervention levels. In the case for which the pre-
intervention treatment trend is more negative than the comparison trend, the intuition behind this pattern of
results is as follows: when we use linear trends (in the ITSA models), the assumed counterfactual is lower than it
should be (leading to positive errors); when we use the standard DID, the assumed counterfactual is higher than it
should be (leading to negative errors).

8 Discussion
Our analysis examined the performance of DID estimators in the context of violations to the parallel trends
assumption. Our findings support two main practical conclusions for DID analysis in the context of non-parallel
pre-intervention trends: (1) the DID estimator with matching has much lower MSE and acceptable coverage,
compared to standard DID or ITSA estimators (coverage for which, in particular, is extremely low and never close
to the nominal 95%) and (2) the DID estimator with matching is least sensitive to deviations from the parallel
trends assumption. These findings were robust across the commonly used health policy outcomes evaluated in this
study (clinical process performance and 30-day risk-standardized mortality).
These results extend our previous work1 by considering different data generating processes, additional
outcomes, and different approaches toward accounting for non-parallel trends. Our results also provide some
bounds to understand the conditions under which the DID matching estimator outperforms the other estimators
considered in this study.
An interesting question that emerges from our analysis is why the estimators that modeled group-specific trends
performed so poorly? We found that these estimators were very sensitive to differences in pre-intervention trends
between treatment and comparison groups—these estimators fared the worst in precisely the circumstances in
which they are commonly employed in empirical practice.
Ryan et al. 3707

Figure 5. Mean-squared error values for 30-day AMI mortality outcome for alternative estimators across differences in pre-
intervention levels and trends. Note: y-axis shows the difference in pre-intervention trends (trend pre, treat  trend pre, control); x-axis
shows the difference in pre-intervention levels (level pre, treat  level pre, control). ‘‘Very low’’ MSE defined as MSE  .01 sdy; ‘‘Very high’’
MSE defined as MSE > .20 sdy. AMI: acute myocardial infarction.

Our findings can be directly translated to practical advice for researchers when performing DID analysis. In
short, if pre-intervention trends are not parallel, or if past outcomes are associated with changes in outcomes,
researchers should consider matching estimators for DID analysis. Our findings do not support modeling
differential trends in DID analysis in such a context.
Our study has a number of limitations. First, by assuming that the effect of the imaginary policy evaluated
in our analysis simulation had no effect, we are also assuming that no major events or external policies
affected our study outcomes for US hospitals between 2008 and 2014. While Hospital Value-Based
Purchasing was implemented during this period and was structured to improve these outcomes, recent
evidence suggests that it did not improve these outcomes.35 Also, even if an external event occurred during
this period and affected the study outcomes, there is no clear rationale why it would differentially affect our
inferences about the relative performance of estimators. Second, we did not consider estimator performance
for all possible outcomes in health policy. Given evidence that estimator performance varied across these
outcomes, performance is also likely to vary across outcomes that we have not considered in our study (such
as expenditures). Estimator performance may also vary under different assignment scenarios that we did not
consider, such as different ratios of treatment and comparison units, misspecification in the relationship
between pre-intervention levels, trends, and program assignment. For instance, because our simulation did
not include unobserved confounders between treatment assignment and the study outcomes, conditional
independence may hold after matching on the past outcomes. This may not be the case in other applied
settings. A general concern with matching is the possibility of limited overlap in the true distribution of pre-
intervention levels and trends, treatment and comparison groups could be matched on noise, rather than
signal. In such scenarios, bias from regression to the mean is a concern.36 Understanding the circumstances
under which this may occur is an important consideration for investigators in their own data and a topic for
future research.
3708 Statistical Methods in Medical Research 28(12)

Figure 6. Mean-squared error values for 30-day heart failure mortality outcome for alternative estimators across differences in pre-
intervention levels and trends. Note: y-axis shows the difference in pre-intervention trends (trend pre, treat  trend pre, control); x-axis
shows the difference in pre-intervention levels (level pre, treat  level pre, control). ‘‘Very low’’ MSE defined as MSE  .01 sdy; ‘‘Very high’’
MSE defined as MSE > .20 sdy. CHF: congestive heart failure.

Figure 7. Mean-squared error values for 30-day pneumonia mortality outcome for alternative estimators across differences in pre-
intervention levels and trends. Note: y-axis shows the difference in pre-intervention trends (trend pre, treat  trend pre, control); x-axis
shows the difference in pre-intervention levels (level pre, treat  level pre, control). ‘‘Very low’’ MSE defined as MSE  .01 sdy; ‘‘Very high’’
MSE defined as MSE > .20 sdy.
Ryan et al. 3709

Second, we did not consider the universe of DID estimators (such as triple-difference estimators,37,38 lagged
dependent variable models,27 synthetic control and generalized synthetic control models, ITSA matching
approaches)20,21,39 and specification choices. We also did not evaluate the performance of estimators across all
relevant factors, such as the number of pre-intervention periods and serial correlation of the outcomes.27 Other
matching techniques40 or weighting techniques may outperform the simple propensity score matching routine that
we used in this study. Instead, we focused our study on estimators that are commonly used in applied health policy
research. Future research should examine the performance of additional estimators for different data generating
processes. Finally, we could have considered other measures of estimator performance (e.g., power), but we felt
that MSE and coverage are the two most relevant measures and are adequate to provide a concise overview of
performance.

9 Conclusion
DID analysis is a crucial tool for policy analysis. Our study suggests that specification choices in DID have major
effects on estimator bias, particularly when pre-intervention trends are not parallel. In such scenarios, we found
that matching on past outcomes can improve inference in DID models. Yet caution should be used when
combining matching with DID. Recent research has highlighted the potential for bias when matching,
particularly on past outcomes, is combined with DID.26–28 The bias, created by matching on noise which leads
to mean reversion, appears to be most severe in cases where few pre-intervention periods are used to match and
there is high outcome variation and low serial correlation.26–28 In the current study, matching was performed using
four pre-intervention periods and among outcomes with relatively high serial correlation (as seen by the high
reliability of the outcomes). Research from O’ Neill et al. found that performance of the DID estimator with
matching on past outcomes improved when matching was based on three pre-intervention periods to when
matching was performed using 10 and 30 periods.27 Work from Chabé-Ferret found that DID with matching
outperformed standard DID when matching was performed on past outcomes using three pre-intervention
periods.28 Identifying the specific circumstances under which alternative estimators yield more accurate
inference remains a critical topic for future research.

Declaration of conflicting interests


The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this
article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD
Andrew M Ryan http://orcid.org/0000-0002-2566-7763
Evangelos Kontopantelis http://orcid.org/0000-0001-6450-5815

References
1. Ryan A, Burgess JF and Dimick J. Why we should not be indifferent to specification choices for difference-in-differences.
Heal Serv Res Methods 2015; 50: 1211–1235.
2. Bertrand M, Duflo E and Mullainathan S. How much should we trust differences-in-differences estimates? Q J Econ 2004;
119: 249–275.
3. Angrist JD and Pischke J-S. Mostly harmless econometrics: an empiricist’s companion. Princeton: Princeton University
Press, 2009.
4. Ryan AM, Krinsky S, Maurer KA and Dimick JB. Changes in hospital quality associated with hospital value-based
purchasing. N Engl J Med 2017; 376: 2358–2366.
5. Rubin D. Estimating causal effects of treatments in randomized and non-randomized studies. J Educ Psychol 1974; 66:
688–701.
6. Wooldridge J. Estimating average treatment effects. Econometric analysis of cross sectional and panel data, 2nd ed.
Cambridge: The MIT Press, 2010, pp.903–982.
3710 Statistical Methods in Medical Research 28(12)

7. Dowd BE. Separated at birth: statisticians, social scientists, and causality in health services research. Health Serv Res 2011;
46: 397–420.
8. Wolfers J. Did unilateral divorce laws raise divorce rates? A reconciliation and new results. Am Econ Rev 2006; 96:
1802–1820.
9. Jayachandran S, Lleras-Muney A and Smith KV. Modern medicine and the twentieth century decline in mortality:
evidence on the impact of sulfa drugs. Am Econ J Appl Econ 2010; 2: 118–146.
10. Currie J, Walker R, Currie BJ and Walker R. Traffic congestion and infant health: evidence from E-ZPass. Am Econ J Appl
Econ 2011; 3: 65–90.
11. Redding S, Sturm D and Wolf N. History and industry location: evidence from German airports. Rev Econ Stat 2011; 9:
814–831.
12. Mora R, Reggio I. Treatment effect identification using alternative parallel assumptions. Archivo 2013, http://hdl.handle.
net/10016/16065.
13. Dimick J, Nicholas L, Ryan A, Thumma J and Birkmeyer J. Bariatric surgery complications before
and after implementation of a national policy restricting coverage to centers of excellence. J Am Med Assoc 2013; 309:
792–799.
14. McWilliams J, Landon B, Chernew M and Zaslavsky A. Changes in patients’ experiences in Medicare Accountable Care
Organizations. N Engl J Med 2014; 371: 1715–1724.
15. McKinnon B, Harper S, Kaufman J and Bergevin Y. Removing user fees for facility-based delivery services: a difference-
in-differences evaluation from ten sub-Saharan African countries. Health Policy Plan 2015; 30: 432–441.
16. Amaral-garcia S, Bertoli P and Grembi V. Does experience rating improve obstetric practices? Evidence from Italy.
Health Econ 2015; 1064: 1050–1064.
17. Chay KY, McEwan PJ and Urquiola M. The central role of noise in evaluating interventions that use test scores to rank
schools. Am Econ Rev 2005; 95: 1237–1258.
18. Pritchett L and Summers LH. Asiaphoria meets regression to the mean. Natl Bur Econ Res 2014. NBER Working Paper
No 20573.
19. Linden A. Conducting interrupted time series analysis for single and multiple group comparisons. Stata J 2015; 15:
480–500.
20. Linden A. A matching framework to improve causal inference in interrupted time-series analysis. J Eval Clin Pract 2018;
24: 408–415.
21. Abadie A, Diamond A and Hainmueller J. Synthetic control methods for comparative case studies: estimating the effect of
California’s Tobacco Control Program. J Am Stat Assoc 2017; 105: 493–505.
22. Linden A and Adams J. Applying a propensity-score based weighting model to interrupted time series data: improving
causal inference in program evaluation. J Eval Clin Pract 2011; 17: 1231–1238.
23. Fronstin P, Sepulveda M and Roebuck M. Medication utilization and adherence in a health savings account-eligible plan.
Am J Manag Care 2013; 19: 118–146.
24. Werner R, Duggan M, Duey K, Zhu J and Stuart E. The patient-centered medical home: an evaluation of a single private
payer demonstration in New Jersey. Med Care 2013; 51: 487–493.
25. Ryan A, Burgess JF, Borden W, Pesko M and Dimick J. The early effects of Medicare’s mandatory hospital pay-for-
performance program on quality. Health Serv Res 2015; 50: 81–97.
26. Daw J, Hatfield L. Matching and regression to the mean in difference-in-differences analysis. Heal Serv Res 2018; 53: 4138–
4156.
27. O’Neill S, Kreif N, Grieve R, Sutton M and Sekhon JS. Estimating causal effects: considering three alternatives to
difference-in-differences estimation. Heal Serv Outcomes Res Methodol 2016; 16: 1–21.
28. Chabe-Ferret S. Should we combine difference in difference with conditioning on pre-treatment outcomes. Toulouse, 2017.
TSE Working Paper, No. 17-824.
29. Kontopantelis E, Doran T, Springate DA, Buchan I and Reeves D. Regression based quasi-experimental approach when
randomisation is not an option: interrupted time series analysis. BMJ 2015; 350: h2750.
30. Landrum MB, Bronskill SE and Normand ST. Analytic methods for constructing cross-sectional profiles of health care
providers. Heal Serv Res Methods 2000; 1: 23–47.
31. Ryan A, Nallamothu BK and Dimick JB. Medicare’s public reporting initiative on hospital quality had modest or no
impact on mortality from three key conditions. Health Aff 2012; 31: 585–592.
32. Stuart E. Matching methods for causal inference: a review and a look forward. Stat Sci A Rev J Inst Math Stat
2010; 25: 1–21.
33. Leuven E and Sianesi B. PSMATCH2: stata module to perform full Mahalanobis and propensity score matching, common
support graphing, and covariate imbalance testing, http://ideas.repec.org/c/boc/bocode/s432001.html (2003, 26 March
2015).
34. Winer B, Brown D and Michels K. Statistical principles in experimental design, 3rd ed. New York: McGraw-Hill, 1991.
35. Ryan AM, Krinsky S, Maurer KA and Dimick JB. Changes in hospital quality associated with hospital value-based
purchasing. N Engl J Med 2017; 376: 2358–2366.
36. Linden A. Assessing regression to the mean effects in health care initiatives. BMC Med Res Methodol 2013; 13: 119.
Ryan et al. 3711

37. Berniell L, de la Mata D and Valdes N. Spillovers of health education at school on parents’ physical activity. Health Econ
2013; 1020: 1004–1020.
38. Keng S and Sheu S. The effect of national health insurance on mortality and the SES-health gradient: evidence from the
elderly. Health Econ 2013; 22: 52–72.
39. Xu Y. Generalized synthetic control method: causal inference with interactive fixed effects models. Polit Anal 2017; 25:
57–76.
40. Diamond A and Sekhon J. Genetic matching for estimating causal effects: a general multivariate matching method for
achieving balance in observational studies. Rev Econ Stat 2013; 95: 932–945.

You might also like