Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

General Articles

Section Editor: Thomas R. Vetter

E SPECIAL ARTICLE

Five Steps to Successfully Implement and Evaluate


Propensity Score Matching in Clinical Research Studies
Steven J. Staffa, MS, and David Zurakowski, MS, PhD

In clinical research, the gold standard level of evidence is the randomized controlled trial (RCT).
The availability of nonrandomized retrospective data is growing; however, a primary concern
of analyzing such data is comparability of the treatment groups with respect to confounding
variables. Propensity score matching (PSM) aims to equate treatment groups with respect to
measured baseline covariates to achieve a comparison with reduced selection bias. It is a
Downloaded from http://journals.lww.com/anesthesia-analgesia by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1AWnYQp/IlQrHD3i3D0OdRyi7TvSFl4Cf3VC4/OAVpDDa8K2+Ya6H515kE= on 06/09/2021

valuable statistical methodology that mimics the RCT, and it may create an “apples to apples”
comparison while reducing bias due to confounding. PSM can improve the quality of anesthesia
research and broaden the range of research opportunities. PSM is not necessarily a magic
bullet for poor-quality data, but rather may allow the researcher to achieve balanced treatment
groups similar to a RCT when high-quality observational data are available. PSM may be more
appealing than the common approach of including confounders in a regression model because
it allows for a more intuitive analysis of a treatment effect between 2 comparable groups.
We present 5 steps that anesthesiologists can use to successfully implement PSM in their
research with an example from the 2015 Pediatric National Surgical Quality Improvement
Program: a validated, annually updated surgery and anesthesia pediatric database. The first
step of PSM is to identify its feasibility with regard to the data at hand and ensure availability
of data on any potential confounders. The second step is to obtain the set of propensity scores
from a logistic regression model with treatment group as the outcome and the balancing fac-
tors as predictors. The third step is to match patients in the 2 treatment groups with similar
propensity scores, balancing all factors. The fourth step is to assess the success of the match-
ing with balance diagnostics, graphically or analytically. The fifth step is to apply appropriate
statistical methodology using the propensity-matched data to compare outcomes among treat-
ment groups.
PSM is becoming an increasingly more popular statistical methodology in medical research. It
often allows for improved evaluation of a treatment effect that may otherwise be invalid due to
a lack of balance between the 2 treatment groups with regard to confounding variables. PSM
may increase the level of evidence of a study and in turn increases the strength and generaliz-
ability of its results. Our step-by-step approach provides a useful strategy for anesthesiologists
to implement PSM in their future research.  (Anesth Analg 2018;127:1066–73)

I
n clinical research, the study design regarded as the gold confound the treatment effect on the outcome of interest.1 Most
standard for level of evidence is the randomized controlled often, anesthesia researchers are not fortunate enough to be
trial (RCT). The RCT is a prospective study design in which able to run a RCT due to the time and resources needed to do
the investigator randomizes subjects to the treatment groups. so. More often, to the anesthesiologist, nonrandomized, retro-
The randomization is intended to inherently balance the 2 spective, observational data are available, because running a
comparison groups with respect to baseline variables that may prospective RCT is very time consuming and requires many
resources. A big concern of analyzing retrospective data is
From the Department of Anesthesiology, Perioperative and Pain Medicine, comparability of the treatment groups with respect to baseline
Boston Children’s Hospital, Harvard Medical School, Boston, Massachusetts. confounding variables.2 The availability of retrospective data
Accepted for publication November 21, 2017. is growing, so techniques to handle imbalances between treat-
Funding: None. ment groups with respect to confounders are ever important.
The authors declare no conflicts of interest. Matching techniques are available to equate treatment
Supplemental digital content is available for this article. Direct URL citations groups with respect to baseline characteristics. Propensity
appear in the printed text and are provided in the HTML and PDF versions of
this article on the journal’s website (www.anesthesia-analgesia.org). score matching (PSM) is an extremely useful matching tech-
American College of Surgeons National Surgical Quality Improvement nique that intuitively achieves the goal of balanced treat-
Program (ACS NSQIP) and the hospitals participating in the ACS NSQIP are ment groups for an assessment of the treatment effect on the
the source of the data used herein; they have not verified and are not respon-
sible for the statistical validity of the data analysis or the conclusions derived
outcome with reduced bias. PSM mimics the RCT, allowing
by the authors. the investigator to perform their data analyses with reduced
Reprints will not be available from the authors. concern of confounding.3 Of course, a RCT cannot be repli-
Address correspondence to Steven J. Staffa, MS, Department of Anesthesiol- cated from nonrandomized data, but PSM is a technique to
ogy, Perioperative and Pain Medicine, Boston Children’s Hospital, 300 Long- get closer to the level of evidence found in a RCT. The bal-
wood Ave, Boston, MA 02115. Address e-mail to Steven.Staffa@childrens.
harvard.edu. ance of treatment groups with respect to baseline variables
Copyright © 2018 International Anesthesia Research Society achieved by PSM is similar to the balance found between
DOI: 10.1213/ANE.0000000000002787 treatment groups due to randomization in a RCT, therefore

1066 www.anesthesia-analgesia.org October 2018 • Volume 127 • Number 4


Copyright © 2018 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited.
Five Steps to Propensity Score Matching

reducing bias due to confounding. The anesthesiologist can the data tab; SAS/STAT 14.2 (SAS Institute Inc, Cary, NC)
utilize PSM to get closer to the gold standard level I evidence has the capability to perform PSM with its PSMATCH pro-
of a RCT, thus improving the strength of their manuscript. cedure; and R (R Foundation for Statistical Computing,
PSM is an intuitive alternative to a multiple regression Vienna, Austria) features the packages Matching, MatchIt,
modeling approach. PSM incorporates the information pro- and Optmatch. We use R for figures in our clinical example.
vided by the baseline factors into 1 propensity score and is
used to balance the treatment groups of those factors. METHODS
In this statistical primer, we present a 5-step approach Step 1: Identify That PSM Is Viable and
for the anesthesia researcher to successfully implement and Appropriate
evaluate PSM to compare 2 treatments using observational PSM is a viable option when you have nonrandomized, obser-
data. Herein, we may refer to treated and untreated groups vational data of high quality that include information on
or more generally to treatment groups and use these terms baseline factors, exposures and treatments, and outcomes of
interchangeably, as PSM may be used to compare treatment interest.5 Most important is that information on potential con-
to standard care, to compare 2 active treatments, or for com- founding variables is included in the data and that these factors
parisons of related scenarios. PSM is growing in appeal in that are suspected to be related to the outcome of interest seem
the clinical research community and is quickly becoming a to be imbalanced between the 2 treatment groups. The nature
highly regarded approach for analyzing nonrandomized, of the nonrandomized data is such that perhaps 1 treatment is
observational data and adjusting for potentially confounding given to certain kinds of patients more readily in practice.
variables. Each step will be practically explained and applied For our NSQIP example, as noted above, we will exclude
to an anesthesia example. These 5 steps are as follows: the 300 cases where information on weight at surgery
is missing, as this is a confounder that we will consider.
1. Identify that PSM is viable and appropriate
Similar to analyses using multivariable regression, alterna-
2. Calculate the propensity scores
tive approaches to missing data are available and should be
3. Match subjects on the propensity scores
considered by the investigator based on the extent, reason
4. Assess balance diagnostics to determine the quality
for, and potential impact of missing data.6
of the matching
Table 1 shows the comparison of baseline factors between
5. Analyze the propensity-matched cohort
our 2 treatment groups. We should be concerned because
clear imbalances are seen between the 2 oxygen support
DATA SOURCE groups with respect to all baseline factors (all P < .001), and
We will demonstrate each step with an anesthesiology in this population, these factors are also known to be associ-
example using the Pediatric 2015 data from the American ated with the 30-day mortality, our outcome of interest. We
College of Surgeons National Surgical Quality Improvement see that the proportion of neonates is higher in the oxygen
Program (NSQIP). The 2015 American College of Surgeons support group (26.68%) as compared to the no oxygen sup-
NSQIP Pediatric data feature approximately 84,000 cases for port group (3.73%). The same is true for inotropic support
approximately 120 variables ranging from baseline factors (7.58% vs 0.29%), ventilator dependence (49.61% vs 1.50%),
to postoperative outcomes. The NSQIP data are audited to and occurrence of previous cardiac surgery (17.49% vs
ensure high quality and reliability.4 3.12%) comparing the 2 exposure groups. The mean weight
The example that we will use with the NSQIP data is the at surgery in the oxygen support group is 10.7 kg compared
following. We will explore if required oxygen support at to 31.49 kg in the no oxygen support group. PSM should be
the time of surgery is associated with 30-day postoperative considered to treat these imbalances.
mortality. Therefore, oxygen support is the exposure defin- Here, we used parametric tests to compare exposure
ing “treatment” group, and 30-day mortality is the outcome groups with respect to the baseline variables because we
of interest. The measured baseline factors what we suspect to have a very large sample size and our baseline factors are
be confounders are as follows: neonatal status (age <30 days not known to be skewed. If, for instance, we were looking at
at the time of surgery), intravenous inotropic pharmacologic hospital length of stay as a variable to be compared between
support at the time of surgery, ventilator dependence within the groups, we would opt for nonparametric testing because
48 hours before surgery, history of a previous cardiac surgery, hospital length of stay was generally skewed.7 For our con-
and weight at surgery (kg). Because data for weight at surgery tinuous variable (weight at surgery), an independent t test
are missing on 300 surgeries, these have been excluded from was done, and for categorical variables (neonatal status,
the analysis, bringing the total sample size to 83,756. Note that inotropic support, ventilator dependence, and history of a
the 2015 Pediatric NSQIP features a very low 30-day mortality previous cardiac surgery), a χ2 test was performed. The non-
event rate of 0.37% and includes information about all base- parametric analog to the t test that could be necessary is the
line confounding factors, so PSM should be considered. Wilcoxon rank sum test, and the appropriate test for small
We used the PSMATCH2 software package in Stata sample sizes for categorical data is Fisher exact test.
(StataCorp LLC, College Station, TX) to implement PSM
in our example. The Stata code used for the clinical exam- Step 2: Calculate the Propensity Scores
ple will be included in Supplemental Digital Content, The estimated propensity score can be conceptualized as the
Appendix, http://links.lww.com/AA/C198, for reference. patient’s probability of being treated as a function of mea-
Statistical software other than Stata can be used to perform sured baseline covariates. Thus, it is equal to a probability
PSM. SPSS (IBM Corp, Armonk, NY) has a PSM tab under of treatment given baseline factors.8 Most studies consider

October 2018 • Volume 127 • Number 4 www.anesthesia-analgesia.org 1067


Copyright © 2018 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited.
EE Special Article

Table 1.  Prematched Baseline Covariates


No Oxygen Support Group Oxygen Support Group Total Cohort
(n = 80,446) (n = 3310) (n = 83,756)
Baseline Covariate n or Mean % or SD n or Mean % or SD n or Mean % or SD P Value d
Neonate (age <30 d) 2997 3.73% 883 26.68% 3880 4.63% <.001 0.675
Inotropic support 231 0.29% 251 7.58% 482 0.58% <.001 0.382
Ventilator dependence 1208 1.50% 1642 49.61% 2850 3.40% <.001 1.322
Previous cardiac surgery 2509 3.12% 579 17.49% 3088 3.69% <.001 0.486
Weight at surgery (kg) 31.49 25.25 10.7 15.24 30.68 25.26 <.001 0.997
Abbreviations: d, absolute standardized mean difference; SD, standard deviation.

a binary treatment (treated versus untreated). In that sce-


nario, the propensity score can be estimated from a multi-
variable logistic regression model fit with treatment group
as the outcome. The covariates in the multivariable model
are the baseline factors on which we wish to balance our
treatment groups. Note that many variables can be used as
covariates in this model, and regardless of how many, we
will only obtain a single propensity score.
There are no stringent rules pertaining to the choice of
predictors of treatment group. However, it should be noted
that the model to obtain the propensity scores with treat-
ment group as the outcome cannot include post-baseline
variables that may be modified by treatment.8
A study by Austin8 considered the merits of including dif-
ferent sets of variables in the propensity model, including
“all measured baseline covariates, all baseline covariates that
are associated with treatment assignment, all covariates that
affect the outcome (ie, potential confounders), and all covari-
ates that affect both treatment assignment and the outcome Figure 1. Distribution of propensity scores by treatment group. The
(ie, true confounder).” This built on a study by Austin et al9 overlap in the distributions of propensity scores shows that match-
who performed a simulation study regarding relative ben- ing can be performed between the oxygen support group and no
efits of including the different set of baseline factors. They oxygen support group.
found that using a propensity score model that includes only
true confounders or includes all variables associated with to implement matching.5 Figure 1 below is a mirrored histo-
the outcome reduces bias slightly better than when the pro- gram of the propensity scores by treatment group in the pre-
pensity score model includes all measured variables or only matched “raw” data. We can see that the propensity scores
includes the variables associated with treatment selection. are distributed differently in each treatment group, which
Thus, we recommend the anesthesia researcher include vari- indicates that the groups have patients with generally differ-
ables associated with the outcome(s) of interest that occur ent baseline factors. One requirement of PSM is that there is
temporally before the treatment, a subset of which will be common support, such that we see some overlap in the mir-
confounders. Had an RCT been performed, we would expect ror histogram of propensity scores to be able to better per-
these variables to be balanced at baseline. form matching in our next step. Because oxygen support is
One does not need to perform an iterative model build- fairly rare in this dataset and therefore has a low probability
ing approach for the propensity score model; a covariate in the data, notice that most of the probability mass of the dis-
can still be included even if its associated P value is not sta- tribution of the propensity scores is close to 0, but we do see
tistically significant. The goal of this model is to accurately overlap of propensity scores between the 2 groups of interest.
produce a fitted probability of treatment group while using
variables that predict the outcome or that predict both the Step 3: Match Subjects on the Propensity
treatment and the outcome (a propensity score for each sub- Scores
ject using a logistic regression framework). It should be noted that there are several options for using
Once the logistic regression model has been fit, the pro- the propensity scores after they are obtained. We will focus
pensity score for each patient can be obtained as the fitted on an intuitive and commonly used option: matching using
probability. In our example, the propensity score model the propensity score. The other approaches have certain
includes variables known to be associated with 30-day appeals and limitations but have the same goal to balance
mortality and that are imbalanced between oxygen support treatment groups with respect to the measured baseline fac-
groups: neonatal status, inotropic support, ventilator depen- tors that went in to calculating the propensity scores. The
dence, previous cardiac surgery, and weight at surgery. other approaches are stratification by the propensity score,
Subjects with similar propensity scores may be matched using the propensity score for inverse probability of treat-
to one another, so we must see overlap in the distributions of ment weighting, and including the propensity score as a
propensity scores between the 2 treatment groups to be able regression covariate.8

1068   
www.anesthesia-analgesia.org ANESTHESIA & ANALGESIA
Copyright © 2018 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited.
Five Steps to Propensity Score Matching

Matching using the propensity score is exactly as it Step 4: Assess Balance Diagnostics to
sounds: most commonly, the investigator starts with a group Determine the Quality of the Matching
of treated patients and selects untreated patients with similar Once the matching is complete, we need to assess the bal-
propensity scores as matches to those treated patients. In this ance of the 2 groups on the baseline factors. If our match-
primer, we demonstrate 1-to-1 nearest-neighbor matching. ing was successful, then the 2 groups will be balanced and
That is, 1 patient from the treated group will get matched therefore we will be able to move on to Step 5 and isolate
with 1 patient from the untreated control group with a simi- a true treatment effect from any confounding. Quality of
lar propensity score.10 However, PSM is flexible enough to matching refers to the extent that the treatment groups
allow for 2-to-1 matching, or k-to-1 matching for any positive are balanced with respect to baseline factors once match-
integer k.11 Note that we choose 1-to-1 matching for demon- ing is done. Since P values comparing the baseline factors
strative purposes in our NSQIP example because it is a very between the 2 treatment groups are highly driven by sam-
common form of PSM. We will lose many data points, and ple size,15,16 we recommend 2 analytic measures of match-
it may be appropriate to perform a k-to-1 matching instead ing quality: the absolute standardized mean difference for
with a large dataset like the NSQIP. The investigator should each baseline factor17,18 and the reduction in pseudo R2 in
consider the number of untreated controls available for the logistic regression model for exposure group (the pro-
matching and potential quality of matches when determin- pensity score model).19
ing k. Other matching techniques that we will not discuss The absolute standardized mean difference is a numeric
but that are available as options to use include Mahalanobis summary that can be calculated for every baseline covari-
metric matching and interval matching. ate, whether continuous or binary. It compares each baseline
When matching is done, several choices must be made. factor between the treatment and the control groups after
The first is between matching with replacement and match- PSM is completed and uses a pooled standard deviation
ing without replacement.12 We will perform matching with- calculation. An absolute standardized mean difference <0.1
out replacement, so once an untreated subject is matched has been taken to indicate a negligible difference between
to a treated subject, they are not available to be matched groups for that covariate.18 For a continuous covariate, the
again to a different treated subject. When matching with- absolute standardized mean difference is defined as follows:
out replacement is performed, the estimates may depend
on the order that the matching is done. To minimize the
d=
(x − x )
t c
influence of possibly systematic orderings of the sub-
jects in the dataset, researchers might consider randomly st2 + sc2
ordering their data.13 Matching with replacement allows 2
an untreated control to act as the match for possibly mul- where xt and xc are the sample means and st2 and sc2 are
tiple-treated patients. This may perform better in the case the sample variances in the treatment and control groups,
where many treated patients are excluded when matching respectively. For a binary baseline factor, the absolute stan-
without replacement is done. This may occur if there is no dardized mean difference is defined as follows:
good overlap (known as common support) between the
propensity scores of the treatment and control patients or if
there are more or a similar number of treated patients than ( p t − p c )
d=
untreated controls. Investigators will need to use methods p t (1 − p t ) + p c (1 − p c )
to account for matching with replacement in Step 5. The sec-
2
ond is between optimal and greedy matching. With greedy
matching, a treated patient is selected, then an untreated where p t and p c are the prevalences in the treatment and
patient with the closest propensity score to that of the the control groups, respectively.17,18 Note that all absolute
treated subject is matched to the chosen treated patient. This standardized mean differences in our example are <0.1,
is done sequentially through the list of treated patients until reflecting balance between the treatment groups (Table 2).
the list is exhausted of treated patients for whom untreated We can look at the pseudo R2 from the logistic regres-
patients can be found.8 The other option is optimal match- sion model for treatment in the unmatched raw data, and
ing, which we use in our clinical example. Optimal match- then again in the matched data. The pseudo R2 is calculated
ing minimizes the total within-pair difference of propensity by any statistical software when running a logistic regres-
score.8 The third choice is the size of the caliper width. The sion. Because pseudo R2 is a measure of predictive value of
caliper width is the maximum difference between propen- a logistic regression model, a reduction in pseudo R2 from
sity scores allowed for 2 patients to potentially be matched. the propensity score model in the raw data compared to the
Austin14 suggests that when matching patients on the pro- same model in the matched data means that the baseline
pensity scores within a certain caliper width, the optimal factors are no longer predictive for determining treatment
caliper width is equal to 0.2 times the standard deviation of group, as desired.19 In our NSQIP example, the pseudo R2 for
the logit of the propensity scores (the logit of a given pro- the propensity score model in the unmatched data is 0.3556
pensity score “p” is equal to the natural logarithm of p/[1 where in the 1-to-1 matched cohort, it is 0.0008. This is a large
− p], also referred to as the log odds). reduction in pseudo R2, so the PSM has successfully reduced
Statistical software packages will perform the matching the predictive value of neonatal status, inotropic support,
since PSM by hand can be tedious for large datasets. Options ventilator dependence, history of a previous cardiac surgery,
for type of matching to be performed can be specified. and weight at surgery on predicting oxygen support group.

October 2018 • Volume 127 • Number 4 www.anesthesia-analgesia.org 1069


Copyright © 2018 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited.
EE Special Article

It is an indication that a successful balance has been achieved It is best to look at the absolute standardized mean dif-
if the pseudo R2 is reduced and becomes close to 0. There is ferences for categorical baseline factors after matching, but
no rule of thumb for how much of a reduction of pseudo R2 to visualize the balance achieved, we can construct bar
is a reasonable reduction. We wish to see any reduction and graphs of the proportions for each variable by treatment
for the pseudo R2 in the matched data to be close to 0. group before and after matching as shown in Figure 2.
The quality of the PSM can also be assessed graphically. A useful graph for evaluating the quality of PSM is a plot
We wish for the postmatching distributions of the baseline showing the absolute standardized mean difference before
covariates to be similar, not only for the means or propor- and after matching for all variables in the propensity score
tions to be close. For continuous baseline covariates, one model (Figure  3). We can see in Figure  3 that the PSM in
can construct box plots by treatment group to look at the our example was successful because all absolute standard-
distribution of this variable in the matched dataset. In our ized mean differences were >0.10 before matching, and after
NSQIP example, we have 1 continuous baseline covariate: matching, they are all <0.10.
weight at surgery (kg). We see in Figure 2 that the distribu- Balance diagnostics allow the investigator to assess the
tion of weight at surgery is similar in the oxygen support quality of PSM. The absolute standardized mean difference
group and the no oxygen support group. Therefore, PSM and reduction in pseudo R2 provide analytic summaries
has successfully incorporated information about weight at of quality of matching for all (continuous and categorical)
surgery and balanced the treatment groups with respect to baseline factors. Looking at P values alone is not advisable
this continuous variable. because they are highly driven by sample size. Box plots

Table 2.  Propensity-Matched Baseline Covariates


No Oxygen Support Group Oxygen Support Group Total Cohort
(n = 2036) (n = 2036) (n = 4072)
Baseline Covariate n or Mean % or SD n or Mean % or SD n or Mean % or SD P Value d
Neonate (age <30 d) 429 21.07% 435 21.37% 864 21.22% .818 0.007
Inotropic support 27 1.33% 40 1.96% 67 1.65% .109 0.050
Ventilator dependence 489 24.02% 508 24.95% 997 24.48% .489 0.022
Previous cardiac surgery 270 13.26% 248 12.18% 518 12.72% .301 0.032
Weight at surgery (kg) 12.11 15.44 12.56 17.21 12.33 16.35 .373 0.028
Abbreviations: d, absolute standardized mean difference; SD, standard deviation.

Figure 2. Before (A) and after (B) propensity score matching. The graphical assessment shows postmatching balance on the baseline
variables.

1070   
www.anesthesia-analgesia.org ANESTHESIA & ANALGESIA
Copyright © 2018 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited.
Five Steps to Propensity Score Matching

Figure 3. Absolute standardized mean differences before and after matching. For each variable, the absolute standardized mean difference is
<0.10 after matching, which indicates balance between the treatment groups. PSM indicates propensity score matching.

and bar plots can display the pre- and postmatching bal- outcome, we advise using generalized estimating equations
ance of baseline factors between the 2 treatment groups. with the identity link function and Gaussian family, where
Once balance diagnostics confirm good quality matching the “subject ID” is treated as the matched pair variable.
was done, then we can proceed to the final data analysis. Generalized estimating equation is flexible enough to also
However, good balance is not guaranteed to be observed be applied under logistic regression for binary outcomes
after PSM. The goal of PSM is to achieve baseline covariate and Poisson regression for count outcomes.22 Other analy-
balance between the treatment groups; however, the inves- sis possibilities that appropriately utilize the propensity-
tigator is only able to balance on baseline factors that are matched data include a paired t test for continuous outcome
measured. There may be unmeasured confounders, which and McNemar test for a binary outcome.7 For the analysis
is an issue with all datasets. If poor balance is observed, the of survival or time-to-event outcomes, Austin23 shows that
researcher may want to revise their propensity score model, propensity score methods can be used.
for example, by including interactions or high-order terms After PSM, from a conditional logistic regression model
and then checking the resulting covariate balance.20 Another taking into account propensity-matched pair, we find that
option available to the researcher is to make modifications oxygen support is a statistically significant predictor of
to the matching algorithm. This can include changing the 30-day mortality (odds ratio [OR] = 1.79, P = .012) (Table 3).
caliper width or utilizing matching with replacement if Use of traditional multivariable logistic regression in the
matching without replacement was used. The resulting unmatched data leads to a larger but similar OR (2.77)
covariate balance can be checked after each modification. and a smaller P value (P < .001). It is clear that adjustment
Continued evidence of poor balance may suggest that there for measured baseline factors using multivariable logistic
is insufficient common support and the population being regression controls for confounding as does PSM, because a
considered may not be appropriate for the hypothesis. simple logistic regression without any adjustment for base-
line factors produces the unadjusted OR of 29.88. Our results
Step 5: Analyze the Propensity-Matched Cohort using PSM are unchanged regarding statistical significance
Finally, it is time to analyze the data and evaluate a treat- and direction of the association as compared to using the
ment effect with the reduction of confounding bias due to traditional multivariable regression approach. Because the
PSM. To analyze the data after PSM, the matched pairs must 2015 Pediatric NSQIP collects data from participating insti-
be taken into account.21 tutions across the country, the findings of studies using its
For a binary outcome (like we have in our example with data are generalizable to similar institutions.
30-day mortality), it is appropriate to use a conditional
logistic regression model to assess the effect of the expo- DISCUSSION
sure.8 Conditional logistic regression is needed to take into When observational (nonrandomized) data are being analyzed,
account the matched pairs (or matched sets) created dur- there is no way to transform them into RCT data. PSM does not
ing PSM. It can be implemented by specifying the “group” produce a gold standard level of evidence like the RCT, but it
as being the matched pair (or matched set) from PSM. For is a highly effective methodology to balance treatment groups
continuous outcomes, to assess the treatment effect on the and leads to an intuitive analysis. There are certain strengths

October 2018 • Volume 127 • Number 4 www.anesthesia-analgesia.org 1071


Copyright © 2018 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited.
EE Special Article

REFERENCES
Table 3.  Comparison of Results With and Without 1. Akobeng AK. Understanding randomised controlled trials.
PSM Arch Dis Child. 2005;90:840–844.
30-d Mortality 2. Rubin DB. The design versus the analysis of observational
Analysis Odds Ratio 95% CI P Value studies for causal effects: parallels with the design of random-
Without PSM unadjusted 29.88 (23.79–37.52) <.001 ized trials. Stat Med. 2007;26:20–36.
Without PSM adjusted 2.77 (1.95–3.94) <.001 3. Rosenbaum PR, Rubin DB. The central role of the propensity
With PSM 1.79 (1.14–2.82) .012 score in observational studies for causal effects. Biometrika.
1983;70:41–55.
Abbreviations: CI, confidence interval; PSM, propensity score matching. 4. American College of Surgeons. Data collection, analysis, and
reporting. Available at: https://www.facs.org/quality-programs/
and limitations of the PSM approach. PSM balances treatment acs-nsqip/program-specifics/data. Accessed June 15, 2017.
groups with respect to confounders, making for a more objective 5. Winger DG, Nason KS. Propensity-score analysis in thoracic
surgery: when, why, and an introduction to how. J Thorac
analysis. Of course, PSM only allows for adjustment for mea- Cardiovasc Surg. 2016;151:1484–1487.
sured confounders, and this limitation is shared with all multi- 6. Rubin DB. Inference and missing data. Biometrika.
variable adjustment methods. The propensity score incorporates 1976;63:581–592.
information on multiple measured baseline factors on which to 7. Lee AH, Fung WK, Fu B. Analyzing hospital length of stay:
balance treatment groups, which cannot be done easily when mean or median regression? Med Care. 2003;41:681–686.
8. Austin PC. An introduction to propensity score methods for
matching by hand. Similar to adjusting for confounders using reducing the effects of confounding in observational studies.
multiple regression analysis, PSM requires the availability of Multivariate Behav Res. 2011;46:399–424.
data on these variables. Researches need to consider their specific 9. Austin PC, Grootendorst P, Anderson GM. A comparison
research scenario and data when considering use of PSM. PSM is of the ability of different propensity score models to balance
measured variables between treated and untreated subjects: a
more desirable when 1-to-1 or k-to-1 matching can be done with- Monte Carlo study. Stat Med. 2007;26:734–753.
out discarding too many data points. Note that we did discard 10. Rosenbaum PR, Rubin DB. Constructing a control group using
many data points due to a “rare exposure” of oxygen support in multivariate matching sampling methods that incorporate the
our NSQIP example to demonstrate 1-to-1 matching, but many- propensity score. Am Stat. 1985;39:33–38.
to-1 matching could have been used to keep a larger sample size. 11. Ming K, Rosenbaum PR. Substantial gains in bias reduction
from matching with a variable number of controls. Biometrics.
Our piece is an educational primer or general frame- 2000;56:118–124.
work; however, there are other aspects of PSM that we 12. Rosenbaum PR. Observational Studies. 2nd ed. New York, NY:
did not cover in this overview. These limitations much be Springer-Verlag; 2002.
acknowledged. We did not cover sample size and power 13. Austin PC. A comparison of 12 algorithms for matching on the
propensity score. Stat Med. 2014;33:1057–1069.
considerations as our focus was developing the 5-step
14. Austin PC. Optimal caliper widths for propensity-score

approach for implementing the propensity scores. Sample matching when estimating differences in means and differ-
size and power should be considered and discussed with a ences in proportions in observational studies. Pharm Stat.
biostatistician for all studies. Additionally, our primer does 2011;10:150–161.
not discuss the issue of adjustment for multiple testing that 15. Austin PC. A critical appraisal of propensity score match-

ing in the medical literature from 1996 to 2003. Stat Med.
may arise in certain research studies. 2008;27:2037–2049.
PSM is not a magic bullet for all observational (nonran- 16. Austin PC. The relative ability of different propensity score
domized) data. Although PSM is more appealing when the methods to balance measured covariates between treated and
outcome of interest is rare, multivariable regression adjust- untreated subjects in observational studies. Med Decis Making.
ment methods are very useful in other scenarios. PSM can 2009;29:661–677.
17. Austin PC. Using the standardized difference to compare
add unneeded steps to an analysis that could be done with- the prevalence of a binary variable between two groups
out it. However, PSM is an important procedure to obtain in observational research. Commun Stat Simul Comput.
balance of observed relevant covariates and allow more 2009;38:1228–1234.
objective group comparisons. 18. Flury BK, Riedwyl H. Standard distance in univariate and mul-
tivariate analysis. Am Stat. 1986;40:249–251.
19. Caliendo M, Kopeinig S. Some practical guidance for the

CONCLUSIONS implementation of propensity score matching. J Econ Surv.
Our 5-step approach provides a useful model for anesthe- 2008;22:31–72.
siologists to implement PSM in their future research. We 20. Imai K, Ratkovic M. Covariate balancing propensity score. J R
Statist Soc. 2014;76:243–263.
have demonstrated an example of 1-to-1 PSM from the 2015
21. Austin PC. Balance diagnostics for comparing the distribution
Pediatric NSQIP. PSM is becoming an increasingly more of baseline covariates between treatment groups in propensity-
popular methodology,24–27 and when implemented correctly, score matched samples. Stat Med. 2009;28:3083–3107.
it increases the level of evidence of a study in turn improv- 22. Austin PC. The performance of different propensity score

ing the strength and generalizability of its results. E methods for estimating marginal odds ratios. Stat Med.
2007;26:3078–3094.
23. Austin PC. The use of propensity score methods with sur-
DISCLOSURES vival or time-to-event outcomes: reporting measures of effect
Name: Steven J. Staffa, MS. similar to those used in randomized experiments. Stat Med.
Contribution: This author helped write the manuscript, conduct 2014;33:1242–1258.
the data analysis, and review the literature. 24. McCluskey SA, Karkouti K, Wijeysundera D, Minkovich L,
Name: David Zurakowski, MS, PhD. Tait G, Beattie WS. Hyperchloremia after noncardiac surgery
Contribution: This author helped write the manuscript, conduct is independently associated with increased morbidity and
the data analysis, and review the literature. mortality: a propensity-matched cohort study. Anesth Analg.
This manuscript was handled by: Thomas R. Vetter, MD, MPH. 2013;117:412–421.

1072   
www.anesthesia-analgesia.org ANESTHESIA & ANALGESIA
Copyright © 2018 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited.
Five Steps to Propensity Score Matching

25. Trent Magruder J, Grimm JC, Dungan SP, et al. Continuous surgery for mitral regurgitation: a propensity-matched study. J
intraoperative cefazolin infusion may reduce surgical site infec- Cardiothorac Vasc Anesth. 2013;27:445–450.
tions during cardiac surgical procedures: a propensity-matched 27. Perlas A, Chan VW, Beattie S. Anesthesia technique and

analysis. J Cardiothorac Vasc Anesth. 2015;29:1582–1587. mortality after total hip or knee arthroplasty: a retrospec-
26. Monaco F, Biselli C, Landoni G, et al. Thoracic epidural anes- tive, propensity score-matched cohort study. Anesthesiology.
thesia improves early outcome in patients undergoing cardiac 2016;125:724–731.

October 2018 • Volume 127 • Number 4 www.anesthesia-analgesia.org 1073


Copyright © 2018 International Anesthesia Research Society. Unauthorized reproduction of this article is prohibited.

You might also like