Propensity Score

Using Propensity Scores To Adjust For Treatment Selection Bias
R. Scott Leslie, MedImpact Healthcare Systems, Inc., San Diego, CA

ABSTRACT
A key strength of observational studies is the ability to estimate treatment effect on health outcomes in real world
conditions. However, studies that lack randomization of subjects into treatment groups must address selection bias to
properly estimate the effect of treatment as non randomized groups usually differ on observed and unobserved
characteristics. That is, the observed treatment effect may be due to the treatment itself or due to the differential
selection into treatment groups from non-randomization. Conventional regression adjustment, matching, and
stratification using propensity scores are widely used techniques to adjust for selection bias. Included in this tutorial is
a description of these propensity score methods, an explanation of the advantages and disadvantages of each
method, and an application of the methods using various examples. This tutorial is intended for intermediate level
statisticians, SAS programmers, and/or data analysts.
INTRODUCTION AND WHY USE PROPENSITY SCORES

Randomized control trials (RCTs) measure drug efficacy in controlled environments: however, can often be restricted
to subpopulations that limit generalizability of results. Observational studies, on the other hand, can evaluate
treatment effectiveness in routine care settings or everyday use patterns. However, a limitation of observational
studies is the lack of treatment assignment. Non randomized groups usually differ in observed and unobserved
characteristics causing selection bias when evaluating the effect of treatment. For example, the below figure shows
how 2 patient groups may differ in age, gender, health status, and previous medication use.
Patients eligible for Drug A or Drug B
Drug A Patient Characteristics
Drug B Patient Characteristics
Younger, More females,

Currently taking less medications,
Poorer medication use behavior
Older, More male,

Currently taking more medications,
Better medication use behavior
Statistical techniques such as matching, stratification, and regression adjustment are commonly used to account for
differences in treatment groups but may be limited if using too few covariates in the adjustment process. The use of
propensity score techniques avoids this limitation because it can summarize more or all the covariate information into
a single score. So what is a propensity score? The propensity score is the conditional probability of being treated
based on individual covariates. Rosenbaum and Rubin demonstrated that propensity scores can account for
imbalances in treatment groups and reduce bias by resembling randomization of subjects into treatment groups.
By using propensity scores to balance groups, traditional adjustment methods can better estimate treatment effect on
outcomes while adjusting for covariates. One method professed by Ralph B. DAgostino, Jr. to adjust for the non
randomized treatment selection is to use a propensity score method in conjunction with traditional regression
techniques. This process is performed using two steps, the first of which calculates propensity scores as the
probability of patients being included in each treatment group based on pre-treatment observables. The aim of this
step is to create balance treatment groups and simulate random treatment allocation. The second step utilizes the
created propensity scores with ANCOVA to more accurately estimate outcomes and study the possible covariate
predictors.
HOW TO CREATE PROPENSITY SCORES

The logistic model describes the relationship of several independent variables to a dichotomous dependent variable.
Furthermore, logistic regression is used to predict the probability of event occurring as a function of independent
variables (continuous and/or dichotomous)
The logistic model:
P( X ) =
1
1 + e ( + iXi1)
Propensity scores are easily created using PROC LOGISTIC. In the cases described in this paper, the dependent
variable is treatment group (treatment for the patient) and the independent variables are patient and baseline
characteristics. In other cases the dependent variable may be any dichotomous outcome (treated or untreated,
disenrolled or not, visited specialist or not). The GENMOD procedure for generalized linear models may also create
propensity scores by using the OUTPUT statement and keyword PREDICTED.
Example of Creating Propensity Scores Using PROC LOGISTIC
The following example illustrates the use of PROC LOGISTIC to create propensity scores. A large pharmacy claims
database was used to identify 19,433 patients using oral antidiabetic therapy. Patients were categorized into two drug
treatment groups, A and B, with the main objective of comparing compliance and adherence rates. Compliance was
measured as the proportion of days a medication was supplied over a 180 day period. Adherent patients were
identified as those reaching a threshold of 80% compliance. Selection bias is believed to be a factor as the two drug
treatment groups differ in patient tolerance, adverse events, and side effects which possibly influence drug choice and
compliance to each drug. Other variables controlled for in the analyses include demographic variables (age, gender)
and previous medication use/patterns measured in a 6 month baseline period prior to treatment. Previous medication
use was recorded by the use of specific cardiovascular, asthma, and antidepressant medications and previous
medication pattern use was measured by refill patterns of maintenance type medications.
PROC LOGISTIC calculates propensity scores as the conditional probability of each patient receiving a particular
treatment based on pre-treatment variables and can output the propensity score to a data set. In this example,
propensity scores were calculated based on the covariates listed in Table 1 below. The objective was to balance the
treatment groups so to reduce bias of treatment selection and obtain better idea of treatment effect on the outcome of
compliance. The logit function is specified in the LINK option to fit the binary logit model and the RSQUARE option
assesses the amount of variation explained by the independent variables. The propensity score is output to data set
named psdataset. The predicted probabilities are output to a variable named ps.
proc logistic data = wuss;
class naive0;
model tx (event=Drug A) = age female b_hmo pre_drug_cnt_subset naive0
pre_refill_pct copay_idxdrug pre_sulf pre_htn pre_asthma pre_pain pre_lipo
pre_depress
/link=logit rsquare;
output out=psdataset pred=ps;
run;
Where,
tx = treatment selection indicator, age = age of patient at date of treatment initiation (index date),
female = indicator variable, b_hmo= HMO insurance, copay_idxdrug = patient co-payment for initial prescription
All variables with the prefix pre_ describe utilization in the baseline period.
pre_drug_cnt = number of medications (generic) utilized excluding medications for the categories below
pre_sulf = sulfonylurea use
pre_htn = hypertension use
pre_asthma = asthma use
pre_pain = pain medication use
pre_lipo = lipotropic use
pre_depress = antidepressant use
pre_refill_pct = refill percentage for maintenance medications
After creating the propensity scores, an evaluation of the distributions can check comparability of the treatment
groups. Sizeable overlaps among the groups illustrate satisfactory overlap in covariate distributions and indicate that
the groups are comparable.
14
12
P
D
e
r
r
u
c
g
e
n
A
t
10
8
6
4
2
0
14
12
P
D
e
r
r
u
c
g
e
n
B
t
10
8
6
4
2
0
0. 33
0. 35
0. 37
0. 39
0. 41
0. 43
0. 45
0. 47
0. 49
0. 51
0. 53
0. 55
0. 57
0. 59
0. 61
0. 63
Est i m
at ed Pr obabi l i t y
Tables 1 and 2 demonstrate how propensity scoring can balance the groups. Table 1 contains the unadjusted values
(before propensity scores) for the two treatment groups. Descriptors include demographic variables (age, gender),
type of health plan insurance, and medication use in a 6 month period prior to treatment (baseline period). Age was
the age of the patient at date of treatment initiation (index date). Type of insurance was categorized by health
maintenance organization (HMO), Medicare, Medicaid, or self-insured. Prior medication use was measured for
sulfonylureas, antihypertensives, lipid lowering agents as well as asthma and antidepressant medications. Refill
patterns of medications for chronic diseases, or maintenance medications, were used to estimate patients behavior
to following prescribed dosing. The TABULATE procedure was used to create the table below and PROC TTEST
and PROC FREQ was used to test for differences between groups. Those observables that differed significantly in
the two groups of patients include type of insurance (proportion of patients enrolled in an HMO), oral antidiabetic,
sulfonylurea, and pain medication use.
Table 1. Unadjusted Demographic and Baseline Measures
UNADJUSTED VALUES
Drug A
Member Count
9,129
Age
Mean 57.7
SD
10.5
Female
%
43.1%
HMO *
%
73.3%
# of Drugs Utilized
Mean 4.9
Drug B
10,304
57.5
10.7
43.4%
75.3%
5.0
p-value
0.3786
0.7115
0.0018
0.0841
UNADJUSTED VALUES
Maintenance Medication Refill *
Prior Oral Antidiabetic Use *
Sulfonylurea *
Hypertension
Lipid Irregularity
Pain Management *
Antidepressant
Asthma
* p < .05
SD
%
%
%
%
%
%
%
%
Drug A
3.0
59.6%
77.5
55.3%
79.1%
68.3%
23.6%
19.5%
9.3%
Drug B
3.1
60.5%
80.4
51.3%
78.4%
68.5%
24.9%
18.4%
9.9%
p-value
0.0492
<.0001
<.0001
0.2381
0.7509
0.0357
0.0575
0.1967
Table 2 contains the adjusted values (after propensity scores) for the two treatment groups. PROC GLM was used to
compare groups while adjusting for the propensity score. Differences between groups were minimized when using
the propensity score method.
Table 2. Propensity Score Adjusted Demographic and Baseline Measures
WITH PROPENSTIY SCORING
Drug A
Drug B
Member Count
9,129
10,304
Age
Mean 57.4
57.4
SD
10.5
10.7
Female
%
42.9%
42.8%
HMO
%
75.2%
75.1%
# of Drugs Utilized
Mean 4.9
4.9
SD
3.0
3.0
Maintenance Medication Refill
%
60.3
60.3
Prior Oral Antidiabetic Use
%
80.2%
80.7%
Sulfonylurea
%
53.2%
53.6%
Hypertension
%
78.6%
78.7%
Lipid Irregularity
%
68.7%
68.8%
Pain Management
%
24.0%
23.9%
Antidepressant
%
18.8%
18.8%
Asthma
%
9.5%
9.5%
p-value
0.9329
0.9062
0.8922
0.9390
0.0687
0.3088
0.5204
0.8674
0.9303
0.8380
0.9112
0.9407
USE OF PROPENSITY SCORES

Once the propensity score is calculated what to do you with them? As explained above, the 3 methods commonly
used are matching on propensity score, stratification, and regression adjustment.
Regression Adjustment
Continuing with the study example described above, the created propensity scores were used in regression
adjustment where a propensity score weight, also referred to as the inverse probability of treatment weight (IPTW),
was calculated as the inverse of the propensity score (Hogan and Lancaster). The treatment selection model above
modeled the propensity to receive drug a. For those patients receiving drug b, the propensity score would be 1- ps
and the propensity score weight would be the inverse of 1-ps.
data psdataset;
set psdataset;
if druga=1 then ps_weight=1/ps;else ps_weight=1/(1-ps);
run;
Next, a propensity score-weighted linear regression model, using the GLM procedure, was fitted to compare drug
treatment on the outcome of compliance while controlling for other covariates. The LSMEANS statement computes
the least-squares means for the treatment variable allowing for multiple comparisons. The ADJUST=TUKEY option
uses the Tukey-Kramer method to adjust the least-squares means and the PDIFF and CL options give the p values
and corresponding confidence intervals for the differences in the least-squares means.
proc glm data=psdataset;

class tx naive0 female;
model p_dayscovered = tx age female b_hmo pre_drug_cnt_subset naive0 pre_refill_pct
copay_idxdrug pre_sulf pre_htn pre_asthma pre_pain pre_lipo
pre_depress
/solution;
lsmeans tx/OM ADJUST=TUKEY PDIFF CL;
weight ps_weight;
quit;
Results of the model above showed no difference between treatment groups (p = 0.2066). Comparing this result to
unadjusted compliance means (a model with drug treatment as the only independent variable), where there is a
significant difference (p=0.0071), shows the effect of controlling for selection bias and confounding. The increase in
compliance from the unadjusted means (0.6673 and 0.6793) to the adjusted means (0.7032 and 0.7082) was
probably due to the large number of patients with prior use of the drug. These patients had higher compliance values
than those new to therapy and therefore when this variable is placed in the model as a covariate the mean
compliance values rise.
Table 3. Mean Compliance by Treatment: Unadjusted vs. PS Adjusted Models
Model
Treatment
Compliance Outcome
95% Confidence Limits
Unadjusted Model
P = 0.0071
Drug A
0.6673
0.6601
0.6736
Drug B
0.6793
0.6733
0.6852
Propensity Score
Adjusted Model
P = 0.2066
Drug A
0.7032
0.6975
0.7089
Drug B
0.7082
0.7029
0.7135
Stratification
Stratification, subclassification or binning using propensity scores involves grouping subjects into classes or strata
based on the subjects observed characteristics. Once the propensity scores are calculated, subjects are placed into
strata (Cochran states that 5 strata can remove 90% of the bias) with the idea that subjects in the same stratum are
similar in the characteristics used in the propensity score development process. The tutorial by DAgostino details how
to perform this technique. Briefly, quintiles are used to group subjects into five strata after making sure that there is
adequate propensity score overlap between treatment groups. To prove that the propensity scores removed any bias
due to differences in covariates between treatment groups, t-tests or chi-square tests are conducted before and after
propensity score creation. Finally, outcomes and treatment effects can be assessed using models while adjusting for
the propensity scores. Continuing with the example and code above, subjects are divided into 5 classes based on the
common propensity score overlap using the RANK procedure. Checking for difference between treatment group
before and after stratifying subjects by propensity scores can be done using PROC FREQ, PROC TTEST and PROC
GLM.
proc rank data= psdataset groups=5 out = r;
ranks rnks + 1;
var ps;
run;
data quintile;
set r;
quintile = rnks + 1;
run;
/*check for differences in groups before propensity score*/
proc freq data=quintile;
tables tx*(female b_hmo naive0 pre_sulf pre_htn pre_asthma pre_pain pre_lipo
pre_depress)/chisq;run;
proc ttest data=quintile;

var age pre_drug_cnt_subset maintrefillratio;
class tx;run;
/*check for differences in groups while adjusting for propensity scores*/
proc glm data=quintile;
class tx quintile;
model age female b_hmo pre_drug_cnt_subset maintrefillratio naive0 pre_sulf
pre_htn pre_asthma pre_pain pre_lipo pre_depress = tx quintile;
lsmeans tx;
quit;
The result is similar to Tables 1 and 2 above showing minimal differences between groups when subclassifying
subjects. Outcomes can be compared within the 5 subclasses or averaged to report for the overall treatment groups.
Matching
Matching groups by propensity scores is a common method to balance groups on covariates. Once the propensity
score is calculated, subjects are matched by this single score as opposed to traditional direct matching by one or
more covariates. A disadvantage of matching methods includes incomplete matching and inexact matching. That is,
subjects may be excluded because of difficulty finding a match. Reducing this bias is well explained in Lori Parsons
papers (see reference section). Her papers offer code for performing case-control match using a greedy matching
algorithm. As she explains in the paper, cases were matched to controls based on propensity scores. Tables 1 and 2
below show how propensity scores were used to balance a treated (Early Intervention group) and untreated group of
subjects (Conservative group). Table 1 is the original population and includes results of rank-sum tests and chisquare tests showing differences between groups in many characteristics (only a few a shown in the tables). Table 2
shows how differences are eliminated after matching. This matched subset of patients can now be used to model
outcomes and assess effect of treatment.
Table 1: Original Population
Total Patients
Age (Meansd)
Male Gender
White Race
Hx Angina
Hx MI
Early Intervention N (%)

2,402
61.3 12.2
1,744 (72.6)
2,079 (91.8)
444 (18.5)
574 (23.9)
Conservative N (%)
17,735
68.213.0
10,914 (61.5)
15,002 (88.4)
4,441 (25.0)
5,382 (30.3)
p-value
Conservative N (%)
2,036
61.713.3
1,445 (71.0)
1,858 (91.3)
381 (18.7)
491 (24.1)
p-value
<0.0001
<0.0001
<0.0001
<0.0001
<0.0001
Table 2: Greedy 5 to 1 Digit Matched Population

Total Patients
Age (Meansd)
Male Gender
White Race
Hx Angina
Hx MI
Early Intervention N (%)

2,036
61.9 12.0
1,452 (71.3)
1,865 (91.6)
390 (19.2)
488 (24.0)
0.5405
0.8087
0.6952
0.7189
0.9124
SUMMARY OF AND STRENGTHS AND LIMITATIONS OF PROPENSITY SCORING

Calculating the propensity score as the conditional probability of treatment summarizes observed values into a single
score. The scores can then be used to control for selection bias by matching subjects, stratifying subjects, and/or as
a regressor. All techniques have the purpose of balancing groups to remove bias when assessing treatment effect on
outcomes.
Traditional techniques to control for bias may be limited if accounting for only a few covariates. Compared to multiple
regression, the propensity score methodology summarizes many observables and is less sensitive to model
misspecification (Perkins et al). Propensity scores can also diagnose if groups are comparable before moving onto
the modeling stage of analysis. If distributions of the propensity scores fail to show much overlap in covariate values,
the comparison groups are too different making it difficult to balance groups.
When creating propensity scores all covariates that affect both treatment and outcome must be included in the model
and it is assumed that all patients have a non zero probability of receiving each treatment. The technique only looks
at observed characteristics of a patient thus does not account for unobserved factors, such as patient attitudes,
socioeconomic status, and education level. This limitation is modified if unobserved covariates are correlated to
observed factors. Analyses also need large samples sizes in order to establish adequate variance in covariate
distributions.
CONCLUSION
This tutorial introduces potential selection bias in observational studies and describes how propensity score
methodology can control for overt bias when estimating treatment effectiveness. Propensity scoring methodology
attempts to balance groups before comparing outcomes between treatment groups. Commonly used techniques via
propensity scores include matching, stratification, and regression adjustment using the inverse of the propensity
score. Each method can be used in conjunction with traditional risk adjustment techniques to reduce bias and better
describe the effect of treatment on outcomes.
REFERENCES
DAgostino R.B. Sr, Kwan H. 1995. Measuring Effectiveness: What to Expect Without a Randomized Control Group.
Medical Care. 195:33 (4 suppl): AS95-AS105.
DAgostino R.B., Jr, DAgostino R.B., Sr. 2007. Estimating Treatment Effects Using Observational Data. JAMA. 297
(3). 314-316
Rosenbaum P.R. and Rubin D.B. 1983. The Central Role of the Propensity Score in Observational Studies for
Causal Effects, Biometrika, 70, 41-55.
DAgostino, R.B. 1998. Tutorial on Biostatistics: Propensity Score Methods for Bias Reduction in the comparison of a
treatment to a non-randomized control group. Statistics in Medicine 17, 2265-2281.
Hogan, J.W., Lancaster, T. 2004. Instrumental variable and propensity weighting for causal inference from
longitudinal observational studies. Statistical Methods in Medical Research 13: 17-48.
Obenchain, R.L., Melfi, C.A., Propensity Score and Heckman Adjustments for Treatment Selection Bias in Database
Studies.
Pasta, David J. 2000. Using Propensity Scores to Adjust for Group Differences: Examples Comparing Alternative
Surgical Methods. Proceedings of the Twenty-Fifth Annual SAS Users Group International Conference, Indianapolis,
IN, 261-25.
Parsons, Lori. 2000. Using SAS Software to Perform a Case Control Match on Propensity Score in an
Observational Study. Proceedings of the Twenty-Fifth Annual SAS Users Group International Conference,
Indianapolis, IN, 214-26.
SAS Institute Inc. 2004. SAS Procedures: The LOGISTIC Procedure. SAS OnlineDoc 9.1.3. Cary, NC: SAS
Institute Inc.
http://support.sas.com/documentation/onlinedoc/91pdf/sasdoc_91/stat_ug_7313.pdf
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
R. Scott Leslie
MedImpact Healthcare Systems, Inc.
10680 Treena Street
San Diego, CA 92131
Work Phone: 858-790-6685
Fax: 858-689-1799
Email: scott.leslie@medimpact.com
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. indicates USA registration.
Other brand and product names are trademarks of their respective companies.

Propensity Score

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Propensity Score

Uploaded by

Copyright:

Available Formats

Using Propensity Scores To Adjust For Treatment Selection Bias

R. Scott Leslie, MedImpact Healthcare Systems, Inc., San Diego, CA

INTRODUCTION AND WHY USE PROPENSITY SCORES

Patients eligible for Drug A or Drug B

Drug A Patient Characteristics

Drug B Patient Characteristics

Younger, More females,

Older, More male,

HOW TO CREATE PROPENSITY SCORES

USE OF PROPENSITY SCORES

proc glm data=psdataset;

proc ttest data=quintile;

Early Intervention N (%)

Table 2: Greedy 5 to 1 Digit Matched Population

Early Intervention N (%)

SUMMARY OF AND STRENGTHS AND LIMITATIONS OF PROPENSITY SCORING

You might also like