Basic Statistical Analysis Issues JP

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Casemix

data and sta,s,cs


Jim Pearse, Health Policy Analysis, Sydney, Australia
Introductory remarks
•  Casemix analyses are generally undertaken on large volumes
of data
•  Can represent the total popula,on of interest (e.g. all hospital
admissions in a country)
•  Will almost certainly have varia,on in the quality of specific
data items
•  Are used for many different purposes
•  Despite their imperfec,ons and complexity, they are at the
core of many health and economic classifica,ons/studies

Lecture outline
Today the focus will be on some founda,on
approaches to analysing and interpre,ng these data
including:
–  Describing cost and length of stay distribu,ons
–  Models to test differences between sub-group means
–  Issues in modelling:
–  Assessing model performance
•  Explanatory vs predic,ve analysis

Example
Days Frequency Days Frequency
For a 1 643 19 48
par,cular 2 1,112 20 41
3 1,196 21 25
DRG we have 4 1,125 22 31
observed the 5
6
1,097
865
23
24
17
16
following 7 749 25 10
8 661 26 10
distribu,on 9 518 27 9
of length of 10 378 28 4
11 332 29 3
stay: 12 289 30 2
13 218 31 2
14 164 32 1
15 145 33 1
16 123 34 1
17 88 36 2
18 73 40 1
What can we stay about these
data?
Central tendency Dispersion/ Other
Varia,on

Central tendency
Mean (arithme,c mean, expected value) X µ


Median (50% percen,le, 2nd Quar,le)
Value for which 50% of observa4ons are above
and 50% below
Mode
Most frequent value

Example
Where is the Days
1
Frequency
643
Days
19
Frequency
48
Mean? 2 1,112 20 41
Median? 3
4
1,196
1,125
21
22
25
31
Mode? 5
6
1,097
865
23
24
17
16
7 749 25 10
8 661 26 10
9 518 27 9
10 378 28 4
11 332 29 3
12 289 30 2
13 218 31 2
14 164 32 1
15 145 33 1
16 123 34 1
17 88 36 2
18 73 40 1
Example
Where is the:
Mean Median Mode

Example
Where is the:
Mean – 6.48 days Median – 5 day Mode – 3 days

Dispersion (Variation)
Variance:
Expected value (mean) of the squared devia,on
from the mean.


Standard devia,on: SD

Example
Variance: 20.95852 Standard devia,on: 4.578047


Dispersion (Variation)
Coefficient of varia,on: cv , cv,
The ra,o of standard devia,on to the mean


Oben useful in gecng a quick sense of the level of
variability within a par,cular class (e.g. DRG) compared
with another. Some,mes referred to as Rela,ve
Standard Devia,on
Example
Coefficient of varia,on: 4.57/6.48 = 0.7064342



Dispersion (Variation)
Interquar,le Range (IQR)
Distance between the 3rd and 1st quar4les
1st quar4le: observa4on at which 25% of observa4on have a lower
value
3rd quar4le: observa4on at which 75% of observa4on have a lower
value

Range
Distance between the minimum and maximum
value
Example
1st quar,le: 3 days 3rd quar,le: 9 days IQR: 6 days



What can we stay about these
data?
Central tendency Varia,on Other
–  Mean –  Variance –  Is zero
–  Median –  Standard devia,on possible?
–  Model –  Coefficient of –  Are nega,ve
varia,on values possible?
–  Interquar,le range –  Other limits?
–  Range –  Skew

What can we stay about these
data?
Most sta,s,cal sobware allow a quick summary of all
variables in a dataset. For example (from R):
Graphical ways of examining data
Histogram: y axis as a frequency value (count)

Graphical ways of examining data
Histogram: y axis as a “density”: Propor,on of cases
Graphical ways of examining data
Density plots (Kernel Density Plots, Density Trace):
Smoothed histogram
Graphical ways of examining data
Density plots for comparing two distribu,ons
Graphical ways of examining data
Box plot (box-and-whisker plot)
(Tukey 1977. Exploratory Data Analysis)
Graphical ways of examining data
Box plot

Q1 50% Q3
“Outlier” points

Q3 + 1.5 * IQR
Median
Comparing groups
Length of stay

Cost
More than one variable
Scaper plot
More than one variable
Scaper plot with brushing
More than one variable
Scaper plot with “smoothers”
More than one variable
Pearson correla,on coefficient: r
Con4nuous variables
Value between +1 and −1
+1 perfectly correlated
0 independent (no rela4onship)
-1 perfect nega4ve correla4on
Many other measures of correla,on including
for categorical variables
Correlation
Pearson correla,on coefficient: r
Con4nuous variables
Value between +1 and −1
+1 perfectly correlated
0 independent (no rela4onship)
-1 perfect nega4ve correla4on
Many other measures of correla,on including
for categorical variables
Correlation matrix
Correlation matrix plots
Correlation matrix plots
Statistical modelling
Response variable: What are we trying to
explain or predict

Explanatory/predictor variables: The
variables that explain or predict different
levels of the response variable

The sta7s7cal model: How is the response
variable related to the explanatory variables

Prior informa7on: What do we know about
possible values and distribu,ons

Statistical modelling

Model Data

Assump,ons inherent in model


e.g. the nature of the distribu,on of the
response variable Conclusions

Assump,ons about what are the important
explanatory/predic,ve variables


Linear model

Cost ~ Normal(µi , ơ)

µi = α + β * predictor


Alternatives to linear model

Alterna,ve
Cost ~ Normal(µi , ơ) distribu,ons:
Poisson, gamma

Alterna,ve
µi = α + β * predictor rela,onships
between the
predictor variables
and the
distribu,on
parameters
Linear model: Example
Linear model: Example
Linear model: Example
Linear model
r2 ,R2 ,R squared, coefficient of determina,on

Several defini,ons. A common interpreta,on is that
it measure the propor4on of the variance in the
response variable that is predicted/explained by the
predictor variable).

Range 0 – 1, unless we are calcula4ng it for data that
was not used to generate the model


R2 and casemix

Tradi,onally the most common way of summarising how


well a classifica,on (DRG system) explains/predicts cost
or length of stay.

An intui,ve sta,s,c, but in sta,s,cs is not generally
accepted as basis for selec,ng between model.


Assessing model performance
The problem of overficng: Our models may work very
well for the data we have but very poorly for data we
don’t have

The remedies to overficng:
•  Different sta,s,cs of selec,ng the best model
(Informa,on criteria such as AIC)
•  Cross valida,on

Guidance

When reviewing a report or journal paper, always


consider where the predic,ve performance was based
only on the source data vs the out of sample data, or
used forms of cross valida,on.

Consider a range of performance metrics beyond R2

You might also like