Professional Documents
Culture Documents
Regress Model Chap 4 Four Per Page
Regress Model Chap 4 Four Per Page
• If we run a regression with the binary variables MAR0 and MAR2, then
Table: Summary Statistics of Logarithmic Face By Marital Status
y = 2.605 + 0.452LNINCOME + 0.205EDUCATION + 0.248NUMHH
Standard −0.557MAR0 − 0.789MAR2.
MARSTAT Number Mean deviation
Other 0 57 10.958 1.566
Married 1 208 12.329 1.822 • If you are married, then MAR0 = 0, MAR1 = 1 and MAR2 = 0, and
Living together 2 10 10.825 2.001
Total 275 11.990 1.871 ym = 2.605 + 0.452LNINCOME + 0.205EDUCATION + 0.248NUMHH.
in fitted value for someone living together, compared to a similar person who
10
• However, there is not a perfect dependency among any two of the Model 1 Model 2 Model 3
Explanatory
three. It turns out that Cor (MAR0, MAR1) = −0.90, Variable Coefficient t-ratio Coefficient t-ratio Coefficient t-ratio
Cor (MAR0, MAR2) = −0.10 AND Cor (MAR1, MAR2) = −0.34. LNINCOME 0.452 5.74 0.452 5.74 0.452 5.74
EDUCATION 0.205 5.30 0.205 5.30 0.205 5.30
• Any two out of the three produce the same model in terms of NUMHH 0.248 3.57 0.248 3.57 0.248 3.57
goodness of fit Intercept 3.395 3.77 2.605 2.74 2.838 3.34
MAR0 -0.557 -2.15 0.232 0.44
MAR1 0.789 1.59 0.557 2.15
MAR2 -0.789 -1.59 -0.232 -0.44
Table: Term Life with Marital Status ANOVA Table
• Model 1 appears the best in the sense that the t-ratios are larger (in
Source Sum of Squares df Mean Square
absolute value). The p-values are close to statistically significant (0.113
Regression 343.28 5 68.66 for -1.59 and 0.032 for -2.15).
Error 615.62 269 2.29 • Model 2 appears the worst in the sense that the t-ratios are smaller (in
Total 948.90 274 absolute value).
• Model 2 suggests that marital status is not statistically significant!!
Residual standard error s = 1.513, R 2 = 35.8%, Ra2 = 34.6%
• The three models are equivalent - same estimates, same fitted values, as
long as you keep your interpretations straight.
Frees (Regression Modeling) Multiple Linear Regression - II 7 / 40
The Role of Binary Variables The Role of Binary Variables
Example. How does Cost-Sharing in Insurance Plans Example. Cost-Sharing in Insurance Plans
affect Expenditures in Healthcare?
• Although families were randomly assigned to plans, Keeler and Rolph
(1988) used regression methods to control for participant attributes and
• Rand Health Insurance Experiment (HIE) - Keeler and Rolph (1988) isolate the effects of plan cost-sharing.
• Cost-Sharing
• Based on a sample of n = 1, 967 episode expenditures.
• 14 health insurance plans were grouped by the co-insurance rate (the • Logarithmic expenditure was the dependent variable.
percentage paid as out-of-pocket expenditures that varied by 0, 25, 50 • Cost-sharing was decomposed into 5 binary variables.
and 95%).
• These variables are “Co-ins25,” “Co-ins50,” and “Co-ins95,” for coinsurance
• One of the 95% plans limited annual out-of-pocket outpatient
rates 25, 50 and 95%, respectively, and “Indiv Deductible” for the plan with
expenditures to $150 per person ($450 per family), providing in effect
individual deductibles.
an individual outpatient deductible. This plan was analyzed as a
• The omitted variable is the free insurance plan with 0% coinsurance.
separate group.
• There were c = 5 categories of insurance plans. • The HIE was conducted in six cities; location was decomposed into 5
• Adverse Selection binary variables: Dayton, Fitchburg, Franklin, Charleston and
• Individuals choose insurance plans making it difficult to assess Georgetown, with Seattle being the omitted variable.
cost-sharing effects. • Age and sex was decomposed into 5 binary variables: “Age 0-2,” “Age
• Adverse selection can arise because individuals in poor chronic health 3-5,” “Age 6-17,” “Woman age 18-65,” and “Man age 46-65,” the omitted
are more likely to choose plans with less cost sharing, thus giving the category was “Man age 18-45.”
appearance that less coverage leads to greater expenditures. • Other control variables included a health status scale, socioeconomic
• In the Rand HIE, individuals were randomly assigned to plans, thus status, number of medical visits in the year prior to the experiment on a
removing this potential source of bias. logarithmic scale and race.
Frees (Regression Modeling) Multiple Linear Regression - II 13 / 40 Frees (Regression Modeling) Multiple Linear Regression - II 14 / 40
Statistical Inference for Several Coefficients Sets of Regression Coefficients Statistical Inference for Several Coefficients The General Linear Hypothesis
Procedure for Testing the General Linear Hypothesis Example. Term Life Insurance
• Run the full regression and get the error sum of squares and • Our first (Chapter 3) regression
mean square error, which we label as (Error SS)full and sfull
2 ,
(Error SS)reduced − (Error SS)full yielded s = 1.513, R 2 = 35.8%, Ra2 = 34.6%, Error SS = 615.62.
F − ratio = 2
. • Comparing the two, we have
psfull
• Reject the null hypothesis in favor of the alternative if the F -ratio (Error SS)reduced − (Error SS)full 630.43 − 615.62
F −ratio = = = 3.235.
exceeds an F -value.
2
psfull 2 × 1.5132
• The F -value is a percentile from the F -distribution with df1 = p and
df2 = n − (k + 1) degrees of freedom. • Degrees of freedom are df1 = p = 2 and df2 = n − (k + p + 1) = 269.
• Following our notation with the t-distribution, we denote this • At α = 5%, the F -value is F0.95,2,269 = 3.029.
• Thus, we reject H0 .
percentile as Fp,n−(k+1),1−α , where α is the significance level. • The p-value is Pr(F2,269 > 3.235) = 0.0409.
Frees (Regression Modeling) Multiple Linear Regression - II 17 / 40
Statistical Inference for Several Coefficients The General Linear Hypothesis Statistical Inference for Several Coefficients The General Linear Hypothesis
n
2 = 2
.
= minb0∗ ,...,bk∗+p yi − b0∗ + ... + bk+p
∗
xi,k +p psfull
i=1
Frees (Regression Modeling) Multiple Linear Regression - II 21 / 40 Frees (Regression Modeling) Multiple Linear Regression - II 22 / 40
One Factor ANOVA Model One Factor ANOVA Model One Factor ANOVA Model One Factor ANOVA Model
• Also available is the state in which the claims was filed (another
categorical variable)
Frees (Regression Modeling) Multiple Linear Regression - II 23 / 40
One Factor ANOVA Model One Factor ANOVA Model One Factor ANOVA Model One Factor ANOVA Model
One Factor ANOVA Model One Factor ANOVA Model One Factor ANOVA Model One Factor ANOVA Model
Box Plots of Logarithmic Claims by Risk Class One Factor ANOVA Model - Notations
nj
6
yj = yij .
nj
i=1
4
yij = µj + εij i = 1, . . . , nj , j = 1, . . . , c.
2
C1 C11 C1A C1B C1C C2 C6 C7 C71 C72 C7A C7B C7C F1 F11 F6 F7 F71
Method of Least Squares ANOVA Table for One Factor ANOVA Model
• Estimate the parameters µj , j = 1, . . . , c using least squares. • Begin with the error sum of squares
• For candidate estimates µ∗j of µj , define the sum of squares nj
nj
c
Error SS = SS(ȳ1 , . . . , ȳc ) = (yij − ȳj )2
SS(µ∗1 , . . . , µ∗c ) = (yij − µ∗j )2 j=1 i=1
j =1 i =1
• Define the “Factor” (Regression) Sum of Squares
• Take a derivative with respect to each µ∗j , set = 0 and solve. For example,
Factor SS = Total SS – Error SS
∂ ∂
n1
n1
• To get the variability decomposition is summarized in the following
SS(µ̂1 , . . . , µ̂c ) = (yi ,1 − µ∗1 )2 = (−2)(yi ,1 − µ∗1 ) = 0.
∂µ∗1 ∂µ∗1 analysis of variance (ANOVA) table.
i =1 i =1
• Thus, ȳ1 is the least squares estimate of µ1 . Table: ANOVA Table for One Factor Model
• In general, ȳj is the least squares estimate of µj .
Source Sum of Square df Mean Square
• This yields least squares estimates without matrix computations
Factor Factor SS c−1 Factor MS
• Note that here, the dimension of X X is c × c; this is 18 × 18 for our auto
Error Error SS n−c Error MS
claims data.
• For large auto insurers, common to have a risk classification system where Total Total SS n−1
c is in the thousands, making the usual estimation methods tedious. • The usual (regression) definitions of R 2 , s, and so forth, hold.
Frees (Regression Modeling) Multiple Linear Regression - II 29 / 40 Frees (Regression Modeling) Multiple Linear Regression - II 30 / 40
One Factor ANOVA Model One Factor ANOVA Model One Factor ANOVA Model One Factor ANOVA Model
Regression and the One Factor ANOVA Model Reparameterizing the One Factor ANOVA Model
• To link the linear (regression) model to the one factor ANOVA
model: • The one factor ANOVA Model can be expressed as
• For a categorical variable with c levels, define c binary variables,
x1 , x2 , . . . , xc . yij = µ + τj + εij , i = 1, . . . , n + j, j = 1, . . . , c.
• Here, xj indicates whether or not an observation falls in the jth level.
• Rewrite the one factor ANOVA model y = µj + ε as • There are 1 + c parameters, too many (starting with c means).
y = µ1 x1 + µ2 x2 + . . . + µc xc + ε. “Overparameterized.”
• Two ways to restrict the parameters
• To include an intercept term, define τj = µj − µ, where µ is an, as
yet, unspecified parameter. • Drop one of the binary variables, for example,
• The Greek “t”, τ is for “treatment” effects.
y = µ + τ1 x1 + τ2 x2 + . . . + τc−1 xc−1 + ε.
• With x1 + x2 + . . . + xc = 1, we have
• Minimize the sum of squares subject to the constraint that the sum
y = µ + τ1 x1 + τ2 x2 + . . . + τc xc + ε.
of
ctaus equals zero. More formally, we use the weighted sum
• A simpler expression is j=1 nj τj = 0.
yij = µ + τj + εij .
Frees (Regression Modeling) Multiple Linear Regression - II 31 / 40 Frees (Regression Modeling) Multiple Linear Regression - II 32 / 40
One Factor ANOVA Model Matrix Expressions One Factor ANOVA Model Matrix Expressions
The One Factor ANOVA Model - Matrix Expressions The One Factor ANOVA Model - Matrix Expressions
One Factor ANOVA Model Matrix Expressions One Factor ANOVA Model Matrix Expressions
The One Factor ANOVA Model - Matrix Expressions The One Factor ANOVA Model - Matrix Expressions
⎢ 0 n2 · · · 0 ⎥ ⎢ 01 1 · · · 0 ⎥ ⎢
⎢ · ⎥
⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢· ⎥ ⎢ n2
⎥ ⎣ · ⎦
· · · · · ⎢ · · · · · · ⎥
= ⎢
⎢·
⎥ =⎢
⎥ ⎥. ⎡1 ⎤
ync ,c
Frees (Regression Modeling) Multiple Linear Regression - II 37 / 40 Frees (Regression Modeling) Multiple Linear Regression - II 38 / 40
Combining Categorical and Continuous Explanatory Variables Combining Categorical and Continuous Explanatory Variables
Gender Age x1 x2 x3 x4 x5 E y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5
Figure: Plot of the expected response versus the covariate for the regression Male Young 0 1 0 0 0 β0 + β2
model with variable intercept and variable slope. Male Middle 0 0 1 0 0 β0 + β3
Male Old 0 0 0 0 0 β0
Female Young 1 1 0 1 0 β0 + β1 + β2 + β4
Female Middle 1 0 1 0 1 β0 + β1 + β3 + β5
Female Old 1 0 0 0 0 β0 + β1
Frees (Regression Modeling) Multiple Linear Regression - II 39 / 40