Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

MAST 6474 Introduction to Data Analysis I

Simple Linear Regression and Multiple Linear Regression

In many business situations, we can take advantage of the relationship between two or more variables to predict costs,
revenues, productivity, sales, etc. Computing a sample correlation coefficient does not reveal the precise linear relationship,
only its direction and strength.

The most common approach to estimating the relationship between variables is linear regression—we will use this approach for
the remainder of this course. Linear regression determines the best fitting linear relationship between variables. (Note that we
sometimes find that relationships are nonlinear; this will require modifications to linear regression which require the use of more
sophisticated techniques.)

Example: A Linear Model of HMO Healthcare Expenses What sort of relationship exists between cost and enrollment? Let’s
consider the following model of the true relationship between healthcare expenses (y) and member months (x):

y=β 0 + β 1 x + ε

This is a “linear model” where x (member months) is called the independent variable (or predictor variable), y is called the
dependent variable (or response variable), β 0 (“beta zero”) is called the intercept, β 1 (“beta one”) is called the slope (or
coefficient) of the independent variable x and ε is the random error term (or residual). In regression analysis, the terms error
and residual are used interchangeably. We will use the notation β 0 and β 1 to denote true “theoretical parameters,” and ^β 0 and
^β 1 to denote parameter estimates, or fitted values, that are determined from the sample data.

Copyright Edward Fox and John Semple 2019 1


MAST 6474 Introduction to Data Analysis I

Importantly, we will assume that the error term ε has an expected value of 0, so the expected value of y, E(y), is

E ( y )=β 0 + β 1 x

Because the expected value of the error term is zero, linear regression models offer unbiased predictions.

Fitting a Regression Line

We fit linear regressions by minimizing the sum of squared error terms, an optimization technique called least squares. In
mathematical terms, we minimize

2
ε 2=∑ ( y i − ^β0 − ^β 1 x i )
i

where ^β 0 and ^β 1 are the sample estimates of the true intercept β 0and slope β 1.

^β 0 and ^β 1 are chosen so as to minimize the sum of squared errors ε 2. For the paired enrollment-expense data ( x i , y i ) in the HMO
example, the classical picture is:

Copyright Edward Fox and John Semple 2019 2


MAST 6474 Introduction to Data Analysis I

Excel computes the regression line that minimizes the sum of squares of the errors. To do this, select the “Data” tab from the
menu bar, and then select “Data Analysis” from the “Analysis” panel on the right. Double click on “Regression” and you should
see the following window:

Copyright Edward Fox and John Semple 2019 3


MAST 6474 Introduction to Data Analysis I

Copyright Edward Fox and John Semple 2019 4


MAST 6474 Introduction to Data Analysis I

Copyright Edward Fox and John Semple 2019 5


MAST 6474 Introduction to Data Analysis I

For the HMO example, the sample enrollment data ( x i) is in column A with the label “Member Months” in the first row; the
corresponding sample cost data ( y i) is in column B with the label “Expenses” in the first row. Because we have included labels
at the top of our input and output columns, we check the “Labels” option in the regression window.

Measuring the Fit of a Regression Line

R2 (coefficient of determination). R2 measures the proportion of the variation in the dependent variable (y) that is explained by
the independent variable (x). This is the most commonly cited measure of how well the line “fits” the sample data. To make this
clearer, imagine that we initially impose the condition ^β 1=0 (no slope; a horizontal line) so that the information provided by x
(enrollment) is totally ignored in fitting our line. In this case, the sum of squared errors is minimized by setting the intercept
^β 0= ý , which is the horizontal line y= ý. The sum of squared errors for this horizontal line is known as SST, the total sum of
squares, and is calculated (by Excel)

n
2
SST =∑ ( y i− ý )
i=1

Observe that SST reflects the total variation in the y variable, without the explanation provided by x (alternatively, with β 1 fixed
at 0). SST will serve as a benchmark by which we will judge the improvement in fit due to the x variable.

If we allow the slope coefficient to take on non-zero values, then we are permitting costs (y) to be adjusted for enrollment (x).
This should improve the overall fit of the line because now we get to choose an intercept and a slope. The line that minimizes
the sum of squared errors will fit at least as well as the line y− ý discussed earlier because it offers greater flexibility. The
optimal choices for the intercept and slope parameters are denoted by ^β 0 and ^β 1, and are computed by Excel. These two

Copyright Edward Fox and John Semple 2019 6


MAST 6474 Introduction to Data Analysis I

parameters determine what we refer to as the regression line, which has the equation ^y = β^ 0 − ^β1 x. The sum of squared errors
remaining when this line is fitted is called the error sum of squares and denoted by SSE

n
2
SSE=∑ ( y i−^y i )
i=1

where ^y i= β^ 0− ^β1 xi is the y value of the least squares line at the i th observation of the independent variable x ( x i), and y i is the
actual observed y value of that i th observation. The value ^y i= β^ 0− ^β1 xi is called the predicted value of y i. The term y i− ^y i is the
known as the i th residual.

The difference between SST and SSE represents the improvement in SST obtained by including an unconstrained slope term
in the model, i.e., the improvement obtained by adjusting y to account for x (note that SSE ≤ SST). This difference is termed
SSR, the regression sum of squares:

SSR=SST – SSE.

You can think of SSR as measuring the “value added” by including x in the model compared to the intercept-only model, which
does not include x. Technically speaking, SSR measures the amount of variability in y that is eliminated by including x in the
model. R2 simply converts this into a proportion using the formula R2=SSR/ SST =1−SSE/SST . This is why people often make the
statement “ R2 is the proportion of variation in y explained by x.” The error sum of squares, SSE, is the unexplained variation in
y. Excel’s regression output includes SSR, SSE, and SST.

Simple Linear Regression: Assumptions about the Error Term

Copyright Edward Fox and John Semple 2019 7


MAST 6474 Introduction to Data Analysis I

By making some additional assumptions about the random error term in our model, we can gain additional statistical insights.

Besides having and expected value of zero, we will further assume that the error ε, for any value of x, is normally distributed
with mean 0 and constant variance σ 2. In our notation, ε N ( 0 , σ 2 ). This implies that y is also normally distributed,
y N ( β 0+ β 1 x , σ 2 ).

The errors associated with different observations are independent of one another. This implies that the predicted value of y for
any given value of x is unrelated to the value of y for a different value of x.

Though these assumptions are often tested in practice (to determine if they are at least approximately right), we will assume
they are true.

A Hypothesis Test for the Slope Coefficient

Observe that our slope estimate β 1 would be different if we had estimated it using a different sample of data from the same
population, so our computed value is really just one observation of a random variable (much like X́ , the estimator for μ, is a
random variable). An estimate of the standard deviation of the slope estimator is known as the standard error of β 1 which is
defined


sβ =
1 n

√ ∑ ( x i− x́ )2
i−1

Copyright Edward Fox and John Semple 2019 8


MAST 6474 Introduction to Data Analysis I

where sε is the standard error of the entire regression model, whose formula is

SSE
sε =
√ n−2

Roughly speaking, sε represents the unexplained variation in yi per observation in the dataset (expected error, not average error
in the sample). You will never need to compute either type of standard error—Excel does that for you. The estimated value of sε
is reported in the top panel of your Excel output in the “Regression Statistics.” The estimated value of s β is reported in your
1

Excel output to the right of the estimated slope “Coefficient.” In our HMO example, the standard error of β 1, s β = 6.752009305.
1

s β is used in an important hypothesis test for the slope. This hypothesis test determines whether or not the true slope of x is
1

different from 0. More generally, this test determines whether there is a linear relationship between y and x. Under our model
^β1 −β1
assumptions, one can show that the quantity t= has a t distribution with n-2 degrees of freedom. Excel automatically
sβ 1

tests the alternative hypothesis that β 1 ≠ 0, that is

H A : β1≠ 0
H 0 : β1 =0

The null hypothesis is rejected at level α if the p value of the test statistic is less than α for a two-tailed t test with n-2 degrees of
freedom. In this case we conclude (at level α ) that the true slope is not equal to zero; we therefore conclude that a linear
relationship exists between x and y. Fortunately for us, the test statistic (labeled “t Stat”) and the associated p value (labeled “p-
value”) for this hypothesis test are reported in the Excel output to the right of the estimated slope coefficient.

Copyright Edward Fox and John Semple 2019 9


MAST 6474 Introduction to Data Analysis I

Using Linear Regression to Make Predictions

Now we will use a simple linear regression model to predict the cost of healthcare for 100,000 employees. For this, we must
“plug in” the appropriate value of x (100,000 employees  12 months/employee = 1,200,000 member months) in order to predict
y. The value 1,200,000 is given (based on information about the number of employees of the firm), so we will call it the “given
value of x,” or x g. Plugging the number of member months into the equation below, the predicted expense ^y is

E ( y∨x g )= ^y = ^β 0+ β^ 0 x g = 50332853 + 1,200,000  91.966899 = $160,693,131.

How accurate is this prediction? First, the least squares line can be used to predict expected expenses, but it is not perfect. The
prediction is, after all, based on estimates from a sample of 16 HMOs. Second, the estimated model includes a random error
term, ε, which captures deviations from the regression line for each HMO (these deviations represent information about
expenses that is not proportional to the number of member months, such as the ages and/or health circumstances of members,
climate, regulation, provider competition, and other market factors. Collectively, these factors add uncertainty to our prediction
of expenses. We compute an interval estimate that captures this uncertainty.

The approximate 100(1−α)% prediction interval for a new observation is given by

^y ±t α / 2 ,n−2 ∙ s ε ∙ √ 1+ 1/n.

This approximation works best if the given value of x, x g , is near the sample mean of x. Note that the value from the inverse t
distribution is based on n-2 degrees of freedom. This value can be computed using the Excel function T.INV.2T. As before, α is

Copyright Edward Fox and John Semple 2019 10


MAST 6474 Introduction to Data Analysis I

a small probability that represents our tolerance for making an error (typically, α = 0.05). All other values can be taken directly
from the Excel output of the regression.

The approximate prediction interval is calculated in a similar way to a confidence interval, but the two are fundamentally
different. A prediction interval is an interval estimate for a new observation from the population (in this case, from HMOs of
similar size). A confidence interval is an interval estimate of the true population mean (in this case, the true mean or expected
value of an HMO given some number of member months).

Multiple Linear Regression

In many applications, a simple linear regression model (with only one x variable) does not adequately explain the variation in y.
This is because the y variable in most applications depends on more than one independent variable. Looking at it another way,
we can often take advantage of the relationships between y and several x variables to make better predictions for y. In such
cases, multiple linear regression should be used to capture the relationship between the dependent variable (e.g., production
cost) and a set of independent variables (e.g., cost of raw materials, product complexity, labor utilization, etc.). The general
model for multiple regression takes the form

y=β 0 + β 1 x 1 + β 2 x 2 +⋯+ β k x k +ε

Copyright Edward Fox and John Semple 2019 11


MAST 6474 Introduction to Data Analysis I

where y is the dependent variable, x 1 , x 2 , ⋯ , x k are the independent variables, β 0 is the “true” intercept, β 1 , β 2 , ⋯ , β k are the
“true” slopes, and ε is the normally-distributed error term (as in simple linear regression, ε N ( 0 , σ 2 )). Also, as in simple linear
regression, we assume that the errors are independent. The β’s are estimated using least squares, which we will do using
Excel’s regression analysis tool.

Example: A Linear Model of HMO Healthcare Expenses, Revisited. We return to the Individual Practice Association (IPA)
HMO data used for estimating healthcare expenditures. For that application, “EXPENSES” was the dependent variable and
“MEMBER MONTHS” was the independent variable. In arriving at a cost estimate for 100,000 employees, we simply used total
enrollment (measured in member months). Suppose last year’s records indicate that your employees made 397,000 visits to
the doctor and spent 26,000 days in a hospital. At the MLR tab in the Module 11 Notes Dataset, we find the following data,
where “EXPENSES” and “MEMBER MONTHS” are the same variables we used before.

Copyright Edward Fox and John Semple 2019 12


MAST 6474 Introduction to Data Analysis I

We might expect a more accurate prediction of expenses by specifying a model that uses actual services rendered (doctor
visits and hospital days) as predictors. The simplest multiple linear regression model relating expenses to utilization has the
form

y=β 0 + β 1 x 1 + β 2 x 2 +ε

Copyright Edward Fox and John Semple 2019 13


MAST 6474 Introduction to Data Analysis I

where y = expenses, x 1= number of doctor visits, x 2= number of hospital days. Performing the regression in this case we get
the following Excel output.1

The incremental cost of a doctor visit (assuming that hospital days are held fixed) is $42.96 in this model. The incremental cost
of a hospital day (assuming that doctor visits are held fixed) is $1960.47 in this model. Note that both of these interpretations
require that the other variable is held fixed (a situation we describe using the phrase “all other things being equal”).

To estimate the expenses associated with 397,000 doctor visits and 26,000 hospital days, we “plug in” the given values to our
regression equation. The expected expenses ^y for 397,000 doctor visits and 26,000 hospital days is

1
Warning: This example is solely for illustrative purposes because the sample size would generally be considered too small for a good model. There are
many “rules of thumb,” or ad hoc guidelines, that exist for determining an appropriate sample size N for a regression model having k independent
variables. These guidelines include: N ≥ 104 +k, N ≥ 40k, N ≥ 50+8k among others. You’ll notice that book problems routinely violate these data
requirements.

Copyright Edward Fox and John Semple 2019 14


MAST 6474 Introduction to Data Analysis I

^y = β^ 0 + β^ 1 x 1 + ^β 2 x 2=¿ 85431953.15 + 397000  42.95611322 + 26000  1960.474897 = $153,457,877

Remember, this is an estimate of expected or mean expenses based on a sample of HMOs—we can again compute an
approximate prediction interval for the expected value of expenses of a new HMO much as we did for simple linear regression.

An approximate 100(1−α)% prediction interval for a new observation is given by

^y ±t α / 2 ,n−k−1 ∙ s ε ∙ √ 1+ 1/n.

where the given values are x g ¿ x1 , x 2 , ⋯ , x k and ^y = β^ 0 + β^ 1 x 1 + ^β 2 x 2 +⋯+ ^β k x k . Again, this approximation works best if the given
values of the predictors are near their respective sample means. Observe that we need k given values, one for each
independent variable. Also observe that the interval estimate is based on a t distribution with n-k-1 degrees of freedom.

It is worth noting that this multiple linear regression model with two independent variables does not fit the data as well as the
2
simple linear regression model proposed earlier with x = member months. The simple linear regression produced an R = .
9298, considerably better than the .6425 achieved by the multiple linear regression above with x 1= doctor visits and x 2=¿
hospital days.

Copyright Edward Fox and John Semple 2019 15

You might also like