Professional Documents
Culture Documents
Simple and Multiple Linear Regression
Simple and Multiple Linear Regression
Simple and Multiple Linear Regression
Linear Regression
Introduction
• Most popular method in Predictive Analytics to predict Input / Output
relationship
• Foundation of all other methods <–> All other methods can be explained in terms
of variations of Regression Models
• Broad Categorization of Regression Models
Broad Category Output(s) Input(s) Type of Regression
• Note that this estimated value for the nutritional rating lies directly on the regression line, at
the location
• Now, there is one cereal in our data set that does have a sugar content of 1 gram, Cheerios
with rating 50.765
• Vertical distance between the actual
rating and predicted rating is
57.3916 - 50.765 = 6.6266
• This vertical distance is known as
prediction error or residual
• Aim is to find out the regression line
which minimizes the sum of squared
residuals i.e.
• As the squared function is a convex function so, we can safely use differentials
for minimizing the above
Least Squares Estimate
• As we are interested in sample estimates of regression coefficients so,
• The slope indicates the estimated change in “rating” per unit increase in “sugar”
• For each increase of 1 gram in sugar content, the estimated nutritional rating
decreases by 2.4614 rating points
• For example, Cereal A with five more grams of sugar than Cereal B would have
an estimated nutritional rating (2.4614) = 12.307 ratings points lower than
Cereal B
Dangers of Extrapolation
• Suppose that a new cereal (say, the Chocolate Frosted Sugar Bombs loved by
Calvin, the comic strip character written by Bill Watterson) arrives on the market
with a very high sugar content of 30 grams per serving
• Let us use our estimated regression equation to estimate the nutritional rating
for Chocolate Frosted Sugar Bombs
• In other words, Calvin’s cereal has so much sugar that its nutritional rating is
actually a negative number, unlike any of the other cereals in the data set
(minimum = 18) and analogous to a student receiving a negative grade on an
exam
• What is going on here?
• The negative estimated nutritional rating for Chocolate Frosted Sugar Bombs is
an example of the dangers of extrapolation
• Analysts should confine the estimates and predictions made using the regression
equation to values of the predictor variable contained within the range of the
values of x in the data set
Dangers of Extrapolation
• In the Cereals data set, the lowest sugar content is 0 grams and the highest is 15
grams, so that predictions of nutritional rating for any value of x (sugar content)
between 0 and 15 grams would be appropriate
• Extrapolation, making predictions for x-values lying outside this range, can be
dangerous, because we do not know the nature of the relationship between the
response and predictor variables outside this range
• Danger of Extrapolation
Adequacy Measurement of Regression
• In a least-squares regression line could be found to approximate the relationship
between any two continuous variables, regardless of the quality of the
relationship between them
• The question arises as to how we may determine whether an estimated
regression equation is useful for making predictions
• R2, known as the coefficient of determination, measures how well the linear
approximation produced by the least-squares regression line fits the observed
data
• R2 is found in terms of residuals or prediction errors
Adequacy Measurement of Regression
• SSE, represents an overall measure of the error in prediction resulting from the
use of the estimated regression equation
• Here we have SSE = 12. Is this value large?
• We are unable to state whether this value, SSE = 12, is large, because at this
point, we have no other measure to compare it to
• Now, imagine for a moment that we were interested in estimating the distance
traveled without knowledge of the number of hours That is, suppose that we
did not have access to the x-variable information for use in estimating the y-
variable
• Because we lack access to the predictor information, our best estimate for y is
simply , the sample mean of the distance traveled
• We would be forced to use = 16 to estimate the number of kilometers traveled
for every competitor, regardless of the number of hours that person had
traveled
Adequacy Measurement of Regression
• The estimates for distance traveled when ignoring the time information is shown
by the horizontal line = 16 disregarding the time information that a person
walked for 2 hours also covered 16 km and a person walked for 8 hours also
covered the same clearly not optimal
Adequacy Measurement of Regression
• The estimates for distance traveled when ignoring the time information is shown
by the horizontal line = 16 disregarding the time information that a person
walked for 2 hours also covered 16 km and a person walked for 8 hours also
covered the same clearly not optimal
• The data points in Figure seem to “cluster” tighter around the estimated
regression line than around the line which suggests that,
overall, the prediction errors are smaller when we use the x-information
• For example, consider competitor #10, who hiked y = 25 kilometers in x = 9
hours. If we ignore the x-information, then the estimation error would be
• Suppose that we proceeded to find for every record in the data set, and
then found the sum of squares of these measures, just as we did for when
we calculated the SSE
• This would lead us to SST, the sum of squares total which gives a measure of
variability in the response variable without reference to the predictor variables
Adequacy Measurement of Regression
• Would we expect SST to be larger or smaller than SSE?
• The typical prediction error when ignoring the predictor is 5 kms whereas if the
predictor is considered then the error is 1.2 kms
Similarity between SSR and Slope
• The formula for SSR is exactly same as the slope except the numerator is
squared of that of slope parameter
• So, using the sum values we can directly calculate the value of R2 = 0.9474
• In words, the linear relationship between time and distance accounts for 94.74%
of the variability in the distance travelled The regression model fits the
underlying data in a quite good way
Correlation Coefficient
• The correlation coefficient r (also known as the Pearson product moment
correlation coefficient) is an indication of the strength of the linear relationship
between two quantitative variables, and is defined as follows
• The definition formula for the correlation coefficient above may be tedious,
because the numerator would require the calculation of the deviations for both
the x-data and the y- data
• An alternative formula
• Note that the vertical distance away from the regression line (indicated by the
vertical arrows) is greater for these two observations than for any other cereals,
indicating the largest residuals
Outliers, High Leverage Points, Influential
Observations
• For example, the nutritional rating for All Bran Extra Fiber (93.7) is much higher
than predicted (59.85), based on its sugar content alone (0 grams)
• Similarly, the nutritional rating for 100% Bran (68.4) is much higher than would
have been estimated (45.08) based on its sugar content alone (6 grams)
• Residuals may have different variances, so that it is preferable to use the
standardized residuals in order to identify outliers
• Standardized residuals are residuals divided by their standard error, so that they
are all on the same scale
• Let denotes the standard error of the ith residual then
where denotes the leverage of the ith observation
• The Standardized residual is
• A rough rule of thumb is to flag observations whose standardized residuals
exceed 2 in absolute value as being outliers
Outliers, High Leverage Points, Influential
Observations
• A high leverage point is an observation that is extreme in the predictor space
a high leverage point takes on extreme values for the x-variable(s), without
reference to the y-variable
• The leverage of the points in predictor space can be found out using
• So, the leverage of any point depends solely upon its squared distance from the
mean of predictor
• The lower bound on leverage values is 1∕n, and the upper bound is 1.0
• An observation with leverage greater than about 2(m + 1)∕n or 3(m + 1)∕n may
be considered to have high leverage
• So, at this point any point can be categorized as Outlier or High Leverage point
Outliers, High Leverage Points, Influential
Observations
• Cook’s distance is the most common measure of the influence of an observation
• It works by considering both the size of the residua and the amount of leverage
for that observation
• For ith observation Cook’s distance is computed as following
• The left-hand ratio in the formula for Cook’s distance contains an element
representing the residual, while the right-hand ratio contains functions of the
leverage
• A rough rule of thumb for determining whether an observation is influential is if
its Cook’s distance exceeds 1.0
• More accurately, one may also compare the Cook’s distance against the
percentiles of the F-distribution with (m+1 , n - m - 1) degrees of freedom
• If the observed value lies within the first quartile of this distribution (lower than
the 25th percentile), then the observation has little influence on the regression;
however, if the Cook’s distance is greater than the median of this distribution,
then the observation is influential
Outliers, High Leverage Points, Influential
Observations
• The value of the Cook’s distance measure for the person who traveled 20
kilometers in 5 hours is as follows
• For this observation, the Cook’s distance of 0.2465 lies within the 22nd percentile
of the F2,8 distribution, indicating that while the influence of the observation is
small
• If a hiker traveled 39 kilometers in 16 hours, providing the 11th observation
• We have n = 11 and m = 1, so that observations having hi > 2(m+1)/n= 0.36 or
hi > 3(m+1)/n = 0.55 may be considered to have high leverage
• This observation has hi = 0.7007, which indicates that this durable hiker does
indeed have high leverage, as mentioned earlier with reference seems to
indicate that this hiker (x = 16, y = 39) is not however an outlier, because the
observation lies near the regression line
Outliers, High Leverage Points, Influential
Observations
• The standardized residual supports this, having a value of 0.46801
• Finally, the Cook’s distance for this observation is 0.2564, which is about the
same as our previous example, indicating that the observation is not influential
• Slight change in the regression with (solid line) and without (dotted line) this
observation
Outliers, High Leverage Points, Influential
Observations
• So we have seen that an observation that is an outlier with low influence, or an
observation that is a high leverage point with a small residual may not be
particularly influential
Assumptions of Linear Regression
• Zero-Mean Assumption. The error term 𝜀 is a random
variable, with mean or expected value equal to zero. In
other words, E(𝜀) = 0.
• Constant Variance Assumption. The variance of 𝜀, denoted
by 𝜎2, is constant, regardless of the value of x.
• Independence Assumption. The values of 𝜀 are
independent.
• Normality Assumption. The error term 𝜀 is a normally
distributed random variable
Assumptions of Linear Regression
• For each value of x, the y are normally distributed, with mean on the true
regression line, and constant variance
Verifying the Regression Assumptions
• The two main graphical methods used to verify regression assumptions are as
follows:
a. A normal probability plot of the residuals.
b. A plot of the standardized residuals against the fitted (predicted) values
• A normal probability plot is a quantile-quantile plot of the quantiles of
particular distribution against the quantiles of the standard normal distribution,
for the purposes of determining whether the specified distribution deviates
from normality (Similar to a percentile, a quantile of a distribution is a value xp
such that p% of the distribution values are less than or equal to xp)
• Lower value of AD test is always indicating the Normality. Conversely, higher the
p-value higher inclination towards normality
Verifying the Regression Assumptions
• The second graphical method used to assess the validity of the regression
assumptions is a plot of the standardized residuals against the fits (predicted
values)
Time/Distance Example
• That is, for every additional gram of sugar, the nutritional rating will decrease by
between 1.9780 and 2.9448 points
• As the point 𝛽1 = 0 is not contained within this interval, we can be sure of the
significance of the relationship between the variables, with 95% confidence
Confidence Interval for Correlation
Coefficient
• Let 𝜌 (“rho”) represent the population correlation coefficient between the x and y
variables for the entire population
• Then the confidence interval for 𝜌 is as follows
• This confidence interval requires that both the x and y variables be normally
distributed
• Now, rating is not normally distributed, but the transformed variable ln rating is
normally distributed. However, neither sugars nor any transformation of sugars
• Carbohydrates, however, shows normality that is just barely acceptable, with an AD
p-value of 0.081
• Thus, the assumptions are met for calculating the confidence interval for the
population correlation coefficient between ln rating and carbohydrates, but not
between ln rating and sugars
• the regression output for the regression of ln rating on carbohydrates, we have r2 =
2.5% = 0.025, and the slope b1 is positive, so that the sample correlation coefficient
is r = 0.1581
• Thus, our 95% confidence interval for 𝜌 is given by (-0.0703, 0.3865). As zero is
included in this interval, then we conclude that ln rating and carbohydrates are not
linearly correlated
Intervals on Response
• Point estimates for values of the response variable for a given value of the
predictor value may be obtained by an application of the estimated regression
equation
• Unfortunately, these kinds of point estimates do not provide a probability
statement regarding their accuracy
• The analyst is therefore advised to provide for the end-user two types of
intervals, which are as follows:
• A confidence interval for the mean value of y given x and
• A prediction interval for the value of a randomly chosen y, given x
Both of these intervals require that the residuals be normally distributed
• Confidence Interval of mean of response given a value of predictor
• Note that it is not necessary to define a third indicator variable “shelf 3,” because cereals
located on shelf 3 will have zero values for each of the shelf 1 and shelf 2 indicator
variables
• In fact, one should not define this third dummy variable because the resulting
covariate matrix will be singular, and the regression will not work
• The category that is not assigned an indicator variable is denoted the reference category.
Here, shelf 3 is the reference category
• Later, we shall measure the effect of the location of a given cereal (e.g., on shelf 1) on
nutritional rating, with respect to (i.e., with reference to) shelf 3, the reference category
MLR with Categorical Predictors Only
• So, let us construct a multiple regression model using only the two indicator
variables, in this case, our regression equation is
• Before we run the regression, let us think about what the regression coefficient
values might be
• On the basis of of the dot plot, we would expect b4 to be negative, because the
shelf 2 cereals have a lower mean rating, compared to shelf 3 cereals
• We might also expect b3 to be essentially negligible but slightly positive,
reflecting the slightly greater mean rating for shelf 1 cereals, compared to with
shelf 3 cereals
• The values for b3 = 0.68 and b4 = -10.247 and b0 = 45.2 with R2 = 0.112 and
s=13.74
MLR with Categorical & Continuous
Predictors
• Next, let us proceed to perform multiple regression, for the linear relationship
between nutritional rating and sugar content, fiber content, and shelf location,
using the two dummy variables
• The regression equation is given as
• This above model can be broken down in following manner
• Thus, the regression equation for cereals located on the various shelves is given
as the following
• Note that these estimated regression equations are exactly the same, except for
the y-intercept
MLR with Categorical & Continuous
Predictors
• The other results on the above model are shown here
Adjusting R2-Penalizing Models for Removal of Insignificant
Variables
• Recall that adding a variable to the model will increase the value of the
coefficient of determination R2, regardless of the usefulness of the variable
• Therefore, in the interests of parsimony, we should find some way to penalize
the R2 measure for models that include predictors that are not useful
• Fortunately, such a penalized form for R2 does exist, and is known as the
adjusted R2
• If R2 adj is much less than R2, then this is an indication that at least one variable
in the model may be extraneous, and the analyst should consider omitting that
variable from the model
• For the current problem
• Let us now run the regression model with and without the shelf 1 indicator
variable, respectively
• With shelf 1: Penalty = R2 - R2adj = 0.828 - 0.818 = 0.010
• Without shelf 1: Penalty = R2 - R2adj = 0.824 - 0.817 = 0.007
• So, the regression model, not including shelf 1, suffers a smaller penalty than
does the model that includes it, which would make sense if shelf 1 is not a
Sequential Sum of Squares
• Which variables must be included in model?
• The sequential sums of squares represent a partitioning of SSR, the regression
sum of squares
• The sequential sums of squares partition the SSR into the unique portions of the
SSR that are explained by the particular predictors, given any earlier predictors
• Thus, the values of the sequential sums of squares depend on the order that the
variables are entered into the model
• For example, the sequential sums of squares for the model
are
• The sequential sum of squares shown for sugars is 8711.9, and represents the
variability in nutritional rating that is explained by the linear relationship
between rating and sugar content.
• In other words, this first sequential sum of squares is exactly the value for SSR
Sequential Sum of Squares
• The second sequential sum of squares, for fiber content, equals 3476.6.
• This represents the amount of unique additional variability in nutritional rating
that is explained by the linear relationship of rating with fiber content, given
that the variability explained by sugars has already been extracted
• The third sequential sum of squares, for shelf 1, is 7.0.
• This represents the amount of unique additional variability in nutritional rating
that is accounted for by location on shelf 1 (compared to the reference class
shelf 3), given that the variability accounted for by sugars and fiber has already
been separated out
• This tiny value for the sequential sum of squares for shelf 1 indicates that the
variable is probably not useful for estimating nutritional rating
• Finally, the sequential sum of squares for shelf 2 is a moderate 159.9
Sequential Sum of Squares
• Now, suppose we changed the ordering of the variables into the regression
model. This would change the values of the sequential sums of squares
• For example, suppose we perform an analysis based on the following model
• This time, the indicator variables are able to “claim” their unique portions of the
variability before the other variables are entered, thus giving them larger values
for their sequential sums of squares
• We use the sequential sums of squares, in the context of a partial F-test, to
perform variable selection (Described very shortly)
Multicollinearity
• Suppose that we are now interested in adding the predictor potassium to the
model, so that our new regression equation looks like
Correlation between
Sugar & Fiber = -0.19
Sugar & Potassium = 0.001
Fiber & Potassium = 0.912
Multicollinearity
• If without checking one has started MLR then is there any to check
multicollinearity from the results?
• We may ask for the variance inflation factors (VIFs) to be reported
• VIF for the ith predictor will be calculated as
• The Ri2 is given by the coefficient of multiple determination by regressing xi on
the other predictors
• Suppose that xi is completely uncorrelated with the remaining predictors, so that
Ri2 = 0 VIFi = 1
• Minimum value of VIF is 1 and there is no upper limit to the value
• A rough rule of thumb for interpreting the value of the VIF is to consider VIFi ≥ 5
to be an indicator of moderate multicollinearity, and to consider VIFi ≥ 10 to be
an indicator of severe multicollinearity
Multicollinearity
• The estimated regression equation for this model is
With VIFs as
• The p-value for potassium is not very small (0.082), so at first glance, the variable may or
may not be included in the model, also, the p-value for the shelf 2 indicator variable
(0.374) has increased to such an extent that we should perhaps not include it in the
model
• The VIF for fiber is 6.952 and the VIF for potassium is 7.157, with both values indicating
moderate-to-strong multicollinearity
• What should we do in this case PCA or User defined composite We can use W =
(fiber + potassium)∕2 as a new predictor variable
• Note that we need to standardize the variables involved in the composite, to avoid the
possibility that the greater variability of one of the variables will overwhelm that of the
other
Multicollinearity
• With the user defined composite W, the regression is re-run with the following results