Simple Linear Regression

- There is only one independent and

dependent variable

Independent variable is the variable

that can possibly predict the
dependent variable

Correlation vs Regression

Scatter plot – shows relationship

between two variable (IV DV)

Correlation – used to measure the

strength of the association (linear
relationship) between two variables.

Regression analysis used to: Positive - upward

Predict the value od a DV based on Negative - downward
the value of at least one IV Equation of Simple Linear Regression:

Explain the impact of changes in an IV

on DV

*Income and spending, wage

DV: the variable we wish to predict or

Beta sub 0 is constant
IV: variable used to predict or explain
the DV


Only one IV, X

Relationship between X and Y is

described by a linear function The representation of the linearity of
Changes in Y are assumed to be relationship if they weak are scattered
related to changes in X
Measures of Variation any conclusions reached was at each level of X is not extremely
emphasized. different from a normal distribution,
When using the least-squares method inferences about 130 and 131 are not
to determine the regression The assumptions necessary for seriously affected.
coefficients for a set of data, you need regression are similar to those of the Evaluating the Assumptions
to compute three measures of analysis of variance because both are The fourth assumption, equal Recall from Section 13.4 that the four
variation. The first measure, the part of the general category of linear variance, or homoscedasticity, assumptions of regression (known by
models (reference 4). requires that the variance of the errors the acronym LINE) are linearity,
total sum of squares (SST), is a Ei be constant for all values of X. In
measure of variation of the Yi values The four assumptions of regression independence, normality, and equal
other words, the variability of Y values variance.
around their mean, Y. The (known by the acronym LINE) are as is the same when X is a low value as
follows: when X is a high value. The equal- Linearity. To evaluate linearity, you
total variation, or total sum of
squares, is subdivided into explained • Linearity variance assumption is important plot the residuals on the vertical axis
variation and unexplained variation. when making inferences about [30 against the corresponding Xi values of
• Independence of errors and [31. If there are serious the independent variable on the
The explained variation, or departures from this assumption, you horizontal axis. If the linear model is
regression sum of squares (SSR), • Normality of error appropriate for the data, you will not
can use either data transformations or
represents variation that is explained • Equal variance weighted least-squares methods (see see any apparent pattern in the plot.
by the relationship between X and Y, reference 4). However, if the linear model is not
The first assumption, linearity, states appropriate, in the residual plot, there
and the unexplained variation, or that the relationship between variables Residual Analysis will be a relationship between the Xi
error sum of squares (SSE), is linear. Relationships between values and the residuals, q.
represents variation due to factors Residual analysis visually evaluates
variables that are not linear are
other than the relationship between X these assumptions and helps you to You can see such a pattern in Figure
discussed in Chapter 15.
and Y. Figure 13.6 shows these determine whether the regression 13.9. Panel A shows a situation in
different measures of variation. The second assumption, model that has been selected is which, although there is an increasing
independence of errors, requires appropriate. trend in Y as X increases, the
that the errors Ei are independent of relationship seems curvilinear
The residual, or estimated error value,
one another. This assumption is because the upward trend decreases
q, is the difference between the
particularly important when data are for increasing values of X. This
observed (Yi) and predicted (Yi)
collected over a period of time. In such quadratic effect is highlighted in Panel
values of the dependent variable for a
situations, the errors in a specific time B, where there is a clear relationship
given value of Xi. A residual appears
period are sometimes correlated with between Xi and q. By plotting the
on a scatter plot as the vertical
those of the previous time period. residuals, the linear trend of X with Y
distance between an observed value
of Y and the prediction line. has been removed, thereby exposing
The third assumption, normality,
the lack of fit in the simple linear
requires that the errors are normally
Equation (13.14) defines the residual. model. Thus, a quadratic model is a
distributed at each value of X. Like the
Assumptions better fit and should be used in place
t test and the ANOVA F test, RESIDUAL of the simple linear model. (See
When hypothesis testing and the regression analysis is fairly robust
The residual is equal to the difference Section 15.1 for further discussion of
analysis of variance were discussed in against departures from the normality
between the observed value of Y and fitting curvilinear models.)
Chapters 9 through 12, the importance assumption. As long as the distribution
of the errors the predicted value of Y.
of the assumptions to the validity of
one residual may sometimes be to conclude that you should not be
related to the previous residual. overly concerned about departures
from this normality assumption in the
If this relationship exists between Sunflowers Apparel data.
consecutive residuals (which violates
the assumption of independence), the
plot of the residuals versus the time in
which the data were collected will
often show a cyclical pattern.

Because the Sunflowers Apparel data

were collected during the same time
To assess linearity, the residuals are period, you do not need to evaluate
plotted against the independent the independence assumption for
variable (store size, in thousands of these data.
square feet) in Figure 13.11. Although
Normality. You can evaluate the
there is widespread scatter in the
assumption of normality in the errors
residual plot, there is no clear pattern
by organizing the residuals into a
or relationship between the residuals
frequency distribution as shown in
and X. The residuals appear to be
Table 13.3. You cannot construct a
evenly spread above and below 0 for
meaningful histogram because the
different values of X. You can
sample size is too small. And with
conclude that the linear model is
such a small sample size (n = 14), it
appropriate for the Sunflowers Apparel
can be difficult to evaluate the
normality assumption by using a stem-
Independence. You can evaluate the and-leaf display (see Section 2.5), a
boxplot (see Section 3.3), or a normal Equal Variance. You can evaluate the
probability plot (see Section 6.3). assumption of equal variance from a
plot of the residuals with Xi. For the
Sunflowers Apparel data of Figure
13.11 on page 540, there do not
To determine whether the simple
appear to be major differences in the
linear regression model is appropriate,
variability of the residuals for different
return to the evaluation of the
Xi values.
Sunflowers Apparel data. Figure 13.10
displays the predicted annual sales Thus, you can conclude that there is
values and residuals. no apparent violation in the
assumption of independence of the From the normal probability plot of the assumption of equal variance at each
errors by plotting the residuals in the residuals in Figure 13.12, the data do level of X.
order or sequence in which the data not appear to depart substantially from
were collected. If the values of Y are a normal distribution. The robustness To examine a case in which the equal-
part of a time series (see Section 2.6), of regression analysis with modest variance assumption is violated,
departures from normality enables you observe Figure 13.13, which is a plot
of the residuals with Xi for a coefficient of multiple determination, r, The coefficient of multiple
hypothetical set of data. This plot is the adjusted r2, and the overall F test. determination also appears in the
fan shaped Compare the multiple regression Figure 14.2 results on page 580, and
model to the simple linear regression is labeled R Square in the Excel
because the variability of the residuals model [Equation (13.1) on page 522]: results and R-Sq in the Minitab
increases dramatically as X increases. results.
Because this plot shows unequal Coefficient of Multiple Determination
variances of the residuals at different Adjusted r2
levels of X, the equal-variance In the simple linear regression model, Recall from Section 13.3 that the
assumption is invalid. the slope, ß1, represents the change in coefficient of determination, r, When considering multiple regression
the mean of Y per unit change in X measures the proportion of the models, some statisticians suggest
and does not take into account any variation in Y that is explained by the that you should use the adjusted r2 to
other variables. In the multiple independent variable X in the simple take into account both the number of
regression model with two linear regression model. In multiple independent variables in the model
independent variables [Equation regression, the coefficient of and the sample size. Reporting the
(14.2)], the slope, ß1, represents the multiple determination represents adjusted r is extremely important when
change in the mean of Y per unit the proportion of the variation in Y that you are comparing two or more
change in X1, taking into account the is explained by the set of independent regression models that predict the
effect of X2. variables. same dependent variable but have a
different number of independent
As in the case of simple linear Equation (14.4) defines the coefficient variables. Equation (14.5) defines the
regression, you use the least-squares
method to compute sample regression
Multiple regression models use two coefficients (ß0, ß1, and ß2) as
or more independent variables to estimates of the population
predict the value of a dependent parameters
(ß0, ß1, and ß2). Equation (14.3)
defines the regression equation for a of multiple determination for a multiple
multiple regression model with two regression model with two or more
independent variables. independent variables.

MULTIPLE REGRESSION In the OmniPower example, from

EQUATION WITH TWO Figure 14.2 on page 580, SSR =
39,472,730.77 and SST = adjusted r2.
52,093,677.44. Thus,
Therefore, 74.21% of the variation in
sales is explained by the multiple
The coefficient of multiple regression model adjusted for the
r2, Adjusted r2, and the Overall F Test
determination (r2 = 0.7577) indicates number of independent variables and
This section discusses three methods that 75.77% of the variation in sales is sample size. The adjusted r also
you can use to evaluate the overall explained by the variation in the price appears in the Figure 14.2 results on
multiple regression model: the and in the promotional expenditures. page 580, and is labeled Adjusted R
Square in the Excel results and R- Using a 0.05 level of significance, the - variables come from item-indicators Bartlett’s test, it should be significant
Sq(adj) in the Minitab results. critical value of the F distribution with or item-questions that describe the (0.05 or lesser) to make the factor
2 and 31 degrees of freedom found factor. analysis reliable.
from Table E.5 is approximately 3.32
(see Figure 14.4 below). From Figure - basis: higher factor loading - the Total Variance Explained:
14.2 on page 580, the FSTATtest higher number in the group will be the
name of the factor loading. The cumulative variance explained
statistic given in the ANOVA summary should be 60% or higher to make it
table is 48.4771. - respondents: min. 100 observations. valid and reliable.
> 3.32, or because the p-value = - if you are going to split the samples Communalities of Variables
Test for the Significance of the
0.000 < 0.05, you reject Ho and con- for validation purposes the
Overall Multiple Regression Model Below 0.50 is a candidate for deletion
Because 48.4771 include that at least respondents must be 200 or higher.
You use the overall F test to or failing to explain the other variables.
determine whether there is a General rule:
If it falls into on or above 0.40, but
significant relationship between the Minimum is to have at least five times below 0.50; it depends to researcher’s
dependent variable and the entire set as many observations as the number discretion if the he will remove it or it
of independent variables (the overall of variables to be analyze. More has signifacnt factor
multiple regression model). Because acceptable is 10:1 (10 respondents
there is more than one independent per one variable). Scree Plot
variable, you use the following null
and alternative hypotheses: Interpretation: - 1 the eigen value should be greater
than 1 to make it significant.
Measures of Sampliing Adequacy
(MSA) through KMO is needed to - if below 1, the factors are not
proceed factor analysis and Barlett’s considered anymore in the analysis.
Test of Spherecity.
one of the independent variables KMO (Kaiser-Meyer Olkin Test) it
(price and/or promotional should be greater than 0.5 for a
expenditures) is related to sales. satisfactory factor analysis to proceed.
-------------------------------------------------- 0.90 or above, marvelous
Factor Analysis 0.80 or above, meritorious
Exploratory factor analysis (EFA) 0.70 or above, middling
- only be used for if you can’t find or 0.60 or above, mediocre
identify a variable in the RRL.-
processed by forming the variables 0.50 or above, miserable
into a structure called factors.
Below 0.50, unacceptable


