ES031 MultipleCorrelationMLR

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

MODULE 4

MULTIPLE
CORRELATION &
MULTIPLE LINEAR
REGRESSION
LEARNING OBJECTIVES
LO1: Appraise multiple regression techniques to build
empirical models to engineering and scientific data

LO2: Deduce regression coefficients, use the


regression model to estimate the mean response and
to make predictions, and assess regression model
adequacy
OUTLINE
• Multiple Correlation Analysis
• Multiple Linear Regression
• Test For The Significance Of The Overall
Multiple Regression Model
• Inference Concerning The Population
Regression Coefficients
SUPPLEMENTARY VIDEOS
Multiple Correlation Analysis and
WATCH Multiple Linear Regression

Multiple Linear Regression in Excel and


WATCH
Coefficient of Multiple Determination, r2

Test For The Significance of the Overall


Multiple Regression Model and Inference
WATCH
Concerning The Population Regression
Coefficients
MULTIPLE
CORRELATION
ANALYSIS
SPURIOUS CORRELATION

• A spurious correlation wrongly implies a causal


relationship between two variables.

• It is a mathematical relationship in which two or


more events or variables are not causally related
to each other even though the data infers that
they are.
EXAMPLE
Divorce rate in Maine and per capita consumption of margarine
OTHER EXAMPLES
CORRELATION
MODULE 4

CORRELATION w/ multiple
independent variables

To measure the degree of relationship between


more than 2 variables, the multiple correlation
coefficient is denoted by 𝑹
CORRELATION
MODULE 4

MULTIPLE CORRELATION
COEFFICIENT
The multiple correlation coefficient is computed as:

# #
𝑟!!" + 𝑟!"" − 2𝑟!!" ' 𝑟!"" ' 𝑟!!!"
𝑅= #
1− 𝑟!!!"
𝑤ℎ𝑒𝑟𝑒:
𝑟!!" = correlation coefficient between a dependent and
independent variable
CORRELATION
MODULE 4

Using the formula introduced Module 6, the


correlation coefficients between 2 variables is
computed through the following formulas:
&
∑&#$% 𝑦# ∑&#$% 𝑥#
𝑆!" = ( 𝑦# 𝑥# −
𝑆!" 𝑛
𝑟!# " = #$%

(𝑆!! 𝑆"" ) &


∑ &
#$% 𝑥#
'
'
𝑆!! = ( 𝑥# −
𝑛
#$%

& '
'
&∑ #$% 𝑦#
𝑆"" = ( 𝑦# −
#$% 𝑛
MULTIPLE LINEAR REGRESSION
MODULE 4

ILLUSTRATIVE EXAMPLE

Dielectric Loss In an article in IEEE Transactions


Density
Constant Factor on Instrumentation and
0.749 2.05 0.016 Measurement (2001, Vol. 50, pp.
0.798 2.15 0.020
2033-2040) powdered mixtures of
0.849 2.25 0.022
coal and limestone were analyzed
0.877 2.30 0.023
for permittivity. The errors in the
0.929 2.40 0.026
0.963 2.47 0.028
loss factor measurement were the
0.997 2.54 0.031 response.
1.046 2.64 0.034
1.133 2.85 0.039 Construct the correlation matrix
1.170 2.94 0.042 between the variables
1.215 3.05 0.045
SCATTER PLOTS

As with simple linear regression, check the existence of a linear


relationship between the variables using scatter plots. Both graphs
show an upward trend indicating a linear relationship. Variables
having no linear relationship cannot be used in linear correlation and
regression analysis
MULTIPLE LINEAR REGRESSION
MODULE 4

ILLUSTRATIVE EXAMPLE
Using the Correlation function under Data Analysis Tool, the function
displays a correlation matrix between variables where the values in the
matrix are the correlation coefficients of its intersecting variables. For
multiple independent variables, the correlation matrix provides a better
display of the correlation coefficients between variables.
Dielectric
Variable Loss Factor (y) Density (x1)
Constant (x2)
Loss Factor (y) 1
Density (x1) 0.9977 1
Dielectric Constant (x2) 0.9987 0.9987 1

The coefficients measure a strong and positive correlation between the


Loss Factor with the Density (𝑟!! " = 0.9977) and Loss Factor with
Dielectric Constant (𝑟!" " = 0.9987)
MULTIPLE LINEAR REGRESSION
MODULE 4

ILLUSTRATIVE EXAMPLE
Multiple Correlation Coefficient R

𝑟!! " = 0.9977


𝑟!" " = 0.9987
𝑟!! !" = 0.9987

0.9977' + 0.9987' − 2 6 0.9977 6 0.9987 6 0.9987


𝑅= '
= 0.9987
1 − 0.9987
MULTIPLE LINEAR
REGRESSION
MULTIPLE LINEAR REGRESSION
MODULE 4

MULTIPLE LINEAR REGRESSION

Is an extension of simple linear regression more


than one independent variable

DEPENDENT INDEPENDENT

Y = F( 𝑋! , 𝑋" … 𝑋# )

One Many
MULTIPLE LINEAR REGRESSION
MODULE 4

NEW CONSIDERATIONS
Adding more independent variables to a multiple regression
procedure does not mean the regression will be “better” or offer
better predictions; in fact it can make things worse. This is called
OVERFITTING.
The addition of more independent variables creates more
relationships among them. So not only are the independent variables
potentially related to dependent variable, they are also potentially
related to each other. When this happens, it is called
MULTICOLLINEARITY.
The ideal case is for all independent variables to be correlated with
the dependent variable but NOT with each other.
MULTIPLE LINEAR REGRESSION
MODULE 4
ASSUMPTIONS

Multiple Linear Regression has the following assumptions:


1. The values of the independent and dependent variables are
normally distributed
2. The variances for the y variables are the same for each
value of the independent variable
3. There is a linear relationship between the dependent and
independent variable
4. The independent variables are not correlated
(Nonmulticollinearity)
5. The values of y variables are independent
MULTIPLE LINEAR REGRESSION
MODULE 4

The Multiple Regression Model With k


Independent Variables
Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more
independent variables (Xi).

Multiple Regression Model with k Independent Variables:


Y-intercept Population slopes Random Error

Yi = β 0 + β1X1i + β 2 X 2i + × × × + β k X ki + ε i
Where:
β0 = Y intercept
β1 = slope of Y with variable X1, holding X2, X3, X4, . . . , Xk constant
β2 = slope of Y with variable X2, holding X1, X3, X4, . . . , Xk constant
β3 = slope of Y with variable X3, holding X1, X2, X4,. . . , Xk constant

βk = slope of Y with variable Xk, holding X1, X2, X3,. . . , Xk-1 constant
εi = random error in Y for observation i
MULTIPLE LINEAR REGRESSION
MODULE 4

Multiple Regression Model With Two


Independent Variables

Yi = β 0 + β1X1i + β 2 X 2i + ε i
Where:
β0 = Y intercept
β1 = slope of Y with variable X1, holding X2 constant
β2 = slope of Y with variable X2, holding X1 constant
εi = random error in Y for observation i
MULTIPLE LINEAR REGRESSION
MODULE 4

MULTIPLE REGRESSION EQUATION


The coefficients of the multiple regression model are estimated
using sample data.

Multiple regression equation with k independent variables:

Estimated Estimated
(or predicted) Estimated slope coefficients
intercept
value of Y

ˆ = b + b X + b X + ××× + b X
Yi 0 1 1i 2 2i k ki
MULTIPLE LINEAR REGRESSION
MODULE 4

INTERPRETING COEFFICIENTS

𝑌! = 27 + 9𝑥! + 12𝑥" …
𝑋% = 𝑐𝑎𝑝𝑖𝑡𝑎𝑙 𝑖𝑛𝑣𝑒𝑠𝑡𝑚𝑒𝑛𝑡 ($1000𝑠) 𝑋' = 𝑚𝑎𝑟𝑘𝑒𝑡𝑖𝑛𝑔 𝑒𝑥𝑝𝑒𝑛𝑑𝑖𝑡𝑢𝑟𝑒𝑠 ($1000𝑠)
𝑦I = 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑠𝑎𝑙𝑒𝑠 ($1000𝑠)

In multiple regression, each coefficient is interpreted as the estimated


change in y corresponding to a one unit change in a variable, when all
other variables are held constant.

So in this example, $9000 is an estimate of the expected increase in


sales y, corresponding to a $1000 increase in capital investment (X1)
when marketing expenditures (X2) are held constant
MULTIPLE LINEAR REGRESSION
MODULE 4

ILLUSTRATIVE EXAMPLE

• A distributor of frozen dessert pies wants to evaluate


factors thought to influence demand.

• Dependent variable: Pie sales (units per week)

• Independent variables: Price (in $)


Advertising ($100’s)

• Data are collected for 15 weeks.


MULTIPLE LINEAR REGRESSION
MODULE 4

ILLUSTRATIVE EXAMPLE
Advertising
Week Pie Sales Price ($) ($100s)
1 350 5.50 3.3 Multiple regression equation:
2 460 7.50 3.3
3
4
350
430
8.00
8.00
3.0
4.5
Sales = b0 + b1 (Price)
5 350 6.80 3.0 + b2 (Advertising).
6 380 7.50 4.0
7 430 4.50 3.0
8 470 6.40 3.7
9 450 7.00 3.5
10 490 5.00 4.0
11 340 7.20 3.5
12 300 7.90 3.2
13 440 5.90 4.0
14 450 5.00 3.5
15 300 7.00 2.7
MULTIPLE LINEAR REGRESSION
MODULE 4

Excel Multiple Regression Output


Regression Statistics
Multiple R 0.72213

R Square 0.52148
Adjusted R Square 0.44172
Standard Error 47.46341
Sales = 306.526 - 24.975(Price) + 74.131(Advertising)
Observations 15

ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888
MULTIPLE LINEAR REGRESSION
MODULE 4

The Multiple Regression Equation

Sales = 306.526 – 24.975(Price) + 74.131(Advertising)


Where:
Sales is in number of pies per week.
Price is in $.
Advertising is in $100’s.

b1 = -24.975: Sales will decrease, b2 = 74.131: Sales will increase,


on average, by 24.975 pies per on average, by 74.131 pies per
week for each $1 increase in week for each $100 increase
selling price, net of the effects in advertising, net of the
of changes due to advertising. effects of changes due to
price
MULTIPLE LINEAR REGRESSION
MODULE 4

Using The Regression Equation to Make


Predictions
Predict sales for a week in which the selling price is
$5.50 and advertising is $350:

Sales = 306.526 - 24.975(Price) + 74.131(Advertising)


= 306.526 - 24.975 (5.50) + 74.131 (3.5)
= 428.6216

Note that Advertising is in


Predicted sales is $100s, so $350 means that
X2 = 3.5.
428.6216 pies.
MULTIPLE LINEAR REGRESSION
MODULE 4

COEFFICIENT OF MULTIPLE
DETERMINATION, r2
• The coefficient of determination for the multiple regression model, called the
coefficient of multiple determination, is denoted by 𝑹𝟐

• Reports the proportion of total variation in Y explained by all X variables taken


together.

• It tells us how good the multiple regression model is and how well the
independent variables included in the model explain the dependent variable.

SSR regression sum of squares


r =
2
=
SST total sum of squares
MULTIPLE LINEAR REGRESSION
MODULE 4

MULTIPLE COEFFICIENT OF
DETERMINATION IN EXCEL
Regression Statistics
SSR 29,460.027
Multiple R 0.72213
r2 = = = .52148
R Square 0.52148 SST 56,493.306
Adjusted R Square 0.44172
52.1% of the variation in pie sales
Standard Error 47.46341
is explained by the variation in
Observations 15
price and advertising.
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888
MULTIPLE LINEAR REGRESSION
MODULE 4

COEFFICIENT OF MULTIPLE
DETERMINATION

Consideration of 𝑅#:

The coefficient of multiple determination has one major shortcoming.


The value of generally increases as we add more and more
explanatory variables to the regression model (even if they do not
belong in the model).

Just because we can increase the value of does not imply that the
regression equation with a higher value does a better job of predicting
the dependent variable. Such a value will be misleading
MULTIPLE LINEAR REGRESSION
MODULE 4

ADJUSTED COEFFICIENT OF
MULTIPLE DETERMINATION

To eliminate this shortcoming of 𝑅# it is preferable to use the


adjusted coefficient of multiple determination, which is denoted by
𝑅- # Note that is the coefficient of multiple determination adjusted for
degrees of freedom.

The value of 𝑅- #may increase, decrease, or stay the same as we add


more explanatory variables to our regression model. If a new variable
added to the regression model contributes significantly to explain
the variation in y, 𝑅- # increases; otherwise it decreases
MULTIPLE LINEAR REGRESSION
MODULE 4

ADJUSTED COEFFICIENT OF
MULTIPLE DETERMINATION
Shows the proportion of variation in Y explained by all X
variables adjusted for the number of X variables used:

é 2 æ n - 1 öù
r 2
adj = 1 - ê(1 - r )ç ÷ú
ë è n - k - 1 øû
(where n = sample size, k = number of independent variables)

• Penalizes excessive use of unimportant independent variables.


• Smaller than r2.
• Useful in comparing among models.
MULTIPLE LINEAR REGRESSION
MODULE 4

Using frozen dessert pies example

é 2 æ n - 1 öù
r2
adj = 1 - ê(1 - r )ç ÷ú
ë è n - k - 1 øû

" 15 − 1
𝑟$%& =1− 1 − 0.52148
15 − 2 − 1

44.17% of the variation in the Sales is


𝒓𝟐𝒂𝒅𝒋 = 𝟎. 𝟒𝟒𝟏𝟕 contributed by the Price and
Advertisement.
TEST FOR THE SIGNIFICANCE
OF THE OVERALL MULTIPLE
REGRESSION MODEL
MULTIPLE LINEAR REGRESSION
MODULE 4

IS THE OVERALL MODEL SIGNIFICANT?

• The F-test of overall significance indicates whether your regression


model provides a better fit to the data than a model containing no
independent variable.
• Shows if there is a linear relationship between all of the X variables
considered together and Y.
• Use F-test statistic.

HYPOTHESES:
H0: β1 = β2 = … = βk = 0 (no linear relationship)
H1: at least one βi ≠ 0 (at least one independent variable affects Y)
MULTIPLE LINEAR REGRESSION
MODULE 4

F Test for Overall Significance

• Test statistic:

SSR
MSR k
FSTAT = =
MSE SSE
n - k -1
where FSTAT has numerator d.f. = k and
denominator d.f. = (n – k - 1)
MULTIPLE LINEAR REGRESSION
MODULE 4

F Test for Overall Significance


H0: β1 = β2 = 0 Test Statistic:
H1: β1 and β2 not both zero MSR
a = .05 FSTAT = = 6.5386
MSE
df1= 2 df2 = 12
Decision:
Critical
Since FSTAT test statistic is in the
Value:
rejection region (p-value < .05),
F0.05 = 3.885 reject H0.
a = .05
Conclusion:
0 F There is evidence that at least one
Do not Reject H0
reject H0 independent variable affects Y.
F0.05 = 3.885
MULTIPLE LINEAR REGRESSION
MODULE 4

If the test results say that the model fits better without the
current set of independent variables, the next action should
be either the following:

1. Change/omit one or more of the independent variables


until the F-test indicates that the new set independent
variables significantly fit to the regression model or;
2. Gather more samples until the F-test result becomes
significant.
INFERENCE CONCERNING
THE POPULATION
REGRESSION
COEFFICIENTS
MULTIPLE LINEAR REGRESSION
MODULE 4

ARE INDIVIDUAL VARIABLES SIGNIFICANT?

• Use t tests of individual variable slopes.


• Shows if there is a linear relationship between the variable Xj
and Y holding constant the effects of other X variables.

HYPOTHESES:
• H0: βj = 0 (no linear relationship)
• H1: βj ≠ 0 (linear relationship does exist between Xj and Y)
MULTIPLE LINEAR REGRESSION
MODULE 4

ARE INDIVIDUAL VARIABLES SIGNIFICANT?

H0: βj = 0 (no linear relationship between Xj and Y)


H1: βj ≠ 0 (linear relationship does exist between Xj and Y)

Test Statistic:

bj - 0 (df = n – k – 1)
t STAT =
Sb
j
MULTIPLE LINEAR REGRESSION
MODULE 4

Inferences about the Slope: t Test Example


H 0 : βj = 0 From the Excel output:
H 1 : βj ¹ 0 For Price tSTAT = -2.306, with p-value .0398.

d.f. = 15-2-1 = 12
For Advertising tSTAT = 2.855, with p-value .0145
a = .05 The test statistic for each variable falls in the
ta/2 = 2.1788 rejection region (p-values < 0.05).

Decision: Reject H0 for each variable.


a/2=.025 a/2=.025
Conclusion: There is evidence that both Price and
Advertising affect pie sales at a = .05.
Reject H0 Do not reject H0 Reject H0
-tα/2 tα/2
0
-2.1788 2.1788
MULTIPLE LINEAR REGRESSION
MODULE 4

This section of the Regression Data Analysis tool will show which
variable(s) do not belong in the regression model

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

If p-value > alpha, omit the variable having a P-value > alpha which
indicates that the variable is not significantly correlated to the
dependent variable. If all x variables are insignificant, find other x
variables or increase sample size until results become significant.
CORRELATION & LINEAR REGRESSION ANALYSIS
MODULE 4

SUMMARY
MULTIPLE CORRELATION ANALYSIS
1. Check for a linear relationship between variables using a Scatter Diagram
2. Compute for the correlation coefficients between variables presented in a
correlation matrix
3. Compute for the multiple correlation coefficient, R

MULTIPLE LINEAR REGRESSION


1. Formulate for the multiple linear regression model
2. Test the overall significance of the regression model using F-test
3. If F-test result to Do not Reject Ho, omit a non-significant variable by test of
individual coefficients
4. Compute for the adjusted coefficient of multiple determination, 𝑹 " 𝟐 from the
final regression model or coefficient of determination, 𝒓𝟐 if the final
regression model has only one independent variable
END OF PRESENTATION

You might also like