Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Learning Objective:

At the end of the lesson, the student should be able to:


• Differentiate between correlation analysis and regression analysis;
• Interpret a regression equation and use it to make predictions;
• Interpret the meaning of the regression coefficients b0 and b1;
• Explain the least squares method and interpret R2;
• Interpret the regression results in Excel;
• Evaluate the assumptions of regression analysis and know what to do it the
assumptions are violated;
• Identify when binary logistic regression is appropriate.
Correlation Analysis
▪ The analysis of bivariate data typically begins with a scatter plot that
displays each observed pair of data (x, y) as a dot on the x-y plane.
▪ Correlation analysis is used to measure the strength of the linear
relationship between two variables.
▪ Correlation is only concerned with strength of the relationship.
▪ No causal effect is implied with correlation.
▪ The sample correlation coefficient (like Pearson’s r and Spearman
rho) measures the degree of linearity in the relationship between two
random variables X and Y, with values in the interval [-1, 1].

Statistical Analysis with Software Applications, Mc Graw Hill


Correlation Analysis
Scatter plots showing
various correlation
coefficient values

In Excel, use these functions


to get the value of the
correlation coefficient.
1. =CORREL(array1, array2)
2. =PEARSON(array1,
array2)

Statistical Analysis with Software Applications, Mc Graw Hill


Correlation Analysis
Test for significant correlation using Student’s t:
▪ The sample correlation coefficient r is an estimate of the population
correlation coefficient 𝜌 (Greek alphabet rho).
▪ There is no flat rule for a “high” correlation because sample size must
be taken into consideration.
▪ To test the hypothesis 𝐻𝑜 : 𝜌 = 0, the test statistic is

𝑛 −2
𝑡𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑑 =𝑟
1 − 𝑟2

Statistical Analysis with Software Applications, Mc Graw Hill


Correlation Analysis
Test for significant correlation using Student’s t:
▪ To test the hypothesis 𝐻𝑜 : 𝜌 = 0, the test statistic is

𝑛 −2
𝑡𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑑 =𝑟
1 − 𝑟2
▪ After calculating this value, we can find its p-value by using Excel’s
function = 𝑇. 𝐷𝐼𝑆𝑇. 2𝑇(𝑡, deg _𝑓𝑟𝑒𝑒𝑑𝑜𝑚) .

Statistical Analysis with Software Applications, Mc Graw Hill


Regression Analysis
▪ The hypothesized relationship may be linear, quadratic or some other
form.
▪ The next slide presents some of the possible patterns.
▪ The module will focus on the simple linear model commonly referred
to as a simple regression equation.

Statistical Analysis with Software Applications, Mc Graw Hill


Regression Analysis: Types of relationships

Source: Statistics for Manager Using Microsoft Excel, 5e @ 2008 Prentice-Hall, Inc
Regression Analysis: Types of relationships

Source: Statistics for Manager Using Microsoft Excel, 5e @ 2008 Prentice-Hall, Inc
Regression Analysis: Types of relationships

Source: Statistics for Manager Using Microsoft Excel, 5e @ 2008 Prentice-Hall, Inc
Simple Linear Regression Model
▪ Only one independent variable, X
▪ The relationship between X and Y is described by a linear
function.
▪ The changes in Y are related to changes in X.

Statistical Analysis with Software Applications, Mc Graw Hill


The population regression model

Statistical Analysis with Software Applications, Mc Graw Hill


Simple Linear Regression Equation

Statistical Analysis with Software Applications, Mc Graw Hill


Interpreting an Estimated Regression Equation
The slope tells us how much, and in what direction, the dependent or response
variable will change for each one unit increase in the predictor variable. On the
other hand, the intercept is meaningful only if the predictor variable would
reasonably have a value equal to zero.
Equation:
𝑆𝑎𝑙𝑒𝑠 = 268 + 7.37 𝐴𝑑𝑠
Interpretation:
Each extra P1 million of advertising will generate P7.37 million of sales on average.
The firm would average P268 million of sales with zero advertising. However, the
intercept may not be meaningful because Ads = 0 may be outside the range of
observed data.

Statistical Analysis with Software Applications, Mc Graw Hill


Interpreting an Estimated Regression Equation
Other examples:

Statistical Analysis with Software Applications, Mc Graw Hill


Prediction Using Regression
One of the main uses of regression is to make predictions. Once we have a fitted
regression equation that show the estimated relationship between X and Y, we can
plug in any value of X (within the range of our sample x values) to obtain the
prediction for Y.

Statistical Analysis with Software Applications, Mc Graw Hill


Assumptions of Regression (L.I.N.E)
▪ Linearity – the relationship between X and Y is linear
▪ Independence of errors – the error values (difference between
observed and estimated values) are statistically independent.
▪ Normality of error – the error values are normally distributed
for any given value of X
▪ Equal variance or homoskedasticity – the probability
distribution of the errors has constant variance.

Statistical Analysis with Software Applications, Mc Graw Hill


Example 1.
What is the relationship between the number of hours a student
studies and his or her exam score? Shown in the table are the
data for 10 students.
Student Hours, X Score, Y Student Hours, X Score, Y

1 1 53 6 11 84

2 5 74 7 14 96

3 7 59 8 15 69

4 8 43 9 15 84

5 10 56 10 19 83

Statistical Analysis with Software Applications, Mc Graw Hill


Example 1 Excel output
For the scatterplot:
1. Highlight X array and
Y array.
2. Choose Insert.
3. Choose Scatter among
the chart types
available.
4. Edit the axis labels.

Statistical Analysis with Software Applications, Mc Graw Hill


Example 1 Excel output
For regression:
1. Go to Data, choose
Data Analysis.
2. Choose Regression
among the Data
Analysis Tools.
3. Fill up necessary
fields.
4. Click OK.

Statistical Analysis with Software Applications, Mc Graw Hill


Example 1 Excel output

The regression equation is:


𝑆𝑐𝑜𝑟𝑒 = 49.477 + 1.9641 ∗ 𝑋 (ℎ𝑜𝑢𝑟𝑠)

Statistical Analysis with Software Applications, Mc Graw Hill


Example 1 Excel output

Intercept = 49.477

Statistical Analysis with Software Applications, Mc Graw Hill


Example 1 Interpretation of coefficients
𝑆𝑐𝑜𝑟𝑒 = 49.477 + 1.9641 ∗ 𝑋 (ℎ𝑜𝑢𝑟𝑠)

▪ B1 measures the change in the average value of Y as a


result of a one-unit change in X.
▪ Here, 𝑏1 = 1.9641 tells us that the mean score in the
exam increases by 1.9641(1 hour) = 1.9641, on average,
for each additional one hour of studying for the
examination.
▪ Also, 𝑏0 = 49.477 tells us that a student who did not
study would expect a score of about 49.
Statistical Analysis with Software Applications, Mc Graw Hill
Assessing Fit: Coefficient of determination, R2
▪ The coefficient of determination is the portion of the total
variation in the dependent variable that is explained by the
variation in the independent variable.
▪ It is also called r-squared and is obtained by:

𝑆𝑆𝑅 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠


𝑟2 = =
𝑆𝑆𝑇 𝑡𝑜𝑡𝑎𝑙 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠

0 ≤ 𝑟2 ≤ 1
Statistical Analysis with Software Applications, Mc Graw Hill
Example 1 Coefficient of determination

2
𝑆𝑆𝑅 1020.341
𝑟 = = = 0.39412
𝑆𝑆𝑇 2588.90
39.41% of the variation in scores
is explained by the variation in
study hours.

Statistical Analysis with Software Applications, Mc Graw Hill


Standard Error of Estimate
▪ The standard deviation of the variation of observations
around the regression line is estimated by:

Statistical Analysis with Software Applications, Mc Graw Hill


Example 1 Standard error of estimate

𝑆𝑌𝑋 = 14.002

Statistical Analysis with Software Applications, Mc Graw Hill


Comparing Standard Errors
𝑆𝑌𝑋 is a measure of the variation of observed Y values from the
regression line.
The magnitude of 𝑆𝑌𝑋 should always be judged relative to the
size of the Y values in the sample data.

Statistical Analysis with Software Applications, Mc Graw Hill


Inferences about the slope using the t-test
▪ The t-test for a population slope is used to determine if there
is a linear relationship between X and Y.
▪ Null and alternative hypotheses
H0: β1 = 0 (no linear relationship)
H1: β1 ≠ 0 (linear relationship does exist)

Statistical Analysis with Software Applications, Mc Graw Hill


Example 1 Inferences about the slope
The estimated regression equation is:

𝑆𝑐𝑜𝑟𝑒 = 49.477 + 1.9641 ∗ 𝑋 (ℎ𝑜𝑢𝑟𝑠)

The slope of this model is 1.9641.


Is there a relationship between the study
hours and the student’s exam score?

Statistical Analysis with Software Applications, Mc Graw Hill


Example 1 Excel output

1.9641 − 0
𝑡= = 1.9641
0.8610
𝑑𝑓 = 𝑛 − 2 = 10 − 2 = 8
𝑏1
𝑇. 𝐷𝐼𝑆𝑇 2.281221,8,2 = .052

Statistical Analysis with Software Applications, Mc Graw Hill


Example 1 Inference about slope

𝑇. 𝐷𝐼𝑆𝑇 2.281221,8,2
= .052 = p-value

• H0: β1 = 0
• H1: β1 ≠ 0

Do not reject the null hypothesis since p > α.

There is no sufficient evidence that study hours affects exam scores.

Statistical Analysis with Software Applications, Mc Graw Hill


Checking the assumptions by examining the residuals

Residual Analysis for


Linearity:
Plot X against residuals

Aside from visually examining the scatter plots of the IV and DV to assess linearity, the
scatter plot of the IV vs the residuals may also be examined. The plots at the left show curve
patterns which indicates that the data relationship is not linear. Another model should be
used.
Statistical Analysis with Software Applications, Mc Graw Hill
Checking the assumptions by examining the residuals

Residual
Analysis for
Equal Variance:
Plot X against
residuals

Statistical Analysis with Software Applications, Mc Graw Hill


Checking the assumptions by examining the residuals

Residual
Analysis for
Equal variance:
Plot predicted
values against
residuals

Statistical Analysis with Software Applications, Mc Graw Hill


Checking the assumptions by examining the residuals
Residual Analysis for Normality:
1. Examine the Stem-and-Leaf Display of the Residuals
2. Examine the Box-and-Whisker Plot of the Residuals
3. Examine the Histogram of the Residuals
4. Construct a normal probability plot.
5. Construct a Q-Q plot.

Statistical Analysis with Software Applications, Mc Graw Hill


Checking the assumptions by examining the residuals

If residuals are normal, the probability plot


and the Q-Q plot should be approximately
linear.

Statistical Analysis with Software Applications, Mc Graw Hill


Checking the assumptions by examining the residuals
What can we do when residuals are not normal?
1. Consider trimming outliers – but only if they clearly are
mistakes.
2. Can you increase the sample size? If so, it will help assure
asymptotic normality of the estimates.
3. You could try a logarithmic transformation of the variables.
However, this is a new model specification with a different
interpretation of coefficients.
4. You could do nothing, just be aware of the problem.
Statistical Analysis with Software Applications, Mc Graw Hill
Checking the assumptions by examining the residuals

Residual Analysis for


Independence of Errors:
Plot times series X against
residuals

Independence of errors means that the


distribution of errors is random and is not
influenced by or correlated to the errors
in prior observations.

Statistical Analysis with Software Applications, Mc Graw Hill


Checking the assumptions by examining the residuals

Residual Analysis for


Independence of Errors:
Plot times series X against
residuals

Clearly, independence can be checked


when we know the order in which the
observations were made. The opposite of
independence is auto-correlation.

Statistical Analysis with Software Applications, Mc Graw Hill


Measuring Autocorrelation
▪ Another way of checking for independence of errors is by
testing the significance of the Durbin Watson Statistic.
▪ The Durbin-Watson Statistic measures detects the presence
of autocorrelation.
▪ It is used when data are collected over time to detect the
presence of autocorrelation.
▪ Autocorrelation exists if residuals in one time period are
related to residuals in another period.

Statistical Analysis with Software Applications, Mc Graw Hill


Measuring Autocorrelation
▪ The presence of autocorrelation of errors (or residuals)
violates the regression assumption that residuals are
statistically independent.

Statistical Analysis with Software Applications, Mc Graw Hill


The Durbin-Watson, DW, Statistic
▪ The DW statistic is used to test for autocorrelation.
n
H0: residuals are not correlated
H1: autocorrelation is present
 (e i − ei −1 ) 2

D= i =2
n
▪ The possible range is 0 ≤ D ≤ 4
i
e 2

i =1
▪ D should be close to 2 if H0 is true

▪ D less than 2 may signal positive The value of DW can be


autocorrelation, D greater than 2 may signal obtained from software like
negative autocorrelation
SPSS, Gretl and JASP.
Statistical Analysis with Software Applications, Mc Graw Hill
Example 1 Excel output for assessing assumptions
The residual plot shows
that the assumptions of
linearity and constant
variance are satisfied.

The assumption of
normality of residuals is
satisfied since the points
follow a straight line.

Statistical Analysis with Software Applications, Mc Graw Hill


Example 1 in
JASP:
Correlation
Analysis

Statistical Analysis with Software Applications, Mc Graw Hill


Example 1 in
JASP:
Regression
analysis with
assumption
checks

Statistical Analysis with Software Applications, Mc Graw Hill


Strategies when performing regression analysis
▪ Start with a scatter plot of X on Y to observe possible
relationship.
▪ Perform residual analysis to check the assumptions.
▪ Plot the residuals vs X to check for violations of
assumptions such as equal variance.
▪ Use a histogram, stem and leaf display, box and whisker
plot or normal probability plot of the residuals to uncover
possible non-normality.

Statistical Analysis with Software Applications, Mc Graw Hill


Strategies when performing regression analysis
▪ If there is any violation of any assumption, use alternative
methods or models.
▪ If there is no evidence of assumption violation, then test for
the significance of the regression coefficients.
▪ Avoid making predictions or forecasts outside the relevant
range.

Statistical Analysis with Software Applications, Mc Graw Hill

You might also like