Session 5 Marked B PDF

Simple Linear Regression & Correlation
• Correlation
• Prediction
• Linear regression
• R-Square
2
Variables
Univariate Data (one-variable)

We want to summarize the values of the variable.
1. Annual incomes of people in a survey.

2. Dividend yield of all the stocks I own.
3. Number of defective items produced by the
machine each day over a 100 day period.
Bivariate Data (two-variable)

We want to summarize the data values of each
variable as well as try to see how the two
variables are related.
1. Annual rate of stock price increase & the

dividend rate.
2. Early exposure of children to music lessons &
their future math and science abilities.
3. Tuition rate at colleges and the ranking of
colleges in terms of "quality".
More than 2 variables

We want to find the inter-relationships between the
variables.
3
EXAMPLE:
Sales and Advertising
Are the variables related

Is the relationship Direct or Inverse?
Direct (or "increasing"):

An increase in one variable is associated
with an increase in the other variable.
Inverse (or "decreasing"):

An increase in one variable is associated
with a decrease in the other variable.
How strong is the relationship? Can we use one

variable to predict the other?
By quantifying the relationship between sales

and advertising, a manager can decide how
much advertising to undertake.
4
Typical Statistical Relationships
Y . Y
. . .
Performance . Error Rate .
. . . .
.. .. .
. . .
. . .
. . .
. . .
. .
. .
.
X X
Test Score Practice Time
Y . . Y
. . . .
Sales . . Unit Cos t
. . . . . .
. .
. . . . .
. . . .
. . . .
. . . .
. . . . .
. . .
. .. .
. . .
X X
Advertis ing Production Rate
5
A measure of the strength of a linear association between two variables is:

The Correlation Coefficient r
r = 0.0 r = 0.4
r = 0.6 r = 0.8
r = 0.95 r = 0.99
6
Interpretation of r
Always: -1  r  1
r has no units of measurement
r Near 0 Weak Association

r Near 1 Strong Association
r>0 Direct Relationship

r<0 Inverse Relationship
High correlation does Not necessarily imply causality.
Would you expect a +,-, or 0 value for r?
1. Price of a Big Mac and its sales.
2. Height and I.Q.
3. Price of housing and amount of air pollution.
4. Money people paid to have their taxes prepared (x) and

the $ fines charged by the IRS (y).
7
EXAMPLE:
The following data collected on our company's 5
salespeople. We've recorded the number of
years of sales experience for each one along
with the amount of sales generated last month.
To what extent does Experience affect Sales? In

particular, we would like to predict Sales from
Experience.
Years of Last Month's Sales

Experience (in $1000)
16 15
6 3
12 10
1 0
10 7
8
Scatterplot of the 5 data points plus a trendline.

9
Computing the Correlation Coefficient
Equivalent Formulas for r :

1
 (X − X )(Y − Y )
r = n −1 (1)
SX SY
where Sx and SY are the standard deviations of X and Y:
SX =
 ( X − X ) 2 and SY =
(Y − Y)2
n −1 n −1
and n = the number of pairs of data values.
The numerator of (1) is the Covariance. This form is

commonly used in Finance. You might see it written:
cov(X ,Y )
r= (2)
S X SY
The simplest formula for computations is:
r=
 XY − nXY (3)
 X 2 − nX 2 Y 2 − nY 2
(a) It doesn't matter which variable we call X and which Y.
(b) The value of r is not affected by changing the units of

measurement of the X and/or Y variables.
10
Example:
For the Sales problem, find r.
X Y X2 Y2 XY
16 15
6 3
12 10
1 0
10 7
Sum
11
Prediction
A typical scatter plot:

BANA 303
100
80
Course Grade
60
40
20
30 40 50 60 70 80 90 100
Exam 3
Draw a trendline through the “middle” of the data.
Use the trendline to give a prediction of the Course

Grade for someone who scored a 65 on Exam 3.
(Estimate using the line).
12
The Linear Regression Model
In the population, we assume there is a linear

relationship between X and Y. Due to “other variables”
all the data points do not lie on a straight line.
The descriptive model we use is:
Y =  0 + 1X + 
where the error term  has a normal distribution with

mean 0 and unknown standard deviation .
Note: the Greek letters 0, 1 are population parameters and  is

random variable. This equation is not used for computation.
Unlike correlation, we now make a distinction between

the two variables:
X is the variable that is used to predict Y.
X = Independent (Predictor) Variable

Y = Dependent (Response) Variable
13
Regression Analysis
Is there one line that “best” fits the data?
We will choose the line that…

minimizes the sum of squared prediction errors.
This leads to one particular line (called the least

squares line, or best fit line, or regression line).
14
SS(Residuals)
A typical (X,Y) pair is a point on the scatterplot. Using

any value of X, we can predict a value of Y by using
the regression equation:
ŶYö = b0 + b1X
and then compare to the actual Y value giving Y − YöŶ
as the prediction error:
ei = yi − yˆi = yi − (b0 + b1 xi )
Y − YŶö (i.e., ei) are called the residuals. Each of these

values is a vertical distance on the plot....
180
160
140
120
100
80
60
40
20
0
0 2 4 6 8 10 12
A data set with r = 0.75
 (Y − Yˆ )
2
is the sum of squared errors (SSE), or the
SS(Residuals), over the data set. It is a measure of

how close the line was to the scatterplot.
The Regression Line is the line that minimizes this
Min  ei2
b ,b
error measure over all possible lines, i.e., 0 1 i
15
The Standard Error
(Y(Y− Y−öY)ˆ )

2 2
The value of can be anything and it
doesn’t mean much by itself. Instead, we can compute
Root Mean Square Error (RMSE) or Standard Error:
(Y(Y− Y
 −Y )
ö) ˆ2 2
Standard Error = s =
n−2
n−2
SSE (e12 + e22 + ... + en2 )
RMSE = =
(n − 2) (n − 2)
The Standard Error (or RMSE) s is an estimate of ,
the standard deviation of data around the regression
line.
Smaller values indicate more accurate predictions.
We can use the rule of thumb:
The prediction accuracy of the regression line is

about Yˆ  2s .
The units of measurement of the Standard Error are

the same as the units of the Y variable.
16
The equation of this estimated Regression Line:
ŶYö = b0 + b1X
where b0 and b1 are computed from:
b1 =
 XY − nXY
 X 2 − nX 2
and
b 0 = Y − b1X
Useful properties of the OLS linear fit
• Intercept: Regression line passes through ( x , y )
• Slope: Cov(X,Y) determines the direction of the

line
• Sum of the residuals around the best fitted line is

zero:  ei = 0
i
17
Example:
For the Sales problem, find the regression equation.
(Use the table of computations we used for r.)
Note that we must make sure we have selected the

correct variable as X and as Y.
1. If an employee has 10 years experience, what will

be the predicted monthly sales?
2. If an employee is a new hire out of college with no

prior sales experience, what is the predicted
monthly sales?
Interpret your result.
18
Software Notes
• For creating a correlation matrix

– Analyze → Multivariate methods → Multivariate
– Select all the columns for which you want the correlations
– Click red triangle → Covariance matrix
• Fit Y by X (For Regression)

– Select the Response and Predictor variables to get a scatterplot
of the data
– Click red triangle → Fit Line, to get regression line along with
estimated equation and other info
– Click the red triangle next to linear fit, and
• Click Plot Residuals
• Click Save Residuals/Fitted values to add columns to the data table
13
19
Regression Printout
SUMMARY OUTPUT Sales in $1000's, Experience in Years
Regression Statistics
R Square 0.9711
Adjusted R Square 0.9614
Standard Error 1.1536
Observations 5
ANOVA
df SS MS F Significance F
Regression 1 134.0076 134.0076 100.6964 0.0021
Residual 3 3.9924 1.3308
Total 4 138
Coefficients Standard Error t Stat P-value

Intercept -2.0682 1.0406 -1.9875 0.1410
Years 1.0076 0.1004 10.0348 0.0021
Note:
The slope b1 and the correlation r will always have the
same sign:
+, -, or 0
sY
because of the formula b1 = r which we’ll use later in the chapter.
sX
20
R-Square
The square of the correlation is called R-Square or r 2 .
Always: 0  r 2  1. If r 2 is "near" 1 it means that the

regression line is a good fit to the data.
In addition to being the square of r, R-Square has

another, even more important interpretation:
The Percentage Of Variance "Accounted For" or

“Explained”
Example:
In our data set, r 2 = 0.9711
So we have “explained” 97.11% of the variance of Y

(Monthly Sales) by knowing X (Years Experience).
There may be many reasons why Sales vary from
person to person but by taking into account their
Experience (via the regression equation), we are able
to understand most of the individual variation in Y. In
fact, we see that we can account for 97.11% of this
variation.
21
From the output, note that the SS or Sum of Squares

can be written:
SS(Total) = SS(Regression) + SS(Residual)

(SST = SSR + SSE)
where:
(
Y − Yˆ ) 2
SSR = SS(Regression) = “Explained SS” =
SSE = SS(Residual) = “Unexplained SS” =  (Y − Y )

ˆ 2
( )
2
SST = SS(Total) = Total SS = Y −Y
This term does not depend on the X variable and how
well it predicts Y.
Var(Y) = SST / (n-1)
Another way to compute R-Square:
SS(Regression)
r2 =
SS(Total)
or
SS(Residual)
r 2 = 1−
SS(Total)
22
Computing b0 and b1 from r
If we know 5 summary measures about our data:
X , Y , s X , s Y , and r
then we can compute the regression line...
sY
b1 = r
sX
b 0 = Y − b1X
where
r = Correlation between X and Y
s Y = Standard deviation of Y values
s X = Standard deviation of X values
1. The equation for b 1 verifies a previous statement:

b 1 and r will always have the same sign.
2. The b 1 equation shows that slope and correlation

both measure the association between X and Y but
in different ways.
23
Example:
A student scores 55% on test #1 in Finance. Predict
his overall grade using regression based on last
semester's data.
Mean Standard Deviation

Test#1 70.1 14.6
Overall 73.0 11.4
The correlation was 0.75 between Test#1 and the

Overall Grade.
Solution:
24
Sampling Distributions for Intercept and Slope
• If certain assumptions hold, we can find the distributions for b0 and b1
b0 − β0 b1 − β1
~ Tn − 2 ~ Tn − 2
SE(b0 ) SE(b1 )
• We can use these distributions for making inferences about the

relationship between X and Y in the population
• Confidence intervals
• Hypothesis tests
• We can also use these distributions to construct prediction intervals for

values of the response variable (y) for a given value of the predictor
variable (x)
6
25
Question:
How do we know if the regression line is useful?
1. If the r or r2 for a problem is “too small” or very

close to 0, it indicates that the x variable may not
really be related to y.
H0: population correlation  = 0

Ha: population correlation   0
2. If the slope of the regression line is close to 0, then

the predicted y value is approximately the same for
any value of x. So knowing x is not much help in
predicting y.
H0: population slope  1 = 0

Ha: population slope  1  0
These are equivalent hypothesis tests.
continued...
26
For this test we can look at the p-value on the

computer printout for the slope. With a small p-value
we have more confidence that the regression model
on the printout is valid and not just sampling error.
(We will use  = 0.05):
If the p-value < 0.05 then:

We Reject H 0 and so the population correlation
between X and Y is  0.
This means X and Y are really related -- it's not just
luck.
3. Even if the p-value < 0.05 then we should examine

r 2 . If it is too small (maybe < 30%) then it means
that the x variable might be related to y, but is not a
very good predictor of y since a lot of the
variability of y is not related to x.
So a good predictor x has:

p-value < 0.05 and r 2  0.30
27
Inference (II): Confidence Intervals
• We can use the point estimate b1 and se(b1)

to construct confidence interval for the
slope parameter
• Given a confidence level ( ), the
corresponding interval is given by
Available from JMP output

28
Things to Worry About
1. On a scatterplot of values note if any of the points

look like outliers. This may indicate either data
entry errors or cases that are not typical of the data
set (and could be deleted). An outlier may radically
change r and regression line of a small data set.
2. Correlation is not causation.
3. If the population model is non-linear (e.g., a

parabola), then computing r or using a regression
line may give misleading results.
4. Be careful when using the regression line to

extrapolate beyond the range of observed data.
29
Example:
In finance, the risk of a stock is measured by its “beta
value.” This is the slope in a linear regression model1
E(Y ) =  0 +  1X
which relates the rate of return2(Y) of a specific stock

to the rate of return for the overall stock market (X),
namely the S&P 500 index. This regression model
measures the relationship between % changes in the
price of a specific stock relative to % changes in the
overall stock market.
The risk of a stock is related to (i) the outlook for the

stock market or overall economy and also to (ii) the
specific prospects for the company3. The beta value or
 1 measures only the risk of a stock due to (i).
Terminology:
Beta is also described as measuring a stock’s:
-- correlated relative volatility
-- non-diversifiable risk
-- systematic risk
-- market risk
1 This is stated more precisely using the current risk-free interest rate: subtract this interest rate from the data
(weekly returns) for both X and Y. If that rate is roughly constant over the time period we are analyzing, then the
regression model as given above is essentially the same. Also, the intercept in this model should be 0 if the market is
efficient.
2 Return =(dividends received+change in price)/(price at the start of the day, week or month). This is the % change.
3 The risk specific to a company is not of major importance since it can be reduced through diversification. The risk
measured by  is more serious since it combines with similar risk of the other stocks in our portfolio.
1
30
Example:
Weekly closing prices and % changes for Apple

Computer and the S&P500 Index:
S&P500 Apple
Date S&P500 % Change Apple % Change
1/4/10 1144.98 211.98
1/11/10 1136.03 -0.0078 205.93 -0.0285
1/19/10 1091.76 -0.0390 197.75 -0.0397
1/25/10 1073.87 -0.0164 192.06 -0.0288
2/1/10 1066.19 -0.0072 195.46 0.0177
2/8/10 1075.51 0.0087 200.38 0.0252
2/16/10 1109.17 0.0313 201.67 0.0064
2/22/10 1104.49 -0.0042 204.62 0.0146
3/1/10 1138.70 0.0310 218.95 0.0700
3/8/10 1149.99 0.0099 226.60 0.0349
3/15/10 1159.90 0.0086 222.25 -0.0192
3/22/10 1166.59 0.0058 230.90 0.0389
3/29/10 1178.10 0.0099 235.97 0.0220
4/5/10 1194.37 0.0138 241.79 0.0247
4/12/10 1192.13 -0.0019 247.40 0.0232
4/19/10 1217.28 0.0211 270.83 0.0947
4/26/10 1186.69 -0.0251 261.09 -0.0360
5/3/10 1110.88 -0.0639 235.86 -0.0966
5/10/10 1135.68 0.0223 253.82 0.0761
5/17/10 1087.69 -0.0423 242.32 -0.0453
5/24/10 1089.41 0.0016 256.88 0.0601
6/1/10 1064.88 -0.0225 255.96 -0.0036
6/7/10 1091.60 0.0251 253.51 -0.0096
6/14/10 1117.51 0.0237 274.07 0.0811
6/21/10 1076.76 -0.0365 266.70 -0.0269
6/28/10 1022.58 -0.0503 246.94 -0.0741
Mean -0.004 0.007

Std Dev 0.027 0.049
No dividends were paid during this time period, otherwise we would have had to
adjust the prices but finance.yahoo.com does that for us automatically.
31
Jan-June 2010
0.15
0.10
Apple % Weekly Change
0.05
0.00
-0.05
-0.10
-0.15
-0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04
S&P500 % Weekly Change
The slope of the regression line is positive. This is

surprising since during this time period the S&P500
decreased and Apple increased (see previous page.)
32
Regression Statistics
Multiple R 0.8238
R Square 0.6787
Adjusted R Square 0.6647
Standard Error 0.0281
Observations 25
ANOVA
df SS MS F Significance F
Regression 1 0.03848 0.03848 48.58683 4.18706E-07
Residual 23 0.01822 0.00079
Total 24 0.05670
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 0.0135 0.0057 2.3729 0.0264 0.0017 0.0253
S&P500 1.5045 0.2158 6.9704 0.0000 1.0580 1.9510
Write out the regression equation.
What is the beta for Apple?

(It was listed as 1.43 on finance.yahoo.com based on additional data.)
Slope  > 1: aggressive stock

Slope  < 1: conservative stock
Slope  = 1: average (equal to the S&P500)
Slope  = 0: uncorrelated with the S&P500
Slope  < 0: moves opposite to the S&P500
A stock with a slope of 2 would tend to increase twice as fast as the S&P 500 and
decrease twice as fast. A stock with a slope of only 0.50 would tend to increase or
decrease only 1/2 as fast as the S&P 500.
33
continued…
What is the correlation between the % weekly price

changes of Apple and the S&P500?
Note: This correlation r measures how a stock and S&P500

tend to move together. But so does b1 . To see the difference
suppose a stock’s weekly % price change was always exactly
25% of the weekly % price change of the S&P500. Then
b1 = 0.25 while r = 1.0
Recall the relationship between slope and correlation:
s Apple
b1 = r
s S&P 500
From the data set we can compute the standard

deviation for the 2 columns of % changes:
s Apple = 0.049 and s S&P 500 = 0.027
s Apple is a measure of volatility of Apple stock without

relating it to the overall market4.
4 A weekly (or daily) s is usually converted to yearly data and then stated as a %. The annual volatility:
Apple: 0.049 52 = 35.3%
S&P500: 0.027 52 = 19.5%
34
continued…
The relative volatility of Apple is

s Apple 0.049
= = 1.815 81.5% higher than S&P500.
s S&P 500 0.027
Then b1 = 1.815r and this is why the beta for Apple is

much larger than its correlation.
Note: A stock can be very volatile, like Apple, but if r

is close to 0, then its beta (slope) is close to 0. So a
stock’s beta only measures the risk that is related to
market fluctuations.
Everything else equal, an investor likes stocks with

low betas below 1, even better near 0 or better yet if
they are negative. Useful for building low risk
portfolios.
Here are some more examples (as of summer 2010):

Relative
Volatility r Beta
Ford 2.18 0.75 1.63
Apple 1.83 0.82 1.50
Google 1.29 0.66 0.86
Coca-Cola 0.87 0.55 0.48
Kellogg 0.70 0.52 0.36
35
Example:
Here is a data set consisting of the number of people
who play golf in this country and the number of
missed days at work in the entire U.S. due to reported
injury or illness.
Year Golfers Missed Days

1960 4,400,000 370,000,000
1965 7,750,000 400,000,000
1970 9,700,000 417,000,000
1975 12,036,000 433,000,000
1980 13,000,000 485,000,000
1985 14,700,000 500,000,000
Predict the number of Missed Days at work as a

function of the number of Golfers.
Solution:
A fitted regression line gives...
yˆ = 304, 525, 000+12,600X
with the r2 = .91 .
How do you interpret such a high r2 -- as cause &

effect?
36
Example:
X Y
1 5
2 2
3 1
4 2
5 5
(a) Calculate the correlation coefficient r. On the basis

of r, does it seem that X and Y are related in the
population?
(b) Plot the 5 data points. Explain the result of (a) now
that you have seen the plot.
Y3
0
0 1 2 3 4 5 6
X

Session 5 Marked B PDF

Uploaded by

Copyright:

Available Formats

You might also like

Session 5 Marked B PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session 5 Marked B PDF

Uploaded by

Copyright:

Available Formats

Simple Linear Regression & Correlation

Univariate Data (one-variable)

1. Annual incomes of people in a survey.

Bivariate Data (two-variable)

1. Annual rate of stock price increase & the

More than 2 variables

Sales and Advertising

Are the variables related

Direct (or "increasing"):

Inverse (or "decreasing"):

How strong is the relationship? Can we use one

By quantifying the relationship between sales

Typical Statistical Relationships

A measure of the strength of a linear association between two variables is:

r Near 0 Weak Association

r>0 Direct Relationship

High correlation does Not necessarily imply causality.

Would you expect a +,-, or 0 value for r?

1. Price of a Big Mac and its sales.

2. Height and I.Q.

3. Price of housing and amount of air pollution.

4. Money people paid to have their taxes prepared (x) and

To what extent does Experience affect Sales? In

Years of Last Month's Sales

Scatterplot of the 5 data points plus a trendline.

Computing the Correlation Coefficient

Equivalent Formulas for r :

where Sx and SY are the standard deviations of X and Y:

and n = the number of pairs of data values.

The numerator of (1) is the Covariance. This form is

The simplest formula for computations is:

(b) The value of r is not affected by changing the units of

A typical scatter plot:

Draw a trendline through the “middle” of the data.

Use the trendline to give a prediction of the Course

The Linear Regression Model

In the population, we assume there is a linear

The descriptive model we use is:

where the error term  has a normal distribution with

Note: the Greek letters 0, 1 are population parameters and  is

Unlike correlation, we now make a distinction between

X is the variable that is used to predict Y.

X = Independent (Predictor) Variable

Is there one line that “best” fits the data?

We will choose the line that…

This leads to one particular line (called the least

A typical (X,Y) pair is a point on the scatterplot. Using

Y − YŶö (i.e., ei) are called the residuals. Each of these

A data set with r = 0.75

SS(Residuals), over the data set. It is a measure of

The Standard Error

Smaller values indicate more accurate predictions.

We can use the rule of thumb:

The prediction accuracy of the regression line is

The units of measurement of the Standard Error are

The equation of this estimated Regression Line:

where b0 and b1 are computed from: