Session 5 Marked B PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Simple Linear Regression & Correlation

• Correlation

• Prediction

• Linear regression

• R-Square
2

Variables

Univariate Data (one-variable)


We want to summarize the values of the variable.

1. Annual incomes of people in a survey.


2. Dividend yield of all the stocks I own.
3. Number of defective items produced by the
machine each day over a 100 day period.

Bivariate Data (two-variable)


We want to summarize the data values of each
variable as well as try to see how the two
variables are related.

1. Annual rate of stock price increase & the


dividend rate.
2. Early exposure of children to music lessons &
their future math and science abilities.
3. Tuition rate at colleges and the ranking of
colleges in terms of "quality".

More than 2 variables


We want to find the inter-relationships between the
variables.
3

EXAMPLE:

Sales and Advertising

Are the variables related


Is the relationship Direct or Inverse?

Direct (or "increasing"):


An increase in one variable is associated
with an increase in the other variable.

Inverse (or "decreasing"):


An increase in one variable is associated
with a decrease in the other variable.

How strong is the relationship? Can we use one


variable to predict the other?

By quantifying the relationship between sales


and advertising, a manager can decide how
much advertising to undertake.
4

Typical Statistical Relationships

Y . Y
. . .
Performance . Error Rate .
. . . .
.. .. .
. . .
. . .
. . .
. . .
. .
. .
.

X X
Test Score Practice Time

Y . . Y
. . . .
Sales . . Unit Cos t
. . . . . .
. .
. . . . .
. . . .
. . . .
. . . .
. . . . .
. . .
. .. .
. . .

X X
Advertis ing Production Rate
5

A measure of the strength of a linear association between two variables is:


The Correlation Coefficient r
r = 0.0 r = 0.4

r = 0.6 r = 0.8

r = 0.95 r = 0.99
6

Interpretation of r

Always: -1  r  1
r has no units of measurement

r Near 0 Weak Association


r Near 1 Strong Association

r>0 Direct Relationship


r<0 Inverse Relationship

High correlation does Not necessarily imply causality.

Would you expect a +,-, or 0 value for r?

1. Price of a Big Mac and its sales.

2. Height and I.Q.

3. Price of housing and amount of air pollution.

4. Money people paid to have their taxes prepared (x) and


the $ fines charged by the IRS (y).
7

EXAMPLE:
The following data collected on our company's 5
salespeople. We've recorded the number of
years of sales experience for each one along
with the amount of sales generated last month.

To what extent does Experience affect Sales? In


particular, we would like to predict Sales from
Experience.

Years of Last Month's Sales


Experience (in $1000)

16 15
6 3
12 10
1 0
10 7
8

Scatterplot of the 5 data points plus a trendline.


9

Computing the Correlation Coefficient

Equivalent Formulas for r :


1
 (X − X )(Y − Y )
r = n −1 (1)
SX SY

where Sx and SY are the standard deviations of X and Y:

SX =
 ( X − X ) 2 and SY =
(Y − Y)2
n −1 n −1

and n = the number of pairs of data values.

The numerator of (1) is the Covariance. This form is


commonly used in Finance. You might see it written:

cov(X ,Y )
r= (2)
S X SY

The simplest formula for computations is:

r=
 XY − nXY (3)
 X 2 − nX 2 Y 2 − nY 2
(a) It doesn't matter which variable we call X and which Y.

(b) The value of r is not affected by changing the units of


measurement of the X and/or Y variables.
10

Example:
For the Sales problem, find r.

X Y X2 Y2 XY
16 15
6 3
12 10
1 0
10 7
Sum
11

Prediction

A typical scatter plot:


BANA 303

100

80
Course Grade

60

40

20
30 40 50 60 70 80 90 100

Exam 3

Draw a trendline through the “middle” of the data.

Use the trendline to give a prediction of the Course


Grade for someone who scored a 65 on Exam 3.
(Estimate using the line).
12

The Linear Regression Model

In the population, we assume there is a linear


relationship between X and Y. Due to “other variables”
all the data points do not lie on a straight line.

The descriptive model we use is:

Y =  0 + 1X + 

where the error term  has a normal distribution with


mean 0 and unknown standard deviation .

Note: the Greek letters 0, 1 are population parameters and  is


random variable. This equation is not used for computation.

Unlike correlation, we now make a distinction between


the two variables:

X is the variable that is used to predict Y.

X = Independent (Predictor) Variable


Y = Dependent (Response) Variable
13

Regression Analysis

Is there one line that “best” fits the data?

We will choose the line that…


minimizes the sum of squared prediction errors.

This leads to one particular line (called the least


squares line, or best fit line, or regression line).
14

SS(Residuals)

A typical (X,Y) pair is a point on the scatterplot. Using


any value of X, we can predict a value of Y by using
the regression equation:
ŶYö = b0 + b1X
and then compare to the actual Y value giving Y − YöŶ
as the prediction error:
ei = yi − yˆi = yi − (b0 + b1 xi )

Y − YŶö (i.e., ei) are called the residuals. Each of these


values is a vertical distance on the plot....

180
160
140
120
100
80
60
40
20
0
0 2 4 6 8 10 12

A data set with r = 0.75

 (Y − Yˆ )
2
is the sum of squared errors (SSE), or the

SS(Residuals), over the data set. It is a measure of


how close the line was to the scatterplot.
The Regression Line is the line that minimizes this
Min  ei2
b ,b
error measure over all possible lines, i.e., 0 1 i
15

The Standard Error

(Y(Y− Y−öY)ˆ )

2 2
The value of can be anything and it
doesn’t mean much by itself. Instead, we can compute
Root Mean Square Error (RMSE) or Standard Error:

(Y(Y− Y
 −Y )
ö) ˆ2 2
Standard Error = s =
n−2
n−2
SSE (e12 + e22 + ... + en2 )
RMSE = =
(n − 2) (n − 2)
The Standard Error (or RMSE) s is an estimate of ,
the standard deviation of data around the regression
line.

Smaller values indicate more accurate predictions.

We can use the rule of thumb:

The prediction accuracy of the regression line is


about Yˆ  2s .

The units of measurement of the Standard Error are


the same as the units of the Y variable.
16

The equation of this estimated Regression Line:

ŶYö = b0 + b1X

where b0 and b1 are computed from:

b1 =
 XY − nXY
 X 2 − nX 2
and
b 0 = Y − b1X

Useful properties of the OLS linear fit

• Intercept: Regression line passes through ( x , y )

• Slope: Cov(X,Y) determines the direction of the


line

• Sum of the residuals around the best fitted line is


zero:  ei = 0
i
17

Example:
For the Sales problem, find the regression equation.
(Use the table of computations we used for r.)

Note that we must make sure we have selected the


correct variable as X and as Y.

1. If an employee has 10 years experience, what will


be the predicted monthly sales?

2. If an employee is a new hire out of college with no


prior sales experience, what is the predicted
monthly sales?
Interpret your result.
18

Software Notes

• For creating a correlation matrix


– Analyze → Multivariate methods → Multivariate
– Select all the columns for which you want the correlations
– Click red triangle → Covariance matrix

• Fit Y by X (For Regression)


– Select the Response and Predictor variables to get a scatterplot
of the data
– Click red triangle → Fit Line, to get regression line along with
estimated equation and other info
– Click the red triangle next to linear fit, and
• Click Plot Residuals
• Click Save Residuals/Fitted values to add columns to the data table

13
19

Regression Printout

SUMMARY OUTPUT Sales in $1000's, Experience in Years

Regression Statistics
R Square 0.9711
Adjusted R Square 0.9614
Standard Error 1.1536
Observations 5

ANOVA
df SS MS F Significance F
Regression 1 134.0076 134.0076 100.6964 0.0021
Residual 3 3.9924 1.3308
Total 4 138

Coefficients Standard Error t Stat P-value


Intercept -2.0682 1.0406 -1.9875 0.1410
Years 1.0076 0.1004 10.0348 0.0021

Note:
The slope b1 and the correlation r will always have the
same sign:
+, -, or 0

sY
because of the formula b1 = r which we’ll use later in the chapter.
sX
20

R-Square

The square of the correlation is called R-Square or r 2 .

Always: 0  r 2  1. If r 2 is "near" 1 it means that the


regression line is a good fit to the data.

In addition to being the square of r, R-Square has


another, even more important interpretation:

The Percentage Of Variance "Accounted For" or


“Explained”

Example:
In our data set, r 2 = 0.9711

So we have “explained” 97.11% of the variance of Y


(Monthly Sales) by knowing X (Years Experience).
There may be many reasons why Sales vary from
person to person but by taking into account their
Experience (via the regression equation), we are able
to understand most of the individual variation in Y. In
fact, we see that we can account for 97.11% of this
variation.
21

From the output, note that the SS or Sum of Squares


can be written:

SS(Total) = SS(Regression) + SS(Residual)


(SST = SSR + SSE)

where:

(
Y − Yˆ ) 2
SSR = SS(Regression) = “Explained SS” =

SSE = SS(Residual) = “Unexplained SS” =  (Y − Y )


ˆ 2

( )
2
SST = SS(Total) = Total SS = Y −Y
This term does not depend on the X variable and how
well it predicts Y.

Var(Y) = SST / (n-1)

Another way to compute R-Square:

SS(Regression)
r2 =
SS(Total)
or
SS(Residual)
r 2 = 1−
SS(Total)
22

Computing b0 and b1 from r

If we know 5 summary measures about our data:

X , Y , s X , s Y , and r

then we can compute the regression line...

sY
b1 = r
sX

b 0 = Y − b1X
where
r = Correlation between X and Y
s Y = Standard deviation of Y values
s X = Standard deviation of X values

1. The equation for b 1 verifies a previous statement:


b 1 and r will always have the same sign.

2. The b 1 equation shows that slope and correlation


both measure the association between X and Y but
in different ways.
23

Example:
A student scores 55% on test #1 in Finance. Predict
his overall grade using regression based on last
semester's data.

Mean Standard Deviation


Test#1 70.1 14.6
Overall 73.0 11.4

The correlation was 0.75 between Test#1 and the


Overall Grade.

Solution:
24

Sampling Distributions for Intercept and Slope

• If certain assumptions hold, we can find the distributions for b0 and b1

b0 − β0 b1 − β1
~ Tn − 2 ~ Tn − 2
SE(b0 ) SE(b1 )

• We can use these distributions for making inferences about the


relationship between X and Y in the population
• Confidence intervals
• Hypothesis tests

• We can also use these distributions to construct prediction intervals for


values of the response variable (y) for a given value of the predictor
variable (x)

6
25

Question:
How do we know if the regression line is useful?

1. If the r or r2 for a problem is “too small” or very


close to 0, it indicates that the x variable may not
really be related to y.

H0: population correlation  = 0


Ha: population correlation   0

2. If the slope of the regression line is close to 0, then


the predicted y value is approximately the same for
any value of x. So knowing x is not much help in
predicting y.

H0: population slope  1 = 0


Ha: population slope  1  0

These are equivalent hypothesis tests.

continued...
26

For this test we can look at the p-value on the


computer printout for the slope. With a small p-value
we have more confidence that the regression model
on the printout is valid and not just sampling error.
(We will use  = 0.05):

If the p-value < 0.05 then:


We Reject H 0 and so the population correlation
between X and Y is  0.
This means X and Y are really related -- it's not just
luck.

3. Even if the p-value < 0.05 then we should examine


r 2 . If it is too small (maybe < 30%) then it means
that the x variable might be related to y, but is not a
very good predictor of y since a lot of the
variability of y is not related to x.

So a good predictor x has:


p-value < 0.05 and r 2  0.30
27

Inference (II): Confidence Intervals

• We can use the point estimate b1 and se(b1)


to construct confidence interval for the
slope parameter
• Given a confidence level ( ), the
corresponding interval is given by

Available from JMP output


28

Things to Worry About

1. On a scatterplot of values note if any of the points


look like outliers. This may indicate either data
entry errors or cases that are not typical of the data
set (and could be deleted). An outlier may radically
change r and regression line of a small data set.

2. Correlation is not causation.

3. If the population model is non-linear (e.g., a


parabola), then computing r or using a regression
line may give misleading results.

4. Be careful when using the regression line to


extrapolate beyond the range of observed data.
29

Example:
In finance, the risk of a stock is measured by its “beta
value.” This is the slope in a linear regression model1

E(Y ) =  0 +  1X

which relates the rate of return2(Y) of a specific stock


to the rate of return for the overall stock market (X),
namely the S&P 500 index. This regression model
measures the relationship between % changes in the
price of a specific stock relative to % changes in the
overall stock market.

The risk of a stock is related to (i) the outlook for the


stock market or overall economy and also to (ii) the
specific prospects for the company3. The beta value or
 1 measures only the risk of a stock due to (i).

Terminology:
Beta is also described as measuring a stock’s:
-- correlated relative volatility
-- non-diversifiable risk
-- systematic risk
-- market risk

1 This is stated more precisely using the current risk-free interest rate: subtract this interest rate from the data
(weekly returns) for both X and Y. If that rate is roughly constant over the time period we are analyzing, then the
regression model as given above is essentially the same. Also, the intercept in this model should be 0 if the market is
efficient.
2 Return =(dividends received+change in price)/(price at the start of the day, week or month). This is the % change.
3 The risk specific to a company is not of major importance since it can be reduced through diversification. The risk
measured by  is more serious since it combines with similar risk of the other stocks in our portfolio.
1
30

Example:

Weekly closing prices and % changes for Apple


Computer and the S&P500 Index:
S&P500 Apple
Date S&P500 % Change Apple % Change
1/4/10 1144.98 211.98
1/11/10 1136.03 -0.0078 205.93 -0.0285
1/19/10 1091.76 -0.0390 197.75 -0.0397
1/25/10 1073.87 -0.0164 192.06 -0.0288
2/1/10 1066.19 -0.0072 195.46 0.0177
2/8/10 1075.51 0.0087 200.38 0.0252
2/16/10 1109.17 0.0313 201.67 0.0064
2/22/10 1104.49 -0.0042 204.62 0.0146
3/1/10 1138.70 0.0310 218.95 0.0700
3/8/10 1149.99 0.0099 226.60 0.0349
3/15/10 1159.90 0.0086 222.25 -0.0192
3/22/10 1166.59 0.0058 230.90 0.0389
3/29/10 1178.10 0.0099 235.97 0.0220
4/5/10 1194.37 0.0138 241.79 0.0247
4/12/10 1192.13 -0.0019 247.40 0.0232
4/19/10 1217.28 0.0211 270.83 0.0947
4/26/10 1186.69 -0.0251 261.09 -0.0360
5/3/10 1110.88 -0.0639 235.86 -0.0966
5/10/10 1135.68 0.0223 253.82 0.0761
5/17/10 1087.69 -0.0423 242.32 -0.0453
5/24/10 1089.41 0.0016 256.88 0.0601
6/1/10 1064.88 -0.0225 255.96 -0.0036
6/7/10 1091.60 0.0251 253.51 -0.0096
6/14/10 1117.51 0.0237 274.07 0.0811
6/21/10 1076.76 -0.0365 266.70 -0.0269
6/28/10 1022.58 -0.0503 246.94 -0.0741

Mean -0.004 0.007


Std Dev 0.027 0.049

No dividends were paid during this time period, otherwise we would have had to
adjust the prices but finance.yahoo.com does that for us automatically.
31

Jan-June 2010

0.15

0.10
Apple % Weekly Change

0.05

0.00

-0.05

-0.10

-0.15
-0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04
S&P500 % Weekly Change

The slope of the regression line is positive. This is


surprising since during this time period the S&P500
decreased and Apple increased (see previous page.)
32

Regression Statistics
Multiple R 0.8238
R Square 0.6787
Adjusted R Square 0.6647
Standard Error 0.0281
Observations 25

ANOVA
df SS MS F Significance F
Regression 1 0.03848 0.03848 48.58683 4.18706E-07
Residual 23 0.01822 0.00079
Total 24 0.05670

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 0.0135 0.0057 2.3729 0.0264 0.0017 0.0253
S&P500 1.5045 0.2158 6.9704 0.0000 1.0580 1.9510

Write out the regression equation.

What is the beta for Apple?


(It was listed as 1.43 on finance.yahoo.com based on additional data.)

Slope  > 1: aggressive stock


Slope  < 1: conservative stock
Slope  = 1: average (equal to the S&P500)
Slope  = 0: uncorrelated with the S&P500
Slope  < 0: moves opposite to the S&P500

A stock with a slope of 2 would tend to increase twice as fast as the S&P 500 and
decrease twice as fast. A stock with a slope of only 0.50 would tend to increase or
decrease only 1/2 as fast as the S&P 500.
33

continued…

What is the correlation between the % weekly price


changes of Apple and the S&P500?

Note: This correlation r measures how a stock and S&P500


tend to move together. But so does b1 . To see the difference
suppose a stock’s weekly % price change was always exactly
25% of the weekly % price change of the S&P500. Then
b1 = 0.25 while r = 1.0

Recall the relationship between slope and correlation:

s Apple
b1 = r
s S&P 500

From the data set we can compute the standard


deviation for the 2 columns of % changes:

s Apple = 0.049 and s S&P 500 = 0.027

s Apple is a measure of volatility of Apple stock without


relating it to the overall market4.

4 A weekly (or daily) s is usually converted to yearly data and then stated as a %. The annual volatility:
Apple: 0.049 52 = 35.3%
S&P500: 0.027 52 = 19.5%
34

continued…

The relative volatility of Apple is


s Apple 0.049
= = 1.815 81.5% higher than S&P500.
s S&P 500 0.027

Then b1 = 1.815r and this is why the beta for Apple is


much larger than its correlation.

Note: A stock can be very volatile, like Apple, but if r


is close to 0, then its beta (slope) is close to 0. So a
stock’s beta only measures the risk that is related to
market fluctuations.

Everything else equal, an investor likes stocks with


low betas below 1, even better near 0 or better yet if
they are negative. Useful for building low risk
portfolios.

Here are some more examples (as of summer 2010):


Relative
Volatility r Beta
Ford 2.18 0.75 1.63
Apple 1.83 0.82 1.50
Google 1.29 0.66 0.86
Coca-Cola 0.87 0.55 0.48
Kellogg 0.70 0.52 0.36
35

Example:
Here is a data set consisting of the number of people
who play golf in this country and the number of
missed days at work in the entire U.S. due to reported
injury or illness.

Year Golfers Missed Days


1960 4,400,000 370,000,000
1965 7,750,000 400,000,000
1970 9,700,000 417,000,000
1975 12,036,000 433,000,000
1980 13,000,000 485,000,000
1985 14,700,000 500,000,000

Predict the number of Missed Days at work as a


function of the number of Golfers.

Solution:
A fitted regression line gives...

yˆ = 304, 525, 000+12,600X

with the r2 = .91 .

How do you interpret such a high r2 -- as cause &


effect?
36

Example:
X Y
1 5
2 2
3 1
4 2
5 5

(a) Calculate the correlation coefficient r. On the basis


of r, does it seem that X and Y are related in the
population?

(b) Plot the 5 data points. Explain the result of (a) now
that you have seen the plot.

Y3

0
0 1 2 3 4 5 6
X

You might also like