Statistics For Management: Omar Paccagnella

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

University of Padua

Statistics for Management


Simple Linear Regression (1)

Omar Paccagnella

Department of Statistical Sciences


University of Padua

omar.paccagnella@unipd.it
http://www.stat.unipd.it/~paccagnella

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 1/ 23


University of Padua

Introduction
What happens if:

• Data are time-oriented


• There is more than 1 variable

Unit Shop surface Weekly sales (1000 e)


1 95 43.2
2 144 132.0
3 210 155.0
4 156 76.0
5 188 100.9
6 321 187.4
7 250 185.0
8 115 60.7
9 178 82.9
10 105 61.3

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 2/ 23


University of Padua

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 3/ 23


University of Padua

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 4/ 23


University of Padua

Introduction

• A scatter diagram or scatter plot (that is a two-dimensional graph)


may help to show a relationship between two variables
• Is this relationship linear? Is this positive or negative? If linear, could
we summarise such relationship by fitting a straight line through the
data points (like a trend for a time series)?

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 5/ 23


University of Padua

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 6/ 23


University of Padua

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 7/ 23


University of Padua

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 8/ 23


University of Padua

The Correlation Coefficient

It measures the extent to which 2 variables (usually called X and Y ) are


linearly related to each other
(in other words, the strength of such linear relationship)

• In the population (that contains all possible values of the pair X − Y


of interest): ρ
• In the (random) sample drawn from this population: r
Often the 2 variables are measured in different units (in the example,
square metres & e); nevertheless it is important to measure the extent to
which X and Y are related

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 9/ 23


University of Padua

The Correlation Coefficient


• Standardize the variables (constructing the Z -scores)

(X − X) ( Y − Y)
ZX = ZY =
SX SY

I X: average value (mean) of X


I SX : standard deviation of X
I n: number of units
• Calculate the mean cross product of Z -scores:
P
1 X (X − X)(Y − Y)
r = ZX ZY = qP q
n−1
(X − X)2 (Y − Y)2

I Correlation and not causation is measured


I −1 ≤ r ≤ 1 (check the value of r = 0.8927 in the previous example)

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 10/ 23
University of Padua

Fitting a Straight Line

• Could we find a straight line that is able to summarise the pattern of


all X − Y data points?
• Could we fit the best straight line?
• Could we exploit this best straight line to forecast unknown (future?)
values of the variable of interest (Y )?

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 11/ 23
University of Padua

Fitting a Straight Line

• We may introduce a mathematical procedure to calculate both the


Y -intercept and the slope of the best-fitting straight line.
• Since many straight lines can be calculated, the most common
approach to determine the best fit is the method of least squares
(OLS - Ordinary Least Squares)

The best fitting line is the one that minimises the sum of
the squared distances between the data points and the line itself,
as measured in the vertical (Y ) direction

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 12/ 23
University of Padua

Fitting a Straight Line

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 13/ 23
University of Padua

Fitting a Straight Line

• In the population the straight line may be mathematically defined as:

Y = β 0 + β1 X

• In the sample the straight line may be mathematically defined as:

Y = b0 + b1 X

where b0 and b1 are estimates of the true (but unknown) population


intercept and slope.
According to the values of the sample, we can predict the Y values in the
fitted line
Ŷ = b̂0 + b̂1 X
Ŷ is the value that we would observe if X ’s were on the line

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 14/ 23
University of Padua

Least Squares

The idea behind this method is that the line will be appropriate to
describe the relationship under investigation if the observed values are
closed to the straight line.
The distance between observed and fitted values is the residual:

ei = Yi − Ŷi = Yi − b0 − b1 Xi

According to the OLS criterium, the values of b0 and b1 are chosen in


order to minimise the sum of squared errors (residuals):
n
X n
X
SSE = f (b0 , b1 ) = ei2 = (Yi − b0 − b1 Xi )2
i =1 i =1

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 15/ 23
University of Padua

Least Squares
First order conditions (that is the derivates of f (b0 , b1 ) with respect to b0
and b1 ) are applied to minimise SSE. Using little calculi:
Pn
i =1 (X − X)(Y − Y)
b̂1 = Pn 2
i =1 (X − X)

b̂0 = Y − b̂1 X

Least Squares slope is related to sample correlation coefficient, so that


qP
n
i = 1 (Y − Y)2
b̂1 = qP r
n
i =1 (X − X)2

Hence, b̂1 and r are proportional to one another and have the same sign.

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 16/ 23
University of Padua

The linear regression model

According to the least squares criterion, we have the identity

Observation = Fit + Residual

formally
Y = Ŷ + (Y − Ŷ )

• The fit represents the overall pattern in the data


• The residuals represent deviations from the pattern

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 17/ 23
University of Padua

The linear regression model


Observed data is a sample of observations on an underlying relation that
holds in the population.

For all values of X , the observed values for Y are identically distributed
around a mean µ that depends linearly on X :

µy = β0 + β1 X
As X changes, the means of the distributions of the possible values of Y
lie along a straight line. This is the so-called
population regression line

• Observed values for Y vary because of the presence of unknown


(and unmeasured) factors.
• This variation is the same for all X ’s values and is measured by the
standard deviation σ .
• The distance between a Y value and its mean is called error ().
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 18/ 23
University of Padua

The linear regression model

In the simple linear regression model:


• Y is the response or dependent variable.
• X is the controlled or explanatory (independent) variable.
• The dependent variable is the sum of its mean and a random
deviation () from this mean.
• Deviations represent variation in Y due to unobserved factors that
prevent the pair (X , Y ) values from lying exactly on the straight line.

The population regression line may be defined as:

Y = β0 + β1 X + 

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 19/ 23
University of Padua

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 20/ 23
University of Padua

The linear regression model

The sample regression line may be regarded as an estimate of the


population regression line,

µY = β0 + β1 X

and the residuals e = Y − Ŷ may be regarded as estimates of the error


components 
Therefore:

Y = b0 + b1 X + e

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 21/ 23
University of Padua

Some notes

• We may also write


Cov (X , Y )
b1 =
Var (X )
if Var (X ) 6= 0
• b1 = 0 if and only if Cov (X , Y ) = 0, that is the two variables are
linearly independent
• Cov (X , Y ) provides the sign of b1 estimate
• The regression line always passes through the means of X and Y

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 22/ 23
University of Padua

Steps of a Linear Regression Analysis

• Hypothesis on the linear functional relationship between the variable


of interest and the other variable(s)
• Estimation of the parameters of this functional relationship,
based on the available sample data
• Statistical testing of model estimates and goodness of fit
• Robustness checks on the main assumptions of the linear
regression model

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 23/ 23

You might also like