Statistics For Management: Omar Paccagnella

University of Padua
Statistics for Management

Simple Linear Regression (1)
Omar Paccagnella
Department of Statistical Sciences

University of Padua
omar.paccagnella@unipd.it
http://www.stat.unipd.it/~paccagnella
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 1/ 23

University of Padua
Introduction
What happens if:
• Data are time-oriented

• There is more than 1 variable
Unit Shop surface Weekly sales (1000 e)

1 95 43.2
2 144 132.0
3 210 155.0
4 156 76.0
5 188 100.9
6 321 187.4
7 250 185.0
8 115 60.7
9 178 82.9
10 105 61.3

University of Padua

University of Padua

University of Padua
Introduction
• A scatter diagram or scatter plot (that is a two-dimensional graph)

may help to show a relationship between two variables
• Is this relationship linear? Is this positive or negative? If linear, could
we summarise such relationship by fitting a straight line through the
data points (like a trend for a time series)?

University of Padua

University of Padua

University of Padua

University of Padua
The Correlation Coefficient
It measures the extent to which 2 variables (usually called X and Y ) are

linearly related to each other
(in other words, the strength of such linear relationship)
• In the population (that contains all possible values of the pair X − Y

of interest): ρ
• In the (random) sample drawn from this population: r
Often the 2 variables are measured in different units (in the example,
square metres & e); nevertheless it is important to measure the extent to
which X and Y are related

University of Padua
The Correlation Coefficient

• Standardize the variables (constructing the Z -scores)
(X − X) ( Y − Y)
ZX = ZY =
SX SY
I X: average value (mean) of X

I SX : standard deviation of X
I n: number of units
• Calculate the mean cross product of Z -scores:
P
1 X (X − X)(Y − Y)
r = ZX ZY = qP q
n−1
(X − X)2 (Y − Y)2
I Correlation and not causation is measured

I −1 ≤ r ≤ 1 (check the value of r = 0.8927 in the previous example)
University of Padua
Fitting a Straight Line
• Could we find a straight line that is able to summarise the pattern of

all X − Y data points?
• Could we fit the best straight line?
• Could we exploit this best straight line to forecast unknown (future?)
values of the variable of interest (Y )?
University of Padua
• We may introduce a mathematical procedure to calculate both the

Y -intercept and the slope of the best-fitting straight line.
• Since many straight lines can be calculated, the most common
approach to determine the best fit is the method of least squares
(OLS - Ordinary Least Squares)
The best fitting line is the one that minimises the sum of
the squared distances between the data points and the line itself,
as measured in the vertical (Y ) direction
University of Padua
University of Padua
• In the population the straight line may be mathematically defined as:
Y = β 0 + β1 X
• In the sample the straight line may be mathematically defined as:
Y = b0 + b1 X
where b0 and b1 are estimates of the true (but unknown) population

intercept and slope.
According to the values of the sample, we can predict the Y values in the
fitted line
Ŷ = b̂0 + b̂1 X
Ŷ is the value that we would observe if X ’s were on the line
University of Padua
Least Squares
The idea behind this method is that the line will be appropriate to
describe the relationship under investigation if the observed values are
closed to the straight line.
The distance between observed and fitted values is the residual:
ei = Yi − Ŷi = Yi − b0 − b1 Xi
According to the OLS criterium, the values of b0 and b1 are chosen in

order to minimise the sum of squared errors (residuals):
n
X n
X
SSE = f (b0 , b1 ) = ei2 = (Yi − b0 − b1 Xi )2
i =1 i =1
University of Padua
Least Squares
First order conditions (that is the derivates of f (b0 , b1 ) with respect to b0
and b1 ) are applied to minimise SSE. Using little calculi:
Pn
i =1 (X − X)(Y − Y)
b̂1 = Pn 2
i =1 (X − X)
b̂0 = Y − b̂1 X
Least Squares slope is related to sample correlation coefficient, so that

qP
n
i = 1 (Y − Y)2
b̂1 = qP r
n
i =1 (X − X)2
Hence, b̂1 and r are proportional to one another and have the same sign.
University of Padua
The linear regression model
According to the least squares criterion, we have the identity
Observation = Fit + Residual
formally
Y = Ŷ + (Y − Ŷ )
• The fit represents the overall pattern in the data

• The residuals represent deviations from the pattern
University of Padua

Observed data is a sample of observations on an underlying relation that
holds in the population.
For all values of X , the observed values for Y are identically distributed
around a mean µ that depends linearly on X :
µy = β0 + β1 X
As X changes, the means of the distributions of the possible values of Y
lie along a straight line. This is the so-called
population regression line
• Observed values for Y vary because of the presence of unknown

(and unmeasured) factors.
• This variation is the same for all X ’s values and is measured by the
standard deviation σ .
• The distance between a Y value and its mean is called error ().
University of Padua
In the simple linear regression model:

• Y is the response or dependent variable.
• X is the controlled or explanatory (independent) variable.
• The dependent variable is the sum of its mean and a random
deviation () from this mean.
• Deviations represent variation in Y due to unobserved factors that
prevent the pair (X , Y ) values from lying exactly on the straight line.
The population regression line may be defined as:
Y = β0 + β1 X +
University of Padua
University of Padua
The sample regression line may be regarded as an estimate of the

population regression line,
µY = β0 + β1 X
and the residuals e = Y − Ŷ may be regarded as estimates of the error

components
Therefore:
Y = b0 + b1 X + e
University of Padua
Some notes
• We may also write

Cov (X , Y )
b1 =
Var (X )
if Var (X ) 6= 0
• b1 = 0 if and only if Cov (X , Y ) = 0, that is the two variables are
linearly independent
• Cov (X , Y ) provides the sign of b1 estimate
• The regression line always passes through the means of X and Y
University of Padua
Steps of a Linear Regression Analysis
• Hypothesis on the linear functional relationship between the variable

of interest and the other variable(s)
• Estimation of the parameters of this functional relationship,
based on the available sample data
• Statistical testing of model estimates and goodness of fit
• Robustness checks on the main assumptions of the linear
regression model

Statistics For Management: Omar Paccagnella

Uploaded by

Copyright:

Available Formats

You might also like

Statistics For Management: Omar Paccagnella

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics For Management: Omar Paccagnella

Uploaded by

Copyright:

Available Formats

University of Padua

Statistics for Management

Department of Statistical Sciences

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 1/ 23

• Data are time-oriented

Unit Shop surface Weekly sales (1000 e)

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 2/ 23

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 3/ 23

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 4/ 23

• A scatter diagram or scatter plot (that is a two-dimensional graph)

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 5/ 23

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 6/ 23

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 7/ 23

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 8/ 23

The Correlation Coefficient

It measures the extent to which 2 variables (usually called X and Y ) are

• In the population (that contains all possible values of the pair X − Y

Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 9/ 23

The Correlation Coefficient

I X: average value (mean) of X

I Correlation and not causation is measured

Fitting a Straight Line

• Could we find a straight line that is able to summarise the pattern of

Fitting a Straight Line

• We may introduce a mathematical procedure to calculate both the

Fitting a Straight Line

Fitting a Straight Line

• In the population the straight line may be mathematically defined as:

• In the sample the straight line may be mathematically defined as:

where b0 and b1 are estimates of the true (but unknown) population

According to the OLS criterium, the values of b0 and b1 are chosen in

Least Squares slope is related to sample correlation coefficient, so that

The linear regression model

According to the least squares criterion, we have the identity

Observation = Fit + Residual

• The fit represents the overall pattern in the data

The linear regression model

• Observed values for Y vary because of the presence of unknown

The linear regression model

In the simple linear regression model:

The population regression line may be defined as:

The linear regression model

The sample regression line may be regarded as an estimate of the

and the residuals e = Y − Ŷ may be regarded as estimates of the error

• We may also write

Steps of a Linear Regression Analysis

• Hypothesis on the linear functional relationship between the variable

You might also like