Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 70

R E G R E S S I O N

a n a l y s i s
SECTION 1
I N T R O D U C T İ O N
HISTORY

1805: LENARDE 1822-1911: Sir Galton


METHOD OF LEAST "REGRESSION" WAS
SQUARES COINED

1809: GAUSS
METHOD OF LEAST
SQUARES
HISTORY

1851-1952: George Yule


1890-1962: Sir Ronald Fisher
JOINED DSTRIBUTION
WAS ASSUMED TO BE WEAKENED THE ASSUMPTION
GAUSSIAN OF YULE AND PEARSON

1857-1936: Karl Pearson


JOINED DSTRIBUTION WAS
ASSUMED TO BE GAUSSIAN
HISTORY
Gauss published a further development of
the theory of least squares, including a
version of the Gauss–Markov theorem

1805 –
1821
1809

The earliest form of regression was the method of least


squares, published by Legendre in 1805, and by Gauss in
1809. They both applied the method to the problem of
determining, from astronomical observations, the orbits of
bodies about the Sun (mostly comets, but also later the
then newly discovered minor planets).
Galton’s work was later extended by Udny Yule and
HISTORY Karl Pearson to a more general statistical context.
In the work of Yule and Pearson, the joint
distribution of the response and explanatory
variables is assumed to be Gaussian.

1897 –
1890
1903

The term "regression" was coined by Francis Galton to


describe a biological phenomenon. The phenomenon
was that the heights of descendants of tall ancestors tend
to regress down towards a normal average (a
phenomenon also known as regression toward the
mean). For Galton, regression had only this biological
HISTORY

Economists used electromechanical desk


calculators to calculate regressions.

1950s
1922 –

1925
1960s

This assumption was weakened by R.A. Fisher. He


assumed that the conditional distribution of the response
variable is Gaussian, but the joint distribution need not
be. In this respect, Fisher's assumption is closer to
Gauss's formulation of 1821.
Regression methods continue to be an area of
HISTORY active research. In recent decades, new
methods have been developed for robust
regression, regression involving correlated
responses such as time series and growth
curves, regression in which the predictor
(independent variable) or response variables are
BEFORE
curves, images, graphs, or other complex data
1970
objects, regression methods accommodating

It sometimes took up to 24 hours to various types of missing data, nonparametric

receive the result from one regression regression, Bayesian methods for regression,
regression in which the predictor variables are
measured with error, regression with more
predictor variables than observations, and
causal inference with regression.
Statistical modeling

D e s c r i b e s R e l a t i o n s h i p
b e t w e e n V a r i a b l e s

Deterministic Models Probabilistic Models

 Hypothesize Exact Relationships  


Hypothesize 2 Components:
Suitable When Prediction Error is Negligible
Deterministic
Example: Body mass index (BMI) is measure of body fat based
Random Error

Metric Formula: Example: Systolic blood pressure of newborns Is 6 Times the

Age in days + Random Error


Non-metric Formula:

Random Error May Be Due to Factors Other Than age in days

(e.g. Birthweight)
STATISTICAL MODELING
T y p e s o f P r o b a b i l i s t i c M o d e l s

PROBABILISTIC
MODELS

REGRESSION CORRELATION OTHER


MODELS MODELS MODELS
TYPES OF
REGRESSION
MODELS

SIMPLE M U LT I P L E
1 EXPLANATORY 2+ EXPLANATORY
VARIABLE VARIABLES

LINEAR NON-LINEAR LINEAR NON-LINEAR


REGRESSION
a n a l y s i s
In analyzing data for the health sciences disciplines, we find that it is

frequently desirable to learn something about the relationship between two

numeric variables. We may, for example, be interested in studying the

relationship between blood pressure and age, height and weight, the

concentration of an injected drug and heart rate, the consumption level of

some nutrient and weight gain, the intensity of a stimulus and reaction

time, or total family income and medical care expenditures. The nature and

strength of the relationships between variables such as these may be

examined using linear models such as regression and correlation analysis,

two statistical techniques that, although related, serve different purposes.


REGRESSION
a n a l y s i s

Regression analysis is helpful in assessing specific

forms of the relationship between variables, and the

ultimate objective when this method of analysis is

employed usually is to predict or estimate the value

of one variable corresponding to a given value of

another variable.
REGRESSION MODELING

STEPS
2. Specify model 1. Define problem

3. Collect data or question

4. Do descriptive data analysis

5. Estimate unknown parameters

6. Evaluate model

7. Use model for prediction


SIMPLE
VS.
MULTIPLE
  represents the unit   representsthe unit
change in Y per unit change in Y per unit
change in X 1 1 change in Xi

does not take into 2 2 takes into account


account any other the effect of other
variable besides 𝛽𝑖s
single independent
3 net regression
variable
coefficient
ASSUMPTIONS

CONTINUOUS V NO SIGNIFICANT
two variables should be OUTLIERS
either interval or ratio
variables 1 4 outliers can have a
negative effect on the
regression analysis

LINEARITY HOMO-
the Y variable is SCEDASTICITY
the variation around the
2 5
linearly related to the
value of the X variable line of regression be
constant for all values of X

INDEPENDENCE NORMALITY
OF ERROR the values of Y be
the error (residual) is normally distributed at
independent for each 3 6 each value of X
value of X

Develop a statistical model that can predict the
GOAL

values of a dependent (response)


variable based upon the values of
the independent (explanatory)
variables.
SECTION 2
L İ N E A R R E G R E S S İ O N A N A L Y S İ S
T Y P E S O F C O R R E L A T İ O N

Positive correlation Negative correlation No correlation


S İ M P L E L İ N E A R R E G R E S S İ O N

describes the linear

Dependent Variable (Y)


relationship between a
predictor variable, plotted on
the x-axis, and a response
variable, plotted on the y-axis Independent Variable (X)
Straight line is the simplest model of the relationship
L İ N E A R E Q U A T İ O N between two interval-scaled attributes, and its slope
gives us an indication of the existence of an association
between them.
Therefore, an objective way to investigate an
Y Association between interval attributes will be to draw a

Y =mX +b straight line through the center of the cloud of points and
Change measure its slope.
m = S lo p e in Y If the slope is zero, the line is horizontal and we
C h a n g e in X conclude that there is no association. If it is non-zero,

b = Y - in t e r c e p t then we can conclude on an association.


So we have two problems to solve:
X
how to draw the straight line that best models the
relationship between attributes and
how to determine whether its slope is different from
zero.
L I N E A R R E G R E S S I O N M O D E L
R e l a t i o n s h i p B e t w e e n Va r i a b l e s I s a L i n e a r F u n c t i o n

POPULATION POPULATION RANDOM


Y-INTERCEPT SLOPE ERROR

 
𝑌 𝑖¿    
𝛽0  
+¿  
𝛽1 ∗ 𝑋𝑖
     
+¿  
𝜀𝑖

DEPENDENT INDEPENDENT
(RESPONSE) (EXPLANATORY)
VARIABLE VARIABLE
L I N E A R R E G R E S S I O N M O D E L
R e l a t i o n s h i p B e t w e e n Va r i a b l e s I s a L i n e a r F u n c t i o n

 
Unknown ^ ^
𝑌 𝑖= 𝛽 0+ 𝛽 1 𝑋 𝑖 + 𝜀^ 𝑖
Relationship
 
𝑌 𝑖 = 𝛽 0+ 𝛽 1 𝑋 𝑖 + 𝜀 𝑖
Population Y  𝑌 = ^
𝛽 + ^ 𝑋 + 𝜀^
𝛽
𝑖 0 1 𝑖 𝑖

Linear
Regression
  = Random error

Model
Unsampled
value
´ =^
 𝑌 𝛽 + ^ 𝑋
𝛽
𝑖 0 1 𝑖

X
RA
 𝑌 = 𝛽 0+ 𝛽 1 𝑋 𝑖 + 𝜀 𝑖
Y 𝑖
Observed value
Observed value

  = Random error

´ ) =𝛽
Sample Linear
 𝐸 ( 𝑌 + 𝛽1 𝑋𝑖
0

X Regression
Model
Observed value
SECTION 3
T H E O R D I N A R Y L E A S T S Q U A R E M E T H O D ( O L S )
!

THE ORDINARY
HOW TO FIT DATA TO
LEAST SQUARE
A LINEAR MODEL?
METHOD (OLS)
T H E O R D I N A R Y L E A S T S Q U A R E M E T H O D
O v e r v i e w

‘Best Fit’ means difference between  


2 2
actual Y values & predicted Y values are (𝑌 − 𝑌
^
∑ 𝑖 𝑖 ∑ ) = 𝜀
^ 𝑖
a minimum.

LS minimizes the Sum of the Squared

But positive differences off-set negative. Differences (errors) (SSE)

So square errors!
L E A S T S Q U A R E S G R A P H I C A L LY LEAST SQUARES REGRESSION

  2 2 2 2 2   ´
𝑌 𝑖 = 𝛽 0+ 𝛽 1 𝑋 𝑖
𝐿𝑆𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑠 ∑ 𝜀^ =^𝜀 + 𝜀^ +^𝜀 +^𝜀
𝑖 1 2 3 4
Model line:

  ´
 𝑌 = ^
𝛽 + ^
𝛽1 𝑋 2 + 𝜀^ 2
Residual (ε) = 𝑌 −𝑌  
Y 2 0

S u m o f s q u a r e s 
^ 4 2
^
^2

^ 3
of residuals = min ∑ (𝑌 − 𝑌  )
´
1
 
 𝑌
´ 𝑖 = 𝛽 0+ 𝛽 1 𝑋 𝑖 we must find values of and that
X minimise
T H E R E G R E S S İ O N C O E F F İ C İ E N T S

 
𝑆 𝑥𝑦 𝜎 𝑥𝑦
𝛽 1= = 2
𝑆 𝑥𝑥 𝜎𝑥

  ´ ´
𝛽 0 = 𝑌 − 𝑏1 𝑋
C O E F F I C I E N T E Q U AT I O N S

Prediction equation Sample slope Sample Y - intercept

  𝑆 𝑥𝑦 ∑ ( 𝑥 𝑖 − 𝑥´ )( 𝑦𝑖 − ´𝑦 )   ^ ^
  ´ ^ ^
𝑌 𝑖 = 𝛽 0+ 𝛽 1 𝑋 𝑖 ^
𝛽 1= = 2
𝛽 0= ´𝑦 − 𝛽 1 ´𝑥
𝑆 𝑥𝑥 ∑ ( 𝑥 𝑖 − 𝑥´ )
I N T E R P R E TAT I O N

  ^
𝑆𝑙𝑜𝑝𝑒 ( 𝛽1 )
  ^
1
𝑒 𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑌 𝑐 h𝑎𝑛𝑔𝑒𝑠 𝑏𝑦 𝛽 1 𝑓𝑜𝑟 𝑒 𝑎𝑐h 1𝑢 𝑛𝑖𝑡 𝑖 𝑛𝑐𝑟𝑒𝑎𝑠𝑒 𝑖𝑛 𝑋
  ^
𝐼𝑓 𝛽 1=2 , 𝑡h𝑒𝑛 𝑌 𝑖 𝑠 𝑒 𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑡𝑜 𝑖 𝑛𝑐𝑟𝑒𝑎𝑠𝑒 𝑏𝑦 2 𝑓𝑜𝑟
 
𝑒 𝑎𝑐h1 𝑢 𝑛𝑖𝑡 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑒 𝑖𝑛 𝑋

2 𝑌 − 𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 ( ^𝛽 0 )
 

 
If = 4, then average Y is expected to be 4 when X is 0
R E Q U İ R E D S T A T İ S T İ C S

   
∑ 𝑋 ∑ 𝑌
´𝑋 = ´𝑌 =
𝑛 𝑛
D E S C R İ P T İ V E S T A T İ S T İ C S

 X  X 
n
2 S xx
Var ( X )  i 1
n 1
S yy (SST )
 Y  Y 
n
2

Var (Y )  i 1
n 1

S xy
  X  X Y  Y 
n

Covar( X , Y )  i 1
n 1
R E G R E S S İ O N S T A T İ S T İ C S

The Sum of Squares Regression (SSR) is the sum of the squared differences between
the prediction for each observation and the population mean.

T h e To t a l S u m o f S q u a r e s ( S S T ) i s e q u a l t o S S R + S S E

SSR   (Y   Y ) 2
(measure of explained variation)

SSE   (Y  Y ) 2 (measure of unexplained variation)

SST   (Y  Y )  2
(measure of total variation in y)

 SSR  SSE
!
Variance to be
explained by predictors
(SST)

Y
!

X1

Variance
Y
explained by X1 Variance NOT
explained by X1
(SSR)
(SSE)
T H E C O E F F İ C İ E N T O F D E T E R M İ N AT İ O N

 
The proportion of total variation (SST) that is explained by the regression (SSR) is known as the Coefficient of

Determination, and is often referred to as .

The value of can range between 0 and 1, and the higher its value the more accurate the regression

model is.

It is often referred to as a percentage.


T H E C O E F F İ C İ E N T O F D E T E R M İ N AT İ O N

 
An important measure of association between variables.
Represented as because its value is the square of another measure of association frequently used, called the
correlation coefficient, which is represented by .

 Although we can obtain from , the two measures are not completely equivalent.

  has values between 0 and 1

  ranges from 1 to +1

 in addition to providing a measure of the strength of an association, also informs us of the type of association

In both cases, the greater the absolute value of the coefficient, the greater the strength of the association

Unlike the coefficient of determination, the correlation coefficient is an abstract value that has no direct and precise
interpretation, somewhat like a score.
T H E C O E F F İ C İ E N T O F D E T E R M İ N AT İ O N

 
These two measures are related to the degree of dispersion of the observations about the regression line.

In a scatterplot, when the two variables are independent, the points will be distributed over the entire area of the

plot. The regression line is horizontal and the coefficient of determination is zero. When an association exists, the

regression line is oblique and the points are more or less spread along the line. The higher the strength of the

association, the less the dispersion of the points around the line and the greater will be and the absolute value of .

If all the points are over the line, has value 1 and value +1 or 1.
T H E C O E F F İ C İ E N T O F D E T E R M İ N AT İ O N

 
The importance of these measures of association comes from the fact that it is very common to find

evidence of association between two variables, and it is the strength of the association that tells us

whether it has some important meaning.

In clinical research, associations explaining less than 50% of the variance of the dependent variable, that

is, associations with less than 0.50 or, equivalently, with between 0.70 and 0.70 are usually not

regarded as important.
S T A N D A R D E R R O R O F R E G R E S S İ O N

The Standard Error of a regression is a


measure of its variability. It can be used
in a similar manner to standard deviation,
allowing for prediction intervals.

S e  S  ˆ 2
e
2 Standard Error for the regression model
From the regression equation
compute the predicted values 1
of the dependent variable

Compute the variance of the


 
2 2𝑆𝑆𝐸
2 𝑆 =∑ ( 𝑦− ´𝑦 ) =¿
𝑒 =𝑀𝑆𝐸 ¿
residuals from y and y* 𝑛−2 Mean
Squared
Error

  2 2
Obtain the sum of squares
of x from the variance of x
3
∑ ( 𝑥− ´𝑥 ) ( 𝑛−1) =𝑆 ( 𝑛−1)
𝑥
  𝑆
2
The standard error of the
regression coefficient is 4 𝑆 𝑒 ( 𝛽 )=
√ 𝑒

( 𝑥 − ´𝑥 ) 2

 
This estimate of the true standard error of is unbiased on the condition that the dispersion of the points about
the regression line is approximately the same along the length of the line. This will happen if the variance of Y is
the same for every value of X, that is, if Y is homoscedastic. If this condition is not met, then the estimate of the
standard error of may be larger or smaller than the true standard error, and there is no way of telling which.
In summary, we can estimate the standard error of the regression coefficient from our sample and construct
confidence intervals, under the following assumptions:
 The dependent variable has a normal distribution for all values of the independent variable.
 The variance of the dependent variable is equal for all values of the independent variable.
 If the independent variable is interval its distribution is normal.
 The relationship between the two variables is linear.
  TEST IN LINEAR REGRESSION

  can test the null hypothesis that with a different test


We
based on analysis of variance.

The Figure compares a situation where the null hypothesis is  


If the null hypothesis is false, the regression line will be steep and the
true, on the left, with a situation where the null hypothesis is
false, on the right.
departures of the values y from the regression line will be less than the
 
When the two variables are independent, and the slope of
the sample regression line will be very nearly zero (not departures from . Therefore, the residual mean square will be smaller
exactly zero because of sampling variation).
than the total variance of Y. We could compare the two estimates and

by taking the ratio .

RESIDUAL MEAN SQUARE The resulting variance ratio would follow an F distribution if the two

 An estimate of the variance of Y for fixed values of X can estimates of were independent, and if the null hypothesis were false
be obtained from the variance of the residuals, that is, the
the variance ratio would have a value much larger than expected under
variance of the departure of each y from the value predicted
.
by the regression
SECTION
4i z
Q u
Q U I Z

The regression line is drawn so that:


A. The line goes through more points than any other possible line, straight or
curved
B. The line goes through more points than any other possible straight line.
C. The same number of points are below and above the regression line.
D. The sum of the absolute errors is as small as possible.
E. The sum of the squared errors is as small as possible.

E. The sum of the squared errors is as small as possible.


Q U I Z

In order for the regression technique to give the best and minimum variance prediction, all the
following conditions must be met, EXCEPT for:
A. The relation is linear.
B. We have not omitted any significant variable.
C. Both the X and Y variables (the predictors and the response) are normally distributed.
D. The residuals (errors) are normally distributed.
E. The variance around the regression line is about the same for all values of the predictor.

C. Both the X and Y variables (the predictors and the response) are normally
distributed.
Q U I Z

If a regression has the problem of heteroscedasticity,


A. The predictions it makes will be wrong on average.
B. The predictions it makes will be correct on average, but we will not be certain of
the RMSE (root-mean-square error)
C. It will also have the problem of an omitted variable or variables.
D. It will also be based on a non-linear equation

The assumption of homoscedasticity (meaning “same variance”) is central to linear regression models.  Homoscedasticity
describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the
independent variables and the dependent variable) is the same across all values of the independent variables. 
Heteroscedasticity (the violation of homoscedasticity) is present when the size of the error term differs across values of an
independent variable.  The impact of violating the assumption of homoscedasticity is a matter of degree, increasing as
heteroscedasticity increases.

B. Heteroscedasticity implies that the variance will differ for different values of the
regressor.
Q U I Z

In regression, the equation that describes how the response variable (y) is
related to the explanatory variable (x) is:
a. the correlation model
b. the regression model
c. used to compute the correlation coefficient
d. None of these alternatives is correct.

b. the regression model


Q U I Z

The regression line is drawn so that:


A. The line goes through more points than any other possible line, straight or
curved
B. The line goes through more points than any other possible straight line.
C. The same number of points are below and above the regression line.
D. The sum of the absolute errors is as small as possible.
E. The sum of the squared errors is as small as possible.

E. The sum of the squared errors is as small as possible.


Q U I Z

The relationship between number of beers consumed (x) and blood alcohol content (y) was studied
in 16 male college students by using least squares regression. The following regression equation
was obtained from this study:
y= -0.0127 + 0.0180x
The above equation implies that:
a. each beer consumed increases blood alcohol by 1.27%
b. on average it takes 1.8 beers to increase blood alcohol content by 1%
c. each beer consumed increases blood alcohol by an average of amount of 1.8%
d. each beer consumed increases blood alcohol by exactly 0.018

c. each beer consumed increases blood alcohol by an average of amount of 1.8%


Q U I Z

SSE can never be


a. larger than SST
b. smaller than SST
c. equal to 1
d. equal to zero

a. larger than SST


Q U I Z

Regression modeling is a statistical framework for developing a


mathematical equation that describes how
a. one explanatory and one or more response variables are related
b. several explanatory and several response variables response are related
c. one response and one or more explanatory variables are related
d. All of these are correct.

c. one response and one or more explanatory variables are related


Q U I Z

In regression analysis, the variable that is being predicted is the


a. response, or dependent, variable
b. independent variable
c. intervening variable
d. is usually x

a. response, or dependent, variable


Q U I Z

In least squares regression, which of the following is not a required


assumption about the error term ε?
a. The expected value of the error term is one.
b. The variance of the error term is the same for all values of x.
c. The values of the error term are independent.
d. The error term is normally distributed.

a. The expected value of the error term is one.


Q U I Z

 Larger values of () imply that the observations are more closely grouped about
the
a. average value of the independent variables
b. average value of the dependent variable
c. least squares line
d. origin

c. least squares line


Q U I Z

In a regression analysis if r2 = 1, then


a. SSE must also be equal to one
b. SSE must be equal to zero
c. SSE can be any positive value
d. SSE must be negative

b. SSE must be equal to zero


Q U I Z

In regression analysis, the variable that is used to explain the change in the
outcome of an experiment, or some natural process, is called
a. the x-variable
b. the independent variable
c. the predictor variable
d. the explanatory variable
e. all of the above (a-d) are correct
f. none are correct

e. all of the above (a-d) are correct


Q U I Z

In the case of an algebraic model for a straight line, if a value for the x variable is
specified, then
a. the exact value of the response variable can be computed
b. the computed response to the independent value will always give a minimal
residual
c. the computed value of y will always be the best estimate of the mean response
d. none of these alternatives is correct.

a. the exact value of the response variable can be computed


Q U I Z

 
In a regression and correlation analysis if = 1, then
a. SSE = SST
b. SSE = 1
c. SSR = SSE
d. SSR = SST

d. SSR = SST
Q U I Z

If the coefficient of determination is a positive value, then the regression equation


a. must have a positive slope
b. must have a negative slope
c. could have either a positive or a negative slope
d. must have a positive y intercept

c. could have either a positive or a negative slope


Q U I Z

If two variables, x and y, have a very strong linear relationship, then


a. there is evidence that x causes a change in y
b. there is evidence that y causes a change in x
c. there might not be any causal relationship between x and y
d. None of these alternatives is correct.

c. there might not be any causal relationship between x and y


Q U I Z

In regression analysis, if the independent variable is measured in kilograms, the


dependent variable
a. must also be in kilograms
b. must be in some unit of weight
c. cannot be in kilograms
d. can be any units

d. can be any units


Q U I Z

In a regression analysis if SSE = 200 and SSR = 300, then the coefficient of
determination is
a. 0.6667
b. 0.6000
c. 0.4000
d. 1.5000

b. 0.6000
Q U I Z

A fitted least squares regression line


a. may be used to predict a value of y if the corresponding x value is given
b. is evidence for a cause-effect relationship between x and y
c. can only be computed if a strong linear relationship exists between x and y
d. None of these alternatives is correct.

a. may be used to predict a value of y if the corresponding x value is given


Q U I Z

You have carried out a regression analysis; but, after thinking about the relationship
between variables, you have decided you must swap the explanatory and the
response variables. After refitting the regression model to the data you expect that:
a. the value of the correlation coefficient will change
b. the value of SSE will change
c. the value of the coefficient of determination will change
d. the sign of the slope will change
e. nothing changes

b. the value of SSE will change


Q U I Z

Suppose you use regression to predict the height of a woman’s current boyfriend by using her
own height as the explanatory variable. Height was measured in feet from a sample of 100
women undergraduates, and their boyfriends, at Dalhousie University. Now, suppose that the
height of both the women and the men are converted to centimeters. The impact of this
conversion on the slope is:
a. the sign of the slope will change
b. the magnitude of the slope will change
c. both a and b are correct
d. neither a nor b are correct

d. neither a nor b are correct


Q U I Z

A residual plot:
a. displays residuals of the explanatory variable versus residuals of the response
variable.
b. displays residuals of the explanatory variable versus the response variable.
c. displays explanatory variable versus residuals of the response variable.
d. displays the explanatory variable versus the response variable.
e. displays the explanatory variable on the x axis versus the response variable on
the y axis.

c. displays explanatory variable versus residuals of the response variable.


Q U I Z

When the error terms have a constant variance, a plot of the residuals versus
the independent variable x has a pattern that
a. fans out
b. funnels in
c. fans out, but then funnels in
d. forms a horizontal band pattern
e. forms a linear pattern that can be positive or negative

d. forms a horizontal band pattern


THANK YOU
S E R V E R . R A R E D I S . O R G / E D U

You might also like