Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Regression Analysis

BY
DR. ISMAIL B
PROFESSOR
DEPARTMENT OF STATISTICS
MANGALORE UNIVERSITY
MANGALAGANGOTHRI

e-mail: prof.ismailb@gmail.com
1

Descriptive Statistics

Using the p-value to make


the decision
The p-value is a probability computed assuming the null
hypothesis is true, that the test criterion would take a value as
extreme or more extreme than that actually observed.
Since its a probability, it is a number between 0 and 1. The
closer the number is to 0 means the event is unlikely..
So if p-value is .small,. we can then reject the null hypothesis.

Using the p-value to make the


decision
How much small??? Smaller than level of significance
= .05 or .01. So Using the p-value to make the decision
If .01<p<0.05, significant (sig, S)

If 0 < p<.001, Highly sig. (HS)

Answer What is the relationship between the


variables?
Equation used
1 numerical dependent (response) variable
What is to be predicted: Y
1 or more numerical or categorical
independent (explanatory) variables: X
Different techniques are using for different levels of
measures.

Types of Regression Models


1 Explanatory
Variable

Regression
Models

2+ Explanatory
Variables

Multiple

Simple

Linear

NonLinear

Linear

NonLinear

10

Types of Regression Models


Linear

Log linear
Dependent

1 Explanatory
Variable

Regression
Models

2+ Explanatory
Variables

Multiple

Simple

Linear

NonLinear

Linear

NonLinear

11

Linear Equations
Y
Y = bX + a
b = Slope

Change
in Y

Change in X
a = Y-intercept

Simple Linear Regression model is given by


Y=a+bX+e
12

Simple Linear Regression Model


Relationship Between Variables Is a Linear Function
The Straight Line that Best Fit the Data
Random
Error

Y intercept (Constant term)

Yi 0 1 X i i
Dependent
(Response)
Variable

Slope

Independent
(Explanatory)
Variable
13

Linear Regression Model


Y

Yi 0 1X i i

Observed
Value

i = Random Error

YX

0 1X i

(E(Y/X))

Observed Value
14

Assumption s
1. E( ) 0, disturbanc e have zero mean.
2. V( i ) 2 , i 1,2,....n
i.e., distubance have constant v ariance.
3. E i j 0,

for i j

i.e., disturbanc e are un correlated .


4. The explanator y variable X is non - stochastic .
i.e., fixed in repeated samples and
hence not correlated with the disturbanc es.
n

5. x 2t / n 0 and has a finite limit as n .


t 1

This assumption s states that we have atleast tw o distinct v alues for X.


15

The Sum of Squares


Y

SSE =(Y - Yi )2

SST = (Yi - Y)2

SSR = (Yi - Y)2

Xi

_
Y

X
16

BLUE : Least square estimator of is


then

x y
x
i

2
i

Y - X

we can write SLR model for all the observatio ns as


Y X
y1

y2
Y .

.
yn

1
X .

.
1

(X' X) -1 X' Y,

X1

X2

X n

V( ) (X' X) -1 2
17

V( ) / x i2
2

i 1

Estimation of 2

2 e 2 /( n 2)
i

e i Yi OLS OLS X i

Re sidual,

Testing H 0 : 0
2
s

t obs OLS
S.E( ) (s 2

x

i 1

2
i

2
i

OLS

S.E

If t obs t , n - 2 reject H 0 : 0 at % significan ce level.


2

(1 - )% ( 95%) confidence interval for is OLS t


2

,n 2

se ( ).
18

A measure of fit :

ei Yi Y1 ,

0,

Yi Y Y

as long as there is a constant in the regression .


n

i 1

i 1

R 2 y i2 / y i2 1

2
e
i

2
y
i

(i) R 2 squared correlatio n between Y & Y


(ii) R 2 simple squared correlatio n between Y and X.
If the intercept is not present th us uncentered R 2 is used as measure of fit.
The uncentered R 2 is
R2 1

2
e
i

Yi

2
Y
i
2
Y
i

Centered R 2 is
n

R 1 e
2

i 1

2
i

2
i

i 1

19

Prediction :
Y0 0 X 0 0
BLUP of E(Y 0 ) is

X
Y
0
0
0

Using the Gauss - Markav result.


2

X
2 1 0
VY
0
2
n
x
i

one can construct 95% confidence intervals to these

prediction s for every valu e of X 0 , given by


1

2
2

X
s 1 1 0
t

Y
0
0.025, n - 2
2


n
x

i


t 0.025, n - 2 represents 2.5% critical value obtained for t - distributi on with n - 2 d.f.

20

Example :
Annual consumptio n of 10 households each selected randomly
from a group of households with a fixed personal disposable income.
Both income and expenditur e measured in 1000 Rs.
Solution :
Yi X i i

0.8095

- estimated marginal propensity to consume.

- This is the extra consumptio n brought about by an extra Rs of disposable income.

Y - X 6.5 - (0.8095)(7 .5) 0.4286.


This is the estimated consumptio n at zero personal disposable income.
The fitted values from these regression , true values and residuals are shown in the figure.

21

V( )

s2

2
i

0.005941 , s 2 0.311905

SE( ) 0.077078

and estimated variances of is


1
X

V( ) s

2
n
x

SE( ) 0.60446
2

0.365374

Test statistics to test H 0 : 0 is


t0

10 .50

SE

p - value P t 8 10 .5 0.0001 , Reject H 0 . Hence X is highly significan t.


H 0 : 0, t 0 0.709 which is not significan t since p - value 0.498.
Therefore we do not reject H 0 .
22

R r
2

x y / x y 0.9324 .
1 - e y 0.9324 .
2

2
i

2
i

2
i

2
i

This means that personal dosposable income


explains 93.24% of the variation in consumptio n

23

The Sum of Squares


SST = Total Sum of Squares
measures the_variation of the Yi values around their mean Y

SSR = Regression Sum of Squares


explained variation attributable to the relationship between X and Y

SSE = Error Sum of Squares

variation attributable to factors other than the relationship between X


and Y
24

The Coefficient of Determination


r2 =

SSR
regression sum of squares
=
SST
total sum of squares

Measures the proportion of variation in


the dependent variable explained by the
regression line

25

Simple Linear Regression

26

Simple Linear Regression

27

28

Y = a0 + a1 X 1 + a2 X 2 + ... + an X n + e
e ~ N (0, s 2 )
Y: Response variable
X: Explanatory variable
e : Error

29

Errors are independent (no auto correlation)


Errors are normally distributed
Errors have zero mean and constant variance
No multi- collinearity
Regressors are not random variables (fixed for repeated measurements)

30

Multiple Regression

31

Regression Diagnostic asks 3 questions:

Are the assumptions of multiple regression complied

with?
Is the model adequate?

Is there anything unusual about any data points?

32

Plot the ACF of residuals

60

100

Residuals Versus the Fitted Values


(response is Crimrate)

50
40

Residual

30
20

East
West
North

50

10
0
-10
-20
-30
-40

1st
Qtr
50

3rd
Qtr
100

150

200

Fitted Value

Durbin Watson statistic (Normal value 0-4).


Remedy?
33

Plot residual versus fitted

Remedy?

34

Auto correlated Regression

35

Residual plot showing


Autocorrelation

36

Check by means of correlation matrix


Variance Inflation. Large changes in regression
coefficients when variables are added or deleted.
Variance inflation factor (VIF)>4 indicate multi
collinearity
VIF=1/(1-R^2)
Durbin Watson statistic is another check for
collinearity. (Normal value 0-4).
Remedy?

37

Logistic

Regression

Logistic regression is a form regression used when the dependent variable is dichotomy
(binary) and independent variable is of any type
Continuous variable are not used as dependent variable.
Logistic regression does not assume linearity of relationship between dependent and
independent variables
Does not assume normality and homoscedasticity
It assumes that observations be independent and that independent variables are linearly
related the logit of the dependent.
The scatter plot of outcome variable (Y) vs. independent variables shows all points fall on
one of the two parallel lines representing Y=0 and Y=1.
This scatter plot does provide clear picture of linear relationship.
In linear regression the quantity E(Y/X) can take any value in range
( , )
where in logistic regression E(Y/X) lies between (0,1)

38

Let (x)=E(Y/X). The specific form of ( x)


we use logistic regression model as
( x ) exp( 0 1 x ) (1 exp( 0 1 x ))
The logit transformation of ( x) given by
g(x) = ln( ( x ) (1 ( x ))
= 0 1 x
The logit, g(x) is linear in parameter, continuous and
may range (-, ) depending on range of x. we may
express value of the outcome variable given x as
y= (x)
39

Binary Logistic Regression

40

Binary Logistic Regression

41

Binary Logistic Regression

42

Thanks !!!

43

You might also like