Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

STATISTICS

IN ECONOMICS AND BUSINESS

Nguyen Huyen Trang


Faculty of Statistics - National Economics University
trangtk@neu.edu.vn
LECTURE 10: REGRESSION

➢ Causality relationship
➢ Methods of correlation
➢ Regression analysis
CAUSALITY RELATIONSHIP

There are many examples in economics where we


need to consider the relationship between two or
more variables:
▪ Consumption and income
▪ Inflation and unemployment
▪ Output and costs
▪ Advertising and Sales Revenue
THE RELATIONSHIP BETWEEN X AND Y

• Correlation: Is there a relationship between 2


variables?

• Regression: How well a certain independent variable


predict dependent variable?
METHODS OF CORRELATION

• Scatter Plots
• Covariance
• Correlation coefficient
SCATTER PLOTS

Indicative of the type of relationship between your two variables

Positive relationship Negative Relationship


18

16

Reliability
14

12
Height in CM

10

0
0 10 20 30 40 50 60 70 80 90
Age in Weeks
Age of Car
SCATTER PLOTS

No relationship
COVARIANCE

Using Formula
n

Variance
Gives information on
variability of a single
 i
( x − x ) 2

S x2 = i =1

variable n −1

Gives information on n

 (x i − x)( yi − y )
Covariance the degree to which two cov( x, y ) = i =1
n −1
variables vary together
COVARIANCE

Measure combined variability of 𝑋 and 𝑌

σ𝑛𝑖=1(𝑥𝑖 − 𝑥)(𝑦
ҧ 𝑖 − 𝑦)

𝐶𝑜𝑣 𝑋, 𝑌 = 𝑠𝑋𝑌 =
𝑛−1

▪ When X and Y : cov (x,y) > 0


▪ When X and Y : cov (x,y) < 0
▪ When no constant relationship: cov (x,y) = 0
EXERCISE 1

Subject x y x error * x y x error *


y error y error
1 101 100 2500 54 53 9
2 81 80 900 53 52 4
3 61 60 100 52 51 1
4 51 50 0 51 50 0
5 41 40 100 50 49 1
6 21 20 900 49 48 4
7 1 0 2500 48 47 9
Mean
Sum of x error * y error : Sum of x error * y error :
Covariance: Covariance:
CORRELATION COEFFICIENTS

▪ Covariance does not really tell us anything

→ Standardize this measure → Correlation Coefficient (r)

▪ Measures the direction and strength between two


variables
CORRELATION COEFFICIENTS

𝐶𝑜𝑣(𝑋, 𝑌) σ𝑛𝑖=1(𝑥𝑖 − 𝑥)(𝑦


ҧ 𝑖 − 𝑦)

𝑟= =
𝑆𝑋 𝑆𝑌 σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2 σ𝑛𝑖=1 𝑦𝑖 − 𝑦ത 2

The value of r ranges between -1 and 1

The sign of r denotes the direction of association

The value of r denotes the strength of association


CORRELATION

Graph and Correlation Coefficient (𝑟)


r = 0.5 Positively
Weak
Strong r = 0.8

r=0
Negatively

No
r = – 0.5 correlated
CORRELATION COEFFICIENTS

No relationship

-1 0 1
Stronger negative Stronger positive (direct)
(indirect, inverse) relationship relationship
CORRELATION COEFFICIENTS

Absolute value of General


correlation Interpretation
The strength of the correlation
coefficient depends on how many data
.8 - 1.0 Very Strong
points in the scatter plot are near
.6 - .8 Strong
.4 - .6 Moderate or far in a pattern.
.2 - .4 Weak
.0 - .2 Very Weak or
no relationship
EXERCISE 2

A sample of 6 children was selected, data about their age in years and
weight in kilograms was recorded as shown in the following table. It
is required to find the correlation between age and weight.
Children Age Weight
No (years) (Kg)
1 7 12
2 6 8
3 8 12
4 5 10
5 6 11
6 9 13
REGRESSION ANALYSIS

➢ Correlation describes the strength of a linear


relationship between two variables
➢ Regression tells us how to draw the straight line
described by the correlation
REGRESSION ANALYSIS

Technique concerned with predicting some variables


by knowing others
The process of predicting variable Y using variable X
Tells you how values in Y change as a function of
changes in values of X
REGRESSION ANALYSIS

• Regression analysis is used to:


• Predict the value of a dependent variable based on the
value of at least one independent variable
• Explain the impact of changes in an independent
variable on the dependent variable
Dependent variable: the variable we wish to explain
Independent variable: the variable used to explain the
dependent variable
SINGLE REGRESSION

• Variable: 𝑌 ← 𝑋
𝒀 𝑿
Dependent variable Independent variable
Explained variable Explanatory variable
Controlled variable Control variable
Predictand Predictor
Endogenous Exogenous
Example: Quantity of sales depends on Price
Advertising expenditure affects to Revenue
Selling price of a home depends on its size
REGRESSION MODEL

➢ Single Regression
➢ Least Squares estimation method
➢ Coefficient of determination
➢ Population vs Sample Regression
➢ T-test for Coefficient
➢ Confidence interval of Coefficient
➢ F-test for goodness of fit
SIMPLE LINEAR REGRESSION MODEL

The population regression model:


Population Random
Population Independent Error
Slope
Y intercept Variable term
Coefficient
Dependent
Variable

Yi = β0 + β1 X i + εi
SIMPLE LINEAR REGRESSION MODEL

Y Yi = β0 + β1 X i + εi
Observed Value
of Y for Xi

εi Slope = β1

Predicted Value Random Error for


of Y for Xi this Xi value

Intercept = β0

Xi
X
SIMPLE LINEAR REGRESSION MODEL

The sample regression model:

Estimated (or Estimate of the Estimate of the


predicted) y regression regression slope
value for intercept
observation i
Value of x for

yො i = 𝛽መ0 + 𝛽መ1 xi observation i

The individual random error terms ei have a mean of zero

ei = (yi −ොyi ) = yi − (𝛽መ0 + 𝛽መ1 xi )


SIMPLE LINEAR REGRESSION MODEL

➢ 𝛽መ0 is the estimated average value of y when the value of


x is zero (if x = 0 is in the range of observed x values)

➢ 𝛽መ1 is the estimated change in the average value of y as a


result of a one-unit change in x
65
yො i = 𝛽መ0 + 𝛽መ1 xi

60 e10

e7
e9
55
e8
e6

50
e3

e5
45 e2

e4
40 e

35
8 10 12 14 16
LEAST SQUARES METHOD

▪ To estimate coefficient
▪ Estimated regression: 𝑦ෝ𝑖 = 𝛽መ0 + 𝛽መ1 𝑥𝑖
▪ Minimum 𝑅𝑆𝑆 = σ𝑒𝑖2 = σ 𝑦𝑖 − 𝑦ො𝑖 2

Σ𝑦𝑖 = 𝑛. 𝛽መ0 + 𝛽መ1 . Σ𝑥𝑖



Σ𝑦𝑖 . 𝑥𝑖 = 𝛽መ0 Σ𝑥𝑖 + 𝛽መ1 . Σ𝑥𝑖2
𝑥 ∗ 𝑦 − 𝑥ҧ ∗ 𝑦ത
𝛽መ1 = ; 𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ
𝑥 2 − 𝑥ҧ 2
LEAST SQUARES METHOD

The coefficients 𝛽መ0 and 𝛽መ1 , and other regression results


in this topic, will be found using a computer:
• Hand calculations are tedious
• Statistical routines are built into Excel
• Other statistical analysis software can be used
EXERCISE 3

• A real estate agent wishes to examine the relationship


between the selling price of a home and its size
(measured in square feet)

• A random sample of 10 houses is selected


• Dependent variable (Y) = house price in $1000s
• Independent variable (X) = square feet
EXERCISE 3

House Price in $1000s Square Feet


(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
EXERCISE 3
House price model: scatter plot
450
400
House Price ($1000s)

350
300
250
200
150
100
50
0
0 500 1000 1500 2000 2500 3000
Square Feet
SPSS OUTPUT

Regression Statistics
Multiple R 0.76211
R Square 0.58082
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10

ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Constant 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
EXERCISE 3

ෝ𝒊 = 98.24833 + 0.10977𝒙𝒊
𝒚
▪ 𝛽መ0 is the estimated average value of Y when the value of
X is zero (if X = 0 is in the range of observed X values).
Here, no houses had 0 square feet, so 𝛽መ0 = 98.24833 just
indicates that, for houses within the range of sizes
observed, $98,248.33 is the portion of the house price not
explained by square feet
▪ Here, 𝛽መ1 = 0.10977 tells us that the average value of a
house increases by .10977($1000) = $109.77, on average,
for each additional one square foot of size
ANALYSIS OF VARIANCE

• 𝑆𝑆𝑇 = σ 𝑌𝑖 − 𝑌ത 2 : Total Sum of Squares: total variation in the


dependent variable
2
෡ ത
• 𝑆𝑆𝑅 = σ 𝑌𝑖 − 𝑌 : Sum of Squares of Regression: variation in
the dependent variable explained by variation in the
independent variable
2

• 𝑆𝑆𝐸 = σ 𝑌𝑖 − 𝑌𝑖 : Sum of Squares of Residual : variation due
to errors where:
• 𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸 ylj = Average value of the dependent variable

yi = Observed values of the dependent variable


yො = Predicted value of y for the given xi value
COEFFICIENT OF DETERMINATION

• Goodness of fit measures:

𝑆𝑆𝑅 𝑆𝑆𝐸
𝑅2 = =1−
𝑆𝑆𝑇 𝑆𝑆𝑇
• 0 ≤ 𝑅2 ≤ 1
• 𝑅2 measures proportion of variation in the dependent
variable is explained by the variation in the independent
variable
• 𝑅2 = 0: horizonal regression line
• 𝑅2 = 1: scatter on straight line
COEFFICIENT OF DETERMINATION

𝑆𝑆𝑅 𝑆𝑆𝐸
𝑅2 = =1− (0 ≤ R2 ≤ 1)
𝑆𝑆𝑇 𝑆𝑆𝑇
Y
Y High R2
R2=1

X X

Y Y
Low R2
R2 ≈ 0

X X
SPSS OUTPUT

Regression Statistics
Multiple R 0.76211
R Square 0.58082
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10

ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Constant 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
POPULATION VS SAMPLE REGRESSION

▪ Population regression model


𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝑢
▪ Sample regression model
𝑌𝑖 = 𝛽መ0 + 𝛽መ1 𝑋𝑖 + 𝑒𝑖
ෝ𝒊 = 𝛽መ0 + 𝛽መ1 𝑋𝑖
𝒚
▪ 𝛽መ0 is estimate of 𝛽0
▪ 𝛽መ1 is estimate of 𝛽1
▪ Least Squares method: 𝑅𝑆𝑆 = σ𝑒𝑖2 → 𝑚𝑖𝑛
▪ Calculated Standard Error: 𝑆𝐸 𝛽መ0 , 𝑆𝐸 𝛽መ1
T-TEST FOR COEFFICIENT

▪ Based on assumption of Normality of error


▪ Testing compare value of 𝛽1 with value 𝑏
▪ Two-tail hypothesis pair
𝐻0 : 𝛽1 = 𝑏

𝐻1 : 𝛽1 ≠ 𝑏
𝛽መ1 − 𝑏
𝑇 − 𝑠𝑡𝑎𝑡 =
𝑆𝐸(𝛽መ1 )
𝑇 − 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙: 𝑡 𝑛−𝐾
𝑛: number of observations; 𝑘: number of coefficients
T-TEST FOR COEFFICIENT

Statistical Critical
Hypothesis Reject H0
value value
𝐻0 : 𝛽1 = 𝑏
𝑡 𝑛−𝑘 𝛼 𝑇 > 𝑡 𝑛−𝑘 𝛼
𝐻1 : 𝛽1 > 𝑏
𝛽መ1 − 𝑏 𝐻0 : 𝛽1 = 𝑏
𝑇= −𝑡 𝑛−𝑘 𝛼 𝑇 < −𝑡 𝑛−𝑘 𝛼
𝑆𝐸(𝛽መ1 ) 𝐻1 : 𝛽1 < 𝑏
𝐻0 : 𝛽1 = 𝑏
𝑡 𝑛−𝑘 𝛼/2 𝑇 >𝑡 𝑛−𝑘 𝛼/2
𝐻1 : 𝛽1 ≠ 𝑏
SPSS OUTPUT

Regression Statistics
Multiple R 0.76211
R Square 0.58082
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10

ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

Coefficients Standard Error t Stat Sig. Lower 95% Upper 95%


Constant 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
CONFIDENCE INTERVAL FOR COEFFICIENT

• At confidence level of (1 − 𝛼)

• 𝛽መ𝑗 − 𝑡 𝑛−𝑘 𝛼/2 𝑆𝐸 𝛽መ𝑗 < 𝛽𝑗 < 𝛽መ𝑗 + 𝑡 𝑛−𝑘 𝛼/2 𝑆𝐸(𝛽መ𝑗 )
SPSS OUTPUT

Regression Statistics
Multiple R 0.76211
R Square 0.58082
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10

ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Constant 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
F-TEST FOR GOODNESS OF FIT

• 𝐻0 : 𝑅2 = 0 (Model is overall insignificant)


• 𝐻1 : 𝑅2 ≠ 0 (𝑅2 > 0) (Model is overall significant)
𝑅 2 /(𝑘−1)
• 𝐹 − 𝑠𝑡𝑎𝑡: 𝐹 =
(1−𝑅 2 )/(𝑛−𝑘)
• Critical value: 𝐹 𝑘−1,𝑛−𝑘 𝛼
• If 𝐹 − 𝑠𝑡𝑎𝑡 > 𝐹 − 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 ⇒ Reject 𝐻0
• Or: (𝑃 − 𝑣𝑎𝑙𝑢𝑒 < 𝛼)

45
SPSS OUTPUT

Regression Statistics
Multiple R 0.76211
R Square 0.58082
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10

ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Constant 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
T-TEST VS F-TEST

▪ F-test is overall test, multiple-coefficient test


▪ T-test is single coefficient test
• Sometimes F-test and T-test are conflict

47

You might also like