Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

BUSINESS STATISTICS CLASS Lecturer: Dr.

Phan Nguyen Ky Phuc

Regression Analysis
Problem given a data set of one output and multiple inputs

Y 72 76 78 70 68 80 82 65 62 90

X1 12 11 15 10 11 16 14 8 8 18

X2 5 8 6 5 3 9 12 4 3 10

Assume that the relationship between output and inputs are linear, so it can be expressed as
k
Y  0   i X i  
i 1

where   N (0,  2 )  can be interpreted as noise. In this model   0 , 1 ,  2 ...,  k  are unknown.
So our objective is to reconstruct a prediction model of Y which creates the minimum errors
based on the given data set.

Assume that our forecasting model is


k
Y  b0   bi X i
i 1

So the error square between the forecasting model and the output of a record is

k 2
 
error  (Y  Y )2   b0   bi X i  Y 
2

 i 1 

For example, for the 1st record  Y  72, X 1  12, X 2  5  the error square is

error 2  (Y  Y ) 2   b0  12b1  5b2  72 


2

Since we consider all the data set, so the error square must be summed for all records

k 2
 
L   error   (Y  Y )2    b0   bi X i  Y  where n is the number of records in dataset.
2

n n n  i 1 

For the above data set, n=10.


BUSINESS STATISTICS CLASS Lecturer: Dr. Phan Nguyen Ky Phuc

k k
L   L  
 2b0   b0   bi X i  Y    2 X i  b0   bi X i  Y 
b0 n  i 1  bi n  i 1 
k k
L   L  
 0    b0   bi X i  Y   0  0   X i  b0   bi X i  Y   0
b0 n  i 1  bi n  i 1 

So to minimize the error square, we take the first derivative corresponding to bi and set it equal
to zero. Solve these linear equation systems we can obtains values of bi

With the above example

b0  47.1649, b1  1.5990, b2 = 1.1487

The next two problems that we concern are

 Whether this regression model is valid (good enough to explain the data)
 Whether we can exclude some inputs, i.e., simplify current model but still can explain the
data

To answer the 1st concern we use the F-test

ANOVA Table for Multiple Regression

Source of Variation Sum of Squares Dof Mean Square F Ratio

SSR MSR
Regression SSR k MSR  F
k MSE
SSE
Error SSE n  (k  1) MSE 
n  (k  1)
Total SST n 1
n: number of data records

k: number of inputs

SSE 2 MSE
R2  1  ; R  1
SST MST

In the above example k=2, n=10.


n n 2 n 2
  y  y    y  y 
2
 y  y
i 1
i 
i 1
i i 
i 1
i

SST  SSE  SSR


(Total sum of squares )  (Sum of squares for error )  (Sum of squares for regression)
BUSINESS STATISTICS CLASS Lecturer: Dr. Phan Nguyen Ky Phuc

ANOVA table
Source Dof SS MS F p
Regression 2 630.54 315.27 86.34 0.000
Error 7 25.56 3.67
Total 9 656.10
R-sq = 96.1% R-sq(adj) = 95.0%

To answer the 2nd concern we use the t-test for each coefficient

In t-test for coefficient  i the hypothesis testing is: H 0 : i  0, H1 : i  0 . When we running the
software, for the above data set, we obtain

Predictor Coef Stdev t-ratio p


Constant 47.165 2.470 19.09 0.000
X1 1.5990 0.2810 5.69 0.000
X2 1.1487 0.3052 3.76 0.007
To look up the value of t-table we use the dof of SSE

You might also like