Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

STT151A

Statistics for Research

Part 1: Regression Analysis- Multiple Regression Model


Review on Least Square Method
using Statistica
Example 3. Least Square Method using Statistica
Use FIES data, where the dependent variable is total income and the
independent variable (predictor) is a total family member.

Based on the Least Square Method, the line of best fit to the data is
𝑦 = 1692911 + 1861.3x
Model Validation using Coefficient of
Determination

Total Household Income = 1692911 + 1861.3 (Total Number of Family Members)

𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 = 0.021044 which is close to zero hence the model is not a good
predictor for household income. Only 2.1% of the variation in total household
income is explained by total number of family members
There are other possible factors.

Looking at the correlation


matrix. The variable Members
with age less than 5 years old
and ages 5-17 years old are
inversely linearly correlated with
total income. These could be
some factors that we may look
into.
• We may also look into the scatter plot of total household income
and total numbers of family members. The line of best-fit forms a
line throughout the number of family members. But we can
observe that there is another line formed within each number of
family members.

For example, we can we see that the highest income is from a family
of 8 members. We may look further into more variables such as:
How many members are younger than 17?
How many members are employed?
The assumptions that can be tested are the following:
(1) The family with more employed family members has higher
income while keeping the number of family member constant.
(2) The family with more family members younger than 17 y.o. the
lower the income while keeping the number of family member
constant.

Now this is multiple linear regression wherein more than one


variables are considered as predictors of the dependent variable.
Multiple Regression Model
(MLRM)
Multiple Regression Model
• extension of the simple linear regression model
• functional relationship between a single dependent (𝑌) and several independent or explanatory
variables (Xi's)
• estimation based least squares, inferences about linear combination of parameters and prediction
are also straightforward extensions of the procedures in SLR
Multiple Regression Model
𝑌i = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + 𝛽3 𝑋3𝑖 + ⋯ + 𝛽𝑘 𝑋𝑘𝑖 + 𝜀𝑖 , 𝑖 = 1,2, … , n;
where 𝑌𝑖 = ith observed value of 𝑌
𝑋1i = ith observed value for 𝑋1
𝑋ki = ith observed value of 𝑋𝑘
𝛽0 = true 𝑦−intercept (value of 𝑌 when all 𝑋 ′s a zero)
𝛽1 = partial regression coefficient due to 𝑋1 (change in 𝑌 for
every unit change in 𝑋1 at fixed levels of 𝑋2 , 𝑋3 , … , 𝑋𝑘 )
𝛽𝑘 = partial regression coefficient due to 𝑋𝑘 (change in 𝑌 for
every unit change in 𝑋𝑘 at fixed levels of 𝑋2 , 𝑋3 , … , 𝑋𝑘−1 )
𝜀𝑖 = is random error and 𝜀𝑖 ∼ 𝑁 0, 𝜎 2
Assumptions
Assumption #1: Your dependent variable should be measured on a continuous scale (i.e
Assumption #2: You have two or more independent variables, which can be either continuous (i.e., an interval or ratio variable) or
categorical (i.e., an ordinal or nominal variable).
Assumption #3: You should have independence of observations (i.e., independence of residuals), which you can easily check using
the Durbin-Watson statistic
Assumption #4: There needs to be a linear relationship between (a) the dependent variable and each of your independent variables,
and (b) the dependent variable and the independent variables collectively.
Assumption #5: Your data needs to show homoscedasticity, which is where the variances along the line of best fit remain similar as
you move along the line.
Assumption #6: Your data must not show multicollinearity, which occurs when you have two or more independent variables that are
highly correlated with each other. This leads to problems with understanding which independent variable contributes to the variance
explained in the dependent variable, as well as technical issues in calculating a multiple regression model.
Assumption #7: There should be no significant outliers, high leverage points or highly influential points.
Assumption #8: Finally, you need to check that the residuals (errors) are approximately normally distributed (we explain these terms
in our enhanced multiple regression guide). Two common methods to check this assumption include using: (a) a histogram (with a
superimposed normal curve) and a Normal P-P Plot; or (b) a Normal Q-Q Plot of the studentized residuals.
Assumptions
Important Assumptions
1. None of the independent variables should be an exact linear combination of the other independent
variables.

Example: X2 = cX1 X3 = X1 + X2 are not allowed


2. The number of observations ( 𝑛 ) must exceed the number of independent variables by at least 2 ,
that is 𝑛 = 𝑘 + 2.
Example 1. Data was taken from eight
varieties of rice to determine if grain yield (Y)
may be predicted using plant height (𝑋1 ) and Grain
Yield
Plant
Height
No. of
Tillers
number of tillers (𝑋2 ) Variety
A
(Y)
5755
(𝑋1 )
110.5
(𝑋2 )
14.5
B 5939 105.4 16
C 6010 118.1 14.6
D 6545 104.5 18.2
𝑌i = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + 𝛽3 𝑋3𝑖 + ⋯ + 𝛽𝑘 𝑋𝑘𝑖 + 𝜀𝑖 , 𝑖 = 1,2, … , n E 6730 93.6 15.4
F 6750 84.1 17.6
𝐺𝑟𝑎𝑖𝑛 𝑌𝑖𝑒𝑙𝑑i = 𝛽0 + 𝛽1 𝑃𝑙𝑎𝑛𝑡 𝐻𝑒𝑖𝑔ℎ𝑡𝑖 + 𝛽2 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑙𝑙𝑒𝑟𝑠𝑖 + 𝜀𝑖 , 𝑖 = 1,2, … , 8 G 6899 97.8 17.9
H 7862 75.6 19.4
Total 52490 769.6 133.6
Adjusted R-squared 0.78868278
Significance p<0.00885
P-value for:
Plant height : 0.0749 Not significant
No. of Tillers : 0.182474 Not significant
Interpretation: Although the Model gave a significant interpretation, the two
independent variables gave not significant p-value. Hence the model is not a
fit predictor for Grain Yield.
Example 2. How will you interpret this multiple regression model?
Example 3.
෣ = 𝑏0 + 𝑏1 𝐶𝑜𝑚𝐸𝑥𝑝 + 𝑏2 𝑓𝑎𝑚𝑠𝑖𝑧𝑒
𝐼𝑛𝑐𝑜𝑚𝑒
෣ = 84164.96 + 27.63 ∗ 𝐶𝑜𝑚𝐸𝑥𝑝 + 10839.03 ∗ 𝑓𝑎𝑚𝑠𝑖𝑧𝑒
𝐼𝑛𝑐𝑜𝑚𝑒
R-squared = 0.51699 p<0.0000 (Model is significant)
P-value for ComExp and famsize are significant
𝟖𝟒𝟏𝟔𝟒. 𝟗𝟔 :
The EXPECTED income for someone with an comexp and famsize of 0 is 𝟖𝟒𝟏𝟔𝟒. 𝟗𝟔 .
(INVALID)
27.63
The EXPECTED increase in income for every unit increase in comexp, holding famsize
constant.
10839.03
The EXPECTED increase in income for every unit increase in famsize, holding comexp
constant.
Example 4.
Data File. This example is based on the data file Poverty.sta. Open this data file by selecting Open Examples from the File menu (classic
menus) or by selecting Open Examples from the Open menu on the Home tab (ribbon bar); it is in the Datasets folder. The data are based
on a comparison of 1960 and 1970 Census figures for a random selection of 30 counties. The names of the counties were entered as case
names
Research Problem. To analyze the correlates of poverty, that is, the variables that best predict the percent of families below the poverty
line in a county. Thus, you will treat variable 3 (Pt_Poor) as the dependent or criterion variable, and all other variables as the independent
or predictor variables.

This spreadsheet shows the standardized regression coefficients (b*) and the raw regression coefficients (b). The magnitude of these Beta
coefficients enable you to compare the relative contribution of each independent variable in the prediction of the dependent variable. As
is evident in the spreadsheet shown above, variables POP_CHNG, PT_RURAL, and N_EMPLD are the most important predictors of poverty;
of those, only the first two variables are statistically significant. The regression coefficient for POP_CHNG is negative; the less the
population increased, the greater the number of families who lived below the poverty level in the respective county. The regression weight
for PT_RURAL is positive; the greater the percent of rural population, the greater the poverty level.

You might also like