10 Regression Analysis

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

Regression Analysis

Gerson JC Antonio, PhD


Faculty, DWCL Graduate School
Correlation
Types of regression
Outline R 2

Model fitting
Regression model
In a Example: Relationship of number of
correlational hours of study and test score (e.g., 100
points)
study, the
goal is to
Independent variable (X): number of
check hours
whether
two
variables are Dependent variable (Y): test score
related.
In correlation, the information we
might get is that test score is
positively “related” to the number
of hours (r = .75, p = .004)
Interpretation
in correlation
Our interpretation is that as you
add your study hours, the higher
is your score in the test.
It does not provide a “prediction value”
of the number of hours and how much
score you will get.

What is If we know how long should we study


lacking in and the corresponding predicted score,
correlation? then we can plan strategically.

This is addressed in regression model.


• can provide us the relationship of the
dependent and independent variables,
estimate their relationship, and
determine if the relationship is
REGRESSION significant.
• it can also validate a claim or theory
ANALYSIS • regression can be used if a researcher
is interested to determine the effect,
impact, or relationship of the
independent variables to the
dependent variable (whether positive
or negative relationship)
Regression is an exact prediction of the
values (amount) of one variable when
one knows the value of the other.

Defining
regression
Types:
Simple linear Multiple linear
regression (SLR) regression (MLR)
➢ predicts the value of y given the value of x.
➢ used when there is a relationship between x
(independent variable) and y (dependent
variable)
SIMPLE LINEAR ➢ data should be normally distributed using
the level of measurement which is
REGRESSION expressed in interval or ratio
➢ y = a + bx
a = intercept
b = slope of the line
What is SLR?
• Simple linear regression is a linear regression model with a single
regression variable (X).
• Simple linear model: 𝑦𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑒𝑖 , i = 1, 2
where:
yi the value of the response variable in the ith trial
𝛽0 and 𝛽1 the parameters of the model
x the value of the regressor in the ith trial
ei the random error in the ith trial
The regression
model for
number of hours 𝑦 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑒𝑖
and test score is:
Test score = 60 + 1.89 (1 hour of study)
Test score = 60 + 1.89 (1 hour of study)

Score of 60: If you just rely on classroom


instruction and do not add an
What does the hour of study at home, your score
is 60/100 (if the test given has 100
model tell us? points)

𝛽1 = 1.89: each time of study for one hour, a


point of 1.89 is added in your score

ei : error term indicates that there are


other variables that were not included
Test score = 60 + 1.89 (1 hour of study)
Predicted score = 60 + 1.89
= 61.89
How many hours
should you
spend in
studying to get a Answer: 15 hours
score of 88.00? 15 hours (1.89) = 28
60 + 28 = 88
• The regression model can be
obtained when values of the two
variables are known.
• The predicted score is applicable
only to one person bearing the
Note: values of the two variables (X and
Y). It can’t be applied to other
cases.
• Regression model may be used if
the regression analysis indicates
that it is significant.
• The previous model informs us
the predicted score.
• However, prior to adopting the
Model fitting model, we should determine if
the model fits the data or is it
close to reality.
• The R square (R2) tells us how much
How do we variance in the dependent variable is
explained by the regressor variable in
check if the the calculation.
• The R square (R2) between the number
model fits the of hours of study and test score is: R2 =
.632
data? • How do we interpret?
• The regressor variable (hours)
accounts for 63.20% of the variance
in the dependent variable (test
score); or
• Increase in the test score can be
explained by hours of study by
63.20%.
How do we • The R2 tells us also the type of model
check if the fitting.
model fits the • The ranges for model fitting are the
following:
data? < 0.1 poor fit
0.11 to 0.30 modest fit
0.31 to 0.50 moderate fit
> 0.50 strong fit
How do we R2 = .632
interpret? Interpretation: The model has
a strong fit.
SPSS Prompts
1. Click Analyze > Regression > Linear
2. Transfer the independent variable height into the Independent(s)
box, and the dependent variable weight into the Dependent box.
3. Click Statistics
4. Tick the following checkboxes: Model fit, R squared change,
Descriptives, Part and partial correlations
5. Under Regression Coefficients area, tick Estimates and set
Confidence Intervals to 95
6. Click Continue, then OK.
EXAMPLE

The height of ten students were determined to be 65, 64, 64, 62,
62, 60, 60, 59, 58, and 57 inches, and their weight were
determined to be 110, 108, 107, 104, 98, 96, 94, 92, 90, and 89
pounds. Determine if height correlates with weight, and the
regression equation. Use .05 level of significance.

ANTONIO2019
This summary table shows whether the regression line (line of best fit)
is significantly different from horizontal line. If p < 0.05, it indicates
that, the regression model significantly predicts the outcome variable.
Since R2 = .953, hence the model is a good fit for the data.

ANTONIO2019
The coefficients table provides the necessary information to
predict dependent variable from the independent.

Regression Equation: Weight = -73.084 + 2.813 (height)


Reporting of Results
A simple linear regression was calculated to predict (dependent
variable) based on (independent variable). A significant regression
equation was found (F(__,__) = ___.___, p < .___). With R2 of .____, it
can be said that increase in (dependent variable) can be explained by
(independent variable) by _____. Participants’ predicted ______ is
equal to _____ + _____(independent variable) (unit of measure) when
(independent variable) is measured in (unit of measure). (Dependent
variable) increased _______ (unit of measure DV) for each (unit of
measure) of (independent variable).

ANTONIO2019
Reporting of Results
A simple linear regression was calculated to predict weight based
on height. A significant regression equation was found (F(1,9) =
161.880, p < .05). With R2 of .953, it can be said that increase in weight
can be explained by height by 95.30%. Participants’ predicted weight is
equal to -73.084 + 2.913 (height) pounds when height is measured in
inches. Participants’ average weight increased 2.813 pounds for each
inch of height.

ANTONIO2019
EXAMPLE # 2 Number of Absences Grades in English
1 90
A study is conducted on the 2 85
relationship of the number of 2 80
absences and the grades of 15 3 75
students in English. Determine the 3 80
relationship of the two variables. 8 65
6 70
1 95
4 80
5 80
5 75
1 92
2 89
1 80
9 65
ANTONIO2019
Multiple Linear Regression (MLR)

• Multiple linear regression enables us to predict and


weigh the relationship between two or more regressor
variables (X1, X2, X3, …) and the dependent variable (Y).
Example of MLR

• The relationship between number of study hours and


achievement
• Dependent variable (Y): achievement (pertains to average
grade)
• Regressor (X): number of study hours
• What other variables can be added to predict achievement?
Other predictors/regressors of achievement
(Moreira, Diaz, Vaz, & Vaz, 2013):

Persistence (X2)

Motivation (X3)

Study skill (X4)


If the data fits the model, the MLR model will be:

Achievement = 𝛽0 + 𝛽1 (study hours) + 𝛽2


(persistence) + 𝛽3 (motivation) + 𝛽4
(study skills) + 𝑒𝑖
Assumptions of normality/linearity

P-plot

Shapiro-Wilk test

Kolmogorov-Smirnov test
Assumptions of normality/linearity
In regression, the difference between what the
model predicts and the observed data are usually
called residuals.

If residuals are normally distributed, then it means


that it is linear.
How to validate normality/linearity?

P-plots can demonstrate linearity.

If the plots fall along a diagonal line, then the distribution


of residuals are roughly normal.

This means that the plots show that error terms are
normally distributed.
How to validate normality/linearity?
To further validate the assumption of normality,
Shapiro-Wilk and Kolmogorov-Smirnov tests are
used.

If the sig (p-value) is greater than .05 for Shapiro-


Wilk and Kolmogorov-Smirnov, the distribution of
errors are said to be normal.
Assumptions of homoscedasticity

• Homoscedasticity refers to whether residuals are


equally distributed, or whether they tend to bunch
together at some values.
Assumptions of homoscedasticity
• Residual plots
Influential Outliers

These are extreme observations or cases.

In a residual plot, they are the points that lie far beyond
the scatter of the remaining residuals.

Its presence could cause a misleading fit.


Durbin-Watson

• Provides information about whether the assumption of


independent errors is acceptable.
• Values of Durbin-Watson should be close to 2.0.
Assumptions of multicollinearity

Predictor variables or the Regressors should be


regressors (X1, X2, X3…) independent from each
should not be correlated other.
with each other.
Assumptions of multicollinearity

• Variance inflation factor (VIF) values.


• VIF values should be below 10.00, and better if it were
below 5.00.
Guide to interpretation

• Statistics
• Regression Plots
Statistics
• Available procedures in this
button are checking the
assumptions of no
multicollinearity (Collinearity
diagnostics) and
independence of errors
(Durbin- Watson)
Dialog box for STATISTICS
Statistics
• Estimates give us the estimated coefficients of the regression model
(i.e. the estimated 𝜷-values).
• Confidence intervals - produce confidence intervals for each of the
unstandardized regression coefficients.
Statistics
• Model fit: Provides statistical test of the model’s ability to predict the
outcome variable (the F-test), and the value of R and the adjusted R2.
• R squared change: It displays the change in R2 resulting from the
inclusion of a new predictor (for MLR only).
Statistics
• Descriptives: Displays correlation matrix to assess whether predictors
are highly correlated
• Collinearity diagnostics: This option is for obtaining collinearity
statistics such as the VIF ≤ 10.0) and tolerance = .20 – 1.00 (for MLR
only)
Statistics
• Durbin-Watson: Displays the Durbin-Watson test statistic, which tests
for correlations between errors. Specifically, it tests whether adjacent
residuals are independent (desired value is 2.0 or close to 2.0)
Residual plots
• Plots provides the means to
create graphs for regression to
check for validity of
assumptions.
Dialog box for Plots
Plots
• ZRESID (standardized residuals, or errors): These values are the
standardized differences between the observed data and the values
that the model predicts.
• ZPRED (standardized predicted values of the dependent variable
based on the model): These values are standardized forms of the
values predicted by the model
Plots
• ZRESID & ZPRED are useful in:
• Determining assumptions on errors
• Identifying if errors are homoscedastic
Dialog box for Save
• This box saves regression
diagnostics. Each statistic has a
corresponding column
in SPSS output.
Task 7.4

Proceed to Task Due: July 19,


7.4 tomorrow 11:59 PM

You might also like