Download as pdf or txt
Download as pdf or txt
You are on page 1of 111

Regression Analysis

GLENN REY A. ESTRADA


Faculty, Mathematics Unit
Leyte Normal University
glennrey.estrada@lnu.edu.ph
Historical Note
 the earliest form of regression analysis was published by
Legendre and Gauss in the early 19th century
 the term “regression” was first coined by Francis Galton
in the 19th century to describe a biological phenomenon
 extended by Udny Yule, Karl Pearson and Ronald Fisher
to a more general statistical context
Simple Linear Regression
Standard Multiple Regression
Simple Linear Regression

A simple linear regression assesses the linear


relationship between two continuous variables to
predict the value of a dependent variable based
on the value of an independent variable.
A linear regression analysis is most often used to:
 determine the change in the dependent variable for a
one unit change in the independent variable;
 determine how much of the variation in the dependent
variable is explained by the independent variable; and
 predict new values for the dependent variable given the
independent variable. 
Basic Requirements
 have one dependent variable that is measured at the continuous level
 have one independent variable that is measured at the continuous level
 there should be a linear relationship between the dependent and
independent variables
 there should be independence of observations
 there should be no significant outliers
 the variances along the line of best fit remain similar as you move along the
line, known as homoscedasticity
 the residuals (errors) of the regression line are approximately normally
distributed
Fitting a Linear Regression Model
A simple linear regression allows for a linear relationship to be
modelled between an independent variable and dependent
variable where the independent variable is predicting the
dependent variable.

Y = β0 + β1X + ε
but it can be estimated as follows:

Ypred = b0 + b1X + e
Understanding Residuals in Regression
Null and Alternative Hypothesis

H0: β1 = 0, the coefficient of the slope equals 0


HA: β1 ≠ 0, the coefficient of the slope does not equal 0
What to be Calculated

A linear regression equation.


The statistical significance of β1 (null hypothesis
significance testing).
A measure of effect size.
Confidence and prediction intervals.
Example
Studies show that exercising can help prevent heart disease. Within
reasonable limits, the more you exercise, the less risk you have of suffering
from heart disease. One way in which exercise reduces your risk is by
reducing a fat in your blood called cholesterol. The more you exercise, the
lower your cholesterol concentration. It has recently been shown that the
amount of time you spend watching TV, an indicator of a sedentary lifestyle,
might be a good predictor of heart disease; that is, the more TV you watch,
the greater your risk of heart disease.
Therefore, a researcher decided to determine if cholesterol concentration
was related to time spent watching TV in otherwise healthy 45 to 65 year old
men (an at-risk category of people). They believed that there would be a
positive relationship: the more time people spent watching TV, the greater
their cholesterol concentration. The researcher also wished to be able to
predict cholesterol concentration and to know the proportion of cholesterol
concentration that time spent watching TV could explain.
For a linear regression, we have three variables:
 The independent variable, time_tv, which is the average daily
time spent watching TV in minutes;
 The dependent variable, cholesterol, which is the cholesterol
concentration in mmol/L; and
 The chronological case number, caseno, which is used for easy
elimination of cases that might occur when checking
assumptions.
The Variable View in SPSS Statistics
The Data View in SPSS Statistics
Assumptions
 have one dependent variable that is measured at the continuous level
 have one independent variable that is measured at the continuous level
 there should be a linear relationship between the dependent and
independent variables
 there should be independence of observations
 there should be no significant outliers
 the variances along the line of best fit remain similar as you move along the
line, known as homoscedasticity
 the residuals (errors) of the regression line are approximately normally
distributed
Establishing if a Linear Relationship Exists
Creating a scatterplot in SPSS Statistics
Interpretation of Linearity
 A scatterplot of cholesterol concentration against average daily
time spent watching TV was plotted. Visual inspection of this
scatterplot indicated a linear relationship between the variables.
 Linearity was established by visual inspection of both a scatterplot.
Linear Regression Procedure
Independence of Observations

There was independence of residuals, as assessed by a Durbin-Watson


statistic of 1.957.
Dealing with Outliers
Casewise Diagnostics

The Casewise Diagnostics table highlights any cases where that case's


standardized residual is greater than ±3 standard deviations, which we have
instructed SPSS Statistics to treat as an outlier of the linear regression procedure.
A value of greater than ±3 is a common cut-off criteria used to define whether a
particular residual might be representative of an outlier.
There are generally three reasons for finding outliers in
your data:
Data entry errors
Measurement errors
Genuinely unusual values
What to do next
1. If we do not want to remove an outlier or feel
you cannot, we have three choices how to
proceed.
Transform the dependent variable; or
Run the linear regression with and without the
outlier, and if there is no appreciable
difference in the results, keep the outlier; or
Run a regression with robust standard errors.
2. Remove the outlier(s).
Testing for Homoscedasticity

There was homoscedasticity, as assessed by visual inspection of a


plot of standardized residuals versus standardized predicted
values.
Examples of Heteroscedasticity
How to Apply a Transformation
Linearity and Heteroscedasticity
Moderately, Positively Skewed Data
Moderately, Negatively Skewed Data
Strongly, Positively Skewed Data
Strongly, Negatively Skewed Data
Extremely, Positively Skewed Data
Extremely, Negatively Skewed Data
Checking for Normality of Residuals
Normal P-P Plot

Residuals were normally distributed as assessed by visual inspection of a normal


probability plot.
Different Types of Non-normality
Interpreting Results

 First, determine whether the linear regression model is a good


fit for the data.
 Second, understand the coefficients of the regression model.
 Third, use SPSS Statistics to make predictions of the dependent
variable based on values of the independent variable
Percentage (or proportion) of Variance Explained

Average daily time spent watching TV accounted for 12.9% of the variation in
cholesterol concentration with adjusted R2 = 12.0%, a medium size effect according
to Cohen (1988).
Statistical Significance of the Model
Average daily time spent watching TV statistically significantly
predicted cholesterol concentration, F(1, 97) = 14.40, p < .001.
Interpreting the Coefficients

Substituting the values of the coefficients into the regression equation, you
have:
cholesterol concentration = -0.944 + (0.037)(time_tv)
Predicting Cholesterol Concentration
Predictions were made to determine mean cholesterol concentration
for those people who watched a daily average of 160, 170 and 180
minutes of TV. For 160 minutes, mean cholesterol concentration was
predicted as 4.98 mmol/L, 95% CI [4.73, 5.23]; for 170 minutes it was
predicted as 5.35 mmol/L, 95% CI [5.24, 5.45]; and for 180 minutes it
was predicted as 5.72 mmol/L, 95% CI [5.53, 5.90].
Reporting
A linear regression was run to understand the effect of average daily time spent watching TV
on cholesterol concentration. To assess linearity a scatterplot of cholesterol concentration
against average daily time spent watching TV with superimposed regression line was plotted.
Visual inspection of these two plots indicated a linear relationship between the variables.
There was homoscedasticity and normality of the residuals. One participant was one outlier
with a cholesterol concentration of 7.98 mmol/L. They were removed from the analysis due
to not representing the target population.

The prediction equation was: cholesterol concentration = -0.94 + (0.03697)(time). Average


daily time spent watching TV statistically significantly predicted cholesterol
concentration, F(1, 97) = 14.39, p < .001, accounting for 12.9% of the variation in cholesterol
concentration with adjusted R2 = 12.0%, a medium size effect according to Cohen (1988). An
extra minute of daily average time spent watching TV leads to a 0.037 mmol/L, 95% CI [0.018,
0.056] increase in cholesterol concentration. Predictions were made to determine mean
cholesterol concentration for those people who watched a daily average of 160, 170 and
180 minutes of TV. For 160 minutes, mean cholesterol concentration was predicted as 4.98
mmol/L, 95% CI [4.73, 5.23]; for 170 minutes it was predicted as 5.35 mmol/L, 95% CI [5.24,
5.45]; and for 180 minutes it was predicted as 5.72 mmol/L, 95% CI [5.53, 5.90].
MULTIPLE REGRESSION

 A multiple regression is used to predict a continuous


dependent variable based on multiple independent variables.
Multiple regression also allows to determine the overall fit of
the model and the relative contribution of each of the
predictors to the total variance explained.
A multiple regression analysis is most often used to:
 predict new values for the dependent variable given the
independent variables; and
 determine how much of the variation in the dependent
variable is explained by the independent variables.
Basic Requirements
 have one dependent variable that is measured at the continuous level
 have two or more independent variables that are measured either at
the continuous or nominal level
 there should be independence of errors (residuals)
 there should be a linear relationship between the predictor variables and the
dependent variable
 there should be homoscedasticity of residuals (equal error variances)
 here should be no multicollinearity
 there should be no significant outliers, high leverage points or highly influential
points
 the errors (residuals) should be approximately normally distributed
Fitting a Multiple Regression Model

Multiple regression allows for a relationship to be modelled between multiple


independent variables and a single dependent variable where the independents
variable are being used to predict the dependent variable. Considering, for
example, four independent variables to be ” X1" through "X4" and the dependent
variable to be "Y", a multiple regression models the following:

Y = β0 + β1X1 + β2X2 + β3X3 + β4X4+ ε.


but it can be estimated as follows:

Y = b0 + b1X1 + b2X2 + b3X3 + b4X4+ e


Example
A health researcher wants to be able to predict maximal aerobic capacity
(VO2max), an indicator of fitness and health. Normally, to perform this procedure
requires expensive laboratory equipment and necessitates that an individual
exercise to their maximum (i.e., until they can no longer continue exercising due
to physical exhaustion). This can put off those individuals that are not very
active/fit and those individuals that might be at higher risk of ill health (e.g., older
unfit subjects). For these reasons, it has been desirable to find a way of predicting
an individual's VO2max based on more easily and cheaply measured attributes.
To this end, the researcher recruits 100 participants to perform a maximum
VO2max test, but also records their age, weight, heart rate and gender. Heart rate
is the average of the last 5 minutes of a 20 minute much easier, lower workload
cycling test. The researcher's goal is to be able to predict VO2max based on age,
weight, heart rate and gender.
Setting up the Data
In this example, we have the following six variables:
 The dependent variable, VO2max, which is the maximal aerobic
capacity;
    and
 The independent variable, age, which is the participant's age in years;
 The independent variable, weight, which is the participant's weight;
 The independent variable, heart rate, which is the participant's heart
rate;
 The independent variable, gender, which has two categories: "Male"
and "Female”; and
 The case identifier, caseno, which is used for easy elimination of cases
(e.g., participants) that might occur when checking assumptions.
The Variable View in SPSS Statistics
The Data View in SPSS Statistics
Multiple Regression Procedure
This procedure will have created five new variables in your Variable View and Data
View windows, as highlighted below in the Variable View window:
Assumptions
 have one dependent variable that is measured at the continuous level
 have two or more independent variables that are measured either at
the continuous or nominal level
 there should be independence of errors (residuals)
 there should be a linear relationship between the predictor variables and
the dependent variable
 there should be homoscedasticity of residuals (equal error variances)
 here should be no multicollinearity
 there should be no significant outliers, high leverage points or highly
influential points
 the errors (residuals) should be approximately normally distributed
Independence of Observations

There was independence of residuals, as assessed by a Durbin-Watson


statistic of 1.910.
Testing for Linearity
 Establishing if a linear relationship exists between the
dependent and independent variables "collectively"
using a scatterplot
 Establishing if a linear relationship exists between the
dependent variable and each independent variables
using "partial regression plots”
Establishing if a linear relationship exists between
the dependent and independent variables
"collectively" using a scatterplot
Testing for Homoscedasticity

There was homoscedasticity, as assessed by visual inspection of a plot of


studentized residuals versus unstandardized predicted values.
Checking for Multicollinearity
Tolerance and VIF
Checking for Unusual Points
Checking for Normality
 a histogram with superimposed normal curve and a P-P Plot,
which were both produced by the options selected in
the Linear Regression: Plots dialogue box (these
use standardized residuals); or
 a Normal Q-Q Plot of the studentized residuals (SRE_1).
Normal Q-Q Plot of the Studentized Residuals
Understanding the Model
Determining How Well the Model Fits

the multiple correlation coefficient;


the percentage (or proportion) of variance explained;
the statistical significance of the overall model; and
the precision of the predictions from the regression
model.
Multiple Correlation Coefficient (R)
Total variation explained (R2 and adjusted R2 )

R2 for the overall model was 57.7% with an adjusted R2 of 55.9%, a


large size effect according to Cohen (1988).
Statistical Significance of the Model

Age, weight, heart rate and gender statistically significantly predicted


VO2max, F(4, 95) = 32.393, p < .001.
Interpreting the Coefficients

The regression equation for the current example can be expressed in the following form:
predicted VO2max = b0 + (b1 x age) + (b2 x weight) + (b3 x heart_rate) + (b4 x gender)
predicted VO2max = 87.83 – (0.165)(age) – (0.385)(weight) – (0.118)(heart_rate) + (13.208)(gender)
Predicting the Dependent Variable

predicted VO2max = 87.83 – (0.165)(age) – (0.385)(weight) – (0.118)(heart_rate) + (13.208)(gender)


SPSS Statistics procedure to make predictions
and calculate 95% confidence intervals
Predictions were made to determine mean VO2max for male individuals who were
30 years old, weighed 80 kg and with a test heart rate of 133 bpm. Mean VO2max
was predicted as 49.63 ml/min/kg, 95% CI [47.96, 51.29].
Reporting the Main Findings

A multiple regression was run to predict VO2max from gender, age, weight and heart
rate. There was linearity as assessed by partial regression plots and a plot of studentized
residuals against the predicted values. There was independence of residuals, as
assessed by a Durbin-Watson statistic of 1.910. There was homoscedasticity, as assessed
by visual inspection of a plot of studentized residuals versus unstandardized predicted
values. There was no evidence of multicollinearity, as assessed by tolerance values
greater than 0.1. There were no studentized deleted residuals greater than ±3 standard
deviations, no leverage values greater than 0.2, and values for Cook's distance above 1.
The assumption of normality was met, as assessed by a Q-Q Plot. The multiple regression
model statistically significantly predicted VO2max, F(4, 95) = 32.393, p < .001,
adj. R2 = .56. All four variables added statistically significantly to the prediction, p < .05.
Summarizing the Multiple Regression Analysis

You might also like