Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 51

Statistical Analysis using IBM SPSS (Part

2) Introduction of Survey Analysis


Correlations
Linear Regression
Simple Linear Regression
Multiple Linear Regression
Logistic Regression
Binomial
There is no one
standard set of
methods that can be
applied to survey
data.

Nevertheless, there
is a preferred logic to
analyzing survey
data.
Type of
Relationship
• Non – Causal Relationship
 It is also known as Association or Correlation. Two variable
are said to be related if there is an association or correlation
between the two
 Two variable can be related to each other, but is not
completely the result of one variable affecting the other.
• Causal Relationship
 One variable has a direct influence on the other.
 If two variable are causally related, it is possible to conclude
that changes in the explanatory variable X, will have a
direct impact on Y. It means that adjusting the value that
variable
will cause the other to change.
Correlational

Analysis
A statistical techniques use to determine an
existence of association between two or more
variable.
• It determines the strength or degree of
association between variables
• Concerned largely with the study of
interdependency or co-variation between
variables
• Does not express causality or that a
variable
function is
of athe other.
Correlational
Analysis
Using the data “Scatterplot_Discuss1”, Let
us answer the following:

1. Is there a relationship between advertising


and sales?
2. How strong is the relationship between
advertising and sales?
Using the Scatter
Plot
How to Create a Scatter
Plot
Open the data “Scatterplot_Discuss1”.xlsx in
your SPSS.
1. Click on Graphs >> Chart Builder
2. Choose Scatter/Dot in the gallery
3. Drag Simple Scatter plot in the canvas
4. Select and drag “ADS” to the x – axis
5. Select and drag “SALES” to the y – axis
6. Click “OK”
Finding the
Correlation
Coefficient
Using the same data for the scatter plot.
1. Click “Analyze” >> “Correlate” >>
“Bivariate”
2. Select “ADS” and “SALES” and
transfer to the variable list
3. Tick of “Pearson” and “Spearman”
4. Tick “Show only the lower triangle”
5. Click “OK”
Correlation
Analysis Example
Open the “Pizza.xlsx” data in SPSS

1. Construct a scatter plot for the pairs of


variable
2. Determine the correlation coefficient and
interpret the results.
Correlation
Analysis Exercise
Open the “Car_Sales.xlsx” data in SPSS.
Perform and answer the following.

1. Construct a scatter plot for Price VS


Horse power, and Fuel capacity and Fuel
efficiency.
2. Find the correlation of each bivariate
pair. Decide to use Pearson or
Spearman.
3. Interpret the result
Introductory Concepts
of Regression
Using the data “Scatterplot_Discuss1”, Can
we answer the following?
1. Is there a relationship between advertising and
sales?
2. How strong is the relationship between advertising
and sales?
3. Is the relationship linear?
4. How accurate can we predict future sales?
5. What will be the value of sales given a value for
advertisement?
Introductory
Concepts in
Regression
Introductory
Concepts in
Regression
Steps of the Model Building Process
The art of model building is an integral part of regression
analysis.
• Planning and data collection
 Statement of the problem
 Selection of potentially relevant variables
 Data collection
• Actual Model Building
 Identification/Specification stage
 Estimation and testing of the parameters
 Diagnostics and remedial measures
Introductory
Concepts in
Purpose of Regression Analysis
Regression
 Regression analysis deals with the study of the
dependence of one variable called the dependent
variable on one or more variable, the predictors or
explanatory or independent variables. The goal is to
estimate and/or predict the dependent variable in terms
of the values of the independents variables.
 The main objective of regression analysis is to extract
structural relationship between the dependent and
independent variable(s).
Regression
Analysis
• Appropriate for survey data and other
cross sectional data.
• Can be used to study time series data by
making some modifications
• It involves model building
Introductory
Concepts in
Type of Variable Regression
• Independent Variable
 Also known as the predictor, explanatory variable,
regressor variable, or exogenous variable. It
affects the system but the variability cannot be
found within the system
• Dependent Variable
 Also know as your response variable or
endogenous variable. Its variability is explained by
the system.
Introductory
Concepts in
Regression
Simple VS Multiple Linear Regression
• Simple Regression
 Very straightforward approach for predicting a quantitative
response Yon the basis of a single predictor variable X
• Multiple Linear Regression
 Extension of the simple linear regression model to
accommodate multiple predictors
Introductory
Concepts in
Some Important NotesRegression
in Regression
• It is possible to apply several regression analysis that
employ a single predictor. This approach, however,
overlooks the possibility that the predictor variables may be
inter-correlated, or that they may interact in their effects on
the response variable.
• Multiple regression analysis takes into consideration (even
exploits) these possibilities, and therefore is eminently
suited for analyzing the collective and separate effects of
two or more predictor variables on response variables
Simple Linear
Regression
Simple Linear Regression
• Regression method involving just two measures
used to explore and quantify the relation between
the two variables know as:
 Independent – predictor or explanatory or exogenous
variable
E.g. Advertisement (Continuous)
 Dependent – response or endogenous variable.
E.g. Sales (Continuous)
Simple Linear
Regression
How to Run a Simple Regression?
Using the data of sales and advertisement, let us create a Simple Linear
Regression.

1. Click “Analyze” >> “Regression” >> “Linear”.


2. Transfer the “SALES” to the dependent
variable box.
3. Transfer the “ADS” to the independent
variable box.
Jr
4. Click “OK”
Simple Linear
Regression
Based from the result the estimated
regression equation is given by
Simple Linear
Assumptions Regression
• one dependent variable that is measured at the continuous level.
• one independent variable that is measured at the continuous level.
• there should be a linear relationship between your dependent and
independent variables.
• No Autocorrelation - there should be independence of
observations
• there should be no significant outliers
• the variances along the line of best fit remain similar as you move
along the line, known as homoscedasticity
• The residuals (errors) of the regression line are approximately
normally distributed
Simple Linear
Regression
Let us Rerun the SLR
1. Click on “Recall” >> “Linear Regression”
2. Click “Statistics” and tick on “Descriptives”,
“Confidence interval”, and “Casewise diagnostics”
>> click “Continue”
3. Click “Plots”, Transfer “ZRESID” in y – axis and
“ZPRED” to the x – axis.
4. Tick on “Histogram” and “Normality Plots”
5. Click “ Continue” and Click “Ok”
Simple Linear
Regression
Model Summary
• The composite correlation of the model is equal to the
absolute value of the correlation between the dependent
and independent variable
• The composite R –square is 0.711, which means that
71.1% of the variations in sales can be explained by
the advertisement
Simple Linear
Regression
ANOVA Table
• The ANOVA (F-Test) table provide the test that at
least one regressor have a linear relationship with the
response variable. In this case since there is only one
variable, then the predictor ADS have a linear relations
to SALES.
Simple Linear
Regression
Regression Coefficient
• The Column B contains the coefficient to create our
regression line. The value 1.435 means that for every
one unit increase in advertisement value would result
to an estimated 1.435 average increase in sales.
Simple Linear
Regression
Casewise Diagnostics
• This result shows which among the 54
cases/data point have a residuals further than 2
standard deviation. This are cases 27 and 28
with standard residuals of -2.211 and -2.552
respectively.
Simple Linear
Regression
Residual Summary
• This part of the result gives us a summary of the
model of its predicted value. A negative residual
represent an overestimation while a positive
residual represents an underestimation of the
model.
Simple Linear
Regression
Residual Plots
• The residual plots would allow us to make
assessment of whether the assumptions of the
analysis are met. First we look at the histogram of
the residual if the follow a normal curve, second
the P-P plot of the standardized residual of the
model, and third is the scatter plot of the
standardized predicted value and the residual of
the model.
Simple Linear
Regression
Multiple Linear
Regression
• An extension of the SLR, where there are more than one
independent variable involve.
• The formal statement of the model is,

Where,
Y – is the value of the response variable
0, 1, 2, … k – are the parameter of the model
X1 – is the value of the first predictor variable
X2 – is the value of the second predictor variable

Xk – is the value of the kth predictor variable


 - is the random error/residual
Multiple Linear
Regression
• The Linearity or Non-linearity of the model
depends in the parameter.
• The value of the highest power of the
predictor is the order of the model.
• Since more than one predictor is involve, we
are concern in answering some of the
questions:
 Which combinations of variable will do the job of
prediction?
 Which among the variable is the most
important?
Multiple Linear
Regression
Assumptions of MLR
• There should be one dependent variable measure in
a continuous level
• Two or more independent variable, with at least one at a
continuous level
• No Autocorrelation, there should be independence of
errors.
• A composite linear relationship between the dependent
and independent variable should exist.
• Homoscedasticity of residuals (equal error variance)
• No Multicollinearity
• No significant outliers
• Errors are normally distributed
Multiple Linear
Regression
How to run a MLR?
Open the “Car_Sales.xlsx” data in SPSS
• Click “Analyze”>> “Regression” >> “Linear”
• Select and transfer “Price in thousand” in the
dependent list
• Select and transfer “Engine size, Horse power,
Wheel base, Width, Length, Curb weight, Fuel
capacity and Fuel efficiency” to the independent list
• Click “Statistics”, tick “Descriptives”, “Collinearity
diagnostics”, and “Casewise diagnostics”
• Click “Continue”
Multiple Linear
Regression
How to run a MLR?
Continuation…
• Click “Plots”, Transfer “ZRESID” in y – axis and
“ZPRED” to the x – axis.
• Tick on “Histogram” and “Normality Plots”
• Click “ Continue”
• Click “Ok”
Multiple Linear
Regression
The Result
Model Summary
• R2
 The coefficient of multiple determination, which is
the general measure of the goodness-of-fit of a
model to the sample data.
 It gives the amount of variability in the response
variable that could be explained by the set of
k-
• Adjusted R2 variable.
independent
 Have the same interpretation of that of coefficient
of multiple determination but is the one to be
used if we need to compare different regression
models with varying number of regressors.
Multiple Linear
Regression
The Result
ANOVA Table
• F – test or ANOVA test
 It is a general test of the correctness
of the model.
 It test the hypothesis that at least one
of the independent predictor have a
linear relationship to the dependent
variable.
 The rejection of the Null Hypothesis;
all the parameters is equal to 0,
depends if the p-value (sig) is lesser
than or equal to the level of
significance.
Multiple Linear
Regression
The Result
Coefficient Table
• Collinearity Statistics (VIF) – the value of VIF
should not be higher than 10.
• T test & p value – to test if the regressor have
linear relationship with the response variable
the p value should be lower than or equal to
the level of significance
• Unstandardized Coefficient – the value that is
needed to have the linear model
• Standardized Coefficient – the value that is
used in order to determine which among the
predictor has the highest effect to the
dependent variable.
Multiple Linear
Regression
The Result
Casewise and Residual
Diagnostics
• Casewise Diagnostics – It gives us the
idea which among the
examples/cases is/are outliers which
may be causing problem in the model.
• Residual Statistics - gives us the
idea if the model predicted value
overestimated or underestimated the
actual values
Multiple Linear
Regression
The Result
The Plots
• Histogram and Normal Probability
Plot
 It show the distribution of the
residuals of the model.
 It visually test if the residual
of the model follows a normal
distribution

Arturo J Patungan Jr
09662776892
Multiple Linear
Regression
The Result
The Plots
• Scatter
Plot
 The scatter plot of the residual
give us the idea of the
homogeneity of variance
(homoscedasticity) of the
residual.
 A model with a characteristics of
homoscedasticity (no problem
of heteroskedasticity) have
relatively the same spread of
points across the horizontal
pattern.
Multiple Linear
Regression
Diagnostic Checking and Remedial Measures in a
Regression Analysis
• Multicollinearity – when predictors are highly correlated
 How to detect?
 Perfect multicollinearity makes the computer scream at
you!
 Milder forms of may be detected by significant F-
statistics or high R2 accompanied by t-statistics which
are not significant.
 It can also be detected by a high correlation between
pairs of regressors.
 The rule of that is commonly used is: Multicollinearity is
not serious if no VIF is greater than 10.
Multiple Linear
Regression
Diagnostic Checking and Remedial Measures in a
Regression Analysis
• Multicollinearity
 Remedies
 Drop one of the variables that are causing the
problem.
 Add more observation to the data
 Perform principal component analysis or perform
factor
analysis before performing the regression.
Multiple Linear
Regression
Diagnostic Checking and Remedial Measures in a
Regression Analysis
• Serial Correlation/ Autocorrelation – error terms are
correlated to each other
 How to detect?
 Use the Durbin – Watson Test – that is there is a
first
order autoregressive serial correlation
 The Durbin – Watson test should be close to 2 to
conclude no serial =
Durbin-Watson correlation.
2 – no serial correlation
 0 < Durbin-Watson < 2 – positive autocorrelation
 2 < Durbin-Watson < 4 – negative autocorrelation
Multiple Linear
Regression
Diagnostic Checking and Remedial Measures in
a Regression Analysis
• Serial Correlation/ Autocorrelation
 Remedies
 Reintroduction of an important omitted variable may
remove the problem
 If the source of the serial correlation is an incorrect
functional form, then specifying the correct functional
form will solve the problem.
Multiple Linear
Regression
Diagnostic Checking and Remedial Measures in
a Regression Analysis
• Heteroskedasticity – the error term having no equal
variance (not constant)
 How to Detect?
 A funnel-shaped residual plot indicates
nonconstant
variance.
 If the residuals form a horizontal band centered
around 0, the indication is that the variance is
constant.
Multiple Linear
Regression
Diagnostic Checking and Remedial Measures in
a Regression Analysis
• Heteroskedasticity
 Remedy
 If the nature of heteroskedasticity is “known”, or
believed to be of a certain form, then a suitable
transformation (e.g. logarithmic) of the
variables might remove the problem.
Multiple Linear
Regression
Exercises
• Use the Pizza data.
• Create a MLR model with the following:
 Dependent Variable – Calories
 Independent Variables - Moisture, Protein,
Fats, Ash, Sodium, Carbohydrates
• Test the Multicollinearity Assumption
• Test significance of predictors
• Test Serial Correlations
• Test Heteroskedasticity assumption
Multiple Linear
Regression
Other Consideration in the Regression Model
• Indicator Variables
 Indicator or dummy variables are used to include categorical or
qualitative regressors in the regression analysis
 Dummy variables are also used to compare the responses of different
groups.
 Dummy variables assume only the values 0 and 1; generally 1 denotes
the presence of a characteristic, while 0 denotes the absence of the
characteristic. However, assignment of labels to the values is generally
arbitrary.
 Even though one of the independent variables in the model is
qualitative, it is possible to include interaction effects or interaction terms
in the model by including cross-product terms.
Multiple Linear
Regression
Other Consideration in the Regression Model
• Number of Indicator Variables
 In general, if a qualitative variable has m categories, we
can define (m-1) dummy variables.
 If there are more than one qualitative variable, define the
appropriate number of dummy variables for each qualitative
variable.
 Dummy variables can also be used to model seasonality (time-
series data).

You might also like