SPSS Tutorials

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

SPSS TUTORIALS

Module 1

Type of file
- Introduction
o To open a file = file > open > data (there are also other types
o Then, select dataset and click open > a dataset window and an output
window will appear
o DATA.sav: data view and variable view
o OUTPUT.spo: each shown in a specific output window
o SYNTAX.sps: shown in a specific syntax window + can have a script in the SPSS
coding language (e.g. useful when needed to repeat the same command with
small changes)
- Data
o Data view: reports each case in a row and the values of each variable in
columns
o Variable view: for each variable (in rows), there are different types of info
reported in columns
 If we click on the row of the chosen variable, we are moved in its
position within the data view
o Variable view info: name, type (e.g. numeric, string…), label (detail the name
of the variable), values (reports the legend of the levels, e.g. 1 “yes”, 2 “no”),
missing (report the levels for missing values, e.g. -1 “no answer”), measure
(reports variable type as scale of measurement used, i.e. scale, ordinal, or
nominal)
- Output
o Output window: reports everything that we asked SPSS to do and contains
the results that it provides us for the analysis
o Analyze > descriptive statistics > frequencies > drag variable and click OK
o Log and analysis (e.g. frequencies), which is divided into title, notes, statistics,
and variable question (e.g. do you have a phone?)
o Can be saved: file > save or save as...
- Syntax
o In the output file, log reports the code corresponding to the command
launched through the drop-down menu
 We can use a syntax file to draft the script with the command lines of
the analysis we want to perform according to the SPSS language
o Analyze > descriptive statistics > frequencies and drag the variable to analyze
> if you click “paste” instead of “ok”, then the command will appear in the
syntax window
 > from there, you can select it, click on the play green button, and the
results of the analysis will appear on the output window
 Command name is blue, variable name is black, parameter is red, and
the option is green
 Can add comments using * at the beginning of the phrase
 Can copy and paste commands, e.g. changing variable name and
running the analysis again
- The four main menus
o In the graphical user interface of SPSS, there are 4 very relevant menus: data,
transform, analyze, and graph
1. Data: collects commands that perform operations at the dataset level, e.g.
merging datasets, sorting cases, filtering cases, extracting a smaller portion of
the dataset by selecting some variables…
2. Transform: commands for managing and preparing the dataset before doing
an analysis, e.g. recording a variable, creating a new variable through a math
operation on other variables, collapsing some variables etc…
3. Analyze: commands to perform analysis (descriptive statistics and model
computation)
4. Graph: commands to ask SPSS to plot a graph

Data management
- Creating a smaller dataset
o File > save as… > variables… (a new window appears – only the selected
variables will be saved) > “drop all” and select the needed variables >
continue (and paste or save)
- Selecting cases
o Data > select cases > “if condition is satisfied” > “if…” (a new window
appears) > select the needed variables, drag them and write the condition
(e.g. PESEX=2) > continue > choose final output
 “Filter out selected cases”: from the variable view, double-click on the
row of the variable PESEX and you’ll see that the cases with value
different than 2 will be crossed out
 If you launch a command, e.g. analyze > descriptive statistics >
frequencies > PESEX variable chosen > “ok” > the output
analysis will be run only on PESEX=2
 “Copy selected cases to a new dataset” > “ok” > a new dataset will
appear with only the cases where PESEX=2 (remember to save this
window as it won’t automatically save)
 “Delete unselected cases” (not suggested because those cases will be
deleted and won’t be recovered in any way)
- Transforming a variable
o Transform > compute variable
o “Target variable” (to give it a name) = “Numeric expression” (can search
“function group”, e.g. for log, do arithmetic > Ln”)
 To write in “numeric expression”, either use the blue arrow or rewrite
the function (same thing to select the variable), e.g. LN(PTC1Q10) >
then click “ok”
 Output window result + new variable appears on the main grid
 In variable view, you can write a label for the variable
- Recording the values of a variable
o E.g. PEEDUCA is a categorical variable with many categories, so we might
want to reduce them (have only 3) in order to lower complexity
o Transform > “record into different variables…” > drag chosen variable > on the
right, write name and label and press “change”
o Click “old and new values” (a new window appears) > “system-missing” to
“system-missing” and add, “system- or user- missing” to “system-missing”
and add (don’t forget to do this!)
 Then, old “value” e.g. 31, to new “value” e.g. 1 and click add
 Then, old “range” e.g. 32 through 40, to new “value” e.g. 2
 Then, old “range, value through highest” e.g. 41, to new “value” e.g. 3
 Click “continue” (you’ll go back to the previous window) and “ok” >
results will appear in the output window (reporting syntax” and in the
data view (new variable will show)
o Useful when creating a dummy variable: identifying that the case belongs to a
category of interest, with value 1 (of interest) or 0 (not of interest)
 Transform > “record into different variables…” > “reset” to clean up
 Drag chosen variable and write name & label > “old and new values…”
 Do everything as before with the system-missing and user-missing
 Old “value” e.g. 31 to new “value” e.g. 1 (value of interest), and add
 Old “all other values” to new “value” e.g. 0, “add” > “continue” >
“change” > “ok”

Analysis of one variable – descriptive statistics


- Frequency distribution
o Analyze > descriptive statistics > frequencies > drag variable and click “ok”
o In the output window, a table with results with appear
 At the top, the label of the variable reported will be shown
 In the first column, the values taken by the variable are shown (or
corresponding label)
 In the second column, the absolute frequencies (number of
respondents who gave a certain answer) are shown
 In the third column, the relative frequencies are shown (percentage
computed out of total value)
 In the fourth column, the percentages considering only those who
actually gave an answer are shown (total excluding missing values)
 In the fifth column, cumulative percentages are shown: for a
numerical variable, this means that for example 80% of visits to art
museums are not greater than 4
- Summary measures
o Analyze > descriptive statistics > frequencies > drag variable and click
“statistics”
o In the new window, you can choose the measures for the output file (e.g.
mean, median, and standard deviation) > click “continue” and “ok”
o In the new window, a new “statistics” table will appear
- Plotting a graph
o Analyze > descriptive statistics > frequencies > drag variable and click “charts”
o In the new window, choose the chart type (e.g. histogram for numerical)
 You can also compare the distribution of the variable to a normal
distribution by clicking on the box “show normal curve on histogram”
 Click “continue” and “ok” > in the output window, the plot will appear
o To check for the normal distribution, you can also click “analyze > “descriptive
statistics” > “Q-Q Plots”
 In the dialogue window, drag the variable and click “ok” > the output
window will show the Q-Q Plot

Analysis of one variable – statistical inference


- Hypothesis testing
o E.g. suppose you want to test the hypothesis that the average number of
visits to art museums or galleries in the past 12 months is greater than 3
o Analyze > compare means > one-sample T test
 In the new window, drag the chosen variable (e.g. number of…) and
set the test value as 3, then click “ok”
o In the output window, a new table will appear, reporting the test statistic, the
number of degrees of freedom of the t distribution of the test statistic, the
significance (p-value > in this case, we have one-sided test so the p-value is
the shown number divided by 2)
- Confidence intervals
o Analyze > descriptive statistics > explore
o Drag variable to “dependent list” and click “statistics” > “descriptives” and
choose confidence level for the mean > “continue” and “ok”
o In the output window, a new table “descriptives” will appear, showing the
confidence intervals (e.g. we are 90% confident that, for those who visited art
galleries and museums, the average number of visits is included within 3.03
and 3.25)

Evaluating association
- Crosstabs
o To study the association between two variables that are categorical or
numerical with a small number of values, we can use cross tabs
 E.g. understand if visits to an art museum or gallery are related to
gender
o Analyze > descriptive statistics > crosstabs
 Put one variable in rows and the other in columns
 To evaluate the presence of association, we need a condition of
frequencies: click on “cells” > click “row” (relative frequencies of the
variable in column conditioned to the variable in row) and “column”
(for the opposite) > “continue”
 To carry out a test for independence, click on “statistics” and “chi-
square” > “continue” and “ok”
o The output will show a new table “crosstabulation”
 First row (count) = joint absolute frequencies (e.g.
 Second row shows relative frequencies by row
 Third row shows relative frequencies by column
o For example:

 1260 females interviewed answered “yes”, while 3071 males


answered “no”
 Among males, 77.4% answered “no”
 Among those who answered “yes”, 58.4% are females
o The output will also show a new table “chi-square tests”
 In the last column, we have the p-value (exact sig.)
- Boxplot and group means
o To understand whether a numerical variable changes its behavior for different
groups/individuals/objects defined by the categories of an ordinal or nominal
variable, we can use a boxplot to compare the distribution of the quantitative
variable in each group
o Graph > legacy dialogs > boxplot > keep default settings and click “define”
 E.g. boxplot of number of visits by gender: drag “number of visits…” to
“variable” and drag “sex” to “category axis” > “ok” and the boxplot will
appear in the output window
o Another method: analyze > descriptive statistics > explore > drag “number of
visits…” to “dependent list” and “sex” to “factor list” > “ok” and the
“descriptives” table will appear in the output window
 There, we can compare the means
 The boxplot will also be shown
- Linear relationship
o To assess the sign and strength of the linear relationship between 2 numerical
variables (e.g. number of visits and age), we could use the Pearson’s
correlation coefficient
o Analyze > correlate > bivariate
 Drag the two variables and select the Pearson correlation coefficient >
“ok” and the “correlations” matrix will show in the output window
 In the example, Person correlation is 0.006 and Sig. (p-value) is 0.77 =
no significant linear relationship between the 2 variables
Module 2

Linear regression command


- New variables (to launch a regression analysis, we need to transform the relevant
categorical variables to be included as independent into dummy variables)
o R_EDU
o DU_less_first_grade
o DU_more_high_school
o DU_male
o DU_female
- Analyze > regression > linear
o E.g. regress number of visits on age, gender, and level of education
o Drag “number of visits…” to dependent and everything else (age, less than
first grade, more than high school, male) > “ok”
- Three tables will appear in the output window:
o Model summary
o ANOVA
o Coefficients

Violation of the model assumptions


- To check for violations, analyze > regression > linear
o Drag relevant variables and click “statistics” > on top of the default options,
select “part and partial correlations” (will add 3 columns to the coefficients
table: zero, partial, and part correlations) and “collinearity diagnostics (will
add 2 columns to the coefficients table: tolerance and VIF)
 Select “Durbin-Watson”: used for detecting serial correlation of the
error terms (another possible violation) > “continue”
o “Continue” will lead us to the previous table > select “plots” > select
“produce all partial plots” (to assess the event of violation of the linearity
assumption)
o “Scatter 1 of 1”: scatter plot to check for homoscedasticity of error terms
 Drag “ZPRED” to “X” and “ZRESID” to “Y”
o Select “histogram” and “normal probability plot” to see if the error terms are
normally distributed
o Then, click “continue” and “ok”
- The output window will show the tables that we need to check for violations
o Event of violation of the linear assumption: check partial regression plots
o Multicollinearity assumption: check “coefficients” table
o Homoscedasticity of error terms: check the scatterplot of the standardized
residual against the standardized predicted value
o Normality of the error terms: check the histogram and P-P plot
o No correlation of the error terms: check “model summary” table (look at
Durbin-Watson value)
Outliers and influential cases
- Identifying outliers
o Analyze > regression > linear > select the variables and click statistics > case-
wide diagnostics > continue and ok
o In the output window, there will be a list of outlier cases with their position in
the dataset
- Identifying influential cases
o Analyze > regression > linear > select the variables and click statistics > save >
select standardized residuals, Cook’s distance, and DfFit > continue and ok
 In the output window, a “residuals statistics” table will appear
o Analyze > descriptive statistics > descriptives > select relevant statistics >
options (make sure that maximum is selected) > continue and ok
 In the output window, a “residuals statistics” table will appear
 If maximum values of Cook’s distance and DFFIT are < 1, there are no
influential cases

Module 3

Odds ratio
- Binary categorical value = odds ratio to evaluate association
o Association between response variable (visits to art museums) and variable
sex
- Analyze > descriptive statistics > crosstabs
o Response variable in column and independent variables in rows
o Cell > percentages = row > continue
o Statistics > risk and chi-square > ok
- Output = risk estimate shows the odds ratio of visiting an art museum or gallery,
comparing males against females

Binary logistic regression


- SPSS can automatically fit a logistic model
- Analyze > regression > binary logistic
o Select dependent variable and covariates
o To signal categorical variables: categorical > select variables and choose
reference category (first or last) > continue
o Options > Hosmer-Lemeshow goodness-of-fit and CI for exp(B) (choose %) >
continue
o For measure of outliers: options > Casewise listing of residuals > continue
o For influential cases: save > Cook’s > continue > ok
- Output tables:
o Case processing summary = how many cases were retained for the analysis
o Dependent variable encoding = how the response variable has been recorded
 E.g. yes = 0 and no = 1 means that the odds ratio will describe the log
of the odds of not having visited / having visited
o Categorical variables codings = how the categorical variables were recorded
into dummy variables

 Female is reference category for sex


 3 is the reference category for education recorded
 Education recorded has 2 dummy variables: 1 when education
recorded is 1 and 0 when it is 2 VS 0 when education recorded is 1 and
1 when it is 2
o Classification table = shows % of correctly classified cases
o Variables in the equation = what we need to know about each covariate (B,
S.E., sig, confidence interval…)
o Casewise list = find out if there are residuals that are higher than 2 standard
deviations
o …
- To analyze influential cases, check out the new variable that will get automatically
created in the variable view
o Analyze > descriptive statistics > frequencies > choose new variable and tick
off “display frequencies table” > statistics > min and max

Multinomial logistic regression


- How to preform multinomial logistic regression analysis, e.g. we are also interested in
no response, refused, don’t know as responses to visits to art museums
o Create new variable with those variables as = 0, not in universe as missing,
and everything else (yes and no) = copy
- Analyze > regression > multinomial logistic
o New variable as the dependent variable and select reference category >
choose first, last, or custom > continue
o Select numerical covariates as “covariates” and categorical ones as “factors”
(to automatically create dummy variables for the latter)
o Statistics > classification table and goodness-of-fit > continue > ok

Module 4

Factor analysis
- To create new variables, equal to existing ones, e.g. N_jazz_zero_new: transform >
compute variable > choose and drag “number of live jazz performances” and “ok”
o Then, change the missing values to create the new variable: transform >
recode into same variables > choose and drag N_jazz_zero_new > select “old
and new values” (system or user missing have new value = 0) > continue
o Select “if” > “include if case satisfies condition” > choose and drag “attended
a live jazz performance > set it equal to =2 > continue > ok
- To perform a factor analysis, analyze > dimension reduction > factor
o Under variables, select and drag variables to include in the analysis
o Descriptives > under “statistics” select univariate descriptives and initial
solution, under “correlation matrix” select coefficients, significance levels,
inverse (for partial correlations), anti-image (for measure of sampling
adequacy for each variable), and KMO and Bartlett’s test (to understand if
factor analysis is possible because of strong correlation) > continue
o Extraction > select scree plot as well (can also change “extract” to “fixed
number of factors” and choose a number) > continue
o Rotation > Varimax > continue
o Options > select “sorted by size” and “suppress small coefficients” (can
choose absolute value, e.g. 0.3) > continue
o Scores > select “save as variables” (saving factors created as new variables –
will appear in variable view window) > continue > ok

Cluster analysis
- Analyze > classify > K-means cluster
o Can create it only on the variables created for the factor analysis
o Choose and drag variables and choose number of clusters
- Output window: iteration history could say “…iterations failed to converge…etc etc” >
could be a problem for the result > change this number for more robust results
o Analyze > classify > K-means cluster
o Iterate > maximum = e.g. 20 > continue
o Options > ANOVA table > continue > ok
- New output: iteration history “convergence ratio reached…” = good (also became = 0
at the last iteration)
o Number of cases in each cluster table shows uneven concentration in cluster
2 (too much) > need to run analysis again
o Analyze > classify > K-means cluster > change number of clusters to e.g. 5
o Save > cluster membership (to describe clusters) > continue > ok
- New output: iteration history is good + ANOVA p-values are significant + number of
cases in each cluster is still not balanced but it’s the best option
o Under variable view, there will be a new variable (QCL_1): can analyze it, e.g.
analyze > descriptive statistics > frequencies and choose the new variable
o Analyze > descriptive statistic > cross tabs (new variable in row and another
one in column)
 Cells > click row and column percentages
 Statistics > chi-square (to see if there is association)

You might also like