SPSS Training Manual EARO-01

Sustainable Resource Management Program in North
Gondar
(SRMP-NG)
Statistical Pacakage for Social

Scientists
(SPSS)
By
Minilik Tsega
Ethiopian Institute of Agricultural

research
Ethiopian Agricultural Research Organization
December 10-13, 2009

Gondar
Prerequisites: This document presumes that you have familiarity with

Windows 98/2000/XP and its commands. It will not review any DOS or
Windows concepts such as filenames, paths, booting up, erasing files, using
the mouse, scrolling, etceteras. If you are not yet comfortable with Windows
and must use SPSS, seek advice from computer professionals
Knowledge is power and data is just data. No matter how much data you have on
hand, if you don’t have a way to make sense of it, you really have nothing at all.
That is where SPSS comes in.
SPSS trainning manual, Yohannes Tilahun 2

Introduction
Developments in the field of statistical data analysis often parallel

or follow advancements in other fields to which statistical methods
are fruitfully applied. Because practitioners of the statistical
analysis often address particular applied decision problems,
methods developments is consequently motivated by the search to
a better decision making under uncertainties.
Decision making process under uncertainty is largely based on

application of statistical data analysis for probabilistic risk
assessment of your decision. Managers need to understand
variation for two key reasons. First, so that they can lead others to
apply statistical thinking in day to day activities and secondly, to
apply the concept for the purpose of continuous improvement. This
course will provide you with hands-on experience to promote the
use of statistical thinking and techniques to apply them to make
educated decisions whenever there is variation in business data.
Therefore, it is a course in statistical thinking via a data-oriented
approach.

Statistical models are currently used in various fields of business

and science. However, the terminology differs from field to field.
For example, the fitting of models to data, called calibration,
history matching, and data assimilation, are all synonymous with
parameter estimation.
Your organization database contains a wealth of information, yet

the decision technology group members tap a fraction of it.
Employees waste time scouring multiple sources for a database.
The decision-makers are frustrated because they cannot get
business-critical data exactly when they need it. Therefore, too
many decisions are based on guesswork, not facts. Many
opportunities are also missed, if they are even noticed at all.
Knowledge is what we know well. Information is the

communication of knowledge. In every knowledge exchange, there
is a sender and a receiver. The sender make common what is
private, does the informing, the communicating. Information can
be classified as explicit and tacit forms. The explicit information
can be explained in structured form, while tacit information is
inconsistent and fuzzy to explain. Know that data are only crude
information and not knowledge by themselves.

Data is known to be crude information and not knowledge by

itself. The sequence from data to knowledge is: from Data to
Information, from Information to Facts, and finally, from Facts to
Knowledge. Data becomes information, when it becomes relevant
to your decision problem. Information becomes fact, when the data
can support it. Facts are what the data reveals. However the
decisive instrumental (i.e., applied) knowledge is expressed
together with some statistical degree of confidence.
Fact becomes knowledge, when it is used in the successful

completion of a decision process. Once you have a massive
amount of facts integrated as knowledge, then your mind will be
superhuman in the same sense that mankind with writing is
superhuman compared to mankind before writing. The following
figure illustrates the statistical thinking process based on data in
constructing statistical models for decision making under
uncertainties.

The above figure depicts the fact that as the exactness of a

statistical model increases, the level of improvements in decision-
making increases. That's why we need statistical data analysis.
Statistical data analysis arose from the need to place knowledge on
a systematic evidence base. This required a study of the laws of
probability, the development of measures of data properties and
relationships, and so on.
Statistical inference aims at determining whether any statistical

significance can be attached that results after due allowance is
made for any random variation as a source of error. Intelligent and
critical inferences cannot be made by those who do not understand
the purpose, the conditions, and applicability of the various
techniques for judging significance.

Considering the uncertain environment, the chance that "good

decisions" are made increases with the availability of "good
information." The chance that "good information" is available
increases with the level of structuring the process of Knowledge
Management. The above figure also illustrates the fact that as the
exactness of a statistical model increases, the level of
improvements in decision-making increases.
Knowledge is more than knowing something technical. Knowledge

needs wisdom. Wisdom is the power to put our time and our
knowledge to the proper use. Wisdom comes with age and
experience. Wisdom is the accurate application of accurate
knowledge. Wisdom is about knowing how something technical
can be best used to meet the needs of the decision-maker. Wisdom,
for example, creates statistical software that is useful, rather than
technically brilliant. For example, ever since the Web entered the
popular consciousness, observers have noted that it puts
information at your fingertips but tends to keep wisdom out of
reach.
Almost every professionals need a statistical toolkit. Statistical

skills enable you to intelligently collect, analyze and interpret data
relevant to their decision-making. Statistical concepts enable us to

solve problems in a diversity of contexts. Statistical thinking

enables you to add substance to your decisions.
The appearance of computer software, JavaScript Applets,

Statistical Demonstrations Applets, and Online Computation are
the most important events in the process of teaching and learning
concepts in model-based statistical decision making courses. These
tools allow you to construct numerical examples to understand the
concepts, and to find their significance for yourself.
We will apply the basic concepts and methods of statistics you've

already learned in the previous statistics course to the real world
problems. The course is tailored to meet your needs in the
statistical business-data analysis using widely available
commercial statistical computer packages such as SAS and SPSS.
By doing this, you will inevitably find yourself asking questions
about the data and the method proposed, and you will have the
means at your disposal to settle these questions to your own
satisfaction
Statistics is a science assisting you to make decisions under

uncertainties (based on some numerical and measurable scales).
Decision making process must be based on data neither on
personal opinion nor on belief.

It is already an accepted fact that "Statistical thinking will one day

be as necessary for efficient citizenship as the ability to read and
write." So, let us be ahead of our time.

Chapter One
Introduction to SPSS
What is a statistical package? It is a computer program or set of

programs that provides many different statistical procedures within
a unified framework. The advantages of such packages are many.
They are much easier to use than most programming languages.
They allow you to run complex analyses without getting bogged
down (late, not on time) in the details of computations, and
because of wide use, they are less likely to have unknown "bugs."
The principle disadvantage of such packages is that they
sometimes make doing statistics too easy. It is possible to apply
complex procedures inappropriately or to properly apply a
procedure and then misinterpret the results. They also do such a
nice job presenting output that the unwary user may be lulled into a
sense of complacency, leading to a failure to detect errors (such as
reading the data incorrectly). A more common problem, however,
is the over-analysis of data. Since analyses are so simple to run, it
is very easy to generate a huge pile of output, with the numbers
you really need lost somewhere in the middle.
1. Overview of SPSS for windows

SPSS stands for Statistical Package for the Social Sciences. It

provides a powerful statistical analysis and data management
system in a graphical environment based on the user interface
facility. This program can be used to analyze data from surveys,
tests, observations, etc. It can perform a variety of data analyses
and presentation functions, including statistical analysis and
graphical presentation of data. Among its features are modules for
statistical data analysis. These include 1) descriptive statistics such
as frequencies, central tendency, plots charts, and lists; and 2)
sophisticated inferential and multivariate statistical procedures,
such as analysis of variance (ANOVA), factor analysis, cluster
analysis, and categorical data analysis. SPSS is particularly well-
suited for survey research, though by no means is it limited to just
this topic of exploration.
1.1 Launching SPSS

There are many ways to launch SPSS. The easiest way is to start it
from the Start button located at the bottom of the Windows
desktop. Click the button, then click Programs, and finally SPSS
10.0 icon. Another method is to double click My Computer on the
Windows desktop, then the (C:) drive icon, Programs, and finally
SPSS icon.

The SPSS Data Editor opens with the window looking

approximately as the picture displayed below:
1.2 Windows in SPSS

In running SPSS, you will encounter several windows. The four
most common windows in SPSS are:
Data Editor. This window displays the contents of the current
(working) data file. You can create new data files or modify
existing ones with the Data Editor. The Data Editor window opens
automatically when you start an SPSS session. You can have only
one data file open at a time.
Viewer. This window displays the results of any statistical

procedures you run and other text. In particular, tables, statistics,

and charts are displayed in the Viewer window. A Viewer window

opens automatically the first time you run a procedure that
generates output. The window is not accessible until after output
has been generated.
Chart Editor. This window is used to edit charts and plots. It is

only displayed after SPSS has been requested to produce a plot.
You can use the window to change the colors, select different type
fonts or sizes, rotate axes, change the chart type, and the like.
Syntax Editor. Most SPSS commands are accessible from the

SPSS menus and dialog boxes. However, some commands and
options are available only by using the SPSS command language.
In this case the Syntax Window is used. You will also use this
window if you wish to run SPSS commands instead of clicking on
the pull-down menus.
Each window in SPSS has its own menu bar with menu selections
appropriate for that window type.
The Analyze and Graphs menus are available on all windows,

making it easy to generate new output without having to switch
windows. Moreover, each SPSS window has its own toolbar that
provides quick, easy access to common tasks.

You can change a window to active simply by clicking on the edge

of the desired window. You can also activate windows by selecting
Window from the menu bar on any of the above windows. The
bottom of the menu lists all currently open windows. To practice,
make the Data Editor window active and click Window in the
menu bar. Notice that the Viewer window is not listed in the menu
because no statistical procedure has been run yet.
If you want to keep the active cell where it is but view another part
of the window, use the scroll arrows along the right and bottom
sides of the workbook window. To practice, click the arrow in the
direction you want to move in the Data Editor window. Then click
the down scroll arrow in the vertical scroll bar. The worksheet
scrolls down one row. Then click the up scroll arrow in the vertical
scroll bar. The worksheet scrolls up one row. Similarly, the

worksheet scrolls left by one column by clicking the left scroll

arrow in the horizontal scroll bar.
2. Data Editor Window
The Data Editor window opens automatically when you start an

SPSS session. The most important components of the Data Editor
window are menus, toolbar, and status bar. The components are
displayed in the picture below:
1. Data Editor Menus

The menu bar provides easy access to most SPSS features. It
consists of ten drop-down
menus:

2. Data Editor Toolbar

The toolbar provides quick and easy access to many useful features
that you may use
frequently. SPSS displays a toolbar below the menu bar on the
Data Editor window. Clicking once on any of these buttons allows
you to perform an action, such as opening a data file, or selecting a
chart for editing.
In order to determine the function of a tool, place the mouse

pointer over the corresponding button, but don't click the mouse
button. SPSS displays a brief description of the tool in the Status
Bar.

3. Status Bar
The status bar at the bottom of each SPSS window apprises the
user of the stage of operations. In particular, for each procedure
you run, a case counter indicates the number of cases processed so
far. There are also messages about the selection of specified
subsets of the data set (filter status). The message weight on
indicates that a weight variable is being used to weight cases for
analysis. When the statement : SPSS Processor is ready appears in
the Status Bar, SPSS is ready to receive your instructions.
4. Dialog Boxes
Most menu selections open dialog boxes. Each dialog box for
statistical procedure and charts have several basic components.
- source variable list:- list of variables in the working data file.
- Target variable list:- one or more lists indicating the variables
you have choosen for analysis, such as dependent and
independent variable lists.

- Command push buttons

o OK – runs the procedure
o PASTE- generates command syntax and pastes to the
syntax editor.
o Reset – deselects any variable.
3. Editing User Preferences
SPSS allows you to customize many aspects of the program to suit

your preferences. By selecting Options under the Edit menu, you
can change such features as how your variables are displayed in
output, the format of charts, where page breaks occur in your
output, as well as the font used -- to name just a few.

Some changes from the default that we recommend and that are
used in all of the ITC Public Computing facilities and classrooms
are:
 General Tab: Check the YES button on the "Open Syntax

Window at Startup".
You may also want to change the "Variable Lists" choices. If
I know the data well, I prefer SPSS shows me variable names
and does it in File order. However if I'm unfamiliar with the
data, it's helpful to have it use the variable labels and present
the list of them in alphabetic order. This affects how the
variables are displayed in any of the menu windows where

you choose variables.

* Notice that SPSS records all of your mouse clicks into a
journal file, the "Session Journal" section shows you the
name of the file into which SPSS writes what you do as well
as provides you with an opportunity to append to it from
session to session or overwrite it. We recommend you
append to it. You will have to periodically delete it as it will
grow very large (> 10MB) if you use SPSS often.
 View Tab: Check the YES button on the "Display
Commands in Log". When you're learning SPSS or trying to
debug a problem it's helpful to have SPSS echo your
commands (from your menu choices) into the OUTPUT
window so you can see what you asked and what SPSS did
all in one place. If you're preparing for a presentation or
paper, you may want to turn this back off, so your OUTPUT
window has only your results in it. Notice on this tab, you
can also choose what additional notes and messages SPSS
puts in your OUTPUT.
Below is a brief explanation of some of the features on each of the

Tabs in the dialogue box:

 General: How the journal files is maintained, how variables

are displayed in lists, which windows are opened upon start-
up, etc.
 Viewer Options: Controls how output is displayed including
page size, alignment, and titles created by the TTLE and
SUBTITLE commands.
 Draft Viewer: Controls which items are displayed when you
run procedures, page breaks, and settings for pivot tables.
 Output Label: Controls the display of variable and value
information in the output outline and pivot table, including
the ability to display labels.
 Chart: Controls the height and width of charts, types of fill
patterns used, and grid lines.
 Data Options: Control transformation and merge options for
data, as well as display information for variables created
using Recode and Compute.
 Interactive: Controls aspects of interactive charting.
 Pivot Tables: Controls display of tables in output outline.
 Currency: Controls how currency is displayed and allows the
creation of custom formats.
 Script: Allows you to create a script, which is a collection of
subroutines associated with procedures, which can be run
automatically to create certain types of output objects.

4. Reading Data Files
The following are some types of formats which can be read into
SPSS or into which you can save your SPSS data file:
 SPSS (*.sav). SPSS format. Data files saved in SPSS format

cannot be read by versions of the software prior to version
7.5.
 SPSS 7.0 (*.sav). SPSS 7.0 for Windows format. Data files
saved in SPSS 7.0 format can be read by SPSS 7.0 and earlier
version of SPSS for Windows--but do not include defined
multiple response sets or Data Entry for Windows
information.
 SPSS/PC+ (*.sys). SPSS/PC+ format. If the data file contains
more than 500 variables, only the first 500 will be saved. For
variables with more than one defined user-missing value,
additional user-missing values will be recoded into the first
defined user-missing value.
 SPSS portable (*.por). SPSS portable format that can be read
by versions of SPSS on other operating systems (for
example, Macintosh or UNIX).

 SAS data sets (*.sas7bdat, *.saspor) SAS data sets, version 6

and 8, as well as SAS Transport (portable) file formats can be
read into SPSS
 Tab-delimited (*.dat). ASCII text files with values separated
by tabs.
 Fixed ASCII (*.dat). ASCII text file in fixed format, using
the default write formats for all variables. There are no tabs
or spaces between variable fields.
 Excel (*.xls). Microsoft Excel spreadsheet. The maximum
number of variables is 256.
Note: For spreadsheet and tab-delimited files, SPSS can read

variables names contained within the first row of data files.
SPSS for Windows can read different types of data files. To read
data files, click on File in the menu bar, and then on Open. The
Open File dialog box is displayed:

4.1 Reading SPSS Data Files

SPSS data files are easily identified since by default each file name
is followed by ".sav" extension. SPSS data files contain not only
the actual data but also some information about the data such as
variable names and formats. These files are written in a special
code that is read and interpreted by the SPSS program.
To read an SPSS data file, click on File in the menu bar, and then
click on Open. This opens the Open File dialog box. Point the
arrow to the data file you wish to open and click on it. If necessary
use the up and down arrows to scroll through files until locate your
file. Click OK.
4.2 Reading Other Types Of Data
4.2.1 Reading In A Microsoft Excel File

As with the tab-delimited data, reading in an Excel file is

straightforward.
For spreadsheet, you can read variable names from the first row of
the file or the first row of the defined range. If the names are
longer than eight characters, they are turnicated. If the first eight
characters do not create a unique variable name, the name is
modified to have a unique variable name.
Once the Excel file is in this format, you can read it into SPSS by
simply going to File: Open, select Excel from under the option box
"Files of Type," and locate the file. NB. Since Excel 5 or later
version files can have multiple spreadsheets, by default data editor
reads the first worksheet.

Once you have located the file, if it's a newer Excel file, (Office 98
or later) you are opening, SPSS will display the following dialog
box:
This box is asking whether the Excel file has variable names that
appear in the first row of the data set. If you do have such variable
names, check this box as above. Doing so, makes SPSS assign

names to each of the new variables. You can also select which
worksheet to read in if there are multiple worksheets in the file.
4.2.2 Reading In A Tab-Delimited Data File
To read in plain ASCII text data that is delimited by tabs (a

common raw data format), you simply need to go to File: Open,
select Tab-Delimited from under the option box "Files of Type,"
and locate the file.
As with the Excel file, SPSS will give you the following prompt
about whether you have variable names that appear in the first row
of the data set.

4.2.3 Reading Text Files

SPSS for Windows can also read raw data files that are in text
format. Text data files are usually identified by the ".txt" extension.
These data files do not contain any additional information about
the file. Suppose that the Framingham Heart Study data are given
as a text file displayed below:
Gender Age Systolic
F 59 170
M 35 130
M 46 136
F 43 96
The data for each subject are recorded in the form of three values
separated by tabs. To read a text data file, click on File, then on
Read Text Data. The Open File dialog box opens. Select your file
and click OK. The Text Open Wizard dialog box opens. The
wizard will help you to transfer your data from the text file into the
Data Editor window. The data file is displayed in the preview
window.

The Text Open Wizard uses six steps to open any text file. In the
first step you can apply a predefined format (previously saved in
the text wizard). As this is not the case in our data file, we check
No button in the step.

In the next step, you are requested to provide information about the
variables in your data file. In particular, you are asked to answer
the question about the arrangement of the variables in the data file.
Fixed width (format) means that each variable is recorded in the
same column for every case. Delimited means that spaces,
commas, tabs, or other characters are used to separate variables.
The variables are recorded in the same order for each case but not
necessarily in the same column locations.
In step 3, you are asked to provide information about cases. In our
data file, each subject is a case. As the top line in the data file
intolab.txt contains the variable names (gender, age, systolic), so in
this step we indicate that the data values start on the second line.
The next three steps are straightforward. Accept all default options
provided by the Text Wizard. The data from introlab.txt will be

displayed in the Data View window and the description of all the
variables in the Variables View window.
Chapter two
1. Working with the Data Editor
The Data Editor window can be displayed in one of the two views:
Data View or Variable View. The Data View displays the contents
of the data file in the form of a spreadsheet. The Variable View
defines all variables in the data file. Switching from one view to
the other can be done by clicking the appropriate tab (Data View or
Variable View) at the bottom of the Data Editor window (see the
picture on page 4).
The Data View window is a grid, whose rows represent subjects
(or cases) and whose columns contain values of the variables
(gender, salary, age etc.) for each subject. Each cell of the grid,
therefore, will usually contain the score of one particular subject
on one particular variable. For example, the salaries of employees
in a company can be presented in a column, and then each
employee is a case.

The cell is the intersection of the case and the variable. Cells
contain only data values. Unlike spreadsheet programs (Excel,
Lotus), cells in the Data Editor cannot contain formulas.
The data file is rectangular. The dimensions of the data file are
determined by the number of cases and variables. Initially, every
column in the Data Editor has the heading var, and all the cells are
empty.
The Variable view contains descriptions of the attributes of each

variable in the data file. In the variable view, Rows are variables
and columns are variable attributes. In this table you can add or
delete variables and modify attributes of variables including
variable name, data type, number of digits, ....

We will demonstrate the basic SPSS features using the following

example.
Example The Blacklion Heart Study followed a cohort of 5209

men and women for over 25 years. The study has been important
in identifying risk factors associated with cardiovascular disease.
The following table contains data for a random sample of 28
subjects from the study. The following is a description of the
variables we have selected from the study for our purpose:
Column Description of Variable
1 Sex (M-Male, F-Female)
2 Age (30-64 years)
3 Systolic blood pressure (82-300 mm)
Now you will enter the data into the Data Editor window. Do not
enter the names of the variables at the top of each column yet.
Follow the instructions below.
1.1. Variables:

The variable name must begin with a letter and cannot end with a
period. The length of the name cannot exceed 8 characters.
Variable names that end with an underscore should be avoided.
Blanks and special characters can not be used (!, ?,” and *)
To define a variable make the Variables View the active window

(click the Variable View tab at the bottom of the Data Editor
window). This will obtain the Variable View window.
Enter the new variable name in the column Name in any blank
row. For example, enter the name gender in the first row. After
entering the name, the default attributes (Type, Width,...) are
automatically assigned. Then if you click on the Type column, the
variable type sub dialog box appears
1.2. Variable Type
- Numeric , Comma and dot – you can enter values with any
number of decimal positions. The data editor displays only
the defined number of decimal positions
- String – all values are right padded to maximum width.
- Date – you can use slashes, dashes, spaces. Commas, or
periods as delimiters between day, month and year. (
dd/mm/yy)
- Time you can use colons, periods or spaces.

In our example there are three variables: gender (categorical), age

(numeric), and systolic blood pressure (numeric).
1.3. Variable Label

Although variable names can only be only 8 characters long,
variable labels can be up to 256 characters long and these
descriptive labels are displayed in output.
1.4. Value Label

You can assign descriptive value labels for each value of a
variable. It can be up to 60 characters long.

To assign to a label, enter the value in the text box then enter the
label in the label text box then click on Add.
To define possible values of the variable gender (possible values

M for male and F for female) click the Values cell in the row for
the variable, and then click the button in the cell.
Define the value labels for the variable gender as follows:
In the same way enter the remaining variables Age and Systolic.
Age should be defined as a numeric variable with two digits (to
minimize the chances of transcription error) and Systolic as a
numeric variable with 3 digits.
1.5 Missing Values

In many situations, data files do not have complete data on all

variables, that is, there are missing values. You need to inform
SPSS when you have missing values so that all computations are
performed correctly. With SPSS, there are two forms of missing
values: system-missing and user defined missing.
System-missing values are those that SPSS automatically treats as
missing. The most common form of this type of value is when
there is a "blank" in the data file. For example, a value for a
variable may not be entered in the data file if the information was
not provided. When SPSS reads this variable, it will read a blank,
and thus treat the value as though it is missing. Any further
computations involving this variable will proceed without the
missing information (computing the average without the missing
value). User-defined missing values are those that the user
specifically informs SPSS to treat as missing. Rather than leaving a
blank in the data file, numbers are often entered that are meant to
represent data. For instance, if the systolic blood pressure for some
subjects in our data set is unknown, we could use the number 9999
to represent those cases that were missing information on the
variable. You need to inform SPSS that 9999 is to be treated as a
missing value, otherwise it will treat it as valid. More precisely,
make the Data Editor the active window, select the Variables

View, and click the Missing cell for the variable (systolic). A
button in the cell appears.
Suppose you define the missing values as displayed in the Missing

Values dialog box:
With this definition of the missing values for the variable systolic,
SPSS will treat 9999 as a missing value of the variable and not
include it in any computations involving the systolic blood
pressure variable.
To delete a variable (row), select the row number that you wish to
delete, click on Edit, and then on Clear. The selected variable will

be deleted and all variables to the right of the deleted variable will
shift to the left. Alternatively, you can select the row and press
Delete key on your keyboard.
To insert a new variable (row) between existing variables: click

on the row that is below the row where you wish to enter a new
variable, click on Data on the menu bar, and then click on Insert
Variable from the pull-down menu.
1.6 Entering Data
Switch from the Variables View window to the Data View

window. The three variables gender, age, and systolic are
represented as columns. Now we are ready to enter the values of
the three variables from the Blacklion Heart Study. Each row
represents a case or an observation. For example, the gender, age,
and systolic blood pressure of a particular subject in the Blacklion
Heart Study data file is recorded as one row.
Clicking on any cell will highlight it (active cell) and its contents
will appear in the cell editor. You can enter the data in any order.
Data values are not recorded until you press Enter or select another

cell. Unlike spreadsheet programs, cells in the Data Editor cannot

contain formulas.
1.7. Data value Restrictions

The defined variable type and width determine the type of value
that can be entered in the cell in the data view.
- if you type a character not allowed by a defined variable
type, the data editor beeps and does not enter the character.
- For string variables, characters beyond the defined width are
not allowed.
- For numeric variable, integer values that exceed the defined
width can be entered, but the data editor displays either in
scientific notation or asterisks in the cell to indicate that the
value is wider than the defined width.
Enter the values for all cases on one variable (column) and then
repeat the procedure for all values in the remaining columns. Enter
the data for our example. You will learn how to save the data in the
next Section.
1. Editing Data
To delete the old value and enter a new value: click the cell, enter
the new value, press Enter. To modify a data value: click the cell,

click the cell editor, edit the data value, and press Enter. To delete
the values in a range, select (highlight) the area concerned and
press Delete. Use the Undo command in Edit to undo any action
you just performed. For example, use the Undo command to delete
the value you have just entered in the Data Editor window.
2. Adding Cases
To insert a new case (row) in between cases that already exist in
your data file: click on the row below the row where you wish to
enter the new case, click on Data on the menu bar, click on Insert
Case from the pull-down menu.
3. Deleting Cases
To delete a case, click on the case number that you wish to delete,
click on Edit from the menu, and then on Clear. The selected case
will be deleted and the rows below will shift upward.
4. Inserting new variables

Entering data in an empty column in the data view or in an empty
row in the variable view automatically creates a new variable with
a default name ( the prefix var and sequential number) The data
editor inserts the system – missing value for the new variable. You
can also insert new variables between existing variables.

5. Finding variables and cases

- Finding variables :- use the GOTO push button in the
variables dialog box of the Utilities menu.
- Finding cases: Choose the Data, Go To Cases
- Finding a data value: to search a data value within a
variable, select any cell in the column, from the Edit menu
choose search for Data
6. Selecting Cases
There are occasions on which you will want to select a subset of
cases from your data file for a particular analysis. You may need to
select the subset based on a formally defined criteria or randomly
in case of a very large data file.
To select subset of cases, click on Data in the main menu and then
on Select Cases from the pull-down menu. This opens the Select
Cases dialog box.

Click on If condition is satisfied radio button, and then on If…It

will select the Select Cases If dialog box displayed below.
When the condition has been completed, click on Continue and

then on OK to return to the Data Editor window, where it will be
noticed that a new column labeled filter_$, and containing 1s and
0s, has appeared. The 1s and 0s represent the selected and
unselected cases, respectively. The row numbers of the unselected
cases have also been marked with an oblique bar. This is a useful
indicator of case selection status. Any further analyses of the data
set will include the selected cases only. The case selection can be

cancelled by returning to the Select Cases dialog box, clicking on

All cases, and then on OK.
1.7 Saving Data Files
To save a new SPSS data file or save data in a different format

make the Data Editor the active window and from the menus
choose File and then Save As…. The following Save Data As
dialog box is displayed:
To save changes to an SPSS data file make the Data Editor the
active window and from the menus choose File and then Save. The

modified data file is saved, overwriting the previous version of the

file./By default, this will save the data file as an SPSS data file. To
save a data file as an ASCII file, you have to use the extension
"*.dat". To practice, save the data from the Framingham Heart
Study as "introlab.sav".
Chapter Three
1. Data Transformations
After a data set has been entered into SPSS, it may be necessary to
modify it in certain ways. With SPSS, you can perform data
transformations ranging from simple tasks, such as combining
categories for analysis, to more advanced tasks, such as creating
new variables based on complex equations.
1.1 Computing New Variables

To create a new variable click on Transform in Data Editor menu,

and then on Compute from the pulldown menu. This opens the
Compute Variable dialog box.
In the dialog box there are two basic places to focus on.
- Calculator Pad: contains numbers, arithmetic operators,
relational operators and logical operators. You can use it just
like calculator.
- Functions: there are over 130 built – in functions, including:
o Arithmetic function (SQRT, EXP, LG10, LN, SIN,
COS, ABS, RND(round))
o Statistical functions ( SUM, MEAN, SD, VARIANCE,
CFVAR, MIN, MAX)
o Distribution functions
o Logical functions
o Date and time aggregation and extraction functions
o Missing value functions
o Cross-case functions

o String functions
1. Conditional expressions
You can use conditional expressions to apply to transformations
.a conditional expression returns a value true, false or missing
for each case.
To specify a conditional expression, click on If ... in the
compute variable dialog box. This opens the If Cases dialog
box. You can choose one of the following alternatives:
- Include all cases: values are calculated for all cases, and
any conditional expressions are ignored. It is the default.
- Include if case satisfies condition: The expression can
include variable names, constants, arithmetic operators,
numeric and other functions, logical variables, and
relational operators.
2. Variable type and label

By default new computed variables are numeric. To compute new
string variables or assign descriptive variables labels, click on
Type & Label... in the Compute Variable dialog box.
Observe that there is a pound sign (#) icon at the variable age and
the variable systolic. In fact, all numeric variables (age, and

systolic are numeric) are identified with the icon. On the other
hand, all string variables are identified by an icon with the letter A.
Obviously, gender is a string variable. For information about a
variable, click the left mouse button on the variable name to select
it, and then the right mouse button and choose Variable
Information from the pop-up menu. Enter the name of the new
variable in the Target Variable box. To build an expression, either
paste components into the Expression field or type directly in the
Expression field. The If… dialog box allows you to apply data
transformations to selected subsets of cases.
For example, to calculate the new variable EXCESS which is

excess systolic blood pressure for males defined as excess =
systolic - 125, your Compute Variable dialog box should look like
the box on the next page. The excess systolic blood pressure is
calculated only for males over 45. When you have completed the
expression, click on OK to close the Compute Variable dialog box.
You will see the message "Running Execute" at the bottom of the
application window indicating that SPSS is computing the new
variable. When the computations are complete, this message will
be replaced with "SPSS Processor is Ready" and your new variable
will appear in the first empty column in the data editor window.

1.2 Counting Occurrence

To Count Occurrences of Values within Cases, From the menus
choose:
Transform > Count...
This dialog box creates a variable that counts the occurrences of

the same value (s) in a list of variables for each case. For example,
a survey might contain a list of magazines with yes/no check boxes
to indicate which magazines each respondent reads. You could

count the number of yes responses for each respondent to create a

new variable that contains the total number of magazines read.
Target variable : is the name of the variable that receives the
counted value.
Target label – Descriptive variable label for the target variable
Variables - selected numeric or string variables from the source
list. Can not contain both numeric and string variables.
1.3 Recoding Variables
Some of the analyses to be performed in SPSS will require that a

categorical variable be entered into SPSS as a numeric variable. If
the variable has already been defined in SPSS as a string, you can
easily create a new variable that contains the same information as
the string variable but is numeric simply by recoding the variable.
For example for the Framingham Heart study data, you might want
to recode the variable gender into a numeric variable, called
gendnum, by assigning F=1, and M=2.
You have two options available for recoding variables. You may
recode values into the same variable, which eliminates all record of
the original values. You also have the option to create a new

variable containing the recoded values. This preserves the original

values of the variable.
1.3.1 Recoding into the Same Variable

To recode into the same variable, click on Transform from the
main menu, then on Recode from the pull-down menu, and finally
on Into Same Variable. This opens the Recode into same variable
dialog box. For information about a variable, click the right mouse
button on the variable name. Then select the name of the variable
to be recoded (gender), and move it to the String Variables box
with the right arrow button.
Then click on Old and New Values. You will obtain the following
box:

The old value of the variable gender is "F", and the new value is
"1". Then click on Add tab to recode the old value and its new
value. Similarly enter "M" as the old value, "2" as the new value,
and click on Add tab.
When you have indicated all the recode instructions, click on
Continue to close the above dialog box. Then click on OK to close
the Recode Into Same Variables dialog box. Now gender no longer
is expressed as either "F" or "M", but it is one of two integers 1, or
2. Close the file without saving the changes you have made and
retrieve the original data file introlab.sav.
1.3.2 Recoding into Different Variables

To recode into a different variable, click on Transform from the
main menu, then on Recode from the pull-down menu, and finally
on Into Different Variable. This opens the Recode into Different
Variables dialog box.

Referring to our example, we might define a new variable agecat

expressing two gender categories defined above. The following
Recode Into Different Variable dialog box is obtained:
Now click on Old and New Values. The old value of the variable
age is the range Lowest through 40, and the new value is 1. Then
click on Add tab to recode the next range and its new value.
Finally, you will obtain the following box:
Now we have two variables expressing the information about the

age of the subjects: age and agecat.
1.4 Ranking data

To compute ranks, normal and savage scorea, or classify cases into

groups based on percentile values, from the menu choose.
Transform > Rank Cases... the rank dialog box is opened
Rank Cases creates new variables containing ranks, normal and
Savage scores, and percentile values for numeric variables.
New variable names and descriptive variable labels are
automatically generated, based on the original variable name and
the selected measure(s). A summary table lists the original
variables, the new variables, and the variable labels.
Optionally, you can:
Rank cases in ascending (Smallest Value) or descending order

(Largest value).
Ranking Method
To choose other ranking methods: Click on Rank Types... in the
rank cases dialog box. Available options are: Rank, Savage score,
Fractional renk, Fractional rank as %, sum of case weights,
Ntiles
Ranking based on normal score.

To create new ranking variables based on proportion estimates and
normal scores, click on more>> in the Rank types dialog box.

You can choose any one or both of the options below appearing in
the dialog box.
- Proportion Estimates :- the estimates of the cumulative
proportion (area) of the distribution.
- Normal scores:- the new variable contains the Z scores
from the standard normal distribution that correspond to
the estimated cumulative proportion.
Proportion Estimate formula

- Blom. Blom’s transformation, defined by the formula (r-
3/8)/(w-1/4), where w is the number of observations and r
is the rank, ranging from 1 to w.
- Tukey. Tukey’s transformation, defined by the formula (r-
1/3)/(w+1/3), where w is the sum of case weights and r is
the rank, ranging form 1 to w.
- Rankit. Given by the formula (r-1/2)/w, where w is the
number of observations and r is the rank, ranging from 1
to w.
- Van der Waeden. Given by the formula r/(w+1), where
w is the sum of case weights and r is the rank, ranging
form 1 to w.

1.5 Categorize Variables
Categorize Variables converts continuous numeric data to a

discrete number of categories. The procedure creates new variables
containing the categorical data. Data are categorized based on
percentile groups, with each group containing approximately the
same number of cases. For example, a specification of 4 groups
would assign a value of 1 to cases below the 25th percentile, 2 to
cases between the 25th and 50th percentile, 3 to cases between the
50th and 75th percentile, and 4 to cases above the 75th percentile.
1.6 Automatic Recode
Automatic Recode converts string and numeric values into
consecutive integers. When category codes are not sequential, the
resulting empty cells reduce performance and increase memory
requirements for many procedures. Additionally, some procedures
cannot use string variables, and some require consecutive integer
values for factor levels.

The new variable(s) created by Automatic Recode retain any

defined variable and value labels from the old variable. For any
values without a defined value label, the original value is used as
the label for the recoded value. A table displays the old and new
values and value labels.
String values are recoded in alphabetical order, with uppercase

letters preceding their lowercase counterparts. Missing values are
recoded into missing values higher than any nonmissing values,
with their order preserved. For example, if the original variable has
10 nonmissing values, the lowest missing value would be recoded
to 11, and the value 11 would be a missing value for the new
variable.
To Recode String or Numeric Values into Consecutive Integers
From the menus choose:
Transform > Automatic Recode...
Select one or more variables to recode. For each selected variable,

enter a name for the new variable and click New Name

1.6.1 Recode into Different Variables

Recode into Different Variables reassigns the values of existing
variables or collapses ranges of existing values into new values for
a new variable. For example, you could collapse salaries into a
new variable containing salary-range categories.
You can recode numeric and string variables. You can recode
numeric variables into string variables and vice versa. If you
select multiple variables, they must all be the same type. You
cannot recode numeric and string variables together.
You can define values to recode in this dialog box.

Old Value. The value(s) to be recoded. You can recode single
values, ranges of values, and missing values. System-missing
values and ranges cannot be selected for string variables because
neither concept applies to string variables. Old values must be the
same data type (numeric or string) as the original variable. Ranges
include their endpoints and any user-missing values that fall within
the range.
New Value. The single value into which each old value or range of
values is recoded. New values can be numeric or string.

If you want to recode a numeric variable into a string variable, you

must also select Output variables are strings. Any old values that
are not specified are not included in the new variable, and cases
with those values will be assigned the system-missing value for the
new variable. To include all old values that do not require
recoding, select All other values for the old value and Copy old
value(s) for the new value.
1.6.2 Recode into Same Variables

Recode into Same Variables reassigns the values of existing
variables or collapses ranges of existing values into new values.
For example, you could collapse salaries into salary range
categories.
You can recode numeric and string variables. If you select multiple
variables, they must all be the same type. You cannot recode
numeric and string variables together.
1.7 Time Series Data Transformations

Several data transformations that are useful in time series analysis
are provided:
- Generate data variables to establish periodicity, and
distingush between historical, validation, and forecasting
periods.

- Create new time series variables as functions of existing time

series variables
- Replace system – and user – missing values with estimates
based on one of several methods.
1.7.1 Define Dates
Define Dates generates date variables that can be used to establish

the periodicity of a time series and to label output from time series
analysis.
Cases Are. Defines the time interval used to generate dates.
- Not dated removes any previously defined date variables.

Any variables with the following names are deleted: YEAR_,

QUARTER_, MONTH_, WEEK_, DAY_, HOUR_,

MINUTE_, SECOND_, and DATE_.
- Custom indicates the presence of custom date variables
created with command syntax (for example, a four-day work
week). This item merely reflects the current state of the
working data file. Selecting it from the list has no effect. (See
the Syntax Reference Guide for information on using the
DATE command to create custom date variables.) Custom
date variables are not available with the Student version.
First Case Is. Defines the starting date value, which is assigned to
the first case. Sequential values, based on the time interval, are
assigned to subsequent cases.
Periodicity at higher level. Indicates the repetitive cyclical

variation, such as the number of months in a year or the number of
days in a week. The value displayed indicates the maximum value
you can enter.
A new numeric variable is created for each component that is used
to define the date. The new variable names end with an underscore.
A descriptive string variable, DATE_, is also created from the
components. For example, if you selected Weeks, days, hours, four
new variables are created: WEEK_, DAY_, HOUR_, and DATE_.

If date variables have already been defined, they are replaced when
you define new date variables that will have the same names as the
existing date variables.
To Define Dates for Time Series Data
Data > Define Dates...

Select a time interval from the Cases Are list. Enter the value(s)
that define the starting date for First Case Is, which determines the
date assigned to the first case.
Variable verses Format Variables
Date variables created with Define Dates should not be confused

with date format variables, defined with Define Variable. Date
variables are used to establish periodicity for time series data. Date
format variables represent dates and/or times displayed in various
date/time formats. Date variables are simple integers representing
the number of days, weeks, hours, etc., from a user-specified
starting point. Internally, most date format variables are stored as
the number of seconds from October 14, 1582.
1.7.2 Create Time Series

Create Time Series creates new variables based on functions of

existing numeric time series variables. These transformed values
are useful in many time series analysis procedures. Default new
variable names are the first six characters of the existing variable
used to create it, followed by an underscore and a sequential
number. For example, for the variable PRICE, the new variable
name would be PRICE_1. The new variables retain any defined
value labels from the original variables.
Available functions for creating time series variables include

differences, moving averages, running medians, lag, and lead
functions.
To Create a New Time Series Variable. From the menus choose:
Transform > Create Time Series...

Select the time series function you want to use to transform the
original variable(s). Select the variable(s) from which you want to
create new time series variables. Only numeric variables can be
used.
Enter variable names to override the default new variable names.

Change the function for a selected variable.
Time Series Transformation Functions
Difference. Nonseasonal difference between successive values in

the series. The order is the number of previous values used to
calculate the difference. Because one observation is lost for each
order of difference, system-missing values appear at the beginning
of the series. For example, if the difference order is 2, the first two
cases will have the system-missing value for the new variable.
Seasonal difference. Difference between series values a constant

span apart. The span is based on the currently defined periodicity.
To compute seasonal differences, you must have defined date
variables (Data menu, Define Dates) that include a periodic
component (such as months of the year). The order is the number
of seasonal periods used to compute the difference. The number of
cases with the system-missing value at the beginning of the series

is equal to the periodicity multiplied by the order. For example, if

the current periodicity is 12 and the order is 2, the first 24 cases
will have the system-missing value for the new variable.
Centered moving average. Average of a span of series values

surrounding and including the current value. The span is the
number of series values used to compute the average. If the span is
even, the moving average is computed by averaging each pair of
uncentered means. The number of cases with the system-missing
value at the beginning and at the end of the series for a span of n is
equal to n/2 for even span values and for odd span values. For
example, if the span is 5, the number of cases with the system-
missing value at the beginning and at the end of the series is 2.
Prior moving average. Average of the span of series values

preceding the current value. The span is the number of preceding
series values used to compute the average. The number of cases
with the system-missing value at the beginning of the series is
equal to the span value.
Running median. Median of a span of series values surrounding

and including the current value. The span is the number of series
values used to compute the median. If the span is even, the median

is computed by averaging each pair of uncentered medians. The

number of cases with the system-missing value at the beginning
and at the end of the series for a span of n is equal to n/2 for even
span values and for odd span values. For example, if the span is 5,
the number of cases with the system-missing value at the
beginning and at the end of the series is 2.
Cumulative sum. Cumulative sum of series values up to and

including the current value.
Lag. Value of a previous case, based on the specified lag order.
The order is the number of cases prior to the current case from
which the value is obtained. The number of cases with the system-
missing value at the beginning of the series is equal to the order
value.
Lead. Value of a subsequent case, based on the specified lead
order. The order is the number of cases after the current case from
which the value is obtained. The number of cases with the system-
missing value at the end of the series is equal to the order value.
Smoothing. New series values based on a compound data

smoother. The smoother starts with a running median of 4, which
is centered by a running median of 2. It then resmoothes these
values by applying a running median of 5, a running median of 3,

and hanning (running weighted averages). Residuals are computed

by subtracting the smoothed series from the original series. This
whole process is then repeated on the computed residuals. Finally,
the smoothed residuals are computed by subtracting the smoothed
values obtained the first time through the process. This is
sometimes referred to as T4253H smoothing.
1.8 Replace missing values

Missing observations can be problematic in analysis, and some
time series measures cannot be computed if there are missing
values in the series. Replace Missing Values creates new time
series variables from existing ones, replacing missing values with
estimates computed with one of several methods.
Default new variable names are the first six characters of the
existing variable used to create it, followed by an underscore and a
sequential number. For example, for the variable PRICE, the new
variable name would be PRICE_1. The new variables retain any
defined value labels from the original variables.

To Replace Missing Values for Time Series Variables
Transform > Replace Missing Values...
Select the estimation method you want to use to replace missing

values. Select the variable(s) for which you want to replace
missing values.
Optionally, you can: Enter variable names to override the default

new variable names. Change the estimation method for a selected
variable.
1.8.1 Estimation Methods for Replacing Missing Values
Series mean. Replaces missing values with the mean for the entire
series.
Mean of nearby points. Replaces missing values with the mean of
valid surrounding values. The span of nearby points is the number

of valid values above and below the missing value used to compute
the mean.
Median of nearby points. Replaces missing values with the
median of valid surrounding values. The span of nearby points is
the number of valid values above and below the missing value
used to compute the median.
Linear interpolation. Replaces missing values using a linear

interpolation. The last valid value before the missing value and the
first valid value after the missing value are used for the
interpolation. If the first or last case in the series has a missing
value, the missing value is not replaced.
Linear trend at point. Replaces missing values with the linear
trend for that point. The existing series is regressed on an index
variable scaled 1 to n. Missing values are replaced with their
predicted values.
2. Sorting Data
Suppose that we would like to sort the data in the data file
introlab.sav according to the age of the subjects enrolled in the
study. In order to sort the data, from the menus choose Data, and
then Sort Cases. The following dialog box will be displayed:

In order to sort the subjects according to the age, select age and
move it to the Sort by box. You can sort cases in ascending and
descending order. If you select multiple sort variables, cases are
sorted by each variable within category of the prior variable on the
Sort list. For example, if you select gender as the first sorting
variable and age as the second sorting variable, cases will be sorted
by age classification within each gender category. For string
variables, uppercase letters precede their lowercase counterparts in
sort order. For example, the string value "F" comes before "f" in
sort order.
3. Transpose
Transpose creates a new data file in which the rows and columns in
the original data file are transposed so that cases (rows) become
variables and variables (columns) become cases. Transpose
automatically creates new variable names and displays a list of the
new variable names.

Combining Data files

You can combine two files in two different ways. Thus, you can:
- Merge files containing the same variables but different cases.
- Merge files containing the same cases but different variables.
3.1 Add Cases

Have one of the files open, and you can add cases to it from an
external SPSS file. To add cases: from the Data menu choose
merge files > Add Cases...
This opens the Add Case Read file dialog box. After selecting the
file to be included, a dialog box with the list of variables appears.
The boxes that appears are:

Unpaired variables :- variables to be excluded from the new,

merged data file. Variable from the working data file are identified
with an asterisk (*). Variables from the external data file are
identified with a plus sign (+).
Variables in the new working data file:- variables to be included
in the new, merged data file. By default, all the variable that match
both name and data type are included on the list.
Indicate case source as variable
- Working file – 0
- External file – 1
Selecting Variables
If the same information is recorded under different variable names
in the two files, you can create a pair from the unpaired variable
list. Select the two variables on the list and click on Pair. To
include an unpaired variable from one file without pairing it with a
variable from the other file, select the variable on the Unpaired
Variable list and click on  . Any unpaired variables included in
the merged file will contain missing data for cases from the file
that does not contain that variable.
3.2 Add Variables

Add Variables merges the working data file with an external data
file that contains the same cases but different variables. For

example, you might want to merge a data file that contains pre-test
results with one that contains post-test results.
Cases must be sorted in the same order in both data files.

If one or more key variables are used to match cases, the two data
files must be sorted by ascending order of the key variable(s).
Variable names in the second data file that duplicate variable
names in the working data file are excluded by default because
Add Variables assumes that these variables contain duplicate
information.
To Merge Files with the Same Cases but Different Variables
Open one of the data files. From the menus choose: Data >
Merge Files > Add Variables...
Select the data file to merge with the open data file.
Select the variables from the external file variables (+) on the
Excluded Variables list. Select Match cases on key variables in
sorted files. Add the variables to the Key Variables list.
The key variables must exist in both the working data file and the
external data file. Both data files must be sorted by ascending order

of the key variables, and the order of variables on the Key

Variables list must be the same as their sort sequence.
Excluded Variables. Variables to be excluded from the new,

merged data file. By default, this list contains any variable names
from the external data file that duplicate variable names in the
working data file. Variables from the working data file are
identified with an asterisk (*). Variables from the external data file
are identified with a plus sign (+). If you want to include an
excluded variable with a duplicate name in the merged file, you
can rename it and add it to the list of variables to be included.
New Working Data File. Variables to be included in the new,

merged data file. By default, all unique variable names in both data
files are included on the list.
Key Variables. If some cases in one file do not have matching

cases in the other file (that is, some cases are missing in one file),
use key variables to identify and correctly match cases from the
two files. You can also use key variables with table lookup files.
The key variables must have the same names in both data files.

Both data files must be sorted by ascending order of the key

variables, and the order of variables on the Key Variables list must
be the same as their sort sequence.
Cases that do not match on the key variables are included in the
merged file but are not merged with cases from the other file.
Unmatched cases contain values for only the variables in the file
from which they are taken; variables from the other file contain the
system-missing value.
External file or Working data file is keyed table. A keyed table,

or table lookup file, is a file in which data for each "case" can be
applied to multiple cases in the other data file. For example, if one
file contains information on individual family members (such as
sex, age, education) and the other file contains overall family
information (such as total income, family size, location), you can
use the file of family data as a table lookup file and apply the
common family data to each individual family member in the
merged data file.
4. Aggregate Data

Aggregate Data combines groups of cases into single summary

cases and creates a new aggregated data file. Cases are aggregated
based on the value of one or more grouping variables. The new
data file contains one case for each group. For example, you could
aggregate county data by state and create a new data file in which
state is the unit of analysis.
To Aggregate a Data File From the menus choose: Data >

Aggregate...
Select one or more break variables that define how cases are
grouped to create aggregated data. Select one or more aggregate
variables to include in the new data file. Select an aggregate
function for each aggregate variable.

Optionally, you can override the default aggregate variable names

with new variable names, provide descriptive variable labels, and
create a variable that contains the number of cases in each break
group.
Break Variable(s). Cases are grouped together based on the

values of the break variables.
Aggregate Variable(s). Variables are used with aggregate
functions to create the new variables for the aggregated file.
To create a variable containing the number of cases in each break

group, select the option: Save number of cases in break groups
as variable.
For Specifying file name and location of the aggregated data file,
choose one of the following.
Create new data file or Replace working data file
Aggregate Functions: Aggregate functions are applied on existing

variables to create new variables in aggregate file. In the
Aggregate Function dialog box, the following functions are
available.

- Summary functions, including mean, standard deviation,

and sum.
- Percentage or fraction of values above or below a specified
value.
- Percentage or fraction of values inside or outside a specified
range.
5. Split File
Split File splits the data file into separate groups for analysis based
on the values of one or more grouping variables.
To Split a Data File for Analysis , From the menus choose: Data >
Split File...
Select Compare groups or Organize output by groups. Select

one or more grouping variables.

If the data file isn’t already sorted by values of the grouping

variables, select Sort file by grouping variables.
6. Selecting Subsets of Cases

Select Cases provides several methods for selecting a subgroup of
cases based on criteria that include variables and complex
expressions. You can also select a random sample of cases. The
criteria used to define a subgroup can include:
 Variable values and ranges

 Date and time ranges
 Case (row) numbers
 Arithmetic expressions
 Logical expressions
 Functions
Options available in Select list of the Select Cases dialog box:

 All cases
 If condition satisfied
 Random sample of cases
 Based on time or case range
 Use filter variable

Unselected Cases. You can filter or delete cases that don’t meet
the selection criteria. Filtered cases remain in the data file but are
excluded from analysis. Select Cases creates a filter variable,
FILTER_$, to indicate filter status. Selected cases have a value of
1; filtered cases have a value of 0. Filtered cases are also indicated
with a slash through the row number in the Data Editor. To turn
filtering off and include all cases in your analysis, select All cases.
Deleted cases are removed from the data file and cannot be
recovered if you save the data file after deleting the cases.
7. Weight Cases
Weight Cases gives cases different weights (by simulated
replication) for statistical analysis.

The values of the weighting variable should indicate the number of

observations represented by single cases in your data file.
Cases with zero, negative, or missing values for the weighting
variable are excluded from analysis. Fractional values are valid;
they are used exactly where this is meaningful, and most likely
where cases are tabulated.
Once you apply a weight variable, it remains in effect until you

select another weight variable or turn off weighting. If you save a
weighted data file, weighting information is saved with the data
file. You can turn off weighting at any time, even after the file has
been saved in weighted form.
Weights in Crosstabs. In the Crosstabs procedure, cell counts
based on fractional weights are rounded to the nearest integer. For
example, a cell count of 4.2 based on fractional weights is rounded
to 4.
Weights in scatterplots and histograms. Scatterplots and
histograms have an option for turning case weights on and off, but
this does not affect cases with a zero, negative, or missing value
for the weight variable. These cases remain excluded from the
chart even if you turn weighting off from within the chart.

Chapter Four
1. Working with the Viewer

All statistical results, tables, and charts are displayed in the Viewer.
A Viewer window opens automatically the first time you run a
procedure that generates output. In this window, you can easily
navigate to whichever part of the output you want to see. You can
also use the Viewer to show or hide selected tables or results from
an entire procedure. This is useful when you want to shorten the
amount of visible output in the contents pane.
The Viewer is divided into two panes: the left pane of the Viewer
contains an outline view of the contents (Outline pane), the right
pane contains statistical tables, charts, and text output (Contents
pane).
The four output components: Title, Notes, Statistics, and Region

have been obtained with the Frequencies procedure

An open book icon next to an item in the outline pane (Region)

indicates that it is currently visible in the contents pane. To hide a
table or a chart in the display without deleting it, double-click its
book icon in the outline pane. The open book icon changes to a
closed book icon, indicating that the item is now hidden.
You can click and drag the right border of the outline pane to
change the width of the outline pane. We will demonstrate how to
navigate through your results in the Viewer window later using the
blacklion Heart Study data.
Chapter Five
1. Evaluating Assumptions
Many statistical procedures, such as analysis of variance, require

that all groups come from normal populations with the same
variance. Therefore, before choosing a statistical hypothesis, we
need to test the hypothesis that all the group variances are equal or
the the samples come from normal populations. If it appears that
the assumptions are violated, we may want to determine
appropriate transformations.
The Levene Test

The Levene Test is a homogeneity test that is used to see whether

variances from different groups are equal and come from
populations with the same variance. It is obtained by computing,
for each case, the absolute different from its cell mean and
performing a one way analysis of variance on these differences.
Tests of Normality
To test whether our data have come from a normal distribution, we
can use the normal probability plot. In a normal probability plot,
each observed value is paired with its expected value from the
normal distribution. If the sample is from a normal distribution, we
expect that the points will fall more or less on a straight line.
Plotting actual values Plotting residuals
The above tests of normality provide a visual basis for fitness.

However, there are tests based on statistical test of hypothesis.
Two commonly used tests are the Shapiro- Wilk's test and the

Lilliefors test. The Lilliefors test is used when means and

variances are not known but must be estimated from the data.
2. Exploring your data

To run the Explore procedure, from the menus choose:
Analyze > Summarize > Explore...
The Explore procedure provides a verity of descriptive plots and

statistics, including stem-and leaf plots, boxplots, normal
probability plots, and spread-versus-level plots. Also the Levene
test for homogeneity of variance, Shapiro-Wilks' and Lilliefors
tests for normality, and several other estimators of location are
available.
When the Explore dialog box opens, the following options are
available.

 Both. Displays plots and statistics.

 Statistics. Displays statistics only.
 Plots. Displays plots only (suppresses all statistics).
There are also pushbuttons for statistics..., plots... and options...
To obtain robust estimators or to display outiers, percentiles, or
frequency tables, select Both or statistics under Display and click
on statistics... in the Explore dialog box. The options available in
the Explore Statistics dialog box are
The Explore options dialog box helps to handle missing values.

The following alternatives are available.
 Exclude cases listwise. Cases with missing values for any
dependent or factor variable are excluded from all analyses.
 Exclude case pairwise. Cases with no missing values for
variable in a cell are included in the analysis of that cell. The
case may have missing values for variables used in other
cells.
 Report values. Missing values for factor variables are treated
as a separate category.
Boxplots. These alternatives control the display of boxplots when
you have more than one dependent variable. Factor levels together
generates a separate display for each dependent variable. Within a
display, Boxplots are shown for each of the groups defined by a

factor variable. Dependents together generates a separate display

for each group defined by a factor variable. Within a display,
boxplots are shown side by side for each dependent variable. This
display is particularly useful when the different variables represent
a single characteristic measured at different times.
3. Cross Tabulation And Measures Of Association

The Crosstabs procedure forms two-way and multiway tables and
provides a variety of tests and measures of association for two-way
tables. The structure of the table and whether categories are
ordered determine what test or measure to use.
Crosstabs’ statistics and measures of association are computed for
two-way tables only. If you specify a row, a column, and a layer
factor (control variable), the Crosstabs procedure forms one panel
of associated statistics and measures for each value of the layer
factor (or a combination of values for two or more control
variables). To obtain crosstabulations and related statistics, from
the menus choose:
Statistics > Summarize > Crosstabs...
Although examination of the various row and column percentages
in a crosstabulation is a useful first step in studying the relationship
between two variables, row and column percentages do not allow
for qualification or testing of that relationship. For these purposes,

it is useful to consider various indexes that measure the extent of

association as well as statistical tests of the hypothesis that is no
association.
3.1 The Chi Square Test of Independence
The Chi Square Statistic (pearson chi-square) is used to test the
hypothesis that the row and cloumn variables are independent. The
formula for Chi Square is given as follows.
X= [(Oij-Eij)/ Eij]
Where Oij=observed value;
Eij = expected value:
The calculated value is compared to the critical points of the

theoretical chi-square distribution to produce an estimate of how
likely (or unlikely) This calculated value is if the variables are
infact independent. As the value of the chi-square depends on the
number of rows and columns in the table being examined, you
have to know the degrees of freedom for the table (which is r rows
and c columns), and it is given by (r-1) X (c-1), since once (r-1)
rows and (c-1) columns are filled, frequencies in the remaining
row and column cells must be chosen so that marginal totals are
maintained. to produce a chi-square measure, in the crosstabs

dialog box click on the statistics... push button and check in the
chi-square check box.
3.2 Measures of Association

The chi-square measure does not provide much information of the
strength or form of the association existing between variables. The
magnitude of the observed chi-square depends not only on the
goodness of fit of the independence model but also on the sample
size.
There are measures of the degree or form of association. They are
divided in to nominal measures, ordinal measures and interval
measures.
Nominal measures
Nominal measures can provide only some indication of the
strength of association between variables; they cannot indicate
direction or anything about the nature of the relationship. Two
types of measures are provided. Those based on chi-square
statistics and those that follow the logic of proportional reduction
in error.
Chi-square-based measures
The phi coefficient which is a modification of the Pearson chi-
square is
 2
 = N

For a 2 x 2 table only, the phi coefficient is equal to the Pearson

correlation coefficient, so the sign of phi matches that of the
correlation coefficient. Since the value of phi may not lie between
0 and 1 for tables in which one dimension is greater than 2,
pearson suggested the following formula.
 2
C =
 +N
2
This index is known as coefficient of contingency

Again the formula above does have some limitations. Although the
value of C lies between 0 and 1, it cannot generally attain the upper
limit of 1. Cramer introduced the following variant.
 2
N (k-
V= 1)
Where K is the smaller of the number of rows and columns. The

statistic is known as Cramer's V and can attain the maximum of 1
for tables of any dimension.
You can include measures of nominal data in your crosstabulation
analysis by checking the appropriate check boxes in the
Crosstabs: Statistics dialog box.

Proportional Reduction in Error
Alternatives to the chi-square based measurements are those based

on the idea of proportional reduction in error (PRE), introduced
by Goodman and Kruskal(1954). They are all essentially ratios of a
measure of error in predicting the values of one variable based on
knowledge of that variable alone and the same measure of error
applied to predictions based on knowledge of an additional
variable. For measures based on proportional reduction in error, in
the Crosstabs: Statistics dialog box check the Lambda check box.
Ordinal Measures
Since ordinal measures include order (rank), additional information

from ranking may be reflected in measures of association from
ranking. Consideration of the kind of relationships that may exist
between two ordered variables leads to the notion of direction of
relationship and to the concept of correlation. Variables are
positively correlated if cases with high values on one also tend to
be high on the other. Negatively correlated variables show the
opposite relationship: the higher the first variable, the lower the
second tends to be.
The Spearman correlation coefficient is commonly used measure

of correlation between two ordinal variables. For all of the cases,

the values of each of the variables are ranked from smallest to

largest, and the Pearson correlation coefficient is computed on the
ranks.
Ordinal Measures Based on Pairs
There are several measures of association for a table of two

ordered variables which are based on a comparison of values of
both variables. Cases are first compared to determine if they are
concordant, discordant, or tied. The following ordinal measures all
have the same numerator: The number of concordant pairs (P)
minus the number of discordant pairs (Q) calculated for all distinct
pairs observations. If there are no pairs with ties, this measure
(Kendall's tau-a) is in the range from -1 to +1
A measure that attempts to normalize P-Q by considering ties on

each variable in a pair separately but not ties on both variables in a
pair is tau-b.
b = P–Q
(P + Q + Tx)(P + Q + TY)
Where Tx is the number of pairs tied on X but not on Y, and T Y is

the number of pairs tied on Y but not on X. If no marginal
frequency is 0, tau-b can attain +1 or -1 only for a square table.
Another measure that can attain or nearly attain, +1 or -1 for any
rxc table is tau-c

c = 2m(P – Q)
N2(m – 1)
Where m is the smaller of the number of rows and columns. the

coefficients tau-b and tau-c do not differ much in value if each
margin contains approximately equal frequencies.
Goodman and Kruskal's gamma is closely related to the tau
statistics and is calculated as
G=(P-Q)/(P+Q)
Gamma can be thought of as the probability that a random pair of
observations is concordant minus the probability that the pair is
discordant , assuming the absence of ties.
For measures of ordinal type you can mark the check boxes in the
Crosstabs: Statistics dialog box so that you can generate the type of
measure you want.
Measures Involving Interval Data
For two variable measured on an interval scale, various measures
of association are available to express the degree and form of
relationship that exists between the two variables. The well known
coefficient that is symmetric and which measures the linear
relationship is the Pearson correlation coefficient ( r). Its value
ranges from -1 to + 1. The formula for Pearson coefficient is
(XY – NXY)
r=  (X – X) 2(Y – Y)2
To generate r for a given pair of variables, in the Crosstabs:

Statistics dialog box check the Correlations check box.

Chi-square. For tables with two rows and two columns, select
Chi-square to calculate the Pearson chi-square, the likelihood-ratio
chi-square, Fisher’s exact test, and Yates’ corrected chi-square
(continuity correction). For 2 X 2 tables, Fisher’s exact test is
computed when a table that does not result from missing rows or
columns in a larger table has a cell with an expected frequency of
less than 5. Yates’ corrected chi-square is computed for all other
2 X 2 tables. For tables with any number of rows and columns,
select Chi-square to calculate the Pearson chi-square and the
likelihood-ratio chi-square. When both table variables are
quantitative, Chi-square yields the linear-by-linear association test.
Correlations. For tables in which both rows and columns contain

ordered values, Correlations yields Spearman’s correlation
coefficient, rho (numeric data only). Spearman’s rho is a measure

of association between rank orders. When both table variables

(factors) are quantitative, Correlations yields the Pearson
correlation coefficient, r, a measure of linear association between
the variables.
Nominal. For nominal data (no intrinsic order, such as Catholic,

Protestant, and Jewish), you can select Phi (coefficient) and
Cramér’s V, Contingency coefficient, Lambda (symmetric and
asymmetric lambdas and Goodman and Kruskal’s tau), and
Uncertainty coefficient.
Ordinal. For tables in which both rows and columns contain

ordered values, select Gamma (zero-order for 2-way tables and
conditional for 3-way to 10-way tables), Kendall’s tau-b, and
Kendall’s tau-c. For predicting column categories from row
categories, select Somers’ d.
Nominal by Interval. When one variable is categorical and the

other is quantitative, select Eta. The categorical variable must be
coded numerically.
Kappa. For tables that have the same categories in the columns as
in the rows (for example, measuring agreement between two
raters), select Cohen’s Kappa.

Risk. For tables with two rows and two columns, select Risk for
relative risk estimates and the odds ratio.
McNemar. The McNemar test is a nonparametric test for two
related dichotomous variables. It tests for changes in responses
using the chi-square distribution. It is useful for detecting changes
in responses due to experimental intervention in "before and after"
designs.
Cochran’s and Mantel-Haenszel. Cochran’s and Mantel-
Haenszel statistics can be used to test for independence between a
dichotomous factor variable and a dichotomous response variable,
conditional upon covariate patterns defined by one or more layer
(control) variables. The Mantel-Haenszel common odds ratio is
also computed, along with Breslow-Day and Tarone's statistics for
testing the homogeneity of the common odds ratio.
4. Subpopulation Differences
Subpopulation differences can be examined using the Means

procedure. The procedure calculates subgroup means and related
univariate statistics for dependent variables within categories of
one or more independent variables. You can also optionally obtain
one-way analysis of variance and a test of linearity.
The Means procedure requires the following minimum

specifications: One numeric dependent variable, one numeric or
short string independent variable. To generate subgroup means and
related statistics, choose:
Analyze > Compare Means > Means...

The Means dialog box is opened. Places for dependent list and
independent list are available to be filled by appropriate variables.
Also when the options... pushbutton is clicked on, the Means:
Options dialog box appears which contains check boxes and raid
buttons for options of univariate statistics and analysis of variance.
5. Multiple Response Analysis
The Define Multiple Response Sets procedure groups elementary

variables into multiple dichotomy and multiple category sets, for
which you can obtain frequency tables and crosstabulations. You
can define up to 20 multiple response sets. Each set must have a
unique name. To remove a set, highlight it on the list of multiple
response sets and click Remove. To change a set, highlight it on
the list, modify any set definition characteristics, and click Change.
You can code your elementary variables as dichotomies or

categories. To use dichotomous variables, select Dichotomies to
create a multiple dichotomy set. Enter an integer value for Counted
value. Each variable having at least one occurrence of the counted
value becomes a category of the multiple dichotomy set. Select
Categories to create a multiple category set having the same range
of values as the component variables. Enter integer values for the
minimum and maximum values of the range for categories of the

multiple category set. The procedure totals each distinct integer

value in the inclusive range across all component variables. Empty
categories are not tabulated.
Each multiple response set must be assigned a unique name of up

to seven characters. The procedure prefixes a dollar sign ($) to the
name you assign. You cannot use the following reserved names:
CASENUM, SYSMIS, JDATE, DATE, TIME, LENGTH, and
WIDTH. The name of the multiple response set exists only for use
in multiple response procedures. You cannot refer to multiple
response set names in other procedures. Optionally, you can enter a
descriptive variable label for the multiple response set. The label
can be up to 40 characters long.
To analyze multiple response data, there are two procedures: the

Multiple Response Frequencies procedure which displays
frequency tables. and the Multiple Response Crosstabs procedure
which displays two- and three- dimensional crosstabulations.
In order to perform a multiple response analysis, you must group

elementary variables in to multiple dichotomy and multiple
category sets. The minimum specifications for analysis are
 Two or more numeric variables
 Values to be counted

 A name for the multiple response set

To define one or more multiple response sets, from the menus
choose:
Analyze > Multiple Response > Define sets
To Crosstabulate Multiple Response Sets

The Multiple Response Crosstabs procedure crosstabulates defined
multiple response sets, elementary variables, or a combination.
You can also obtain cell percentages based on cases or responses,
modify the handling of missing values, or get paired
crosstabulations. You must first define one or more multiple
response sets.

For multiple dichotomy sets, category names shown in the output

come from variable labels defined for elementary variables in the
group. If the variable labels are not defined, variable names are
used as labels. For multiple category sets, category labels come
from the value labels of the first variable in the group. If categories
missing for the first variable are present for other variables in the
group, define a value label for the missing categories. The
procedure displays category labels for columns on three lines, with
up to eight characters per line. To avoid splitting words, you can
reverse row and column items or redefine labels.
Statistics. Crosstabulation with cell, row, column, and total counts,

and cell, row, column, and total percentages. The cell percentages
can be based on cases or responses. Crosstabulations of multiple
responses, choose:
Analyze > Multiple Response > Crosstabs...

There are options which you may set. The Define Ranges... and
the Options pushbutton in the dialog box when clicked display
smaller dialog box for value range specification and setting
options.
Under the Define Ranges...value ranges must be defined for any

elementary variables in the crosstabulation. To define value ranges
for an elementary variable, highlight the variable on the Row(s),
Column(s) list and click on Define Ranges... in the Multiple
Response Crosstabs dialog box. Enter integer minimum and
maximum category values that you want to tabulate.
To obtain cell percentages, control the computation of percentages,

modify the handling of missing values, or get a paired
crosstabulation, click on Option... in the Multiple Response
Crosstabs dialog box.

6. Testing hypothesis about differences in means

6.1. samples and populations
The totality of cases about which conclusions are drawn is called
the population, while the cases actually included in the study
constitute the Sample. The field of statistics helps us draw
inferences about populations based on observations obtained from
random samples.
You obtain information from a random sample and the results

obtained from the summary and analysis serve as
estimates/statistics of the unknown population values or
parameters.
6.2. sampling distributions

The theoretical distribution of all possible values of a statistic
obtained from a population is called the sampling distribution of
the statistic. The mean of the sampling distribution of the statistic.
The mean of the sampling distribution is called the expected value
of the statistic. The standard deviation is termed as the standard
error.

Sampling distribution of the Mean

The mean of the theoretical sampling distribution of the means of
sampling of size N is , the population mean. The standard error,
which is another name for the standard deviation of the sampling
distribution of the mean, is

 x  =
N
where  is the standard deviation of the population and N is the

sample size.
Since usually the value of the standard error is unknown, it is

estimated using a single sample using
S x = s/ N
6.3. The two-sample T Test
To see whether there is a difference between the means of two

populations, we can apply the t-test. Suppose that there are two
samples with mean X1 and X2, with variances S12 and S22, and with
respective sample sizes of N1 and N2. Suppose the corresponding
population means be 1 and 2
To test the hypothesis that the two population means are equal
against they are diffeent, the following statistic can be calculated:

X1 - X2
S12 – S22
t
= N1 N2
Based on the sampling distribution of the above statistics, you can

calculate the probability that a difference at least as large as the
one observed would occur if the two population means are equal.
The probability is called the observed significance level. If the
observed significance level is small enough, the hypothesis that the
population means are equal is rejected.
Another statistic based on the t distribution can be used to test the

equality of means hypothesis. This Statistics, known as the pooled-
variance, t test is based on the assumption that the population
variances in the two groups are equal and is obtained using a
pooled estimate of that common variance. The test statistic is
identical to the equation for t given previously except that the
individual group variances are replaced by a pooled estimate, S p 2
i.e.,
X1 - X2
Sp2 – Sp2
t
= N1 N2
Sp2 =((N1 – 1) S12 + (N1 – 1) S22 )/(N1 + N2 - 2)

Levene's test is used to test the hypothesis that the population

variances are equal. This test is less dependent on the assumption
of normality than most tests of equality of variance.
Independent versus Paired Samples
In a situation where there is a corresponding pair in group for each
subject is a group we will have a paired-samples designs.
However, in an independent-samples design there is no pairing of
cases; all observations are independent.
Analysis of paired Data

The statistic used to test the hypothesis that the mean difference in
the population is 0 is
t D
= SD/N
Where D bar is the observed difference between the two means

and SD is the standard deviation of the differences of the paired
observations. The sampling distribution of t, if the differences are
normally distributed with a mean of 0, is student's t with N-1
degrees of freedom, where N is the number of pairs. If the paring is
effective, the standard error of the difference will be smaller than

the standard error obtained if two independent samples with N

subjects each were chosen.
6.4 What is Hypothesis Testing?

The purpose of hypothesis testing is to help draw conclusions
about population parameters based on results observed in a random
sample. The procedure remains virtually the same for tests of most
hypothesis .
 A hypothesis of no difference (called a null hypothesis) and

its alternative are formulated.
 A test statistic is chosen to evaluate the null hypothesis
 For the sample, the test statistic is calculated.
 The probability, if the null hypothesis is true, of obtaining a
test value at least as extreme as the one observed is
determined.
 If the observed significance level is judged small enough, the
null hypothesis is rejected.
It is important to make assumptions inorder to perform a statistical
analysis. The particular assumptions depend on the statistical test
being used. For parametric tests, some knowledge about the
distribution from which samples are selected is required.

The assumptions are necessary to define the sampling distribution

of the test statistic. Unless the distribution is defined, correct
significance levels cannot be calculated. For equal variance t - test,
the assumption is that the observations are random samples from
normal distributions with the same variance.
6.5 One-Sample T Test

The One-Sample T Test procedure tests whether the mean of a
single variable differs from a specified constant.
To Obtain a One-Sample T Test

From the menus choose: Analyze > Compare Means > One-
Sample T Test...
Select one or more variables to be tested against the same

hypothesized value. Enter a numeric test value against which each
sample mean is compared.
Statistics. For each test variable: mean, standard deviation, and

standard error of the mean. The average difference between each
data value and the hypothesized test value, a t test that tests that

this difference is 0, and a confidence interval for this difference

(you can specify the confidence level).
6.6 Independent-samples T test

The Independent-Samples T test procedure computes student's t
statistic for testing the significance in means for independent
samples. Both equal-and unequal-variance t values are provided, as
the Levene test for equality of variances.
The minimum specifications are:

 One or more numeric test variables
 One numeric or short string grouping variable
 Group values for the grouping variable.
To obtain an independent-samples t test, from the menus choose:
Analyze > Compare Means > Independent-Samples T

test..

You must define the two groups for the grouping variable.
Obtaining a paired- Samples T test
The paired-samples T Test procedure computes Student’s t statistic

for testing the significance of a difference in means for paired
samples.
The minimum specification is a pair of numeric variables.
To obtain a paired-samples t test, from the menus choose.
Analyze ► Compare Means► Paired-Samples T Test…

7. One-Way Analysis of Variance
An analysis of Variance (ANOVA) is a statistical technique used

to test the null hypothesis that several population means are equal.
It examines the variability of the observation within each group as
well as the variability between the group means. Based on these
two estimates of variability, you draw conclusions about the
population means.
You require one - way analysis of variance when only one variable
is used to classify cases into the different groups. When two or
more variables are used to form the groups, the simple Factorial
ANOVA procedure is required.

You can use the One-way ANOVA procedure only when your
groups are independent. If you observe the same person under
several conditions, you cannot use this procedure.
Assumptions Required
 Each of the groups is an independent random sample from a
normal population
 In the population, the variances of the groups are equal
You can test the null hypothesis that the groups come from
population with the same variance by means of the Leven test,
which can be obtained with the One-Way ANOVA procedure.
Variability Analysis
The observed variability is divided into two parts: variability of the

observations within a group and the variability among the group
means.
There are two estimates of the variability in the Population: the

within-groups mean square and the between-groups mean square.
The within-group mean square is based on how much the
observations within each group vary. The between-groups mean
square is based on how much the groups means vary among
themselves. If the null hypothesis is true, the numbers should be
close to each other. If you divide one by the other, the ratio should

be close to 1. The statistical test for the null hypothesis that all
groups have the same mean in the population is based on this ratio,
called an F statistic.
7.1 Obtaining a One-way analysis of Variance (ANOVA)
The One-Way ANOVA procedure produces a one-way analysis of

variance for a quantitative dependent variable by a single factor
(independent) variable. Analysis of variance is used to test the
hypothesis that several means are equal. This technique is an
extension of the two-sample t test.
In addition to determining that differences exist among the means,
you may want to know which means differ. There are two types of
tests for comparing means: a priori contrasts and post hoc tests.
Contrasts are tests set up before running the experiment, and post
hoc tests are run after the experiment has been conducted. You can
also test for trends across categories.
Data. Factor variable values should be integers, and the dependent
variable should be quantitative (interval level of measurement
To obtain a One-way analysis of variance, from the means choose:
Analyze ► Compare Means ► One-way ANOVA….

Contrast: You can partition the between-groups sums of squares

into trend components or specify a priori contrasts.
Polynomial. Partitions the between-groups sums of squares into
trend components. You can test for a trend of the dependent
variable across the ordered levels of the factor variable. For
example, you could test for a linear trend (increasing or
decreasing) in salary across the ordered levels of highest degree
earned.
Degree. You can choose a 1st, 2nd, 3rd, 4th, or 5th degree
polynomial.
Coefficients. User-specified a priori contrasts to be tested by the t
statistic. Enter a coefficient for each group (category) of the factor
variable and click Add after each entry. Each new value is added to
the bottom of the coefficient list. To specify additional sets of
contrasts, click Next. Use Next and Previous to move between sets
of contrasts.

The order of the coefficients is important because it corresponds to

the ascending order of the category values of the factor variable.
The first coefficient on the list corresponds to the lowest group
value of the factor variable, and the last coefficient corresponds to
the highest value. For example, if there are six categories of the
factor variable, the coefficients 1, 0, 0, 0, 0.5, and 0.5 contrast the
first group with the fifth and sixth groups. For most applications,
the coefficients should sum to 0. Sets that do not sum to 0 can also
be used, but a warning message is displayed.
Tests. Once you have determined that differences exist among the
means, post hoc range tests and pairwise multiple comparisons can
determine which means differ. Range tests identify homogeneous
subsets of means that are not different from each other. Pairwise
multiple comparisons test the difference between each pair of

means, and yield a matrix where asterisks indicate significantly

different group means at an alpha level of 0.05.
Tukey’s honestly significant difference test, Hochberg’s GT2,

Gabriel’s test, and Scheffé’s test are multiple comparison tests and
range tests. Other available range tests are Tukey’s b, S-N-K
(Student-Newman-Keuls), Duncan, R-E-G-W F (Ryan-Einot-
Gabriel-Welsch F test), R-E-G-W Q (Ryan-Einot-Gabriel-Welsch
range test), and Waller-Duncan. Available multiple comparison
tests are Bonferroni, Tukey’s honestly significant difference test,
Sidak, Gabriel, Hochberg, Dunnett, Scheffé, and LSD (least
significant difference). Multiple comparison tests that do not
assume equal variances are Tamhane’s T2, Dunnett’s T3, Games-
Howell, and Dunnett’s C.

8. General Linear Model (GLM)

GLM Univariate
The GLM Univariate procedure provides regression analysis and
analysis of variance for one dependent variable by one or more
factors and/or variables. The factor variables divide the population
into groups. Using this General Linear Model procedure, you can
test null hypotheses about the effects of other variables on the
means of various groupings of a single dependent variable. You
can investigate interactions between factors as well as the effects
of individual factors, some of which may be random. In addition,
the effects of covariates and covariate interactions with factors can
be included. For regression analysis, the independent (predictor)
variables are specified as covariates.
Both balanced and unbalanced models can be tested. A design is

balanced if each cell in the model contains the same number of
cases. In addition to testing hypotheses, GLM Univariate produces
estimates of parameters.
Commonly used a priori contrasts are available to perform
hypothesis testing. Additionally, after an overall F test has shown
significance, you can use post hoc tests to evaluate differences
among specific means. Estimated marginal means give estimates

of predicted mean values for the cells in the model, and profile
plots (interaction plots) of these means allow you to easily
visualize some of the relationships.
Residuals, predicted values, Cook’s distance, and leverage values

can be saved as new variables in your data file for checking
assumptions.
WLS Weight allows you to specify a variable used to give
observations different weights for a weighted least-squares (WLS)
analysis, perhaps to compensate for a different precision of
measurement.
To Obtain a GLM Univariate Analysis
From the menus choose: Analyze > General Linear Model >
Univariate...

Specify Model. A full factorial model contains all factor main

effects, all covariate main effects, and all factor-by-factor
interactions. It does not contain covariate interactions. Select
Custom to specify only a subset of interactions or to specify factor-
by-covariate interactions. You must indicate all of the terms to be
included in the model.
Contrast
Hypothesis testing is based on the null hypothesis LB=0, where L
is the contrast coefficients matrix and B is the parameter vector.
When a contrast is specified, SPSS creates an L matrix in which
the columns corresponding to the factor match the contrast. The
remaining columns are adjusted so that the L matrix is estimable.
The output includes an F statistic for each set of contrasts. Also
displayed for the contrast differences are Bonferroni-type

simultaneous confidence intervals based on Student’s t

distribution.
Deviation. Compares the mean of each level (except a reference
category) to the mean of all of the levels (grand mean). The levels
of the factor can be in any order.
Simple. Compares the mean of each level to the mean of a
specified level. This type of contrast is useful when there is a
control group. You can choose the first or last category as the
reference.
Difference. Compares the mean of each level (except the first) to
the mean of previous levels. (Sometimes called reverse Helmert
contrasts.)
Helmert. Compares the mean of each level of the factor (except
the last) to the mean of subsequent levels.
Repeated. Compares the mean of each level (except the last) to the
mean of the subsequent level.
Polynomial. Compares the linear effect, quadratic effect, cubic
effect, and so on. The first degree of freedom contains the linear
effect across all categories; the second degree of freedom, the
quadratic effect; and so on. These contrasts are often used to
estimate polynomial trends.

Univariate Profile Plots

Profile plots (interaction plots) are useful for comparing marginal
means in your model. A profile plot is a line plot in which each
point indicates the estimated marginal mean of a dependent
variable (adjusted for any covariates) at one level of a factor. The
levels of a second factor can be used to make separate lines. Each
level in a third factor can be used to create a separate plot. All
fixed and random factors, if any, are available for plots. For
multivariate analyses, profile plots are created for each dependent
variable. In a repeated measures analysis, both between-subjects
factors and within-subjects factors can be used in profile plots.
GLM Multivariate and GLM Repeated Measures are available only
if you have the Advanced Models option installed.
A profile plot of one factor shows whether the estimated marginal

means are increasing or decreasing across levels. For two or more
factors, parallel lines indicate that there is no interaction between
factors, which means that you can investigate the levels of only
one factor. Nonparallel lines indicate an interaction.
After a plot is specified by selecting factors for the horizontal axis

and, optionally, factors for separate lines and separate plots, the
plot must be added to the Plots list.

8.2 GLM Multivariate

The GLM Multivariate procedure provides regression analysis and
analysis of variance for multiple dependent variables by one or
more factor variables or covariates. The factor variables divide the
population into groups. Using this general linear model procedure,
you can test null hypotheses about the effects of factor variables on
the means of various groupings of a joint distribution of dependent
variables. You can investigate interactions between factors as well
as the effects of individual factors. In addition, the effects of
covariates and covariate interactions with factors can be included.
For regression analysis, the independent (predictor) variables are
specified as covariates.
9. Measuring Linear Association
The relationship between any two variables can be measured in

different ways. A Scatter plot can reveal various types of
associations between two variables. However, since a scatter plot
is not quantifying measures, an alternative method is to use the
Pearson correlation coefficient, denoted by r, measure. It is
defined as

R= (X,- X)(Yi-Y)
(N-1) Sx Sy
Where N is the number of cases and S x and Sy are the standard
deviations of the two variables. The absolute value of r indicates
the strength of the linear relationship. The largest possible absolute
value or r is 1, which occur when all points fall exactly on the line.
In order to test the hypothesis that the population correlation
coefficient is different from zero, we use the statistics.
t= r N-2
1-r2
If the population correlation coefficient (p) is Zero, the test statistic
has a Student’s t distribution with N-2 degrees of freedom. The
assumption required to use the above statistic is that independent
random samples are taken from a distribution in which the two
variables together are distributed normally.
The Pearson product-moment correlation is appropriate only for

data that attain at least interval level of measurement. Normality is
also assumed when testing hypotheses about this correlation
coefficient. For ordinal data or interval data that do not satisfy the
normality assumption, another measure of the linear relationship
between two variables, Spearman’s rank correlation coefficient,
is available.

The rank correlation coefficient is the Pearson correlation

coefficient based on the ranks of the data if there are no ties. If the
original data for each variable have no ties, the data for each
variable are first ranked, and then the Pearson correlation
coefficient between the ranks for the two variables is computed.
Like the Pearson correlation coefficient, the correlation ranges
between-1 and +1, where-1 and +1 indicate perfect linear
relationship between the ranks of the two variables. Minimum
specification is two or more numeric variables.
9.1 Bivariate Correlations
To obtain bivariate correlations, from the menus choose:

Analyze ► Correlate ► Bivariate…
The Bivariate Correlations dialog box appears. The following

options are available
Correlation Coefficients.
 Pearson.
 Kendall’s tau-b
 Spearman
Test of Significance
 Two-tailed
 One-tailed

9.2 Partial Correlation Analysis

Partial Correlation analysis provides us with a single measure of
linear association between two variables while adjusting for the
linear effects of one or more additional variables. It is a technique
for uncovering spurious relationships, identifying intervening
variables, and detecting hidden relationships of properly used.
To test the statistical significance of a population partial

correlation coefficient, the assumption that the population is
distributed normally, i.e multivariate normality assumption is
required. The test statistic is
T=r N-θ-2
1- r2
Where θ is the order of the coefficient and r is the partial
correlation coefficient. The degrees of freedom for t are N- θ-2,
where N is the number of cases.
Obtaining Partial Correlations.

Analyze > Correlate > Partial...
10. Distances
This procedure calculates any of a wide variety of statistics
measuring either similarities or dissimilarities (distances), either
between pairs of variables or between pairs of cases. These

similarity or distance measures can then be used with other

procedures, such as factor analysis, cluster analysis, or
multidimensional scaling, to help analyze complex data sets.
Dissimilarity (distance) measures for interval data are Euclidean

distance, squared Euclidean distance, Chebychev, block,
Minkowski, or customized; for count data, chi-square or phi-
square; for binary data, Euclidean distance, squared Euclidean
distance, size difference, pattern difference, variance, shape, or
Lance and Williams. Similarity measures for interval data are
Pearson correlation or cosine; for binary data, Russel and Rao,
simple matching, Jaccard, dice, Rogers and Tanimoto, Sokal and
Sneath 1, Sokal and Sneath 2, Sokal and Sneath 3, Kulczynski 1,
Kulczynski 2, Sokal and Sneath 4, Hamann, Lambda, Anderberg’s
D, Yule’s Y, Yule’s Q, Ochiai, Sokal and Sneath 5, phi 4-point
correlation, or dispersion.

11. Linear Regression Analysis
11.1 Linear Regression
Regression is a model that describes the relationship between

dependent and independent variables. A linear regression describes
only the linear relationship that exists between variables. Suppose
Y and X are variables where Y is the dependent variable and X the
independent variable. The regression line fitted is
Y= B0+B1X
Where Bo is the intercept of the line and B1 is the slope, i.e, the
amount of change in Y for a single unit change in X
When making inferences based on the sample results, we have to

make assumptions about the population. The assumptions for
regression are the following.
Normality and Equality of Variance. For any fixed value of the

independent variable X, the distribution of the dependent variable
Y should be normal, with mean Y/x( the mean of Y for a given
X) and a constant variance of 2.

Independence. The Y’s are statistically independent of each other’

that is, observations are in no way influenced by other
observations. For example, observations are not independent if
they are based on repeated measurements from the same
experimental unit.
Linearity. The mean values Y x all lie on a straight line, which is

the population regression line. An alternative way of stating this
assumption is that the linear model is correct. When there is a
single independent variable, the model can be summarized by
Yi=0+1Xi+ei
The population parameters (values) for the slope and intercept are
denoted by 1and 0
The term ei usually called the error, is the difference between the
observed value of Yi
And the subpopulation mean at the point Xi. The ei are assumed to
be normally distributed, independent, random variables with a
mean of 0 and variance of  2
11.2. Estimating Population Parameters

Since 1 and 2 are unknown population parameters, they must be

estimated from the sample. The least-squares coefficients B 0 and
B1 are used to estimate the population parameters.
However, the slope and intercept estimated from a single sample

typically differ from the population values and vary from sample to
sample. To use these estimates for inference about the population
values, the sampling distributions of the two statistics are needed.
When the assumptions of linear regression are met, the sampling
distributions of B0 and B1 are normal, with means of 0 and 1.
B1 =  (Xi-X)(Y1-Y)
(Xi-X)
The standard error of B0 is
o=  1 + X2
2 (N-1)S X2 Where
Sx2 is the sample variance of the
independent variable.
The standard error of B1
 B1= 
(N-1) S2x
Since the population Variance of the errors, 2 , is not known, it

must be estimated.
The usual estimate of 2 is

S2 = (Yi-B0-B1Xi)2
N-2
S is termed as the standard error of the estimate.
11.3 Testing Hypotheses
A frequently tested hypothesis is that there is no linear relationship

between X and Y –
That the slope of the population regression line is 0. The statistic
uses to test this hypothesis is.
T= B1
SB1
The distribution of the statistic, when the assumptions are met and
the hypothesis of no linear relationship is true, is Student’s
distribution with N-2 degrees of freedom The statistic for testing
the hypothesis that the intercept is 0 is.
T= Bo
S Bo
Its distribution is also Student’s t with N-2 degrees of freedom.
11.4 Goodness of Fit

An important part of any statistical procedure that builds models
from data is establishing how well the model actually fits, or its
goodness of fit. This includes the detection of possible violations
of the required assumptions in the data being analyzed.

The R-squared Coefficient

A Commonly used measure of the goodness of fit of a linear model
is R2, or the coefficient of determination. Besides being the square
of the correlation coefficient between variables X and Y, it is the
square of correlation coefficient between Y (the observed value of
the dependent variable) and Ŷ (the predicted value of Y from the
fitted line). If all the observations fall on the regression line, R 2 is
1. If there is no linear relationship between the variables R2 is 0.
Note that R2 is a measure of the goodness of fit of a particular

model and that an R2 , of 0 does not necessarily mean that there is
no association between the variables. Instead, it indicates that there
is no linear relationship.
The sample R2 tends to be optimistic estimate of how well the

model fits the population. The model usually does not fit the
population as well as it fits the sample from which it is derived.
The statistic adjusted R2 attempts to correct R2 to more closely
reflect the goodness of fit of the model in the population. Adjusted
R2 is given by
R2 = R2- P(1-R2)
N-P-1
Were P is the number of independent variables in the equation.

To test the hypothesis that there is no linear relationship between X

and Y, several equivalent statistics can be computed. When there is
a single independent variable, the hypothesis that the population R 2
is 0 is identical to the hypothesis that the population slope is 0. the
test for R2 pop = 0 is usually obtained from the analysis of variance
table.
The total variation in a regression analysis can be broken down

into two components: the residual sum of squares and the
regression sum of squares. Mathematically.
(Yi –Y)2 = (Yi-Yi)2+(Yi-Y)2

If the regression assumptions are met, the ratio of the mean square
regression to the mean square residual is distributed as an F
statistic with p and N-p-1 degrees of freedom. F serves to test how
well the regression model fits the data. If the Probability associated
with the F statistic is small, the hypothesis that R 2 pop = 0 is rejected
F is mathematically written as
F= mean square regression/mean square residual
11.5. Multiple Regression Models

Multiple regression extends the concept of bivariate regression by

incorporating multiple independent variables. The model can be
expressed as
Yi =0+1X1+2X2+…+ pXp+i
The notation Xp indicates the value of the pth independent variable
for case i. Again, the  terms are unknown parameters and the i
terms are independent random variables that are normally
distributed with mean 0 and constant variance 2.
In the analysis of the multiple linear regression the first step,
usually, is to calculate the correlation matrix between variables.
The matrix indicates the correlations between the independent
variables and the dependent variable as well as between the
independent variables.
The F statistic under the multiple regression model indicates
whether there is a linear relationship between the dependent
variable and the independent variables. The null hypothesis can be
stated as
0=1=2 =…=N
11.6 Determining Important Variables.
The importance of variables in a regression equation in explaining

the regression may be viewed from different perspectives. A
number of ways are available to judge whether the variable is

important or not. The following methods are useful in determining

important variables.
Beta Coefficients
Use of the actual regression coefficients may not give the true
picture of the importance of variables. Instead regression
coefficients are standardized using the following equation.
beta k-Bk(Sk/SY)
Where Sk is the standard deviation of the Kth independent variable.
Beta is the beta coefficient calculated.
Part and Partial Coefficients
Another way of assessing the relative importance of independent
variables is to consider the increase in R 2 when a variable is
entered into an equation that already contains the other
independent variables. This increase if
R2change =R2-R2 (i)
Where R2 (i) is the square of the multiple correlation coefficient

when all independent variables except the ith are in the equation. A
large change in R2 indicates that a variable provides unipue
information about the dependent variable that is not available from
the other independent variables in the equation. The signed square

root of the increase is called he part correlation coefficient. It is the

correlation between Y and Xi .
When the linear effects of the other independent variables have

been removed from Xi If all independent variables are
uncorrelated, the change in R2 when a variable is entered into the
equation is simply the square of the correlation coefficient between
that variable and the dependent variable.
11.7 Building a Model

You can construct a variety of regression models from the same set
of variables. Among the widely used procedures for computing all
possible regression equations, under this section forward selection,
backward elimination, and stepwise selection are discussed.
Forward Selection
In forward selection, the first variable considered for entry into the
equation is the one with the largest positive or negative correlation
with the dependent variable. The F test for the hypothesis that the
coefficient of entered variable is 0 is then calculated. To determine
whether this variable (and each succeeding variable) is entered, the
F value is compared to an established criterion. You can specify
one of two criteria in SPSS. One criterion is the minimum value of

the F statistic that a variable must achieve in order to enter, called

F-to-enter (FIN), with a default value of 3.84. The other criterion
you can specify is the probability associated with the F statistic,
called probability of F-toenter (PIN), with a default of 0.05. In this
case, a variable enters into the equation only if the probability
associated with the F test is less than or equal to the default 0.05 or
the value you specify. By default, PIN is the criterion used.
Backward Elimination
While forward selection starts with no independent variables in the
equation and sequentially enters them, backward elimnination
starts with all variables in the equation and sequentially removes
them. Instead of entry criteria, removal criteria are used.
Two removal criteria are available in SPSS. The first is the

minimum F value that a variable must have in order to remain in
the equation. Variables with F values less than this F-to-remove
(FOUT) are eligible for removal. The second criterion available is
the maximum probability of F-to-remove (POUT) that a variable
can have. The defaulat FOUT value is 2.71 and the default POUT
value is 0.10. The default criterian is probability of F-to-remove.
Stepwise Selection

Stepwise selection of independent variables is really a combination

of backward and forward procedures and is probably the most
commonly used method. The first variable is selected in the same
manner as in forward selection. If the variable fails to meet entry
requirements (either FIN or PIN), the procedure terminates with no
independent variables in the equation. If it passes the criterion, the
second variable is selected based on the highest partial correlation.
If it passes entry criteria, it also enters the equation.
After the first variable is entered, stepwise selection differs from

forward selection: the first variable is examined to see whether it
should be removed according to the removal criterion (FOUT or
POUT) as in backward elimination. In the next step, variables not
in the equation are examined for entry. After each step, variables
already in the equation are examined for removal. Variables are
removed until none remain that meet the moved, the PIN must be
less than the POUT (or FIN greater than FOUT). Variable
selection terminates when no more variables meet entry and
removal criteria.
How to Obtain a Linear Regression Analysis

The Linear Regression procedure provides five eqution-building
methods: forward selection, backward elimination, stepwise

selection,forced removal. It can produce residual analyses to help

detect influential data points, outliers, and violations of regression
model assumptions. You can also save predicted values, residuals,
and related measures.
The minimum specifications are:

 One numeric dependent variable.
 One or more numeric independent variables.
To obtain a linear regression alaysis, from the menus choose:
Analyze ► Regression ► Linear…
12 Curve Estimation
In a situation where you want to fit a curve that you think is
appropriate to data you have, you can do it in SPSS. The Curve
Estimation procedure produces curve estimation regression
statistics and related plots for 11 different curve estimation
regression models. You can also save predicted values, residuals,
and prediction intervals as new variables.
The minimum specification are:

 One or more dependent variables

 An independent variable that can be either a variable in the

working data file or time.
To obtain curve estimation regression analysis and related plots,
from the menus choose:
Analyze ► Regression ► Curve Estimation….
Curve Estimation Models

You can choose one or more curve estimation regression models.
To determine which model to use, plot your data. If your variables
appear to be related linearly, use a simple linear regression model.
When your variables are not linearly related, try transforming your
data. When a transformation does not help, you may need a more
complicated model. View a scatterplot of your data; if the plot
resembles a mathematical function you recognize, fit your data to
that type of model. For example, if your data resemble an
exponential function, use an exponential model. The following
models are available in the Curve Estimation procedure: linear,
logarithmic, inverse, quadratic, cubic, power, compound, S-curve,
logistic, growth, and exponential. If you are unsure which model
best fits your data, try several models and select among them.
In the Curve Estimation dialog box, click your right mouse button
on a model to obtain the equation of the model.

13 Distribution-Free or Nonparametric Tests
As the name indicates, non-parametric tests do not require

assumptions about the shape of the underlying distribution, or
require limited distributional assumptions of course, they are less
powerful than their parametric counterparts. They are most
powerful in situations where parametric procedures are not
appropriate-for example, when the data are nominal or ordinal, or
when interval data are from markedly non- normal distributions.
Significance levels for certain nonparametric tests can be
determined regardless of the shape of the population distribution,
since they are based on ranks. In the ensuing subsections, various
nonparametric tests will be introduced.
13.1. One-Sample Tests

Various one-Sample nonparametric procedures are available for
testing hypotheses about the parameters of a population. These
include procedures for examining differences in paired samples.
The sign Test
The sign test is a nonparametric procedure used with two related

samples to test the hypothesis that the distributions of two

variables are the same. This test makes no assumptions about the
shape of these distributions.
To compute the sign test, the difference between the buying scores
of husbands and wives is calculated for each case. Next, the
numbers of positive and negative differences are obtained. If the
distributions of the two variables are the same, the numbers of
positive and negative differences should be similar.
To obtain a sing test, from the menus choose
Analyze ► Nonparametric Tests► 2 Related Samples…

and check or mark the box for the wilcoxon Test Type.
The Wald-Wolfowitz Runs Test

The runs Test is a test of randomness. That is, given a sequence of
observations, the runs test examines whether the value of one
observation influences the values for later observation. If there is
no influence( the observations are independent), the sequence is
considered random.
The Binomial Test

With data that are binomially distributed, the hypothesis that the
probability P of a particular outcome is equal to some number is

often of interest. For example, you might want to find out if a

tossed coin was unbiased. To check this, you could test to see
whether the probability of heads was equal to 0.5. The binomial
test compares the observed frequencies in each category of a
binomial distribution to the frequencies expected under a binomial
distribution with the probability parameter p.
To obtain the Wald-Wolfowitz Runs test, from the menus choose:

Analyze ► Nonparametric Tests ► Binomial….
The Kolmogorov-Smirnov One-Sample Test

The kolmogorov-Smirnov test is used to determine how well a
random sample of data fits a particular distribution (univorm,
normal, or Posson). It is based on comparison of the sample
cumulative distribution function to the hypothetical cumulative
distribution function.
To obtain the one-Sample Kolmogorov-Smirnov test, from the
menus choose:
Analyze ► Nonparametric Tests► 1-Sample K-S…
The One-Sample Chi-Square Test

To calculate the One-Sample Chi-Square Statistic, the data are first

classified into mutually exclusive categories of interest, and then
expected frequencies for these categories are computed. Expected
frequencies are the frequencies that would be expected if the null
hypothesis is true. One the expected frequencies are obtained, the
chi-square statistic is computed as
X2=(Oi-Ei)2
Ei
Where Oi is the observed frequency for the i th category, Ei is the

expected frequency for the ith category, and k is the number of
categories.
To obtain the chi-Square test, from the menus choose:
Analyze ► Nonparametric Tests ► Chi-Square…
The Fredman Test

The Fredman test is used to compare two or more related samples.
(This is an extension of the tests for paired data) The K variables to
be compared are ranked from 1 to K for each case, and the mean

ranks for the variable s are calculated and compared, resulting in a

test statistic with approximately a chi-square distribution.
14. Logistic Regression

A variety of multivariate statistical techniques can be used to
predict a binay dependent variable from a set of independent
variables. Multiple regression analysis and discriminat analysis are
two related techniques that come in mind. How ever both methods
require assumptions to be net.
For new cases, for estimating the probability of an event occurs we
use binary luqistic repression model. This model requires for fewer
assumptions. It is none effective when group member ship entirely
categorical. For the case of single independent variable, the logistic
regression model is written as
Prob (event) eBo+ B1x = 1

1+eBo+B1x 1+e –(Bo+B1X)
Where Bo and B1 are coefficients estimated from the data.

For more than one independent variable, the model can be written
as

Prob(event) 1
=
1+ e –z
Where z is a linear Combination , Z= Bo+B 1x1+B2X2+B3X3+
………+ BpXp
The probability of an event not occurring is estimated as,
Prob (no event) = 1-Prob(event)
Logistic regression is useful for situations in which you want to be

able to predict the presence or absence of a characteristic or
outcome based on values of a set of predictor variables. It is
similar to a linear regression model but is suited to models where
the dependent variable is dichotomous. Logistic regression
coefficients can be used to estimate odds ratios for each of the
independent variables in the model. Logistic regression is
applicable to a broader range of research situations than
discriminant analysis.
In logistic regression analysis the data should be: The dependent

variable should be dichotomous. Independent variables can be
interval level or categorical; if categorical, they should be dummy

or indicator coded (there is an option in the procedure to recode

categorical variables automatically).
To Obtain a Logistic Regression Analysis

Analyze > Regression > Binary Logistic…
Select one dichotomous dependent variable. This variable may be

numeric or short string. Select one or more covariates. To include
interaction terms, select all of the variables involved in the
interaction and then select >a*b>.

14.1 Multinomial Logistic regression
If you have categorical dependent variable with more than two

possible values you can use an extension of binary logistic
regression to examine the relationship b/n the dependent variable
and independent variable, called multinomial logistic regression.
The models are called multinomial since for each combination of
values of independent variably the counts of the dependant
variable are assumed to have a multinomial distribution. The
counts at the different combination are also assumed to be
independent with a fixed total.
The logit model
When you have two groups, one that has experience the event of
interest and the other that has not, you can write the logistic
(
regression model
= Bo+B1X1+B2X2+……+BpXp
as.
Lo
g
P(event)
1-P(events
)
The quantity on the left side of the equal sign is called a logit. It is
a natural log of the odds that the event will occur.

To enter variables in groups (blocks), select the covariates for a

block, and click Next to specify a new block. Repeat until all
blocks have been specified. Optionally, you can select cases for
analysis. Click Select, choose a selection variable, and click Rule.
You can specify the following statistics for your Multinomial
Logistic Regression:
Summary statistics. Prints the Cox and Snell, Nagelkerke, and

McFadden R2 statistics. Likelihood ratio test. Prints likelihood-
ratio tests for the model partial effects. The test for the overall
model is printed automatically. Parameter estimates. Prints
estimates of the model effects, with a user-specified level of
confidence. Asymptotic correlation of parameter estimates. Prints
matrix of parameter estimate correlations. Asymptotic covariance
of parameter estimates. Prints matrix of parameter estimate
covariances.
Cell probabilities. Prints a table of the observed and expected
frequencies (with residual) and proportions by covariate pattern
and response category. Classification table. Prints a table of the
observed versus predicted responses. Goodness of fit chi-square
statistics. Prints Pearson and likelihood-ratio chi-square statistics.
Statistics are computed for the covariate patterns determined by all
factors and covariates or by a user-defined subset of the factors and

covariates. Define Subpopulations. Allows you to select a subset of

the factors and covariates in order to define the covariate patterns
used by cell probabilities and the goodness-of-fit tests.
Multinomial Logistic Regression Criteria
You can specify the following criteria for your Multinomial

Logistic Regression: Iterations. Allows you to specify the
maximum number of times you want to cycle through the
algorithm, the maximum number of steps in the step-halving, the
convergence tolerances for changes in the log-likelihood and
parameters, how often the progress of the iterative algorithm is
printed, and at what iteration the procedure should begin checking
for complete or quasi-complete separation of the data.

Delta. Allows you to specify a non-negative value less than 1. This

value is added to each empty cell of the crosstabulation of response
category by covariate pattern. This helps to stabilize the algorithm
and prevent bias in the estimates.
Singularity tolerance. Allows you to specify the tolerance used in
checking for singularities.
Output
The logistic regression analyses gives us estimates of B, standard
error, walid statistic, significance probability level, From these
output we can give the logistic regression equation for the
probability of an event occurs.
Prob 1 , assuming the parameters with

(event)=
some valus, Z=3.346 -1+e-z
The probability of an event occurs is then estimated as
Prob(event) 1 =0.0340
1-e-(-3.346)
Based on this estimate, we could say the event will not occur
because the probability is <0.5

Test
We will not stop at getting the output we have to see the
significance of the coefficients For large sample size, the test that a
coefficient is 0 can be based on the walid statistic, which has a chi-
square distribution. When a variable has a single degree of
freedom, the walid statistic is just the square of the ratio of the
coefficient to its standard error.
But, the waild statistic has a very undesirable property. When the
absolute value of the regression coefficients becomes large, the
estimated standard error is too large. This produces a walid statistic
that in too small, leading you to fail to reject the null hypothesis,
that the coefficient is 0, when infact you should. So when you have
a large coefficient, you should not rely on the walid statistic for
hypothesis testing. Instead you should model with and without that
variable and base your hypothesis test on the change in the Log-
likelihood.
15. Probit Analysis

This procedure measures the relationship between the strength of a
stimulus and the proportion of cases exhibiting a certain response
to the stimulus. It is useful for situations where you have a

dichotomous output that is thought to be influenced or caused by

levels of some independent variable(s) and is particularly well
suited to experimental data. This procedure will allow you to
estimate the strength of a stimulus required to induce a certain
proportion of responses, such as the median effective dose. How
much insecticide is needed to kill a pest? The responses we are
interested in is all or none. The insect in either dead or alive. There
are different mathematical models that can be used to express the
relationship between the proportion of responding and the dose of
one or more stimuli.
Probit and logit response Models.
In Probit and logit models, instead of regressing the actual

proportion responding on the values of the stimuli, we transform
the proportion responding using either a logit or probit
transformation. For probit transformation, we replace each of the
observed proportions with the value of the standard normal curve
below which the observed proportion of the area is found.
Eg. If half (0.5) of the subjects respond at a particular dose, the

corresponding probit value is 0, since half of the area in a standard

normal curve falls below a Z score of 0. If the observed proportion

is 0.95, the corresponding probit value is 1.64
If the logit transformation is used, the observed proportion P is

replaced by
L ( 1-P
P
)
n
This quantity is called a logit. If the observed proportion is 0.5, the
logit-transformed value is 0. And if the observed proportion is
0.95, the logit transformed value is 1.47. In most situations,
analysis based on logits and probits give very Similar results.
The regression model for the transformed response can be written

as
Transformed Pi=A+BXi
Where Pi is observed proportion responding at dose Xi (usually
log of the dose is used)
Assumptions. Observations should be independent. If you have a

large number of values for the independent variables relative to the

number of observations, as you might in an observational study,

the chi-square and goodness-of-fit statistics may not be valid.
Related procedures. Probit analysis is closely related to logistic
regression; in fact, if you choose the logit transformation, this
procedure will essentially compute a logistic regression. In general,
probit analysis is appropriate for designed experiments, whereas
logistic regression is more appropriate for observational studies.
the differences in output reflect these different emphases. The
probit analysis procedure reports estimates of effective values for
various rates of response (including median effective dose), while
the logistic regression procedure reports estimates of odds ratios
for independent variables.
To Obtain a Probit Analysis

_ From the menus choose:
Analyze > Regression > Probit…

 Select a response frequency variable. This variable indicates

the number of cases exhibiting a response to the test stimulus.
The values of this variable cannot be negative.
 Select a total observed variable. This variable indicates the
number of cases to which the stimulus was applied. The
values of this variable cannot be negative and cannot be less
than the values of the response frequency variable for each
case. Optionally, you can select a factor variable. If you do,
click Define Range to define the groups.
 Select one or more covariate(s). This variable contains the
level of the stimulus applied to each observation. If you want

to transform the covariate, select a transformation from the

Transform drop-down list. If no transformation is applied,
and there is a control group, then the control group is
included in the analysis.
 Select either Probit or Logit model.
Probit Analysis Define Range

This allows you to specify the levels of the factor variable that will
be analyzed. The factor levels must be coded as consecutive
integers, and all levels in the range you specify will be analyzed.
Probit Analysis Options

You can specify options for your probit analysis:

Statistics. Allows you to request the following optional statistics:
Frequencies, Relative median potency, Parallelism test, and
Fiducial confidence intervals. Fiducial confidence intervals and
Relative median potency are unavailable if you have selected more
than one covariate. Relative median potency and Parallelism test
are available only if you have selected a factor variable.
Natural Response Rate. Allows you to indicate a natural
response rate even in the absence of the stimulus. Available
alternatives are None, Calculate from data, or Value.

Criteria. Allows you to control parameters of the iterative

parameter-estimation algorithm. You can override the defaults for
maximum iterations, step limit, and optimality tolerance.
This is the Concentration, Properties of insects dead and probit

transformation of each observed proportion.
Dose 10.2 7.7 5.1 3.8 2.6

Number Observed 50 49 46 43 50
Number dead 44 42 24 16 6
Proportion dead 0.88 0.86 0.52 0.33 0.12
Probit 1.18 1.08 0.05 -0.44 01.18
Taking the log of the dose in probit analysis, first look at the plot
of observed probits against the dose. If the plot looks linear go
ahead. If not, change the transformation format to another until, the
relationships looks linear.
The regression equation from this data looks like

Probit(Pi)= -2.86+4.17 (log10 (dose i))
You may be interested to Know what the concentration of an agent
must be in order to achieve a certain proportion of response. Eg.

What concentration is needed in order to kill half the insects, this is

Known as the median lethal dose. This can be obtained from the
above equation by solving for the concentration that corresponds to
a probit value of 0.
Log 10 (median lethal dose)= 2.86/4.17

Median lethal dose=4.85
This means in order to kill half the insects 4.85 dose of insecticide
in needed.
16. Nonlinear Regression

Nonlinear regression is a method of finding a nonlinear model of
the relationship between the dependent variable and a set of
independent variables. Unlike traditional linear regression, which
is restricted to estimating linear models, nonlinear regression can
estimate models with arbitrary relationships between independent
and dependent variables. This is accomplished using iterative
estimation algorithms. Note that this procedure is not necessary for
simple polynomial models of the form . By defining , we get a
simple linear model, , which can be estimated using traditional
methods such as the Linear Regression procedure.
Nonlinear models linear in regression model context doesn’t mean
that the relationship is straight line or quadratic. If refers the

functional form of the equation. That is can the independent

variable be expersed as a linear combination of parameters of
values of the independent variables?
Eg Y= Bo+B1 X12
This can be rewritten as Y=Bo+B 1X1 - Where
X1 = X2
Consider the model
Y= e Bo+B1k1+E
This model is not of the form linear, but if we get the Natural
logarithm of both sides .
Ln (Y) = Bo+B1X1+ E
This model is linear in parameters, we can use the usual techniques
to estimate them, Models that seem nonlinear but transformed to
linear are sometimes called intrinsically linear models. It is good
idea to always search a way to make the model linear.
Fitting the logistic population Growth Model.

Let us consider a model for Population growth, which is often
modeled using a logistic Population growth model.
C
Yi + where Yi is the population size at time ti
1+eA+BTi
= Ei

In order to start the nonlinear estimations algorithm, we must have

initial values for the parameters. Good initial values are important
and may provide a better solution in fewer iterations.
There are a number of ways to determine initial values for non

linear models. If you don’t have a starting values, don’t just set
them all to 0. use values in the neighborhood of wheat you expect
to see.
If you ignore the error term, sometimes a linear form of the model
can be derived. Linear regression can then be used to obtain initial
values.
e.g. Let us see this model
Y= eA+bx +E
If we ignore then error term and take the natural log of both sides
Ln(Y) = A+Bx
We can then use the linear regression to estimate A and B and

specify there values as starting values is non linear regression.

Example. Can population be predicted based on time? A

scatterplot shows that there seems to be a strong relationship
between population and time, but the relationship is nonlinear, so it
requires the special estimation methods of the Nonlinear
Regression procedure. By setting up an appropriate equation, such
as a logistic population growth model, we can get a good estimate
of the model, allowing us to make predictions about
population for times that were not actually measured.
Statistics. For each iteration: parameter estimates and residual
sum of squares. For each model: sum of squares for regression,
residual, uncorrected total and corrected total, parameter estimates,
asymptotic standard errors, and asymptotic correlation matrix of
parameter estimates.
Data. The dependent and independent variables should be
quantitative. Categorical variables such as religion, major, or
region of residence need to be recoded to binary(dummy) variables
or other types of contrast variables.
Related procedures. Many models that appear nonlinear at first
can be transformed to a linear model, which can be analyzed using
the Linear Regression procedure. If you are uncertain what the
proper model should be, the Curve Estimation procedure canhelp
to identify useful functional relations in your data.

To Obtain a Nonlinear Regression Analysis

_ From the menus choose:
Analyze > Regression > Nonlinear…
Select one numeric dependent variable from the list of variables in

your working data file. To build a model expression, enter the
expression in the Model field or paste components (variables,
parameters, functions) into the field. Identify parameters in your
model by clicking Parameters. A segmented model (one that takes
different forms in different parts of its domain) must be specified
by using conditional logic within the single model statement.
16.1 Conditional Logic (Nonlinear Regression)

You can specify a segmented model using conditional logic. To
use conditional logic within a model expression or a loss function,

you form the sum of a series of terms, one for each condition. Each
term consists of a logical expression (in parentheses) multiplied by
the expression that should result when that logical expression is
true.
For example, consider a segmented model that equals 0 for ,

X<=0 , Xfor 0<X<1, and 1 for X<=0,X for 0<X<1,and 1 for X>=1
The expression for this is:
(X<=0)*0+(x>0 & x<1)*x+(x>=1)*1
.The logical expressions in parentheses all evaluate to 1 (true) or 0
(false). Therefore:
If , x<=0 the above reduces to . 1*0+0*x+0*1=0
If , 0<x<l it reduces to . 0*0+1*x+0*1=X
If , X>= it reduces to . 0*0+0*x+1*1=1
More complicated examples can be easily built by substituting
different logical expressions and outcome expressions. Remember
that double inequalities, such as 0<x<1 must be written as
compound expressions, such as (x>0&x<1)
Nonlinear Regression Parameters

Parameters are the parts of your model that the Nonlinear
Regression procedure estimates. Parameters can be additive
constants, multiplicative coefficients, exponents, or values used in

evaluating functions. All parameters that you have defined will

appear (with their initial values) on the Parameters list in the main
dialog box.
Starting Value. Allows you to specify a starting value for the

parameter, preferably as close as possible to the expected final
solution. Poor starting values can result in failure to converge or in
convergence on a solution that is local (rather than global) or is
physically impossible.
Use starting values from previous analysis. If you have
already run a nonlinear regression from this dialog box, you can
select this option to obtain the initial values of parameters from
their values in the previous run. This permits you to continue
searching when the algorithm is converging slowly. (The initial
starting values will still appear on the Parameters list in the main

dialog box.) Note: This selection persists in this dialog box for the
rest of your session. If you change the model, be sure to deselect it.
Nonlinear Regression Common Models

The table below provides example model syntax for many
published nonlinear regression models. A model selected at
random is not likely to fit your data well. Appropriate starting
values for the parameters are necessary, and some models require
constraints in order to converge.
Table 5.1 Example model syntax

Name Model
expression
Asymptotic Regression b1 + b2
*exp( b3 * x )
Asymptotic Regression b1 –( b2 *(
b3 ** x ))
Density ( b1 + b2 *
x )**(–1/ b3 )
Gauss b1 *(1– b3
*exp( –b2 * x **2))

Gompertz b1 *exp( –b2 * exp( –

b3 * x ))
Johnson-Schumacher b1 *exp ( –b2 / ( x
+ b3))
Log-Modified ( b1 + b3 * x )
** b2
Log-Logistic b1 –ln(1+ b2
*exp( –b3 * x ))
Metcherlich Law of Diminishing
Returns b1 + b2 *exp( –b3 * x
)
Michaelis Menten b1* x /( x + b2 )
Morgan-Mercer-Florin ( b1 * b2 + b3 * x **
b4 )/( b2 + x ** b4 )
Peal-Reed b1 /(1+ b2 *exp(–( b3 * x + b4 * x
**2+ b5 * x **3)))
Ratio of Cubics ( b1 + b2 * x + b3 * x **2+ b4 *
x **3)/( b5 * x **3)
Ratio of Quadratics ( b1 + b2 * x + b3 * x **2)/( b4 *
x **2)
Richards b1 /((1+ b3 *exp(– b2 * x ))**(1/ b4 ))
Verhulst b1 /(1 + b3 * exp(– b2 * x ))

Von Bertalanffy ( b1 ** (1 – b4 ) – b2 * exp( –b3 * x )) **

(1/(1 – b4 ))
Weibull b1 – b2 *exp(– b3 * x ** b4 )
Yield Density (b1 + b2 * x + b3 * x **2)**(–1)
16.2 Nonlinear Regression Loss Function
The loss function in nonlinear regression is the function that is
minimized by the algorithm. Select either Sum of squared residuals
to minimize the sum of the squared residuals or User-defined loss
function to minimize a different function. If you select User-
defined loss function, you must define the loss function whose sum
(across all cases) should be minimized by the choice of parameter
values.
• Most loss functions involve the special variable RESID_, which

represents the residual. (The default Sum of squared residuals loss
function could be entered explicitly as RESID_**2.) If you need to
use the predicted value in your loss function, it is equal to the
dependent variable minus the residual.
• It is possible to specify a conditional loss function using
conditional logic. You can either type an expression in the User-
defined loss function field or paste components of the expression
into the field. String constants must be enclosed in quotation marks

or apostrophes, and numeric constants must be typed in American

format, with the dot as a decimal delimiter.
Nonlinear Regression Parameter Constraints

A constraint is a restriction on the allowable values for a
parameter during the iterative search for a solution. Linear
expressions are evaluated before a step is taken, so you can use
linear constraints to prevent steps that might result in overflows.
Nonlinear expressions are evaluated after a step is taken. Each
equation or inequality requires the following elements:
• An expression involving at least one parameter in the model.
Type the expression or use the keypad, which allows you to paste
numbers, operators, or parentheses into the expression. You can
either type in the required parameter(s) along with the rest of the

expression or paste from the Parameters list at the left. You cannot
use ordinary variables in a constraint.
• One of the three logical operators <=, =, or >=.
• A numeric constant, to which the expression is compared using
the logical operator.
Type the constant. Numeric constants must be typed in American
format, with the dot
as a decimal delimiter.
You can save a number of new variables to your active data file.
Available options are Predicted values, Residuals, Derivatives, and
Loss function values. These variables can be used in subsequent
analyses to test the fit of the model or to identify problem cases.
Nonlinear Regression Options

Options allow you to control various aspects of your nonlinear

regression analysis:
Bootstrap estimates of standard error. Requests bootstrap
estimates of the standard errors for parameters. This requires the
sequential quadratic programming algorithm.
Estimation Method. Allows you to select an estimation method, if
possible. (Certain choices in this or other dialog boxes require the
sequential quadratic programming algorithm.) Available
alternatives include Sequential quadratic programming and
Levenberg- Marquardt.
Sequential Quadratic Programming. Allows you to specify
options for this estimation method. You can enter new values for
Maximum iterations and Step limit, and you can change the
selection in the drop-down lists for Optimality tolerance, Function
precision, and Infinite step size.
Levenberg-Marquardt. Allows you to specify options for this
estimation process. You can enter new values for Maximum
iterations, and you can change the selection in the dropdown lists
for Sum-of-squares convergence and Parameter convergence.

28 Chapter 5
Interpreting Nonlinear Regression Results
Nonlinear regression problems often present computational
difficulties:
• The choice of initial values for the parameters influences
convergence. Try to choose initial values that are reasonable and, if
possible, close to the expected final solution.
• Sometimes one algorithm performs better than the other on a
particular problem. In the Options dialog box, select the other
algorithm if it is available. (If you specify a loss function or certain
types of constraints, you cannot use the Levenberg- Marquardt
algorithm.)
• When iteration stops only because the maximum number of
iterations has occurred, the “final” model is probably not a good
solution. Select Use starting values from previous analysis in the

Parameters dialog box to continue the iteration or, better yet,

choose different initial values.
• Models that require exponentiation of or by large data values can
cause overflows or underflows (numbers too large or too small for
the computer to represent). Sometimes you can avoid these by
suitable choice of initial values or by imposing constraints on the
parameters.
Chapter Six
Factor Analysis
Factor analysis attempts to identify underlying variables, or

factors, that explain the pattern of correlations within a set of
observed variables. Factor analysis is often used in data reduction
to identify a small number of factors that explain most of the
variance observed in a much larger number of manifest variables.
Factor analysis can also be used to generate hypotheses regarding
causal mechanisms or to screen variables for subsequent analysis
(for example, to identify collinearity prior to performing a linear
regression analysis).
The factor analysis procedure offers a high degree of flexibility:
 Seven methods of factor extraction are available.

 Five methods of rotation are available, including direct

oblimin and promax for nonorthogonal rotations.
 Three methods of computing factor scores are available, and
scores can be saved as variables for further analysis.
Example. What underlying attitudes lead people to respond to the

questions on a political survey as they do? Examining the
correlations among the survey items reveals that there is significant
overlap among various subgroups of items--questions about taxes
tend to correlate with each other, questions about military issues
correlate with each other, and so on. With factor analysis, you can
investigate the number of underlying factors and, in many cases,
you can identify what the factors represent conceptually.
Additionally, you can compute factor scores for each respondent,
which can then be used in subsequent analyses. For example, you
might build a logistic regression model to predict voting behavior
based on factor scores.
Statistics. For each variable: number of valid cases, mean, and

standard deviation. For each factor analysis: correlation matrix of
variables, including significance levels, determinant, and inverse;
reproduced correlation matrix, including anti-image; initial
solution (communalities, eigenvalues, and percentage of variance
explained); Kaiser-Meyer-Olkin measure of sampling adequacy

and Bartlett's test of sphericity; unrotated solution, including factor

loadings, communalities, and eigenvalues; rotated solution,
including rotated pattern matrix and transformation matrix; for
oblique rotations: rotated pattern and structure matrices; factor
score coefficient matrix and factor covariance matrix. Plots: scree
plot of eigenvalues and loading plot of first two or three factors.
to obtain, from the menus choose:
Analyze >Data Reduction > Factor...
Select the variables for the factor analysis.
Descriptives
Statistics. Univariate statistics include the mean, standard

deviation, and number of valid cases for each variable. Initial

solution displays initial communalities, eigenvalues, and the

percentage of variance explained.
Correlation Matrix. The available options are coefficients,

significance levels, determinant, KMO and Bartlett's test of
sphericity, inverse, reproduced, and anti-image.
 KMO and Bartlett's Test of Sphericity. The Kaiser-Meyer-

Olkin measure of sampling adequacy tests whether the partial
correlations among variables are small. Bartlett's test of
sphericity tests whether the correlation matrix is an identity
matrix, which would indicate that the factor model is
inappropriate.
 Reproduced. The estimated correlation matrix from the factor
solution. Residuals (difference between estimated and
observed correlations) are also displayed.
 Anti-image. The anti-image correlation matrix contains the
negatives of the partial correlation coefficients, and the anti-

image covariance matrix contains the negatives of the partial

covariances. In a good factor model, most of the off-diagonal
elements will be small. The measure of sampling adequacy
for a variable is displayed on the diagonal of the anti-image
correlation matrix.
Extraction
Method. Allows you to specify the method of factor extraction.

Available methods are principal components, unweighted least
squares, generalized least squares, maximum likelihood, principal
axis factoring, alpha factoring, and image factoring.
 Principal Components Analysis. A factor extraction method

used to form uncorrelated linear combinations of the
observed variables. The first component has maximum
variance. Successive components explain progressively
smaller portions of the variance and are all uncorrelated with
each other. Principal components analysis is used to obtain
the initial factor solution. It can be used when a correlation
matrix is singular.
 Unweighted Least-Squares Method. A factor extraction
method that minimizes the sum of the squared differences

between the observed and reproduced correlation matrices

ignoring the diagonals.
 Generalized Least-Squares Method. A factor extraction
method that minimizes the sum of the squared differences
between the observed and reproduced correlation matrices.
Correlations are weighted by the inverse of their uniqueness,
so that variables with high uniqueness are given less weight
than those with low uniqueness.
 Maximum-Likelihood Method. A factor extraction method
that produces parameter estimates that are most likely to have
produced the observed correlation matrix if the sample is
from a multivariate normal distribution. The correlations are
weighted by the inverse of the uniqueness of the variables,
and an iterative algorithm is employed.
 Principal Axis Factoring. A method of extracting factors
from the original correlation matrix with squared multiple
correlation coefficients placed in the diagonal as initial
estimates of the communalities. These factor loadings are
used to estimate new communalities that replace the old
communality estimates in the diagonal. Iterations continue
until the changes in the communalities from one iteration to
the next satisfy the convergence criterion for extraction.

 Alpha. A factor extraction method that considers the

variables in the analysis to be a sample from the universe of
potential variables. It maximizes the alpha reliability of the
factors.
 Image Factoring. A factor extraction method developed by
Guttman and based on image theory. The common part of the
variable, called the partial image, is defined as its linear
regression on remaining variables, rather than a function of
hypothetical factors.
Rotation
Method. Allows you to select the method of factor rotation.

Available methods are varimax, direct oblimin, quartimax,
equamax, or promax.

 Varimax Method. An orthogonal rotation method that

minimizes the number of variables that have high loadings on
each factor. It simplifies the interpretation of the factors.
 Direct Oblimin Method. A method for oblique
(nonorthogonal) rotation. When delta equals 0 (the default),
solutions are most oblique. As delta becomes more negative,
the factors become less oblique. To override the default delta
of 0, enter a number less than or equal to 0.8.
 Quartimax Method. A rotation method that minimizes the
number of factors needed to explain each variable. It
simplifies the interpretation of the observed variables.
 Equamax Method. A rotation method that is a combination
of the varimax method, which simplifies the factors, and the
quartimax method, which simplifies the variables. The
number of variables that load highly on a factor and the
number of factors needed to explain a variable are
minimized.
 Promax Rotation. An oblique rotation, which allows factors
to be correlated. It can be calculated more quickly than a
direct oblimin rotation, so it is useful for large datasets.

Options
Missing Values. Allows you to specify how missing values are

handled. The available alternatives are to exclude cases listwise,
exclude cases pairwise, or replace with mean.
Coefficient Display Format. Allows you to control aspects of the

output matrices. You sort coefficients by size and suppress
coefficients with absolute values less than the specified value.

Correlation Matrix. The available options are coefficients,

significance levels, determinant, KMO and Bartlett's test of
sphericity, inverse, reproduced, and anti-image.
Chapter Seven
Discriminant Analysis
Discriminant function analysis, a.k.a. discriminant analysis or DA,

is used to classify cases into the values of a categorical dependent,
usually a dichotomy.If discriminant function analysis is effective
for a set of data, the classification table of correct and incorrect
estimates will yield a high percentage correct. Multiple
discriminant function analysis (MDA) is when the dependent has
three or more categories.
There are several purposes for DA:

 To classify cases into groups using a discriminant prediction

equation.
 To investigate independent variable mean differences
between groups formed by the dependent variable.
 To determine the percent of variance in the dependent
variable explained by the independents.
 To determine the percent of variance in the dependent
variable explained by the independents over and above the
variance accounted for by control variables, using sequential
discriminant analysis.
 To assess the relative importance of the independent
variables in classifying the dependent variable.
 To discard variables which are little related to group
distinctions.
 To test theory by observing whether cases are classified as
predicted.
There are additional purposes of MDA:
 To determine the most parsimonious way (the fewest

dimensions) to distinguish between groups.
 To infer the meaning of the dimensions which distinguish
groups, based on discriminant loadings.

with measurements for the predictor variables but unknown group

membership.
Note: The grouping variable can have more than two values. The
codes for the grouping variable must be integers, however, and you
need to specify their minimum and maximum values. Cases with
values outside of these bounds are excluded from the analysis.
Example. On average, people in temperate zone countries consume

more calories per day than those in the tropics, and a greater
proportion of the people in the temperate zones are city dwellers. A
researcher wants to combine this information in a function to
determine how well an individual can discriminate between the
two groups of countries. The researcher thinks that population size
and economic information may also be important. Discriminant
analysis allows you to estimate coefficients of the linear
discriminant function, which looks like the right side of a multiple
linear regression equation. That is, using coefficients a, b, c, and d,
the function is:
D = a * climate + b * urban + c * population

+ d * gross domestic product per capita
If these variables are useful for discriminating between the two

climate zones, the values of D will differ for the temperate and

tropic countries. If you use a stepwise variable selection method,

you may find that you do not need to include all four variables in
the function.
Statistics. For each variable: means, standard deviations, univariate

ANOVA. For each analysis: Box's M, within-groups correlation
matrix, within-groups covariance matrix, separate-groups
covariance matrix, total covariance matrix. For each canonical
discriminant function: eigenvalue, percentage of variance,
canonical correlation, Wilks' lambda, chi-square. For each step:
prior probabilities, Fisher's function coefficients, unstandardized
function coefficients, Wilks' lambda for each canonical function.
Analyze > Classify > Discriminant...

Select an integer-valued grouping variable and click Define Range

to specify the categories of interest. Select the independent, or
predictor, variables. (If your grouping variable does not have
integer values, Automatic Recode on the Transform menu will
create one that does.)
Select the method for entering the independent variables.
 Enter independents together. Forced-entry method. All

independent variables that satisfy tolerance criteria are
entered simultaneously.
 Use stepwise method. Uses stepwise analysis to control
variable entry and removal.
Optionally, you can select cases with a selection variable.

Define Range
Specify the minimum and maximum value of the grouping value

for the analysis. Cases with values outside of this range are not
used in the discriminant analysis but are classified into one of the
existing groups based on the results of the analysis. The minimum
and maximum must be integers
Select Cases
To select cases for your analysis, in the main dialog box click
Select, choose a selection variable, and click Value to enter an
integer as the selection value. Only cases with that value for the
selection variable are used to derive the discriminant functions.
Statistics and classification results are generated for both selected

and unselected cases. This provides a mechanism for classifying
new cases based on previously existing data or for partitioning

your data into training and testing subsets to perform validation on

the model generated.
Statistics
Descriptives. Available options are means (including standard

deviations), univariate ANOVAs, and Box's M test.
 Means. Displays total and group means, and standard

deviations for the independent variables.
 Univariate ANOVAs. Performs a one-way analysis of
variance test for equality of group means for each
independent variable.
 Box's M. A test for the equality of the group covariance
matrices. For sufficiently large samples, a nonsignificant p
value means there is insufficient evidence that the matrices
differ. The test is sensitive to departures from multivariate
normality.

 Function Coefficients. Available options are Fisher's

classification coefficients and unstandardized coefficients.
 Fisher's. Displays Fisher's classification function coefficients
that can be used directly for classification. A set of
coefficients is obtained for each group, and a case is assigned
to the group for which it has the largest discriminant score.
 Unstandardized. Displays the unstandardized discriminant
function coefficients.
 Matrices. Available matrices of coefficients for independent
variables are within-groups correlation matrix, within-groups
covariance matrix, separate-groups covariance matrix, and
total covariance matrix.
 Within-groups correlation. Displays a pooled within-groups
correlation matrix that is obtained by averaging the separate
covariance matrices for all groups before computing the
correlations.
 Within-groups covariance. Displays a pooled within-groups
covariance matrix, which may differ from the total
covariance matrix. The matrix is obtained by averaging the
separate covariance matrices for all groups.
 Separate-groups covariance. Displays separate covariance
matrices for each group.

 Total covariance. Displays a covariance matrix from all cases

as if they were from a single sample.
Stepwise Method
Method. Select the statistic to be used for entering or removing

new variables. Available alternatives are Wilks' lambda,
unexplained variance, Mahalanobis' distance, smallest F ratio, and
Rao's V. With Rao's V, you can specify the minimum increase in V
for a variable to enter.
 Wilks' lambda. A variable selection method for stepwise

discriminant analysis that chooses variables for entry into the
equation on the basis of how much they lower Wilks' lambda.

At each step, the variable that minimizes the overall Wilks'

lambda is entered.
 Unexplained variance. At each step, the variable that
minimizes the sum of the unexplained variation between
groups is entered.
 Mahalanobis distance. A measure of how much a case's
values on the independent variables differ from the average
of all cases. A large Mahalanobis distance identifies a case as
having extreme values on one or more of the independent
variables.
 Smallest F ratio. A method of variable selection in stepwise
analysis based on maximizing an F ratio computed from the
Mahalanobis distance between groups.
 Rao's V. A measure of the differences between group means.
Also called the Lawley-Hotelling trace. At each step, the
variable that maximizes the increase in Rao's V is entered.
After selecting this option, enter the minimum value a
variable must have to enter the analysis.
Criteria. Available alternatives are Use F value and Use probability

of F. Enter values for entering and removing variables.
 Use F value. A variable is entered into the model if its F

value is greater than the Entry value, and is removed if the F

value is less than the Removal value. Entry must be greater

than Removal and both values must be positive. To enter
more variables into the model, lower the Entry value. To
remove more variables from the model, increase the Removal
value.
 Use probability of F. A variable is entered into the model if
the significance level of its F value is less than the Entry
value, and is removed if the significance level is greater than
the Removal value. Entry must be less than Removal and
both values must be positive. To enter more variables into the
model, increase the Entry value. To remove more variables
from the model, lower the Removal value.

Chapter Eight
Cluster Analysis
Cluster analysis classifies a set of observations into two or more

mutually exclusive unknown groups based on combinations of
interval variables. The purpose of cluster analysis is to discover a
system of organizing observations, usually people, into groups.
where members of the groups share properties in common. It is
cognitively easier for people to predict behavior or properties of
people or objects based on group membership, all of whom share
similar properties. It is generally cognitively difficult to deal with
individuals and predict behavior or properties based on
observations of other behaviors or properties.
For example, a person might wish to predict how an animal would

respond to an invitation to go for a walk. He or she could be given
information about the size and weight of the animal, top speed,
average number of hours spent sleeping per day, and so forth and
then combine that information into a prediction of behavior.
Alternatively, the person could be told that an animal is either a cat
or a dog. The latter information allows a much broader range of
behaviors to be predicted. The trick in cluster analysis is to collect

information and combine it in ways that allow classification into

useful groups, such as dog or cat.
Cluster analysis classifies unknown groups while discriminant

function analysis classifies known groups. The procedure for doing
a discriminant function analysis is well established. There are few
options, other than type of output, that need to be specified when
doing a discriminant function analysis. Cluster analysis, on the
other hand, allows many choices about the nature of the algorithm
for combining groups. Each choice may result in a different
grouping structure.
Cluster analyses can be performed using the TwoStep,

Hierarchical, or K-Means Cluster Analysis procedures. Each
procedure employs a different algorithm for creating clusters, and
each has options not available in the others.
TwoStep Cluster Analysis. For many applications, the TwoStep

Cluster Analysis procedure will be the method of choice. It
provides the following unique features:
 Automatic selection of the best number of clusters, in

addition to measures for choosing between cluster models.
 Ability to create cluster models simultaneously based on
categorical and continuous variables.

 Ability to save the cluster model to an external XML file,

then read that file and update the cluster model using newer
data.
Additionally, the TwoStep Cluster Analysis procedure can analyze

large data files.
Hierarchical Cluster Analysis. The Hierarchical Cluster Analysis

procedure is limited to smaller data files (hundreds of objects to be
clustered), but has the following unique features:
 Ability to cluster cases or variables.

 Ability to compute a range of possible solutions and save
cluster memberships for each of those solutions.
 Several methods for cluster formation, variable
transformation, and measuring the dissimilarity between
clusters.
As long as all the variables are of the same type, the Hierarchical
Cluster Analysis procedure can analyze interval (continuous),
count, or binary variables.
K-Means Cluster Analysis. The K-Means Cluster Analysis

procedure is limited to continuous data and requires you to specify

the number of clusters in advance, but it has the following unique

features:
 Ability to save distances from cluster centers for each object.

 Ability to read initial cluster centers from and save final
cluster centers to an external SPSS file.
Additionally, the K-Means Cluster Analysis procedure can analyze

large data files.
1. K-Means Cluster Analysis
This procedure attempts to identify relatively homogeneous groups

of cases based on selected characteristics, using an algorithm that
can handle large numbers of cases. However, the algorithm
requires you to specify the number of clusters. You can specify
initial cluster centers if you know this information. You can select
one of two methods for classifying cases, either updating cluster
centers iteratively or classifying only. You can save cluster
membership, distance information, and final cluster centers.
Optionally, you can specify a variable whose values are used to
label casewise output. You can also request analysis of variance F
statistics. While these statistics are opportunistic (the procedure
tries to form groups that do differ), the relative size of the statistics

provides information about each variable's contribution to the

separation of the groups.
Example. What are some identifiable groups of television shows

that attract similar audiences within each group? With k-means
cluster analysis, you could cluster television shows (cases) into k
homogeneous groups based on viewer characteristics. This can be
used to identify segments for marketing. Or you can cluster cities
(cases) into homogeneous groups so that comparable cities can be
selected to test various marketing strategies.
Statistics. Complete solution: initial cluster centers, ANOVA table.

Each case: cluster information, distance from cluster center.
Analyze > Classify > K-Means Cluster...

Select the variables to be used in the cluster analysis. Specify the

number of clusters. The number of clusters must be at least two
and must not be greater than the number of cases in the data
file. Select either Iterate and classify or Classify only.Optionally,
you can select an identification variable to label cases.
Data. Variables should be quantitative at the interval or ratio level.

If your variables are binary or counts, use the Hierarchical Cluster
Analysis procedure.
Assumptions. Distances are computed using simple Euclidean

distance. If you want to use another distance or similarity measure,
use the Hierarchical Cluster Analysis procedure. Scaling of
variables is an important consideration--if your variables are

measured on different scales (for example, one variable is

expressed in dollars and another is expressed in years), your results
may be misleading. In such cases, you should consider
standardizing your variables before you perform the k-means
cluster analysis (this can be done in the Descriptives procedure).
The procedure assumes that you have selected the appropriate
number of clusters and that you have included all relevant
variables. If you have chosen an inappropriate number of clusters
or omitted important variables, your results may be misleading.
Options
Statistics. You can select the following statistics: initial cluster

centers, ANOVA table, and cluster information for each case.
 Initial cluster centers. First estimate of the variable means for

each of the clusters. By default, a number of well-spaced

cases equal to the number of clusters is selected from the

data. Initial cluster centers are used for a first round of
classification and are then updated.
 ANOVA table. Displays an analysis-of-variance table which
includes univariate F tests for each clustering variable. The F
tests are only descriptive and the resulting probabilities
should not be interpreted. The ANOVA table is not displayed
if all cases are assigned to a single cluster.
 Cluster information for each case. Displays for each case the
final cluster assignment and the Euclidean distance between
the case and the cluster center used to classify the case. Also
displays Euclidean distance between final cluster centers.
Missing Values. Available options are Exclude cases listwise or

Exclude cases pairwise.
Hide details
 Exclude cases listwise. Excludes cases with missing values
for any clustering variable from the analysis.
 Exclude cases pairwise. Assigns cases to clusters based on
distances computed from all variables with nonmissing
values.
2. Two Step Cluster Analysis

The TwoStep Cluster Analysis procedure is an exploratory tool

designed to reveal natural groupings (or clusters) within a data set
that would otherwise not be apparent. The algorithm employed by
this procedure has several desirable features that differentiate it
from traditional clustering techniques:
 Handling of categorical and continuous variables. By

assuming variables to be independent, a joint multinomial-
normal distribution can be placed on categorical and
continuous variables. See the TwoStep Cluster Assumptions
for more information.
 Automatic selection of number of clusters. By comparing the
values of a model-choice criterion across different clustering
solutions, the procedure can automatically determine the
optimal number of clusters.
 Scalability. By constructing a cluster features (CF) tree that
summarizes the records, the TwoStep algorithm allows you
to analyze large data files.
Example. Retail and consumer product companies regularly apply

clustering techniques to data that describe their customers' buying
habits, gender, age, income level, etc. These companies tailor their
marketing and product development strategies to each consumer
group to increase sales and build brand loyalty.

Statistics. The procedure produces information criteria (AIC or

BIC) by numbers of clusters in the solution, cluster frequencies for
the final clustering, and descriptive statistics by cluster for the final
clustering.
Plots. The procedure produces bar charts of cluster frequencies, pie

charts of cluster frequencies, and variable importance charts.
Distance Measure. This selection determines how the similarity

between two clusters is computed.
 Log-likelihood. The likelihood measure places a probability

distribution on the variables. Continuous variables are
assumed to be normally distributed, while categorical
variables are assumed to be multinomial. All variables are
assumed to be independent.
 Euclidean. The Euclidean measure is the "straight line"
distance between two clusters. It can be used only when all of
the variables are continuous.
Number of Clusters. This selection allows you to specify how the

number of clusters is to be determined.
 Determine automatically. The procedure will automatically

determine the "best" number of clusters, using the criterion

specified in the Clustering Criterion group. Optionally, enter

a positive integer specifying the maximum numbers of
clusters that the procedure should consider.
 Specify fixed. Allows you to fix the number of clusters in the
solution. Enter a positive integer.
Count of Continuous Variables. This group provides a summary of

the continuous variable standardization specifications made in the
Options dialog box. Clustering Criterion. This selection determines
how the automatic clustering algorithm determines the number of
clusters. Either the Bayesian Information Criterion (BIC) or the
Akaike Information Criterion (AIC) can be specified.
Analyze > Classify > TwoStep Cluster...
Select one or more categorical or continuous variables.

 Adjust the criteria by which clusters are constructed.

 Select settings for noise handling, memory allocation,
variable standardization, and cluster model input.
 Request optional tables and plots.
 Save model results to the working file or to an external XML
file.
Data. This procedure works with both continuous and categorical

variables. Cases represent objects to be clustered, and the variables
represent attributes upon which the clustering is based.

Assumptions. The likelihood distance measure assumes that

variables in the cluster model are independent. Further, each
continuous variable is assumed to have a normal (Gaussian)
distribution, and each categorical variable is assumed to have a
multinomial distribution. Empirical internal testing indicates that
the procedure is fairly robust to violations of both the assumption
of independence and the distributional assumptions, but you should
try to be aware of how well these assumptions are met.
Use the Bivariate Correlations procedure to test the independence

of two continuous variables. Use the Crosstabs procedure to test
the independence of two categorical variables. Use the Means
procedure to test the independence between a continuous variable
and categorical variable. Use the Explore procedure to test the
normality of a continuous variable. Use the Chi-Square Test
procedure to test whether a categorical variable has a specified
multinomial distribution.
Options
Outlier Treatment. This group allows you to treat outliers specially

during clustering if the cluster features (CF) tree fills. The CF tree
is full if it cannot accept any more cases in a leaf node and no leaf
node can be split.

 If you select noise handling and the CF tree fills, it will be

regrown after placing cases in sparse leaves into a "noise"
leaf. A leaf is considered sparse if it contains fewer than the
specified percentage of cases of the maximum leaf size. After
the tree is regrown, the outliers will be placed in the CF tree
if possible. If not, the outliers are discarded.
 If you do not select noise handling and the CF tree fills, it
will be regrown using a larger distance change threshold.
After final clustering, values that cannot be assigned to a
cluster are labeled outliers. The outlier cluster is given an
identification number of –1 and is not included in the count
of the number of clusters.
Memory Allocation. This group allows you to specify the

maximum amount of memory in megabytes (MB) that the cluster
algorithm should use. If the procedure exceeds this maximum, it

will use the disk to store information that will not fit in memory.
Specify a number greater than or equal to 4.
 Consult your system administrator for the largest value that

you can specify on your system.
 The algorithm may fail to find the correct or desired number
of clusters if this value is too low.
Variable standardization. The clustering algorithm works with

standardized continuous variables. Any continuous variables that
are not standardized should be left as variables "To be
Standardized." To save some time and computational effort, you
can select any continuous variables that you have already
standardized as variables "Assumed Standardized."
Plots
Within cluster percentage chart. Displays charts showing the

within-cluster variation of each variable. For each categorical
variable, a clustered bar chart is produced, showing the category
frequency by cluster ID. For each continuous variable, an error bar
chart is produced, showing error bars by cluster ID.
Cluster pie chart. Displays a pie chart showing the percentage and
counts of observations within each cluster.

Variable Importance Plot. Displays several different charts

showing the importance of each variable within each cluster. The
output is sorted by the importance rank of each variable.
 Rank Variables. This option determines whether plots will be

created for each cluster (By cluster) or for each variable (By
variable).
 Importance Measure. This option allows you to select which
measure of variable importance to plot. Chi-square or t-test
of significance reports a Pearson chi-square statistic as the
importance of a categorical variable and a t statistic as the
importance of a continuous variable. Significance reports one
minus the p value for the test of equality of means for a

continuous variable and the expected frequency with the

overall data set for a categorical variable.
 Confidence level. This option allows you to set the
confidence level for the test of equality of a variable's
distribution within a cluster versus the variable's overall
distribution. Specify a number less than 100 and greater than
or equal to 50. The value of the confidence level is shown as
a vertical line in the variable importance plots, if the plots are
created by variable or if the significance measure is plotted.
 Omit insignificant variables. Variables that are not significant
at the specified confidence level are not displayed in the
variable importance plots.
3. Hierarchical Cluster Analysis
This procedure attempts to identify relatively homogeneous groups

of cases (or variables) based on selected characteristics, using an
algorithm that starts with each case (or variable) in a separate
cluster and combines clusters until only one is left. You can
analyze raw variables or you can choose from a variety of
standardizing transformations. Distance or similarity measures are

generated by the Proximities procedure. Statistics are displayed at

each stage to help you select the best solution.
Example. Are there identifiable groups of television shows that

attract similar audiences within each group? With hierarchical
cluster analysis, you could cluster television shows (cases) into
homogeneous groups based on viewer characteristics. This can be
used to identify segments for marketing. Or you can cluster cities
(cases) into homogeneous groups so that comparable cities can be
selected to test various marketing strategies.
Statistics. Agglomeration schedule, distance (or similarity) matrix,

and cluster membership for a single solution or a range of
solutions. Plots: dendrograms and icicle plots.
To obtain, from the menus choose:
Analyze > Classify > Hierarchical Cluster...

If you are clustering cases, select at least one numeric variable. If

you are clustering variables, select at least three numeric variables.
Optionally, you can select an identification variable to label cases.
Data. The variables can be quantitative, binary, or count data.

Scaling of variables is an important issue--differences in scaling
may affect your cluster solution(s). If your variables have large
differences in scaling (for example, one variable is measured in
dollars and the other is measured in years), you should consider
standardizing them (this can be done automatically by the
Hierarchical Cluster Analysis procedure).
Assumptions. The distance or similarity measures used should be

appropriate for the data analyzed (see the Proximities procedure
for more information on choices of distance and similarity

measures). Also, you should include all relevant variables in your

analysis. Omission of influential variables can result in a
misleading solution. Because hierarchical cluster analysis is an
exploratory method, results should be treated as tentative until they
are confirmed with an independent sample.
Method
Cluster Method. Available alternatives are between-groups

linkage, within-groups linkage, nearest neighbor, furthest neighbor,
centroid clustering, median clustering, and Ward's method.

Measure. Allows you to specify the distance or similarity measure

to be used in clustering. Select the type of data and the appropriate
distance or similarity measure:
 Interval data. Available alternatives are Euclidean distance,

squared Euclidean distance, cosine, Pearson correlation,
Chebychev, block, Minkowski, and customized.
 Count data. Available alternatives are chi-square measure
and phi-square measure.
 Binary data. Available alternatives are Euclidean distance,
squared Euclidean distance, size difference, pattern
difference, variance, dispersion, shape, simple matching, phi
4-point correlation, lambda, Anderberg's D, dice, Hamann,
Jaccard, Kulczynski 1, Kulczynski 2, Lance and Williams,
Ochiai, Rogers and Tanimoto, Russel and Rao, Sokal and
Sneath 1, Sokal and Sneath 2, Sokal and Sneath 3, Sokal and
Sneath 4, Sokal and Sneath 5, Yule's Y, and Yule's Q.
Transform Values. Allows you to standardize data values for either

cases or values before computing proximities (not available for
binary data). Available standardization methods are z scores, range
–1 to 1, range 0 to 1, maximum magnitude of 1, mean of 1, and
standard deviation of 1.

Transform Measures. Allows you to transform the values

generated by the distance measure. They are applied after the
distance measure has been computed. Available alternatives are
absolute values, change sign, and rescale to 0–1 range.
Measures for Interval Data
The following dissimilarity measures are available for interval

data:
 Euclidean distance. The square root of the sum of the squared

differences between values for the items. This is the default
for interval data.
 Squared Euclidean distance. The sum of the squared
differences between the values for the items.
 Pearson correlation. The product-moment correlation
between two vectors of values.

 Cosine. The cosine of the angle between two vectors of

values.
 Chebychev. The maximum absolute difference between the
values for the items.
 Block. The sum of the absolute differences between the
values of the item. Also known as Manhattan distance.
 Minkowski. The pth root of the sum of the absolute
differences to the pth power between the values for the items.
 Customized. The rth root of the sum of the absolute
differences to the pth power between the values for the items.
Measures for Count Data
The following dissimilarity measures are available for count data:
 Chi-square measure. This measure is based on the chi-square

test of equality for two sets of frequencies. This is the default
for count data.
 Phi-square measure. This measure is equal to the chi-square
measure normalized by the square root of the combined
frequency.
Measures for Binary Data
The following dissimilarity measures are available for binary data:

 Euclidean distance. Computed from a fourfold table as

SQRT(b+c), where b and c represent the diagonal cells
corresponding to cases present on one item but absent on the
other.
 Squared Euclidean distance. Computed as the number of
discordant cases. Its minimum value is 0, and it has no upper
limit.
 Size difference. An index of asymmetry. It ranges from 0 to
1.
 Pattern difference. Dissimilarity measure for binary data that
ranges from 0 to 1. Computed from a fourfold table as
bc/(n**2), where b and c represent the diagonal cells
corresponding to cases present on one item but absent on the
other and n is the total number of observations.
 Variance. Computed from a fourfold table as (b+c)/4n, where
b and c represent the diagonal cells corresponding to cases
present on one item but absent on the other and n is the total
number of observations. It ranges from 0 to 1.
 Dispersion. This similarity index has a range of -1 to 1.
 Shape. This distance measure has a range of 0 to 1, and it
penalizes asymmetry of mismatches.

 Simple matching. This is the ratio of matches to the total

number of values. Equal weight is given to matches and
nonmatches.
 Phi 4-point correlation. This index is a binary analog of the
Pearson correlation coefficient. It has a range of -1 to 1.
 Lambda. This index is Goodman and Kruskal's lambda.
Corresponds to the proportional reduction of error (PRE)
using one item to predict the other (predicting in both
directions). Values range from 0 to 1.
 Anderberg's D. Similar to lambda, this index corresponds to
the actual reduction of error using one item to predict the
other (predicting in both directions). Values range from 0 to
1.
 Dice. This is an index in which joint absences are excluded
from consideration, and matches are weighted double. Also
known as the Czekanowski or Sorensen measure.
 Hamann. This index is the number of matches minus the
number of nonmatches, divided by the total number of items.
It ranges from -1 to 1.
 Jaccard. This is an index in which joint absences are
excluded from consideration. Equal weight is given to
matches and nonmatches. Also known as the similarity ratio.

 Kulczynski 1. This is the ratio of joint presences to all

nonmatches. This index has a lower bound of 0 and is
unbounded above. It is theoretically undefined when there are
no nonmatches; however, the software assigns an arbitrary
value of 9999.999 when the value is undefined or is greater
than this value.
 Kulczynski 2. This index is based on the conditional
probability that the characteristic is present in one item, given
that it is present in the other. The separate values for each
item acting as a predictor of the other are averaged to
compute this value.
 Lance and Williams. Computed from a fourfold table as
(b+c)/(2a+b+c), where a represents the cell corresponding to
cases present on both items, and b and c represent the
diagonal cells corresponding to cases present on one item but
absent on the other. This measure has a range of 0 to 1. (Also
known as the Bray-Curtis nonmetric coefficient.)
 Ochiai. This index is the binary form of the cosine similarity
measure. It has a range of 0 to 1.
 Rogers and Tanimoto. This is an index in which double
weight is given to nonmatches.

 Russel and Rao. This is a binary version of the inner (dot)

product. Equal weight is given to matches and nonmatches.
This is the default for binary similarity data.
 Sokal and Sneath 1. This is an index in which double weight
is given to matches.
 Sokal and Sneath 2. This is an index in which double weight
is given to nonmatches, and joint absences are excluded from
consideration.
 Sokal and Sneath 3. This is the ratio of matches to
nonmatches. This index has a lower bound of 0 and is
unbounded above. It is theoretically undefined when there are
no nonmatches; however, the software assigns an arbitrary
value of 9999.999 when the value is undefined or is greater
than this value.
 Sokal and Sneath 4. This index is based on the conditional
probability that the characteristic in one item matches the
value in the other. The separate values for each item acting as
a predictor of the other are averaged to compute this value.
 Sokal and Sneath 5. This index is the squared geometric
mean of conditional probabilities of positive and negative
matches. It is independent of item coding. It has a range of 0
to 1.

 Yule's Y. This index is a function of the cross-ratio for a 2 x

2 table and is independent of the marginal totals. It has a
range of -1 to 1. Also known as the coefficient of colligation.
 Yule's Q. This index is a special case of Goodman and
Kruskal's gamma. It is a function of the cross-ratio and is
independent of the marginal totals. It has a range of -1 to 1.
You may optionally change the Present and Absent fields to

specify the values that indicate that a characteristic is present or
absent. The procedure will ignore all other values.
Transform Values
The following alternatives are available for transforming values:
 Z scores. Values are standardized to z scores, with a mean of

0 and a standard deviation of 1.
 Range -1 to 1. Each value for the item being standardized is
divided by the range of the values.
 Range 0 to 1. The procedure subtracts the minimum value
from each item being standardized and then divides by the
range.
 Maximum magnitude of 1. The procedure divides each value
for the item being standardized by the maximum of the
values.

 Mean of 1. The procedure divides each value for the item

being standardized by the mean of the values.
 Standard deviation of 1. The procedure divides each value for
the variable or case being standardized by the standard
deviation of the values.
Additionally, you can choose how standardization is done.

Alternatives are By variable or By case.
Statistics
Agglomeration schedule. Displays the cases or clusters combined

at each stage, the distances between the cases or clusters being
combined, and the last cluster level at which a case (or variable)
joined the cluster.
Proximity matrix. Gives the distances or similarities between

items.
Cluster Membership. Displays the cluster to which each case is

assigned at one or more stages in the combination of clusters.
Available options are single solution and range of solutions.
Plots

Dendrogram. Displays a dendrogram. Dendrograms can be used to

assess the cohesiveness of the clusters formed and can provide
information about the appropriate number of clusters to keep.
Icicle. Displays an icicle plot, including all clusters or a specified

range of clusters. Icicle plots display information about how cases
are combined into clusters at each iteration of the analysis.
Orientation allows you to select a vertical or horizontal plot.

Chapter Nine
Reliability Analysis
Reliability analysis allows you to study the properties of

measurement scales and the items that make them up. The
Reliability Analysis procedure calculates a number of commonly
used measures of scale reliability and also provides information
about the relationships between individual items in the scale.
Intraclass correlation coefficients can be used to compute interrater
reliability estimates.
Example. Does my questionnaire measure customer satisfaction in

a useful way? Using reliability analysis, you can determine the
extent to which the items in your questionnaire are related to each
other, you can get an overall index of the repeatability or internal
consistency of the scale as a whole, and you can identify problem
items that should be excluded from the scale.
Statistics. Descriptives for each variable and for the scale,

summary statistics across items, inter-item correlations and
covariances, reliability estimates, ANOVA table, intraclass
correlation coefficients, Hotelling's T2, and Tukey's test of
additivity.

Models. The following models of reliability are available:
 Alpha (Cronbach). This is a model of internal consistency,

based on the average inter-item correlation.
 Split-half. This model splits the scale into two parts and
examines the correlation between the parts.
 Guttman. This model computes Guttman's lower bounds for
true reliability.
 Parallel. This model assumes that all items have equal
variances and equal error variances across replications.
 Strict parallel. This model makes the assumptions of the
parallel model and also assumes equal means across items.
Data. Data can be dichotomous, ordinal, or interval, but they

should be coded numerically.
Assumptions. Observations should be independent, and errors

should be uncorrelated between items. Each pair of items should
have a bivariate normal distribution. Scales should be additive, so
that each item is linearly related to the total score.
Related procedures. If you want to explore the dimensionality of

your scale items (to see if more than one construct is needed to
account for the pattern of item scores), use Factor Analysis or
Multidimensional Scaling. To identify homogeneous groups of

variables, you can use Hierarchical Cluster Analysis to cluster

variables.
Analyze > Scale > Reliability Analysis...
Select two or more variables as potential components of an

additive scale.
Choose a model from the Model drop-down list.
Statistics
You can select various statistics describing your scale and items.
Statistics reported by default include the number of cases, the
number of items, and reliability estimates as follows:

 Alpha models: Coefficient alpha. For dichotomous data, this

is equivalent to the Kuder-Richardson 20 (KR20) coefficient.
 Split-half models: Correlation between forms, Guttman split-
half reliability, Spearman-Brown reliability (equal and
unequal length), and coefficient alpha for each half.
 Guttman models: Reliability coefficients lambda 1 through
lambda 6.
 Parallel and Strictly parallel models: Test for goodness-of-fit
of model, estimates of error variance, common variance, and
true variance, estimated common inter-item correlation,
estimated reliability, and unbiased estimate of reliability.

Descriptives for. Produces descriptive statistics for scales or items

across cases. Available options are Item, Scale, and Scale if item
deleted.
Scale if item deleted. Displays summary statistics comparing each

item to the scale composed of the other items. Statistics include
scale mean and variance if the item were deleted from the scale,
correlation between the item and the scale composed of other
items, and Cronbach's alpha if the item were deleted from the
scale.
Summaries. Provides descriptive statistics of item distributions

across all items in the scale. Available options are Means,
Variances, Covariances, and Correlations.
Inter-Item. Produces matrices of correlations or covariances

between items.
ANOVA Table. Produces tests of equal means. Available

alternatives are None, F test, Friedman chi-square, or Cochran chi-
square.
 F Test. Displays a repeated measures analysis-of-variance

table.

 Friedman chi-square. Displays Friedman's chi-square and

Kendall's coefficient of concordance. This option is
appropriate for data that are in the form of ranks. The chi-
square test replaces the usual F test in the ANOVA table.
 Cochran chi-square. Displays Cochran's Q. This option is
appropriate for data that are dichotomous. The Q statistic
replaces the usual F statistic in the ANOVA table.
Hotelling's T-square. Produces a multivariate test of the null

hypothesis that all items on the scale have the same mean.
Tukey's test of additivity. Produces a test of the assumption that

there is no multiplicative interaction among the items.
Intraclass correlation coefficient. Produces measures of

consistency or agreement of values within cases.
 Model. Select the model for calculating the intraclass

correlation coefficient. Available models are Two-way
mixed, Two-way random, and One-way random. Select two-
way mixed when people effects are random and the item
effects are fixed, two-way random when people effects and
the item effects are random, and one-way random when
people effects are random.

 Type. Select the type of index. Available types are

Consistency and Absolute Agreement.
 Confidence interval. Specify the level for the confidence
interval. Default is 95%.
 Test value. Specify the hypothesized value of the coefficient
for the hypothesis test. This is the value to which the
observed value is compared. Default value is 0.
cars, multidimensional scaling can be used to identify dimensions

that describe consumers' perceptions. You might find, for example,
that the price and size of a vehicle define a two-dimensional space,
which accounts for the similarities reported by your respondents.
Statistics. For each model: data matrix, optimally scaled data

matrix, S-stress (Young's), stress (Kruskal's), RSQ, stimulus
coordinates, average stress and RSQ for each stimulus (RMDS
models). For individual difference (INDSCAL) models: subject
weights and weirdness index for each subject. For each matrix in
replicated multidimensional scaling models: stress and RSQ for
each stimulus. Plots: stimulus coordinates (two- or three-
dimensional), scatterplot of disparities versus distances.

Chapter Ten
Graphs
We produced examples of pie charts and bar chart as part of the
Frequencies and Crosstabs output. In this section we look at the
range of SPSS graphs and charts in more detail and how to
customize them to your own taste, using some of the more popular
types of graphs.
There are two sorts of Graphs available from the Graphs menu.
Down the main part of the menu is access to old style graphs,
available for a long time in SPSS. The Interactive submenu
contains a similar list of graphs which are available in the new
interactive graphics format.

You may recognize some of the graph types and have used them
before. Some of the graph types have specialized applications,
such as the Pareto... and Control... options to quality control
work, Sequence... and Time Series for Business and Econometric
work and ROC curve first available in SPSS 9.
1. Graphs Gallery
If you want to know more about the different types of graph then
select Gallery from the Graphs menu to get information about
how to construct them. This will open the SPSS help system at the
Main Chart Gallery
2. Histogram for a single scale variable

The histogram gives a visual representation of a scale variable
(sometimes called a continuous or interval variable). A scale
variable's values are measured on a scale. The histogram bars
represent the number of cases whose values are in that interval of
the measure.

There are various parts to the dialog box: -

Variable: box will contain the variable to be summarised in a
histogram. This is the only
box that has to have a variable name in it.
Titles... button leads to a dialog box to add a title to the chart.
Template section File button is used to give the new chart a
similar appearance to the saved chart in a specified file.
Display normal curve if selected, superimposes a normal curve on
the graph.
2.1 Opening a chart window

The chart is part of the output window and can be saved as part of
the results. The chart can be opened in a window of its own either
by double clicking on it, choosing Edit/ SPSS Chart

Object/Open from the menus or using the context sensitive

menus, which will be illustrated later.
Above the histogram you should see the chart menus and a variety
of buttons on the Toolbar which can be used to change different
aspects of the graph. The Toolbar looks different now that a chart
is open. The menus have also changed to provide facilities for
changing the content and appearance of graphs.
The top section of the tool bar contains the same buttons you
would see when a data or output window is active. On the second
row are the chart editing buttons. From left to right the chart
editing buttons are used to: -

.
Some of the buttons can only be used with particular types of
graph, so for a histogram they will be grayed out on the toolbar.
2.2 Changing the axes

SPSS will allow you to rescale and change quantitative axes in a
variety of ways. This can be done by selecting Chart/Axis... from
menus and choose which axis you want to change. Or you can get
to the same place by double clicking on the axis you wish to
change. As an example, let's change the look of the labels on the
interval axis (in this case, the horizontal axis). We will change it by
getting rid of the decimal point and displaying only every second

label. The axis will also be rescaled so it starts from 60" (five feet)
and finish at 78" (six feet six inches).
You will then be presented with a box to choose which axis, either
the Interval or Scale axis. The Interval axis corresponds the range
of values of the variable and the Scale axis corresponds to the
number of cases in each height interval (i.e. the horizontal and
vertical axis respectively for this histogram).
Both scale and interval axes have the same general form - although
the dialog boxes and thus the things about them you can change are
slightly different. The interval axis, along the base of the
histogram, corresponds to the range of the variable values - the
scale for the weight variable.
Changing the intervals

The intervals between each tick on an axis and the axis range can
be customised rather than letting SPSS automatically set the scales.
Custom and Define... buttons are used to make changes to the axis
intervals.
2.3 The scale axis

The scale axis for the histogram measures the number of cases in
each interval of the histogram. The height of the bar indicates the
number of people in that interval.
In the dialog box there is: -

Display axis line box switches the scale axis line on or off.
Axis Title: box for typing in a title for the axis.
Title Justification: menu changes the place of the title relative to
the axis.

Scale changes between the ordinary Linear scale to a Log scale,

which is not really appropriate for this kind of graph, but may be
useful later on.
Range changes overall range of the axis by entering new values in
the Displayed: Minimum and Maximum boxes.
2.4 Using Chart Options for a histogram

All of the charts have a set of options available to tailor the output.
There are few options available under Histogram Options but using
it we can add one final feature to the histogram before moving on
to another chart. We will add a normal curve to the histogram.
The normal curve is a representation of what the underlying

normal distribution should look like using the mean and standard
deviation estimated from the data. What am I trying to say – the
histogram should follow the line of the normal curve, if it doesn’t
then
the data may not be suitable for the standard statistical tests.
3. Saving a chart

Often it is enough to save the charts you create along with the rest
of the output in an output window (with a .SPO extension).
However, it is possible to export charts individually into a graphics
file format using Export Chart... from the File menu. If you can't
see Export Chart... in the File menu, double click on the chart to
make sure it is open. This opens the Export Chart dialog box, so
the chart can be saved to a file. The default for a file type is jpeg or
.jpg; this file format is used a lot on the web. There is a variety of
possible formats available, what you choose will depend on what
you want to do with the chart.
4. Scatter plot
So far we have dealt with charts to display one scale or category
variables. A scatter plot shows the relationship between two scale
variables.
Simple plots the relationship between two scale variables.

Overlay overlays the plots for pairs of scale variables in the same
space.

Matrix produces plots for each combination in a list of scales

variables.
3-D plots the relationship between 3 scale variables.
Y-axis: is for a scale variable to describe the vertical scale axis.

X-axis: is for the scale variable to describe the horizontal axis.
Set Markers by: optional, the plot markers can be set to different
colors or shapes depending on a category variable.
Label cases by: optional, plot points will be labeled using the
values of any variable put in here.
Once the plot has appeared in the viewer, it can be edited, saved
with the Output window or exported to another format as we have
seen earlier with other graphs. Double-click on the scatterplot and
the Chart Editor window is similar to the earlier ones and most
facilities will work the same but a few more buttons are now active
on the toolbar
4.1 Fitting a line

There are two options available for fitting a line to a scatter plot,
either Format/ Interpolation which is best for a sequence of cases
or Fit Line under the Scatterplot options in Chart/Options.

5. Interactive Charts
So far all the charts we have seen have been available since version
6 of SPSS but newer interactive method of producing charts has
been available since version 8. The new interactive chart type
makes displaying charts and graphs of sub groups is much easier
and the editing facilities are nicer to use, although it can be quite
slow on older computers. Interactive is second from the top in the
Graphs menu and leads to a sub-menu of similar looking chart
types to the ones we have seen already. The system for editing the
charts is better and their dimensions and contents can be changed
interactively as the name would suggest.

You can see the interactive Create Bar Chart dialog box has a lot
more elements. Along the top of the dialog box you can see there
are tabs leading to other aspects of the bar chart. One of the most
noticeable differences is that you can see in an axis diagram of
which variables have been assigned to which dimension of the bar
chart. There are no arrow buttons in this dialog box to direct the
variables into the appropriate space. In these dialog boxes you
simply drag and drop a variable from the variable list into the
desired part of the chart diagram. Three different icons in the
variable list to indicate how SPSS will deal with that variable in a
chart.

Group variables have been defined as a nominal or ordinal

variables indicates that SPSS will be use their individual values to
define groups of cases of bars, markers or separate plots. The
Calculated icon indicates quantities calculated by SPSS such as
number or percent of cases. Finally the Scale icon indicates that
SPSS will plot the variable values on a scale where it can or
calculate summary statistics in for example a bar chart.
5.1 Interactive Chart Editor

Interactive charts are edited by double clicking on them, which
adds vertical and horizontal toolbars to it “in situ”. Unlike the older
charts, these cannot be opened as separate windows. This chart
editor looks different from the older one, but editing functions are
still accessed either through toolbar buttons or through the menus.

It is possible to customise the position of the buttons on the tool

bar by clicking and dragging sections of the toolbar. There are four
toolbar sections, utility, text, cursor and style.
Utility tools
Utility buttons are usually to be found to the left on the horizontal
toolbar. These buttons change different structural aspects of the
chart including reassigning the variables in the chart, opening the
chart manager, and inserting new elements.
The undo and redo last action buttons can be used to undo the last
thing done to the chart and redo will redo the editing instruction if
the last action has been "undone". There are two buttons to swap
axes as the swap axis button in the old style charts.
Text tools
On the right of the horizontal toolbar are pop-up menus to change
the style of selected text, including the font face, size, bold and
italic.

Text buttons on the horizontal to the right to change text font, size
and style, you will only see the font type and size when a text label
is selected in the graph.
Cursor tools
Cursor buttons change the editing mode. The arrow tool is for
selecting objects. Text can be changed using the text tool to select
the text and then changing the font using the menus. Point Id tool
looks like a target sight and is for identifying points in scatter!plots
Style tools
The vertical tool bar contains buttons to change the appearance of
the charts in various ways... Some of the items can only be used
with particular types of chart.
The style buttons are normally on the left hand vertical tool bar
below the Cursor tools. In descending order they are fill color,
border color, fill pattern, plot symbols for scatter!plots, symbol
size for scatter!plots, line style, line width, connector style.

Most of these buttons had equivalents in the older style charts.

They can be used in a similar way. The attributes of any graph
object can be changed by selecting it and then choosing the
appropriate attribute button/tool. The style tools produce pop-up
menus mostly in the form of palettes to change
the attribute.
Color
Placing a variable in the Color: box will produce a clustered bar
chart with a different colored bar for each category of the colour
variable.
Notice the Cluster heading beside the Style and Color fields. This
can be used to change the graph style from a clustered to a stacked
bar chart.
Size

For this type of chart size really isn't important! The Size: option
will only be useful for scatter plots, where the plot point size will
vary with the category of the size variable. In fact if you look back
at the original Create Bar Chart dialog box then the size option is
not available.
If you wish to try out this option then choose Graphs/ Interactive/
Scatterplot... from the menus and reproduce the height vs. weight
scatter plot we did earlier this time with cigarette consumption
defining the plot point size.
5.2. 3 D
Instead of producing a clustered or stacked bar chart it is possible
to arrange the bars in 3d formation. There is a drop-down menu
with three choices in the assign variables dialog box.
Choosing 3-D Coordinate from the menu will create a third

dimension. The other menu item, 3-D effect gives a 3-D effect to
the bars in a 2-D chart. Try it to see what effect it has on the bar
chart.

There are also a couple of relevant buttons in the utility tools (see
page 37). The 3-D tool will also give you access to the 3-D and 3-
D Light palettes. As illustrated below, the 3-D Light palette can
change the direction and strength of the lighting on the chart.
The Co-ordinate systems tool does the same job as the dimension
menus in the Create Bar chart dialog box and the Assign Variables
dialog box. If we changed from 3-D to 2-D then the chart would be
reduced to a simple bar chart again.

Panel variables
Extra dimensions can be added using panel variables. Each
category of a panel variable defines a separate graph. If there is
more than one panel variable, one graph is produced for each
combination of the variable categories.
Insert Elements
Elements can be added to the graph or chart using the Insert
Elements button on the tool bar
Clicking on the button shows the list of elements which can be

added. Some are only suitable for certain types of chart so may not
be available.

Chart Looks
You can also change the default look of the interactive charts you
produce (in Edit/Options... under the interactive tab) in
ChartLook:. This can be set so the charts you produce for reports
or presentations will have a uniform image - you could even
specify your own look by saving a suitable chart as a chartlook
while editing it.
These chart looks are also available from the menus while creating
or editing an interactive chart. While creating an interactive chart,
you can choose another chart look under the Option tab in the

Create dialog box. While editing the chart choose

Format/ChartLooks… to apply, edit or add a new chart look.
For example, it is quite difficult to see the second category plot

symbols because the colour is very close to the background one. In
the ChartLook dialog box you can browse the edit and apply
stored looks or create new ones. Selecting a look and clicking the
Edit Look… button will open the Chart Properties dialog box.
There, under the Colors tab, you can change the associated
colours. There are a lot of other settings attached to each look
which you can change – explore the settings and if you like save
your own chart properties for use. To change the colours select the
category, then its new colour from the palette.


Index
SPSS 10 for Windows Keystroke
New Data Editor Sheet = Alt + F + N + A
New Syntax Window = Alt + F + N + S
New Output Window = Alt + F + N + O
New Draft Output Window = Alt + F + N + R
New Script Window = Alt + F + N + C
Open Datafile = Alt + F + O + A
Open Text Data = Alt + F + R
Save = Alt + F + S or Ctrl + S
Save As = Alt + F + A
Print Data = Alt + F + P
Print Preview = Alt + F + V
Undo = Alt + E + U or Ctrl + Z
Redo = Alt + E + R or Ctrl + R
Cut = Alt + E + T or Ctrl + X
Copy = Alt + E + C or Ctrl + C
Paste = Alt + E + P or Ctrl + V
Paste Variables = Alt + E + V
Clear = Alt + E + E or DEL
Define Dates = Alt + D + E
Insert Variable at Pointer = Alt + D + V
Insert Case at Pointer = Alt + D + I
Go to Case = Alt + D + S
Sort Cases = Alt + D + O
Transpose Variables = Alt + D + N
Merge Files - Add Cases = Alt + D + G + C
Merge Files - Add Variables = Alt + D + G + V
Aggregate Data = Alt + D + A
Split File = Alt + D + F
Select Cases = Alt + D + C
Weight Cases = Alt + D + W
Dialog Box Keystrokes

Activate a control = Alt+[underlined letter]
Paste syntax = Alt + P Next control = Tab
Previous control = Shift+Tab
"Click" the active control = Spacebar
Cancel dialog box = Esc
Select several items in list box = Ctrl-click
Select all items in list box = Ctrl+A

Open a drop-down list = Alt+Down Arrow

Select item in a drop-down list = Alt+Up Arrow

SPSS Training Manual EARO-01

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SPSS Training Manual EARO-01

Uploaded by

Copyright:

Available Formats

Sustainable Resource Management Program in North

Statistical Pacakage for Social

Ethiopian Institute of Agricultural

December 10-13, 2009

Prerequisites: This document presumes that you have familiarity with

SPSS trainning manual, Yohannes Tilahun 2

Developments in the field of statistical data analysis often parallel

Decision making process under uncertainty is largely based on

SPSS trainning manual, Yohannes Tilahun 3

Statistical models are currently used in various fields of business

Your organization database contains a wealth of information, yet

Knowledge is what we know well. Information is the

SPSS trainning manual, Yohannes Tilahun 4

Data is known to be crude information and not knowledge by

Fact becomes knowledge, when it is used in the successful

SPSS trainning manual, Yohannes Tilahun 5

The above figure depicts the fact that as the exactness of a

Statistical inference aims at determining whether any statistical

SPSS trainning manual, Yohannes Tilahun 6

Considering the uncertain environment, the chance that "good

Knowledge is more than knowing something technical. Knowledge

Almost every professionals need a statistical toolkit. Statistical

SPSS trainning manual, Yohannes Tilahun 7

solve problems in a diversity of contexts. Statistical thinking

The appearance of computer software, JavaScript Applets,

We will apply the basic concepts and methods of statistics you've

Statistics is a science assisting you to make decisions under

SPSS trainning manual, Yohannes Tilahun 8

It is already an accepted fact that "Statistical thinking will one day

SPSS trainning manual, Yohannes Tilahun 9

What is a statistical package? It is a computer program or set of

1. Overview of SPSS for windows

SPSS trainning manual, Yohannes Tilahun 10

SPSS stands for Statistical Package for the Social Sciences. It

1.1 Launching SPSS

SPSS trainning manual, Yohannes Tilahun 11

The SPSS Data Editor opens with the window looking

1.2 Windows in SPSS

Viewer. This window displays the results of any statistical

SPSS trainning manual, Yohannes Tilahun 12

and charts are displayed in the Viewer window. A Viewer window

Chart Editor. This window is used to edit charts and plots. It is

Syntax Editor. Most SPSS commands are accessible from the

The Analyze and Graphs menus are available on all windows,

SPSS trainning manual, Yohannes Tilahun 13

You can change a window to active simply by clicking on the edge

SPSS trainning manual, Yohannes Tilahun 14

worksheet scrolls left by one column by clicking the left scroll

2. Data Editor Window

The Data Editor window opens automatically when you start an

1. Data Editor Menus

SPSS trainning manual, Yohannes Tilahun 15

2. Data Editor Toolbar

In order to determine the function of a tool, place the mouse

SPSS trainning manual, Yohannes Tilahun 16

SPSS trainning manual, Yohannes Tilahun 17

- Command push buttons

3. Editing User Preferences

SPSS allows you to customize many aspects of the program to suit

SPSS trainning manual, Yohannes Tilahun 18

 General Tab: Check the YES button on the "Open Syntax

SPSS trainning manual, Yohannes Tilahun 19

you choose variables.

 SAS data sets (.sas7bdat, .saspor) SAS data sets, version 6