Spss

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

SPSS is very similar to Microsoft Excel in layout.

There is a menu and a tool bar option at the top of every window
and some of its functions are just like an Excel. It is very easy to use and best studied by doing. I suggest the use
of Tutorial that can be found in Help menu. Following notes are some preliminary comments and instructions to
help you know the basics of SPSS for your better understanding.

Windows

SPSS interface uses two windows: one is Program Editor and second is Viewer. A Program Editor is where the data
files are seen and manipulated. A Viewer is where an output of the statistical analyses is seen and manipulated.
Select the Window menu to go back and onward between a Viewer and a Program Editor windows.

Data Files

There are two types of basic files. First is the data file (.sav). This is the place where all data for your analyses
exists. While you open up the data file, it will emerge in Program Editor window. Format is same like a spreadsheet
with grid of rows and columns. Columns symbolize variables and rows symbolize observations. You can also place
the cursor on column heading to acquire a lengthier description of every variable. To get total information on
variable, go to Utilities menu and then click on variables.

The data can be entered physically or imported from the database, spreadsheet, or else text file. For Marketing
Management class, data files exist, so it is just a matter of opening the files.

One significant distinction between usual spreadsheet files and the SPSS data files is the formulas will not be
entered straight into the sheet. In order to enter the formulas in SPSS, then go to Transform menu and choose
compute. This permits you to compute a new variable.

Output files

Second type of file is the output file (.spo) and when a statistical process is run, output is created. Viewer window
automatically will open to display the production. The left side pane will have an outline view of an output. The
right side pane will have the contents of an output that include charts, tables, and text. There are some book icons
in outline view subsequent to various objects of output. If book is open, it specifies that the output is noticeable. If
book is closed, then it is hidden. There are lots of ways to control objects (moving, modifying charts, hiding, text,
and so on) in the output. I suggest going to "Working with output" in tutorial to become familiar with this.

Statistical Procedures

There are several statistical procedures, which can be run against data. The basic procedure is to choose your
procedure by using the Analyze menu, pick up the variables, which you wish to analyze and then run the
procedure, and examine the results. For each and every procedure there are certain default settings for the
statistics as well as output, which can be modified by picking Options in a dialog boxe.
a general stats package for Win 95/98/NT, developed SYSTAT 12 -- powerful statistical
by Bill Miller of Iowa State U, with a very broad range software ranging from the most
of data manipulation and analysis capabilities and an elementary descriptive statistics to very
SPSS-like user interface. Bill also has provided an advanced statistical methodology.
excellent downloadable textbook in the form of Adobe Novices can work with its friendly and
Acrobat files. simple menu-dialog; statistically-savvy
users can use its intuitive command
ViSta -- a Visual Statistics program for Win3.1, Win language. Carry out very comprehensive
95/NT, Mac and Unix, featuring a Structured Desktop, analysis of univariate and multivariate
with features designed to structure and assist the data based on linear, general linear, and
statistical analyst. mixed linear models; carry out different
types of robust regression analysis when
PSPP -- a free replacement for SPSS (although at this your data are not suitable for
time it implements only a small fraction of SPSS's conventional multiple regression
analyses). But it's free, and will never "expire".  It analysis;compute partial least-squares
replicates the "look and feel" of SPSS very closely, and regression;design experiments, carry out
even reads native SPSS syntax and files!  Some other power analysis, do probability
features... calculations on many distributions and fit
them to data; perform matrix
 Supports over 1 billion cases and over 1 billion computations. Provides Time Series,
variables. Survival Analysis, Response Surface
 Choice of terminal or graphical user interface; Optimization, Spatial Statistics, Test
Choice of text, postscript or html output Item Analysis, Cluster Analysis,
formats. Classification and Regression Trees,
 Inter-operates with Gnumeric, OpenOffice.Org Correspondence Analysis,
and other free software. Multidimensional Scaling, Conjoint
 Easy data import from spreadsheets, text files Analysis, Quality Analysis, Path
and database sources. Analysis, etc. A 30-day evaluation
 Fast statistical procedures, even on very large version is available for free download.
data sets.
 No license fees; no expiration period; no Statlets -- a 100% Pure Java statistics
unethical “end user license agreements”. program. Should run on any platform
(PC, Mac, Unix) that supports Java. The
 Fully indexed user manual.
free Academic Version is limited to 100
 Cross platform; Runs on many different
cases by 10 variables.
computers and many different operating
systems.
WINKS (Windows KWIKSTAT) -- a
Note: For Windows installer, click here.
full-featured, easy-to-use stats package
with statistics (means, standard
OpenEpi Version 2.3 -- OpenEpi is a free, web-based,
deviations, medians, etc.), histograms, t-
open source, operating-system-independent series of
tests, correlation, chi-square, regression,
programs for use in public health and medicine,
nonparametrics, analysis of variance
providing a number of epidemiologic and statistical
(ANOVA), probability, QC plots, cpk,
tools. Version 2 (4/25/2007) has a new interface that
graphs, life tables, time series, crosstabs,
presents results without using pop-up windows, and has
and more. Works on Windows XP (as
better installation methods so that it can be run without
well as Windows 2000, NT, 98, ME and
an internet connection. Version 2.2 (2007/11/09) lets 95.) Comes in Basic and Professional
users run the software in English, French, Spanish, or editions. Evaluation version available for
Italian. download.

Statext -- Provides a nice assortment of basic statistical StudyResult -- (30-day free trial) General
tests, with text output (and text-based graphics). statistics package for: paired & unpaired
Capabilities include: rearrange, transpose, tabulate and t-test, one-way ANOVA, Fisher's exact ,
count data; random sample; basic descriptives; text- McNemar's, Chi2, Chi2 homogeneity ,
plots for dot, box-and-whiskers, stem-and-leaf, life table & survival analysis, Wilcoxon
histogram, scatterplot; find z-values, confidence rank-sum & signed-rank, sign test,
interval for means, t-tests (one and two group, and bioequivalence testing, correlation &
paired; one- and two-way ANOVA; Pearson, Spearman regression coefficient tests. Special
and Kendall correlation; ;inear regression, Chi-square features for interpreting summary data
goodness-of-fit test and independence tests; sign test, found in publications (p-values & conf.
Mann-Whitney U and Kruskal-Wallis H tests, intervals from summary statistics,
probability tables (z, t, Chi-square, F, U); random converts p-values to CI's & vice versa,
number generator; Central Limit Theorem, Chi-square what observed results are needed to get a
distribution. significant result, estimates from
publications needed for sample size
MicrOsiris -- a comprehensive statistical and data calculations). Includes equivalence- and
management package for Windows, derived from the non-inferiority testing for most tests.
OSIRIS IV package developed at the University of
Michigan. It was developed for serious survey analysis STATGRAPHICS Plus v5.0 (for
using moderate to large data sets. Main features: Windows) -- over 250 statistical
handles any size data set; has Excel data entry; analyses: regression, probit, enhanced
imports/exports SPSS, SAS, and Stats datasets; reads logistic, factor effects plots, automatic
ICPSR (OSIRIS) and UNESCO (IDAMS) datasets; forecasting, matrix plots, outlier
data mining techniques for market analysis (SEARCH identification, general linear models
--very fast for large datasets); interactive decision tree (random and mixed), multiple regression
for selecting appropriate tests; database maniuplation with automatic Cochrane-Orcutt and
(dictionaries, sorting, merging, consistency checking, Box-Cox procedures, Levene's,
recoding, transforming) extensive statistics (univariate, Friedman's, Dixon's and Grubb's tests,
staccerplot, cross-tabs, ANOVA/MANOVA, log-linear, Durbin-Watson p-values and 1-variable
correlation/regressionMCA, MNA, binary bootstrap estimates, enhanced 3D charts.
segmentation, cluster, factor, MINISSA, item analysis, For Six Sigma work: gage linearity and
survival analysis, internal consistency); online, web- accuracy analysis, multi-vari charts, life
enabled users manual; requires only 6MB RAM; uses data regression for reliability analysis
12MB disk, including manual. Fully-functional version and accelerated life-testing, long-term
is free; the authors would appreciate a small donation to and short-term capability assessment
support ongoing development and distribution. estimates. Two free downloads are
available: full-function but limited-
Gnumeric -- a high-powered spreadsheet with better time(30 days), and unlimited-time but
statistical features than Excel. Has 60 extra functions, limited-function (no Save, no Print, not
basic support for financial derivatives (Black Scholes) all analyses).
and telecommunication engineering, advanced
statistical analysis, extensive random number NCSS-2007 (Statistical Analysis
generation, linear and non-linear solvers, implicit System), PASS-2008 (Power and Sample
intersection, implicit iteration, goal seek, and Monte Size, and GESS (Gene Expression
Carlo simulation tools. software for Micro-arrays) for Windows.
Free 7-day evaluation versions.
Statist -- a compact, portable program that provides
most basic statistical capabilities: data manipulation MiniTab -- a powerful, full-featured MS
(recoding, transforming, selecting), descriptive stats Windows package, with good coverage
(including histograms, box&whisker plots), correlation of industrial / quality control analyses.
& regression, and the common significance tests (chi- The free Version 12 Demo expires after
square, t-test, etc.). Written in C (source available); 30 days.
runs on Unix/Linux, Windows, Mac, among others.
InStat (Instant Statistics), a full-featured
Tanagra -- a free (open-source) data-mining package, statistics package from GraphPad
which supports the standard "stream diagram" Software. Demo version disables
paradigm used by most data-mining systems. Contains printing, saving and exporting
components for Data source (tab-delimited text), capabilities. Demo available for
Visualization (grid, scatterplots), Descriptive statistics Windows only; commercial version
(cross-tab, ANOVA, correlation), Instance selection available for Windows and Mac.
(sampling, stratified), Feature selection and
construction, Regression (multiple linear), Factorial Prism -- from GraphPad Software.
analysis (principal components, multiple Performs basic biostatistics, fits curves
correspondence), Clustering (kMeans, SOM, LVQ, and creates publication quality scientific
HAC), Supervised learning (logistic regr., k-NN, multi- graphs in one complete package (Mac
layer perceptron, prototype-NN, ID3, discriminant and Windows). Windows demo is fully-
analysis, naive Bayes, radial basis function), Meta-spv functional for 30 days, then disables
learning (instance Spv, arcing, boosting, bagging), printing, saving and exporting; Mac
Learning assessment (train-test, cross-validation), and demo always disables these functions.
Association (Agrawal a-priori). (French-language page
here) CoStat 6.2 -- an easy-to-use program for
data manipulation and statistical analysis,
Dap -- a statistics and graphics package developed by from CoHort Software. Use a
Susan Bassein for Unix and Linux systems, with spreadsheet with any number of columns
commonly-needed data management, analysis, and and rows of data: floating point, integer,
graphics (univariate statistics, correlations and date, time, degrees, text, etc. Import
regression, ANOVA, categorical data analysis, logistic ASCII, Excel, MatLab, S+, SAS,
regression, and nonparametric analyses). Provides some Genstat, Fortran, and others. Has
of the core functionality of SAS, and is able to read and ANOVA, multiple comparisons of
run many (but not all) SAS program files. Dap is freely means, correlation, descriptive statistics,
distributed under a GNU-style "copyleft". analysis of frequency data, miscellaneous
tests of hypotheses, nonparametric tests,
PAST -- an easy-to-use data analysis package aimed at regression (curve fitting), statistical
paleontology including a large selection of common tables, and utilities. Has an auto-recorder
statistical, plotting and modelling functions: a and macro programming language.
spreadsheet-type data entry form, graphing, curve Callable from the command line, batch
fitting, significance tests (F, t, permutation t, Chi- files, shell scripts, pipes, and other
squared w. permutation test, Kolmogorov-Smirnov, programs; can be used as the statistics
Mann-Whitney, Shapiro-Wilk, Spearman's Rho and engine for web applications. Free time-
Kendall's Tau tests, correlation, covariance, limited demo available.
contingency tables, one-way ANOVA, Kruskal-Wallis
test), diversity and similarity indices & profiles,
abundance model fitting, multivariate statistics, time
series analysis, geometrical analysis, parsimony
analysis (cladistics), and biostratigraphy.

AM -- a free package for analyzing data from complex


samples, especially large-scale assessments, as well as
non-assessment survey data. Has sophisticated stats,
easy drag & drop interface, and integrated help system
that explains the statistics as well as how to use the
system. Can estimate models via marginal maximum
likelihood (MML), which defines a probability
distribution over the proficiency scale. Also analyzes
"plausible values" used in programs like NAEP.
Automatically provides appropriate standard errors for
complex samples via Taylor-series approximation,
jackknife & other replication techniques.

Instat Plus -- from the University of Reading, in the


UK. (Not to be confused with Instat from GraphPad
Software.) An interactive statistics package for
Windows or DOS.

WinIDAMS -- from UNESCO -- for numerical


information processing and statistical analysis. Provides
data manipulation and validation facilities classical and
advanced statistical techniques, including interactive
construction of multidimensional tables, graphical
exploration of data (3D scattergram spinning, etc.),
time series analysis, and a large number of multivariate
techniques.

SSP (Smith's Statistical Package) -- a simple, user-


friendly package for Mac and Windows that can
enter/edit/transform/import/export data, calculate basic
summaries, prepare charts, evaluate distribution
function probabilities, perform simulations, compare
means & proportions, do ANOVA's, Chi Square tests,
simple & multiple regressions.
Also, check out R and Ox, described in the
Programming Languages section below.

Dataplot -- (Unix, Linux, PC-DOS, Windows) for


scientific visualization, statistical analysis, and non-
linear modeling. Has extensive mathematical and
graphical capabilities. Closely integrated with the
NIST/SEMATECH Engineering Statistics Handbook.

WebStat -- A Java-based statistical computing


environment for the World Wide Web. Needs a
browser, but can be downloaded and run offline.

Regress+ -- A professional package (Macintosh only)


for univariate mathematical modeling (equations and
distributions). The most powerful software of its kind
available anywhere, with state-of-the-art functionality
and user-friendliness. Too many features to even begin
to list here.

SISA -- Simple Interactive Statistical Analysis for PC


(DOS) from Daan Uitenbroek. An excellent collection
of individual DOS modules for several statistical
calculations, including some analyses not readily
available elsewhere.

Statistical Software by Paul W. Mielke Jr. -- a large


collection of executable DOS programs (and Fortran
source). Includes: Matrix occupancy, exact g-sample
empirical coverage test, interactions of exact analyses,
spectral decomposition analysis, exact mrbp
(randomized block) analyses, exact multi-response
permutation procedure, Fisher's Exact for cross-
classfication and goodness-of-fit, Fisher's combined p-
values (meta analysis), largest part's proportion,
Pearson-Zelterman, Greenwood-Moran and Kendall-
Sherman goodness-of-fit, runs tests, multivariate
Hotelling's test, least-absolute-deviation regression,
sequential permutation procedures, LAD regression,
principal component analysis, matched pair
permutation, r by c contingency tables, r-way
contingency tables, and Jonkheere-Terpstra.

IRRISTAT -- for data management and basic statistical


analysis of experimental data (Windows). Primarily for
analysis of agricultural field trials, but many features
can be used for analysis of data from other sources.
Includes: Data management with a spreadsheet , Text
editor, Analysis of variance, Regression, Genotype x
environment interaction analysis, Quantitative trait
analysis, Single site analysis, Pattern analysis,
Graphics, Utilities for randomization and layout,
general factorial EMS, and orthogonal polynomial.

Tabulation of Data

The process of placing classified data into tabular form is known as tabulation. A table is a symmetric arrangement of statistical data in rows and
columns. Rows are horizontal arrangements whereas columns are vertical arrangements. It may be simple, double or complex depending upon the
type of classification.

Types of Tabulation:

(1) Simple Tabulation or One-way Tabulation:


          When the data are tabulated to one characteristic, it is said to be simple tabulation or one-way tabulation.
For Example: Tabulation of data on population of world classified by one characteristic like Religion is example of simple
tabulation.

2) Double Tabulation or Two-way Tabulation:


          When the data are tabulated according to two characteristics at a time. It is said to be double tabulation or two-way tabulation.
For Example: Tabulation of data on population of world classified by two characteristics like Religion and Sex is example of double
tabulation.

(3) Complex Tabulation:


          When the data are tabulated according to many characteristics, it is said to be complex tabulation.
For Example: Tabulation of data on population of world classified by two characteristics like Religion, Sex and Literacy etc…is
example of complex tabulation.

Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the
goal of highlighting useful information, suggesting conclusions, and supporting decision making.
Data analysis has multiple facets and approaches, encompassing diverse techniques under a
variety of names, in different business, science, and social science domains.

Data mining is a particular data analysis technique that focuses on modeling and knowledge
discovery for predictive rather than purely descriptive purposes. Business intelligence covers
data analysis that relies heavily on aggregation, focusing on business information. In statistical
applications, some people divide data analysis into descriptive statistics, exploratory data
analysis, and confirmatory data analysis. EDA focuses on discovering new features in the data
and CDA on confirming or falsifying existing hypotheses. Predictive analytics focuses on
application of statistical or structural models for predictive forecasting or classification, while
text analytics applies statistical, linguistic, and structural techniques to extract and classify
information from textual sources, a species of unstructured data. All are varieties of data
analysis.

Data integration is a precursor to data analysis, and data analysis is closely linked to data
visualization and data dissemination. The term data analysis is sometimes used as a synonym for
data modeling, which is unrelated to the subject of this article.

Nuclear and particle physics


In nuclear and particle physics the data usually originate from the experimental apparatus via a
data acquisition system. It is then processed, in a step usually called data reduction, to apply
calibrations and to extract physically significant information. Data reduction is most often,
especially in large particle physics experiments, an automatic, batch-mode operation carried out
by software written ad-hoc. The resulting data n-tuples are then scrutinized by the physicists,
using specialized software tools like ROOT or PAW, comparing the results of the experiment
with theory.

The theoretical models are often difficult to compare directly with the results of the experiments,
so they are used instead as input for Monte Carlo simulation software like Geant4, predict the
response of the detector to a given theoretical event, producing simulated events which are then
compared to experimental data.

See also: Computational physics. 3

[edit] Qualitative data analysis


Qualitative research uses qualitative data analysis (QDA) to analyze text, interview transcripts,
photographs, art, field notes of (ethnographic) observations, et cetera.

[edit] The process of data analysis


Data analysis is a process, within which several phases can be distinguished:[1]

 Data cleaning
 Initial data analysis (assessment of data quality)
 Main data analysis (answer the original research question)
 Final data analysis (necessary additional analyses and report)

[edit] Data cleaning

Data cleaning is an important procedure during which the data are inspected, and erroneous data
are -if necessary, preferable, and possible- corrected. Data cleaning can be done during the stage
of data entry. If this is done, it is important that no subjective decisions are made. The guiding
principle provided by Adèr (ref) is: during subsequent manipulations of the data, information
should always be cumulatively retrievable. In other words, it should always be possible to undo
any data set alterations. Therefore, it is important not to throw information away at any stage in
the data cleaning phase. All information should be saved (i.e., when altering variables, both the
original values and the new values should be kept, either in a duplicate dataset or under a
different variable name), and all alterations to the data set should carefully and clearly
documented, for instance in a syntax or a log.[2]

[edit] Initial data analysis

The most important distinction between the initial data analysis phase and the main analysis
phase, is that during initial data analysis one refrains from any analysis that are aimed at
answering the original research question. The initial data analysis phase is guided by the
following four questions:[3]

[edit] Quality of data

The quality of the data should be checked as early as possible. Data quality can be assessed in
several ways, using different types of analyses: frequency counts, descriptive statistics (mean,
standard deviation, median), normality (skewness, kurtosis, frequency histograms, normal
probability plots), associations (correlations, scatter plots).
Other initial data quality checks are:

 Checks on data cleaning: have decisions influenced the distribution of the variables? The
distribution of the variables before data cleaning is compared to the distribution of the
variables after data cleaning to see whether data cleaning has had unwanted effects on the
data.
 Analysis of missing observations: are there many missing values, and are the values
missing at random? The missing observations in the data are analyzed to see whether
more than 25% of the values are missing, whether they are missing at random (MAR),
and whether some form of imputation (statistics) is needed.
 Analysis of extreme observations: outlying observations in the data are analyzed to see if
they seem to disturb the distribution.
 Comparison and correction of differences in coding schemes: variables are compared
with coding schemes of variables external to the data set, and possibly corrected if coding
schemes are not comparable.

The choice of analyses to assess the data quality during the initial data analysis phase depends on
the analyses that will be conducted in the main analysis phase.[4]

[edit] Quality of measurements

The quality of the measurement instruments should only be checked during the initial data
analysis phase when this is not the focus or research question of the study. One should check
whether structure of measurement instruments corresponds to structure reported in the literature.
There are two ways to assess measurement quality:
 Confirmatory factor analysis
 Analysis of homogeneity (internal consistency), which gives an indication of the
reliability of a measurement instrument, i.e., whether all items fit into a unidimensional
scale. During this analysis, one inspects the variances of the items and the scales, the
Cronbach's α of the scales, and the change in the Cronbach's alpha when an item would
be deleted from a scale.[5]

[edit] Initial transformations

After assessing the quality of the data and of the measurements, one might decide to impute
missing data, or to perform initial transformations of one or more variables, although this can
also be done during the main analysis phase.[6]
Possible transformations of variables are:[7]

 Square root transformation (if the distribution differs moderately from normal)
 Log-transformation (if the distribution differs substantially from normal)
 Inverse transformation (if the distribution differs severely from normal)
 Make categorical (ordinal / dichotomous) (if the distribution differs severely from
normal, and no transformations help)

[edit] Did the implementation of the study fulfill the intentions of the research design?

One should check the success of the randomization procedure, for instance by checking whether
background and substantive variables are equally distributed within and across groups.
If the study did not need and/or use a randomization procedure, one should check the success of
the non-random sampling, for instance by checking whether all subgroups of the population of
interest are represented in sample.
Other possible data distortions that should be checked are:

 dropout (this should be identified during the initial data analysis phase)
 Item nonresponse (whether this is random or not should be assessed during the initial data
analysis phase)
 Treatment quality (using manipulation checks).[8]

[edit] Characteristics of data sample

In any report or article, the structure of the sample must be accurately described. It is especially
important to exactly determine the structure of the sample (and specifically the size of the
subgroups) when subgroup analyses will be performed during the main analysis phase.
The characteristics of the data sample can be assessed by looking at:

 Basic statistics of important variables


 Scatter plots
 Correlations
 Cross-tabulations[9]
[edit] Final stage of the initial data analysis

During the final stage, the findings of the initial data analysis are documented, and necessary,
preferable, and possible corrective actions are taken.
Also, the original plan for the main data analyses can and should be specified in more detail
and/or rewritten.
In order to do this, several decisions about the main data analyses can and should be made:

 In the case of non-normals: should one transform variables; make variables categorical
(ordinal/dichotomous); adapt the analysis method?
 In the case of missing data: should one neglect or impute the missing data; which
imputation technique should be used?
 In the case of outliers: should one use robust analysis techniques?
 In case items do not fit the scale: should one adapt the measurement instrument by
omitting items, or rather ensure comparability with other (uses of the) measurement
instrument(s)?
 In the case of (too) small subgroups: should one drop the hypothesis about inter-group
differences, or use small sample techniques, like exact tests or bootstrapping?
 In case the randomization procedure seems to be defective: can and should one calculate
propensity scores and include them as covariates in the main analyses?[10]

[edit] Analyses

Several analyses can be used during the initial data analysis phase:[11]

 Univariate statistics
 Bivariate associations (correlations)
 Graphical techniques (scatter plots)

It is important to take the measurement levels of the variables into account for the analyses, as
special statistical techniques are available for each level:[12]

 Nominal and ordinal variables


o Frequency counts (numbers and percentages)
o Associations
 circumambulations (crosstabulations)
 hierarchical loglinear analysis (restricted to a maximum of 8 variables)
 loglinear analysis (to identify relevant/important variables and possible
confounders)
o Exact tests or bootstrapping (in case subgroups are small)
o Computation of new variables

 Continuous variables
o Distribution
 Statistics (M, SD, variance, skewness, kurtosis)
 Stem-and-leaf displays
 Box plots

ANOVA ANOVA:
ANalysis Of VAriance between groups

 "Analysis of Variance." A statistical test for heterogeneity of means by analysis of group variances. ANOVA is

implemented as ANOVA[data] in the Mathematica package ANOVA`) .

 To apply the test, assume random sampling of a variate with equal variances, independent errors, and a normal

distribution. Let be the number of replicates (sets of identical observations) within each of factor levels (treatment

groups), and be the th observation within factor level . Also assume that the ANOVA is "balanced" by restricting to

be the same for each factor level.

 Now define the sum of square terms

)
(

 which are the total, treatment, and error sums of squares. Here, is the mean of observations within factor level , and

is the "group" mean (i.e., mean of means). Compute the entries in the following table, obtaining the P-value corresponding

to the calculated F-ratio of the mean squared values

(6)

at
fr
e mean
eed
g squared
om
o

ry

el

rr

to

ta

 If the P-value is small, reject the null hypothesis that all means are the same for the different groups.

Discriminant Analysis
Introduction
 
Discriminant analysis is a technique for classifying a set of observations into predefined classes. The
purpose is to determine the class of an observation based on a set of variables known as predictors or
input variables. The model is built based on a set of observations for which the classes are known. This
set of observations is sometimes referred to as the training set. Based on the training set , the technique
constructs a set of linear functions of the predictors, known as discriminant functions, such that 
L = b1x1 + b2x2 + � + bnxn + c , where the b's are discriminant coefficients, the x's are the input
variables or predictors and c is a constant.
 
These discriminant functions are used to predict the class of a new observation with unknown class. For a
k class problem k discriminant functions are constructed. Given a new observation, all the k discriminant
functions are evaluated and the observation is assigned to class i if the i th discriminant function has the
highest value.
 
 
 

Discriminant Analysis
Discriminant Analysis may be used for two objectives: either   we want to assess the adequacy of
classification, given the group memberships of the objects under study; or we wish to assign objects to
one of a number of (known) groups of objects. Discriminant Analysis may thus have a descriptive or a
predictive objective.

In both cases, some group assignments must be known before carrying out the Discriminant
Analysis. Such group assignments, or labelling, may be arrived at in any way. Hence
Discriminant Analysis can be employed as a useful complement to Cluster Analysis (in order to
judge the results of the latter) or Principal Components Analysis. Alternatively, in star-galaxy
separation, for instance, using digitised images, the analyst may define group (stars, galaxies)
membership visually for a conveniently small training set or design set.    

Methods implemented in this area are Multiple Discriminant Analysis, Fisher's Linear
Discriminant Analysis, and K-Nearest Neighbours Discriminant Analysis.

Multiple Discriminant Analysis

(MDA) is also termed Discriminant       Factor Analysis and Canonical Discriminant Analysis. It
adopts a similar perspective to PCA: the rows of the data matrix to be examined constitute
points in a multidimensional space, as also do the group mean vectors. Discriminating axes are
determined in this space, in such a way that optimal separation of the predefined groups is
attained. As with PCA, the problem becomes mathematically the eigenreduction of a real,
symmetric matrix. The eigenvalues represent the discriminating power of the associated
eigenvectors. The nYgroups lie in a space of dimension at most nY - 1. This will be the number of
discriminant axes or factors obtainable in the most common practical case when n > m > nY
(where n is the number of rows, and m the number of columns of the input data matrix).
Linear Discriminant Analysis

is the 2-group case of MDA.   It optimally separates two groups, using the Mahalanobis metric or
generalized distance.     It also gives the same linear separating decision surface as Bayesian
maximum likelihood discrimination in the case of equal class covariance matrices.

K-NNs Discriminant Analysis

: Non-parametric (distribution-free) methods dispense with the need for assumptions regarding
the probability density function. They have become very popular especially in the image
processing area. The K-NNs method assigns an object of unknown affiliation to the group to
which the majority of its K nearest neighbours belongs.

There is no best discrimination method. A few remarks concerning the advantages and
disadvantages of the methods studied are as follows.

 Analytical simplicity or computational reasons may lead to initial consideration of linear


discriminant analysis or the NN-rule.
 Linear discrimination is the most widely used in practice. Often the 2-group method is used

repeatedly for the analysis of pairs of multigroup data (yielding decision surfaces for
k groups).
 To estimate the parameters required in quadratic discrimination more computation and data is
required than in the case of linear discrimination. If there is not a great difference in the group
covariance matrices, then the latter will perform as well as quadratic discrimination.
 The k-NN rule is simply defined and implemented, especially if there is insufficient data to
adequately define sample means and covariance matrices.
 MDA is most appropriately used for feature selection. As in   the case of PCA, we may want to
focus on the variables used in order to investigate the differences between groups; to create
synthetic variables which improve the grouping ability of the data; to arrive at a similar objective
by discarding irrelevant variables; or to determine the most parsimonious variables for graphical
representational purposes.

Factor Analysis

Factor analysis includes both component analysis and common factor analysis. More than other
statistical techniques, factor analysis has suffered from confusion concerning its very purpose.
This affects my presentation in two ways. First, I devote a long section to describing what factor
analysis does before examining in later sections how it does it. Second, I have decided to reverse
the usual order of presentation. Component analysis is simpler, and most discussions present it
first. However, I believe common factor analysis comes closer to solving the problems most
researchers actually want to solve. Thus learning component analysis first may actually interfere
with understanding what those problems are. Therefore component analysis is introduced only
quite late in this chapter.
What Factor Analysis Can and Can't Do
I assume you have scores on a number of variables-- anywhere from 3 to several hundred variables, but
most often between 10 and 100. Actually you need only the correlation or covariance matrix--not the
actual scores. The purpose of factor analysis is to discover simple patterns in the pattern of relationships
among the variables. In particular, it seeks to discover if the observed variables can be explained largely
or entirely in terms of a much smaller number of variables called factors.

Some Examples of Factor-Analysis Problems

1. Factor analysis was invented nearly 100 years ago by psychologist Charles Spearman, who
hypothesized that the enormous variety of tests of mental ability--measures of mathematical skill,
vocabulary, other verbal skills, artistic skills, logical reasoning ability, etc.--could all be explained by one
underlying "factor" of general intelligence that he called g. He hypothesized that if g could be measured
and you could select a subpopulation of people with the same score on g, in that subpopulation you
would find no correlations among any tests of mental ability. In other words, he hypothesized that g was
the only factor common to all those measures.

It was an interesting idea, but it turned out to be wrong. Today the College Board testing service
operates a system based on the idea that there are at least three important factors of mental
ability--verbal, mathematical, and logical abilities--and most psychologists agree that many other
factors could be identified as well.

2. Consider various measures of the activity of the autonomic nervous system--heart rate, blood
pressure, etc. Psychologists have wanted to know whether, except for random fluctuation, all
those measures move up and down together--the "activation" hypothesis. Or do groups of
autonomic measures move up and down together, but separate from other groups? Or are all the
measures largely independent? An unpublished analysis of mine found that in one data set, at any
rate, the data fitted the activation hypothesis quite well.

3. Suppose many species of animal (rats, mice, birds, frogs, etc.) are trained that food will appear
at a certain spot whenever a noise--any kind of noise--comes from that spot. You could then tell
whether they could detect a particular sound by seeing whether they turn in that direction when
the sound appears. Then if you studied many sounds and many species, you might want to know
on how many different dimensions of hearing acuity the species vary. One hypothesis would be
that they vary on just three dimensions--the ability to detect high-frequency sounds, ability to
detect low-frequency sounds, and ability to detect intermediate sounds. On the other hand,
species might differ in their auditory capabilities on more than just these three dimensions. For
instance, some species might be better at detecting sharp click-like sounds while others are better
at detecting continuous hiss-like sounds.

4. Suppose each of 500 people, who are all familiar with different kinds of automobiles, rates
each of 20 automobile models on the question, "How much would you like to own that kind of
automobile?" We could usefully ask about the number of dimensions on which the ratings differ.
A one-factor theory would posit that people simply give the highest ratings to the most expensive
models. A two-factor theory would posit that some people are most attracted to sporty models
while others are most attracted to luxurious models. Three-factor and four-factor theories might
add safety and reliability. Or instead of automobiles you might choose to study attitudes
concerning foods, political policies, political candidates, or many other kinds of objects.

5. Rubenstein (1986) studied the nature of curiosity by analyzing the agreements of junior-high-
school students with a large battery of statements such as "I like to figure out how machinery
works" or "I like to try new kinds of food." A factor analysis identified seven factors: three
measuring enjoyment of problem-solving, learning, and reading; three measuring interests in
natural sciences, art and music, and new experiences in general; and one indicating a relatively
low interest in money.

Conjoint analysis
Conjoint Analysis is an advanced market research technique that gets under the skin of how
people make decisions and what they really value in products and services (see demonstration)
Conjoint analysis is perfect for answering questions such as "Which should we do, build in more
features, or bring our prices down?" or "Which of these changes will hurt our competitors
most?" Also see our conjoint analysis model demonstration, or for more detail see our paper on
conjoint analysis (.rtf) and case study examples.

"We were looking for an agency that could understand our solutions and complex customer base in order to
transfer this understanding into a comprehensive customer survey.

dobney.com quickly gained deep insight into the specificities of our business and designed an excellent,
state-of-the-art conjoint survey. They delivered professional and individual service of a quality we had never
experienced before. It was great working with dobney.com and the findings derived from the survey are
invaluable for us."

Marketing Manager, Leica Microsystems 2009

Every customer making choices between products and services is faced with trade-offs. Is high
quality more important than a low price and quick delivery for instance? Or is good service more
important than design and looks?

For businesses, understanding precisely how markets value different elements of the product and
service mix means product development can be optimised and aspects such as pricing tuned to
customer's willingness to pay for specific features

Conjoint Analysis is a technique developed since the 1970s that allows you to work out the
hidden rules people use to make trade-offs between different products and services and the
values they place on different features. By understanding precisely how people make decisions
and what they value in your products and services, you can work out the optimum level of
features and services that balance value to the customer against cost to the company.

The principle behind conjoint analysis is to break a product or service down into it's constituent
parts (see conjoint design) then to test combinations of these parts to look at what customers
prefer. By designing the study appropriately it is then possible to use statistical analysis to work
out the value of each part in driving the customers decision. See a fully worked up conjoint
analysis example using Excel.

For example a computer may be described in terms of attributes such as processor type, hard
disk size and amount of memory. Each of these attributes is broken down into levels - for
instance levels of the attribute for memory size might be 1GB, 2GB, 3GB and 4GB.

These attributes and levels can be used to define different products or product profiles. The first
stage in conjoint analysis is to create a set of product profiles which customers or respondents
are then asked to compare and choose from. Obviously, the number of potential profiles
increases rapid for every new attribute, so there are techniques to simplify both the number of
profiles to be tested and the way in which preferences are discovered. Different flavours of
conjoint analysis have different approaches and strengths and weaknesses

By analysing which items are chosen or preferred from the product profiles offered to the
customer it is possible to work out statistically both what is driving the preference from the
attributes and levels shown, but more importantly, give an implicit numerical valuation for each
attribute and level

The result is a detailed picture of how customers make decisions (see the demonstration to see it
at work), a picture that can be used to build market models which can predict market share in
new market conditions and test the impact of product or service changes on the market to see
where and how you can gain the greatest improvements over your competitors. Not surprisingly
conjoint analysis has become a key tool in building and developing market strategies.

By combining these market models with internal project costings, companies can evaluate
decisions in terms of Return on Investment (ROI) before going to market. For example
determining what resources to put into New Product Development and in what areas. Conjoint
analysis also forms the basis of much pricing research and powerful needs-based segmentation.

To help you understand more about what Conjoint Analysis tells you and how it works, there is a
more detailed overview of conjoint analysis click here (Word .rtf document 90k). At the heart of
conjoint analysis is breaking a product or service down into attributes and levels, which provides
an extremely powerful way of looking at what you offer.

MULTIDIMENSIONAL SCALING

Forrest W. Young, University of North Carolina

This paper originally appeared in Kotz-Johnson (Ed.) Encyclopedia of Statistical


Sciences, Volume 5, Copyright (c) 1985 by John Wiley & Sons, Inc. This HTML
document derives from the original as follows: The original was scanned and converted
to text by optical-character-recognition software, which was then edited using MS-
WORD. This in turn was converted to HTML automatically. The figures also derive from
scanning the originals. I apologize for poor quality and mistakes in processing the text.
The software and hardware for doing this is far from perfect.
 

ABSTRACT

In this entry we summarize the major types of multidimensional scaling (MDS), the
distance models used by MDS, the similarity data analyzed by MDS, and the computer
programs that implement MDS. We also present three brief examples. We do not
discuss experimental design, interpretation, or the mathematics of the algorithms. The
entry should be helpful to those who are curious about what MDS is and to those who
wish to know more about the types of data and models relevant to MDS. It should help
the researcher. the statistical consultant, or the data analyst who needs to decide if
MDS is appropriate for a particular set of data and what computer program should be
used.

For a more complete, but still brief. introduction to MDS, the, reader should turn to
Kruskal and Wish [6]. A complete discussion of the topics covered here as well as of
experimental design. data analysis, and interpretive procedures can be found in
Schiffman et al. [141. An intermediate-level mathematical treatment of some MDS
algorithms is given in Davison (1983). An advanced treatment of the theory of MDS,
illustrated with innovative applications, is presented by Young and Hamer [211].
Reviews of the current state of the art are presented by Young (1984a; 1984b).
Multidimensional scaling is related to principal component analysis, factor analysis,
cluster analysis, and numerical taxonomy: the reader is referred to the appropriate
entries in this encyclopedia, along with the SCALING and PROXIMITY DATA entries.

1. OVERVIEW OF MULTIDIMENSIONAL SCALING

Multidimensional scaling (MDS) is a set of data analysis techniques that display


the structure of distance-like data as a geometrical picture. It is an extension of
the procedure discussed in SCALING.

MDS has its origins in psychometrics. where it was proposed to help understand
people's judgments of the similarity of members of a set of objects. Torgerson
[18] proposed the first MDS method and coined the term. his work evolving from
that of Richardson [11]. MDS has now become a general data analysis technique
used in a wide variety of fields [14]. For example, the book on theory and
applications of MDS by Young and Hamer, [21] presents applications of MDS in
such diverse fields as marketing'. sociology, physics, political science', and
biology. However, we limit our examples here to the field with which the author is
most familiar, psychology.

MDS pictures the structure of a set of objects from data that approximate the
distances between pairs of the objects. The data, which are called similarities.
dissimilarities, distances, or proximities, must reflect the amount of (dissimilarity
between pairs of the. In this article we use the term similarity generically to refer
to both similarities (where large numbers refer to great similarity) and to
dissimilarities (where large numbers refer to great dissimilarity).

In addition to the traditional human similarity judgment, the data can be an


"objective" similarity measure (the driving time between pairs of cities) or an
index calculated from multivariate data (the proportion of agreement in the votes
cast by pairs of senators). However, the data must always I represent the degree
of similarity of pairs of objects (or events).

Each object or event is represented by a point in a multidimensional space. The


points are arranged in this space so that the distances between pairs of points
have the strongest possible relation to the similarities among the pairs of objects.
That is, two similar objects are represented by two points that are close together,
and two dissimilar objects are represented by two points that are far apart. The
space is usually a two- or three-dimensional Euclidean space, but may be non-
Euclidean and may have more dimensions.

MDS is a generic term that includes many different specific types. These types
can be classified according to whether the similarities data are qualitative (called
nonmetric MDS) or quantitative (metric MDS). The number of similarity matrices
and the nature of the MDS model can also classify MDS types. This classification
yields classical MDS (one matrix, unweighted model), replicated MDS (several
matrices, unweighted model), and weighted MDS (several matrices, weighted
model). We discuss the nonmetric-metric and the classical-replicated-weighted
classifications in the following sub-sections.
General Purpose
The term cluster analysis (first used by Tryon, 1939) encompasses a number of different algorithms and methods for grouping
objects of similar kind into respective categories. A general question facing researchers in many areas of inquiry is how to organize
observed data into meaningful structures, that is, to develop taxonomies. In other words cluster analysis is an exploratory data
analysis tool which aims at sorting different objects into groups in a way that the degree of association between two objects is
maximal if they belong to the same group and minimal otherwise. Given the above, cluster analysis can be used to discover
structures in data without providing an explanation/interpretation. In other words, cluster analysis simply discovers structures in data
without explaining why they exist.

We deal with clustering in almost every aspect of daily life. For example, a group of diners sharing the same table in a restaurant
may be regarded as a cluster of people. In food stores items of similar nature, such as different types of meat or vegetables are
displayed in the same or nearby locations. There is a countless number of examples in which clustering plays an important role. For
instance, biologists have to organize the different species of animals before a meaningful description of the differences between
animals is possible. According to the modern system employed in biology, man belongs to the primates, the mammals, the
amniotes, the vertebrates, and the animals. Note how in this classification, the higher the level of aggregation the less similar are the
members in the respective class. Man has more in common with all other primates (e.g., apes) than it does with the more "distant"
members of the mammals (e.g., dogs), etc. For a review of the general categories of cluster analysis methods, see Joining (Tree
Clustering), Two-way Joining (Block Clustering), and k-Means Clustering. In short, whatever the nature of your
business is, sooner or later you will run into a clustering problem of one form or another.

Statistical Significance Testing


Note that the above discussions refer to clustering algorithms and do not mention anything about statistical significance testing. In
fact, cluster analysis is not as much a typical statistical test as it is a "collection" of different algorithms that "put objects into
clusters according to well defined similarity rules." The point here is that, unlike many other statistical procedures, cluster analysis
methods are mostly used when we do not have any a priori hypotheses, but are still in the exploratory phase of our research. In a
sense, cluster analysis finds the "most significant solution possible." Therefore, statistical significance testing is really not
appropriate here, even in cases when p-levels are reported (as in k-means clustering).

Area of Application
Clustering techniques have been applied to a wide variety of research problems. Hartigan (1975) provides an excellent summary of
the many published studies reporting the results of cluster analyses. For example, in the field of medicine, clustering diseases, cures
for diseases, or symptoms of diseases can lead to very useful taxonomies. In the field of psychiatry, the correct diagnosis of clusters
of symptoms such as paranoia, schizophrenia, etc. is essential for successful therapy. In archeology, researchers have attempted to
establish taxonomies of stone tools, funeral objects, etc. by applying cluster analytic techniques. In general, whenever we need to
classify a "mountain" of information into manageable meaningful piles, cluster analysis is of great utility.

You might also like