Download as pdf or txt
Download as pdf or txt
You are on page 1of 618

Klaus Backhaus

Bernd Erichson
Sonja Gensler
Rolf Weiber
Thomas Weiber

Multivariate
Analysis
An Application-Oriented
Introduction
Second Edition
Multivariate Analysis
ii 

Your bonus with the purchase of this book


With the purchase of this book, you can use our “SN Flashcards” app to access
questions free of charge in order to test your learning and check your
understanding of the contents of the book. To use the app, please follow the
instructions below:

1. Go to https://flashcards.springernature.com/login
2. Create an user account by entering your e-mail adress and
assiging a password.
3. Use the link provided in one of the first chapters to access your
SN Flashcards set.

Your personal SN Flashcards link is provided in one of the first chapters.

If the link is missing or does not work, please send an e-mail with the subject
“SN Flashcards” and the book title to customerservice@springernature.com.
Klaus Backhaus · Bernd Erichson ·
Sonja Gensler · Rolf Weiber · Thomas Weiber

Multivariate Analysis
An Application-Oriented Introduction
Second Edition
Klaus Backhaus Bernd Erichson
University of Münster Otto-von-Guericke-University Magdeburg
Münster, Nordrhein-Westfalen, Germany Magdeburg, Sachsen-Anhalt, Germany

Sonja Gensler Rolf Weiber


University of Münster University of Trier
Münster, Nordrhein-Westfalen, Germany Trier, Rheinland-Pfalz, Germany

Thomas Weiber
Munich, Bayern, Germany

ISBN 978-3-658-40410-9 ISBN 978-3-658-40411-6 (eBook)


https://doi.org/10.1007/978-3-658-40411-6

English Translation of the 17th original German edition published by Springer Fachmedien Wiesbaden,
Wiesbaden, 2023
© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2021, 2023
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage
and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or
hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does
not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective
laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors
or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.

This Springer Gabler imprint is published by the registered company Springer Fachmedien Wiesbaden GmbH,
part of Springer Nature.
The registered company address is: Abraham-Lincoln-Str. 46, 65189 Wiesbaden, Germany
Preface 2nd Edition

The new edition of our book “Multivariate Analysis—An Application-Oriented


Introduction” has been very well received on the market. This success has motivated us
to quickly work on the 2nd edition in English and the 17th edition in German. As far as
the content of the book is concerned, the German and English versions are identical. The
2nd English edition differs from the 1st edition in the following aspects:

• The latest version of SPSS (version 29) was used to create the figures.
• Errors in the first edition, which occurred despite extremely careful editing, have been
corrected. We are confident that we have now fixed (almost) all major mistakes. A
big thank you goes to retired math and physics teacher Rainer Obst, Supervisor of
the Freiherr vom Stein Graduate School in Gladenbach, Germany, who read both the
English and German versions very meticulously and uncovered quite a few inconsist-
encies and errors.
• We made one major change in the chapter about the cluster analysis. The application
example was adapted to improve the plausibility of the results.

Also for the 2nd edition, research assistants have supported us energetically. Special
thanks go to the research assistants at the University of Trier, Mi Nguyen, Lorenz
Gabriel and Julian Morgen. They updated the literature, helped to correct errors and
adjusted all SPSS figures. They were supported by the student Sonja Güllich, who was
instrumental in correcting figures and creating SPSS screenshots. The coordination of
the work among the authors as well as with the publisher was again taken over by the
research assistant Julian Morgen, who again tirelessly and patiently accepted change
requests and again implemented them quickly. Last but not least, we would like to thank
Barbara Roscher and Birgit Borstelmann from Springer Verlag for their competent
support.
SpringerGabler also provides a set of electronic learning cards (so-called “flash-
cards”) to help readers test their knowledge. Readers may individualize their own learn-
ing environment via an app and add their own questions and answers. Access to the
flashcards is provided via a code printed at the end of the first chapter of the book.

v
vi Preface 2nd Edition

We are pleased to present this second edition, which is based on the current version
of IBM SPSS Statistics 29 and has been thoroughly revised. Nevertheless, any remaining
mistakes are of course the responsibility of the authors.

Münster Klaus Backhaus



Magdeburg Bernd Erichson
Münster  Sonja Gensler
Trier  Rolf Weiber

Munich Thomas Weiber
November 2022

Note on Excel Operation

The book Multivariate Analysis provides Excel formulas with semicolon separators. If
the regional language setting of your operating system is set to a country that uses peri-
ods as decimal separators, we ask you to use comma separators instead of semicolon
separators in Excel formulas.
Preface

This is the first English edition of a German textbook that covers methods of multivari-
ate analysis. It addresses readers who are looking for a reliable and application-oriented
source of knowledge in order to apply the discussed methods in a competent manner.
However, this English edition is not just a simple translation of the German version;
rather, we used our accumulated experience gained through publishing 15 editions of the
German textbook to prepare a completely new version that translates academic statistical
knowledge into an easy-to-read introduction to the most relevant methods of multivariate
analysis and is targeted at readers with comparatively little knowledge of mathematics
and statistics. The new version of the German textbook with exactly the same content is
now available as the 16th edition.
For all methods of multivariate analysis covered in the book, we provide case studies
which are solved with IBM’s software package SPSS (version 27). We follow a step-by-
step approach to illustrate each method in detail. All examples and case studies use the
chocolate market as an example because we assume that every reader will have some
affinity to this market and a basic idea of the factors involved in it.
This book constitutes the centerpiece of a comprehensive service offering that is cur-
rently being realized. On the website www.multivariate-methods.info, which accompa-
nies this book and offers supplementary materials, we provide a wide range of support
services for our readers:

• For each method discussed in this book, we created Microsoft Excel files that allow
the reader to conduct the analyses with the help of Excel. Additionally, we explain
how to use Excel for many of the equations mentioned in this book. By using Excel,
the reader may gain an improved understanding of the different methods.
• The book’s various data sets, SPSS jobs, and figures can also be requested via the
website.
• While in the book we use SPSS (version 27) for all case studies, R code (www.r-pro-
ject.org) is also provided on the website.
• In addition to the SPSS syntax provided in the book, we explain how to handle SPSS
in general and how to perform the analyses.

vii
viii Preface

• In order to improve the learning experience, videos with step-by-step explanations of


selected problems will be successively published on the website.
• SpringerGabler also provides a set of electronic learning cards (so-called “flash-
cards”) to help readers test their knowledge. Readers may individualize their own
learning environment via an app and add their own questions and answers. Access to
the flashcards is provided via a code printed in the book.

We hope that these initiatives will improve the learning experience for our readers. Apart
from the offered materials, we will also use the website to inform about updates and, if
necessary, to point out necessary corrections.
The preparation of this book would not have been possible without the support of our
staff and a large number of research assistants. On the staff side, we would like to thank
above all Mi Nguyen (MSc. BA), Lorenz Gabriel (MSc. BA), and Julian Morgen (M.
Eng.) of the University of Trier, who supported us with great meticulousness. For creat-
ing, editing and proofreading the figures, tables and screenshots, we would like to thank
Nele Jacobs (BSc. BA). Daniela Platz (BA), student at the University of Trier, provided
helpful hints for various chapters, thus contributing to the comprehensibility of the text.
We would also like to say thank you to Britta Weiguny, Phil Werner, and Kaja Banach
of the University of Münster. They provided us with help whenever needed. Heartfelt
thanks to Theresa Wild and Frederike Biskupski, both students at the University of
Münster, who provided feedback to improve the readability of this book.
Special thanks go to Julian Morgen (M. Eng.) who was responsible for the entire pro-
cess of coordination between the authors and SpringerGabler. Not only did he tirelessly
and patiently implement the requested changes, he also provided assistance with ques-
tions concerning the structure and layout of the chapters.
Finally, we would like to thank Renate Schilling for proofreading the English text and
making extensive suggestions for improvements and adaptations. Our thanks also go to
Barbara Roscher and Birgit Borstelmann of SpringerGabler who supported us continu-
ously with great commitment. Of course, the authors are responsible for all errors that
may still exist.

Münster Klaus Backhaus


Magdeburg Bernd Erichson
Münster Sonja Gensler
Trier Rolf Weiber
München Thomas Weiber
April 2021
www.multivariate-methods.info

On our website www.multivariate-methods.info, we provide additional and supplemen-


tary material, publish updates and offer a platform for the exchange among the readers.
The website offers the following core services:

Methods
We provide supplementary and additional material (e.g., examples in Excel) for each
method discussed in the book.

FAQ
On this page, we post frequently asked questions and the answers.

Forum
The forum offers the opportunity to interact with the authors and other readers of the
book. We invite you to make suggestions and ask questions. We will make sure that you
get an answer or reaction.

Service
Here you may order all tables and figures published in the book as well as the SPSS data
and syntax files. Lecturers may use the material in their classes if the source of the mate-
rial is appropriately acknowledged.

Corrections
On this page, we inform the readers about any mistakes detected after the publication
of the book. We invite all readers to report any mistakes they may find on the Feedback
page.

Feedback
Here the authors invite the readers to share their comments on the book and to report any
ambiguities or errors by sending a message directly to the authors.

ix
x www.multivariate-methods.info

3URIHVVXUI¾U0DUNHWLQJXQG,QQRYDWLRQ
8QLY3URI'U5ROI:HLEHU
8QLYHUVLW¦WVULQJ
'(7 7ULHU *HUPDQ\
&RQWDFWmva@uni-trier.deದSKRQH 

6HQGHU

____________________________
____________________________
____________________________
____________________________

(0DLOBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB3KRQHBBBBBBBBBBBBBBBBBB

6XEMHFW0XOWLYDULDWH$QDO\VLV

+HUHZLWK,RUGHU

DOOGDWDVHWVDQG6366V\QWD[ILOHVIRUDOOPHWKRGVFRYHUHGLQWKLVERRNDWDSULFHRI(85

FRPSOHWHVHWRIILJXUHVIRUDOOPHWKRGVFRYHUHGLQWKLVERRNDWDSULFHRI (85

WKHVHWRIILJXUHVDVUHDGRQO\3RZHU3RLQWILOHVDWDSULFHRI(85HDFKIRUWKHIROORZLQJFKDSWHUV

1 Introduction to statistical data analysis 6 Contingency analysis


2 Regression analysis 7 Factor analysis
3 Analysis of variance 8 Cluster analysis
4 Discriminant analysis 9 Conjoint analysis
5 Logistic regression

7KHGRFXPHQWVZLOOEHVHQWE\HPDLO,IGHVLUHGRWKHUPHDQVRIGHOLYHU\ HJPHPRU\VWLFN DUHDOVR


SRVVLEOH3OHDVHFRQWDFWWKHDXWKRUVDWWKHDERYHDGGUHVVIRUIXUWKHULQIRUPDWLRQ

_____________________ ________________________
'DWH  6LJQDWXUH

<RXPD\DOVRSODFH\RXURUGHURQRXUZHEVLWHV
ZZZPXOWLYDULDWHPHWKRGVLQIRRUZZZPXOWLYDULDWHGH
RUSHU0DLOWRPYD#XQLWULHUGH
Contents

1 Introduction to Empirical Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


2 Regression Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4 Discriminant Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
5 Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
6 Contingency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
7 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
8 Cluster Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
9 Conjoint Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533

Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601

xi
Introduction to Empirical Data Analysis
1

Contents

1.1 Multivariate Data Analysis: Overview and Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


1.1.1 Empirical Studies and Quantitative Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Types of Data and Special Types of Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2.1 Scales of Empirical Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.2.2 Binary and Dummy Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.3 Classification of Methods of Multivariate Analysis. . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.3.1 Structure-testing Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.3.2 Structure-Discovering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.3.3 Summary and Typical Research Questions of the Different Methods. . . 14
1.2 Basic Statistical Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Basic Statistical Measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.2 Covariance and Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3 Statistical Testing and Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3.1 Conducting a Test for the Mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.3.1.1 Critical Value Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.3.1.2 Using the p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.3.1.3 Type I and Type II Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.3.1.4 Conducting a One-tailed t-test for the Mean. . . . . . . . . . . . . . . . . . . . . . . 34
1.3.2 Conducting a Test for a Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.3.3 Interval Estimation (Confidence Interval). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.4 Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.4.1 Causality and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.4.2 Testing for Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.5 Outliers and Missing Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.5.1 Outliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.5.1.1 Detecting Outliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.5.1.2 Dealing with Outliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 1


K. Backhaus et al., Multivariate Analysis, https://doi.org/10.1007/978-3-658-40411-6_1
2 1 Introduction to Empirical Data Analysis

1.5.2 Missing Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48


1.6 How to Use IBM SPSS, Excel, and R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

With the purchase of this book, you can use our “SN Flashcards” app to access
questions free of charge in order to test your learning and check your understanding
of the contents of the book. To use the app, please follow the instructions below:

1. Go to https://flashcards.springernature.com/login
2. Create a user account by entering your e-mail address and assigning a password.
3. Use the following link, to access your SN Flashcards set: https://sn.pub/hhdvdz

If the code is missing or does not work, please send an e-mail with the subject “SN
Flashcards” and the book title to customerservice@springernature.com.

This introductory chapter introduces, characterizes and classifies the methods of multi-
variate data analysis covered in this book. It also presents the fundamentals of empirical
data analysis that are relevant to all these methods. Most readers will be familiar with the
fundamentals of statistical data analysis, so this is primarily intended as a repetition of
important aspects of quantitative data analysis.
This chapter comprises six sections:
Section 1.1 elaborates on the objectives of this book and summarizes the fundamen-
tals of empirical research. Data are the ‘raw material’ for multivariate methods. Thus, we
first describe the various types of data and their measurement levels. Furthermore, the
methods discussed in this book are characterized and classified.
Section 1.2 explains the basics of statistics (i.e., mean, variance, standard deviation,
covariance, and correlation) and illustrates how to compute some central statistical
measures.
Section 1.3 elaborates on the principles of statistical testing, the choice of the signif-
icance level and the use of the so-called p-value, thus laying the foundation for under-
standing the different statistical tests used in this book.
Section 1.4 explains the concept of causality in the context of multivariate analysis
which aims to describe, explain and predict real-life phenomena. While causality is not a
statistical concept, it is crucial for the interpretation of statistical results.
Section 1.5 deals with the problem of outliers and missing values in empirical data.
We discuss how to detect outliers and what influence they may have on the results of
empirical studies. Moreover, we illustrate the options provided by IBM SPSS for han-
dling missing values.
Section 1.6 introduces the IBM SPSS statistical software package. The SPSS proce-
dures used throughout this book are briefly presented. Users of R are referred to www.
multivariate-methods.info.
1.1 Multivariate Data Analysis: Overview and Basics 3

1.1 Multivariate Data Analysis: Overview and Basics

This book deals with methods of statistical data analysis that examine several variables
simultaneously and quantitatively analyze their relationships with the aim of describing
and explaining these relationships or predicting future developments. These methods are
collectively referred to as multivariate analysis techniques. Sometimes, only methods
of analysis that consider several dependent variables are explicitly referred to as mul-
tivariate. This applies, for example, to multivariate regression analysis (cf. Sect. 2.4.3)
or multivariate analysis of variance (cf. Sect. 3.4.1). In this book we follow the broad
understanding of the term, which regards bivariate analyses (that consider only two var-
iables at a time) as the simplest form of multivariate data analysis. However, the user
should be aware that in practice the interrelationships are usually much more complex
and require the consideration of more than just two variables. A more detailed differenti-
ation of the various methods of multivariate analysis is provided in Sect. 1.1.3.
Today, methods of multivariate analysis are one of the foundations of empirical
research in science. So, not surprisingly, the methods are still undergoing rapid devel-
opment. New methodological advancements are constantly being developed, new areas
of application are being opened up and new or improved software packages are being
developed. Despite their importance, beginners often hesitate to apply methods of multi-
variate analysis, mostly for the following reasons:

• General lack of knowledge of statistical methods and how to use them.


• Reluctance to use software packages for statistical analysis they do not completely
understand (‘black box’).
• Doubts and little knowledge about mathematics and statistics in general.

This book aims to address all these issues. To this end, we followed the following princi-
ples when writing this book:

1. We took great care to present the various methods in an easy and straightforward
manner. Instead of providing many methodological details, we focused on easy com-
prehensibility. The guiding principle for each chapter was to focus on the fundamen-
tals of each method and its application. We discuss statistical details only if necessary
to facilitate the understanding. We explain the computation in detail, so readers will
get a good idea of how the methods work as well as their possible applications and
limitations.
2. Short examples are used to support our explanations and to facilitate understanding. A
more extensive case study is presented at the end of each chapter to discuss the imple-
mentation of the different methods with the help of IBM SPSS statistical software.
3. We use simple examples from the chocolate market that are related to management
questions but are also easy to follow for non-business students or readers from other
disciplines.
4 1 Introduction to Empirical Data Analysis

4. For the case studies, we use IBM SPSS Statistics (SPSS for short). SPSS has become
very widely used, especially in higher education, but also in practice and research,
because of its ease of use. For demonstration we use the graphical user interface (GUI)
of SPSS. But we also offer the SPSS syntax commands at the end of each chapter so
readers can easily repeat the analysis (maybe with changed data or testing different
model specifications). All case study data may be ordered and downloaded from our
website. We show and explain the IBM SPSS output files in detail for each method.
5. For the case studies in Chaps. 4, 5, 7, and 8, we use the same data set to demonstrate
the similarities and differences between the different methods. In cases where we use a
different data set, we still stick to the chocolate market and related research questions.
6. For the calculation of smaller examples, there will be Excel files provided on our
website www.multivariate-methods.info.
7. For readers using R, we provide additional material for the case studies on our web-
site (www.multivariate-methods.info).

Overall, this book is targeted at novices and amateur users who would like to apply
the methods of multivariate analysis more knowledgeably. Therefore, all methods are
explained independently of each other, i.e., the different chapters may be read individu-
ally and in any order.

1.1.1 Empirical Studies and Quantitative Data Analysis

Empirical research involves the collection of data and their evaluation using qualitative
or quantitative methods. The primary objectives of empirical studies are:

• to describe reality (descriptive data analysis),


• to test statements or hypotheses developed logically or theoretically on the basis of
real data (confirmatory analysis),
• to discover (previously unknown) relationships in real data (exploratory analysis).

In most cases, empirical studies are based on some part of a population (sample) that is
used to draw conclusions about the entire population. The sample may be described by
certain statistical parameters, but the question is: What are the ‘true’ parameters in the
population? Using the sample data, central parameters can be estimated for the popula-
tion, which is explained in more detail in Sects. 1.2 and 1.3.
The data collected in the course of empirical studies usually refer to various attributes
and their manifestations. The numerically coded attributes of objects are called variables
and are usually denoted by letters (e.g., X, Y, Z). Their values express certain character-
istics of subjects or objects. Therefore, a variable varies across the considered subjects
(objects) and possibly also over time.
1.1 Multivariate Data Analysis: Overview and Basics 5

Table 1.1  Dependent and independent variables


Independent variables Dependent variables
X1, X2, X3 … Y1, Y2, Y3 …
x-variables, explanatory variables, predictors, y-variables, explained variables, response varia-
covariates bles, outcome variables

Concerning variables, the following distinctions are important:

• manifest versus latent variables,


• dependent versus independent variables,
• scale or level of measurement of the variables.

Manifest variables are variables that may be directly observed or measured (e.g., weight,
height, hair color, sex, age, price, quantities, and income). In contrast, latent variables
are variables that cannot be directly observed, but are assumed to be related to manifest
(observable) variables. Hypothetical constructs such as trust, motivation, intelligence,
illness, or brand image are regarded as latent variables. To measure latent variables, suit-
able operationalizations using manifest variables are required. This means the value of a
latent variable is derived from various observed variables (compositional approach). Let
us take the construct “intelligence”, for example, which is measured by a linear combi-
nation of various observed variables. The measurement of observable variables can be
performed at different levels (so-called scales) (cf. Sect. 1.1.2.1).
In the majority of methods, a distinction is made between dependent and independent
variables (Table 1.1). Such a division is necessary if changes in some observed variable
(y-variable) are supposed to be explained by changes in another variable (x-variable). It
is assumed that a y-variable is dependent on the x-variables. In this case, we speak of
analysis of dependence or structure-testing methods (cf. Sect. 1.1.3.1).
The various dependency analyses may be distinguished according to the measurement
levels of the dependent and independent variables. A categorization of the methods along
these lines is discussed in Sect. 1.1.2.1. In so-called interdependency analyses, the vari-
ables are not subdivided a priori. In this book, we discuss the following interdependency
analyses: factor analysis and cluster analysis. Sect. 1.1.3.2 briefly explains the basic idea
of these two methods.

1.1.2 Types of Data and Special Types of Variables

Data are the ‘raw material’ of multivariate data analysis. In empirical research, we distin-
guish between different types of data

• cross-sectional data and time series data,


• observational data and experimental data.
6 1 Introduction to Empirical Data Analysis

Cross-sectional data are collected by observing many different subjects or objects at a


single point or period in time. Timeseries data (longitudinal data) are collected at regular
time intervals (e.g., monthly sales of chocolate). They measure how a variable changes
over time. Time series data have a natural temporal ordering and are of special impor-
tance for making predictions. We can also collect cross-sectional time series data. Thus,
a combination of both types of data is possible.
Most data are observational data. Observational data also comprise survey data that
are obtained by questioning respondents. The term ‘observational’ indicates that the
researcher does not (or should not) have any influence on the data generation. For exam-
ple, the way of questioning should not affect a respondent’s answer to a question.
This is different for experimental data. In an experiment, the researcher actively
manipulates one or more independent variables X and observes the changes in a depend-
ent variable Y. When doing so, the researcher takes great care to cancel out influences
from other variables to be able to make statements on the effect of X on Y (usually to find
out if a causal relationship exists).

1.1.2.1 Scales of Empirical Data


Data quality is, among others, determined by the method of measurement. Measurement
means that attributes (characteristics) of objects (persons) are expressed by numbers
according to certain rules (e.g., a higher number means a greater weight of an object or
a person). How well certain variables can be measured differs. For example, a person’s
weight or height can easily be expressed in numbers, while intelligence or motivation are
more difficult to capture. Measurements can also be distinguished by different amounts
of information, which is reflected by their scale.
The most common distinction of scales was developed by Stevens (1946) who dis-
tinguished between nominal, ordinal, interval, and ratio scales. The nominal and ordinal
scales are also grouped together as non-metric or categorical scales. The interval scale
and the ratio scale are grouped together as metric or cardinal scales.
The scale determines both the information content of the data and permissible sta-
tistics. Table 1.2 summarizes the different scales and their properties. Among the
four scales, the nominal scale is considered the lowest and the ratio scale the highest.
Permissible arithmetic operations or statistics that are permissible on a lower scale are
also permissible on a higher scale, but not vice versa.

Scales According to Stevens


The nominal scale is the simplest type of measurement, where the assignment of numeri-
cal values serves only to identify categories. Examples of a nominal scale are:

• gender (male—female—diverse),
• religion (catholic—protestant—…),
• color (red—yellow—green—blue—…),
• advertising channel (television—newspapers—billboards—…),
• sales area (north—south—east—west).
1.1 Multivariate Data Analysis: Overview and Basics 7

Table 1.2  Scales (measurement levels) of variables


Measurement level Attributes Permissible statistics
Non-metric Nominal scale Classification of qualita- Frequency, mode
(categorical) scale tive attributes
Ordinal scale Ranking with ordinal Median, quantiles
numbers
Metric (cardinal) scale Interval scale Scale with equivalent dis- Difference,
tances between any two mean value,
adjacent points without a standard deviation,
natural zero point correlation,
t-test, F-test
Ratio scale Scale with equivalent Sum, ratio, product,
distances between any geometric mean
two adjacent points and a harmonic mean,
natural zero point coefficient of variation

Nominal scales serve only for the identification of qualitative attributes. If the attribute
‘color’, for example, occurs in three variants, it can be coded as follows:

• red = 1,
• yellow = 2,
• green = 3.

The choice of numbers to identify the colors is arbitrary, i.e., other numbers (e.g., 5, 3,
7) could have been chosen just as well. The only condition is that different numbers are
assigned to different attributes and identical numbers to identical attributes (one-to-one
correspondence). For nominal variables, no arithmetic operations (such as addition, sub-
traction, multiplication, or division) are permissible. It is obvious that a mean value of
2.0 for the three colors mentioned above would not be meaningful. However, we can
infer frequencies or percentages by counting how often the values of the different attrib-
utes occur in a data set (e.g., 25% red; 40% yellow; 35% green).
The ordinal scale represents the next higher measurement level.The ordinal scale results
from ranking. Examples: product A is preferred to product B, or person M is more efficient
than person N. The ranking values (1st, 2nd, 3rd etc.) do not say anything about the dif-
ferences between the objects. It is therefore not possible to state from the ordinal scale by
how much product A is preferred to product B. Hence, ordinal data, just like nominal data,
cannot be used for arithmetic operations. But it is possible to calculate the median and quan-
tiles, in addition to counting frequencies. Ordinal variables are often treated as nominal (and
thus qualitative) variables (e.g., in the case of social classes: lower, middle, upper class).
The interval scale represents the next higher measurement level. This scale implies
intervals of equal size. Since the intervals may be infinitely small units (at least theo-
retically), we also speak of a continuous scale. But for simplicity’s sake, usually only
8 1 Introduction to Empirical Data Analysis

discrete values are used. A typical example is the Celsius scale for measuring temper-
ature, where the distance between the freezing point and the boiling point of water is
divided into 100 equally sized (discrete) intervals. With interval-scaled data, the differ-
ences between the data points also contain information (e.g., large or small temperature
differences), which is not the case in nominal or ordinal data. Since this scale has no
natural zero point, we cannot state that 20 °C is twice as warm as 10 °C. But we can
compute mean values, standard deviations, or correlations for interval-scaled data.
The ratio scale represents the highest measurement level. It differs from the interval
scale in that there is a natural zero point that can be interpreted as “not present” for the
attribute in question. This applies to most physical characteristics (e.g., length, weight,
speed) and most economic characteristics (e.g., income, costs, prices). In the case
of ratio-scaled data, not only the difference but also the ratio of the data makes sense.
Ratio-scaled data thus allow the application of all arithmetic operations as well as the
calculation of all statistical measures.
In general, we can say that the higher the measurement level, the greater the informa-
tion content of the data and the more arithmetic and statistical operations can be applied
to the data. It is possible to transform data from a higher to a lower scale, but not vice
versa. This can be useful for increasing the clarity of the data or for simplifying the anal-
ysis. For example, in many cases income classes or price classes are formed. This may
involve transforming the originally ratio-scaled data to an interval, ordinal or nominal
scale. Of course, transformation to a lower scale always involves a loss of information.

Use of Rating Scales in Empirical Studies


In empirical research, so-called rating scales are often used, especially in behavioral sci-
ence studies. They are very popular because of their versatility and ease of use. Rating
scales are scales in which the respondents are asked to state their assessment of a certain
object (e.g., a product) on a given numerical scale. Typical examples of rating scales are:

• Assessment scales, which are used to assess, for example, the level of an object’s
quality or performance.
• Importance scales, which are used to assess, for example, the importance of an
attribute.
• Intensity scales, which indicate to what extent (low to high) an attribute is present.
• Agreement scales, which are used to determine the level of agreement or disagree-
ment with a statement (cf. Fig. 1.1)

Developing a rating scale requires some strategic considerations. An important decision


regards the number of levels (points) the scale should have. In practice, 5-point scales
are frequently used but even scales from 0 to 100 are possible. One decisive factor for
the number of levels is, among other things, whether the distances between the numer-
ical values are perceived by all respondents as equally large (equidistant). Only in this
case may a rating scale be interpreted as interval-scaled. Otherwise, the values have to be
assumed to be ordinally scaled.
1.1 Multivariate Data Analysis: Overview and Basics 9

How much do you agree with the following statement? “…“?

1 2 3 4 5 6
I fully disagree I fully agree

Fig. 1.1 Example of a rating scale

1.1.2.2 Binary and Dummy Variables


Variables that comprise just two categories are called dichotomous variables. They are a
special case of nominal variables. For example:

• yes/no decisions (e.g., purchase/non-purchase)


positive/negative test results (e.g., Corona-positive/Corona-negative).
• question of survival (e.g., patient survived/patient died)
• tossing a coin (head/tail)

If dichotomous variables are coded with 0 and 1, they are called binary variables or
dummy variables. Binary variables can be treated like metric variables. The mean value
of a binary variable indicates the proportion with which the attribute value coded with 1
occurs in a data set. If, for example, ‘purchase’ is coded with 1 and ‘no-purchase’ with 0,
the mean value of 0.75 means that the product was purchased by 75% of the respondents.
Nominal variables with more than two categories cannot be treated like metric varia-
bles. But it is possible to replace a nominal variable with several dummy variables that
can then be interpreted like metric variables.

Example
A supplier wants to know if the color of packaging influences consumers’ purchase
decisions. Three colors are considered: red, yellow and green.
The nominal variable ‘color’ may be represented by three dummy variables, each
of which may have the value 1 (color present) or 0 (otherwise). For the color red, for
example, the coding of the dummy variable q1 is as follows:
{
1 if color = red
q1 =
0 otherwise
Similarly, a dummy variable q2 can be defined for the color yellow and a dummy
variable q3 for the color green. If only these three colors are considered, one of
the three dummies is redundant: if q1 = 0 and q2 = 0, q3 must be equal to 1. The
three colors can therefore be described by two dummy variables q1 and q2. The
10 1 Introduction to Empirical Data Analysis

third—unused—variable serves as the reference category. The choice of the reference


category is arbitrary. In general, a nominal variable with n categories can be replaced
with (n−1) dummy variables. ◄

Dummy variables may be treated like metric variables. The correlation of dummy vari-
ables with metric variables is called a point-biserial correlation and represents a special
case of the Pearson correlation (cf. Sect. 1.2.2).1
A problem arises if dummy variables are used for recoding nominal variables with
many categories. This would lead to an implicit weighting, since the number of variables
which ultimately measure the same property (e.g., color) increases. Caution is therefore
required when transforming a nominal variable with many categories. In empirical stud-
ies, it should always be stated whether dummy variables were used and how they were
assembled.

1.1.3 Classification of Methods of Multivariate Analysis

For this book, we selected methods of multivariate analysis that are frequently used in
science and practice and are of central importance for teaching in higher education.
The following methods are covered:

• Chapter 2: regression analysis (simple and multiple linear regression),


• Chapter 3: analysis of variance (ANOVA),
• Chapter 4: discriminant analysis,
• Chapter 5: logistic regression (binary and multinomial),
• Chapter 6: contingency analysis (cross tabulation),
• Chapter 7: factor analysis,
• Chapter 8: cluster analysis,
• Chapter 9: conjoint analysis (traditional and choice-based).

To classify these methods, we can distinguish between structure-testing and struc-


ture-discovering methods:

1. Structure-testing methods are procedures of multivariate analysis with the primary goal
of testing relationships between variables. In doing so, the dependence of a variable on
one or more independent variables (influencing factors) is considered. The user has prop-
ositions about the relationships between the variables based on logical or theoretical con-
siderations and would like to test these with the help of methods of multivariate analysis.

1 Both SPSS and R use the point-biserial calculation of a correlation if one of the variables has
only two calculation-relevant values.
1.1 Multivariate Data Analysis: Overview and Basics 11

Structure-testing methods are: regression analysis, analysis of variance (ANOVA), dis-


criminant analysis, logistic regression, contingency analysis, and conjoint analysis.
2. Structure-discovering methods are procedures of multivariate analysis with the pri-
mary goal of discovering relationships between variables or between objects (sub-
jects). At the outset, the user has no propositions about the relationships existing
between the variables in a data set. Structure-discovering methods covered in this book
are: factor analysis and cluster analysis. Other methods, which are not covered in this
book, are multidimensional scaling, correspondence analysis, and neural networks.

However, it needs to be stressed that the methods cannot always be assigned exclusively
to the above two categories because sometimes the objectives of the different procedures
may overlap.

1.1.3.1 Structure-testing Methods
Structure-testing procedures are primarily used to carry out causal analyses in order to
establish cause and effect, for example, whether and to what extent the weather, soil con-
ditions, and different fertilizers have an effect on crop yield or how strongly the demand
for a product depends on its quality, price, advertising, and consumer income.
A prerequisite for applying these methods is that the user develops a priori (i.e., in
advance) hypotheses about the causal relationship between the variables. This means
that the user already knows or suspects which variables might affect another variable.
In order to test this hypothesis, the variables are usually categorized into dependent and
independent variables. The methods of multivariate analysis are used to test the proposi-
tions based on collected data. According to the measurement levels of the variables, the
basic structure-testing procedures can be characterized as shown in Table 1.3.

Regression Analysis
Regression analysis is a very flexible method and of great importance for describing and
explaining relationships as well as for making predictions. It is therefore certainly one
of the most important and most frequently used methods of multivariate analysis. In par-
ticular, it can be used to investigate relationships between a dependent variable and one
or more independent variables. With the help of regression analysis, such relationships
can be quantified, hypotheses can be tested, and forecasts can be made.

Table 1.3  Structure-testing methods of multivariate analysis covered in this book


Independent variable
Metric scale Nominal scale
Dependent variable Metric scale Regression analysis Analysis of variance
Nominal/ordinal scale Discriminant analysis Contingency analysis
Logistic regression Conjoint analysis
12 1 Introduction to Empirical Data Analysis

Let us take, for example, the investigation of the relationship between a product’s
sales volume and its price, advertising expenditure, number of sales outlets, and national
income. Once these relationships have been quantified and confirmed with regression
analysis, forecasts (what-if analyses) can be made that predict how the sales volume will
change if, for example, the price or the advertising expenditure or both vary.
In general, regression analysis is applicable if both the dependent and the independ-
ent variables are metric variables. Qualitative (nominally scaled) variables may also be
included in regression analysis, if dummy variables (cf. Sect. 1.1.2.2) are used.

Analysis of Variance (ANOVA)


Analysis of variance (ANOVA) may be applied if the independent variables are nomi-
nal and the dependent variable is metric. This method is particularly important for the
analysis of experiments, where the nominal independent variables represent experimental
treatments. ANOVA can be used, for example, if we want to investigate the effect of a
product’s packaging type or store placement on its sales volume—assuming that no other
factors are influencing the sales volume.

Discriminant Analysis
Discriminant analysis is used if the independent variables are metric and the depend-
ent variable is nominally scaled. Discriminant analysis is a method for analyzing differ-
ences between groups, for example, if we want to investigate whether and how the voting
behavior for political parties differs in terms of voters’ socio-demographic and psych-
ographic characteristics. The dependent nominal variable identifies the group member-
ship, that is, the elected political party, and the independent variables describe the group
elements, that is, the voters’ socio-demographic and psychographic characteristics.
Another area of application of discriminant analysis is the classification of elements.
For example, once relationships between group membership and characteristics have
been analyzed for a given set of objects (subjects), we can predict the group membership
of ‘new’ objects (subjects). These kinds of predictions are frequently used in credit scor-
ing (i.e., risk classification of bank customers applying for a loan) or in performance pre-
dictions (e.g., classification of sales representatives according to expected sales success).

Logistic Regression (Binary and Multinomial)


Logistic regression answers questions similar to those handled by discriminant analysis.
In this case, the probability of belonging to a group (i.e., a category of the dependent var-
iable) is determined as a function of one or more independent variables. If the dependent
variable has only two categories, a binary logistic regression is performed. If the depend-
ent variable has three or more categories, a multinomial logistic regression is conducted.
The independent variables can have both nominal and metric scales. In addition to the
analysis of group differences, logistic regression may also be used to predict an event.
For example, we can predict the risk of a heart attack for patients depending on their
age and cholesterol level. Since the (s-shaped) logistic function is used to estimate the
1.1 Multivariate Data Analysis: Overview and Basics 13

probabilities of different categories of the dependent variable, logistic regression is based


on a non-linear model but has a linear systematic component (like the other methods in
this book).

Contingency Analysis
Contingency analysis examines relations among two or more nominally scaled varia-
bles. For example, we can investigate the relation between smoking (smoker versus non-
smoker) and lung cancer (yes, no). We do so with the help of a cross-table (contingency
table) that maps the occurrence of combinations of levels of the nominal variables.

Conjoint Analysis (Traditional and Choice-based)


The methods presented so far only distinguish between metric and nominal scales. One
method in which the dependent variable is often measured at an ordinal scale is con-
joint analysis. In particular, conjoint analysis can be used to analyze ordinally/metrically
measured preferences and nominally measured choice decisions (choice-based conjoint
(CBC) analysis). The aim is to infer the utility contributions of a product’s attributes
(and the levels of these attributes) to the product’s overall utility. In this way, the utility
of not yet existing products may also be evaluated, thus aiding in the designing of new
products.
To carry out a conjoint analysis, we need to determine in advance which attributes
(e.g., price) and which attribute levels (e.g., 1 EUR, 2 EUR) are relevant for consumers.
Then, we design an experimental study and survey to measure consumers’ preferences.
Conjoint analysis is thus a combination of both a survey method and an analysis method.

1.1.3.2 Structure-Discovering Methods
Structure-discovering methods are used to analyze correlations between variables or
between objects, so we do not split the set of variables into dependent and independent
variables in advance, as we do for structure-testing methods.

Factor Analysis (Exploratory)


Factor analysis is used when a large number of metric variables has been collected in a
specific context and the user is interested in reducing or bundling these variables. In this
case, we want to investigate whether a large number of variables can be traced back to a
few central ‘factors’. For example, we might try to reduce the different technical attrib-
utes of vehicles to a few dimensions, such as performance or safety.
Positioning analyses are one important application area of factor analysis. In this case,
we reduce the subjective assessments of attributes of specific objects (e.g., brands, com-
panies, or politicians) to some underlying assessment dimensions. If a reduction to two
or three dimensions is possible, the objects can be displayed graphically. This is called
factorial positioning, as distinguished from other forms of positioning analysis.
If the number of factors is extracted from the correlation matrix of the considered
variables, this is called exploratory factor analysis (EFA). If the number of factors and
14 1 Introduction to Empirical Data Analysis

their relationship to the underlying variables has been specified a priori, factor analysis
becomes a structure-testing instrument and is called confirmatory factor analysis (CFA).
Chap. 7 focuses on EFA, only briefly discussing CFA.

Cluster Analysis
While factor analysis is used to reduce the number of variables, cluster analysis is used
to bundle objects (subjects). The aim is to combine objects to different groups (i.e., clus-
ters) in such a way that all objects in one group are as similar to each other as possible,
while the groups are as dissimilar to each other as possible. Typical examples of research
questions tackled with cluster analysis are the definition of personality types based on
psychographic characteristics or the definition of market segments based on demand-rel-
evant consumer characteristics.
With a subsequent discriminant analysis, we can check to what extent the variables
that were used for clustering contribute to or explain the differences between the identi-
fied clusters.

1.1.3.3 Summary and Typical Research Questions of the Different


Methods
The classification into structure-testing and structure-discovering methods should be
considered a categorization according to the predominant application of the methods.
But factor analysis may also be used to validate assumed relationships, and, in practice,
much too often regression and discriminant analysis are used to explore relationships
between variables. It should be pointed out that the mindless use of methods of mul-
tivariate analysis may easily lead to misinterpretations; thus, a statistically significant
relationship does not justify the assumption of a causal relationship. Therefore, it is gen-
erally recommended to use structure-testing methods for the empirical testing of theo-
retically or factually logical hypotheses (cf. the explanations on causality in Sect. 1.4).
Table 1.4 summarizes the methods of multivariate analysis outlined above, each with an
application example.

1.2 Basic Statistical Concepts

The discussion of the various methods in this book involves as little mathematics as pos-
sible and requires only basic statistical knowledge—equivalent to an introductory course
in statistics. To refresh the reader’s knowledge, we will in the following provide a short
recapitulation of the most relevant statistical measures and concepts:2

2 On www.multivariate-methods.info, the reader will also find an Excel sheet with information on
the calculation of the various statistical parameters using Excel.
1.2 Basic Statistical Concepts 15

Table 1.4  Summary of the methods of multivariate analysis covered in this book


Methods Application example
Regression analysis How does the sales volume of a product depend on price, advertising,
and income?
Analysis of variance Impact of alternative packaging designs on a product’s sales volume
Discriminant analysis Distinction of voters of different parties based on socio-demographic
and psychographic characteristics
Logistic regression Investigation of the risk of heart attacks depending on a patient’s age
and cholesterol level
Contingency analysis Association between smoking and lung disease
Factor analysis Consolidation of a multitude of variables into a lower number of factors
Cluster analysis Identification of personality types based on psychographic
characteristics
Conjoint analysis Determine the utility contribution of product attributes to a product’s
overall utility to predict purchase behavior

• arithmetic mean of a variable (in short: mean),


• variance of a variable,
• standard deviation of a variable,
• covariance between two variables,
• correlation between two variables.

Table 1.5 provides the equations for these statistical measures. For simplicity’s sake, we
are using N to denote the sample size as well as the population size. Usually, however,
the latter is unknown.
To clarify whether a statistical parameter is calculated for a sample or denotes the
value of the population, different terms are used. Table 1.6 shows this distinction for the
mean, the variance, and the standard deviation.

1.2.1 Basic Statistical Measures

Arithmetic Mean
Large amounts of empirically collected quantitative data may be characterized very well
by a few descriptive statistics. The most important descriptive statistical measure is the
arithmetic mean x j, which reflects the average value of a variable:3

3 InExcel, the mean of a variable can be calculated by: = AVERAGE(matrix), where (matrix) is the
range of cells containing the data of the variable. For example, “ = AVERAGE(C6:C55)” calculates
the mean of the 50 cells C6 to C55 in column C. Note on Excel operation: The book Multivariate
16 1 Introduction to Empirical Data Analysis

Table 1.5  Equations for computing basic statistical measures


Mean 1
N
Σ
xj = xi j
If taken from a sample it provides an esti- N
i=1
mator of the true mean µ of the population where N is the size of the population or sample
Variance (sample) N
1
sx2 = N−1 (xi − x)2
Σ
Estimator of the true variance σx2 in the i=1
population
Variance (population) N
σx2 = 1
(xi − µ)2
Σ
N
i=1
where N is the size of the population
Standard deviation (sample)

sx = sx2
Estimator of the true standard deviation σx
in the population
Standard deviation (population)

σx = σx2
Covariance (sample) 1 ΣN

Estimator of the true population covariance sx1 ,x2 = (x1i − x 1 ) · (x2i − x 2 )


N − 1 i=1
cov(x1 , x2 )
or σx1 ,x2 sxx =sx2
Standardization zi = (xi − x)/sx
Correlation ΣN
(x1i − x 1 ) · (x2i − x 2 )
−1 ≤ rx1 ,x2 ≤ 1 range sx ,x
rx1 ,x2 = rx2 ,x1 symmetry rx1 ,x2 = √ i=1 = 1 2
N N sx 1 s x 2
(x1i − x 1 )2 · (x2i − x 2 )2
Σ Σ
i=1 i=1
N
1 Σ
= zx 1 i · z x 2 i
N −1 i=1

Table 1.6  Abbreviations Parameter Sample Population


of statistical measures in the
sample and the population Mean of variable j xj µj
Variance of variable j s2j σj2
Standard deviation of j sj σj

N
1 Σ
xj = xij (1.1)
N i=1

Analysis provides Excel Formulas with semicolon seperators. If the regional language setting of
your operating system is set to a country that uses periods as decimal seperators, we ask you to use
comma seperators instead of semicolon seperators in Excel formulas.
1.2 Basic Statistical Concepts 17

with
xij observed value of variable j for person or object i
N number of cases in the data set
The mean value is a measure of central tendency, also called center or location of a dis-
tribution. It is most useful if the data follow an approximately symmetric distribution. If
this is not the case, the mean cannot be considered the ‘central’ value of the observations.

Variance
In addition to the mean, it is important to measure the dispersion (variability, variation)
of the data, i.e., the deviation of the observed values from the mean. The most impor-
tant measure of dispersion is the variance, which is the average of the squared deviations
from the mean.4 If N denotes the sample size, the following applies:

N
Σ
sj2 = (xij − x j )2 (1.2)
i=1

with
xij observed value of variable j for person or object i
xj mean of variable j
N number of cases in the data set (sample size)

Standard Deviation
Since an average of squared deviations is difficult to interpret, the standard deviation,
which is the square root of the variance, is usually considered.5 An advantage of the
standard deviation compared to the variance is that it measures the dispersion in the
same units as the original data. This makes an interpretation easier and facilitates a com-
parison with the mean.
|
|
| 1 Σ N
(2
(1.3)
)
sj = √ xij − x j
N − 1 i=1

with
xij observed value of variable j for person or object i
xj mean of variable j
N number of cases in the data set (sample size)

4 In Excel, the sample variance can be calculated by: sx2 = VAR.S(matrix). The population variance
can be calculated by: σx2 = VAR.P(matrix).
5 In Excel, the sample standard deviation can be calculated by: s = STDEV.S(matrix). The popula-
x
tion standard deviation is calculated by: σx = STDEV.P(matrix).
18 1 Introduction to Empirical Data Analysis

While these three statistical measures are very useful for describing even large amounts
of data, they are also of fundamental importance for all methods of multivariate analysis.
In the following, we will briefly illustrate how to calculate these measures.

Application Example
We consider a data set derived from five persons. For each person we collect data on his
or her age (x1), income (x2) and gender (x3). The variable ‘gender’ is a binary (dummy)
variable, with 0 = male and 1 = female. Table 1.7 lists the values collected for the five
persons (n = 5) (i.e., rows 2–6 of column “age”, “income”, and “gender”). In addition
to the three statistical measures we also computed the simple deviation from the mean
(x1 − x 1) and the squared deviation from the mean (x1 − x 1 )2 for all values. ◄

The sum of the deviations from the mean is always zero (see columns A, C and E). The
reason for this is the so-called centering property of the arithmetic mean, that is, the
mean is always located at that point in a series of data where positive and negative devia-
tions from the mean are exactly the same:

N
Σ N
Σ
(xij − x j ) = 0 and thus xij = N · x j (1.4)
i=1 i=1

with
xij observed value of variable j for person i
xj mean of variable j
N number of cases in the data set
The simple deviation from the mean is therefore not suitable as a measure of disper-
sion. Instead, the squared deviations from the mean are computed (see columns B, D and
F).6 If we divide the squared deviations by (n–1) and take the square root, we obtain the
standard deviation. The standard deviation (SD) is easy to interpret: It is a measure of
how much the observed values deviate on average from the mean value.

Degrees of Freedom
Most empirical studies are based on data from random samples. This means, however,
that the true characteristics of the population are usually not known and must therefore
be estimated from the sample. The larger the size of a sample, the greater the probability
that the statistical measures calculated for a sample will also apply to the population (cf.
Table 1.5).

6 Varianceand standard deviation cannot be interpreted meaningfully for the variable “gender”.
However, columns E and F are required for the calculation of covariance and correlations.
1.2

Table 1.7  Application example


Basic Statistical Concepts

Age A B Income C D Gender E F


Case x1 x1 − x 1 2 x2 x2 − x 2 2 x3 x3 − x 3
(x1 − x 1 ) (x2 − x 2 ) (x3 − x 3 )2
1 25 –2 4 1,800 –600 360,000 1 0.6 0.36
2 27 0 0 2,000 –400 160,000 0 –0.4 0.16
3 24 –3 9 1,900 –500 250,000 0 –0.4 0.16
4 30 3 9 2,800 400 160,000 0 –0.4 0.16
5 29 2 4 3,500 1,100 1,210,000 1 0.6 0.36
Sum 135 0 26 12,000 0 2,140,000 2 0 1.20
Mean x 1 = 27 x 2 = 2,400 x 3 = 0.4
Variance s12 = 6.50 s22 = 535, 000 s32 = 0.30
SD s1 = 2.55 s2 = 731.437 s3 = 0.548
with: x1: age (in years); x2: income (in Euro); x3: gender (0 = male; 1 = female)
19
20 1 Introduction to Empirical Data Analysis

To estimate the measures for the population based on information from a sample, the
concept of the degrees of freedom (df) is important. The sample mean x j provides the
best estimator for the unknown population mean µj. But there will always be an error:
Mean of a population:
µj = x j ± error (1.5)
with
xj mean of variable j

The error depends, among other things, on the degrees of freedom (cf. Sect. 1.3). It will
be smaller for larger degrees of freedom.
In general, the number of degrees of freedom is the number of observations in the
computation of a statistical parameter that are free to vary. Let us assume, for example,
that we determine the age of 5 people in a sample: 18, 20, 22, 24, and 26 years. The
sample mean equals 22 years. Knowing that the mean value is 22 years, we can freely
choose 4 observations in the sample and the final one is determined because we need to
make sure that the sample mean is again 22 years. This sample therefore has 5–1 = 4 df.
If several parameters are used in a statistic, the number of the degrees of freedom is
generally the difference between the number of observations in the sample and the num-
ber of estimated parameters in the statistic.
The number of degrees of freedom increases with increasing sample size and
decreases with the number of measures (i.e., parameters) to be estimated. The larger
the number of degrees of freedom, the greater the precision (the smaller the error) of an
estimate.

Standardized Variables
It is often difficult to compare statistical measures across variables if the variables have
different dimensions. In the application example, age was measured in years (two-digit
values), income in Euro (four-digit values), and gender is a binary variable (0/1 values).
As a result, we cannot compare the dispersion of the variables. We first need to standard-
ize them.
To do so, we compute the difference between the observed value of a variable (xij) and
this variable’s mean (x j). Then we divide this difference by the variable’s standard devia-
tion (sj).
xij − x j
zij = (1.6)
sj
with
xij observed value of variable j for object i
x j mean of variable j
sj standard deviation of variable j
1.2 Basic Statistical Concepts 21

Table 1.8  Application example (standardized variables)


Age A B Income C D Gender E F
Case x1 x1 − x 1 A/sx1 x2 x2 − x 2 C/sx2 x3 x3 − x 3 E/sx3
1 25 –2 −0.784 1,800 –600 –0.820 1 0.6 1.095
2 27 0 0.000 2,000 –400 –0.547 0 –0.4 –0.730
3 24 –3 –1.177 1,900 –500 –0.684 0 –0.4 –0.730
4 30 3 1.177 2,800 400 0.547 0 –0.4 –0.730
5 29 2 0.784 3,500 1.100 1.504 1 0.6 1.095
Mean 0 z1 = 0 0 z2 = 0 0 z3 = 0
Variance S2z1 = 1 s2z2 = 1 s2z3 = 1
Note: The means and standard deviations of the unstandardized variables were calculated in Table
1.7 (x 1 = 27, sx1 = 2.55; x 2 = 2,400, sx2 = 731.437; x 3 = 0.4; sx3 = 0.548.)

Table 1.9  Matrix Z of the Case z1 z2 z3


standardized variables in the
application example 1 –0.784 –0.820 1.095
2 0.000 –0.547 –0.730
3 –1.177 –0.684 –0.730
4 1.177 0.547 –0.730
5 0.784 1.504 1.095
Mean 0 0 0
Variance 1 1 1
Std. deviation 1 1 1

Standardization ensures that the mean of the standardized variable equals 0, and the
variance as well as the standard deviation are equal to 1.
Table 1.8 lists the standardized values for the three variables of the application exam-
ple (columns B, D, and F). The last row of the table displays the means of the standard-
ized variables, which are 0, and the variances or standard deviations, which are 1 for all
three variables.
We can display the standardized variable values in a matrix, the standardized data
matrix Z (Table 1.9).

1.2.2 Covariance and Correlation

From a statistical point of view, two variables are independent if they do not vary in the
same way more often than might be expected according to chance. No relationship can
be found between two independent variables, or in other words: information on one vari-
able does not convey any information on the other variable.
22 1 Introduction to Empirical Data Analysis

Fig. 1.2 Covariance as a function of the variation of the observed values

Whether dependence or independence exists in a statistical sense can be assessed by


the measures covariance and correlation.

Covariance
The covariance is a measure of the joint variability of two random variables. It is referred
to as cov(x1, x2) or sx1,x2. The covariance between two variables x1 and x2 is calculated as
the expected value of the product of their deviations from the respective mean:

N
1 Σ
sx1 ,x2 = (x1i − x 1 ) · (x2i − x 2 ) (1.7)
N − 1 i=1

with
xij observed value of variable j for object i
x j mean of variable j
N number of cases in the data set
As Fig. 1.2 shows, the covariance may take on positive and negative values depending
on which of the four quadrants we consider. A covariance of zero results when the neg-
ative and positive covariations cancel each other out. This is the case if the observed
1.2 Basic Statistical Concepts 23

Table 1.10  Calculation of covariances and correlations in the application example


A C F A*C A*F C*F
Case x1 − x 1 x2 − x 2 x3 − x 3 Covariation Covariation Covariation
1 –2 –600 0.6 1,200 –1.2 –360
2 0 –400 –0.4 0 0 160
3 –3 –500 –0.4 1,500 1.2 200
4 3 400 –0.4 1,200 –1.2 –160
5 2 1,100 0.6 2,200 1.2 660
Sum 0 0 0 6,100 0 500
Covariance 1,525 0 125
Correlation 0.818 0 0.312

values show no tendency of moving in the same direction (becoming higher or lower). In
this case we may conclude that the two variables change independently of each other. It
should be noted, however, that the covariance examines only linear relationships between
two variables. For the values shown in Fig. 1.2, we get a covariance cov(x1, x2) that is
equal to 0 over the four quadrants. The variables have a U-shaped relationship and thus a
non-linear dependence exists.
For the example, Table 1.10 shows that positive covariances exist between variables
x1 and x2 (cov(x1,x2) = 1,525) and between x2 and x3 (cov(x2,x3) =  125), while the covar-
iance between x1 and x3 is zero.7 Thus, a statistical dependence between x1 and x2 and
between x2 and x3 can be inferred, whereas for x1 and x3 there is no linear dependence in
a statistical sense.

Correlation
The covariance has the disadvantage that its value is influenced by the units of meas-
urement of the variables and thus its interpretation is difficult. But we can normalize the
covariance by dividing it by the standard deviations of the two variables under consider-
ation (sx1 and sx2). This results in the so-called Pearson correlation coefficient for metric
variables, rx1 x2:8

N
1 Σ (x1i − x 1 ) (x2i − x 2 ) sx x
rx1 x2 = · = 12 (1.8)
N −1 i sx1 sx2 sx1 sx2

7 In
Excel, the covariance can be calculated as follows: sxy = COVARIANCE.S(matrix1;matrix2).
8 In
Excel, the correlation between variables can be calculated as follows: rxy =
CORREL(matrix1;matrix2).
24 1 Introduction to Empirical Data Analysis

Table 1.11  Correlation matrix R for the application example


Var_1 (age) Var_2 (income) Var_3 (gender)
Var_1 (age) 1
Var_2 (income) 0.818 1
Var_3 (gender) 0.000 0.312 1

with
xij observed value of variable j for object i
x j mean of variable j
sxj standard deviation of variable xj
Sx1 x 2 covariance of variables x1 and x2
N number of cases in data set
For the three variables of the application example, Table 1.10 also shows the calculation
of the correlations. They can be presented as a so-called correlation matrix, as shown in
Table 1.11.9
Unlike the covariance, the correlation coefficient can be compared for different units
of measurement. For example, correlations of prices will lead to the same values of r,
regardless whether they have been calculated in dollars or euros.
Figure 1.3 shows values for two variables X and Y for a set of data with different scat-
ter. Scenario a represents a data set with no relationship between the two variables, i.e.,
the correlation is zero or close to zero. In scenario b we see a tendency that larger values
of one variable occur with larger values of the other variable. Thus, we get a positive
correlation (r > 0). A negative correlation (r < 0) indicates that X and Y tend to change
in opposite directions (scenario c). Scenario d shows a non-linear relationship between
the two variables. Here again the correlation is zero or close to zero. As a non-linear rela-
tionship cannot be captured by r, a visual examination of the data using a scatter plot is
always recommended before performing calculations.
The correlation coefficient r has the following properties:

• The value range of r is normalized and is between –1 and +1.


• The correlation coefficient can only measure the degree of linear relationships.
• The value of r is invariant to linear transformations of the variables (e.g., X = a + b X
with b > 0).
• The correlation coefficient does not distinguish between dependent and independent
variables. It is therefore a symmetrical measure.

9 Cf. the correlation of binary variables with metrically scaled variables in Sect. 1.1.2.2.
1.2 Basic Statistical Concepts 25

Y Y

X X
a) uncorrelated data ( ≈ 0) b) positive correlation ( > 0)
Y Y

X X
c) negative correlation ( < 0) d) nonlinear correlation ( ≈ 0)

Fig. 1.3 Scatter plots of data sets with different correlations

Values of –1 or +1 for the correlation coefficient indicate a perfect correlation between


the two variables. In this case, all data points in a scatterplot are on a straight line. The
following values are often cited in the literature for assessing the magnitude of the corre-
lation coefficient:

• | r | ≥ 0.7 : strong correlation


• | r | ≤ 0.3 : weak correlation

However, the correlation coefficient must also be evaluated in the context of the appli-
cation (e.g., individual or aggregated data). For example, in social sciences, where vari-
ables are often influenced by human behavior and many other factors, a lower value may
be regarded as a strong correlation than in natural sciences, where much higher values
are generally expected.
Another way to assess the relevance of a correlation coefficient is to perform a sta-
tistical significance test that takes the sample size into account. The t-statistic or the
F-statistic may be used for this purpose:10

10 For statistical testing, also see Sect. 1.3.


26 1 Introduction to Empirical Data Analysis

r
t=√
(1 − r 2 )/(N − 2)

r2
F=
(1 − r 2 )/(N − 2)

with
r correlation coefficient
N number of cases in the data set
and df = N–2.
We can now derive the corresponding p-value (cf. Sect. 1.3.1.2).11

1.3 Statistical Testing and Interval Estimation

Data usually contain sampling and/or measurement errors. We distinguish between ran-
dom errors and systematic errors:

• Random errors change unpredictably between measurements. They scatter around a


true value and often follow a normal distribution (central limit theorem).12
• Systematic errors are constant over repeated measurements. They consistently over-
or underestimate a true value (also called bias). Systematic errors result from deficien-
cies in measuring or non-representative sampling.

Random errors, e.g. in sampling, are not avoidable, but their amount can be calculated
based on the data, and they can be diminished by increasing the sample size. Systematic
errors cannot be calculated and they cannot be diminished by increasing the sample size,
but they are avoidable. For this purpose, they first have to be identified.
As statistical results always contain random errors, it is often not clear if an observed
result is ‘real’ or has just occurred randomly. To check this, we can use statistical testing
(hypothesis testing). The test results may be of great importance for decision-making.
Statistical tests come in many forms, but the basic principle is always the same. We
start with a simple example, the test for the mean value.

11 The p-value may be calculated in Excel as follows: p = TDIST(ABS(t);N−2;2) or p = 1–F.


DIST(F;1;N–2;1).
12 The central limit theorem states that the sum or mean of n independent random variables tends

toward a normal distribution if n is sufficiently large, even if the original variables themselves are
not normally distributed. This is the reason why a normal distribution can be assumed for many
phenomena.
1.3 Statistical Testing and Interval Estimation 27

1.3.1 Conducting a Test for the Mean

The considerations in this section are illustrated by the following example:

Example
The chocolate company Choco Chain measures the satisfaction of its customers
once per year. Randomly selected customers are asked to rate their satisfaction on
a 10-point scale, from 1 = “not at all satisfied” to 10 = “completely satisfied”. Over
the last years, the average index was 7.50. This year’s survey yielded a mean value
of 7.30 and the standard deviation was 1.05. The sample size was N = 100. Now the
following question arises: Did the difference of 0.2 only occur because of random
fluctuation or does it indicate a real change in customer satisfaction? To answer this
question, we conduct a statistical test for the mean. ◄

1.3.1.1 Critical Value Approach


The classical procedure of statistical hypothesis testing may be divided into five steps:

1. formulation of hypotheses,
2. computation of a test statistic,
3. choosing an error probability α (significance level),
4. deriving a critical test value,
5. comparing the test statistic with the critical test value.

Step 1: Formulation of Hypotheses


The first step of statistical testing involves the stating of two competing hypotheses, a
null hypothesis H0 and an alternative hypothesis H1:

• null hypothesis: H0 : µ = µ0
• alternative hypothesis: H1 : µ � = µ0

where µ0 is an assumed mean value (the status quo) and µ is the unknown true mean
value. For our example with µ0 = 7.50 we get:

• null hypothesis: H0 : µ = 7.50


• alternative hypothesis: H1 : µ � = 7.50

The null hypothesis expresses a certain assumption or expectation of the researcher. In


our example, it states: “satisfaction has not changed, the index is still 7.50”. It is also
called the status quo hypothesis and can be interpreted as “nothing has changed” or,
depending on the problem, as “no effect”. Hence the name “null hypothesis”.
28 1 Introduction to Empirical Data Analysis

The alternative hypothesis states the opposite. In our example, it means: “satisfaction
has changed”, i.e., it has increased or decreased. Usually it is this hypothesis that is of
primary interest to the researcher, because its acceptance often requires some action. It
is also called the research hypothesis. The alternative hypothesis is accepted or “proven”
by rejecting the null hypothesis.

Step 2: Computation of a Test Statistic


For testing our hypotheses regarding customer satisfaction, we calculate a test statistic.
For testing a mean, we calculate the so-called t-statistic. The t-statistic divides the dif-
ference between the observed and the hypothetical mean by the standard error SE of the
mean. The empirical value of the t-statistic is calculated as follows:
x − µ0 x − µ0
temp = √ = (1.9)
sx / N SE(x)
with
x mean of variable x
µ0 assumed value of the mean
sx standard deviation of variable x
SE(x) standard error of the mean
N number of cases in the data set
For our example we get:
7.3 − 7.5 −0.2
temp = √ = = −1.90
1.05/ 100 0.105
Under the assumption of the null hypothesis, the t-statistic follows Student’s t-distribu-
tion with N–1 degrees of freedom. Figure 1.4 shows the density function of the t-dis-
tribution and the value of our test statistic. If the null hypothesis were true, we would
expect temp = 0 or close to 0. Yet we get a value of –1.9 (i.e., 1.9 standard deviations
away from zero), and the probability of such a test result decreases fast with the distance
from zero.
The t-distribution is symmetrically bell-shaped around zero and looks very similar
to the standard normal distribution, but has broader tails for small sample sizes. With
increasing sample size, the tails of the distribution get slimmer and the t-distribution
approaches the standard normal distribution. For our sample size, the t-distribution and
the standard normal distribution are almost identical.

Step 3: Choosing an Error Probability


As statistical results always contain random errors, the null hypothesis cannot be
rejected with certainty. Thus, we have to specify an error probability of rejecting a true
1.3 Statistical Testing and Interval Estimation 29

f(t)

0.4

0.3

0.2

0.1

0.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
temp H0

Fig. 1.4 t-distribution and empirical t-value (df = 99)

null hypothesis. This error probability is denoted by α and is also called the level of
significance.
Thus, the error probability α that we choose should be small, but not too small. If α is
too small, the research hypothesis can never be ‘proven’. Common values for α are 5%,
1% or 0.1%, but other values are also possible. An error probability of α = 5% is most
common, and we will use this error probability for our analysis.

Step 4: Deriving a Critical Test Value


With the specified error probability α, we can derive a critical test value that can serve
as a threshold for judging the test result. Since in our example the alternative hypothe-
sis is undirected (i.e., positive and negative deviations are possible), we have to apply
a two-tailed t-test with two critical values: −tα/2 in the left (lower) tail and tα/2 in the
right (upper) tail (see Fig. 1.5). Since the t-distribution is symmetrical, the two values are
equal in size. The area beyond the critical values is α/2 on each side, and it is called the
rejection region. The area between the critical values is called the acceptance region of
the null hypothesis.
30 1 Introduction to Empirical Data Analysis

rejection region acceptance region rejection region

0.4

0.3

0.2

0.1
2 0.025 2 0.025

0.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
t 2 t 2

Fig. 1.5 t-distribution and critical values for α = 5% (df = 99)

Table 1.12  Extract from the Error probability α


t-table
df 0.10 0.05 0.01
1 6.314 12.706 63.657
2 2.920 4.303 9.925
3 2.353 3.182 5.841
4 2.132 2.776 4.604
5 2.015 2.571 4.032
10 1.812 2.228 3.169
20 1.725 2.086 2.845
30 1.697 2.042 2.750
40 1.684 2.021 2.704
50 1.676 2.009 2.678
99 1.660 1.984 2.626
∞ 1.645 1.960 2.576

The critical value for a given α value and degrees of freedom (df = N–1) may be taken
from a t-table or calculated by using a computer. Table 1.12 shows an extract from the
t-table for different values of α and degrees of freedom.
1.3 Statistical Testing and Interval Estimation 31

For our example, we get:13


tα/2 = 1.984 ≈ 2
Step 5: Comparing the Test Statistic with the Critical Test Value
If the test statistic exceeds the critical value, H0 can be rejected at a statistical signifi-
cance of α = 5%. The rules for rejecting the H0 can be formulated as
Reject H0 if
| |
|temp | > tα/2 (1.10)
Do not reject H0 if |temp | ≤ tα/2.
| |

Here we cannot reject H0, since


| |
|temp | = 1.9 < 2

This means, the result of the test is not statistically significant at α = 5%.

Interpretation
It is important to note that accepting the null hypothesis does not mean that its correct-
ness has been proven. H0 usually cannot be proven, nor can we infer a probability for the
correctness of H0.
In a strict sense, H0 is usually “false” in a two-tailed test. If a continuous scale is used
for measurement, the difference between the observed value and µ0 will practically never
be exactly zero. The real question is not whether µ0 will be zero, but how large the dif-
ference is. In our example, it is very unlikely that the present satisfaction index is exactly
7.50, as stated by the null hypothesis. We have to ask whether the difference is suffi-
ciently large to conclude that customer satisfaction has actually changed.
The null hypothesis is just a statement that serves as a reference point for assessing a
statistical result. Every conclusion we draw from the test must be conditional on H0.
Thus, for a test result |temp | > 2, we can conclude:
| |

• Under the condition of H0, the probability that the test result has occurred solely by
chance is less than 5%. Thus, we reject the| proposition of H0.
Or, as in our example, for the test result |temp | ≤ 2, we can conclude:
|

• Under the condition of H0, the probability that this test result has occurred by chance
is larger than 5%. We require a lower error probability. Thus, we do not have suffi-
cient reason to reject H0.

13 InExcel we can calculate the critical value tα/2 for a two-tailed t-test by using the function
T.INV.2T(α,df). We get: T.INV.2T(0.05,99) = 1.98. The values in the last line of the t-table are
identical with the standard normal distribution. With df = 99 the t-distribution comes very close to
the normal distribution.
32 1 Introduction to Empirical Data Analysis

f(t)

0.4

0.3

0.2

0.1
p 2 0.03 p 2 0.03

0.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
-1.9 H0 1.9

Fig. 1.6 p-value p = 6% (for a two-sided t-test with df = 99)

The aim of a hypothesis test is not to prove the null hypothesis. Proving the null hypoth-
esis would not make sense. If this were the aim, we could prove any null hypothesis by
making the error probability α sufficiently small.
The hypothesis of interest is the research hypothesis. The aim is to “prove” (be able to
accept) the research hypothesis by rejecting the null hypothesis. For this reason, the null
hypothesis has to be chosen as the opposite of the research hypothesis.

1.3.1.2 Using the p-value


The test procedure may be simplified by using a p-value approach instead of the critical
value approach. The p-value (probability value) for our empirical t-statistic is the prob-
ability to observe a t-value more distant from the null hypothesis than our temp if H0 is
true:
p = P(|t| ≥ |temp |) (1.11)
This is illustrated in Fig. 1.6: p = P(|t| ≥ 1.9) = 0.03 + 0.03 = 0.06 or 6%.14
Since the
t-statistic can assume negative or positive values, the absolute value has to be considered
in for the two-sided t-test and we get probabilities in both tails.

14 InExcel we can calculate the p-value by using the function T.DIST.2 T(ABS(temp);df). For the
variable in our example we get: T.DIST.2 T(ABS(−1.90);99) = 0.0603 or 6.03%
1.3 Statistical Testing and Interval Estimation 33

Table 1.13  Test results and Test Result Reality


errors
H0 is True H0 is False
H0 is accepted Correct decision Type II error
1–α β
H0 is rejected Type I error Correct
α decision
Significance level 1–β
Power

The p-value is also referred to as the empirical significance level. In SPSS, the
p-value is called “significance” or “sig”. It tells us the exact significance level of a test
statistic, while the classical test only gives us a “black and white” picture for a given α.
A large p-value supports the null hypothesis, but a small p-value indicates that the prob-
ability of the test statistic is low if H0 is true. So probably H0is not true and we should
reject it.
We can also interpret the p-value as a measure of plausibility. If p is small, the plau-
sibility of H0 is small and it should be rejected. And if p is large, the plausibility of H0 is
large.
By using the p-value, the test procedure is simplified considerably. It is not necessary
to start the test by specifying an error probability (significance level) α. Furthermore,
we do not need a critical value and thus no statistical table. (Before the development of
computers, these tables were necessary because the computing effort for critical values
as well as for p-values was prohibitive.)
Nevertheless, some people like to have a benchmark for judging the p-value. If we use
α as a benchmark for p, the following criterion will give the same result as the classical
t-test according to the rule according Eq. (1.10):
If p < α, reject H0 (1.12)
Since in our example p = 6%, we cannot reject H0. But even if α is used as a benchmark
for p, the problem of choosing the right error probability remains.

1.3.1.3 Type I and Type II Errors


There are two kinds of errors in hypothesis testing. So far, we considered only the error
of rejecting the null hypothesis although it is true. This error is called type I error and its
probability is α (Table 1.13).
A different error occurs if a false null hypothesis is accepted. This refers to the upper
right quadrant in Table 1.13. This error is called type II error and its probability is
denoted by β.
The size of α (i.e., the significance level) is chosen by the researcher. The size of β
depends on the true mean µ, which we do not know, and on α (Fig. 1.7). By decreasing
α, the probability β of the type II error increases.
34 1 Introduction to Empirical Data Analysis

0
acceptance region

Fig. 1.7 Type II error β (dependent on α and μ)

The probability (1–β) is the probability that a false null hypothesis is rejected (see
lower right quadrant in Table 1.13), what we want. This is called the power of a test and
it is an important property of a test. By decreasing α, the power of the test also decreases.
Thus, there is a tradeoff between α and β. As already mentioned, the error probability α
should not be too small; otherwise the test is losing its power to reject H0 if it is false.
Both α and β can only be reduced by increasing the sample size N.

How to Choose α
The value of α cannot be calculated or statistically justified, it must be determined by the
researcher. For this, the researcher should take into account the consequences (risks and
opportunities) of alternative decisions. If the costs of a type I error are high, α should be
small. Alternatively, if the costs of a type II error are high, α should be larger, and thus β
smaller. This increases the power of the test.
In our example, a type I error would occur if the test falsely concluded that customer
satisfaction has significantly changed although it has not. A type II error would occur if
customer satisfaction had changed, but the test failed to show this (because α was set too
low). In this case, the manager would not receive a warning if satisfaction had decreased,
and he would miss out on taking corrective actions.

1.3.1.4 Conducting a One-tailed t-test for the Mean


As the t-distribution has two tails, there are two forms of a t-test: a two-tailed t-test, as
illustrated above, and a one-tailed t-test. A one-tailed t-test offers greater power and
should be used whenever possible, since smaller deviations from zero are statistically
significant and thus the risk of a type II error (accepting a wrong null hypothesis) is
1.3 Statistical Testing and Interval Estimation 35

reduced. However, conducting a one-tailed test requires some more reasoning and/or a
priori knowledge on the side of the researcher.
A one-tailed t-test is appropriate if the test outcome has different consequences
depending on the direction of the deviation. If in our example the satisfaction index has
remained constant or even improved, no action is required. But if the satisfaction index
has decreased, management should be worried. It should investigate the reason and take
action to improve satisfaction.
The research question of the two-tailed test was: “Has satisfaction changed?” For the
one-tailed test, the research question is: “Has satisfaction decreased?”.
Thus, we have to ‘prove’ the alternative hypothesis
H1 : µ < 7.5
by rejecting the null hypothesis
H0 : µ ≥ 7.5
which states the opposite of the research question. The decision criterion is:
reject H0 if temp < tα (1.13)
Note that in our example tα is negative. The rejection region is now only in the lower tail
(left side) and the area under the density function has double the size. The critical value
for α = 5% is tα = –1.66 (Fig. 1.8).15 As this value is closer to H0 than the critical value
tα/2 = 1.98 for the two-tailed test, a smaller deviation from H0 is significant.
The empirical test statistic temp = –1.9 is now in the rejection region on the lower tail.
Thus, H0 can be rejected at the significance level α = 5%. With the more powerful one-
tailed test we can now “prove” that customer satisfaction has decreased.

Using the p-value


When using the p-value, the decision criterion is the same as before in Eq. (1.12):
If p < α, reject H0.
But the one-tailed p-value here is just half the two-tailed p-value in Eq. (1.12). Thus,
if we know the two-tailed p-value, it is easy to calculate the one-tailed p-value. As we
got p = 6% for the two-tailed test, the p-value for the one-tailed test is p = 3%. This is
clearly below α = 5%.16

15 In Excel we can calculate the critical value tα for the lower tail by using the function
T.INV(α;df). We get: T.INV(0.05;99) = –1.66. For the upper tail we have to switch the sign or use
the function T.INV(1–α;df).
16 In Excel we can calculate the p-value for the left tail by using the function T.DIST(temp;df;1).

We get: T.DIST(−1.90;99;1) = 0.0302 or 3%. The p-value for the right tail is obtained by the func-
tion T.DIST.RT(temp;df).
36 1 Introduction to Empirical Data Analysis

rejection region acceptance region

0.4

0.3

0.2

0.1
0.05

0.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
t 1.66

Fig. 1.8 t-distribution and critical value for a one-tailed test (α = 5%, df = 99).

1.3.2 Conducting a Test for a Proportion

We use proportions or percentages instead of mean values to describe nominally scaled


variables. Testing a hypothesis about a proportion follows the same steps as a test for the
mean.
For a two-sided test of a proportion, we state the hypotheses as follows:

• null hypothesis H0 : π = π0
• alternative hypothesis H1 : π � = π0

where π0 is an assumed proportion value and π is the unknown true proportion.


If we denote the empirical proportion by prop, the test statistic is calculated by:
prop − π0
zemp = √ (1.14)
σ/ N
with
prop empirical proportion
π0 assumed proportion value
σ standard deviation in the population
N number of cases in the data set
1.3 Statistical Testing and Interval Estimation 37

The standard deviation in the population can be calculated as follows:



σ = π0 (1 − π0 ) (1.15)
If the null hypothesis is true, the standard deviation of the proportion can be derived
from π0. For this reason, we can use the standard normal distribution instead of the t-dis-
tribution for calculating critical values and p-values. This simplifies the procedure. But
for N ≥ 100 it makes no difference whether we use the normal distribution or the t-dis-
tribution (see Table 1.12).

Example
The chocolate company ChocoChain knows from regular surveys on attitudes and
lifestyles that 10% of its customers are vegetarians. In this year’s survey x = 52 cus-
tomers stated that they are vegetarians. With a sample size of N = 400 this amounts to
a proportion prop = x/N = 0.13 or 13%. Does this result indicate a real increase or is it
just a random fluctuation?
For π0 = 10% we get σ = 0.30 and, with Eq. (1.13), we get the test statistic:
0.13 − 0.10
zemp = √ = 2.00
0.30/ 400
A rule of thumb says that an absolute value ≥ 2 of the test statistic is significant at
α = 5%. So we can conclude without any calculation that the proportion of vegetari-
ans has changed significantly. The exact critical value for the standard normal distri-
bution is zα/2 = 1.96.
The two-tailed p-value for zemp = 2.0 is 4.55% and thus smaller than 5%. If our
research question is: “Has the proportion of vegetarians increased?”, we can perform
a one-tailed t-test with the hypotheses
H0 : π ≤ π0 = 10%
H1 : π > π0
In this case, the critical value will be 1.64 and the one-tailed p-value will be 2.28%,
which is clearly lower than 5%. Thus, the result is highly significant. ◄

Accuracy Measures of Binary Classification Tests


Tests with binary outcomes are very frequent, e.g. in medical testing (sick or healthy,
pregnant or not) or quality control (meets specification or not). To judge the accuracy of
such tests, certain proportions (or percentages) of the two test outcomes are used, called
sensitivity and specificity.17 These measures are common in medical research, epidemiol-
ogy, or machine learning, but still widely unknown in other areas.

17 Cf., e.g., Hastie et al. (2011), Pearl and Mackenzie (2018); Gigerenzer (2002).
38 1 Introduction to Empirical Data Analysis

Table 1.14  Measures of Test Result No Disease Disease


accuracy in medical testing
Negative Specificity 1–sensitivity
true negative false negative
1–α β
Positive 1–specificity sensitivity
false positive true positive
α 1–β
false alarm power

In medical testing these measures have to be interpreted as follows:

• Sensitivity = percentage of “true positives”, i.e., the test will be positive if the patient
is sick (disease is correctly recognized).
• Specificity = percentage of “true negatives”, i.e., the test will be negative if the patient
is not sick.

For an example, we can look at the accuracy of the swab tests (RT-PCR-tests) used early
in the 2020 corona pandemic for testing people for infection with SARS-CoV-2. The
British Medical Journal (Watson et al. 2020) reported a test specificity of 95%, but a
sensitivity of only 70%. This means that out of 100 persons infected with SARS-CoV-2,
the test was falsely negative for 30 people. Not knowing about their infection, these 30
people contributed to the rapid spreading of the disease.
In Sect. 1.3.1.1 we discussed type I and type II errors (α and β) in statistical test-
ing. These errors can be seen as inverse measures of accuracy. There is a close corre-
spondence to specificity and sensitivity. Assuming “no disease” as the null hypothesis,
Table 1.14 shows the correspondence of these measures of accuracy to the error types in
statistical testing. The test sensitivity of 70% corresponds to the power of the test and the
“falsely negative” rate of 30% corresponds to the β-error (type II error).
Measures of sensitivity and specificity can be used for results of cross tables (in con-
tingency analysis), discriminant analysis, and logistic regression. In Chap. 5 on logis-
tic regression, we will give further examples of the calculation and application of these
measures.

1.3.3 Interval Estimation (Confidence Interval)

Interval estimation and statistical testing are part of inferential statistics and are based on
the same principles.

Interval Estimation for a Mean


We come back to ChocoChain’s measurement of satisfaction with the mean value
x = 7.30. This value can be considered a point estimate of the true mean μ, which we
1.3 Statistical Testing and Interval Estimation 39

do not know. It is the best estimate we can get for the true value μ. However, since x
is a random variable, we cannot expect it to be equal to μ. But we can state an interval
around x within which we expect the true mean μ with a certain error probability α (or
confidence level 1–α):
µ = x ± error
This interval is called an interval estimate or confidence interval for μ. Again, we can
use the t-distribution to determine this interval:
sx
µ = x ± tα/2 · √ (1.16)
N
We use the same values as we used above for testing: tα/2 = 1.98, sx = 1.05 and N = 100,
and we get:
1.05
µ = 7.30 ± 1.98 · √ = 7.30 ± 0.21
100
Thus, with a confidence of 95%, we can expect the true value μ to be in the interval
between 7.09 and 7.51:
7.09 ← µ → 7.51
The smaller the error probability α (or the greater the confidence 1–α), the greater the
interval must be. Thus, for α = 1% (or confidence 1–α = 99%), the confidence interval is
[7.02, 7.58].
We can also use the confidence interval for testing a hypothesis. If our null hypothesis
µ0 = 7.50 falls into the confidence interval, it is equivalent to the test statistic falling into
the acceptance region. This is an alternative way of testing a hypothesis. Again, we can-
not reject H0, just as in the two-tailed test above.

Interval Estimation for a Proportion


Analogously, we can estimate a confidence interval for a proportion. The survey yielded
a proportion of prop = 13%. We can compute the confidence interval for the true value π
as:
σ
π = prop ± zα/2 · √ (1.17)
N
Using the same values as above for testing: zα/2 = 1.96, σ = 0.30 and N = 400, we get:
0.30
π = 13.0 ± 1.96 · √ = 13.0 ± 2.94
400
Thus, with a confidence of 95%, we can expect the true value π to be in the interval
between 10.06 and 15.94. As π0 = 10% does not fall into this interval, we can again
reject the null hypothesis, as before.
40 1 Introduction to Empirical Data Analysis

If we do not know σ, we have to estimate it based on the proportion prop as



s = prop (1 − prop) (1.18)
In this case, we have to use the t-distribution and calculate the confidence interval as:
s
π = prop ± tα/2 · √ (1.19)
N
We get
0.336
π = 13.0 ± 1.97 · √ = 13.0 ± 3.31
400
The confidence interval increases to [9.69, 16.31].

1.4 Causality

A causal relationship is a relationship that has a direction. For two variables X and Y, it
can be formally expressed by
X → Y
cause effect
This means: If X changes, Y changes as well. Thus, changes in Y are caused by changes
in X.
However, this does not mean that changes in X are the only cause of changes in Y. If X
is the only cause of changes in Y, we speak of a mono-causal relationship. But we often
face multi-causal relationships, which makes it difficult to find and prove causal relation-
ships (cf. Freedman, 2002; Pearl & Mackenzie, 2018).

1.4.1 Causality and Correlation

Finding and proving causal relationships is a primary goal of all empirical (natural and
social) sciences. Statistical association or correlation plays an important role in pursuing
this goal. But causality is no statistical construct and concluding causality from an asso-
ciation or correlation can be very misleading. Data contain no information about causal-
ity. Thus, causality cannot be detected or proven by statistical analysis alone.
To infer or prove causality, we need information about the data generation process
and causal reasoning. The latter is something that computers or artificial intelligence are
still lacking. Causality is a conclusion that must be drawn by the researcher. Statistical
methods can only support our conclusions about causality.
There are many examples of significant correlations that do not imply causality. For
instance, high correlations were found for
1.4 Causality 41

• number of storks and birth rate (1960–1990),


• reading skills of school children and shoe size,
• crop yield of hops and beer consumption,
• ice cream sales and rate of drowning,
• divorce rate in Maine and per capita consumption of margarine,
• US spending on science, space, and technology versus suicides by hanging, strangula-
tion, and suffocation.

Such non-causal correlations between two variables X and Y are also called spurious cor-
relations. They are often caused by a lurking third variable Z that is simultaneously influ-
encing X and Y. This third variable Z is also called a confounding variable or confounder.
It is causally related to X and Y. But often we cannot observe such a confounding varia-
ble or do not even know about it. Thus, the confounder can cause misinterpretations.
The strong correlation between the number of storks and the birth rate that was
observed in the years from 1966 to 1990 was probably caused by the growing industrial
development combined with prosperity. For the reading skills of school children and their
shoe size, the confounder is age. For crop yield of hops and beer consumption, the con-
founder is probably the number of sunny hours. The same may be true for the relationship
between ice cream sales and the rate of drowning. If it is hot, people eat more ice cream
and more people go swimming. If more people go swimming, more people will drown.

1.4.2 Testing for Causality

To support the hypothesis of a causal relationship, various conditions should be met


(Fig. 1.9).

Fig. 1.9 Testing for causality


42 1 Introduction to Empirical Data Analysis

Condition 1: Correlation Coefficient


The correlation coefficient can be positive or negative. A positive sign means that Y will
increase if X increases, and a negative sign indicates the opposite: Y will decrease if X
increases. Thus, the researcher should not only hypothesize that a causal relationship
exists, but should also state in advance (before the analysis) if it is a positive or a nega-
tive one.
For example, when analyzing the relationship between chocolate sales and price, we
would expect a negative correlation, and when analyzing the relationship between choc-
olate sales and advertising, we would expect a positive correlation. Of course, a positive
relationship between price and sales is also possible (e.g., for luxury goods or if the price
is used as an indicator for quality), but these are rare exceptions and usually do not apply
to most fast-moving consumer goods (FMCG). Also, a negative effect of advertising
(wear-out effect) has rarely been observed. Therefore, an unexpected sign of the correla-
tion coefficient should make us skeptical.
If there is a causal relationship between the two variables X and Y, a substantial cor-
relation is expected. If there is no correlation or the correlation coefficient is very small
(close to zero), there is probably no causality, or the causality is weak and irrelevant.
In assessing the correlation coefficient, one must also consider the number of obser-
vations (sample size). This can be done by performing a statistical test of significance,
either a t-test or an F-test.

Condition 2: Temporal Ordering


A causal relationship between two variables X and Y can always have two different
directions:

• X is a cause of Y: X → Y
• Y is a cause of X: Y → X

For the correlation coefficient it makes no difference whether we have situation a) or b).
Thus, a significant correlation is no sufficient proof of the hypothesized causal relation-
ship a).
A cause must precede the effect, and thus changes in X must precede corresponding
changes in Y. If this is not the case, the above hypothesis is wrong. In an experiment, this
may be easily verified. The researcher changes X and checks for changes in Y. Yet if one
has only observational data, it is often difficult or impossible to check the temporal order.
We can do so if we have time series data and the observation periods are shorter than
the lapse of time between cause and effect (time lag). Referring to our example, the time
lag between advertising and sales depends on the type of product and the type of media
used for advertising. The time lag will be shorter for FMCG like chocolate or toothpaste
and longer for more expensive and durable goods (e.g., TV set, car). Also, the time lag
will be shorter for TV or radio advertising than for advertising in magazines. For adver-
tising, the effects are often dispersed over several periods (i.e., distributed lags).
1.5 Outliers and Missing Values 43

In case of a sufficiently large time lag (or sufficiently short observation periods) the
direction of causation can be detected by a lagged correlation (or lagged regression).
Under hypothesis X → Y, the following has to be true (Campbell & Stanley, 1966, p. 69):
rXt−r Yt > rXt Yt−r

where t is the time period and r the length of the lag in periods (r = 1, 2, 3 …).
Otherwise, it indicates that the hypothesis is wrong and causality has the opposite
direction.
A time lag can also obscure a causal relationship. Thus, rXt Yt might not be significant,
but rXt−r Yt is. This should be considered from the outset if there are reasons to suspect a
lagged relationship. The relationship between sales and advertising is an example where
time lags frequently occur. Regression analysis (see Chap. 2) can cope with this by
including lagged variables.

Condition 3: Exclusion of Other Causes


As stressed above, there can be a significant correlation between X and Y without a
causal relationship, because it is caused by a third variable Z. In this case, we speak of
non-causal or spurious correlations.
Thus, it should be made sure that there are no third variables that are causing a spuri-
ous correlation between X and Y. It has also been argued in the literature that the absence
of plausible rival hypotheses increases the plausibility of a hypothesis (Campbell &
Stanley, 1966, p. 65).
The world is complex and usually numerous factors are influencing an empirical var-
iable Y. To account for such multi-causal relationships, we can use multivariate methods
like regression analysis, variance analysis, logistic regression, or discriminant analysis.
All these methods are described in the following chapters. Usually, however, not all
influencing factors can be observed and included in a model. The art of model build-
ing requires an identification of the relevant variables. As Albert Einstein said: “A model
should be simple, but not too simple.”
With the help of statistics, we can measure the correlation between two variables, but
this does not prove that a causal relationship exists. A correlation between variables is a
necessary but not a sufficient condition for causality. The other two conditions must be
fulfilled as well. The most reliable evidence for a causal relationship is provided by a
controlled experiment (Campbell & Stanley, 1966; Green et al., 1988).

1.5 Outliers and Missing Values

The results of empirical analyses can be distorted by observations with extreme values
that do not correspond to the values “normally” expected. Likewise, missing data can
lead to distortions, especially if they are not treated properly when analyzing the data.
44 1 Introduction to Empirical Data Analysis

1.5.1 Outliers

Empirical data often contain one or more outliers, i.e., observations that deviate substan-
tially from the other data. Such outliers can have a strong influence on the result of the
analysis.
Outliers can arise for different reasons. They can be due to

• chance (random),
• a mistake in measurement or data entry,
• an unusual event.

1.5.1.1 Detecting Outliers
When faced with a great number of numerical values, it can be tedious to find unusual
ones. Even for a small data set as the one in Table 1.15 it is not easy to detect possi-
ble outliers just by looking at the raw data. Numerical and/or graphical methods may
be used for detecting outliers, with graphical methods such as histograms, boxplots, and

Table 1.15  Example data: observed and standardized values


Observed data Standardized data
Observation X1 X2 Z1 Z2
1 26 26 0.41 0.35
2 34 30 1.27 0.84
3 19 29 –0.35 0.71
4 20 24 –0.24 0.10
5 19 14 –0.35 –1.12
6 23 30 0.08 0.84
7 20 27 –0.24 0.47
8 32 33 1.05 1.20
9 12 7 –1.11 –1.97
10 6 9 –1.76 –1.73
11 11 17 –1.22 –0.75
12 29 22 0.73 –0.14
13 15 15 –0.78 –0.99
14 16 26 –0.68 0.35
15 24 18 0.19 –0.63
16 46 39 2.56 1.93
17 30 26 0.84 0.35
18 15 21 –0.78 –0.26
19 20 19 –0.24 –0.51
20 28 31 0.62 0.96
Mean 22.3 23.2 0.00 0.00
Std.deviation 9.26 8.20 1.00 1.00
1.5 Outliers and Missing Values 45

Histogram
Frequency
8
6
4
2
0
5 10 15 20 25 30 35 40 45 50 More

Fig. 1.10 Histogram of variable X1

scatterplots usually being more convenient and efficient (du Toit et al., 1986; Tukey,
1977). A simple numerical method for detecting outliers is the standardization of data.

Standardization of Data
Table 1.15 shows the observed values of two variables, X1 and X2, and their standardized
values, called z-values. We can see that only one z-value exceeds 2 (observation 16 of
variable X1). If we assume that the data follow a normal distribution, a value >2 has a
probability of less than 5%. The occurrence of a value of 2.56, as observed here, has
a probability of less than 1%. Thus, this value is unusual and we can identify it as an
outlier.
The effect of an outlier on a statistical result can be easily quantified by repeating the
computations after discarding the outlier. Table 1.15 shows that the mean of variable X1
is 22.3. After discarding observation 16, we get a mean of 21.0. Thus, the mean value
changes by 1.3. The effect will be smaller for larger sample sizes. Especially for small
sample sizes, outliers can cause substantial distortions.

Histograms
Figure 1.10 shows a histogram of variable X1, with the outlier at the far right of the
figure.18

Boxplots
A more convenient graphical means is the boxplot. Figure 1.11 shows the boxplots of
variables X1 and X2, with the outlier showing up above the boxplot of X1.
A boxplot (also called box-and-whisker plot) is based on the percentiles of data. It is
determined by five statistics of a variable:

18 The histogram was created with Excel by selecting “Data/Data Analysis/Histogram”. In SPSS,
histograms are created by selecting “Analyze/Descriptive Statistics/Explore”.
46 1 Introduction to Empirical Data Analysis

x1 x2

Fig. 1.11 Boxplots of the variables X1 and X2

• maximum,
• 75th percentile,
• 50th percentile (median),
• 25th percentile, and
• minimum.

The bold horizontal line in the middle of each box represents the median, i.e., 50% of the
values are above this line and 50% are below. The upper rim of the box represents the
75th percentile and the lower rim represents the 25th percentile. Since these three per-
centiles, the 25th, 50th, and 75th percentiles, divide the data into four equal parts, they
are also called quartiles. The height of the box represents 50% of the data and indicates
the dispersion (spread, variation) and skewness of the data.
The whiskers extending above and below the boxes represent the complete range of
the data, from the smallest to the largest value (but without outliers). Outliers are defined
as points that are more than 1.5 box lengths away from the rim of the box.19

19 In SPSS we can create boxplots (just like histograms) by selecting “Analyze/Descriptive


Statistics/Explore”. But don´t be surprised if observation 16 with value 46 will not be flagged as an
outlier. For our data, the rule of 1.5 box lengths above the rim of the box will give the cutoff value
of 47. But also this rule is not entirely free of arbitrariness. Here we want to demonstrate how an
outlier is represented in the boxplot.
1.5 Outliers and Missing Values 47

X2
50
45
40
35
30
25
20
15
10
5
0
0 10 20 30 40 50
X1

Fig. 1.12 Scatterplot of the variables X1 and X2

Scatterplots
Histograms and boxplots are univariate methods, which means that we are looking for
outliers for each variable separately. The situation is different if we analyze the relation-
ship between two or more variables. If we are interested in the relationship between two
variables, we can display the data by using a scatterplot (Fig. 1.12). Each dot represents
an observation of the two variables X1 and X2.
The relationship between X1 and X2 can be represented by a linear regression line
(dashed line in Fig. 1.12). We can see that observation 16 (at the right end of the regres-
sion line), which we identified as an outlier from a univariate perspective, fits the linear
regression model quite well. The slope of the line will not substantially be affected if we
eliminate the outlier. However, it is also possible that an outlier impacts the slope of the
regression line and biases the results.

1.5.1.2 Dealing with Outliers


In any case, it should always be investigated what caused the appearance of an outlier.
Sometimes it is possible to correct a mistake made in data collection or entry. An obser-
vation should only be removed if we have reason to believe that a mistake has occurred
or if we find proof that the outlier was caused by an unusual event outside the research
context (e.g., a strike of the union or an electricity shutdown).
In all other cases, outliers should be retained in the data set. If the outlier is due
to chance, it does not pose a problem and does not have to be eliminated. Indeed, by
removing outliers one can possibly manipulate the results. If one actually does so for
some good reason, this should always be documented in any report or publication.
48 1 Introduction to Empirical Data Analysis

Some methods discussed in this book (e.g., regression analysis or factor analysis)
are rather sensitive to outliers. We will therefore discuss the issue of outliers in detail in
those chapters.

1.5.2 Missing Values

Missing values are an unavoidable problem when conducting empirical studies and fre-
quently occur in practice. The reasons for missing values are manifold. Some examples
are:

• Respondents forgot to answer a question.


• Respondents cannot or do not want to answer a question.
• Respondents answered outside the defined answer interval.

The problem with missing values is that they can lead to distorted results. The validity of
the results can also be limited since many methods require complete data sets and cases
with a missing value have to be deleted (i.e., listwise deletion). Finally, missing values
also represent a loss of information, so the validity of the results is reduced as compared
to analyses with complete data sets.
Statistical software packages offer the possibility of taking missing values into
account in statistical analyses. Since all case studies in this book are using IBM SPSS,
the following is a brief description of how this statistical software package can identify
missing values. There are two options:

• System missing values:


Absent values (empty cells) in a data set are automatically identified by SPSS as
so-called ‘system missing values’ and shown as dots (.) in data view.
• User missing values:
Missing values can also be coded by the user. For this purpose, the ‘Variable View’
must be called up in the data editor (button at the bottom left). There, missing data
may be indicated by entering them in ‘Missing’ (see Fig. 1.13). Any value can be used
as an indicator of a missing value if it is outside the range of the valid values of a
variable, (e.g., “9999” or “0000”). Different codes may be used for different speci-
fications of missing values (e.g., 0 for “I don’t know” and 9 for “Response denied”).
These “user missing values” are then excluded from the following analyses.

Dealing with Missing Values in SPSS


SPSS provides the following three basic options for dealing with missing values:
1.5 Outliers and Missing Values 49

Fig. 1.13 Definition of user missing values in the data editor of IBM SPSS

• The values are excluded “case by case” (“Exclude cases listwise”), i.e., as soon as a
missing value occurs, the whole case (observation) is excluded from further analy-
sis. This often reduces the number of cases considerably. The “listwise” option is the
default setting in SPSS.
• The values are excluded variably (“Exclude cases pairwise”), i.e., in the absence of a
value only pairs with this value are eliminated. If, for example, a value is missing for
variable j, only the correlations with variable j are affected in the calculation of a cor-
relation matrix. In this way, the coefficients in the matrix may be based on different
numbers of cases. This may result in an imbalance of the variables.
• There is no exclusion at all. Average values (“Replace with mean”) are inserted for
the missing values. This may lead to a reduced variance if many missing values occur
and to a distortion of the results.

For option 3, SPSS offers an extra procedure, which can be called up by the menu
sequence Transform/Replace Missing Values (cf. Fig. 1.14). With this procedure, the user
can decide per variable which information should replace missing values in a data set.
The following options are available:

• Series mean,
• Mean of nearby points (the number of nearby points may be defined as 2 to all),
• Median of nearby points (number of nearby points: 2 to all),
• Linear interpolation,
• Linear trend at point.
50 1 Introduction to Empirical Data Analysis

Fig. 1.14 SPSS procedure ‘Replace Missing Values’

For cross-sectional data, only the first two options make sense, since the missing values
of a variable are replaced with the mean or median (nearby points: all) of the entire data
series. The remaining options are primarily aimed at time series data, in which case the
order of the cases in the data set is important: With the options “Mean of nearby points”
and “Median of nearby points”, the user can decide how many observations before and
after the missing value are used to calculate the mean or the median for a missing value.
In the case of “Linear interpolation”, the mean is derived from the immediate predeces-
sor and successor of the missing value. “Linear trend at point” calculates a regression
(see Chap. 2) on an index variable scaled 1 to N. Missing values are then replaced with
the estimated value from the regression.
With the menu sequence Analyze/Multiple Imputation, SPSS offers a good possibil-
ity of replacing missing values with very realistic estimated values. SPSS also offers the
possibility to analyze missing values under the menu sequence Analyze/MissingValue
Analysis.
In addition to the general options for handling missing values described above, some
of the analytical procedures of SPSS also offer options for handling missing values.
Table 1.16 summarizes these options for the methods discussed in this book.

How the User Can Deal With Missing Values


The designation System missing values is automatically assigned by SPSS if values are
missing in a case (default setting). These cells are then ignored in any calculations, e.g.,
of the statistical parameters (see Sect. 1.2). However, this leads to the problem that varia-
bles with very different numbers of valid cases are included in the calculations (pairwise
exclusion of missing values). Such distortions can be avoided if all cases with an invalid
1.6 How to Use IBM SPSS, Excel, and R 51

Table 1.16  Procedure-specific options of missing values


Method Options
Regression analysis • Exclude cases listwise
• Exclude cases pairwise
• Replace with mean
Analysis of variance • Exclude cases listwise
(ANOVA) • Exclude cases analysis by analysis
Discriminant analysis Dialog box ‘Classification’: Replace missing values with mean
Logistic regression No separate missing value options in the procedure
Contingency analysis No separate missing value options in the procedure
Factor analysis • Exclude cases listwise
• Exclude cases pairwise
• Replace with mean
Cluster analysis No separate missing value options in the procedure
Conjoint analysis No separate missing value options in the procedure

value are completely excluded (listwise exclusion of missing values). But this may result
in a greatly reduced number of cases. Replacing missing values with other values is
therefore a good way to counteract this effect and avoid unequal weighting.
Besides, the option user missing values offers the advantage that the user can differ-
entiate missing values in terms of content. Missing values that should not be included
in the calculation of statistical parameters may still provide specific information, i.e.,
whether a respondent is unable to answer (does not know) or does not want to answer
(no information). If the option to differentiate “missing values” in such a way is inte-
grated into the design of a survey from the start, important information can be derived
from it.
Finally, it should be emphasized again that it is important to make sure that missing
values are marked as such in SPSS so they are not included in calculations, thus distort-
ing the results.

1.6 How to Use IBM SPSS, Excel, and R

In this book, we primarily use the software IBM SPSS Statistics (or SPSS) for the differ-
ent methods of multivariate analysis, because SPSS is widely used in science and prac-
tice. The name SPSS originally was an acronym for Statistical Package for the Social
Sciences. Over time, the scope of SPSS has been expanded to cover almost all areas of
data analysis.
IBM SPSS Statistics may be run on the operating systems Windows, MAC, and
Linux. It includes a base module and several extension modules. Apart from the full
52 1 Introduction to Empirical Data Analysis

Table 1.17  Analysis methods and SPSS procedures


Method of analysis SPSS procedure SPSS module
Regression analysis REGRESSION Statistics Base
Variance Analysis UNIANOVA Statistics Base
ONEWAY
GLM
Discriminant analysis DISCRIMINANT Statistics Base
Logistic analysis LOGISTIC REGRESSION Advanced Statistics
NOMREG or SPSS Regression
Contingency analysis CROSSTABS Statistics Base
Cross tabulation LOGLINEAR Advanced Statistics
HILOGLINEAR Advanced Statistics
Factor analysis FACTOR Statistics Base
Cluster analysis CLUSTER Statistics Base
QUICK CLUSTER Statistics Base
Conjoint Analysis CONJOINT SPSS Conjoint
ORTHOPLAN
PLANCARDS

version of IBM SPSS Statistics Base, a lower-cost student version is available for educa-
tional purposes. This has some limitations that are unlikely to be relevant to the majority
of users: Data files can contain a maximum of 50 variables and 1500 cases, and the SPSS
command syntax (command language) and extension modules are not available.
To use SPSS, the basic IBM SPSS Statistics Base package must be purchased, con-
taining basic statistical analysis. This basic module is also a prerequisite for purchas-
ing additional packages or modules, which usually focus on specific analysis procedures
such as SPSS Regression (regression analysis), SPSS Conjoint (conjoint analysis), or
SPSS Neural Networks.
An alternative option is to use the IBM SPSS Statistics Premium package which
includes all the procedures of the Basic and Advanced packages and is available to stu-
dents at most universities.
Table 1.17 provides an overview of the analytical methods covered in this book and
the associated SPSS procedures, all of which are included in the SPSS premium pack-
age. They run under the common user interface of SPSS Statistics. For readers without
SPSS premium package, the column “SPSS module” lists those SPSS modules or pack-
ages that contain the corresponding procedures.
The various data analysis methods can be selected in SPSS via a graphical user inter-
face. This user interface is constantly being improved and extended. Using the available
menus and dialog boxes, even complex analyses can be performed in a very convenient
way. Thus, the command language (command syntax) previously required to control the
References 53

program is hardly used any more, but it still has some advantages for the user, such as
the customization of analyses. All chapters in this book therefore contain the command
sequences required to carry out the analyses.
There are several books on how to use IBM SPSS, all of which provide a very good
introduction to the package:

• George, D. & Mallery, P. (2019). IBM SPSS Statistics 26 Step by Step (16th ed.).
London: Taylor & Francis Ltd.
• Field, A. (2018). Discovering Statistics Using IBM SPSS Statistics (5th ed.). London:
Sage Publication Ltd.
• Härdle, W. K., & Simar, L. (2015). Applied Multivariate Statistical Analysis, (5th
ed.). Heidelberg: Springer.

IBM SPSS also provides several manuals under the link https://www.ibm.com/support/
pages/ibm-spss-statistics-29-documentation, which are regularly updated.
Users who work with the programming language R will find notes on how to use it
for data analysis under the link www.multivariate-methods.info.
In addition, a series of Excel files for each analysis method is also provided on the
website www.multivariate-methods.info, which should help the readers familiarize them-
selves more easily with the various methods.

References

Campbell, D. T., & Stanley, J. C. (1966). Experimental and quasi-experimental designs for
research. Rand McNelly.
du Toit, S. H. C., Steyn, A. G. W., & Stumpf, R. H. (1986). Graphical exploratory data analysis.
Springer.
Freedman, D. (2002). From association to causation: Some remarks on the history of statistics (p.
521). Technical Report, University of California.
Gigerenzer, G. (2002). Calculated rsks. Simon & Schuster.
Green, P. E., Tull, D. S., & Albaum, G. (1988). Research for marketing decisions (5th ed.).
Prentice Hall.
Hastie, T., Tibshirani, R., & Friedman, J. (2011). The elements of statistical learning. Springer.
Pearl, J., & Mackenzie, D. (2018). The book of why—The new science of cause and effect. Basic
Books.
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680.
Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.
Watson, J., Whiting, P.F., & Brush, J.E. (2020). Interpreting a covid-19 test result. British Medical
Journal, 369, m1808.
54 1 Introduction to Empirical Data Analysis

Further Reading

Anderson, D. R., Sweeney, D. J., & Williams, T. A. (2007). Essentials of modern business statistics
with Microsoft Excel. Thomson.
Field, A., Miles, J., & Field, Z. (2012). Discovering sstatistics using R. Sage.
Fisher, R. A. (1990). Statistical methods, experimental design, and scientific inference. Oxford
University Press.
Freedman, D., Pisani, R., & Purves, R. (2007). Statistics (4th ed.). Norton.
George, D., & Mallery, P. (2021). IBM SPSS statistics 27 step by step: A simple guide and refer-
ence (17th ed.). Routledge.
Sarstedt, M., & Mooi, E. (2019). A concise guide to market research: The process, data, and meth-
ods using IBM SPSS statistics (3rd ed.). Springer.
Wonnacott, T. H., & Wonnacott, R. J. (1977). Introductory statistics for business and economics
(2nd ed.). Wiley.
Regression Analysis
2

Contents

2.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.2 Procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.2.1 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.2.2 Estimating the Regression Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.2.2.1 Simple Regression Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.2.2.2 Multiple Regression Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.2.3 Checking the Regression Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.2.3.1 Standard Error of the Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.2.3.2 Coefficient of Determination (R-square) . . . . . . . . . . . . . . . . . . . . . . . . 78
2.2.3.3 Stochastic Model and F-test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.2.3.4 Overfitting and Adjusted R-Square. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.2.4 Checking the Regression Coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2.2.4.1 Precision of the Regression Coefficient. . . . . . . . . . . . . . . . . . . . . . . . . 85
2.2.4.2 t-test of the Regression Coefficient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
2.2.4.3 Confidence Interval of the Regression Coefficient. . . . . . . . . . . . . . . . . 89
2.2.5 Checking the Underlying Assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.2.5.1 Non-linearity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.2.5.2 Omission of Relevant Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.2.5.3 Random Errors in the Independent Variables. . . . . . . . . . . . . . . . . . . . . 101
2.2.5.4 Heteroscedasticity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.2.5.5 Autocorrelation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.2.5.6 Non-normality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
2.2.5.7 Multicollinearity and Precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
2.2.5.8 Influential Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
2.3 Case Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
2.3.1 Problem Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
2.3.2 Conducting a Regression Analysis With SPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
2.3.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
2.3.3.1 Results of the First Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 55


K. Backhaus et al., Multivariate Analysis, https://doi.org/10.1007/978-3-658-40411-6_2
56 2 Regression Analysis

2.3.3.2 Results of the Second Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127


2.3.3.3 Checking the Assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
2.3.3.4 Stepwise Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
2.3.4 SPSS Commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
2.4 Modifications and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
2.4.1 Regression With Dummy Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
2.4.2 Regression Analysis With Time-Series Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
2.4.3 Multivariate Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
2.5 Recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

2.1 Problem

Regression analysis is one of the most useful and thus most frequently used methods for
statistical data analysis. With the help of regression analysis, one can analyze the rela-
tionships between variables. For example, one can find out if a certain variable is influ-
enced by another variable, and if so, how strong this effect is.
By this one can learn, how the world works. Regression analysis can be used in the
search for truth, which can be very exciting. Regression analysis is very useful when
searching for explanations or making decisions or predictions. Thus, regression analysis
is of eminent importance for all empirical sciences as well as for solving practical prob-
lems. Table 2.1 lists examples for the application of regression analysis.
Regression analysis takes a special position among the methods of multivariate data
analysis. The invention of regression analysis by Sir Francis Galton (1822–1911) in con-
nection with his studies on heredity1 can be considered as the birth of multivariate data
analysis. Stigler (1997, p. 107) calls it “one of the grand triumphs of the history of sci-
ence”. And of further importance is, that regression analysis also provides a basis for
numerous other methods used today in big data analysis and machine learning. For an
understanding of these other, often more complex methods of multivariate data analysis a
profound knowledge about regression analysis is indispensable.
While regression analysis is a relatively simple method within the field of multivariate
data analysis, it is still prone to mistakes and misunderstandings. Thus, wrong results
or wrong interpretations of the results of regression analysis are frequent. This concerns
above all the underlying assumptions of the regression model. We will come back to this
later but will add a word of caution here. Regression analysis can be very helpful for
finding causal relationships, and this is the main reason for its application. But neither
regression analysis nor any other statistical method can prove causality. For this purpose,

1 Galton (1886) investigated the relationship between the body heights of parents and their adult
children. He “regressed the height of children on the height of parents”.
2.1 Problem 57

Table 2.1  Application examples of regression analysis in different disciplines


Discipline Exemplary research questions
Agriculture How does crop yield depend on the amounts of rainfall, sunshine, and fertilizers?
Biology How does bodyweight change with the quantity of food intake?
Business What revenue and profit can we expect for the next year?
Economics How does national income depend on government expenditures?
Engineering How does production time depend on the type of construction, technology, and
labor force?
Healthcare How does health depend on diet, physical activity, and social factors?
Marketing What are the effects of price, advertising, and distribution on sales?
Medicine How is lung cancer affected by smoking and air pollution?
Meteorology How does the probability of rainfall change with variables like temperature,
humidity, air pressure, etc.?
Psychology How important are income, health, and social relations for happiness?
Sociology What is the relationship between income, age, and education?

reasoning beyond statistics and information about the generation of the data may be
needed.
First, we want to show here how regression analysis works. For the application of
regression analysis, the user (researcher) must decide which one is the dependent vari-
able that is influenced by one or more other variables, so-called independent variables.
The dependent variable must be on a metric (quantitative) scale. The researcher further
needs empirical data on the variables. These may be derived from observations or exper-
iments and may be cross-sectional or time-series data. Somewhat bewildering for the
novice are the different terms that are used interchangeably in the literature for the varia-
bles of regression analysis and that vary by the author and the context of the application
(see Table 2.2).

Table 2.2  Regression analysis and terminology


Regression analysis (RA) is used to
• Describe and explain relationships between variables,
• Estimate or predict the values of a dependent variable.
Dependent variable (output) Y, explained varia- Independent variable(s) (input)
ble, regressand, response variable, y-variable X1 , X2 , . . . , Xj , . . . , XJ , explanatory variables,
regressors, predictors, covariates, x-variables
Example: Sales of a product Price, advertising, quality, etc.
In linear regression, the variables are assumed to be quantitative.
By using binary variables (dummy variable technique) qualitative regressors can also be
analyzed.
58 2 Regression Analysis

Example
When analyzing the relationship between the sales volume of a product and its
price, sales will usually be the dependent variable, also called the response variable,
explained variable or regressand, because sales volume usually responds to changes
in price. The price will be the independent variable, also called predictor, explana-
tory variable, or regressor. So, an increase in price may explain why the sales volume
has declined. And the price may be a good predictor of future sales. With the help of
regression analysis, one can predict the expected sales volume if the price is changed
by a certain amount. ◄

Simple Linear Regression


The relationship between sales volume and advertising expenditures poses one of the
big problems in business since a lot of money is spent without knowing much about the
effects. Many efforts have been made to learn more about this relationship with the help
of regression analysis. Various elaborate models have been developed (e.g. Leeflang
et al., 2000, pp. 66–99). We will start here with a very simple one.
In simple regression, we are looking for a regression function of Y on X. For example,
we assume that the sales volume is influenced by advertising and write in a very general
form
sales = f (advertising)
or Y = f (X) (2.1)
f (·) is an unknown function that we want to estimate:
estimated sales = fˆ (advertising)

or Ŷ = fˆ (X). (2.2)
Of course, the estimated values are not identical with the real (observed) values. That is
why the variable for estimated sales is denoted by Ŷ (Y with a hat). To get a quantitative
estimate for the relationship in Eq. (2.2) we must specify its structure. In linear regres-
sion, we assume:

Ŷ = a + b X (2.3)
With given data of Y and X, regression analysis can find values for the parameters a and
b. Parameters are numerical constants in a model, whose values we want to estimate.
Parameters that accompany (multiply) a variable (such as b) are also called coefficients.
Let us assume that the estimation yields the following result:

Ŷ = 500 + 3 X (2.4)
Figure 2.1 illustrates this function. Parameter b (the coefficient of X) is an indicator of
the strength of the effect of advertising on sales. Geometrically, b is the slope of the
2.1 Problem 59

Sales

1,000

500

0
0 50 100 150
Advertising

Fig. 2.1 Estimated regression line

regression line. If advertising increases by 1 Euro, in this example sales will increase by
3 units. Parameter a (the regression intercept) reflects the basic level of sales if there is
no advertising (X = 0).
With the help of the estimated regression function a manager can answer questions
like:

• How will sales change if advertising expenditures are changed?


• What sales can be expected for a certain advertising budget?

So, for example, if the advertising budget is 100 Euros, we will expect sales to be

Ŷ = 500 + 3 · 100 = 800 units. (2.5)


And if advertising is increased to 120 Euros, sales will increase to 860 units.
Furthermore, if the variable costs per unit of the product are known, we can find out
whether an increase in advertising is profitable or not. Thus, regression analysis can be
used as a powerful tool to support decision making.
The above regression function is an example of a so-called simple linear regression
or bivariate regression. Unfortunately, the relationship between sales volume and adver-
tising is usually not linear. But a linear function can be a good approximation in a lim-
ited interval around the current advertising budget (within the range of observed values).
Another problem is that sales are not solely influenced by advertising. Besides advertis-
ing, sales volumes also depend on the price of the product, its quality, its distribution,
60 2 Regression Analysis

and many other influences.2 So, with simple linear regression, we usually can get only
very rough estimates of sales volumes.

Multiple Regression
With multiple regression analysis one can take into account more than one influencing
variable by including them all into the regression function. So, Eq. (2.1) can be extended
to a function with several independent variables:
Y = f (X1 , X2 , . . . , Xj , . . . , XJ ) (2.6)
Choosing again a linear structure, we get:

Ŷ = a + b1 X1 + b2 X2 + . . . + bj Xj + . . . + bJ XJ (2.7)
By including more explaining variables, the predictions of Y can become more precise.
However, there are limitations to extending the model. Often, not all influencing vari-
ables are known to the researcher. Or some observations are not available. Also, with
an increasing number of variables the estimation of the parameters can become more
difficult.

2.2 Procedure

In this section, we will show how regression analysis works. The procedure can be
structured into five steps that are shown in Fig. 2.2. The steps of regression analysis
are demonstrated using a small example with three independent variables and 12 cases
(observations) as shown in Table 2.3.3

Example
The manager of a chocolate manufacturer is not satisfied with the sales volume of
chocolate bars. He would like to find out how he can influence the sales volume. To
this end, he collected quarterly sales data from the last three years. In particular, he
took data on sales volume, retail price, and expenditures for advertising and sales pro-
motion. Data on retail sales and prices were acquired from a retail panel (Table 2.3).

2 Sales can also depend on environmental factors like competition, social-economic influences, or
weather. Another difficulty is that advertising itself is a complex bundle of factors that cannot sim-
ply be reduced to expenditures. The impact of advertising depends on its quality, which is difficult
to measure, and it also depends on the media that are used (e.g., print, radio, television, internet).
These and other reasons make it very difficult to measure the effect of advertising.
3 On the website www.multivariate-methods.info we provide supplementary material (e.g., Excel

files) to deepen the reader’s understanding of the methodology.


2.2 Procedure 61

Fig. 2.2 The five-step 1 Model formulation


procedure of regression analysis
2 Estimating the regression function

3 Checking the regression function

4 Checking the regression coefficients

5 Checking the underlying assumptions

Table 2.3  Data of the application example


Period i Sales Advertising Price Promotion
[1000 units] [1000 EUR] [EUR/unit] [1000 EUR]
1 2596 203 1.42 150
2 2709 216 1.41 120
3 2552 207 1.95 146
4 3004 250 1.99 270
5 3076 240 1.63 200
6 2513 226 1.82 93
7 2626 246 1.69 70
8 3120 250 1.65 230
9 2751 235 1.99 166
10 2965 256 1.53 116
11 2818 242 1.69 100
12 3171 251 1.72 216
Mean 2825.1 235.2 1.71 156.43
Std-deviation 234.38 18.07 0.20 61.53

2.2.1 Model Formulation

1 Model formulation

2 Estimating the regression function

3 Checking the regression function

4 Checking the regression coefficients

5 Checking the underlying assumptions

The first step in performing a regression analysis is the formulation of a model. A model
is a simplified representation of a real-world phenomenon. It should have some structural
or functional similarity with reality. A city map, for example, is a simplified visual model
of a city that shows its streets and their courses. A globe is a three-dimensional model of
Earth.
62 2 Regression Analysis

In regression analysis, we deal with mathematical models. The specification of


regression models comprises:

• choosing and defining the variables,


• specifying the functional form,
• assumptions about errors (random influences).4

A model should always be as simple as possible (principle of parsimony) and as com-


plex as necessary. Thus, modeling is always a balancing act between simplicity and com-
plexity (completeness). A model must be able to capture one or more relevant aspects
of interest to the user. However, the more complete a model represents reality, the more
complex it becomes, and its handling becomes increasingly difficult or even impossible.5
The appropriate level of detail depends on the intended use, but also on the user’s expe-
rience and the available data. An evolutionary approach is often useful, starting with a
simple model, which is then extended with increasing experience and expertise (Little,
1970).
A model becomes more complex with the number of variables. For explaining sales,
there exists a great number of candidate explanatory variables. Our manager starts with a
simple model and chooses only one variable for explaining sales. He assumes that sales
volume is mainly influenced by advertising expenditures. Thus, he chooses sales as the
dependent variable and advertising as the independent variable and formulates the fol-
lowing model:

sales = f (advertising)
or
Y = f (X)

The manager further assumes that the effect of advertising is positive, i.e. that the sales
volume increases with increasing advertising expenditures. To check this hypothesis, he
inspects the data in Table 2.3. It is always useful to visualize the data by a scatterplot
(dot diagram), as shown in Fig. 2.3. This should be the first step of an analysis.
Each observation of sales and advertising in Table 2.3 is represented by a point in
Fig. 2.3. The first point at the left is the point (x1 , y1 ), i.e. the first observation with the
values 203 and 2596. Using Excel or SPSS, such scatter diagrams can be easily created,
even for large amounts of data.

4 SeeSects. 2.2.3.3 and 2.2.5.


5 In
regression analysis we encounter the problem of multicollinearity. We will deal with this prob-
lem in Sect. 2.2.5.7.
2.2 Procedure 63

Sales

3,500

3,000

2,500

2,000
200 220 240 260

Advertising

Fig. 2.3 Scatterplot of the observed values for sales and advertising

The scatterplot shows that the sales volume tends to increase with advertising. We can
see some linear association between sales and advertising.6 This confirms the hypothesis
of the manager that there is a positive relationship between sales and advertising. For
the correlation (Pearson’s r) the manager calculates rxy = 0.74. Moreover, the manager
assumes that the relationship between sales and advertising can be approximately repre-
sented by a linear regression line, as shown in Fig. 2.4.
The situation would be different if we had a scatterplot as shown in Fig. 2.5. This
indicates a non-linear relationship. Advertising response is always non-linear. Linear
models are in almost all cases a simplification of reality. But they can provide good
approximations and are much easier to handle than non-linear models. So, for the
data in Fig. 2.5, a linear model could be appropriate for a limited range of advertising

6 The terms association and correlation are widely and often interchangeably used in data analysis.
But there are differences. Association of variables refers to any kind of relation between variables.
Two variables are said to be associated if the values of one variable tend to change in some sys-
tematic way along with the values of the other variable. A scatterplot of the variables will show
a systematic pattern. Correlation is a more specific term. It refers to associations in the form of a
linear trend. And it is a measure of the strength of this association. Pearson’s correlation coeffi-
cient measures the strength of a linear trend, i.e. how close the points are lying on a straight line.
Spearman’s rank correlation can also be used for non-linear trends.
64 2 Regression Analysis

Sales

3,500

3,000

2,500

2,000
200 220 240 260
Advertising

Fig. 2.4 Scatterplot with a linear regression line

Sales

3,000

2,500

2,000

1,500

1,000

500

0
0 100 200 300 400
Advertising

Fig. 2.5 Scatterplot with a non-linear association

expenditures, e.g. from zero to 200. For modeling advertising response over the com-
plete range of expenditures, a non-linear formulation would be necessary (for handling
non-linear relations see Sect. 2.2.5.1).
The regression line in Fig. 2.4 can be mathematically represented by the linear
function:

Ŷ = a + b X (2.8)
2.2 Procedure 65

Y
3,000

2,000

1,000

0
0 50 100 150 200 250
X

Fig. 2.6 The linear regression function

with
Ŷ estimated sales
X advertising expenditures
a constant term (intercept)
b regression coefficient

The meaning of the regression parameters a and b is illustrated in Fig. 2.6.


Parameter a (intercept) indicates the intersection of the regression line with the y-axis
(the vertical axis or ordinate) of the coordinate system. This is the value of the regression
line for X = 0 or no advertising.
Parameter b indicates the slope of the regression line. It holds that

ΔŶ
b= (2.9)
ΔX
Parameter b tells us how much Y will probably increase if X is increased by one unit.

2.2.2 Estimating the Regression Function

1 Model formulation

2 Estimating the regression function

3 Checking the regression function

4 Checking the regression coefficients

5 Checking the underlying assumptions


66 2 Regression Analysis

Table 2.4  Data for sales and Year Sales Advertising


advertising with basic statistics i Y X
1 2596 203
2 2709 216
3 2552 207
4 3004 250
5 3076 240
6 2513 226
7 2626 246
8 3120 250
9 2751 235
10 2965 256
11 2818 242
12 3171 251
Mean y, x 2825 235.2
Std-deviation sy , sx 234.38 18.07
Correlation rxy 0.742

A mathematical model, like the regression function in Eq. (2.3), must be adapted to real-
ity. The parameters of the model must be estimated based on a data set (observations of
the variables). This process is called model estimation or calibration. We will demon-
strate this with the data from Table 2.3, first for simple regression, and then for multiple
regression.

2.2.2.1 Simple Regression Models


The procedure of estimation is based on the method of least squares (LS) that we will
explain in the following. Table 2.4 shows the data for sales and advertising (transferred
from Table 2.3) and the values of some basic statistics.
The regression coefficient b can be calculated as:
ΣN
i (xi − x) · (yi − y) 34,587
b = ΣN = = 9.63 (2.10)
i (xi − x)
2 3592

With the statistics for the standard deviations and the correlation of the two variables
given in Table 2.4, we can calculate the regression coefficient more easily as7
sy 234.38
b = rxy = 0.742 · = 9.63 (2.11)
sx 18.07

7 Thesebasic statistics can be easily calculated with the Excel functions AVERAGE(range) for
mean, STDEV.S(range) for std-deviation, and CORREL(range1;range2) for correlation.
2.2 Procedure 67

Sales

3,500

3,000

2,500

2,000
200 210 220 230 240 250 260
Advertising

Fig. 2.7 Regression line and SD line

With the value of coefficient b, we get the constant term through


a = y − b x = 2825 − 9.63 · 235.2 = 560 (2.12)
The resulting regression function as shown in Fig. 2.4 is:

Ŷ = 560 + 9.63 X (2.13)


The value of b   = 9.63 states that if advertising expenditures are increased by 1 EUR,
sales may be expected to rise by 9.63 units.
The value of the regression coefficient can provide important information for the
manager. If we assume that the contribution margin per chocolate bar is 0.20 EUR,
spending 1 EUR more on advertising will increase the total contribution margin by
0.20 EUR · 9.63 = 1.93 EUR. Thus, by spending 1.00 EUR, net profit will increase by
0.93 EUR. If the profit per unit were only 1.00/9.63 = 0.10 EUR or less, an increase in
advertising expenditures would not be profitable.

Understanding Regression
A regression line for given data must always pass through the centroid of the data (the
point of means or point of averages). This is a consequence of the least-squares esti-
mation. In our case, the point of means is [x, y] = [235, 2825] (marked by a bullett in
Fig. 2.7).
For standardized variables with x = y = 0 and sx = sy = 1, the regression line will
pass through the origin of the coordinate system, the point [x, y] = [0, 0]. For the con-
stant term, we get a = 0, as we can see from Eq. (2.12). And from Eq. (2.11) we can see
that the regression coefficient (the slope of the regression line) is simply the same as the
correlation coefficient. In our case, after standardization we would get:
68 2 Regression Analysis

b = rxy = 0.74.

For the original variables X and Y the slope b also depends on the standard deviations
of X and Y, sx and sy. Only for sx = sy the values of b and rxy are identical. If sy > sx,
then the slope of the regression line will be larger than the correlation (b > rxy ) and vice
versa. The greater sy, the greater the coefficient b will be, and the greater sx, the smaller b
will be.
As the standard deviation of any variable changes with its scale, the regression coeffi-
cient b also depends on the scaling of the variables. If, e.g., the advertising expenditures
are given in cents instead of EUR, b will be diminished by the factor 100. The effect of a
change by one cent is just 1/100 of the effect of a change by one EUR.
By changing the scale of the variables, the researcher can arbitrarily change the stand-
ard deviations and thus the regression coefficient. But he cannot change the value of the
correlation coefficient rxy since its value is independent of differences in scale.
The line through the point of means with the slope sy /sx (with the same sign as the
correlation coefficient) is called the standard deviation line (SD line (Freedman et al.
2007, pp. 130–131). This line is known before performing a regression analysis. For our
data, we get sy /sx = 13. In Fig. 2.7 the SD line is represented by the dashed line.
For rxy = 1, the regression line is identical with the SD line. But for empirical data,
we will always get rxy < 1. Thus, it follows that the regression line will always be flatter
than the SD line, i.e. |b| < sy /sx. This effect is called the regression effect, from which
regression analysis got its name.
For our data we get b = 9.63 < 13. The estimated regression line will always lie
between the SD line and a horizontal line through the point of means.

Correlation and Regression


As can be seen from Eq. (2.11), there is a very close connection between correlation
and regression. Karl Pearson derived the correlation coefficient from regression analy-
sis when he was working on a mathematical formulation of regression analysis. Thus,
he denoted it by the letter “r”, like regression. Both correlation and regression are used
to measure the strength of a relationship between two variables. But regression analysis
can do more than correlation.8 It can also measure the effect that the independent varia-
ble has on the dependent variable. Plus, regression analysis can yield predictions for the
dependent variable.
The correlation coefficient, in contrast, does not distinguish between dependent and
independent variables. Thus, the correlation coefficient is symmetrical:
rxy = ryx (2.14)

8 Blalock (1964, p. 51) writes: “A large correlation merely means a low degree of scatter …. It is
the regression coefficients which give us the laws of science.”
2.2 Procedure 69

This is different in regression analysis. There are two forms of regression functions for
two variables X and Y:
Y = f (X) and X = f (Y ) (2.15)
with the regression coefficients
sy sx
b = rxy and b = rxy (2.16)
sx sy

These two regression coefficients are only identical if sx = sy.


In regression analysis, we usually assume a causal relationship between the “depend-
ent”and the “independent” variable, as the names of the variables suggest. A causal rela-
tionship is a relationship that has a direction. For just two variables X and Y, it can be
formally expressed by
X → Y
(2.17)
cause effect
This means: If X is changed, then there will be a change in Y. Or in other words: X pro-
duces a change in Y. When analyzing this relationship by regression analysis, Y has to be
the dependent variable.
But the causal relationship assumed in regression analysis is often only a hypothesis,
i.e., an assumption by the researcher. Correlation must not necessarily imply causation.
And if there is causality, the correlation will not be affected by the direction of the causa-
tion. In regression analysis, the researcher has to decide which variable is the dependent
one and which is the independent one.

Residuals
Due to random influences, the estimated values ŷi and the observed values yi will not
be identical. The differences between the observed and the estimated y-values are called
residuals and they are usually denoted by “e” (for “error”):
ei = yi − ŷi (i = 1, . . . , N) (2.18)
with
yi o bserved value of the dependent variable Y
ŷi estimated value of Y for xi
N number of observations
Table 2.5 shows the sales estimated with the regression function in Eq. (2.13) for the
given values of X. The last two columns list the residuals and the squared residuals.
The residuals are caused by influences on sales that are not considered in the model.
There are two types of such influences:
70 2 Regression Analysis

Table 2.5  Data and residuals


Year Sales Advertising Estimated Residuals Squared
sales residuals
i Y X Ŷ e = Y − Ŷ e2
1 2596 203 2515 80.7 6508
2 2709 216 2641 68.5 4690
3 2552 207 2554 −1.8 3
4 3004 250 2968 36.1 1301
5 3076 240 2872 204.4 41768
6 2513 226 2737 −223.8 50091
7 2626 246 2929 −303.4 92055
8 3120 250 2968 152.1 23127
9 2751 235 2823 −72.5 5253
10 2965 256 3026 −60.7 3685
11 2818 242 2891 −72.9 5312
12 3171 251 2978 193.4 37421
Mean: 2825.1 235.2 2825.1 0 22601
Sum: 0 271217

• Systematic influences, such as price, promotion, or actions of competitors,


• Random influences caused by the behavior of consumers or unobservable errors in
measurement.

A single observation yi can thus be expressed by a systematic component (the regression


line) and a residual quantity that cannot be explained:
yi = a + b xi + ei (2.19)
Geometrically, the residual ei is the vertical distance or deviation of an observation point
i from the regression line (see Fig. 2.8). If an observation point lies below the regression
line, the residual assumes a negative value. By squaring the residuals all values become
positive. Otherwise, negative residuals would offset positive residuals.

The Method of Least Squares (LS)


The residuals play a key role in regression analysis. If data for X and Y are given, regres-
sion analysis finds values for the parameters a and b that make the “sum of squared
residuals” (SSR) as small as possible:

N
Σ
SSR = e21 + e22 + . . . e2N = e2i → min! (2.20)
i=1

With Eq. (2.19) we can write:


2.2 Procedure 71

0
0
X

Fig. 2.8 Regression line and residual

N
Σ
SSR = (yi − a − b xi )2 → min! (2.21)
a,b
i=1

The sum of squared residuals is a function of the unknown regression parameters a and
b. The resulting optimization problem can be solved by differential calculus, i.e. by tak-
ing partial derivatives with respect to a and b. In this way, we can derive the formulas
sy
a = y − b x, b = rxy
sx
that we used above for calculation. The minimum value of SSR (see Table 2.5, bottom
right) is given by the values a = 560 and b = 9.63. No other values for a and b can make
this sum smaller.9
This method of estimation is called the “method of least squares” (LS). Concerning
linear regression, it is also called ordinary least squares (OLS). It was developed by the
great mathematician Carl Friedrich Gauß and is the most widely used statistical method for
the estimation of parameters. Gauß was able to show that the least-squares criterion) will,
under certain assumptions (see Sect. 2.2.5), yield best linear unbiased estimators (BLUE).10

9 With the optimization tool Solver of MS Excel it is easy to find this solution without differential
calculus or knowing any formulas. One chooses the cell that contains the value of SSR (the sum at
the bottom of the rightmost column in Table 2.5) as the target cell (objective). The cells that con-
tain the parameters a and b are chosen as the changing cells. Then minimizing the objective will
yield the least-squares estimates of the parameters within the changing cells.
10 Carl Friedrich Gauß (1777–1855) used the method in 1795 at the age of only 18 years for calcu-

lating the orbits of celestial bodies. This method was also developed independently by the French
mathematician Adrien-Marie Legendre (1752–1833). G. Udny Yule (1871–1951) first applied it to
regression analysis.
72 2 Regression Analysis

For a better understanding of OLS, it is useful to look at some alternatives. Instead


of minimizing the vertical deviations between observations and the regression line, one
could also minimize the horizontal deviations or the squared Euclidean distances. We
choose the vertical deviations yi − ŷi because we want to minimize the error for predict-
ing the dependent variable Y.
To obtain a good optimization criterion, it is necessary to transform the negative devi-
ations into positive ones. While in LS, this is achieved by squaring the residuals, it could
also be done by using absolute values. This yields the least-absolute deviations criterion
(LAD) (see Greene 2012, p. 243; Wooldridge 2016, p. 321):

N
Σ
|ei | → min! (2.22)
i=1

An advantage of LAD is that it is more robust than OLS. It is less sensitive to outliers,
i.e., observations with unusually large deviations from the regression line. By squaring
these deviations, they have a stronger effect on the estimation results in OLS. This is
especially a problem for small sample sizes and can be seen as a disadvantage of OLS.
The LAD method looks simpler than OLS, but it is computationally more difficult to
handle since we cannot use differential calculus for solving the optimization problem.
Instead, iterative numerical methods are necessary. Before the invention of the computer,
this made the application of LAD prohibitive, while today this is not such a problem any
more. But one cannot analytically derive such nice formulas for the estimation of the
parameters as Eqs. (2.11 and 2.12). Another problem of LAD is that it does not always
give a unique solution. Multiple solutions are possible. But in most cases, OLS and LAD
will yield very similar results.

Separate Simple Regressions


In the same way we have analyzed the effect of advertising on sales, we can also analyze
the effects of the other two marketing variables in Table 1.15, i.e., price and promotion,
by performing a simple regression for each variable. The three resulting regression func-
tions would be:

Ŷ = 560 + 9.63 advertising


Ŷ = 2920 − 55.5 price
Ŷ = 2400 + 2.72 promotion

This approach of estimating a separate regression function for each independent variable
poses some problems:

• If the manager fixes his marketing plan for the coming period, each equation will give
a different value for the expected sales.
• The parameters of each of the three equations will probably be biased by having
neglected the other two variables.
2.2 Procedure 73

Table 2.6  Correlation matrix Sales Advertising Price Promotion


Sales 1 0.742 −0.048 0.713
Advertising 1 0.155 0.290
Price 1 0.299
Promotion 1

So, instead of having three separate regression functions that are all not very accurate, it
would be better to have one regression function with all three independent variables that
will give more accurate results. This possibility is offered by multiple regression. The
use of simple regression should be restricted to problems where we have only one inde-
pendent variable.

2.2.2.2 Multiple Regression Models


In general, the multiple regression function has the following form:

Ŷ = b0 + b1 X1 + b2 X2 + . . . + bj Xj + . . . + bJ XJ (2.23)
where J denotes the number of independent variables. For technical reasons, we will
now denote the constant term a by b0.11 Unfortunately, multiple regression does not yield
such nice formulas for the estimation of the parameters as simple regression.

Regression With Two Independent Variables


We will start multiple regression with just two independent variables (J = 2). Assume
that our manager wants to analyze the combined effects of advertising and price:
sales = f (advertising, price) (2.24)
Before we start with a multiple regression, we should have a look at the correlation
matrix for our variables, which is displayed in Table 2.6. It shows that there is a low
correlation between price and advertising, and an even lower one between price and sales
volume. We specify the following regression function:

Ŷ = b0 + b1 X1 + b2 X2 (2.25)
with X1 = advertising and X2 = price.
Note that this function does not represent a line, as in a simple regression. It now
defines a plane in a three-dimensional space spanned by the three variables. For more
than two regressors the regression function becomes a hyperplane.

11 When using matrix algebra for calculation, the constant term is treated as the coefficient of a
fictive variable with all values equal to 1. By this, it can be computed in the same way as the other
coefficients and the calculation becomes easier.
74 2 Regression Analysis

When using the LS criterion, we have to minimize the following sum of squared
residuals:

N
Σ N
Σ
SSR = e2i = (yi − b0 − b1 x1i − b2 x2i )2 → min! (2.26)
b0 , b1 , b2
i=1 i=1

Minimizing this sum by taking partial derivatives with respect to the three unknown
parameters b0 , b1 , b2 yields the value SSR = 254,816 (in contrast to SSR = 271,217 in
Table 2.5). Thus, by taking into account the influence of price on sales, we can reduce
the sum of squared residuals by 16,401 or 6%. This is not very much. But because of the
low correlation between sales and price, we could not expect more.
The resulting regression function is:

Ŷ = 814 + 9.97 X1 − 194.6 X2 (2.27)


In Eq. (2.27) the coefficient for price shows a negative sign, which is logically correct.
It indicates that sales will decrease with a rising price, which is what we would expect.
Checking the signs of the coefficients should be the first step when examining a regres-
sion function.
The slope coefficient b1 = 9.97 denotes the change in Ŷ per unit of change in X1
(advertising) when X2 (price) is held constant. And similarly, b2 = −194.6 denotes the
change in Ŷ per unit of change in X2 (price) when X1 (advertising) is held constant.
In the context of multiple regression, the regression coefficients are also called partial
regression coefficients. The partial regression coefficient for advertising is now b1 = 9.97
and differs from the coefficient b = 9.63 obtained by simple regression. The reason is
that in simple regression the effect of price on sales affected the estimation of b. In b1,
this effect has been removed by the inclusion of price into the regression function. That
is why b1 is called a partial regression coefficient.
If all effects are positive, the partial regression coefficients will be smaller than the
simple regression coefficients. But when looking at the correlation matrix in Table 2.6
we can see that price is negatively correlated with sales. This negative effect has now
been removed in b1 and so the coefficient has become larger.
As in our example the correlations of price with sales and also with advertising are
small, the change in the regression coefficient of advertising is also small. For zero cor-
relation of price the following would hold: b1 = b, i.e., there would be no difference
between the simple and the multiple regression coefficient.

Standardized Regression Coefficients (Beta Coefficients)


When comparing the regression coefficients of advertising and price in Eq. (2.27), one
can get the impression that price is much more important than advertising for explaining
the variation of sales because its absolute size is much larger. But this is wrong.
2.2 Procedure 75

For the simple regression we can see from Eq. (2.11) that the value of the regression
coefficient b is determined by the value of the correlation r between the two variables
and the ratio of the standard deviations:
sy
b = rxy
sx
Thus, the greater the standard deviation of the independent variable, the smaller b will
be. For our data the following applies:
sx1 = 18.07 for advertising and
sx2 = 0.201 for price.
The standard deviation of advertising is much greater than that of price. Thus, for
an equal importance of advertising and price, a much smaller regression coefficient for
advertising is to be expected.
A way to make the regression coefficients comparable is to standardize them. The
standardized regression coefficients are usually called beta coefficients. They are calcu-
lated as follows:
sxj
betaj = bj (2.28)
sy

When comparing this formula with Eq. (2.11), we can see that the scaling of the varia-
bles X and Y is eliminated in the beta coefficients. Thus, the beta coefficients are inde-
pendent of any linear transformations of the variables and can thus be used as a measure
of importance. We get for

• advertising: beta1 = 9.97 234.38


18.07
= 0.768
• price: beta2 = −194.6 234.38
0.201
= −0.167

From this, we can see that in our example advertising has a much greater influence on
sales variation than price. One may wonder about the comparatively small importance
of price in this case. This can have several reasons. We have seen above that price has
a very low correlation with sales. This may be due to the fact that the prices in our data
set were taken from a retail panel and are average values over many sales outlets. Due to
this, the variation in price is probably diminished and the correlation somewhat blurred.
We could also get the beta coefficients by standardizing the variables. In this case,
OLS would yield regression coefficients that are identical with the beta coefficients and
the constant term would be zero. For simple regression, the beta coefficient will be iden-
tical with the correlation coefficient. But in general, the beta coefficients cannot be inter-
preted as correlation coefficients.
Note: For estimating the effects of changes in the independent variables or mak-
ing predictions of the dependent variable, we need the non-standardized regression
coefficients.
76 2 Regression Analysis

Regression With Three or More Regressors


In the same way as above, we can extend the regression function to more regressors. For
example, by including all three marketing variables from Table 2.3 into our model we
get:
sales = f (advertising, price, promotion)

Ŷ = b0 + b1 X1 + b2 X2 + b3 X3 (2.29)
With the data in Table 2.3, OLS-estimation yields the following regression function:

Ŷ = 1248 + 7.91X1 − 387.6 X2 + 2.42 X3


As mentioned above, the regression coefficients will not change by extending the model
and be equal to the corresponding simple regression coefficients if the independent vari-
ables are uncorrelated. But here we see greater changes in the regression coefficients for
advertising and price after including the variable promotion in the regression function.
The partial regression coefficient for advertising is now b1 = 7.91 and thus signifi-
cantly smaller than the coefficient b = 9.63 obtained by a simple regression. The positive
effect of promotion has now been removed from the coefficient. As you can see from the
correlation matrix, this effect is much stronger than the negative effect of price.
The values of the beta coefficients now are:
beta1 = 0.610, beta2 = −0.332, beta3 = 0.636
While the coefficient of the price had the greatest absolute value, the beta coefficient
tells us that price has the lowest importance. It is the factor promotion that has the great-
est influence on sales, slightly greater even than advertising. By including promotion in
the regression function the sum of squared residuals diminishes from SSR = 254,816 to
SSR = 47,166. This confirms the great importance of promotion for explaining the varia-
tion in sales.

2.2.3 Checking the Regression Function

1 Model formulation

2 Estimating the regression function

3 Checking the regression function

4 Checking the regression coefficients

5 Checking the underlying assumptions

Once we have estimated a regression function, we need to assess its goodness or quality.
Nobody wants to rely on a bad model. We need to know how our model fits the empirical
data and whether it is suitable as a model of reality. For this reason, we need measures
for evaluating the goodness-of-fit.
2.2 Procedure 77

A natural basis for evaluating the goodness-of-fit is provided by the fitting criterion
of regression, the sum of squared residuals (SSR). We have already compared the three
models above by their SSR. With each regressor that was added, SSR became smaller
and thus the fit to the data improved.
However, the absolute value of SSR has no meaning, because it depends not only on
the goodness-of-fit but also on the number of observations and the scaling of Y. So, by
using SSR we can only compare models for the same data set. And SSR does not tell us
whether a model is good or bad, or how good or bad it is.
Suitable measures for assessing the goodness-of-fit of a model are:

• standard error of the regression,


• coefficient of determination (R-square),
• F-statistic and corresponding p-value,
• adjusted coefficient of determination.

2.2.3.1 Standard Error of the Regression


The standard error of the regression (standard error of the estimate) (SE) measures how
closely the observations scatter (vertically) around the estimated regression function. It
is used as an inverse measure of statistical precision. Precision increases if SE decreases.
SE is calculated as the standard deviation of the residuals:

SSR
SE = (2.30)
N −J −1
with N = number of observations and J = number of regressors.
N – J – 1 is the number of the degrees of freedom (df) for the estimation, i.e. the num-
ber of observations minus the number of parameters in the regression function that have
to be estimated from our observations.
The units for the standard error are the same as for Y. For our simple regression
“sales = f (advertising)” we get

271,217
SE = = 164.7
12 − 1 − 1
Thus, 165 [sales units] is the error we make on average when using the estimated regres-
sion function to predict sales. It is often useful to express the standard error in terms of
percentages that are independent of scale. This gives 5.8% of average sales: y = 2825.
For the multiple regression model “sales = f (advertising, price, promotion)” we get

47,166
SE = = 76.8 or 2.7%
12 − 3 − 1
This is clearly below 5% and seems to be quite acceptable. The precision has been
increased considerably by extending the model.
78 2 Regression Analysis

2.2.3.2 Coefficient of Determination (R-square)


The most common goodness-of-fit measure is the coefficient of determination, denoted
by R2 (R-square). For a simple regression it is calculated by squaring the correlation
between Y and X:

R2 = ryx
2
(2.31)

with 0 ≤ R2 ≤ 1. For multiple regression R-square is calculated by squaring the correla-


tion between Y and Ŷ :

R2 = ryŷ
2
(2.32)

The correlation coefficient ryŷ between the observed and the fitted y-values is called mul-
tiple correlation because Ŷ is a linear combination of the x-variables.
R-square can be interpreted as the proportion of total variation in Y that is explained
by the independent variables. The higher R-square, the better the fit. This intuitively easy
interpretation is the reason for the popularity of R-square. In our example we get the fol-
lowing values:

• Model 1: sales = f (advertising) R2 = 0.551


• Model 2: sales = f (advertising, price) R2 = 0.578
• Model 3: sales = f (advertising, price, promotion) R2 = 0.922

So, while with advertising alone we can explain only 55% of the variation in sales, with
the full model we can explain 92%. With each additional regressor R-square increases.

Decomposition of the Variation of Y


To explain the interpretation of R-square as the explained proportion of variation, we
first go back to simple regression. We consider just a single observation of advertising
and sales (xi , yi ). The corresponding point and its vertical deviation yi − y from the mean
value y are shown in Fig. 2.9. This deviation is called the total deviation and it can be
split into two components:

• explained deviation ŷi − y that can be explained by the regression line.


For a given xi, the regression line yields ŷi − y = b (xi − x).
• residual ei = yi − ŷi that cannot be explained.

So, the following holds:


total deviation = explained deviation + residual
yi − y = (ŷi − y) + (yi − ŷi ) (2.33)
2.2 Procedure 79

0
0 X

Fig. 2.9 Decomposition of the deviation from the mean

This is quite trivial and can be easily confirmed in Fig. 2.9. However, it is not trivial
that this equation is still valid if the elements are squared and summed over the observa-
tions.12 This results in the principle of the decomposition of the sample variation of Y.

total variation = explained variation + unexplained variation


N N N
(yi − y)2 = (ŷi − y)2 (yi − ŷi )2 (2.34)
Σ Σ Σ
+
i=1 i=1 i=1
SST = SSE + SSR
SST stands for total sum of squares and measures the sample variation of Y. It can be
calculated before doing a regression analysis by summing up the squared deviations of
the observed values of Y from their mean value. After having performed the regression,
the same can be done with the estimated values of Y. This yields the explained sum of
squares (SSE). More easily we can calculate SSE = SST – SSR.
Based on the decomposition of variation, R-square can also be calculated—alterna-
tively to Eq. (2.32)—as:
SSR SSE explained variation
R2 = 1 − = = (2.35)
SST SST total variation
The decomposition of variation is valid for simple and multiple regression. Minimizing
SSR (the least-squares criterion) is identical to maximizing R2.
R-square is easy to interpret intuitively. But it is harder to say what is “a good
R-square”. This depends also on the area of application and the kind of data. In areas like
engineering, physics, or astronomy one usually gets much higher values than in areas

12 This holds only for linear models and LS estimation. The principle is also of central importance
for the analysis of variance (ANOVA, cf. Chap. 3) and for discriminant analysis (cf. Chap. 4).
80 2 Regression Analysis

like economics, sociology, or psychology, where human behavior is involved. Here one
has to cope with many influencing variables and a great amount of randomness. The kind
of data also makes a difference. For experimental data, we can expect higher values of
R-square than for observational data. And for aggregated data, we can expect higher
values than for individual data. So, to judge the value of R-square, it is necessary to
compare it with values from similar applications. Ultimately, the interpretation of all
measurements depends on comparisons.

2.2.3.3 Stochastic Model and F-test


The primary goal in applying regression analysis is not to achieve a maximum fit with
the data, but a good representation of reality. The data on which regression analysis is
based are usually sample data. If we repeat the sampling, we will get other data and
regression analysis will yield other estimates for the same problem. The data of repeated
samples will vary randomly and thus the results of repeated regression analyses will also
vary randomly. Therefore, the primary aim of regression analysis cannot be to give a
description of the data in the sample but to draw conclusions from the sample about the
population (part of reality) from which the sample was drawn.13
For this reason, regression analysis uses a stochastic model that covers the random-
ness inherent in sample data. According to this model, all output is seen as random, the
estimated parameters as well as the predictions.

Stochastic Model of Regression Analysis


The stochastic model of regression analysis comprises two components, a systematic
component and a stochastic component ε. In its generic form, the regression model can
be expressed by
Y = β0 + β1 X1 + β2 X2 + . . . + βJ XJ + ε (2.36)
with
Y dependent variable
Xj independent variables (j = 1, 2, …, J)
β0 constant term (intercept)
βj regression coefficient (j = 1, 2, …, J)
ε error term (disturbance)
The parameters of the model (in the systematic component) are assumed to be true val-
ues that we do not know and want to estimate. They are denoted by Greek letters to dis-
tinguish them from the estimated values.

13 This is called inferential statistics and has to be distinguished from descriptive statistics.
Inferential statistics makes inferences and predictions about a population based on a sample drawn
from the studied population.
2.2 Procedure 81

The error term ε is the stochastic component. It represents all influences on Y that are
not explicitly contained in the systematic component. These may be errors in the meas-
urement of Y or influences that are unknown or cannot be measured. The error term is
also called disturbance, because it disturbs the estimation of the systematic component
that we are interested in. The errors are not observable, but they become manifest in the
residuals ei. We assume that the random errors are independent of the x-values and have
the mean value zero.
Regarding sales data, for example, there are countless influences by competitors,
retailers, and buyers. The behavior of humans always has some degree of randomness.
Besides, there are various macroeconomic, social, and other environmental influences.
Usually, the data are obtained by sampling and a random sampling error is unavoidable.
It is therefore justified to regard the error term as a random variable.
Since the dependent variable Y contains the error ε, it also is a random variable. Thus,
the estimated regression parameters bj that are obtained from observations of Y are real-
izations of random variables. In the case of repeated random samples, these estimates
fluctuate around the true values βj.
Based on this reasoning we can check the statistical significance of a model. The
research question is: Can the model, or at least one of the independent variables, contrib-
ute to explaining the variation of the dependent variable Y? To answer this question, we
test the null hypothesis
H0 : β1 = β2 = . . . = βJ = 0 (2.37)
versus the alternative hypothesis
H1: at least one βj is non-zero
To prove this, we have to reject the null hypothesis. This can be done by an F-test. For
this purpose, it is useful to organize the data in an ANOVA table as used in the analysis
of variance (see Chap. 3).

ANOVA Table
By dividing sums of squares (SS) by their corresponding degrees of freedom (df) we get
mean squares (MS) or sample variances. In Table 2.7 we do this with the values of our
Model 3 for the explained, residual and total sums of squares.

Table 2.7  Analysis of variance table (ANOVA) for Model 3


Source of variation Sum of squares SS Degrees of freedom df Mean squares MS
Explained N df1 = J = 3 MSE = SSE
= 185, 704
(ŷi − y)2 = 557,113
Σ
J
i=1

Residual N df2 = N–J–1 = 8 MSR = SSR


= 5, 896
(yi − ŷi )2= 47,166
Σ
N−J−1
i=1

Total N df3 = N–1 = 11 MST = SST


= 54, 934
(yi − y)2 = 604,279
Σ
N−1
i=1
82 2 Regression Analysis

The degrees of freedom for the explained variation (df1) are given by the number of
independent variables in the regression model. The degrees of freedom for the residual
variation (df2) are given by the number of observations minus the number of parame-
ters in the regression model: df2 = N – (J + 1). For a model without an intercept, we get
df2 = N – J.
From Eq. (2.34) we know that the explained variation and the residual variation sum up
to the total variation: SST = SSE + SSR. The same holds for the corresponding degrees of
freedom: df3 = df1 + df2. But this is not valid for the variances: MST ≠ MSE + MSR.

F-test
For performing an F-test we must compute an empirical value of the F-statistic.14 With
the values in Table 2.7 we get for our Model 3:
MSE explained variance 185,704
Femp = = = = 31.50 (2.38)
MSR unexplained variance 5, 896
Under the null hypothesis, the F-statistic follows an F-distribution. Its density function
for the degrees of freedom in our example is displayed in Fig. 2.10 (the lower line).
Taking the degrees of freedom into account, we can write the F-statistic as a function of
R-square. The F-statistic in R-square form is
R2 /J
Femp = (2.39)
(1 − R2 )/(N − J − 1)
Using this form, it is easy to test an R-square or a simple coefficient of correlation.15
From the empirical F-value, we can derive an empirical significance level, the
p-value. In SPSS, the p-value is referred to as “Significance” or “Sig”. Figure 2.10 shows
the p-value as a function of Femp. The greater Femp, the smaller is p. For Femp = 31.50
with df1 = J and df2 = N – J – 1 (in the numerator and the denominator) we get
p = 0.009%.16
We reject H0 if p < α and conclude that the estimated regression function is statisti-
cally significant with the probability α (alpha). α is the probability that H0 will be falsely
rejected, if it is true (type I error), and is also called the significance level. Commonly the
value α = 0.05 or 5% is chosen.17

14 Sect. 1.3 briefly discusses basics of statistical testing.


2
15 Fora simple coefficient of correlation r, we get Femp = (1−r 2 )/(N−2)
r
, as J = 1.
16 With Excel we can calculate the p-value by using the function F.DIST.RT(F
emp;df1;df2). We get:
F.DIST.RT (31.50;3;8) = 0.00009 or 0.009%.
17 The reader should be aware that other values for α are also possible. α = 5% is a kind of

“gold” standard in statistics that goes back to Sir R. A. Fisher (1890–1962) who also created the
F-distribution. But the researcher must also consider the consequences (costs) of making a wrong
decision.
2.2 Procedure 83

f(F, m, n) p-values for the F-distribution: p = Prob(F > Femp)


1.0

0.9

0.8 critical F-value


p-values
for α = 0.05
0.7

0.6

0.5

0.4

0.3
f(F, 3, 8)
0.2

0.1

0.0
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0

F, Femp

Fig. 2.10 F-distribution and p-values

Figure 2.10 shows that our empirical F-value Femp = 31.50 is much larger than
the critical F-value for α = 5%. So our p is almost zero and it is practically impossible
that H0 is true. Our estimated regression function for Model 3 is highly significant. Our
Models 1 and 2 are also statistically significant, as can be checked with the given values
of R-square.

2.2.3.4 Overfitting and Adjusted R-Square


R-square is the most common goodness-of-fit measure, but it has limitations. Everybody
strives for a model with a high R-square, but the model with the higher R-square must
not necessarily be the better model.

• R-square does not take into account the number of observations (sample size N) on
which the regression is based. But we will have more trust in an estimation that is
based on 50 observations than in one that is based on only 5 observations. In the
extreme case with only two observations, a simple regression would always yield
R2 = 1 since a straight line can always be laid through two points without deviations.
But for this, we would not need a regression analysis.
84 2 Regression Analysis

• R-square does not consider the number of independent variables contained in the
regression model and thus the complexity of the model. We mentioned the principle
of parsimony for model building. Making a model more complex by adding variables
(increasing J) will always increase R-square, but not necessarily increase the good-
ness of the model.

The amount of “explanation” added by a new variable may only be a random effect.
Moreover, with an increasing number of variables, the precision of the estimates can
decrease due to multicollinearity between the variables (see Sect. 2.2.5.7).
Also, with too much fitting, called “overfitting”, “the model adapts itself too closely
to the data, and will not generalize well” (cf. Hastie et al. 2011, p. 38). This especially
concerns predictions: We are not interested in predicting a value yi that we used already
for estimating the model. We are more interested in predicting a value yN+i that we have
not yet observed. And for this, a simpler model may be better than a more complex
model, because every parameter in the model contains some error.
On the other hand, if the model is omitting relevant variables and is not complex
enough, called “underfitting”, the estimates of the model parameters will be biased, i.e.
contain systematic errors (see Sect. 2.2.5.2). Again, large prediction errors will result.
Remember: Modeling is a balancing act between simplicity and complexity, or
between underfitting and overfitting.
The inclusion of a variable in the regression model should always be based on log-
ical or theoretical reasoning. It is bad scientific style to haphazardly include several or
all available variables into the regression model in the hope of finding some independ-
ent variables with a statistically significant influence. This procedure is sometimes called
“kitchen sink regression”. With today’s software and computing power, the calculation is
very easy and such a procedure is tempting. As R-square cannot decrease by adding vari-
ables to a regression model, it cannot indicate the “badness” caused by overfitting.
For these reasons, in addition to R-square, an adjusted coefficient of determination
(adjusted R-square) should also be calculated. With the values in Table 2.7 we get:

2 SSR/(N − J − 1) MSR 5,896


Radj =1− =1− =1− = 0.893 (2.40)
SST/(N − 1) MST 54,934

2
with Radj < R2 .

The adjusted R-square uses the same information as the F-statistic. Both statistics con-
sider the sample size and the number of parameters. To compare the adjusted R-square
with R-square, we can write:

2 N −1
Radj =1− (1 − R2 ) (2.41)
N −J −1
2.2 Procedure 85

The adjusted R-square becomes smaller when the number of regressors increases (other
things being equal) and can also become negative. Thus, it penalizes increasing model
complexity or overfitting.18 In our example we get the following values:

• Model 1: sales = f (advertising) Radj


2
= 0.506
• Model 2: sales = f (advertising, price) Radj
2
= 0.485
• Model 3: sales = f (advertising, price, promotion) Radj
2
= 0.893

By including price into the model, the adjusted R-square decreases. Price contributes
only little to explaining the sales volume and its contribution cannot compensate for
the penalty for increasing model complexity. With the inclusion of promotion, we get
another picture. Promotion strongly boosts the explained variation. Here the increase of
model complexity plays only a minor role.
The term adjusted R-square may be misunderstood, because Radj 2
is not the square of
any correlation. Another name, corrected R-square, is also misleading, because it sug-
gests that R2 is false, which is not the case.

2.2.4 Checking the Regression Coefficients

1 Model formulation

2 Estimating the regression function

3 Checking the regression function

4 Checking the regression coefficients

5 Checking the underlying assumptions

2.2.4.1 Precision of the Regression Coefficient


If the check of the regression function (by F-test or p-value) has shown that our model
is statistically significant, the regression coefficients must now be checked individually.
By this, we want to get information about their precision (due to random error) and the
importance of the corresponding variables.
As explained above, the estimated regression parameters bj are realizations of random
variables. Thus, the standard deviation of bj, called the standard error of the coefficient,
can be used as an inverse measure of precision.

18 Othercriteria for model assessment and selection are the Akaike information criterion (AIC) and
the Bayesian information criterion (BIC). See, e.g., Agresti (2013, p. 212); Greene (2012, p. 179);
Hastie et al. (2011, pp. 219–257).
86 2 Regression Analysis

For simple regression, the standard error of b can be calculated as:


SE
SE(b) = √ (2.42)
s(x) · N − 1
with
SE standard error of the regression
s(x) standard deviation of the independent variable
For our Model 1, sales = f (advertising), we get:
164.7
SE(b) = √ = 2.75
18.07 · 12 − 1
We estimated b = 9.63, so the relative standard error is 2.75/9.63 = 0.29 or 29%.
It is instructive to examine the formula for SE(b) more closely. To achieve high
precision for an estimated coefficient it is not sufficient to get a good model fit, here
expressed by the standard error of the regression. Furthermore, precision increases, i.e.,
the standard error of b gets smaller, when the

• standard deviation s(x) of the regressor increases,


• sample size N increases.

Variation of the x-values and sufficient sample size are essential factors for getting reli-
able results in regression analyses. To make a comparison, you cannot get a stable posi-
tion by balancing on one leg. So, if the variance of the x-values and/or the sample size is
small, the regression analysis will be a shaky affair. In an experiment, the researcher can
control these two conditions. He can manipulate the independent variable(s) and deter-
mine the sample size. But mostly we have to cope with observational data. Experiments
are not always possible, and a higher larger sample size takes more time and leads to
higher costs.
For multiple regressions, the formula for the standard error of an estimated coefficient
extends to:

SE
SE(bj ) = √ (2.43)

s(xj ) · N − 1 · 1 − Rj2

where Rj2 denotes the R-square for a regression of the regressor j on all other independent
variables. Rj2 is a measure of multicollinearity (see Sect. 2.2.5.7). It refers to the rela-
tionships among the x-variables. The precision of an estimated coefficient increases
(other things being equal) with a smaller Rj2, i.e. with less correlation of xj with the other
x-variables.
2.2 Procedure 87

For our Model 3 and variable j = 1 (advertising) we get the following standard error
for b1:
76.8
SE(b1 ) = √ √ = 1.34
18.07 · 12 − 1 · 1 − 0.089
We estimated b1 = 7.91. So, the relative standard error for the coefficient of advertising
has now decreased to 0.17 or 17%. This is due to a substantial reduction of the standard
error of the regression in Model 3.

2.2.4.2 t-test of the Regression Coefficient


To test whether a variable Xj has an influence on Y we have to check whether the regres-
sion coefficient βj differs sufficiently from zero. For this, we must test the null hypothesis
H0 : βj = 0 versus the alternative hypothesis H1 : βj � = 0.
Again, an F-test may be applied. But a t-test is easier and thus more common in this
context. While the F-test can be used for testing a group of variables, the t-test is only
suitable for testing a single variable.19 For a single variable (df = 1), it holds: F = t 2.
Thus, both tests will give the same results.
The t-statistic (empirical t-value) of an independent variable j is calculated very sim-
ply by dividing the regression coefficient by its standard error:
bj
temp = (2.44)
SE(bj )
Under the null hypothesis, the t-statistic follows a Student’s t-distribution with N – J – 1
degrees of freedom. In our Model 3 we have 12 – 3 – 1 = 8 df. Figure 2.11 shows the
density function of the t-distribution for 8 df with quantiles (critical values) −tα/2 and tα/2
for a two-tailed t-test with an error probability α = 5%. For 8 df we get tα/2 ± 2.306.20
For our Model 3, we get the empirical t-values shown in Table 2.8. All these t-values
are clearly outside the region [−2.306, 2.306] and thus statistically significant at α = 5%.
The p-values are clearly below 5%.21 So we can conclude that all three marketing varia-
bles influence sales.

One-tailed t-test—Rejection Region in the Upper Tail


An advantage of the t-test over the F-test is that it allows the application of a one-tailed
test, as the t-distribution has two tails. While the two-tailed t-test is the standard in

19 Fora brief summary of the basics of statistical testing see Sect. 1.3.
20 With Excel we can calculate the critical value tα/2 for a two-tailed t-test by using the function
T.INV.2 T(α;df). We get: T.INV.2 T(0.05;8) = 2.306.
21 The p-values can be calculated with Excel by using the function T.DIST.2 T(ABS(t
emp);df). For
the variable price we get: T.DIST.2 T(3.20;8) = 0.0126 or 1.3%.
88 2 Regression Analysis

f(t)

0.4

0.3

0.2

0.1

0.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0

Fig. 2.11 t-distribution and critical values for error probability α = 5% (two-tailed t-test)

Table 2.8  Regression j Regressor bj Std. error t-value p-value


coefficients and statistics of
Model 3 1 Advertising 7.91 1.342 5.89 0.0004
2 Price –387.6 121.1 –3.20 0.0126
3 Promotion 2.42 0.408 5.93 0.0003

regression analysis, a one-tailed t-test offers greater power since smaller deviations from
zero are now statistically significant and thus the danger of a type II error (accepting a
wrong null hypothesis) is reduced. But a one-tailed test requires more reasoning and a
priori knowledge on the researcher’s side.
A one-tailed t-test is appropriate if the test outcome has different consequences
depending on the direction of the deviation. Our manager will spend money on adver-
tising only if advertising has a positive effect on sales. He will not spend any money if
the effect is zero or if it is negative (while the size of the negative effect does not matter).
Thus, he wants to prove the alternative hypothesis
H1 : βj > 0 versus the null hypothesis H0 : βj ≤ 0 (2.45)
H0 states the opposite of the research question. The decision criterion is:
If temp > tα , then reject H0 (2.46)
2.2 Procedure 89

Now the critical value for a one-tailed t-test at α = 5% is only tα = 1.86.22 This value
is much smaller than the critical value tα/2 = 2.306 for the two-tailed test. As the rejec-
tion region is only in the upper tail (right side), the test is also called an upper tail test.
The rejection region on the upper tail has now double size (α instead α/2). Thus, a lower
value of temp is significant.
Using the p-value, the decision criterion is the same as before: We reject H0 if p < α.
But the one-tailed p-value is just half the two-tailed p-value. Thus, if we know the two-
tailed p-value, it is easy to calculate the one-tailed p-value by dividing it by 2. Table 2.8
gives the p-value p = 0.0004 or 0.04% for the variable advertising. Thus, the one-tailed
p-value is p = 0.02%.23

One-tailed t-test—rejection Region in the Lower Tail


The one-tailed t-test can be performed in an analogous way as above if the rejec-
tion region is in the lower tail (left side). In this case, we want to prove the alternative
hypothesis
H1 : βj < 0 versus the null hypothesis H0 : βj ≥ 0
The decision criterion is:
If temp < −tα, then reject H0.
For the variable price, we have the empirical t-value –3.20. As this value is lower than
the critical value −tα = –1.86, the effect of price is statistically significant at α = 5%.24
The p-value is 1.26/2 = 0.6%, clearly lower than 5%.

2.2.4.3 Confidence Interval of the Regression Coefficient


We used the t-test to check whether an unknown true regression coefficient βj differs
from zero. Another, quite similar question is now, in which range the value of the true
regression coefficient βj lies with a certain probability (confidence level). Regression
analysis yielded a point estimation bj. Now we ask for an interval estimation. For this
question we can use the same statistics that we have used before in the t-test.
bj is the best estimate we can get for βj. But βj is a constant and the estimate bj is a
random variable that will take another value for another sample. Thus both values will be
equal only by chance. In general, the estimate will contain an error e. So we can write:
b j − e ≤ βj ≤ b j + e (2.47)

22 With Excel we can calculate the critical value tα for a one-tailed t-test by using the function
T.INV(1 – α;df). We get: T.INV(0.95;8) = 1.860.
23 With Excel we can calculate the p-value for the right tail by the function T.DIST.RT(temp;df).

For the variable advertising we get: T.DIST.RT(5.89;8) = 0.00018 or 0.018%.


24 Using Excel, we can calculate the critical value for a lower-tail t-test by T.INV(α;df).

We get: T.INV(0.05;8) = –1.860.


90 2 Regression Analysis

Table 2.9  95% confidence intervals for the parameters of Model 3


J Regressor bj std. error Lower bound Upper bound
1 Advertising 7.91 1.342 4.81 11.0
2 Price –387.6 121.1 –666.9 –108.3
3 Promotion 2.42 0.408 1.48 3.36

The confidence interval therefore is a range around bj in which the unknown value βj can
be found with a certain probability. Its size depends on a specified error probability α (or
confidence level 1 – α). We can calculate it as
bj − tα/2 · SE(bj ) ≤ βj ≤ bj + tα/2 · SE(bj ) (2.48)
with
βj true regression coefficient (unknown)
bj estimated regression coefficient
tα/2 theoretical t-value for error probability α and df = N–J–1
SE(bj) standard error of bj

With a probability (confidence level) of (1–α) the true value βj is located in the given
interval around the estimate bj. With an error probability of α it is located outside of the
confidence interval. The lower α, the larger the interval.
All values needed for the calculation were already used above for the t-test (see
Table 2.8). For the variable advertising and Model 3, we get
7.91 − 2.306 · 1.342 ≤ βj ≤ 7.91 + 2.306 · 1.342

4.81 ≤ βj ≤ 11.00
This is the interval for the error probability α = 5% or confidence level 95%. Thus, with a
probability of 95% the true regression coefficient of the variable ‘advertising’ is between
4.81 and 11.00. If we increase the confidence level, the interval will increase accord-
ingly. Table 2.9 shows the confidence intervals for all regression coefficients of Model 3.

2.2.5 Checking the Underlying Assumptions

1 Model formulation

2 Estimating the regression function

3 Checking the regression function

4 Checking the regression coefficients

5 Checking the underlying assumptions


2.2 Procedure 91

Regression analysis is a powerful and relatively easy-to-use method, especially with


today’s computer power and software. But we also mentioned at the beginning that
regression analysis is prone to mistakes and misunderstandings. This concerns especially
the underlying assumptions of the regression model. Their violation can severely distort
the results or lead to wrong interpretations. And detecting of violations is not as easy as
calculating a regression function with the help of a computer.
The data must be valid for the investigated problem. If the data are lacking validity or
representativity for the problem under investigation, the analysis is worthless (“garbage
in, garbage out”). This concerns every analysis and is not specific to regression.
The linear regression model can be formulated as follows:
yi = β0 + β1 x1i + . . . + βJ xJi + εi (i = 1, 2, . . . , N) (2.49)
The expected value of yi is
E(yi ) = β0 + β1 x1i + . . . + βJ xJi + E(εi ) (2.50)
The model is correctly specified when the expected value of any yi is equal to the sys-
tematic component:
E(yi ) = β0 + β1 x1i + . . . + βJ xJi (2.51)
This is only possible if
E(εi |x1i , x2i , . . . , xJi ) = 0 (2.52)
This is the central assumption of the linear regression model. From Eq. (2.52) we can
infer the assumptions 1, 2, 4, and 5 stated below, which concern the error term. A further
assumption concerns the distribution of the error term. Moreover, there are two assump-
tions concerning the statistical properties of the independent variables.25

Assumptions of the Linear Regression Model


• A1: Linearity in parameters
• A2: No relevant independent variables are missing: Cov(εi , xji ) = 0
• A3: The independent variables are measured without error
• A4: Homoscedasticity: The error terms have a constant variance: Var(εi ) = σ 2
• A5: No autocorrelation: The error terms are uncorrelated: Cov(εi , εi+r ) = 0
• A6: The error terms εi are normally distributed: εi ∼ N(0, σ 2 )
• A7: No perfect multicollinearity

Table 2.10 gives an overview on the violations of these assumptions and their conse-
quences, which we will discuss in the following. The first three assumptions are the most

25 Cf. e.g., Kmenta (1997, p. 392); Fox (2008, p. 105); Greene (2012, p. 92); Wooldrige (2016,
p. 79); Gelman and Hill (2018, p. 45). You will find slight differences between the formulations of
the different authors.
92 2 Regression Analysis

Table 2.10  Violations of the assumptions of regression analysis and their consequences


Assumption Violation Consequence
1. Linearity and additivity Nonlinearity Biased coefficients
2. No relevant variables missing Omitted variables Biased coefficients
3. Xs are observed without error Errors in Xs Biased coefficients
4. Homoscedasticity Heteroscedasticity Less precision
5. No autocorrelation Autocorrelation Less precision
6. Normal distribution of errors Non-normal errors Significance not valid
7. No perfect multicollinearity Perfect multicollinearity No solution possible
Strong multicollinearity Low precision

important ones because they concern the validity of the results. Together with assump-
tions 4 and 5, the method of least squares yields unbiased and efficient linear estimations
of the parameters. This characteristic is called BLUE (best linear unbiased estimators),
where “best” stands for the smallest possible variance.26
Assumption 6 is needed for significance tests and confidence intervals. This assump-
tion is supported by the central limit theorem of statistics.27 Perfect multicollinearity
should not occur. If it does, there is a mistake in modeling. But strong multicollinearity is
a frequent problem.
In general, the effect or harm caused by a violation of these assumptions depends on
the degree of the violation. Thus, any violation can be harmful. But the good news is that
minor violations will not do any harm.
It should be emphasized that meeting the stated assumptions of the linear regres-
sion model is a necessary but not a sufficient condition for getting good estimates. For
achieving high precision of the estimates, regression analysis also requires a sufficient
variation (spread) in the independent variables, sufficiently large sample sizes, and low
multicollinearity.

2.2.5.1 Non-linearity
The world is non-linear. In almost all instances, linear models are a simplification of
reality. But after all, “all science is dominated by the idea of approximation” (Bertrand
Russell). For certain ranges, depending on the data, linear models can provide good
approximations, and they are much easier to handle than non-linear models. But if we

26 This follows from the Gauss-Markov theorem. See e.g. Fox (2008, p. 103); Kmenta (1997,
p. 216).
27 The central limit theorem plays an important role in statistical theory. It states that the sum or

mean of n independent random variables tends toward a normal distribution if n is sufficiently


large, even if the original variables themselves are not normally distributed. This is the reason why
a normal distribution can be assumed for many phenomena.
2.2 Procedure 93

4
80
3 60
2 40
1 20
0 0
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Concave Concave with wear-out
1.2
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Concave with saturation S-shape with saturation

Fig. 2.12 Models of advertising response

extend the range, a linear model may become inappropriate. If there is a strong non-lin-
ear relationship between Y and any x-variable, the expectancy-value E(εi ) cannot be zero
for any value of X and it cannot be independent of X.
A good example of this phenomenon can be found in advertising. If we double the
budget, the effect will usually not double. The more money we spend, the less will
be the marginal gains. Figure 2.12 shows possible curves of non-linear advertising
response. The same models are also used in other areas (e.g., epidemiology, diffusion of
innovations).
By transforming variables we can handle many non-linear problems within the linear
regression model. Assumption A1 of the regression model only postulates that the model
is linear in the parameters. Thus, a variable in the model can be a non-linear function of
an observed variable. To model a concave advertising response, we can transform adver-
tising expenditures X by a square root

X′ = X (2.53)
and estimate the model

Y = α + β · X′ + ε (2.54)
by linear regression.
In general, any variable X in a regression model can be replaced by a variable.
X ′ = f (X),
94 2 Regression Analysis

Table 2.11  Nonlinear No Name Definition Range


transformations
1 Square X 2 Unlimited

2 Square root X X ≥   0
3 Power X c
X>0
4 Reciprocal 1/X X = 0
5 Logarithmic 1n (X) X>0
6 Exponential exp (X) Unlimited
7 Logit ln (X/(1 − X)) 0<X<1
8 Arcsine sin −1
(X) |X| ≤ 1
9 Arctangent tan −1
(X) Unlimited

where f denotes a non-linear function (transformation) of the observed variable.


Table 2.11 shows examples of suitable nonlinear transformations. The permissible value
range is specified in each case.
Thus, the linear regression model can also be written in the following form:
f (Y ) = β0 + β1 · f1 (X1 ) + . . . + βJ · fJ (XJ ) + ε (2.55)
In multiple regression, the effects of the variables are always additive. A multiplicative
model can be linearized by taking logarithms on both sides:

Y = α · Xβ · ε (2.56)

1n Y = α ′ + β · 1n X + ε′ (2.57)
with α ′ = ln α and ε′ = ln ε. This can be extended to multiple regression.
Another very flexible form of non-linear transformation is offered by polynomials. A
polynomial regressionof the jth degree is given by

Y = β 0 + β1 X + β 2 X 2 + β3 X 3 + . . . + βj X j + ε (2.58)
The regression line in Fig. 2.13 shows a polynomial of the 2nd degree. With a polyno-
mial of the 3rd degree, we can create S-shaped functions.

Interaction Effects
Another form of non-linearity can be caused by interaction effects if the joint effect of
two independent variables is greater or smaller than the sum of the individual effects.
Such effects can occur, for example, between price and promotions. A price reduction
will often not be noticed by consumers if it is not accompanied by a promotion. And the
effect of a promotion will be increased by a price reduction. That is why they often go
together.
2.2 Procedure 95

Sales
3,000

2,500

2,000

1,500

1,000

500

0
0 100 200 300 400
Advertising

Fig. 2.13 Scatterplot with non-linear advertising response

Such interactions can be modeled by including the product of the two variables in the
model:
Y = β0 + β1 · A + β2 · B + β3 · A · B + ε (2.59)
with A for price and B for promotion. The product A × B is called an interaction term.
One of two interacting variables can also be a moderating variable (see Fig. 17).

Detection of Non-linearity
Undiscovered non-linearity will lead to a bias (systematic distortion) in the estimated
parameters. Therefore, the detection and correct handling of non-linearities are impor-
tant. Often researchers are aware of non-linearities in their problems due to prior experi-
ence. If not, statistical tools can be used to check for non-linearity. A visual inspection of
a scatterplot of the data usually works best.
The scatterplot in Fig. 2.13 shows a non-linear association between sales and advertis-
ing expenditures. It can be seen that the mean of the error terms εi variates over the range
of the x-values. In the medium range, it will be above zero and for low or high expendi-
tures it will be below zero. Thus, E(εi |x1i , x2i , . . . , xJi ) = 0 is violated by non-linearity.
To detect non-linearities in multiple regressions we can plot the y-values against each
independent variable. Another possibility is the use of a Tukey-Anscombe plot that we
will discuss in the following section.

2.2.5.2 Omission of Relevant Variables


The omission of relevant variables (underfitting) is a very frequent specification error
that can cause biased estimates. In economics and social sciences, we often have very
many influencing variables. If one thinks of sales data, there are countless influences by
competitors, retailers, and buyers. It will not be possible to include all of them in the
96 2 Regression Analysis

regression model. Besides, it is not necessary. According to assumption 2, only relevant


variables should not be missing. So the question arises, what is a relevant variable?
From Eq. (2.52) E(εi |x1i , x2i , . . . , xJi ) = 0 we can infer:
A2: Cov(εi , xji ) = 0,
i.e., there is no correlation between the independent variables and the error term.
Now let us assume we have a correct model:
Y = β0 + β1 X 1 + β2 X2 + ε
and we falsely specify

Y = β̃0 + β̃1 X 1 + ε̃
with
ε̃ = ε + β2 X2
In the falsely specified model, the effect of X2 is absorbed by the error ε̃. If X1 and X2 are
correlated, then X1 and ε̃ in the second model are also correlated and A2 is violated (cf.
Kmenta 1997, p. 443; Fox 2008, p. 111).
For the two models above we estimate the regression functions:

Ŷ = a + b1 X 1 + b2 X2

Ŷ = ã + b̃1 X 1
The estimator b̃1 will be biased because it takes the effect of X2.
We will demonstrate this effect with our Models 1 and 2. Looking at the correla-
tion matrix in Table 2.6, we can see that price is positively correlated with advertising
(r = 0.155). Thus, the coefficient of advertising is biased in Model 1, where the price is
omitted. We estimated above:
Model 2: Ŷ = 814 + 9.97 X1 − 194.6 X2
Model 1: Ŷ = 560 + 9.63 X1
The estimate b̃1 = 9.63 for advertising in Model 1 shows a downward bias because it
contains the negative effect of price. By subtracting the coefficients of the “correct”
model (Model 2) from those of the biased model (Model 1) we get:

bias = 9.63 − 9.97 = −0.34


This is a negligible effect, but it may serve for illustration purposes.
We can calculate:

b̃1 = b1 + bias (2.60)


where
s2
bias = b2 · r12 · (2.61)
s1
2.2 Procedure 97

From this formula we can learn: the bias increases with b2 and the correlation r12.
With the values in Tables 2.3 and 2.6 we get:
0.201
bias = −194.6 · 0.155 · = −0.34
18.07
The bias is small here because the variable price has only a small influence on Y and is
only weakly correlated with advertising. The bias of omitting promotion is much larger
(>2) and positive. For training purposes, the reader may calculate this bias in Model 1.
We summarize: An omitted variable is relevant if

• it has a significant influence on Y


• and is significantly correlated with the independent variables in the model.

An omitted variable does not cause bias if it is not correlated with the independent varia-
ble(s) in the model.

Detection of Omitted Variables


If relevant variables are omitted, E(εi ) and Corr(εi , xji ) will not be equal to zero.
To check this, we have to look at the residuals ei = yi − ŷi. The residuals can be ana-
lyzed by numerical or graphical methods. In the present case, a numerical analysis of the
residuals is problematic. By construction of the OLS method, the mean value of all the
residuals is always zero. Also, the correlations between the residuals and the x-variables
will all be zero. So, these statistics are of no help.
Thus, we need graphical methods to check the assumptions. Graphical methods are
often more powerful and easier to understand. An important kind of plot is the Tukey-
Anscombe plot, which involves plotting the residuals against the fitted y-values (on
the horizontal x-axis).28 For simple regression, it is equivalent to plotting the residuals
against the x-variable since the fitted y-values are linear combinations of the x-values.
According to the assumptions of the regression model, the residuals should scatter
randomly and evenly around the x-axis, without any structure or systematic pattern. For
an impression, Fig. 2.14 shows a residual plot with purely random scatter (for N = 75
observations). Deviations of the residual scatter from this ideal look would indicate that
the model is not correctly specified.
In our Model 1, sales = f (advertising), the variables price and promotion are omit-
ted. Figure 2.15 shows the Tukey-Anscombe plot for this Model. The scatterplot deviates
from the ideal shape in Fig. 2.14, and the difference would become even more pro-
nounced if we had more observations.

28 Anscombe and Tukey (1963) demonstrated the power of graphical techniques in data analysis.
98 2 Regression Analysis

Purely Random Residuals


300

200

100

0
2,400 2,500 2,600 2,700 2,800 2,900 3,000 3,100 3,200
-100

-200

-300

-400
Fitted y-values

Fig. 2.14 Scatterplot with purely random residuals (N = 75)

Residuals, Model 1: sales = f(advertising)


300

200

100

0
2,400 2,500 2,600 2,700 2,800 2,900 3,000 3,100
-100

-200

-300

-400
Estimated Sales

Fig. 2.15 Scatterplot of the residuals for Model 1

For Model 3, which includes the variables price and promotion, we get the scatterplot
in Fig. 2.16. Now the suspicious scatter on the right-hand side of Fig. 2.15 has vanished.

Lurking Variables and Confounding


Choosing the variables in a regression model is the most challenging task on the side of
the researcher. It requires knowledge of the problem and logical reasoning. From Eq.
(2.60) we infer that a bias can
2.2 Procedure 99

Residuals, Model 3: sales = f(adv., price, promotion)


300

200

100

0
2,400 2,500 2,600 2,700 2,800 2,900 3,000 3,100 3,200
-100

-200

-300

-400
Estimated Sales

Fig. 2.16 Scatterplot of the residuals for Model 3

• obscure a true effect,


• exaggerate a true effect,
• give the illusion of a positive effect when the true effect is zero or even negative.

Thus, great care has to be taken when concluding causality from a regression coefficient
(cf. Freedman 2002). Causality will be evident if we have experimental data.29 But most
data are observational data. To conclude causality from an association or a significant
correlation can be very misleading.
“Correlation is not causation” is a mantra that is repeated again and again in statistics.
The same applies to a regression coefficient. If we want to predict the effects of changes
in the independent variables on Y, we have to assume that a causal relationship exists.
But regression is blind for causality. Mathematically we can regress an effect Y on its
cause X (correct) but also a cause X on its effect Y. Data contain no information about
causality so it is the task of the researcher to interpret a regression coefficient as a causal
effect.
A danger is posed by the existence of lurking variables, that influence both the
dependent variable and the independent variable(s), but are not seen or known and thus
are omitted in the regression equation. Such variables are also called confounders (see
Fig. 2.17). They are confounding (confusing) the relationship between two variables X
and Y.

29 Inan experiment the researcher actively changes the independent variable X and observes
changes of the dependent variable Y. And, as far as possible, he tries to keep out any other influ-
ences on Y. For the design of experiments see e.g. Campbell and Stanley (1966); Green et al.
(1988).
100 2 Regression Analysis

X Y X Y X Y

Z Z Z

a) Confounder Z b) Confounder Z c) Mediator Z

Fig. 2.17 Causal diagrams of confounding and mediation

Example
A lot of surprise and confusion were created by a study on the relation between
chocolate consumption and the number of Nobel Prize winners in various countries
(R-square = 63%).30 It is claimed that the flavanols in dark chocolate (also contained
in green tea and red wine) have a positive effect on the cognitive functions. But one
should not expect to win a Nobel Prize if only one eats enough chocolate. The con-
founding variable is probably the wealth or the standard of living in the observed
countries. ◄

Causal Diagrams
Confounding can be illustrated by the causal diagrams (a) and (b) in Fig. 2.17. In dia-
gram (a) there is no causal relationship between X and Y. The correlation between X and
Y, caused by the lurking variable Z, is a non-causal or spurious correlation. If the con-
founding variable Z is omitted, the estimated regression coefficient is equal to the bias in
Eq. (2.61).
In diagram (b) the correlation between X and Y has a causal and a non-causal part.
The regression coefficient of X will be biased by the non-causal part if the confounder Z
is omitted. The bias in the regression is given by Eq. (2.61).31
Another frequent problem in causal analysis is mediation, illustrated in diagram (c).
Diagrams (b) and (c) look similar and the dataset of (c) might be the same as in (b),
but the causal interpretation is completely different. A classic example of mediation is
the placebo effect in medicine: a drug can have a biophysical effect on the body of the
patient (direct effect), but it can also act via the patient’s belief in its benefits (indirect

30 Switzerland was the top performer in chocolate consumption and number of Noble Prizes. See
Messerli, F. H. (2012). Chocolate consumption, Cognitive Function and Nobel Laureates. The New
England Journal of Medicine, 367(16), 1562–1564.
31 For causal inference in regression see Freedman (2012); Pearl and Mackenzie (2018, p. 72).

Problems like this one are covered by path analysis, originally developed by Sewall Wright (1889–
1988), and structural equation modeling (SEM), cf. e.g. Kline (2016); Hair et al. (2014).
2.2 Procedure 101

effect). We will give an example of mediation in the case study (Sect. 2.3). Thus, one
must clearly distinguish between a confounder and a mediator.32

Inclusion of Irrelevant Variables


In contrast to the omission of relevant variables (underfitting), a model may also con-
tain too many independent variables (overfitting). This may be a consequence of incom-
plete theoretical knowledge and the resulting uncertainty. In this case, the researcher may
include all available variables into the model so as not to overlook any relevant ones. As
discussed in Sect. 2.2.3.4, such models are known as “kitchen sink models” and should
be avoided. As in many things it applies also here: more is not necessarily better.

2.2.5.3 Random Errors in the Independent Variables


A crucial assumption of the linear regression model is the assumption A3: The independ-
ent variables are measured without any error. As stated in the beginning, an analysis is
worthless if the data are wrong. But with regard to measurements, we have to distin-
guish between systematic errors (validity) and random errors (reliability). We permitted
random errors for Y. They are absorbed in the error term, which plays a central role in
regression analysis.
In practical applications of regression analysis, we also encounter random errors in
the independent variables. Such errors in measurement may be substantial if the vari-
ables are collected by sampling and/or surveys, especially in the social sciences.
Examples from marketing are constructs like image, attitude, trust, satisfaction, or brand
knowledge that can all influence sales. Such variables can never be measured with per-
fect reliability. Thus, it is important to know something about the consequences of ran-
dom errors in the independent variables.
We will illustrate this by a small simulation. We choose a very simple model:

Y ∗ = X∗
which forms a diagonal line.
Now we assume that we can observe Y and X with the random errors εx and εy:
Y = Y ∗ + εy and X = X ∗ + εx
We assume the errors are normally distributed with means of zero and standard devia-
tions σεx and σεy .
Based on these observations of Y and X we estimate as usual:

Ŷ = a + b · X

32 “Mistaking
a mediator for a confounder is one of the deadliest sins in causal inference.” (Pearl
and Mackenzie 2018, p. 276).
102 2 Regression Analysis

Y Regression 1 Y Regression 1
500 500

450 450

400 400

350 350

300 300

250 250

200 200

150 observed 150 observed


estimated estimated
diagonal diagonal
100 100
SD Line SD Line
X mean X mean
50 50 Y mean
Y mean
center center
0 0
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
X X

Y Regression 1 Y Regression 1
500 500

450 450

400 400

350 350

300 300

250 250

200 200

150 observed 150 observed


estimated estimated

100 diagonal 100 diagonal


SD Line SD Line
X mean X mean
50 50
Y mean Y mean
center center
0 0
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
X X

Fig. 2.18 Scenarios with different error sizes (N = 300)

What is important now is that the two similar errors εx and εy have quite different effects
on the regression line. We will demonstrate this with the following four scenarios that are
illustrated in Fig. 2.18:

1) σεx = σεy = 0
No error. All observations are lying on the diagonal, the true model. By regression we
get correctly a = 0 and b = 1. The regression line is identical to the diagonal.
2) σεx = 0, σεy = 50
We induce an error in Y. This is the normal case in regression analysis. Despite con-
siderable random scatter of the observations, the estimated regression line (solid line)
shows no visible change.
2.2 Procedure 103

Table 2.12  Effects of error s(ex) s(ey) s(x) s(y) r(x,y) a b


size on standard deviations,
correlation, and estimation 0 0 87 87 1.00 0.0 1.000
0 50 87 100 0.87 −0.1 0.999
50 50 103 100 0.74 67.2 0.724
100 50 137 100 0.57 144.5 0.415

The slope of the SD line (dashed line) has slightly increased because the standard
deviation of Y has been increased by the random error in Y.
3) σεx = 50, σεy = 50.
We now induce an error in X that is equal to the error in Y. The regression line moves
clockwise. The estimated coefficient b < 0.75 is now biased downward (toward zero).
The slope of the SD line has also slightly decreased because the standard deviation
of X has been increased by the random error in X. The deviation between the SD line
and the regression line has increased because the correlation between X and Y has
decreased (random regression effect).
4) σεx = 100, σεy = 50
We now double the error in X. The effects are the same as in 3), but stronger. The
coefficient b < 0.5 is now less than half of the true value.

Table 2.12 shows the numerical changes in the four different scenarios.
The effect of the measurement error in X can be expressed by
b = β · reliability (2.62)
where β is the true regression coefficient (here β = 1) and reliability expresses the
amount of random error in the measurement of X. We can state:
σ 2 (X ∗ )
reliability = ≤1 (2.63)
σ 2 (X ∗ ) + σ 2 (εx )
Reliability is 1 if the variance of the random error in X is zero. The greater the random
error, the lower the reliability of the measurement.
Diminishing reliability affects the correlation coefficient as well as the regression
coefficient. But the effect on the regression coefficient is stronger, as the random error in
X also increases the standard deviation of X.
The effect of biasing the regression coefficient toward zero is called regression to the
mean (moving back to the average), whence regression got its name.33 It is important

33 The expression goes back to Francis Galton (1886), who called it “regression towards medioc-
rity”. Galton wrongly interpreted it as a causal effect in human heredity. It is ironic that the first
and most important method of multivariate data analysis got its name from something that means
the opposite of what regression analysis actually intends to do. Cf. Kahneman (2011, p. 175); Pearl
and Mackenzie (2018, p. 53).
104 2 Regression Analysis

e e

0 0

Fig. 2.19 Heteroscedasticity

to note that this is a purely random effect. To mistake it for a causal effect is called the
regression fallacy (regression trap).34
In practice, it is difficult to quantify this effect, because we usually do not know the
error variances.35 But it is important to know about its existence so as to avoid the regres-
sion fallacy. If there are considerable measurement errors in X, the regression coefficient
tends to be underestimated (attenuated). This causes non-significant p-values and type II
errors in hypothesis testing.

2.2.5.4 Heteroscedasticity
Assumption 3 of the regression model states that the error terms should have a constant
variance. This is called homoscedasticity, and non-constant error variance is called het-
eroscedasticity. Scedasticity means statistical dispersion or variability and can be meas-
ured by variance or standard deviation.
As the error term cannot be observed, we again have to look at the residuals.
Figure 2.19 shows examples of increasing and decreasing dispersion of the residuals in a
Tukey-Anscombe plot.

34 Cf. Freedman et al. (2007, p. 169). In econometric analysis this effect is called least squares
attenuation or attenuation bias. Cf., e.g., Kmenta (1997, p. 346); Greene (2012, p. 280);
Wooldridge (2016, p. 306).
35 In psychology great efforts have been undertaken, beginning with Charles Spearman in 1904, to

measure empirically the reliability of measurement methods and thus derive corrections for attenu-
ation. Cf., e.g., Hair et al. (2014, p. 96); Charles (2005).
2.2 Procedure 105

Heteroscedasticity does not lead to biased estimators, but the precision of least-
squares estimation is impaired. Also, the standard errors of the regression coefficients,
their p-values, and the estimation of the confidence intervals become inaccurate.
To detect heteroscedasticity, a visual inspection of the residuals by plotting them
against the predicted (estimated) values of Y is recommended. If heteroscedasticity is
present, a triangular pattern is usually obtained, as shown in Fig. 2.19. Numerical testing
methods are provided by the Goldfeld-Quandt test and the method of Glesjer.36

Goldfeld-Quandt test
A well-known test to detect heteroscedasticity is the Goldfeld-Quandt test, in which the
sample is split into two sub-samples, e.g. the first and second half of a time series, and
the respective variances of the residuals are compared. If perfect homoscedasticity exists,
the variances must be identical:

s12 = s22 ,
i.e. the ratio of the two variances of the subgroups will be 1. The further the ratio devi-
ates from 1, the more uncertain the assumption of equal variance becomes. If the errors
are normally distributed and the assumption of homoscedasticity is correct, the ratio
of the variances follows an F-distribution and can, therefore, be tested against the null
hypothesis of equal variance:

H0 : σ12 = σ22 .
The F-test statistic is calculated as follows:
s12
Femp =
s22

N1 N2
e2i e2i
Σ Σ
i=1 i=1 (2.64)
with s12 = and s22 =
N1 − J − 1 N2 − J − 1
N1 and N2 are the numbers of cases in the two subgroups and J is the number of inde-
pendent variables in the regression. The groups are to be arranged in such a way that
s12 ≥ s22 applies. The empirical F-value is to be tested at a given significance level against
the theoretical F-value for (Nl – J – 1, N2− J −1) degrees of freedom.

Method of Glesjer
An easier way for detecting heteroscedasticity is the method of Glesjer, in which the
absolute residuals are regressed on the regressors:

36 An overview of this test and other tests is given by Kmenta (1997, p. 292); Maddala and Lahiri

(2009, p. 214).
106 2 Regression Analysis

e e

0 0

Fig. 2.20 Positive and negative autocorrelation

J
Σ
|ei | = β0 + βj xji (2.65)
j=1

In the case of homoscedasticity, the null hypothesis H0 : βj = 0 (j = 1, 2, . . . , J)


applies. If significant non-zero coefficients result, the assumption of homoscedasticity
must be rejected.

Coping with Heteroscedasticity


Heteroscedasticity can be an indication of nonlinearity or the omission of some relevant
influence. Thus, the test for heteroscedasticity can also be understood as a test for nonlin-
earity and we should check for this. In the case of nonlinearity, transforming the depend-
ent variable and/or the independent variables (e.g., to logs) will often help.

2.2.5.5 Autocorrelation
Assumption 4 of the regression model states that the error terms are uncorrelated. If this
condition is not met, we speak of autocorrelation. Autocorrelation occurs mainly in time
series but can also occur in cross-sectional data (e.g., due to non-linearity). The devia-
tions from the regression line are then no longer random but depend on the deviations of
previous values. This dependency can be positive (successive residual values are close
to each other) or negative (successive values fluctuate strongly and change sign). This is
illustrated by the Tukey-Anscombe plot in Fig. 2.20.
Like heteroscedasticity, autocorrelation usually does not lead to biased estimators,
but the efficiency of least-squares estimation is diminished. The standard errors of the
2.2 Procedure 107

regression coefficients, their p-values, and the estimation of the confidence intervals
become inaccurate.

Detection of Autocorrelation
To detect autocorrelation, again a visual inspection of the residuals is recommended by
plotting them against the predicted (estimated) values of Y.
A computational method for testing for autocorrelation is the Durbin-Watson test. The
Durbin-Watson test checks the hypothesis H0 that the errors are not autocorrelated:
Cov(εi , εi+r ) = 0
with r ≠ 0. To test this hypothesis, a Durbin-Watson statistic DW is calculated from the
residuals.
N
(ei − ei−1 )2
Σ
i=2 [ ]
DW = N
≈ 2 1 − Cov(εi , εi−1 ) (2.66)
e2i
Σ
i=1

The formula considers only a first-order autoregression. Values of DW close to 0 or close


to 4 indicate autocorrelation, whereas values close to 2 indicate that there is no autocor-
relation. It applies:

DW → 0 if positive autocorrelation: Cov(ei, ei − 1) = 1.


DW → 4 if negative autocorrelation: Cov(ei, ei− 1) = −1.
DW → 2 if no autocorrelation: Cov(ei, ei − 1) = 0.

For sample sizes around N = 50, the Durbin-Watson statistic should roughly be between
1.5 and 2.5 if there is no autocorrelation.
More exact results can be achieved by using the critical values dL (lower limit) and
dU (upper limit) from a Durbin-Watson table. The critical values for a given significance
level (e.g., α = 5%) vary with the number of regressors J and the number of observations
N.
Figure 2.21 illustrates this situation. It shows the acceptance region for the null
hypothesis (that there is no autocorrelation) and the rejection regions. And it also shows
that there are two regions of inconclusiveness.

Decision Rules for the (Two-sided) Durbin-Watson Test (Test of H0: d = 2):
1. Reject H0 if: DW < dL or DW > 4 − dL (autocorrelation).
2. Do not reject H0 if: dU < DW < 4 − dU (no autocorrelation).
3. The test is inconclusive in all other cases.
108 2 Regression Analysis

inconclusive inconclusive

positive no negative
autocorr. autocorrelation autocorr.

0 dL dU 2 4 - dU 4 - dL 4

Fig. 2.21 Regions of the Durbin-Watson statistic

For our data (Model 1) we get DW = 2.04. This is very close to 2, so the hypothesis of no
autocorrelation must not be rejected.37 There is no reason to suspect autocorrelation.

Coping with Autocorrelation


Autocorrelation, like heteroscedasticity, can be an indication of nonlinearity or the
omission of some relevant variable(s). Thus, the test for autocorrelation can also be
understood as a test for nonlinearity, and we should check for this. Often nonlinear trans-
formations can help. In the case of time-series data the inclusion of dummy variables can
often solve the problem (see Sect. 2.4.2).

2.2.5.6 Non-normality
The last assumption concerning the error terms states that the errors are normally dis-
tributed. This assumption is not necessary for getting unbiased and efficient estimations
of the parameters. But it is necessary for the validity of significance tests and confidence
intervals. For these, it is assumed that the estimated values of the regression parameters
are normally distributed.38 If this is not the case, the tests are not valid.
As the errors cannot be observed, we again have to look at the residuals to check the
normality assumption. Graphical methods are best suited for doing so.39 A simple way is
to inspect the distribution of the residuals via a histogram (Fig. 2.22). Since the normal
distribution is symmetric, this should also apply to the distribution of the residuals. But
for small sample sizes, this might not be conclusive.

37 From a Durbin-Watson table we derive the values dL = 0.97 and dU = 1.33 and thus
1.33 < DW < 2.67 (no autocorrelation).
38 If the errors are normally distributed, the y-values, which contain the errors as additive elements,

are also normally distributed. And since the least-squares estimators form linear combinations of
the y-values, the parameter estimates are normally distributed, too.
39 Numerical significance tests of normality: the Kolmogorov-Smirnov test and the Shapiro-Wilk

test.
2.2 Procedure 109

Frequency Histogram of residuals


5

0
-300 -200 -100 0 100 200 300 More

Fig. 2.22 Histogram of the residuals

Q-Q Plot P-P Plot


Expected cum. Probability
Quantile of N(0,1) 1.0
2.0 0.9
1.5 0.8
0.7
1.0 0.6
0.5 0.5
0.4
0.0
-2.0 -1.0 0.0 1.0 2.0 0.3
-0.5 Observed standardized residual 0.2
0.1
-1.0
0.0
-1.5 0.0 0.2 0.4 0.6 0.8 1.0
Cumulative Proportion
-2.0

Fig. 2.23 Q-Q plot and P-P plot, based on standardized residuals

Better instruments for checking normality are specialized probability plots such as the
Q-Q plot and the P-P plot. Both are based on the same information and they give similar
results (see Fig. 2.23). They look at the same thing from different sides.

• Q-Q plot: the standardized residuals, sorted in ascending order, are plotted along the
x-axis and the corresponding quantiles of the standard normal distribution are plotted
along the y-axis.
• P-P plot: The expected cumulative probabilities of the (sorted) standardized residuals
are plotted along the y-axis against the cumulative proportions (probabilities) of the
observations on the x-axis.

Under the normality assumption, the points should scatter randomly along the diagonal
(x = y line). This is the case here. Slight deviations at the ends are frequently encountered
and pose no problem.
110 2 Regression Analysis

If the normality assumption is violated, one should not worry too much. For large
samples (N > 40) the estimated parameters will be normally distributed, even if the errors
are not normally distributed. This follows from the central limit theorem of statistical
theory. So, the significance tests and confidence intervals will be approximately correct.
But with small samples, one should be careful. Significance levels and confidence inter-
vals cannot be interpreted in the usual way.
A violation of the normality assumption is often the consequence of some other vio-
lations, e.g., missing variables, non-linearities, or outliers. After fixing these problems,
non-normality often disappears.

2.2.5.7 Multicollinearity and Precision


With empirical data, there is always a certain amount of multicollinearity between the
independent variables. Otherwise, we would not need multiple regression analysis and
could instead perform a simple regression for each independent variable. Assumption 7
of the regression model states that there must not be perfect multicollinearity, i.e. there
must not be a linear relationship between the regressors. In this case, the matrix of the
x-values would have no full rank and regression analysis would mathematically not be
feasible.40
Perfect multicollinearity seldom occurs, and if it does, it is mostly a result of misspec-
ifications. So, the problem can usually be solved easily.
More important is the case of high multicollinearity. The question is: What is high or
harmful multicollinearity? And what is the consequence? As multicollinearity does not
concern the stochastic model of regression, the least-squares method provides unbiased
and efficient estimators (best linear unbiased estimators or BLUE) even in the presence
of multicollinearity. But with increasing multicollinearity the estimates of the regression
parameters become less reliable (the precision decreases). And multicollinearity usually
increases with the number of variables in the model. For this reason, parsimony in model
building is important.
If two regressors contain similar information, it is difficult to separate their effects.
This can be illustrated graphically using a Venn diagram (Fig. 2.24). The scatters of the
dependent variable Y and the two regressors X1 and X2 are each represented by circles.41
The overlapping of the two independent variables, areas C and D, represents the colline-
arity between these two variables.
For the estimation of coefficient b1 only the information in area A can be used, and for
the estimation of b2 the information in area B. The information in area C, on the other
hand, cannot be assigned individually to the regressors and therefore cannot be used to
estimate their coefficients. Thus, the standard errors of the coefficients become larger.

40 The matrix X'X is singular and cannot be inverted.


41 Numerically
Σ ( these areas can be expressed by their sum of squares: SSY = ( yk − y) 2 and
Σ
xjk − x j 2.
)
SSXj =
2.2 Procedure 111

Fig. 2.24 Venn diagram

SSy

A B
C

SSx1 SSx2
D

However, the overlapping information is not completely lost. It reduces the standard
error of the regression and thus increases R-square and also the accuracy of forecasts.
As a result of multicollinearity, R-square may be significant although none of the
coefficients in the regression function is significant. Another consequence of multicol-
linearity may be that the regression coefficients change significantly if another variable is
included in the model or if a variable is removed. Thus, the estimates become unreliable.

Detection of Multicollinearity
To prevent the problem of multicollinearity, it is first necessary to detect it, i.e. to deter-
mine which variables are affected and how strong the extent of multicollinearity is. The
collinearity between two variables can be measured by the correlation coefficient. Thus,
a first clue may be provided by looking at the correlation matrix. High correlation coef-
ficients between the independent variables can point to a collinearity problem. However,
the correlation coefficient measures only pairwise relations. Therefore, high-grade multi-
collinearity can also exist despite consistently low values for the correlation coefficients
of the independent variables.42
To detect multicollinearity, it is therefore recommended to regress each independent
variable Xj on the other independent variables to determine their multiple relationships.
A measure for this is the corresponding squared multiple correlation coefficient denoted
by Rj2. A large value of Rj2 means that the variable Xj may be approximately generated
by a linear combination of the other independent variables and is therefore redundant.
Rj2 can thus be used as a measure of redundancy of the variable Xj. The complementary
value Tj = 1 − Rj2 is called the tolerance of variable j.

42 See Belsley et al. (1980, p. 93).


112 2 Regression Analysis

Table 2.13  Collinearity Advertising Price Promotion


statistics for Model 3
Tolerance 0.911 0.906 0.850
VIF 1.098 1.104 1.177

The reciprocal of the tolerance value is the variance inflation factor (VIF) of variable
Xj, which is currently the most common measure of multicollinearity:
1
VIFj = (2.67)
1 − Rj2
In statistic software for regression analysis both tolerance and VIF can usually be used
for checking multicollinearity. Exact cut-off values, however, cannot be given: For
Tj = 0,2 you get VIFj = 5, or for Tj = 0.1 you get VIFj = 10. Such critical values can be
found in the literature.43
In our data, we do not find considerable multicollinearity, as the statistics in
Table 2.13 show.

Precision of the Estimated Regression Coefficients


Equation (2.43) gave a measure of precision for the estimated regression parameters bj.
Now we can write for the standard error of the coefficient bj:
SE √
SE(bj ) = √ · VIFj (2.68)
s(xj ) · N − 1
For our variable promotion, for example, we get again (cf. Table 2.8):
76.8 √
SE(b3 ) = √ · 1.177 = 0.408
61.53 · 12 − 1
In Eq. (2.68) we can identify four issues for the precision of an estimated regression
coefficient.
The precision increases (the standard error decreases) with

a) the variation of the regressor s(xj ),


b) sample size N.

The precision decreases (the standard error increases) with

c) the standard error of the regression SE


d) multicollinearity.

43 Verysmall tolerance values can lead to computational problems. By default, SPSS will not allow
variables with Tj < 0.0001 to enter the model.
2.2 Procedure 113

Once the data are given, factors a) and b) cannot be changed anymore. Factors c) and
d) can be changed by the researcher by changing the model. A simple way to prevent
high multicollinearity is to remove variables with large VIF. But by removing variables
from the model, the model fit usually decreases and the standard error of the regression
increases. So, this is a balancing act. It becomes problematic if a variable with large VIF
is of primary interest to the researcher. In this case he is possibly faced with the dilemma
of either removing the variable and thus possibly compromising the purpose of the inves-
tigation or keeping the variable and accepting the consequences of multicollinearity.
Factor analysis (see Chap. 7) can be very useful for coping with multicollinearity. It
helps to analyze the interrelationships among the independent variables because it may
help to select variables with low correlation or to create composite variables (indicators)
by combining two or more variables into a new variable (e.g. by summation or averag-
ing) and thus diminishing multicollinearity (cf. Hair et al. 2010, pp. 123–126). Or one
can also do a regression on the factors, which are always uncorrelated (they are compos-
ites of all variables). However, if the regressors are replaced by factors, this may jeopard-
ize the actual purpose of the investigation, as the factors cannot be observed.
The simplest way to prevent multicollinearity is by increasing the sample size. But
this usually costs time and money and is not always possible.44

2.2.5.8 Influential Outliers
Empirical data often contain one or more outliers, i.e. observations that deviate substan-
tially from the other data. Regression analysis is susceptible to outliers because the resid-
uals are squared in the OLS method. Therefore, an outlier can have a strong influence on
the result of the analysis. In this case, the outlier is called influential.45
To find out whether an outlier is influential, we first have to detect the outlier(s).
And if an outlier is influential, we have to check whether this leads to a violation of the
assumptions.
Outliers can arise for different reasons. They can be due to

• chance (a random addition of influences),


• a mistake in measuring or data entry,
• an unusual event outside the research context (e.g., sales can go down in one period
because of a delay in supply due to a strike by the union of railway workers),
• an unusual event within the research context (e.g., sales of chocolate go up before
Xmas or Easter. In this case, the outlier contains valuable information for the man-
ager, and the cause should be included in the model).

44 Another method to counter multicollinearity, which is beyond the scope of this text, is ridge
regression. By this method one trades a small amount of bias in the estimators for a large reduction
in variance. See Fox (2008, p. 325); Kmenta (1997, p. 440); Belsley et al. (1980, p. 219).
45 Excellent treatments of this topic may be found in Belsley et al. (1980); Fox (2008, p. 246).

SPSS provides numerous statistics.


114 2 Regression Analysis

Sales Outlier with high leverage


3,200

3,000

2,800

2,600

2,400

2,200
200 210 220 230 240 250 260
Advertising

Fig. 2.25 Regression line (solid line) for data with a high-leverage outlier, while the dashed line shows
the correct regression line

To demonstrate the effect of outliers, we will use a small simulation. We go back to our
Model 1 for simple regression. Table 2.5 shows the data and residuals. Now imagine
that a “small mistake” happened. For sales in the first period (i = 1) a wrong digit was
entered: instead of the correct value 2596, the value 2996 was entered, which is an
increase of 400 units.
If the expected value of the error in period 1 was E(ε1 ) = 0 before the mistake, it will
now be E(ε1 ) = 400. This is a violation of condition (2.51). It is instructive to look for
the effects of this mistake on the regression results.
Figure 2.25 shows the scatterplot with the outlier, point (203, 2996) on the left side
(represented by the bullet). And it also shows the corresponding regression line (solid
line). The dashed line represents the regression line from Fig. 2.4 with the correct data.
Usually we do not know this line when dealing with outliers. We inserted it here to illus-
trate the effect of the “small mistake”.
Table 2.14 shows the numerical results of regression analysis with the wrong and the
correct y-value in period 1. The bottom row shows the changes caused by the outlier.

• By increasing the observed y-value in period 1 by 400, the estimated value increases
by 149.
• The regression coefficient diminishes from 9.36 to 6.05 [chocolate bars/€]. The effect
on the slope of the regression line can be seen in Fig. 2.25. We see how the outlier
pulls on the regression line.
2.2 Procedure 115

Table 2.14  Results of regression analysis with wrong and correct data


Data Observed Estimated Coeff b R2 Residual r1
y-value y-value
Wrong 2996 2664 6.05 0.23 332
Correct 2596 2515 9.63 0.55 81
Change 400 149 –3.58 –0.32 251

• R-square shrinks dramatically, from 0.55 to 0.23.


• The residual in period 1 increases from 81 to 332.

The example demonstrates clearly that the change of only one data point can have very
strong effects on the results of regression analysis, especially for small sample sizes.

Detecting Outliers
When we encounter an outlier, we do not know the correct value, as we did in the simu-
lation above. So if we create a scatterplot, like the one in Fig. 2.25, the correct regression
line (dashed line) will be missing. And in a table like Table 2.14, we will only have the
values in the first row.
For detecting outliers one can use graphical and/or numerical methods. Graphical
methods are easier to understand, and they are quicker and more efficient.46 It can be
tedious to find unusual values in a possibly great number of numerical values. But when
looking at a scatterplot like the one in Fig. 2.25, unusual points47 such as the high marker
on the left side, can be easily detected.
We will now take a closer look at this point with numerical methods. To judge the size
of a residual, it is advantageous to standardize its value by dividing it through the stand-
ard error of the regression. In this way we get standardized residuals:
zi = ri /SE (2.69)
The residual of observation 1 is r1 = 332, and for SE we calculate 209. So, for observa-
tion 1 we get the standardized value
z1 = 332/209 = 1.59
The bar chart in the upper left-hand panel of Fig. 2.26 shows the standardized residuals
for our data. We can see that observation 1 has the largest residual. The value 1.59 can

46 Before doing a regression analysis one can use exploratory techniques of data analysis, like
box plots (box-and-whisker plots), for checking the data and detecting possible outliers. But these
methods do not show the effects on regression.
47 This may be different when the number of variables is large. In this case the detection of multi-

variate outliers by scatterplots can be difficult (see Belsley et al. 1980, p. 17).
116

Standardized Residual by Period Leverage by Period


2.50 0.40

2.00 0.35

1.50 0.30
1.00 0.25
0.50 0.20
0.00 0.15
1 2 3 4 5 6 7 8 9 10 11 12
-0.50
0.10
-1.00
0.05
-1.50
0.00
-2.00 1 2 3 4 5 6 7 8 9 10 11 12

Student. Deleted Residual by Period Cook´s Distance by Period


2.50 1.40
2.00
1.20
1.50
1.00
1.00

0.50 0.80

0.00 0.60
2

1 2 3 4 5 6 7 8 9 10 11 12
-0.50
0.40
-1.00
0.20
-1.50

-2.00 0.00
1 2 3 4 5 6 7 8 9 10 11 12
Regression Analysis

Fig. 2.26 Outlier diagnostics


2.2 Procedure 117

be seen as a realization of the standard normal distribution (according to assumption 6).


Thus, we can infer from this value the probability of its occurrence, its p-value. For
z = ± 2 it would be p = 5% (2-sigma rule). But here we have a smaller value of z = 1.59
with a larger p-value of 11%.48 Usually, we judge p-values below 5% as significant.
Thus, p = 11% is not alarming.
A problem of residuals is that they can partly mask their true sizes. Before the erro-
neous entry, the residual of observation 1 was 81 (see Table 2.5). After the mistake it
increased to 81 + 400 = 481. This represents the vertical distance between the bullet and
the dashed line in Fig. 2.25. The true (erroneous) residual 481 is much larger than the
observed residual 332. For r = 481 we get z = 2.30 with p = 2.1%. This value is highly
significant. But usually we do not know the true size of residuals.
By pulling on the regression line, the outlier diminished its residual to 332 units. This
effect can be increased by a group of outliers. This poses a difficulty in the detection
of outliers that must be overcome. The reason that the outlier here could influence the
regression so strongly is its leverage.

Influence and Leverage


The influence of an outlier on the regression line depends not only on its size (y-value)
but also on its location on the x-axis. The further away an observation is located on
the x-axis from the mean x , the greater is its influence on the slope of the regression
line. This effect is called leverage effect. In least-squares estimation, the leverage effect
increases with (xi − x)2, i.e. with the squared distance from the mean. Observations far
away from the mean are called high-leverage points. Our outlier here is such a high-lev-
erage point because observation 1 is located far to the left.
The influence of an outlier depends on both x and y-values. Roughly we can state:
influence = size · leverage.
Size is a function of the y-values and leverage is a function of the x-values. Observations
with a large influence are called influential observations because they have a strong
impact on the estimated regression coefficients. Influential observations are bad for a
regression analysis. Nobody wants coefficients whose values are dominated by one or a
few outliers. Our outlier here is an influential observation because it contains a substan-
tial error in y and has a strong leverage.
In regression analysis, the leverage for an observation i is usually measured by the hat
value hi ≡ hii (i = 1, 2, …, N). These values are the diagonal elements of the hat matrix
H (which we encounter in matrix algebra for multiple regression analysis). For simple
regression, the leverage can be calculated by

48 With Excel we can calculate: p(abs(z) ≥ 1.59) = 2*(1-NORM.S.DIST(1.59;1) = 0.112.


118 2 Regression Analysis

) (2
1 1 xi − x
hi = + · (1/N ≤ hi ≤ 1) (2.70)
N N −1 sx

From this formula, we can see that the leverage

• increases with the squared distance (xi − x)2, which is the source of the leverage,49
• decreases with the standard deviation of the independent variable,
• decreases with the sample size.

The bar chart in the upper right-hand panel of Fig. 2.26 shows the leverage values for our
data. The mean of the N leverage values hi is h = (J + 1)/N . For our example, we get
0.1677. Leverages with hi > 2 h are considered high leverages.
For the leverage of the outlying observation 1 we get:
) (2
1 1 203 − 235.2
h1 = + · = 0.371
12 12 − 1 18.07

This is clearly greater than 2 h = 0.333.

Outliers with Low Leverage


By contrast, imagine that the false data entry happened in observation 9. For sales in
period 9, the value 2351 was entered instead of 2751, which is now 400 units too low.
Figure 2.27 shows the scatterplot with the outlier (marked by a bullet) and the resulting
regression line (solid line). The dashed line again shows the correct regression line.
This time the slope of the regression line has not visibly changed although the mis-
take has the same size. The regression line has only been shifted slightly downwards
because the-value for observation 9 is too low. The reason for the different effects of
errors in observation 1 and observation 9 is that the latter one is located close to the
mean and thus only has a small leverage (see upper right-hand panel in Fig. 2.26).
For the leverage of observation 9 we get:
) (2
1 1 235 − 235.2
h9 = + · = 0.083
12 12 − 1 18.07

This leverage is distinctly smaller than the leverage h1 = 0.371 for observation 1.

Types of Residuals
Because it is difficult to detect an outlier by its residual, different types of residuals are used.
So far, we have introduced the normal (unstandardized) and the standardized residual. Two
further types of residuals are the studentized residuals and studentized deleted residuals.

49 A modified measure of this distance is the centered leverage hi′ = hi − J+1


with −1 ≤ hi′ ≤ 1.
N
2.2 Procedure 119

Sales Outlier with low leverage


3,200

3,100

3,000

2,900

2,800

2,700

2,600

2,500

2,400

2,300

2,200
200 210 220 230 240 250 260
Advertising

Fig. 2.27 Regression line (solid line) for data with low leverage outlier, while the dashed line shows the
correct regression line

Above we described how an outlier pulls on the regression line and thus diminishes
its residual. The effect of this mechanism is related to the leverage of the outlier. The cal-
culation of studentized residuals and studentized deleted residuals includes the leverage
of an observation. Compare the formulas of the four types of residuals:
Normal residuals: ei = yi − ŷi

Standardized residuals: zi = ei /SE (2.71)

/( √ )
Studentized residuals: ti = ei SE · 1 − hi (2.72)

/( √ )
Studentized deleted residuals: ti∗ = ei SE(−i) · 1 − hi (2.73)

Table 2.15 shows the values for these four different types of residuals for observation
1, first for the incorrect sales value (after the mistake) and then for the correct sales
value (before the mistake), to identify the effect of the mistake in observation 1. We also
include the abbreviations used in SPSS.
The calculation of the studentized deleted residual i uses the standard error of
the regression after deleting observation i. We have denoted this standard error of the
120 2 Regression Analysis

Table 2.15  Values of different types of residuals for observation 1 (after and before the mistake)
Data Residual Standardized Studentized Student.Deleted
r Residual z Residual t Residual t*
RES ZRE SRE SDR
Incorrect 332 1.59 2.01 2.46
Correct 81 0.49 0.62 0.60

regression by SE(–i).50 Here we have SE = 208.9 and SE(–i) = 170.2. For observation 1
we get:
/( √ )
ti∗ = 332 170.2 · 1 − 0.37 = 2.46

The lower left-hand panel in Fig. 2.26 shows the bar chart of the studentized deleted
residuals. They follow a t-distribution with N − J − 2 degrees of freedom.51 Thus, for
the p-value of the studentized-deleted residual t1∗ = 2.46 we can infer p = 3.6%.52 This
value is considerably smaller than the p-value p = 11.2 that we got for the standardized
residual of observation 1. And the p-value is below 5%. Thus, the studentized-deleted
residual responds more sensitively to residuals with high leverage and marks observation
1 as an outlier.

Cook’s Distance
Above, we defined the influence of a residual roughly as the product of two factors:
influence = size · leverage.
A specification of this formula is Cook’s distance that can be calculated as:

ti2 hi
Di = · (2.74)
J + 1 1 − hi
This statistic by Cook (1977) is currently the most frequently used measure of influence.
Its calculation is based on the studentized residual for an observation i and its hat value.
For observation 1 we get:

2.012 0.37
D1 = · = 2.02 · 0.587 = 1.19
1 + 1 1 − 0.37

50 By using s(−i) instead of standard error s, the numerator and the denominator in the formula
for the studentized deleted residuals become stochastically independent. See Belsley et al. (1980,
p. 14).
51 See Fox (2008, p. 246); Belsley et al. (1980, p. 20).

52 With Excel we can calculate: p(abs(t) ≥ 2.46) = T.DIST.2 T(2.46;9) = 0.036.


2.2 Procedure 121

Sales Regression line after discarding observation 1 with outlier


3,200

3,000

2,800

2,600

2,400

2,200
200 210 220 230 240 250 260
Advertising

Fig. 2.28 Regression line (solid line) after discarding observation 1 (outlier); the dotted line shows the
regression with outlier and the dashed line the correct regression line

The lower right-hand panel of Fig. 2.26 shows the bar chart of Cook’s distances for all
observations. We can see that the bar for observation 1 clearly stands out from the other
observations. Thus, Cook’s distance gives a clear indication that observation 1 is a highly
influential outlier.
In the literature, one finds different opinions concerning a cut-off value for the detec-
tion of influential outliers (e.g. 4/N = 0.333, or 4/(N–J–1) = 0.4 or just 0.5). Values
greater than 1 are significant. Our value of Cook’s distance here exceeds these possible
cut-off values by far. This also indicates that observation 1 is a highly influential outlier.
But in general, the best way of detecting influential outliers is to check a diagram
(such as Fig. 2.26) for values (dots or bars) that stand out from the others.

Outlier Deletion
Above, we showed the effect of a simulated outlier on regression results by comparing
the results with the correct data. But usually we do not know the correct data. Another
way to show the effect of an outlier is to repeat the analysis after deleting the observation
with the outlier. This is illustrated in Fig. 2.28. Table 2.16 shows the numerical results.
We can see that after deleting observation 1 (the outlier) the regression is close to the
regression with the (usually unknown) correct data. Thus, in this case, the deletion of the
outlier yields good results.
122 2 Regression Analysis

Table 2.16  Regression results: a) after discarding observation 1 (outlier), b) before discarding the
outlier, c) correct data
Data Coeff. b R2 s
a) Outlier discarded 10.8 0.52 170.2
b) With outlier 6.05 0.23 208.9
c) Correct data 9.63 0.55 164.7

How to Deal With Outliers


“In or out, that is the question.” This is indeed a difficult question. We only have to take
action if an observation is influential. In the above example, we got good results after
deleting the outlying data. But we knew that the outlier was caused by a mistake. The
automatic deletion of an outlier is not acceptable. If the outlier is due to chance, it does
not pose a violation of the assumptions and must not be eliminated. By dropping outliers
one can possibly manipulate the regression results. And if one does so for good reasons,
it should be documented in a report or publication. If possible, one should present the
results with and without the outlier, as we did in Fig. 2.28.
In any case, one should investigate the reason(s) for an outlier (see above). Sometimes
it is possible to correct a mistake in measurement or data entry. We should only drop the
observation if we have reason to believe a mistake has occurred (e.g. the data gives 5 for
the age or weight of a respondent). The same applies, if there is proof that the outlier was
caused by an unusual event outside the research context (e.g., a strike by the union or a
shutdown of electricity).
In all other cases, outliers should be kept. Sometimes a change in the model specifica-
tion can help, e.g. the inclusion of an omitted variable or a non-linear transformation of
some variable(s). By using dummy variables, unusual events within the research context
can be included in the model (cf. Sect. 2.4.1).

2.3 Case Study

2.3.1 Problem Definition

We will now use another sample related to the chocolate market to demonstrate how to
conduct a regression analysis with the help of SPSS.
The marketing manager of a chocolate company wants to analyze the influence of
demographic variables on the shopping behavior of his customers. He wants to find out if
and how age and gender influence the shopping frequency in his outlet stores. His model
is:
Shopping frequency = f (age, gender).
2.3 Case Study 123

and he specifies the following regression function:

Ŷ = b0 + b1 X1 + b2 X2
with
Ŷ estimated shopping frequency (shoppings)
X1 age
X2 gender (coded 0 for females and 1 for males)
A variable coded 0 or 1 is called a dummy variable. It can be treated like a metric var-
iable. Dummy variables can be used to incorporate qualitative predictors into a linear
model (cf. Sect. 2.4.1).
For the estimation, the manager started with a small sample of 40 customers, ran-
domly selected from the company’s database.
As the manager was not content with the results of his analysis, he additionally col-
lected data on the income of his customers by a separate survey. This was omitted in the
first survey, as people usually do not like to report their income. For this, the respondents
had to be ensured that their data are kept confidential and anonymous. The income data
from the second survey are also contained in the data file.

2.3.2 Conducting a Regression Analysis With SPSS

To conduct a regression analysis with SPSS, we can use the graphical user interface
(GUI). After loading the data file into SPSS we can see the data in the SPSS data editor.
To select the procedure for regression analysis we have to click on ‘Analyze’. A pull-
down menu opens with submenus for groups of procedures (see Fig. 2.29). The group
‘Regression’ contains (among other forms of regression analysis) the procedure of linear
regression (‘Linear’).
After selecting ‘Analyze/Regression/Linear’, the dialog box ‘Linear Regression’
opens, as shown in Fig. 2.30. The left field shows the list of variables. Our dependent
variable ‘shopping frequency’ has to be moved into the field ‘Dependent’. The independ-
ent variables ‘age’ and ‘gender’ have to be moved into the field ‘Independent(s)’.
SPSS offers various model-building methods. We here choose the method ‘Enter’,
which is the default option. This means that all selected independent variables will be
included into the model as they were entered in the field ‘Independent(s)’. This is called
blockwise regression.
The dialog box Regression contains several buttons that lead to further submenus.
If we click the button ‘Statistics’, the dialog box ‘Linear Regression: Statistics’ opens
(Fig. 2.31). Here we can request various statistical outputs. ‘Estimates’ and ‘Model fit’’
are the default settings.
124 2 Regression Analysis

Fig. 2.29 Data editor with a selection of the procedure ‘Linear Regression’

If the data set contains missing values, which often occurs in practice, this can be
taken into account with the options under ‘Missing Values’.53 The regression analysis in
SPSS offers the possibility of excluding missing values listwise or pairwise. Missing val-
ues can also be replaced by means (cf. Fig. 2.32).

2.3.3 Results

2.3.3.1 Results of the First Analysis


After clicking on ‘Continue’ and then ‘OK’, SPSS provides the results shown in
Fig. 2.33.
The output comprises three sections: ‘Model Summary’, ‘ANOVA’, and ‘Coefficients’.
The last section gives the estimated regression parameters in the column marked ‘B’:

53 Missing values are a frequent and unfortunately unavoidable problem when conducting surveys
(e.g., because people cannot or do not want to answer some question, or as a result of mistakes by
the interviewer). The handling of missing values in empirical studies is discussed in Sect. 1.5.2.
2.3 Case Study 125

Fig. 2.30 Dialog box: Linear Regression

Fig. 2.31 Dialog box: Statistics

the constant term and the two coefficients. With these values we can write the estimated
regression function:
Shoppings = 3.832 + 0.101 · age − 0,015 · gender (2.75)
126 2 Regression Analysis

Fig. 2.32 Dialog box: Options

This result indicates that the shopping frequency increases slightly with age and
decreases with gender. As gender was coded with 0 for females and 1 for males, the neg-
ative sign means that shopping frequency is somewhat lower for men than for women.
The ‘Model Summary’ contains global measures for evaluating the goodness-of-fit
of the estimated regression function. The coefficient of determination, R-square, tells us
that only 4.8% of the total variation of the dependent variable Y = ‘shopping frequency’
can be explained by the two predictors ‘age’ and ‘gender’. This is a very disappointing
result.
The second part with the heading ‘ANOVA’ (analysis of variance) shows that, with an
empirical F-value of only 0,923, the regression function has no statistical significance
(cf. Sect. 2.2.3.3). The critical F-value for J = 2 and N – J – 1 = 37 df is 3.25. Thus,
we get a p-value of 40,6%, much larger than the usual limit at α = 5% for statistical
significance.
The first column, with the heading ‘Sum of Squares’, shows the decomposition of
variation according to Eq. (2.34): SSE + SSR = SST. The next column gives the corre-
sponding degrees of freedom (df) (cf. Table 2.7).
2.3 Case Study 127

Fig. 2.33 SPSS output for regression analysis 1

The last section with the estimated regression parameters gives the

• standard error, see Eq. (2.42),


• beta coefficient, see Eq. (2.28),
• t-value, see Eq. (2.44), with the corresponding p-value

for each parameter.


The critical t-value for 37 df is 2.03 (≈2). All t-values are smaller here. And the p-val-
ues are much larger than α = 5%. Thus, none of the estimated parameters is statistically
significant.

2.3.3.2 Results of the Second Analysis


Because of these disappointing results, the manager had collected also data on the
income of his customers (as mentioned above). After including the variable ‘income’ in
the model and running a second regression analysis, SPSS provides the results shown in
Fig. 2.34.
128 2 Regression Analysis

Fig. 2.34 SPSS output for regression analysis 2

The estimated regression function now reads:


Shoppings = 1.401 − 0.120 · Age − 0,438 · Gender + 3.116 · Income (2.76)
This regression function is highly significant. R-square indicates that 65.8% of the total
variation is explained by the three predictors. The F-value is now 23.1, with a p-value of
< 0.001. Also, the coefficient of ‘age’ is statistically significant now, but it changed its
sign. The coefficient of ‘gender’ has increased in size but has not become significant.
It is noteworthy that the coefficient of ‘age’ has changed its sign. If we take a look at
the correlation matrix in Fig. 2.35, we see that ‘shopping frequency’ and ‘age’ are posi-
tively correlated. But the coefficient of age has become negative. This requires a causal
analysis.

Checking for Causality


If there is a causal relationship between age and shopping frequency, shopping frequency
has to be the dependent variable. Chocolate may possibly increase the life span, but
2.3 Case Study 129

Fig. 2.35 Correlation matrix

nothing can change age. Thus, if there is a causal relationship, age must be the cause
of changes in shopping frequency, and not vice versa. But why has the regression coef-
ficient become negative, while the correlation between age and shopping frequency is
positive?
The reason is that age is directly and indirectly causally related to shopping fre-
quency, and income functions as a mediator (cf. Fig. 2.17c). Age has a direct effect on
shopping frequency, which is negative. And it has an indirect effect via income, which is
positive, and which is larger than the direct effect.
In Eq. (2.75) income is omitted. Thus, part of the effect of income on shopping fre-
quency is erroneously assigned to the coefficient of age because age and income are pos-
itively correlated. The coefficient b1 = 0.101 in Eq. (2.75) comprises the direct and the
indirect effect of age on shopping. frequency. And as the positive indirect effect is larger
than the negative direct effect, we get a wrong positive value for b1.
By including income in the regression Eq. (2.76), the direct and indirect effects of
age are separated. Thus, the coefficient b1 of age in the 2nd analysis reflects only the
direct effect, and this is negative. This means that within a group of customers with equal
income, the shopping frequency will decrease with higher age. Intuitively the chocolate
manager had felt this and thus undertook the 2nd regression analysis.

2.3.3.3 Checking the Assumptions

A1: Linearity
To check for non-linearity, we can plot the dependent variable ‘shopping frequency’
against any of the independent variables. For example, by plotting ‘shopping frequency’
versus ‘income’ we get the scatterplot in Fig. 2.36. In SPSS we can do this by select-
ing ‘Graphs / Chart Builder / Scatter/Dot’ and then moving ‘shopping frequency’ to the
y-Axis and ‘income’ to the x-Axis. The scatter in Fig. 2.36 does not indicate a violation
of the linearity assumption. Additionally, we can fit a linear line to the scatter by select-
ing ‘Total’ under the option ‘Linear Fit Lines’. In the same way, we can create scatter-
plots with the other independent variables.
130 2 Regression Analysis

Scatter Plot of Shopping frequency by Income


R2 Linear = 0.607

20

y=-2.04+2.65*x
15
Shopping frequency

10

0
1.000 2.000 3.000 4.000 5.000 6.000 7.000

Income

Fig. 2.36 Scatterplot of the dependent variable ‘shopping frequency’ against ‘income’

A more efficient plot is the Tukey-Anscombe plot, which involves plotting the resid-
uals against the fitted y-values on the x-axis (cf. Sect. 2.2.5.2) as the fitted y-values are
linear combinations of all x-values. We can easily create such a plot with the dialog box
‘Linear Regression: Plots’ of SPSS Regression (see Fig. 2.37). This box offers stand-
ardized values of the fitted y-values and the residuals. We put the standardized residuals
*ZRESID on the y-axis and the standardized predicted values *ZPRED on the x-axis (as
shown in Fig. 2.37), and receive the diagram in Fig. 2.38. The scatterplot does not show
any outliers. The residuals seem to scatter randomly without any structure and do not
show any suspicious pattern. This is what we want to see.

A2: No Relevant Variables Omitted


The Tukey-Anscombe plot can also be used for the detection of omitted variables (cf.
Sect. 2.2.5.2). As the residuals in Fig. 2.38 scatter randomly around the x-axis, we have
no reason to suspect that relevant variables (that are correlated with the independent vari-
ables in the model) have been omitted.

A3: The Independent Variables are Measured Without Error


Concerning ‘age’ and ‘gender’, we can assume that these variables are measured with-
out error. This is different from reported ‘income’, which usually contains random errors
2.3 Case Study 131

Fig. 2.37 Dialog box: Plots

and also bias. Due to this, the estimated regression coefficient will be underestimated
(attenuated).
But in the present case, this does not pose a problem. The estimated regression coef-
ficient of the variable ‘income’ is very large here, despite possible attenuation, with a
p-value of practically zero.

A4 + A5: Homoscedasticity and No Autocorrelation


The Tukey-Anscombe plot in Fig. 2.38 indicates neither heteroscedasticity (nonconstant
error variance) nor autocorrelation (cf. Figs. 2.19 and 2.20).
Autocorrelation is of no relevance here as we have no time-series data. However,
we will show how to check for autocorrelation in SPSS: Via the dialog box ‘Statistics’
(Fig. 2.31) we can request the value of the Durbin-Watson statistic. We get DW = 1.94.
This is close to the ideal value 2.

A6: Normal Distribution


To check the residuals for normal distribution we can select ‘Histogram’ and ‘Normal
probability plot’ in the dialog box ‘Linear Regression: Plots’ (cf. Fig. 2.37), which leads
to the diagrams in Figs. 2.39 and 2.40.
The distribution of the residuals seems to be symmetric. The probabilities in the P-P
plot (Fig. 2.40) scatter randomly along the diagonal line. Thus, we can assume that the
normality assumption is not violated.

A7: No Strong Multicollinearity


In the dialog box ‘Linear Regression: Statistics’ (Fig. 2.31), we can select the option
‘Collinearity diagnostics’. This selection gives us for each independent variable the tol-
erance value Tj = 1 − Rj2 and the value of the variance inflation factor (VIFj) according
132 2 Regression Analysis

Scatterplot
Dependent Variable: Shopping frequency
2
Regression Standardized Residual

-1

-2
-2 -1 0 1 2

Regression Standardized Predicted Value

Fig. 2.38 Plot of residuals against predicted values

Histogram
Dependent Variable: Shopping frequency

10
Mean = 1.19E-15
Std. Dev. = 0.961
N = 40

8
Frequency

0
-2 -1 0 1 2
Regression Standardized Residual

Fig. 2.39 Histogram of residuals with normal curve


2.3 Case Study 133

Normal P-P Plot of Regression Standardized Residual


Dependent Variable: Shopping frequency
1.0

0.8
Expected Cum Prob

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0

Observed Cum Prob

Fig. 2.40 P-P Plot

to Eq. (2.67). The lowest tolerance value here results for the variable ‘income’:
T1 = 0.726. Thus, for income we get the largest VIF-value with VIF1 = 1.377. This is a
very moderate value. Thus, we do not have a collinearity problem here.

2.3.3.4 Stepwise Regression
Besides the blockwise regression (method ‘Enter’) used above, SPSS also offers a step-
wise regression. This method can be chosen via the dialog box ‘Linear Regression’ (see
Fig. 2.30).
If we choose stepwise regression, an algorithm of SPSS will build the model. It will
include the independent variables sequentially into the model, one by one, based on their
statistical significance (this process is called forward selection). Non-significant varia-
bles will be omitted. By this procedure, the algorithm tries to find a good model. This
method can be useful if we have a great number of independent variables.
134 2 Regression Analysis

Fig. 2.41 Stepwise regression: Variables Entered

In our case study with only three independent variables, a total of seven different
models (regression equations) can be formed: three with one independent variable, three
with two independent variables, and one with three independent variables. If we had 10
independent variables, 1023 models could be formed. The number of possible combina-
tions increases exponentially with the number of variables.
For this reason, it can be very tempting to let the computer find a good model. But
there is also a risk. The computer can only select variables according to statistical crite-
ria, but it cannot recognize whether a model is meaningful in terms of content. We have
shown that also nonsense correlations can be statistically significant. To recognize this is
the task of the researcher. The computer does not know anything about causality because
the data do not contain information about causality. Thus, a computer could possibly
“think” that tax is a good predictor for sales or profit because of a strong correlation.
Anyway, we will now demonstrate the stepwise regression method. Figure 2.41 shows
that ‘income’ is included in the first step and ‘age’ in the second step. The variable ‘gen-
der’ is not included because its p-value exceeds α = 5%.
The target criterion for the successive inclusion of variables is the increase of
R-square. In the first step, the variable ‘income’ is selected because it has the highest cor-
relation with the dependent variable and thus yields the highest R-square (see Fig. 2.42).
In each successive step, the variable is selected which yields the highest increase
of R-square. The process ends if there is no further variable that leads to a significant
increase in R-square. Here in our example the process ends after the second step.
The estimated coefficients are shown in Fig. 2.43. After the first step, the coeffi-
cient of ‘income’ is somewhat biased downwards because it takes the negative effect of
‘age’. This is corrected in the second step by the inclusion of ‘age’. The coefficients of
‘income’ and ‘age’ are almost identical with the coefficients in Eq. (2.76) which addi-
tionally contains the variable ‘gender’.
The selection process of stepwise regression can be controlled via the dialog box
‘Options’ (cf. Fig. 2.32). The default settings of SPSS shown in Fig. 2.32 were used
(PIN: 0.05; POUT = 0.10). The user can change the p-values for PIN (entry) and POUT
(removal) of a variable. An already selected variable can lose importance due to the
inclusion of other variables and thus its p-value can increase. If the p-value exceeds
2.3 Case Study 135

Fig. 2.42 Stepwise regression: changes in R-square

Fig. 2.43 Stepwise regression: estimated parameters

the “removal” value POUT, then the variable will be removed from the model. The
“removal” value must always be larger than the “entry” value (PIN < POUT), because
otherwise the algorithm may not find an end.
If we set PIN = 0.7 and POUT = 0.8, then the variable ‘gender’ will also be selected
and stepwise regression will yield the same regression function as the blockwise regres-
sion in Eq. (2.76).
If you choose backward elimination, then the algorithm starts with a model that
includes all variables and removes in each step the variable that results in the smallest
change in R-square.

2.3.4 SPSS Commands

Above, we demonstrated how to use the graphical user interface (GUI) of SPSS to
conduct a regression analysis. Alternatively, we can use the SPSS syntax which is a
136 2 Regression Analysis

* MVA: Case Study Chocolate Regression Analysis.


* Defining Data.
DATA LIST FREE / ID Shoppings Age Gender Income.
MISSING VALUES ALL (9999).

BEGIN DATA
1 6 37 0 1.800
2 12 25 0 2.900
3 2 20 1 2.000
--------------------------------
40 9 56 0 4.900
END DATA.
* Enter all data.

* A: Case Study Regression: Method "enter".


* 1st analysis: Method "enter".
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT Shoppings
/METHOD=ENTER Age Gender.

* 2nd analysis: Method "enter".


REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG N
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT Shoppings
/METHOD=ENTER Age Gender Income
/SCATTERPLOT=(*ZRESID ,*ZPRED)
/RESIDUALS HISTOGRAM(ZRESID) NORMPROB(ZRESID).

* B: Case Study Regression: Method "stepwise".


REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT Shoppings
/METHOD=STEPWISE Age Gender Income.

Fig. 2.44 SPSS syntax for linear regression

programming language unique to SPSS. Each option we activate in SPSS’s GUI is trans-
lated into SPSS syntax. If you click on ‘Paste’ in the dialog box ‘Linear Regression’
shown in Fig. 2.30, a new window opens with the corresponding SPSS syntax. However,
you can also run SPSS by only using the syntax and write the commands yourself. Using
the SPSS syntax can be advantageous if you want to repeat an analysis multiple times
(e.g., maybe with changed data or testing different model specifications). Figure 2.44
2.4 Modifications and Extensions 137

shows the SPSS syntax for running the analyses discussed above. The procedure ‘Linear
Regression’ can be requested by the command ‘REGRESSION’ and several subcom-
mands. The syntax shown here does not refer to an existing data file of SPSS (*.sav), but
the data is embedded in the commands between BEGIN DATA and END DATA.
For readers interested in using R (https://www.r-project.org) for data analysis, we pro-
vide the corresponding R-commands on our website www.multivariate-methods.info.

2.4 Modifications and Extensions

2.4.1 Regression With Dummy Variables

The flexibility of the linear regression model can be extended considerably by the use
of dummy variables. In this way, qualitative (nominally scaled) variables can also be
included in a regression model as explaining variables or predictors. Dummy variables
are binary variables (0,1-variables). Mathematically, they can be handled like metric
variables.
We have encountered an example of a dummy variable in the case study, where we
estimated:
Shopping frequency = f (age, gender)
For the variable ‘gender’ we used a dummy variable with the values 0 and 1. If we
denote the dummy variable by d, we can write the estimated regression function in the
following form:

Ŷ = a + b1 · d + b2 · X (2.77)
with

• Ŷ : Estimated shopping frequency of chocolate


• X: Age
• d: Dummy for gender
{
1 for men
d=
0 for women
With this we get
for men Ŷ = a + b1 · d + b2 · X = (a + b1 ) + b2 · X
for women Ŷ = a + b2 · X
For a qualitative variable with two categories, we need one dummy variable. Here,
women are the baseline category. We do not need a dummy for the baseline category.
138 2 Regression Analysis

This can be generalized to qualitative variables with more than two categories. Let’s
assume that, instead of gender, we want to investigate whether hair color influences the
shopping of chocolate. We distinguish between the colors blond, brown, and black. For
q = 3 categories we now need two dummy variables:
{
1 for blond
d1 =
0 else
{
1 for brown
d2 =
0 else
Here, black is the baseline category for which we do not need a dummy. We now have to
estimate the regression function

Ŷ = a + b1 · d1 + b2 · d2 + b3 · X (2.78)
So in general, for a qualitative variable with q categories we need q – 1 dummy varia-
bles. By including q dummies in the model, we would cause perfect multicollinearity
between the independent variables and violate assumption A7.
An example of using dummy variables will be given in the next section on regression
analysis with time-series data.

2.4.2 Regression Analysis With Time-Series Data

An important application of the linear regression model is the analysis of time-series


data. While in the case study we dealt with cross-sectional data (data collected at one
point in time from different subjects or objects), we already dealt with time-series data
in our introductory example (Table 2.3). But we did not consider the special properties of
time-series data.
Time-series data
• are ordered according to time,
• allow the incorporation of time variables into a model.

In addition to describing and explaining the temporal development of a variable Y, the


time series analysis may also serve to forecast its development, i.e. to estimate the val-
ues of Y for future points or periods in time. We can do this also with other regression
models. But then we first have to predict or assume the future values of the predictor
variables. This can be difficult if the independent variables are not controlled by the
researcher, e.g. actions by competitors or environmental change.
If the predictor variable is ‘time’, we can get a prediction simply by time-­series
extrapolation. Time is a very special variable, unlike any other. It develops in a
2.4 Modifications and Extensions 139

Table 2.17  Time-series data for chocolate sales


Period Sales [1000 bars] Time t [quarter] d1 d2 d3 d4
i
1 2596 1 1 0 0 0
2 2709 2 0 1 0 0
3 2552 3 0 0 1 0
4 3004 4 0 0 0 1
5 3076 5 1 0 0 0
6 2513 6 0 1 0 0
7 2626 7 0 0 1 0
8 3120 8 0 0 0 1
9 2751 9 1 0 0 0
10 2965 10 0 1 0 0
11 2818 11 0 0 1 0
12 3171 12 0 0 0 1

completely uniform manner and independently of all other events.54 Time has no cause.
It has an ordering function and puts the data in a fixed and unchangeable order. With
cross-section data, on the other hand, the order of the data is irrelevant and can be
changed at will. The variable ‘time’ divides time into equidistant points or periods (e.g.
days, weeks, months, years).
Linear trend model

Example
As a numerical example, we will use the sales data from our introductory example in
Table 2.3. Table 2.17 shows the sales of chocolate without the marketing variables,
but with a time variable t (t = 1, …, 12). The time variable here counts periods of
three months (quarters). The four dummy variables indicate certain quarters within a
year. ◄

Figure 2.45 shows a scatterplot of the sales data. We can recognize a slight increase over
time. The simplest time-series model is the linear trend model:
Y = α + β ·t + ε (2.79)
By simple regression, we get the following estimated model:

Ŷ = a + b · t = 2617 + 32.09 · t (2.80)

54 Since Albert Einstein (1879–1955) we know that this is not quite true. Relativity theory tells us
that time slows down with increasing speed and even comes to a standstill at the speed of light. But
for our problems we can neglect this.
140 2 Regression Analysis

Sales
3,400

3,200

3,000

2,800

2,600

2,400

2,200

2,000
0 2 4 6 8 10 12
Time

Fig. 2.45 Scatterplot and linear trend

The estimated model is represented by the trend line in Fig. 2.45. By extrapolating this
line, we can get a prediction of sales for any period (N + k) in the future (outside the
range of observations). Thus, for the next period N + 1 = 13 we get:
ŷN+1 = a + b · (N + 1) = 2617 + 32.09 · 13 = 3034
For 10 periods ahead we get:
ŷN+10 = a + b · (N + 10) = 2617 + 32.09 · 22 = 3322
To assess the goodness of the estimated model, we can use the measures we discussed in
Sect. 2.2.3:

• Standard error of the regression: SE = 214


• R-square: = R2 = 24,4%

Another common measure used in the time-series analysis is the mean absolute deviation
(MAD):
N |
Σ |
|yi − ŷi |
i=1 2060.7 (2.81)
MAD = = = 171.7
N 12
For assessing the predictive quality of a model, the calculation of the MAD should be
based on observations that have not been used for the estimation of the model. A model
might provide a good fit, but not necessarily also a good predictive performance.
2.4 Modifications and Extensions 141

Here, all the measures above indicate a poor quality of the estimated model. This can
also be seen in Fig. 2.45, which shows a considerable scatter (unexplained variation) of
the observations around the trend line.

Linear Trend Model With Seasonal Dummies


Sales data often show seasonal fluctuations. By using dummy variables we can
include the seasonal effects into the model. As our periods are quarters, we can distin-
guish between 4 seasons per year. They are represented by the 4 dummy variables in
Table 2.17. To avoid perfect multicollinearity we have to choose one season (e.g. season
4) as the baseline and include dummies for seasons 1–3 into the model:

Ŷ = a + b1 · q1 + b2 · q2 + b3 · q3 + b · t (2.82)
An alternative specification leads to a model without a constant term:

Ŷ = b1 · q1 + b2 · q2 + b3 · q3 + b4 · q4 + b · t (2.83)
After removing the constant term, we can include all 4 dummies into the model without
causing perfect multicollinearity. In SPSS we can do this via the dialog box “Options”
by removing the checkmark from the default option “Include constant in the equation”.
Estimating this model, we get

Ŷ = 2676 · d1 + 2571 · d2 + 2481 · d3 + 2887 · d4 + 26.38 · t (2.84)


Figure 2.46 shows the seasonal pattern, which we get after centering the coefficients
of the dummy variables. By including the seasonal pattern in the model, we have

Seasonal pattern
250
200
150
100
50
0
1. quarter 2. quarter 3. quarter 4. quarter
-50
-100
-150
-200
-250

Fig. 2.46 Seasonal pattern


142 2 Regression Analysis

Time-series of chocolate sales

3,400

3,200

3,000

2,800

2,600

2,400

2,200

2,000
0 1 2 3 4 5 6 7 8 9 10 11 12
Time

Fig. 2.47 Estimated time-series with seasonal fluctuations

diminished the unexplained variation (see Fig. 2.47). This considerably improves the fit
of the model. We now get:

• Standard error of the regression: SE = 163


• R-square: R2 = 69.2%
• Mean absolute deviation: MAD = 94.1

By extrapolation of Eq. (2.84), we can make predictions for any period in the future. For
the next period (quarter) N + 1 we get:
ŷ13 = 2676 · 1 + 2571 · 0 + 2481 · 0 + 2887 · 0 + 26.38 · 13 = 3019
And for period 20 we get:
ŷ20 = 2676 · 0 + 2571 · 0 + 2481 · 0 + 2887 · 1 + 26.38 · 20 = 3415
The dashed line in Fig. 2.48 shows the predictions for the next eight periods, 13 to 20.

Prediction Error
Unfortunately, predictions are always associated with errors (“especially if they are
directed towards the future” can be remarked ironically). Based on the standard error of
regression according to Eq. (2.30) we can calculate the standard error of the prediction
for a future period N + k as follows:

1 (N + k − t)2
sp (N + k) = SE 1+ + Σ N (2.85)
N 1 (t − t)
2
2.4 Modifications and Extensions 143

Time-series of chocolate sales


3,600
3,400
3,200
3,000
2,800
2,600
2,400
2,200
2,000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Time

Fig. 2.48 Estimated time-series and predictions

where N = 12 is the period of the last observation. It is important to note that the pre-
diction error increases with the prediction horizon N + k, i.e. the longer the prediction
reaches into the future, the larger the error.
For the next period we get:

1 (12 + 1 − 6.5)2
sp (13) = 163 1 + + = 191.4
12 143
And for period 20 we get:

1 (12 + 8 − 6.5)2
sp (20) = 163 1+ + = 250.4
12 143
Interval Predictions
The predictions we did above are called point predictions. With the help of the standard
error of the prediction we can also make interval predictions, i.e. we can specify a confi-
dence interval in which the future value will lie with a certain probability:
ŷT +k − tα/2 · sp (N + k) ≤ yT +k ≤ ŷT +k + tα/2 · sp (N + k) (2.86)
So, for period 13, we get the interval
3019 − tα/2 · 191.4 ≤ y13 ≤ 3019 + tα/2 · 191.4
tα/2 denotes the quantile of the t-distribution (student distribution) for error probability
α = 5% (confidence level 1 − α = 95%) for a two-tailed test. For sample sizes N > 30 we
2.4 Modifications and Extensions
144 2 Regression Analysis

can assume tα/2 ≈ 2. Here we have only N – 4 = 8 degrees of freedom and thus we have
to use tα/2 = 2.3. With this, we get the prediction interval
3019 − 440 ≤ y13 ≤ 3019 + 440
The prediction interval here has a span of 880. For period 20 the interval increases to
more than 1100. This explains why predictions so often fail, especially if they reach far
into the future.

2.4.3 Multivariate Regression Analysis

Multivariate regression is an extension of multiple regression with more than one


dependent variable (cf., e.g., Izenman 2013, p. 159 ff.; Greene 2020, p. 366 ff.). Each of
the M dependent variables is influenced by the same set of independent variables. So,
there are M equations, each of which contains the same set of independent variables.
For example, in a study on the effects of eating habits on health, the independent var-
iables could be the amount of meat, vegetables, cereal food, fruits, and chocolate. The
dependent variables could be the body mass index (BMI), blood pressure, and choles-
terol level. Thus, one can form three equations with the same five independent variables
on eating habits.
In multivariate regression, all parameters are estimated simultaneously. This falls into
the field of simultaneous equation models and is beyond the scope of this book.

2.5 Recommendations

Here are some recommendations for the practical application of regression analysis.

• The problem to be investigated must be precisely defined: Which variable is to be


explained or predicted? The dependent variable must be on a metric scale. If inde-
pendent variables are on a nominal scale, they have to be transformed into dummy
variables.
• Before you start a regression analysis you should visualize the data by scatterplots
and study the correlation matrix.
• Expertise and thought are necessary to identify and define possible influencing varia-
bles. A variable should only be considered if there are logical reasons for doing so.
• The number of observations must be sufficiently large. A model will be more reliable
if it is based on more observations. And we need more observations if we have more
variables. In the literature, you can find different recommendations, ranging from 10
to 30 observations per independent variable. 10 observations per parameter is a simple
and practical rule of thumb. That means at least 20 observations for a simple regres-
sion. But as we have shown, the precision of an estimate does not only depend on the
References 145

number of observations but also on the variation of the independent variable. And for
multiple independent variables, the precision also depends on their collinearity (see
Sect. 2.2.5.7).
• After estimating a regression function, the coefficient of determination must first be
checked for significance. If no significant test result can be achieved, the entire regres-
sion approach must be discarded.
• The individual regression coefficients must then be checked logically (for sign) and
statistically (for significance).
• It has to be checked whether the assumptions of the linear regression model are met
(see Sect. 2.2.5). For this, plotting the residuals is an easy and effective instrument
(see Sects. 2.2.5.1 and 2.2.5.2).
• If the model is used for explanation or decision making, the correctness of the causal-
ity assumptions is essential. This requires reasoning beyond statistics and information
on the generation of the data.
• To find a good model, you may need to remove variables from the equation or add
new variables. Modeling is often an iterative process in which the researcher formu-
lates new hypotheses based on empirical results and then tests them again.
• If the regression model has passed all statistical and logical checks, its validity has to
be checked against reality.

References

Agresti, A. (2013). Categorical data analysis. John Wiley.


Anscombe, F. J., & Tukey, J. W. (1963). The examination and analysis of residuals. Technometrics,
5(2), 141–160.
Belsley, D., Kuh, E., & Welsch, R. (1980). Regression diagnostics. Wiley.
Blalock, H. M. (1964). Causal inferences in nonexperimental research. The Norton Library.
Campbell, D. T., & Stanley, J. C. (1966). Experimental and Quasi-experimental designs for
research. Rand McNelly.
Charles, E. P. (2005). The correction for attenuation due to measurement error: Clarifying concepts
and creating confidence sets. American Psychological Association, 10(2), 206–226.
Cook, R. D. (1977). Detection of influential observations in linear regression. Technometrics, 19,
15–18.
Fox, J. (2008). Applied regression analysis and generalized linear models. Sage.
Freedman, D. (2002). From association to causation: Some remarks on the history of statistics (p.
521). University of California.
Freedman, D. (2012). Statistical models: Theory and practice. Cambridge University Press.
Freedman, D., Pisani, R., & Purves, R. (2007). Statistics (4th ed.). Norton & Company.
Galton, F. (1886). Regression towards mediocrity in hereditary stature. Journal of the
Anthropological Institute of Great Britain and Ireland, 15, 246–263.
Gelman, A., & Hill, J. (2018). Data analysis using regression and multilevel/hierarchical models.
Cambridge University Press.
Green, P. E., Tull, D. S., & Albaum, G. (1988). Research for Marketing Decisions (5th ed.).
Englewood Cliffs.
146 2 Regression Analysis

Greene, W. H. (2012). Econometric analysis (7th ed.). Pearson.


Greene, W. H. (2020). Econometric analysis (8th ed.). Pearson.
Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2010). Multivariate data analysis (7th
ed.). Pearson.
Hair, J. F., Hult, G.T., Ringle, C. M., & Sarstedt, M. (2014). A primer on Partial Least Squares
Structural Equation Modelling (PLS-SEM). Sage.
Hastie, T., Tibshirani, R., & Friedman, J. (2011). The elements of statistical learning. Springer.
Izenman, A. L. (2013). Modern multivariate statistical techniques. Springer Texts. in Statistics.
Kahneman, D. (2011). Thinking, fast and slow. Penguin.
Kline, R. B. (2016). Principles and practice of structural equation modeling. Guilford Press.
Kmenta, J. (1997). Elements of econometrics (2nd ed.). Macmillan.
Leeflang, P., Witting, D., Wedel, M., & Naert, P. (2000). Building models for marketing decisions.
Kluwer Academic Publishers.
Little, J. D. C. (1970). Models and managers: The concept of a decision calculus. Management
Science, 16(8), 466–485.
Maddala, G., & Lahiri, K. (2009). Introduction to econometrics (4th ed.). New York: Wiley.
Pearl, J., & Mackenzie, D. (2018). The book of why—The new science of cause and effect. Basic
Books.
Spearman, C. (1904). The proof and measurement of association between two things. The
American Journal of Psychology, 15(1), 72–101.
Stigler, S. M. (1997). Regression towards the mean, historically considered. Statistical Methods in
Medical Research, 6, 103–114.
Wooldridge, J. (2016). Introductory econometrics: A modern approach (6th ed.). Thomson.

Further Reading

Fahrmeir, L., Kneib, T., Lang, S., & Marx, B. (2009). Regression—models, methods and aplica-
tions. Springer.
Hanke, J. E., & Wichern, D. (2013). Business forecasting (9th ed.). Prentice-Hall.
Härdle, W., & Simar, L. (2012). Applied multivariate analysis. Springer.
Stigler, S. M. (1986). The history of statistics. Harvard University Press.
Analysis of Variance
3

Contents

3.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148


3.2 Procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
3.2.1 One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
3.2.1.1 Model Formulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
3.2.1.2 Variance Decomposition and Model Quality. . . . . . . . . . . . . . . . . . . . . 155
3.2.1.3 Statistical Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
3.2.1.4 Interpretation of the Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
3.2.2 Two-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
3.2.2.1 Model Formulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
3.2.2.2 Variance Decomposition and Model Quality. . . . . . . . . . . . . . . . . . . . . 172
3.2.2.3 Statistical Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
3.2.2.4 Interpretation of the Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
3.3 Case Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
3.3.1 Problem Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
3.3.2 Conducting a Two-way ANOVA with SPSS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
3.3.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
3.3.3.1 Two-way ANOVA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
3.3.3.2 Post-hoc Tests for the Factor Placement. . . . . . . . . . . . . . . . . . . . . . . . . 187
3.3.3.3 Contrast Analysis for the Factor Placement. . . . . . . . . . . . . . . . . . . . . . 187
3.3.4 SPSS Commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
3.4 Modifications and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
3.4.1 Extensions of ANOVA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
3.4.2 Covariance Analysis (ANCOVA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
3.4.2.1 Extension of the Case Study and Implementation in SPSS . . . . . . . . . . 194
3.4.2.2 Two-way ANCOVA with Covariates in the Case Study. . . . . . . . . . . . . 194
3.4.3 Checking Variance Homogeneity Using the Levene Test . . . . . . . . . . . . . . . . . . . 197
3.5 Recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 147


K. Backhaus et al., Multivariate Analysis, https://doi.org/10.1007/978-3-658-40411-6_3
148 3 Analysis of Variance

3.1 Problem

Both science and business practice are often confronted with questions regarding the
most suitable actions to achieve a certain goal. A test of the effectiveness of different
measures can be carried out by defining alternative measures (e.g. different advertis-
ing concepts) and then applying them in different groups. If the measures implemented
in the different groups lead to different results for the variable of interest, this can be
seen as an indication that the different measures influence the target variable in different
ways. Of course, this only applies if the different groups are comparable in their struc-
ture and the only difference between the groups is the measure applied to the groups.
The analysis of variance (ANOVA) is the most important statistical procedure for the
analysis and statistical evaluation of such situations. In the simplest case, it examines the
effect of one or more independent variables on a dependent variable. Thus, a presumed
causal relationship is investigated, which can be formally represented as:
 
Y = f X1 , X2 , . . . , Xj , . . . , XJ .
Here, the independent variables (Xj) are nominally scaled (categorical) variables, which
can occur in different forms, while the dependent variable is always measured at the met-
ric scale level. Table 3.1 shows a number of examples from different fields of application
with the assumed direction of effect.
The example from marketing analyzes whether advertising as an independent varia-
ble (with the three states “internet”, “poster” and “newspaper”) has an influence on the
dependent variable “number of visitors”. In the example from education, it is investi-
gated whether the teaching methodology can change the grades in a school subject. In
all examples, the independent variables always comprise alternative states (factor levels)
that are assumed to influence the metrically measurable dependent variables (e.g. number
of visitors, annual sales, image, recovery time, grades) in different ways.
Another common feature of the examples in Table 3.1 is that they describe experi-
mental situations. Experiments are a classical instrument for the empirical investigation
of causal hypotheses. The researcher actively intervenes in an experiment by systemati-
cally varying (manipulating) the independent variables and then measuring the effects on
the dependent variable. The independent variables (X) are thus subjected to systematic
“treatment” by the user, which is why they are often referred to as the experimental fac-
tor. In contrast, the dependent variable (Y) is often referred to as the measured parameter
or criterion variable. Table 3.2 gives an overview of the different designations of the var-
iables used in ANOVAs that are common in the literature. In this chapter, an independ-
ent variable is consistently referred to as a “factor” and its alternative states as “factor
levels”.
The specific variation of the factors (treatment, manipulation) is based on theoretical
or logical considerations of the user. These are reflected in the so-called experimental
design. The experimental design is intended to ensure that the formed groups are equal
3.1 Problem 149

Table 3.1  Application examples of ANOVA in different disciplines


Field of application Exemplary research questions
Agriculture A farm wants to test the effectiveness of three different fertilizers in rela-
tion to soil quality. For this purpose, yield and stalk length for a given crop
are investigated in fields with different soil conditions, each of which has
three different fertilizer segments
Business A company is testing three different brand names in two different sales
channels and wants to determine the effects these two marketing instru-
ments have on annual sales, in isolation and together
Healthcare It is assumed that diets contribute to the reduction of obesity to different
degrees. To test this, five (comparable) groups of overweight people are
formed, each of whom is to follow one of the five tested diets for six
months. If after six months the average weight reduction (in kg) in the five
groups of persons is different, the assumption is considered confirmed
Marketing A cinema manager would like to know whether advertising via the inter-
(Advertisement) net, posters or newspapers leads to different effects regarding the number
of visitors. To check this, only one form of advertising is carried out at a
time and the number of visitors is recorded
Medicine There are various therapeutic methods to treat a disease. The clinic wants
to know whether the different therapies have a different effect on the
recovery time and which therapy leads to the fastest recovery
Pedagogy The aim is to investigate whether the choice of teaching methods can influ-
ence the performance of students. Therefore, the same subject is taught
in several classes of one educational level with three different teaching
methods for six months. Afterwards, the grades of the school classes are
compared
Psychology It is to be investigated whether people with comparable life situations
living in rural areas have a different sense of happiness compared to people
living in big cities
Sociology A sociological institute wants to find out whether the choice of a hobby
influences the reputation of a person. Comparable groups of people pur-
suing different hobbies are therefore investigated and the reputation of the
different groups is compared
Technology Four different production technologies are available. By using the four
technologies in comparable production environments it is examined
whether the technologies influence productivity in different ways

(do not differ systematically). Only under this condition can the differences in test out-
comes be unambiguously attributed to the differences in treatments (combinations of
factor levels). To meet this condition, the application of ANOVA requires that the test
objects are randomly assigned to the different factor levels. The average measured val-
ues of the criterion variable in the different groups are then compared with each other.
If significant differences occur, this is taken as an indicator that the factor levels do have
150 3 Analysis of Variance

Table 3.2  Alternative names of the variables in ANOVAs


Dependent variable (Y) Independent variable (X)
explained variable, target variable, research (experimental) factor
variable, measured parameter, criterion variable
Scale: metric Scale: nominal
Different manifestations of a nominal variable:
factor levels, categories, groups

an influence on the dependent variable. If, on the other hand, this is not the case, this
is taken as an indicator that the different factor levels influence the dependent variable
differently.
If a factor with only two factor levels is considered, checking whether the depend-
ent variable differs for the two factor levels corresponds to a simple test for differences
between the means. If, however, a factor has three or more factor levels or if two or more
factors are considered, the (simultaneous) testing for differences between the means is
no longer possible and an ANOVA is required. The term “analysis of variance” can be
traced back to the fact that the scatter (= variance) within the different groups as well as
between the groups is included in the test value (cf. Sect. 3.2.1.3).
The term “analysis of variance” is not only used to describe a specific procedure, but
also as a collective term for different variants of ANOVA. The differences between these
procedures are briefly described in Sect. 3.4.1 (see Table 3.15). Here, it may suffice to
point out that univariate (i.e. one dependent variable) analyses of variance (ANOVAs)
can be described as one-, two-, three-way (or one-, two-, three-factorial) etc. ANOVAs,
depending on the number of factors considered. This chapter focuses on the one- and
two-way ANOVA.

3.2 Procedure

In the following, the basic principle of ANOVAs is first explained using the example
of a univariate ANOVA with one dependent and one independent variable (one-way
ANOVA). The considerations are then extended to two-way ANOVAs (two independent
variables). In both cases, the procedure is divided into four steps which are shown in
Fig. 3.1.
In the first step, the model is formulated and central preliminary considerations for
the implementation of ANOVA are presented. The model formulation varies depending
on the chosen type of ANOVA (one-, two-, three-factorial, etc.). The second step encom-
passes the analysis of variation which is the basic principle of ANOVA. The more factors
are considered and the more factor levels they comprise, the more sources of variation
there are. Based on these considerations, the third step shows how to statistically test
3.2 Procedure 151

Fig. 3.1 Four-step procedure of 1 Model formulation


the ANOVA
2 Variance decomposition and model quality

3 Statistical evaluation

4 Interpretation of the results

whether differences between the mean values of the factor levels are significant and how
to assess the quality of the formulated variance model. Finally, the fourth step serves to
interpret the results and to examine follow-up questions arising from the results.

3.2.1 One-way ANOVA

3.2.1.1 Model Formulation

1 Model formulation

2 Variance decomposition and model quality

3 Statistical evaluation

4 Interpretation of the results

The ANOVA model formulates how a certain observed value i of the dependent variable
(Y), which originates from a certain group (factor level) g of the independent variable
(X), can be reproduced.

Stochastic model of ANOVA


Any analysis of variance is usually based on sample data. If we repeat the sampling, we
will get other data and another ANOVA will yield other estimates for the same problem.
Therefore, the primary aim of ANOVA cannot be to describe the data in the sample but
to draw conclusions from the sample about the population from which the sample was
drawn.1 Thus, the stochastic model of ANOVA covers the randomness inherent in the
sample data. In its generic form, the model can be expressed by:
ygi = µ + αg + εgi (3.1)

1 This is called inferential statistics and has to be distinguished from descriptive statistics.
Inferential statistics makes inferences and predictions about a population based on a sample drawn
from the studied population.
152 3 Analysis of Variance

with
ygi observed value i (i = 1, 2, …, N) of the dependent variable in factor level g (g = 1,
2, …, G)
μ total mean of the population (global expected value)
αg true effect of factor level g (g = 1, 2, …, G)
εgi error term (disturbance)

The following applies:

αg = μg – μ
with μg = mean for factor level g in the population

The effects of the different factor levels are reflected in the deviations (αg) of the mean
y values of these factor levels or groups of factors (μg) from the total expected mean μ.
These deviations reflect whether the factor levels have an effect on the dependent varia-
ble and thus provide an explanation for the measured values of the dependent variable.
This component of the model is called the systematic component.
In contrast, the error term εgi expresses measurement errors and the effect of variables
not considered in the model. It is assumed that all groups have approximately the same
degree of disturbance. This component is called the stochastic component.
The model of the ANOVA is thus composed of a systematic component and a stochas-
tic component, which are linearly linked to each other.

Application example for one-way ANOVA


To illustrate the procedure and its implications, we will examine the following exam-
ple: The manager of a supermarket chain wants to check the effect of different types
of product placement (factor) on chocolate sales. To do this, the manager selects three
types of placement (factor levels):

1. Candy section
2. Special placement
3. Cash register

The manager develops the following experimental design: Out of the largely com-
parable 100 shops of his supermarket chain, he randomly selects 15 supermar-
kets. Subsequently, each one of the three placements is implemented in 5 stores,
also selected at random, for one week. At the end, the chocolate sales (explained
variable) per supermarket are recorded in “kilograms per 1000 checkout transac-
tions”. Table 3.3 shows the three subsamples (g) with the chocolate sales in the five
3.2 Procedure 153

Table 3.3  Data of the Output data ygi


example: chocolate sales in 15
supermarkets Placement Chocolate sales in 15 supermarkets
(kg per 1000 transactions)
Candy section 47 39 40 46 45
Special placement 68 65 63 59 67
Cash register 59 50 51 48 53

Table 3.4  Average chocolate Type of placement Average sales per placement


sales per type of placement
Candy section y1 = 43.40
Special placement y2 = 64.40
Cash register y3 = 52.20
Total mean y = 53.33

supermarkets (i = 5 for each type of placement). As each of the three groups com-
prises the same number of observations, the design is called “balanced”.2 ◄

The means of the three groups and the total mean of the data in the example are listed
in Table 3.4. These values are necessary to perform the ANOVA because this procedure
analyzes the differences between the means. The variances of the observed values around
these mean values play a decisive role in ANOVA since the unknown true mean values
μg of the factor levels can be estimated from the means of the observed values.
An analysis should always start with an illustration of the data.3 Box plots of the out-
put data, as shown in Fig. 3.2, are particularly suitable for comparing several samples.
Each boxplot represents one of the three subsamples (type of placement), describing its
position and the statistical variation. The box shows the position of the median. The box
indicates the range in which 50% of the observations are found. The whiskers mark the
span of the data (maximum and minimum value, with the exception of outliers).
Figure 3.2 shows that there are differences between the sales volumes of the various
placement groups. Therefore, one can conclude that the type of placement has an influ-
ence on chocolate sales. For the moment, however, this is considered a hypothesis which
will have to be tested by an ANOVA.

2 In our example, only 5 observations per group and thus a total of 15 observations were chosen in
order to make the subsequent calculations easier to understand. The literature usually recommends
a minimum of 20 observations per group.
3 On the website www.multivariate-methods.info we provide supplementary material (e.g., Excel

files) to deepen the reader’s understanding of the methodology.


154 3 Analysis of Variance

Sales
70

2= 65

60

3 = 51
50

= 45

40

30
Type of
placement
Candy section Special Cash register

group median range of observations per group

Fig. 3.2 Box plots of the average chocolate sales in the three placements

Since the sales volumes differ even for the same placement (variation within a group),
there must be other influencing factors in addition to placement. In general, there are
always many different influences, most of which cannot be observed.
In the model of ANOVA, these influences are represented by random variables (error
variables) ϵgi, which are contained in the observations ygi. The effects of the factor levels
on the sales quantities in our example result from the deviations of the group mean val-
ues from the total mean value. Since the model in Eq. (3.1) has G + 1 unknown param-
eters with only G categories, an auxiliary condition (reparameterization condition) is
required for the purpose of unambiguous determination (identifiability). As one possibil-
ity, it can be assumed that the effects cancel each other out, so that only their scaling is
affected. In this case the following applies:
G

αg = 0 (3.2)
g=1

Alternatively, one of the categories could be chosen as the reference category and its
effect set to zero.
The effects of the three forms of placement can be estimated by using the different
mean values. The “true” effect of placement g in the population (αg) is estimated by the
3.2 Procedure 155

observed difference between the group mean and the overall mean (ag). Thus, the follow-
ing applies:
αg = (yg − y) (3.3)
with

N
1 
yg = ygi Group means (3.4)
N i=1

G N
1 
y= ygi Total means (3.5)
G · N g=1 i=1

Entering the values of Table 3.4 in Eq. (3.3) leads to the following deviations between
the group means and the overall mean:
a1 = (y1 − y) = 43.40 − 53.33 = −9.93 Candy section
a2 = (y2 − y) = 64.40 − 53.33 = 11.07 Special placement
a3 = (y3 − y) = 52.20 − 53.33 = −1.13 Cash register
The sum of the effects is zero (except for rounding errors). The special placement has the
strongest positive effect, while average sales are lowest in the candy section, with a value
of −9.93 (group average of 43.40).
Despite these differences, the question remains whether the inferred effects were actu-
ally caused by the way the chocolate was placed. Due to the presence of non-observable
influencing variables (error terms), the estimated effects could possibly also have arisen
purely by chance. This question can be answered by a so-called variance decomposition.

3.2.1.2 Variance Decomposition and Model Quality

1 Model formulation

2 Variance decomposition and model quality

3 Statistical evaluation

4 Interpretation of the results

Variance decomposition as a basic principle of ANOVA


To explain the variation of the observed values of the dependent variable, ANOVA
decomposes the variance of a dependent variable into a part that is explained by the
model of ANOVA (Eq. 3.1) and a part that is not explained. The explained part is attrib-
uted to the effect of the independent variable under consideration.
In the case of one-way ANOVAs, it is assumed that the differences between the group
means are caused by the factor levels and can thus be explained by them (systematic
component). In contrast, the variances within a group (factor level) are not explained by
156 3 Analysis of Variance

the formulated model of ANOVA (random component). The basic principle here is the
decomposition of the variation between the observed values ygi and the total mean y (cf.
also the general relationship shown in Eq. 3.6).
total variation = explained variation + unexplained variation
SSt(otal) = SSb(etween) + SSw(ithin) (3.6)
G N  2 G  2 G N  2
g=1 i=1 ygi − y = g=1 N yg − y + g=1 i=1 ygi − yg
Here, SS stands for ‘sum of squares’ and reflects the various squared deviations from the
mean value. Accordingly, SSt stands for the total variation within a dataset. SSb reflects
the scatter (variance) between groups that can be explained by the formulated model, and
SSw corresponds to the scatter within a group that cannot be explained by the formulated
model. If the scatter decomposition in Eq. (3.6) is applied to the dataset in Table 3.3, the
data in Table 3.5 are obtained.
For our example, Table 3.5 shows that of the total variation of chocolate sales
(SSt = 1287.33) in the experiment, a variation of SSb = 1112.13 can be explained by the
type of placement, while SSw = 175.20 remains unexplained.

Table 3.5  Calculation of the squared deviations (sum of squares)


SSt SSb SSw
G N  2 G  2 G N  2
ygi − y N yg − y ygi − yg
g=1 i=1 g=1 g=1 i=1

Candy section (47 − 43.4)2 = 12.96


 2  2
47 − 53.3 = 40.11 43.4 − 53.3 = 98.67
(39 − 43.4)2 = 19.36
 2  2
+ 39 − 53.3 = 205.44 + 43.4 − 53.3 = 98.67
(40 − 43.4)2 = 11.56
 2  2
+ 40 − 53.3 = 177.78 + 43.4 − 53.3 = 98.67
2 2
(46 − 43.4)2 = 6.76
 
+ 46 − 53.3 = 53.78 + 43.4 − 53.3 = 98.67
2 2
(45 − 43.4)2 = 2.56
 
+ 45 − 53.3 = 69.44 + 43.4 − 53.3 = 98.67
Special
2 2
(68 − 64.4)2 = 12.96
 
+ 68 − 53.3 = 215.11 + 64.4 − 53.3 = 122.47
placement 2 2
(65 − 64.4)2 = 0.36
 
+ 65 − 53.3 = 136.11 + 64.4 − 53.3 = 122.47
2 2
(63 − 64.4)2 = 1.96
 
+ 63 − 53.3 = 93.44 + 64.4 − 53.3 = 122.47
2 2
(59 − 64.4)2 = 29.16
 
+ 59 − 53.3 = 32.11 + 64.4 − 53.3 = 122.47
2 2
(67 − 64.4)2 = 6.76
 
+ 67 − 53.3 = 186.78 + 64.4 − 53.3 = 122.47
Cash register
2 2
(59 − 52.2)2 = 46.24
 
+ 59 − 53.3 = 32.11 + 52.2 − 53.3 = 1.28
2 2
(50 − 52.2)2 = 4.84
 
+ 50 − 53.3 = 11.11 + 52.2 − 53.3 = 1.28
2 2
(51 − 52.2)2 = 1.44
 
+ 51 − 53.3 = 5.44 + 52.2 − 53.3 = 1.28
2 2
(48 − 52.2)2 = 17.64
 
+ 48 − 53.3 = 28.44 + 52.2 − 53.3 = 1.28
2 2
(53 − 52.2)2 = 0.64
 
+ 53 − 53.3 = 0.11 + 52.2 − 53.3 = 1.28
SSt = 1287.33 SSb = 1112.13 SSw = 175.20
3.2 Procedure 157

As an example, we will here determine the variance decomposition for the first obser-
vation value in supermarket 2 for the special placement (y21). According to Table 3.3,
this value is y21 = 68. Using the mean values in Table 3.4, the following calculations can
now be made:
The deviation from the total mean is
y21 − y = 68 − 53.3 = 14.7
The deviation
y2 − y = a2 = 11.1
can be explained by the effect of different placements, while the deviation of
y21 − y2 = 68 − 64.4 = 3.6
cannot be explained by placement.
Therefore, the relationship shown in Table 3.6 applies. This relationship is presented
graphically in Fig. 3.3 for the observation values y21 = 68 and y13 = 40 (cf. Table 3.3).
The equation in Table 3.6 also applies when the elements are squared and summed
over the observations (SS = sum of squares).

Explanatory power of ANOVA


Based on the variance decomposition, the strength of the effect can now be assessed.
For this purpose, we calculate which part of the total variation is explained by the model
(in our example, this is the placement of the chocolate). This proportion is called eta-
squared and relates the variation explained by the model to the total variation.
explained variation SSb
Eta-Squared = = (3.7)
total variation SSt
The value of eta-squared ranges between zero and one. It is the larger, the higher the pro-
portion of the explained variation in relation to the total variation. A high value of eta-
squared indicates that the estimated model explains the sample data well.
Based on the values in Table 3.5, the eta-squared value of the overall model is
(1112.13/1287.33 =) 0.864. This means that 86.4% of the variation in sales volume can
be explained by the placement. Only (1 – 0.864 =) 13.6% remain unexplained and must
be attributed to interferences. Eta-squared corresponds to the coefficient of determination
(R-squared) of regression analysis (cf. Chap. 2), which also indicates the proportion of
the variation explained by the formulated regression model.

Table 3.6  Decomposition of the total deviation in the application example


y21 − y = y2 − y +y21 − y2
14.7 = 11.1 +3.6
Total deviation = explained deviation + unexplained deviation
158 3 Analysis of Variance

Sales

68 21 unexplained
deviation
64.4

60 explained
deviation

53.3

50
explained
deviation

43.4
unexplained
deviation
40 13

group means
30
Candy section Special Placement

Fig. 3.3 Explained and unexplained deviations of two exemplary observation values

3.2.1.3 Statistical Evaluation

1 Model formulation

2 Variance decomposition and model quality

3 Statistical evaluation

4 Interpretation of the results

A high eta-squared value indicates that the estimated model explains the sample data
well. However, this does not mean that this statement also applies to the total popula-
tion. ANOVA uses F-statistics to test the statistical significance of a model. The research
question here is: Does the factor under consideration have an effect (αg) that helps to
explain the variation of the dependent variable (y)? The answer to this question is pro-
vided by an F-test.
In the following, the steps of a classical F-test are first described in general terms and
then applied to the extended example described in Sect. 3.2.1.1 (Table 3.3). The results
lead to an ANOVA table which is important for the analysis of variance.
3.2 Procedure 159

F-test
1. State a null hypothesis
2. Calculate the empirical F-value Femp
3. Choose an error probability α (significance level) and make a
decision
4. Interpretation

Fig. 3.4 Steps of an F-test

Since the F-test assumes that the variances within the groups are homogeneous, this
assumption, called variance homogeneity, will be checked at the end of this section
(Fig. 3.4).

F-Test
The classical F-test can be summarized by the following steps (Fig. 3.4).4

Step 1
First, the null hypothesis of the F-test has to be formulated. It should be noted that dif-
ferent formulations of the null hypothesis exist in the literature, but they are identical in
their statements.

Option a:
For the stochastic model formula of the ANOVA in Eq. (3.1), the null hypothesis states
that all factor steps have an effect of zero (αg = 0) on the dependent variable and thus
have no influence on the dependent variable:
H0 : αi = α2 = . . . = αG = 0
(3.8)
H1 : at least two αg are � = 0

Option b:
A second formulation of the null hypothesis states that all group mean values are identi-
cal. This also means that the factor levels have no effect on the dependent variable, since
they do not lead to differences in the group mean values.
H0 : µ1 = µ2 = . . . = µG = 0

H1 : at least two µg are � = 0 (3.9)

4 For a brief summary of the basics of statistical testing see Sect. 1.3.
160 3 Analysis of Variance

Regarding variant b, it should be noted that the ANOVA model is also formulated dif-
ferently here compared to Eq. (3.1). The following applies:
ygi = µg + ∈gi (3.10)
with
ygi observed value i (i = 1, …, N) in factor level g (g = 1, 2, …, G)
μg mean for factor level g in the population (expected value)
ϵgi error term (disturbance)
Option c:
The null hypotheses formulated in Eqs. (3.8, 3.9) are both equivalent to the statement
that the variation of the examined factor between the groups does not differ from the
variation within the groups. Accordingly, the null hypothesis could also be written in the
following form:
H0 : SSb = SSw or SSb /SSw = 1
(3.11)
H1 : SS b �= SS w or SS b /SS w > 1
Step 2
In the second step, the F-statistic (Femp) has to be calculated. Due to the relation in
Eq. (3.6) all three variants of the null hypothesis result in the following test statistic:
explained variance SSb /(G − 1) MSb
Femp = = = (3.12)
unexplained variance SSw /(G · (N − 1)) MSw
The test variable follows an F-distribution and relates the variance between the groups
to the variance within the groups. The variances are calculated by dividing the variations
(SS) by their respective degrees of freedom (df).5 These variances are called mean square
deviations (MS). The name “analysis of variance” also refers to the scatter decomposi-
tion and the ratio of two variances considered in the F-test.
The stronger the experimental effects, the larger the F-value. If the interferences
are small, even the smallest effects can be proven to be significant (i.e., caused by the
factor). However, the stronger the interferences, the larger the (unexplained) variance
in the denominator and the more difficult it gets to prove significance. To use an anal-
ogy: the louder the environmental noise, the louder you have to shout to be understood.
Communications engineering refers to this as the signal-to-noise ratio.

Step 3
Once the empirical F-value has been calculated, it has to be compared with the theoret-
ical F-value (Fα) found in an F-table. The magnitude of Fα is determined by the error
probability α (significance level) chosen by the user and the degrees of freedom in the

5 For a detailed explanation on degrees of freedom (df) see Sect. 1.2.1.


3.2 Procedure 161

f(F, g, i) p-values for the F-distribution: p = Prob(F>Femp)


1.0

0.9

0.8
p-values critical F-value
for = 0.05
0.7

0.6
region of rejection
0.5 <

0.4

0.3

0.2

0.1

0.0
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
F, Femp

Fig. 3.5 F-distribution and p-value

numerator and denominator of the test value. The decision to reject the null hypothesis is
determined by the following rules:
Femp > Fα → H0 is rejected → the relationship is significant
Femp ≤ Fα → H0 cannot be rejected
The error probability α is the probability that the null hypothesis is rejected even
though it is correct (first-order error). The smaller α, the greater the user’s effort not to
make an error when rejecting the null hypothesis. Usually a value of α = 0.05 or 5% is
chosen.6
When using statistical software packages, the decision to reject the null hypothesis is
usually based on the so-called p-value (probability value). It is derived on the basis of the
empirical F-value and indicates the probability that the rejection of the null hypothesis
is a wrong decision. The greater Femp, the smaller p. Figure 3.5 shows the p-value as a
function of Femp. In SPSS, the p-value is called “significance” or “sig”. When using the
p-value, the null hypothesis is rejected if the following applies: p < α

6 The user can also choose other values for α. However, α = 5% is a kind of “gold” standard in
statistics and goes back to R. A. Fisher (1890–1962) who developed the F-distribution. However,
the user must also consider the consequences (costs) of a wrong decision when making a decision.
162 3 Analysis of Variance

Table 3.7  ANOVA table for the one-way ANOVA in our application example
Source SS df MS
Between factor levels G
  2 G−1=2 SSb
= 556.07
N yg − y = 1112.13 G−1
g=1

Within factor levels  N 


G  2 G(N − 1) = 12 SSw
= 14.50
ygi − yg = 175.20 G(N−1)
g=1 i=1

Total  N 
G  2 G · N − 1 = 14 SSt
= 91.95
ygi − y = 1287.33 G·N−1
g=1 i=1

Step 4
If the null hypothesis is rejected on the basis of the F-test, it can be concluded that there
are statistically significant differences between the group mean values or between the
variations between and within the groups. This means that the factors considered in the
model do have a significant influence on the dependent variable.

ANOVA table and F-test for our application example


To carry out the F-test in our example, the test value of the F-test can be calculated using
the sum of squares (SS) in Table 3.5. The result is shown in Table 3.7 also referred to as
the “ANOVA table”.
With the values in Table 3.7, the following value is obtained for the F-test Eq. (3.12)
in our example:
MSb 556.07
Femp = = = 38.09
MSw 14.60
For an assumed error probability α = 0.05, the F-table yields Femp = 38.09. For df1 =
G – 1 = 2 and df2 = G (N – 1) = 12 (in numerator and denominator), Fα = 3.89. Thus,
Femp > Fα and the null hypothesis has to be rejected.
In our example, Femp = 38.09 results in a value of p = 0.0000064.7 This means that
rejecting the null hypothesis of the F-test is a wrong decision with a probability of
p = 0.0000064. The null hypothesis (stating that the factor levels do not have any influ-
ence on the sales quantity) can therefore be rejected and the validity of the alterna-
tive hypothesis (that the factor levels do have an influence) can be assumed. Since the
p-value is almost zero here, an effect of placement on the sales quantity can definitely be
assumed. This is also confirmed by the box plots in Fig. 3.2.

7 The p-value can also be calculated with Excel by using the function F.DIST.RT(Femp;df1;df2). For
our application example, we get: F.DIST.RT(38.09;2;12) = 0.0000064 or 0.00064%. The reader
will also find a detailed explanation of the p-value in Sect. 1.3.1.2.
3.2 Procedure 163

Variance homogeneity and normal distribution as central assumptions of ANOVA


The F-test assumes that the dependent variable is normally distributed in all factor lev-
els (groups) and that the variances in the groups (SSw) are approximately equal. The lat-
ter premise is also known as variance homogeneity or homoscedasticity. Since the SSw
reflect the disturbance variables (error term) of the ANOVA model, different SSw would
correspond to different magnitudes of the disturbance variables in the groups. However,
this would mean that a comparison of the effects in the groups would not be possible.
The Levene test (cf. Levene, 1960) can be used to test variance homogeneity. It is based
on the null hypothesis that the variances in the groups do not differ or that the error vari-
ance of the dependent variables is the same across the groups.8
Overall, ANOVA is considered relatively robust with respect to estimation errors if
the sample is large overall and the sample sizes of the groups are approximately equal
(Bray & Maxwell, 1985, p. 34; Perreault & Darden, 1975, p. 334). This can be assumed
if the ratio between the sample sizes in the largest and the smallest group is not greater
than 1.5 (Pituch & Stevens, 2016, p. 220). However, higher values (up to 4) are some-
times mentioned in the literature (cf. Moore, 2010, p. 645). If the size of the samples
in the groups were between 20 and 30 cases, for example, this rule would be fulfilled
(30/20 = 1.5), i.e. groups of about the same size could be assumed. However, if the ratio
of sample sizes is larger, cases should be randomly eliminated from the larger groups.
Since heteroscedasticity (i.e. no homoscedasticity) also affects the p-value of the F-test,
the smaller the p-value associated with Femp according to Eq. (3.12), the less dramatic
the violation of variance homogeneity. In the present example, p = 0.0000064, which
means a violation of variance homogeneity is not a problem. With a p-value around 0.05,
however, the situation would be more critical.

3.2.1.4 Interpretation of the Results

1 Model formulation

2 Variance decomposition and model quality

3 Statistical evaluation

4 Interpretation of the results

The central results of the ANOVA are reflected in the ANOVA table (cf. Table 3.7). It
provides information on whether the factor has a significant influence on the dependent
variable and how large the explanatory contribution of the model is. The F-test used here
is based on a so-called omnibus hypothesis, i.e. it tests whether there are fundamental

8 Guidance on testing the assumption of multivariate normal distribution is given in Sect. 3.5.
A detailed description of the testing of variance homogeneity using the Levene test is given in
Sect. 3.4.3.
164 3 Analysis of Variance

differences between the groups. However, it is not possible to tell from the test whether
only one, several or even all groups differ from each other and how large these effects
are. This is the case in our example, but the box plots in Fig. 3.2 also show that the dif-
ference between special placement and candy section is particularly large, while the dif-
ference between candy section and cash register placement is smaller. Thus, if the F-test
shows that a factor has a significant influence on the dependent variable, it cannot be
concluded from such a result that all factor level means (i.e., group means) are different
and thus all factor levels considered have a significant influence on the dependent varia-
ble. Rather, it is quite possible that several group means are the same and only one factor
level is different. For the user, however, an exact knowledge of the differences is often of
great interest. To analyze such differences, two situations must be distinguished:

a) Prior to the analysis (a priori), the user already has theoretically or factually logical
hypotheses as to where exactly mean differences in the factor levels exist. Whether
such presumed differences (contrasts) actually occur, can then be checked with the
help of a contrast analysis. Contrast analyses are thus confirmatory, i.e. hypothe-
sis-testing, analyses.
b) The user has no hypotheses about possible differences in the effect of the factor levels
and would therefore like to know, after a significant F-test (a posteriori), where empir-
ically significant differences in means are to be found. For this purpose, he can resort
to so-called post-hoc tests. Their application is therefore exploratory, i.e. hypothe-
sis-generating, and is carried out ad hoc.

Multiple comparison tests for a priori assumptions: contrast analysis


The contrast analysis of the ANOVA provides information on the extent to which the
means of the dependent variables differ at different factor levels of an independent vari-
able. A contrast analysis could be applied in our example if, for example, market studies
had shown a large effect of the special placement on chocolate sales and the supermarket
manager now suspects this effect in his placements as well. To check this, the candy sec-
tion and the cash register placement would be combined into one group, which can be
done by averaging. The chocolate sales in the special placement could then be contrasted
(compared) with this group. For our example, the contrast could be calculated as follows
(using the mean values in Table 3.4):
Contrast = (0.5 · 43.4 + 0.5 · 52.2) – 1 · 64.4 = −16.6
The contrast value indicates that the average chocolate sales in the candy section
and at the checkout are 16.6 kg per 1000 checkout transactions lower than at the spe-
cial placement. In this example, the following contrast coefficients were chosen: Candy
section = 0.5; special placement = −1.0; cash register placement = 0.5. Depending on the
a priori assumptions, other weightings may be defined by selecting the respective con-
trast coefficients. Whether the calculated contrast value is significant can be checked by
a contrast test, which is based on the null hypothesis that no significant differences exist.
3.2 Procedure 165

Comparative multiple-group tests based on significant F-tests: post-hoc test


Post-hoc tests are performed whenever the F-test of an ANOVA has led to a significant
results and the user wants to know (ex-post; a posteriori)—without any prior hypothe-
ses—which factor levels are responsible for the differences in the means.
In order to find differences, a simple solution would be to combine two factor lev-
els with the help of a t-test and test for significant differences between the mean values.
However, the problem that arises is an accumulation of type I errors (α error).9 This is
also known as alpha error inflation. If there are, for example, five factor levels, ‘five-
over-two’ = 10 different t-tests would have to be performed to test all paired combina-
tions of factor levels for mean differences. This is referred to as multiple tests, since the
same null hypothesis is examined with several tests. With the number of test repetitions,
the probability that a difference appears to be significant increases, even if in reality none
of the differences is significant. With α = 0.05 and 10 test repetitions this probability is
around 40%. The alpha error must therefore be corrected in such a way that the desired
error probability (e.g., 5%) is retained in the result via the comparison tests.
Alpha-error inflation may be avoided by so-called post-hoc tests, which are based on
the null hypothesis that there is no difference between the group means of a factor. Post-
hoc tests are carried out in case an ANOVA leads to a significant F-test (ex post) but no
factually logical hypotheses on specific mean differences were available beforehand. In
this case, pairwise mean value comparisons between the factor levels are carried out and
it is checked whether the mean values differ significantly. Post-hoc tests are only defined
for factors with three or more factor levels, since for two factor levels the classical test
for differences in mean values (t-test) can be performed.
In the literature, many variants of post-hoc tests are discussed, which differ, for exam-
ple, according to whether there is variance homogeneity between the groups or whether
the number of cases in the groups is the same (cf. Shingala & Rajyaguru, 2015, p. 22).10
This assumption can be checked with the help of the Levene test (see Sect. 3.4.3). All
tests first calculate the pairwise mean differences between the factor levels. They then
check whether the differences are significant. First, the p-value is shown and then, a 95%
confidence interval is given for the difference values. In the case of variance homogene-
ity, the following tests are widely used in practice:

• The Bonferroni test carries out the pairwise comparisons between the group mean val-
ues on the basis of t-tests.
• The Scheffe test performs pairwise comparisons simultaneously for all possible pair-
wise combinations of the mean values using the F-distribution.
• The Tukey test performs all possible pairwise comparisons between the groups using
the t-distribution.

9 The alpha error reflects the probability of rejecting the null hypothesis although it is true. For type
I and type II errors, refer to the basics of statistical testing in Sect. 1.3.
10 SPSS offers a total of 18 variants of post-hoc tests. See Fig. 3.16 in Sect. 3.3.3.2.
166 3 Analysis of Variance

Table 3.8  Comparison of the group mean values by a Bonferroni post-hoc test


Placement (I) Placement (J) Mean (J) Mean Difference (I-J) Sig
Candy section Special placement 64.4 −21.0 0.000
y1 = 43.4 Cash register 52.2 −8.8 0.010
Special placement Candy section 43.4 21.0 0.000
y2 = 64.4 Cash register 52.2 12.2 0.001
Cash register Candy section 43.4 8.8 0.010
y3 = 52.2 Special placement 64.4 −12.2 0.001

Apart from the assumed distributions, the tests differ mainly with regard to the correction
of alpha error inflation. Since the tests determine the overall error rate in different ways,
their results differ with regard to the confidence intervals shown.
For the application example, Fig. 3.2 already showed that there are differences
between the mean values of the three factor levels (types of placement). In addition,
Table 3.8 shows the mean value differences for the example calculated by the post-hoc
tests. The last column reports the significance level for the Bonferroni test. It can be seen
that all placement combinations differ significantly.
We can conclude that all three forms of placement have a significant impact on choc-
olate sales. The differences between the group means in our example are also apparent
from the box plots in Fig. 3.2.

3.2.2 Two-way ANOVA

3.2.2.1 Model Formulation

1 Model formulation

2 Variance decomposition and model quality

3 Statistical evaluation

4 Interpretation of the results

In two-way ANOVAs, two factors (independent variables) are considered simultaneously,


each of which can have two or more different factor levels. It is more efficient to exam-
ine two or more factors at the same time, rather than to conduct a separate examination
for each factor. The simultaneous variation of two or more factors is called a factorial
design. Apart from reasons of efficiency, the extension of the ANOVA model to several
factors has further advantages:
3.2 Procedure 167

• Interactions between the factors can be investigated.


• The unexplained variance can be reduced, thus facilitating the detection of factor
effects.

Stochastic model of two-way ANOVAs


The stochastic model of a two-way ANOVA also consists of a systematic and a sto-
chastic component ε. The systematic component contains the explanatory contributions
resulting from the effectiveness of the two factors. The model of a two-way ANOVA
with interaction effects has the following form:
yghi = µ + αg + βh + (αβ)gh + εghi (3.13)
with
yghi observed value i (i = 1, …, 5) for placement g and packaging h
α total mean of the statistical population
αg true effect of placement g (g = 1, 2, 3)
βh true effect of packaging h (h = 1, 2)
(αβ)gh true interaction effect of placement g and packaging h
εghi error term

For the purpose of a unique determination (identifiability) of the effects, the effects
should add up to zero. In multi-factor ANOVA, the isolated effects of the factors are
referred to as main effects to distinguish them from interaction effects.

Extended application example for a two-way ANOVA


The manager of a supermarket chain would like to know whether, in addition to
placement (see Sect. 3.2.1.1), the type of packaging (box or paper) also affects sales.
Thus, the experiment is extended accordingly. With three types of placement and
two types of packaging, there are 3 × 2 experimental combinations of factor lev-
els. This is called a 3 × 2 factorial design. In the same 15 randomly selected super-
markets (SM) as in Sect. 3.2.1.1, the chocolate is now offered both in a box and in
paper. Table 3.9 shows the achieved chocolate sales in the 15 supermarkets within
one week as an extended data matrix with six cells (3 types of placement × 2 types of
packaging).11
The manager also wants to find out whether there is any interaction between the
factors packaging and placement, i.e. whether the average chocolate sales in a box
and in paper depend on the placement and vice versa. ◄

11 Pleasenote that the number of 5 observations per group, and thus a total of 30 observations, was
chosen in order to make the subsequent calculations easier to understand. In the literature, at least
20 observations per group are usually recommended for a two-way ANOVA.
168 3 Analysis of Variance

Table 3.9  Data of the Packaging


extended example: chocolate
sales in kilograms per 1000 Placement Box Paper
cash transactions, depending Candy section SM 1 47 40
on placement and packaging SM 2 39 39
SM 3 40 35
SM 4 46 36
SM 5 45 37
Special placement SM 6 68 59
SM 7 65 57
SM 8 63 54
SM 9 59 56
SM 10 67 53
Cash register SM 11 59 53
SM 12 50 47
SM 13 51 48
SM 14 48 50
SM 15 53 51

Table 3.10  Group mean values and margin mean values in the extended application example
Packaging Packaging
Placement h1: Box h2: Paper (margin means)
g1: Candy section 43.4 37.4 40.4
g2: Special placement 64.4 55.8 60.1
g3: Cash register 52.2 49.8 51.0
Placement (margin means) 53.3 47.7 50.5

Calculation of the main effects


To determine the main effects, first the mean values of all sales figures in Table 3.9 have
to be calculated. In addition to the mean values in the cells of the table, the marginal
mean values for the two factors and the total mean value must also be calculated. The
results are shown in Table 3.10.
The main effects are calculated as the differences between the group means and the
total mean, as in one-way ANOVA. The following calculations need to be made to deter-
mine the main effects of the two factors on the dependent variable:
αg = (yg. − y) (3.14)

bh = (y.h − y) (3.15)
3.2 Procedure 169

with

H N
1 
yg. = yghi (Group means g)
H · N h=1 i=1

G N
1 
y.h = yghi (Group means h)
G · N g=1 i=1

G  H  N
1 
y= yghi (Total means)
G · H · N g=1 h=1 i=1

For the effects of the three placement types, the following can be observed:

α1 = (y1 − y) = 40.4 − 50.5 = −10.1 (Candy section)


α2 = (y2 − y) = 60.1 − 50.5 = 9.6 (Special placement)
α3 = (y3 − y) = 51.0 − 50.5 = 0.5 (Cash register placement)

The sum of the three effects is again zero. The same applies to the effects of the two
types of packaging:
b1 = (y1 − y) = 53.33 − 50.5 = 2.83 (Box)
b2 = (y2 − y) = 47.67 − 50.5 = −2.83 (Paper)
Different effects of the factors under consideration can be assumed if the factor levels of
the two factors each show clear differences in average sales volumes. This is the case if
the mean values at the factor levels of one factor differ significantly from the mean val-
ues at the factor levels of the other factor.
Table 3.10 shows that the average sales volumes at the three factor levels of ‘place-
ment’ differ with respect to the two factor levels of ‘packaging’ (box and paper): For
example, in “special placement” the average sales amount of chocolate in box packag-
ing is 64.4, while in paper packaging it is only 55.8. It can also be seen that the average
chocolate sales volume in box packaging shows differences for the three types of place-
ment (43.3 compared to 64.4 and 52.2). If these differences prove to be significant across
all levels and factors, we can assume that the factors have different main effects on the
dependent variable.

Checking for interaction effects


Interactions between two factors occur if the levels of one factor have an influence on
the levels of a second factor. Thus, in our example, the difference between the average
sales quantities of the packaging types box and paper is (64.4 – 55.8 =) 8.6 in the special
placement, while it is only (52.2 – 49.8 =) 2.4 in the cash register placement. From these
170 3 Analysis of Variance

differences, it can be concluded that the type of placement has an influence on the sales
volumes of the two types of packaging. If these differences were exactly the same, there
would be no interaction between the factors.
The interaction effects are estimated by
(ab)gh = ygh − ŷgh (3.16)
with

5
1 
ygh = yghi : observed mean in cell(g, h)
N i=1
ŷgh : estimated value for the mean from cell(g, h) without interaction

The estimated value ŷgh is the value we would expect for the cell (g, h), i.e. placement g
and packaging h, if there is no interaction. This value is derived from the group average
and the total average as follows:
ŷgh = yg. + y.h − y (3.17)
Let us now look at the cell for g = 3 and h = 2 in the extended example. The observed
mean is y32 = 49.8 (Table 3.10). If any interaction exists, this value contains the interac-
tion effect. The estimated value without interaction can be calculated as:
ŷ32 = (51.00 + 47.67) − 50.50 = 48.17
Thus, the interaction effect is:
(ab)gh = 49.8 − 48. 17 = 1.63
Due to the interaction, the sales volume of chocolate in paper is higher if offered at the
cash register.
Interaction effects can also be tested for significance by the following hypotheses:

• H0: The mean values of the factor levels are identical, therefore there is no interaction
between the factors.
• H1: The mean values of the factor levels are not identical, therefore there is an interac-
tion between the factors.

If H0 is rejected, we can assume significant interaction effects, which we should interpret


in a next step.

General types of interaction effects


A simple method to detect interactions is the graphical representation of the factor level
mean values. For this purpose, the mean values of the levels of one factor are plotted as a
function of the levels of the other factor. For clarification, lines are drawn to connect the
3.2 Procedure 171

Ordinal interaction Hybrid interaction Disordinal interaction


(both main effects can (only main effects B (no effects can be
be interpreted) can be interpreted) interpreted)
y y y

B1 B1 B1

B2 B2
B2

A1 A2 A3 A1 A2 A3 A1 A2 A3
y y y

A1 A1
A1

A2 A2
A2

B1 B2 B3 B1 B2 B3 B1 B2 B3

Fig. 3.6 Types of interaction effects in ANOVAs

group mean values. According to Leigh and Kinnear (1980), three types of interaction
effects can be distinguished, as shown in Fig. 3.6.12

a) Ordinal interaction effects: In case of an ordinal interaction effect, the ranking


of the levels of one factor is identical to the ranking of the respective levels of the
other factor. This means that the lines of both plots have a common trend and do not
intersect. The main effects of both factors can therefore be interpreted meaningfully.
Ordinal interaction effects tend to have a rather weak effect on the dependent variable.
b) Disordinal interaction effects: Disordinal interaction effects are present if no com-
mon trend of the factors can be observed in both plots, i.e. the ranking and thus the
lines of the levels of both factors run in different directions in both plots. An inter-
pretation of the two main effects is therefore not possible. In case of a disordinal
interaction, the plotted lines may or may not intersect. Disordinal interactions with
non-intersecting lines can only be observed if more than two factor levels are pres-
ent. Disordinal interaction effects tend to have the strongest effect on the dependent
variable.

12 Here, the different types of interaction are illustrated graphically. The interaction effects in
the application example correspond to those in the case study and are shown and explained in
Fig. 3.15.
172 3 Analysis of Variance

c) Hybrid interaction effects: If ordinal and disordinal interaction effects occur simul-
taneously, we call this a hybrid interaction effect. While in the case of an ordinal
interaction effect both influencing factors can be interpreted, the interpretation in the
case of hybrid interaction effects is only possible for one of the two influencing fac-
tors. In one of the two plots, the lines do not intersect, i.e. the main effect of this
factor can be interpreted. However, the effect shown in this plot is not reflected in the
counterplot. The trend of this main effect runs in the opposite direction in the other
plot and the lines intersect, which means that the main effect of the other factor can-
not be interpreted.

3.2.2.2 Variance Decomposition and Model Quality

1 Model formulation

2 Variance decomposition and model quality

3 Statistical evaluation

4 Interpretation of the results

Variance decomposition in a two-way ANOVA


To examine the significance of the effects, the total variation of the data needs to be bro-
ken down, just as in a one-way ANOVA. To simplify notations, the factors will be called
A (placement) and B (packaging). The principle of variance decomposition for the two-
way ANOVA is shown in Fig. 3.7.

Total variation

SS t

Variation between the Variation withing the


groups groups
SS b SS w

Variation due to Variation due to Variation due to interaction of


factor A factor B A and B
SS A SS B SS A x B

Fig. 3.7 Distribution of the total variation in a factorial design with 2 factors
3.2 Procedure 173

Just as in the one-way ANOVA, the total variation is split into an explained variation
and an unexplained variation. The explained variation is then further divided into three
components which result from the influence of factor A, the influence of factor B and the
interaction of factors A and B. This results in the following decomposition of the total
variation:
SSt = SSA + SSB + SSAxB + SSw (3.18)
The sum of squares (SS) presented in Eq. (3.18) is now calculated as follows:
SSt: total variation
G 
 N
H 
SSt = (yghi − y)2 (3.19)
g=1 h=1 i=1

SSb: variation between the groups (explained variation)


SSb can be calculated directly as the difference between the group means and the total
means. The following applies:
G 
 H
SSb = N · (ygh − y)2 (3.20)
g=1 h=1

However, the total variation explained is obtained by adding the individual effects:
SSb = SSA + SSB + SSAxB (3.21)
The variations produced by the isolated effects (main effects) of factor A (placement)
and factor B (packaging) are derived from the deviations of the row or column means
from the total mean. Eqs. (3.22, 3.23) show the general calculation of the variation
explained by the main effects.
G
SSA = H · N · (yg − y)2 (3.22)
g=1

H
SSB = G · N · (yh − y)2 (3.23)
h=1

with
G Number of factor levels of factor A
H Number of factor levels of factor B
N Number of elements in cell (g, h)
yg Mean of row
yh Mean of column
174 3 Analysis of Variance

The variation generated by the interaction effects is obtained by adding the squared
deviations of the cell means and the estimated values that would be expected without
interaction:
G 
 H
SSAxB = N · (ygh − ŷgh )2 (3.24)
g=1 h=1

SSw: variation within the groups (unexplained variation)


The unexplained variation is the variation that can be attributed neither to the two factors
nor to interaction effects, i.e., it is a random influence on the dependent variable. It is
reflected in SSw and is defined analogously to SSw of one-way ANOVA (cf. Eq. 3.6) in
Sect. 3.2.1.2 as follows
G 
 N
H 
SSw = (yghi − ygh )2 (3.25)
g=1 h=1 i=1

Results for the extended example


To calculate the total variation, Eq. (3.25) is to be applied to the data of the extended
example in Table 3.9. The result is as follows:
G 
 N
H 
SSt = (yghi − y)2 = 2471.50
g=1 h=1 i=1

Try to calculate this value using the data in Table 3.9.


Using the data for the means in Table 3.10, the other SS values for the extended
example can be calculated as follows:

SSA = 2 · 5 · [(40.4 − 50.5)2 + (60.1 − 50.5)2 + (51.0 − 50.5)2 ] = 1944.20


SSB = 3 · 5 · [(53.3 − 50.5)2 + (47.6̄ − 50.5)2 ] = 240.83

The means per group (i.e., cell) are shown in Table 3.10.
The following estimated values are obtained for the means to be expected without
interaction:
ŷ11 = 40.4 + 53.3 − 50.5 = 43.23
ŷ12 = 40.4 + 47.6 − 50.5 = 37.56
ŷ21 = 60.1 + 53.3 − 50.5 = 62.93
ŷ22 = 60.1 + 47.6 − 50.5 = 57.26
ŷ31 = 51.0 + 53.3 − 50.5 = 53.83
ŷ32 = 51.0 + 47.6 − 50.5 = 48.16
3.2 Procedure 175

This results in the variation explained by the interactions:


2 2
 
 (43.4 − 43.23) + (37.4 − 37.56) 
 
SSAxB = 5 · +(64.4 − 62.93)2 + (55.8 − 57.26)2
 
+(52.2 − 53.83)2 + (49.8 − 48.16)2
 

= 48.47
SSb is the difference between the group means and the total means. For our extended
example, the result is

SSb = 5 · {(43.4 − 50.5)2 + . . . + (49.8 − 50.5)2 }


= 2233.5
SSAxB can now be determined as
SSAxB =SSb − SSA − SSB
=2233.5 − 240.83 − 1944.20
=48.47
The residual variation, which manifests as ‘variation within the groups’, analogous to
SSw in a one-way ANOVA, is calculated for our example as follows

SSw = (47 − 43.4)2 + . . . +(45 − 43.4)2


+(40 − 37.4)2 + . . . +(37 − 37.4)2
+(68 − 64.4)2 + . . . + . . .
+(53 − 49.8)2 + . . . +(51 − 49.8)2
= 238
In analogy to Fig. 3.7, the unexplained variation can also be calculated indirectly by
decomposing the total variation
SSw =SSt − SSA − SSB − SSAxB = SSt − SSb
(3.26)
=2471.5 − 2233.5 = 238
Model quality (eta-squared values)
Similar to one-way ANOVA, the power of a two-way ANOVA can now be estimated
using eta-squared according to Eq. (3.7). With the values calculated for the extended
example above, the following results are obtained for eta-squared:

2233.5
Eta-Squared = = 0.904
2471.5
When using the extended model, 90.4% of the total variation can be explained (previ-
ously 0.864 according to Eq. 3.7). A one-way ANOVA with the factor placement can
explain 78.7% of the variation for the present data set. By extending the model, the unex-
plained variation can be reduced from 21.3% to 9.6%. Again, it should be noted that
176 3 Analysis of Variance

Table 3.11  Calculation of the partial eta-squared in the extended example


Source of explanation Sum of Squares Error Sum Partial
(explained) (not explained) (total deviation) Eta Squared
PLACEMENT 1944.200 238.000 2182.200 0.891
PACKAGING 240.833 238.000 478.833 0.503
PLACEMENT* 48.467 238.000 286.467 0.169
PACKAGING

eta-squared corresponds to the coefficient of determination (R-squared) of regression


analysis (see Chap. 2).
In a two-way ANOVA, in addition to the eta-squared of the overall model, partial
eta-squared values for each factor and also for the interaction term can be calculated.
The square sums of the explanatory variables (factor A, factor B and the interaction term
A × B) are divided by the partial total variation. This results from adding the unexplained
variation (SSw) to the partial total variation.

Partial eta-squared values for individual effects


SSb(Effect)
Partial eta − squaredEffect = (3.27)
SSb(Effect) + SSw
For the extended example, Table 3.11 shows the calculation of the partial eta-squared
values for the two main effects (Eqs. 3.22, 3.23), the interaction effect (Eq. 3.24) and the
error term (see Eq. 3.25).

3.2.2.3 Statistical Evaluation

1 Model formulation

2 Variance decomposition and model quality

3 Statistical evaluation

4 Interpretation of the results

In a two-way ANOVA, the statistical assessment for different effects of the two factors is
carried out by comparing the means in all cells. If all means are approximately equal, the
factors have no effect (null hypothesis). The alternative hypothesis is that at least one of
the factors has an influence.
The global significance test for the two-factor model is therefore identical with the
test for the simple model (except for the different number of degrees of freedom; cf.
Eq. 3.12):
3.2 Procedure 177

Table 3.12  Two-way Source SS df MS


ANOVA table
Main effects
Placement 1944.200 2 972.100
Packaging 240.833 1 240.833
Interaction
Placement x Packaging 48.467 2 24.233
Residual 238.000 24 9.917
Total 2471.500 29 85.224

explained variance SSb /(G · H − 1) MSb


Femp = = = (3.28)
unexplained variance SSw /(G · H · N − G · H) MSw
With the values above, the following is calculated:
2233.5/5 446.7
Femp = = = 45.05
238.0/24 9.917
For a confidence level of 95%, the F-table displays the value Fα = 2.62. The result is
therefore highly significant and the null hypothesis can be rejected. The corresponding
p-value is almost zero.
This allows us to investigate further questions concerning individual factors and their
interactions. In these cases, the null hypothesis is that the factor under examination has
no effect, i.e., there are no interactions.
In a two-way ANOVA, the central results are also summarized in an ANOVA table.
In contrast to the one-way ANOVA table (see Table 3.7), the SS for both factors and the
interaction between the factors are now listed as well. Table 3.12 shows the results for
the extended application example (values as calculated in Sect. 3.2.2.2).
The degrees of freedom (df) for the individual elements of Table 3.12 are calculated
as follows:
dfA = G − 1
dfB = H − 1
dfAxB = (G − 1) · (H − 1)
dfw = G · H · (N − 1)
df t = G · H · N − 1

Table 3.13 shows the results of the specific F-tests with a confidence level of 95%. The
variance of the unexplained variation is the same in all cases, namely 238.000.
The result shows that the alternative hypothesis H1 is accepted for the main effects,
i.e., both packaging and placement have an effect on the sales volume, whereas the inter-
action is not significant. This result does not necessarily mean that in reality there is no
178 3 Analysis of Variance

Table 3.13  Specific F-tests in a two-factor design


Source df (numerator) df (denominator) Ftab Femp
Packaging 1 24 4.26 24.286
Placement 2 24 3.40 98.026
Interaction
Packaging * Placement 2 24 3.40 2.444

connection, but on the basis of the available results the null hypothesis cannot be rejected
in this case (cf. the graphical analysis of the interactions in Fig. 3.15).

3.2.2.4 Interpretation of the Results

1 Model formulation

2 Variance decomposition and model quality

3 Statistical evaluation

4 Interpretation of the results

For the two-way ANOVA, the central results are also included in an ANOVA table (see
Table 3.12). They provide information on whether the factors or their interaction have a
significant effect on the dependent variable. The F-statistics, on the other hand, are omni-
bus tests, i.e. they do not provide information about which levels of one or more factors
have a significant influence on the dependent variable and how large these effects are.
If there are a priori assumptions about possible differences in the two-way ANOVA,
these can of course also be tested using contrast analysis. And if such assumptions can-
not be made in advance, post-hoc tests can also be performed for significant F-values. In
our extended example, however, post-hoc tests for the type of packaging (box or paper)
are not useful, since there are only two factor levels. A post-hoc test can only be carried
out for the factor “placement”, and it is identical with the test in the case of a one-way
ANOVA (see Sect. 3.2.1.4).
Further explanations on contrasts and post-hoc tests for two-way ANOVAs can be
found in the case study (Sects. 3.3.3.2, 3.3.3.3) which uses the dataset of our extended
example, with a focus on the options implemented in SPSS.

3.3 Case Study

3.3.1 Problem Definition

Based on experience, the manager of a supermarket chain assumes that the sales of a
certain type of chocolate can be influenced by the type of packaging and the placement.
To test his assumption, he presents the chocolate in three different places (candy section,
3.3 Case Study 179

Table 3.14  Chocolate sales in Packaging


kilograms per 1000 checkout
transactions depending on Placement Box Paper
placement and packaging Candy section SM 1 47 40
SM 2 39 39
SM 3 40 35
SM 4 46 36
SM 5 45 37
Special placement SM 6 68 59
SM 7 65 57
SM 8 63 54
SM 9 59 56
SM 10 67 53
Cash register SM 11 59 53
SM 12 50 47
SM 13 51 48
SM 14 48 50
SM 15 53 51

special placement, cash register) and in two different types of packaging (box and
paper). This results in 3 × 2 = 6 different ways of presenting the chocolate. As described
in Sect. 3.2.2.1, 15 out of the 100 shops of his supermarket (SM) chain are randomly
selected and the chocolate is presented in a box and in paper in 5 shops per placement,
also selected at random, for one week each. Table 3.14 shows the achieved chocolate
sales in kilograms per 1,000 checkout transactions in the 15 supermarkets for chocolate
in boxes and in paper.13
With the help of the data collected, he now wants to check whether the packaging and
the placement have a significant influence on the sales volume. To answer this question,
the manager conducts a two-way ANOVA. If the influence of the placement turns out
to be significant, the manager wants to know in a second step whether all three place-
ments (candy section, special placement and cash register placement) have an influence
on chocolate sales and how strong these effects are. This is possible via a so-called post
hoc test.

13 For didactic reasons, the data of the extended example are also used in the case study (cf.
Sect. 3.2.2.1; Table 3.9). Note that the case study is thus based on a total of 30 cases only. In the
literature, a number of at least 20 observations per group is usually recommended.
180 3 Analysis of Variance

Since the manager knows that both the ANOVA and the post-hoc test require variance
equality in the factor levels (groups), he would like to check this assumption in advance
using the Levene test.

3.3.2 Conducting a Two-way ANOVA with SPSS

To conduct a two-way (and any multi-factorial) ANOVA with SPSS, we can use the
graphical user interface (GUI). After loading the data file into SPSS, the data are avail-
able in the SPSS data editor. Click on ‘Analyze’ to select the procedure for ANOVA. A
pull-down menu opens with submenus for groups of procedures (see Fig. 3.8). The group
‘General Linear Model’ contains the procedure ‘Univariate …’, which means that only
one dependent variable is considered.
In the dialog box ‘Univariate’, select the dependent variable (sales volume of choco-
late) and the two independent, nominally scaled variables (placement and type of pack-
aging) from the list and transfer them to the field ‘Fixed Factors’ (see Fig. 3.9).

Fig. 3.8 Data editor with selection of the analysis method ‘Univariate’
3.3 Case Study 181

Fig. 3.9 Dialog box: Univariate

Additionally, various statistics and parameters can be selected via the field ‘Options’
(see Fig. 3.10). For the present case study, ‘Descriptive statistics’ and ‘Estimates of effect
sizes’ have to be selected. By clicking on the box ‘Homogeneity test’, the Levene test for
homogeneity of variances is requested.
A plot of the factor level mean values can also be requested to visually check the
presence of interactions. To do this, click on the dialog box ‘Diagrams’. In the dialogue
box ‘Plot’, enter the factor placement as the ‘Horizontal axis’ and the factor packag-
ing under ‘Separate lines’ and then enter them both in the field ‘Plots’ using ‘Add’ (see
Fig. 3.11).

3.3.3 Results

3.3.3.1 Two-way ANOVA
Since ANOVA requires homogeneity of variance between groups, we first check the
result for Levene’s test (see Sect. 3.2.1.3) shown in Fig. 3.12. In the last column, the sig-
nificance is given as Sig. = 0.499. The test score of the Levene test is therefore not signif-
icant. A rejection of the null hypothesis would be a wrong decision with a probability of
0.499. The null hypothesis (“error variance of the dependent variable is the same across
the groups”) can therefore be accepted. This means that there are no significant differ-
ences in the error variances of the three factor levels, i.e. variance homogeneity can be
assumed.
182 3 Analysis of Variance

Fig. 3.10 Dialog box: Univariate: Options

Figure 3.13 lists the descriptive statistics, with the average chocolate sales in kilogram
per 1000 checkout transactions for the different placements as well as the standard devia-
tions and the case numbers (N) for the two packaging types.
A review of the descriptive results reveals differences in the mean sales volumes for
the combinations of factor levels of the two independent variables: For example, the
packaging type box consistently shows higher sales figures than paper packaging in all
placements, and the special placement leads to the highest results, with 60.1 sales units.
The mean values thus already suggest the effectiveness of the two marketing measures.
The results of the two-way ANOVA are presented in an ANOVA table (Fig. 3.14).
Since we used the same figures as in Sect. 3.2.2.2, the results correspond to the values
in Table 3.12 and 3.13. In addition to the empirical F-values, Fig. 3.14 shows the corre-
sponding p-values (“significance”).
The structure of the table in Fig. 3.14 clearly reflects the basic principle of the vari-
ance decomposition because according to Eq. (3.6):

SSt(otal) = SSb(etween) + SSw(ithin)


2471.500 = 2233.500 + 238.000
If the intercept estimated by the ANOVA is also included in the considerations, the result
for the explained total variation is as follows:
3.3 Case Study 183

Fig. 3.11 Dialog box:


Univariate: Plots

Fig. 3.12 Result of the Levene test for variance homogeneity


184 3 Analysis of Variance

Fig. 3.13 Descriptive statistics for the case study

Fig. 3.14 Results of the two-way ANOVA (ANOVA table)

2471.500 + 76507.500 = 78979.000.


The intercept can also be interpreted as an explanatory variable: It corresponds to the
sum of the deviation squares that would result if the effects of placement and packaging
were zero. This model which is based on the null hypothesis is called the null model
3.3 Case Study 185

(constant-only model). The Null model serves primarily for comparison with other
models in order to clarify their explanatory power. With reference to the case study, the
constant term indicates the sum of the deviation squares that would be generated “on
average” if the supermarket manager had not undertaken any activities regarding place-
ment or packaging.
Column four in Fig. 3.14 shows the square means that results if the ‘Type III Sum of
Squares’ is divided by the degrees of freedom (df). Using the square means, the test val-
ues of the F-statistics and their significances can be calculated directly (cf. Sects. 3.2.1.3,
3.2.2.3). It should be emphasized that in the case study the F-tests for the two market-
ing instruments (placement and packaging) are significant (cf. column Sig. in Fig. 3.14).
This means that both measures have a significant influence on chocolate sales. In con-
trast, the interaction between placement and packaging is not significant (Sig. = 0.108).

Model quality (eta-squared values)


The last column in Fig. 3.14 shows the eta statistics requested via the option ‘Estimates
of effect size’ (Fig. 3.10). The quality of the overall model (corrected model) is calculated
according to Eq. (3.7) and for the case study follows
explained variation SSb 2233.5
Eta − Squared = = = = 0.904
total variation SSt 2471.5
This means that the model (without intercept) can explain 90.4% of the squared variation
of the dependent variable.
In addition, the explanatory power of the factors (placement, packaging) and the inter-
action effect (placement * packaging) can be determined ‘in isolation’ with respect to the
dependent variable (so-called partial eta-squared values). The partial eta-squared values
generated by SPSS correspond to the results in Table 3.11, since the same data set was
used in both cases.
The partial eta-squared values show that the factor placement contributes more to
the explanation of variance (89.1%) than the factor packaging (50.3%). The interaction
term placement * packaging explains 16.9% of the variance of the dependent variable. It
should be noted that the interaction term has a value of 0.108 in the column ‘Sig.’. This
means that we would take the risk with a probability of 10.8 % if we claimed that there
was an influence of the interaction term on the sales volume. Accordingly, the influence
of the interaction term is to be classified as not significant. In addition, SPSS shows the
partial eta squared for the intercept, which is calculated according to Eq. (3.27):
76507.5
Partial Eta − SquaredIntercept = = 0.997
76507.5 + 238
The strong explanatory power of the intercept (null model) suggests that influences
which contribute to the explanation of the scatter of sales volume but were not explic-
itly formulated in the model are reflected in the intercept. This presumption will be con-
firmed by later considerations (see Fig. 3.23 in Sect. 3.4.2.2).
186 3 Analysis of Variance

Fig. 3.15 Graphical analysis of interactions in the case study

Interaction between packaging and placement in the case study


The ANOVA results in Fig. 3.14 show that there is no significant interaction between the
factors packaging and placement (partial eta-squared = 0.169). A visual check for inter-
actions can be performed in SPSS by requesting a profile plot (see Fig. 3.15). Here, the
average sales figures for the three placements and the two packaging types are plotted
against each other. Interaction effects can be recognized by connecting lines between
the mean values that are not parallel to each other.14 In our case study, the packaging
types box and paper show no interaction for candy section and special placement (paral-
lel lines), while the cash register placement does influence the sales volumes of the two
types of packaging (smaller distance between the two corresponding sales values since
the rightmost point of the lower polygon chain is relatively higher).

14 Forthe types of interaction effects and the calculation of the interaction effect in the case study,
see the explanations in Sect. 3.2.2.1.
3.3 Case Study 187

Due to the interaction, the sales volume of chocolate in paper is higher if it is offered
at the cash register.

3.3.3.2 Post-hoc Tests for the Factor Placement


In our case study, the factor “placement” has a significant influence on chocolate sales as
proven by the significant F-test in the two-factorial ANOVA. However, the manager does
not yet know whether all three forms of placement (factor levels) have an equally strong
influence or whether there are differences. This question can be answered by means of
post-hoc tests (cf. the explanations in Sect. 3.2.1.4).
SPSS provides a total of 18 different post-hoc tests in dialog box ‘Post hoc’ in the
main menu of the procedure ‘Univariate’ (see Fig. 3.10). Fourteen tests are suitable if
equal variance in all groups can be assumed and four test options are applicable if var-
iance equality cannot be assumed. Figure 3.16 shows the dialog window of the menu
selection ‘Post hoc’.
Since for our case study the Levene test showed that variance homogeneity can be
assumed (see Fig. 3.12), the Tukey and Scheffé tests were selected as post-hoc tests.
They are among the most commonly used post-hoc tests in empirical applications.
The Tukey test is particularly recommended for comparing paired means (as in our
case study) and classified as very robust (cf. Smith, 1971, p. 31). The Scheffé test is also
recommended in the literature and is especially suited if the sample sizes in the groups
may vary (so-called unbalanced design). But this is not the case in our example. For the
sake of comparison, the results of both tests are requested here.
It should be noted that the procedure ‘Univariate’ is only allowed for post-hoc tests
if no covariates are included in the analysis (for analyses with covariates see Sect. 3.4.2
on ANCOVA). Also, no post-hoc test can be performed for the factor packaging, as this
factor has only two factor levels (box and paper) while post-hoc tests are only defined
for factors with three or more factor levels. With two factor levels, a simple test for dif-
ferences in mean values can be carried out, which can be called up in SPSS in the menu
Analyze/Compare means/Means.

Results of the post-hoc tests in the case study


The results of the two post-hoc tests requested in Fig. 3.16 are shown in Fig. 3.17. Both
tests lead to the same result: All pairwise comparisons of the mean values of the three
factor levels lead to significant results, which can be seen directly in the column “Sig.”
in Fig. 3.17. In addition, significance at a level of ≤5% is indicated by the asterisks in
the column “Mean difference (I–J)”. This indicates that all factor levels have different
effects on chocolate sales, with the greatest difference showing up between the factor
levels candy section and special placement: (40.4 – 60.1) = −19.7.

3.3.3.3 Contrast Analysis for the Factor Placement


In our case study, it was assumed that the manager had no idea whether the factor levels
had different effects on the dependent variable. However, quite often the user does have
188 3 Analysis of Variance

Fig. 3.16 Dialog box: Post Hoc Multiple Comparisons

Fig. 3.17 Results of the post-hoc tests

a priori ideas about the effects of the factor levels, e.g. due to logical considerations. In
these cases, these assumptions about the differences in the effectiveness of the factor lev-
els can be checked by the so-called contrast analysis (cf. Sect. 3.2.1.3).
3.3 Case Study 189

Fig. 3.18 Dialog box ‘Contrasts’ of one-way ANOVA

For our case study, it is assumed that prior research has shown that the special place-
ment of chocolate can increase sales. The supermarket manager would now like to know
whether this effect is also valid in his case. The manager is therefore interested in a con-
trast analysis for the factor “placement”. In this case, a one-way ANOVA can be used to
perform the contrast analysis.
In SPSS, the one-way ANOVA is called up by the menu sequence Analyze/Compare
Means/One-Way ANOVA. There, the sales volume can be entered as a dependent variable
and the placement as factor. Pressing the button ‘Contrasts’ opens the corresponding dia-
log box, as shown in Fig. 3.18.
In the case of one-way ANOVAs, contrast analysis compares the contrast variable of
interest (factor level) with the other factor levels, which, for this purpose, are combined
into one group. This is achieved by determining the so-called contrast coefficients, which
are often referred to as lambda coefficients.
In order to contrast special placement with the two other factor levels, the supermar-
ket manager selects a contrast coefficient of −1. Based on logical reasoning, the contrast
coefficients are set to +0.5 for the remaining factor levels, candy section and cash regis-
ter. As a result, special placement is regarded as an independent group, while the factor
levels candy section and cash register are combined into one group. In the dialog box
‘Contrasts’, these values are entered in the field ‘Coefficients’ and transferred to the anal-
ysis by clicking on ‘Add’.
Note that the absolute magnitude of the lambda coefficients is irrelevant. They merely
indicate the weighting ratio of the mean (here 1:1). Also note that the coefficients of the
factor levels to be contrasted must have opposite algebraic signs and that the sum of all
contrast coefficients has to result in a total of zero.

Results of the contrast analysis for the factor placement


The result of the contrast analysis within the one-way ANOVA is shown in Fig. 3.19.
The matrix of contrast coefficients at the top again shows the weightings applied
in the case study. The comparison of the means of the factor level ‘special placement’
and the group ‘candy section’ and ‘cash register’ is carried out with a t-test under the
190 3 Analysis of Variance

Fig. 3.19 Results of the contrast analysis

assumptions “equal variances” and “no equal variances”. The contrast value reflects the
difference between the two considered group averages and here is calculated as follows:
Contrast = 0.5 · 40.4 + ( − 1 · 60.1) + 0.5 · 51.0 = − 14.4
The group mean values of the factor levels can be found in Fig. 3.13 (they are always
shown in the total sum line). Both t-tests lead to the same result and are highly signif-
icant with a p-value of 0.000. This means that the assumption of the supermarket man-
ager can be confirmed, i.e. the special placement significantly increases chocolate sales
compared to chocolate placement in the candy section and at the cash register.
In one-way ANOVAs, contrast analysis refers solely to the differences between the
factor levels of the factor under consideration. If, however, a contrast analysis is per-
formed within the framework of a multifactorial ANOVA, the differences between the
factors are considered across all factor levels (so-called boundary mean values).

3.3.4 SPSS Commands

Above, we demonstrated how to use the graphical user interface (GUI) of SPSS to con-
duct an analysis of variance. Alternatively, we can also use the SPSS syntax which is a
programming language unique to SPSS. Each option we activate in SPSS’s GUI is trans-
lated into SPSS syntax. If you click on ‘Paste’ in the main dialog box shown in Fig. 3.9,
a new window opens with the corresponding SPSS syntax. However, you can also use
the SPSS syntax directly and write the commands yourself. Using the SPSS syntax can
be advantageous if you want to repeat an analysis multiple times (e.g., testing different
model specifications). Figure 3.20 shows the SPSS syntax for the two-way ANOVA of
the case study.
Figure 3.21 show the SPSS syntax for running a covariance analysis (ANCOVA),
which is presented in Sect. 3.4.2
For readers interested in using R (https://www.r-project.org) for data analysis, we pro-
vide the corresponding R-commands on our website www.multivariate-methods.info.
3.4 Modifications and Extensions 191

* MVA: Case Study Chocolate Analysis of Variance (ANOVA).


* Defining Data.
DATA LIST FREE / ID Placement Packaging Sales Price Temp.
MISSING VALUES ALL (9999)

BEGIN DATA
1 1 1 47 1,89 16
2 1 1 39 1,89 21
3 1 1 49 1,89 19
------------------
30 3 2 51 2,13 18
* Enter all data.
END DATA.

* Case Study ANOVA.


UNIANOVA sales BY placement packaging
/METHOD=SSTYPE(3)
/INTERCEPT=INCLUDE
/POSTHOC=placement(TUKEY SCHEFFE)
/PLOT=PROFILE(placement*packaging) TYPE=LINE ERRORBAR=NO
MEANREFERENCE=NO YAXIS=AUTO
/PRINT ETASQ DESCRIPTIVE HOMOGENEITY
/CRITERIA=ALPHA(.05)
/DESIGN=placement packaging placement*packaging.

Fig. 3.20 SPSS syntax for the two-way ANOVA of the case study

* MVA: Case Study Analysis of Variance: Method with covariates (ANCOVA).


UNIANOVA sales BY placement packaging WITH price temp
/METHOD=SSTYPE(3)
/INTERCEPT=INCLUDE
/PLOT=PROFILE(placement*packaging) TYPE=LINE ERRORBAR=NO
MEANREFERENCE=NO YAXIS=AUTO
/PRINT ETASQ DESCRIPTIVE HOMOGENEITY
/CRITERIA=ALPHA(.05)
/DESIGN=price temp placement packaging placement*packaging.

Fig. 3.21 SPSS syntax for ANCOVA

3.4 Modifications and Extensions

This section presents different extensions of ANOVA. Also, the covariance analysis
(ANCOVA) is considered more closely and explained for the case study in Sect. 3.3.1.
For this purpose, the data of the case study will be extended by two covariates (metri-
cally scaled independent variables). Finally, in Sect. 3.4.3 the Levene test for checking
the assumption of variance homogeneity is described in more detail.
192 3 Analysis of Variance

Table 3.15  Variations of the analysis of variance


Dependent variable(s) Independent variable(s) Description Abbreviation
One, metric One nominal variable with Univariate analysis ANOVA
different levels of variance, one-way
(one-factorial)
One, metric Two or more nominal vari- Univariate analysis of vari- ANOVA
ables with different levels ance, multiple or multi-fac-
torial (two-, three-way etc.
or two-, three-factorial etc.)
One, metric One or more nominal Univariate analysis of vari- ANCOVA
and one or more metric ance with covariate(s)
variables
Two or more, metric One or more nominal Multivariate analysis of MANOVA
variables variance
Two or more, metric One or more nominal Multivariate analysis of MANCOVA
and one or more metric variance with covariate(s)
variables

3.4.1 Extensions of ANOVA

The term “analysis of variance” comprises various forms of ANOVA, with the extensions
resulting from the inclusion of additional variables (see Table 3.15). Since these exten-
sions lead to changes in the procedure, they are referred to differently in the literature
(see last column in Table 3.15).
All variants of ANOVA follow the principle of the variance decomposition. Regarding
multifactorial analyses of variance, it should be noted that with an increasing number of
factors the possibilities of interaction relationships also increase.
As an example, Fig. 3.22 shows a three-way ANOVA including the different levels of
interactions: the interaction between all possible combinations of two factors and, addi-
tionally, the interaction between all three factors. If more than three factors are included
in the analysis, again, all the interactions of the factors have to be considered, although
in these cases the interactions can hardly be interpreted in terms of content.
In practical applications, ANCOVA has become much more important because it
allows the simultaneous consideration of nominally and metrically scaled independent
variables. Therefore, we will describe ANCOVA in more detail in the following section.
MANOVA allows a design with more than one dependent variable and several fac-
tors. MANCOVA also considers several dependent variables and includes metrically
scaled covariates in addition to nominally scaled factors. MANCOVAs can be per-
formed in SPSS via the menu sequence ‘Analyze/General Linear Model/Multivariate’.
MANOVA and MANCOVA result in a general linear model approach (for the general
linear model see Christensen, 1996, pp. 427–431 and for the multivariate ANOVA see
3.4 Modifications and Extensions 193

Total variation

SS t

Variation between the


Variation within the groups
groups
SS b SS w

Main effects Interactions

Variation due to Variation due to the interaction of


factor A A and B
SS A SS A x B

Variation due to Variation due to the interaction of


factor B A and C
SS B SS A x C

Variation due to Variation due to the interaction of


factor C B and C
SS C SS B x C

Variation due to the interaction of


A and B and C
SS A x B x C

Fig. 3.22 Distribution of the total variation in a three-factor design

Christensen, 1996, pp. 367–374 as well as Haase & Ellis, 1987, pp. 404–113 or Warne,
2014, pp. 1–10).

3.4.2 Covariance Analysis (ANCOVA)

For practical applications, analyses of variance that do not only consider nominally
scaled but also metrically scaled independent variables are of great importance. The
metrically scaled independent variables are called covariates. Analyses of variance with
194 3 Analysis of Variance

covariates are therefore called covariance analyses (ANCOVA). Due to the great impor-
tance of ANCOVAs, we will examine this type of analysis in more detail below, using
the case study in Sect. 3.3.

3.4.2.1 Extension of the Case Study and Implementation in SPSS


For ANCOVAs, the first step is usually to determine the proportion of the variance that
can be attributed to the covariates. In general, this corresponds to a previous regression
analysis (see Chap. 2). The observed values of the dependent variables are corrected for
the influence determined by the regression analysis and then subjected to ANOVA (see
Christensen, 1996, pp. 281–298). In this way the dependent variable is mathematically
adjusted for the influence of the covariates.
If in our case study price, for example, also varied, the residual spread would not only
contain random but also systematic influences. In order to avoid biased estimates, the
investigator should keep the prices constant to simplify the analysis. If the user experi-
mentally varies the prices to determine their effects, the variations should be independent
of (uncorrelated to) the other experimental variables. By introducing price as a covariate,
a part of the total variance may be attributed to the variation of the price, which, if not
recorded, would result in an increased residual variation (SSW).

Extension of the case study


The manager of the supermarket chain now assumes that the explanation of the choc-
olate sales volumes achieved by the ANOVA can be further improved by additionally
considering the factors “price level” and “outside temperature”. He therefore supple-
ments the data collected on the two factors (see Table 3.14) with information on the
price level and the outside temperature in the considered 15 supermarkets. These data
are metrically scaled variables and are included in Table 3.16 as price and temp. ◄

In order to perform the two-way ANOVA with covariates, the metrically scaled varia-
bles “price” and “temp” have to be inserted in the field ‘Covariates’ in the dialog box
‘Univariate’ (see Fig. 3.9). After the transfer, the sub-item ‘Post-hoc’ is automatically
hidden, as post-hoc tests are only defined for analyses without covariates. To execute the
procedure ‘Univariate’ with covariates, click on ‘OK’ again.

3.4.2.2 Two-way ANCOVA with Covariates in the Case Study


For a two-way ANCOVA, Fig. 3.23 shows the results of the table of variance taking into
account the two covariates.
Again, the decomposition of the total variation into the explained variation (corrected
model) and the residual variation (error) is apparent in the second column of the table.
In the second column, rows 3 to 7 show a breakdown of the variation which is explained
by the covariates and by the factors’ (corrected model) individual contributions (price,
temp, placement, packaging, placement * packaging). The remaining columns contain,
3.4 Modifications and Extensions 195

Table 3.16  Data matrix of the case study with covariates


Packaging Box Paper
Placement Sales Price Temp Sales Price Temp
Candy section SM 1 47 1.89 16 40 2.13 22
SM 2 39 1.89 21 39 2.13 24
SM 3 40 1.89 19 35 2.13 21
SM 4 46 1.84 24 36 2.09 21
SM 5 45 1.84 25 37 2.09 20
Special placement SM 6 68 2.09 18 59 2.09 18
SM 7 65 2.09 19 57 1.99 19
SM 8 63 1.99 21 54 1.99 18
SM 9 59 1.99 21 56 2.09 18
SM 10 67 1.99 19 53 2.09 18
Cash register SM 11 59 1.99 20 53 2.19 19
SM 12 50 1.98 21 47 2.19 20
SM 13 51 1.98 23 48 2.19 17
SM 14 48 1.89 24 50 2.13 18
SM 15 53 1.89 20 51 2.13 18

Fig. 3.23 Two-way ANCOVA using the univariate procedure

as above, the degrees of freedom (df), the variances (mean squares), the empirical
F-values (F), the significance level of the F-statistics (significance, Sig.) and the partial
eta-squared.
196 3 Analysis of Variance

As the results show, the covariates price and temperature (with partial eta-squared val-
ues of 0.022 and 0.021, respectively) do not have any significant explanatory power with
regard to the dependent variable. The supermarket manager’s assumption that the quan-
tity sold can be additionally explained by these factors can therefore not be confirmed.
Mathematically, the sales quantity is ‘corrected’ for the influence of the covariates.
This correction is expressed by the fact that the two-way ANOVA now relates to the total
variation minus the influence of the covariates. This has a direct effect on the results, as a
comparison of the ANOVA tables in Figs. 3.14, 3.23 shows:

• The variation explained by the factor placement decreased in absolute terms from
1944.2 to 1207.881. The explanation by the packaging factor has also decreased from
240.833 to 82.605.
• The change in the constant term is also quite obvious: the explained variation in abso-
lute terms dropped from 76507.500 to 8.815. Thus, the constant term is no longer
significant and now only has a partial eta-squared of 0.038 (previously: 0.997). This
means that the original “explanatory power” of the constant term has apparently been
absorbed by the two covariates.

Recommendations for carrying out an ANCOVA


For ANCOVAs, metrically scaled covariates are included in the analysis in addition to
the nominally scaled factors. ANCOVAs follow a linear model approach. Factors and
covariates must therefore be independent of each other, i.e., there must be no interactions
between the two. When creating the experimental factor design, it is important to make
sure that the design does not influence the covariates (so-called balanced experimental
design). In contrast, a correlation between the covariates and the dependent variable(s) is
mandatory, which must also be checked in advance.
In ANCOVAs, the contribution of covariates for explaining the dependent variable(s)
is determined in advance by regression analyses. Since these analyses are carried out for
each factor level (group), the preceding regressions in the individual groups should be
homogeneous (approximately equal regression coefficients). Simulation studies by Levy
(1980, p. 835) have shown that a covariance analysis should not be performed if the
regressions in the groups are not homogeneous, the assumption of a multinormal distri-
bution is violated and the groups are unequal in size (see the explanations on the assump-
tions of regression analysis in Sect. 2.2.5).
When covariates are considered, ANOVAs can no longer be supplemented by post-
hoc tests. However, contrasts can also be taken into account in ANCOVA. The dialogue
box ‘Contrasts’ for the ‘Univariate’ procedure is shown in Fig. 3.24. Six types of con-
trasts can be selected under ‘Contrast’ (deviation, simple, difference, Helmert, repeated,
polynomial). For simple and deviation contrasts, you have the choice of using the ‘Last’
or ‘First’ factor level as ‘Reference Category’. The contrast type can be varied per factor.
Unlike in the contrast options of one-way ANOVAs (see Fig. 3.18), contrast coeffi-
cients cannot be defined for two-way ANOVAs. Thus, for the case study, no identical
3.4 Modifications and Extensions 197

Fig. 3.24 Dialog box: Contrasts of the procedure ‘Univariate’

results are obtained with both method variants. The reader should understand the differ-
ences in the results when choosing different contrast options for the case study.

3.4.3 Checking Variance Homogeneity Using the Levene Test

As discussed in Sect. 3.2.1.3, the Levene test (cf. Levene, 1960) is a widely used tool for
statistically testing the assumption of variance homogeneity. The test is based on the null
hypothesis that the variances in the groups do not differ (or that the error variance of the
dependent variable is the same across all groups).

H0 : σ12 = σ22 = . . . σg2

H1 : at least two σg2 are different

The decision to reject the null hypothesis is based on the following tenets:
Lemp > Fα → H0 is rejected, i.e. there is no variance homogeneity
Lemp ≤ Fα → H0 is not rejected, i.e. variance homogeneity exists
In order to be able to assume variance homogeneity, Lemp should be as small as possi-
ble or the corresponding p-value as large as possible (at least p > 0.05). The Levene test
is relatively robust to a violation of the assumption of normal distribution. The following
test variable is used to test the null hypothesis:
L1
Lemp = (3.29)
L2
with
1 G  2
L1 = · Ng · lg· − l (3.30)
G−1 g=1
198 3 Analysis of Variance

Table 3.17  Absolute deviations (l-values) and their mean values in the application example
above for the factor ‘placement’ (packaging type “box”)
Absolute deviations of the sample values from the respective sample mean
Candy section Special placement Cash register
|47 − 43.4| = 3.6 |68 − 64.4| = 3.6 |59 − 52.2| = 6.8
|39 − 43.4| = 4.4 |65 − 64.4| = 0.6 |50 − 52.2| = 2.2
|40 − 43.4| = 3.4 |63 − 64.4| = 1.4 |51 − 52.2| = 1.2
|46 − 43.4| = 2.6 |59 − 64.4| = 5.4 |48 − 52.2| = 4.2
|45 − 43.4| = 1.6 |67 − 64.4| = 2.6 |53 − 52.2| = 0.8
Group mean values of the deviations lg
(3.6 + 4.4 + 3.4 + 2.6 + 1.6)/5 = (3.6 + 0.6 + 1.4 + 5.4 + 2.6)/5 (6.8 + 2.2 + 1.2 + 4.2 + 0.8)/5
3.12 = 2.72 = 3.04
Total mean value of the deviations l
(3.12 + 2.72 + 3.04)/3 = 2.96

1 G  N  2
L2 = · lgi − lg· (3.31)
G(N − 1) g=1 i=1

L follows an F distribution with df1 = G–1 and df2 = G (N–1). To calculate L, the abso-
lute deviations lgi between the observed values ygi and the sample mean values within the
groups (y g) need to
 be considered first:
lgi = ygi − yg·  with g = 1, . . . , G and i = 1, . . . , N
Then the mean values for each group (lg) and the total mean value (l ) must be deter-
mined for the l values:
1 N
lg = · lgi (3.32)
N i=1

1 G
l= lg (3.33)
G g

For our application example of one-way ANOVA above, Table 3.17 shows the calcula-
tions for the three factor levels of the factor placement. The calculations for the example
are based on the initial data in Table 3.3 and the average chocolate sales in the three
supermarkets in Table 3.4.
Using the results in Table 3.17, the weighted variances L1 and L2 of the l-values can
now also be determined according to Eqs. (3.30, 3.31):

1   1
L1 = · 5 · (3.12 − 2.96)2 + . . . + 5 · (3.04 − 2.96)2 = · 0.448 = 0.224
2 2
3.5 Recommendations 199

1  1
· (3.6 − 3.12)2 + (4.4 − 3.12)2 + . . . + (0.8 − 3.04)2 =

L2 = ·43.328 = 3.611
12 12

L1 0.224
Lemp = = = 0.062
L2 3.611
With an assumed error probability of α = 0.05, the F-table results for df1 = G−1 = 2 and
df2 = G (N−1) = 12 in a value of Fα = 3.89. Thus, Lemp is < Fα, i.e. the null hypothesis
must not be rejected. The assumption of variance homogeneity can thus be regarded as
fulfilled. For Lemp = 0.062 a p-value of p = 0.9402 follows.15 This means that p > 0.05 and
the null hypothesis cannot be rejected either.

3.5 Recommendations

For using ANOVA, the following prerequisites, which relate both to the characteristics of
the data collected and to the evaluation of the data, must be fulfilled:

(A) Model formulation and assumptions of ANOVA

Model foundation
ANOVA is a confirmatory (structure-checking) analysis. Accordingly, factual logic,
expert knowledge and theory are decisive for formulating a well-founded model and to
identify the possible influencing variable(s) on a dependent variable.

Scale level of the variables


The dependent variable must be on a continuous (metric) scale while the factors con-
sidered (independent variables) are categorical. Each category should represent observa-
tions that are expected to lead to differences in the dependent variable. The observations
may only belong to one factor level (group) at a time.

Independence of the factors


In two- and multi-factorial variance analyses, the selected factors must represent dif-
ferent sources of influence on the dependent variable. If two supposedly different fac-
tors were based on the same relationship, the variation of the dependent variable could
no longer be clearly attributed to one of the two factors (problem of multicollinearity).

15 The p-value can also be calculated using Excel by using the function F.DIST.RT(Femp;df1;df2).
For the example in Sect. 3.2.1.1, we obtain: F.DIST.RT(0,062;2;12) = 0.9402. A detailed explana-
tion of the p-value may be found in Sect. 1.3.1.2.
200 3 Analysis of Variance

This would be the case, for example, if “packaging” and “branding” were chosen as fac-
tors but the customer perceived both as inseparable.

Error terms
Error terms must not contain any influencing variables. If other influencing factors
(extraneous and confounding variables) are present, they are automatically included in
the error terms. This problem exists mainly in single-factor ANOVAs. A solution here is
to extend the model (e.g. to a multifactorial ANOVA, with the inclusion of covariates).

Assumption of variance homogeneity


Both the ANOVA and the post-hoc tests assume variance homogeneity in the fac-
tor levels (groups). This assumption may be confirmed by using the Levene test (see
Sects. 3.2.1.3, 3.4.3).

Assumption of multivariate normal distribution:


The test statistics relevant for the ANOVA (F-test, Levene test, post-hoc test) assume
normal distribution of the observations of the dependent variable in each group (factor
level). A quantile-quantile plot (Q-Q diagram) and the Kolmogorov-Smirnov test can be
used to check for normal distribution. SPSS offers these analyses under the menu item
“Descriptive Statistics”. Overall, however, the test statistics are relatively robust against
violations of the normal distribution assumption. The larger the sample size, the more the
distribution loses significance (cf. Bray & Maxwell, 1985, p. 32 ff.)

(B) Variable selection and survey planning

Manipulation check
Prior to an empirical investigation, it must be ensured that changes in the observations
of the dependent variable are definitely due to different factor levels of the selected fac-
tors. The targeted variation (manipulation) of the independent variable must be carried
out in advance by the user based on theoretical or logical considerations. These consider-
ations are then reflected in the so-called experimental design (cf. Kahn, 2011, pp. 687 ff.;
Perdue & Summers, 1986, pp. 317 ff.).

Number of factors
An ANOVA only makes sense if the effect of at least one factor with three or more factor
levels is examined. If there is one factor with only two factor levels, a simple comparison
of mean values should be used.

Number of observations
A model becomes more reliable if it is based on a large number of observations. The
more factors and factor levels are considered, the more observations are needed. A rule
of thumb is that a group should contain at least 20 observations. In addition, each cell
References 201

should be occupied by about the same number of cases (cf. Perreault & Darden, 1975,
p. 334 ff.). In order to counteract violations of the assumptions, the cases should be ran-
domly assigned to the groups in advance, if possible.

(C) Recommendations for practical implementation

Model complexity
Getting started with ANOVA is easier if the beginner does not include too many factors
(and possibly covariates) in the investigation so the interpretation of the results does not
become too difficult (e.g. increasing number of interaction effects).

Outlier analysis
Outliers influence the variances of a survey in a specific way. They also have an influ-
ence on the assumptions of variance homogeneity and normal distribution. They should
therefore be identified and excluded from the analysis.

Incomplete experimental designs


Here, only complete experimental designs (experimental plans) were considered, in
which all possible combinations of factor levels were represented by corresponding
observations. In practice, missing data (e.g. due to impossible or costly observations) can
lead to incomplete designs. In these cases, so-called reduced designs have to be created
(cf. Brown et al., 1990).

References

Bray, J. H., & Maxwell, S. E. (1985). Multivariate analysis of variance. Sage.


Brown, S. R., Collins, R. L., & Schmidt, G. W. (1990). Experimental design and analysis. Sage
Christensen, R. (1996). Analysis of variance, design, and regression: Applied statistical methods.
CRC Press.
Haase, R. F., & Ellis, M. V. (1987). Multivariate analysis of variance. Journal of Counseling
Psychology, 34(4), 404–413.
Kahn, J. (2011). Validation in marketing experiments revisited. Journal of Business Research,
64(7), 687–692.
Leigh, J. H., & Kinnear, T. C. (1980). On interaction classification. Educational and Psychological
Measurement, 40(4), 841–843.
Levene, H. (1960). Robust tests for equality of variances. In I. Olkin (Ed.), Contributions to prob-
ability and statistics. Essays in honor of Harold Hotelling (pp. 278–292). Stanford University
Press.
Levy, K. I. (1980). A Monte Carlo study of analysis of covariance under violations of the assump-
tions of normality and equal regression slopes. Educational and Psychological Measurement,
40(4), 835–840.
Moore, D. S. (2010). The basic practice of statistics (5th ed.). Freeman.
Perdue, B., & Summers, J. (1986). Checking the success of manipulations in marketing experi-
ments. Journal of Marketing Research, 23(4), 317–326.
202 3 Analysis of Variance

Perreault, W. D., & Darden, W. R. (1975). Unequal cell sizes in marketing experiments: Use of the
general linear hypothesis. Journal of Marketing Research, 12(3), 333–342.
Pituch, K. A., & Stevens, J. P. (2016). Applied multivariate statistics for the social sciences (6th
ed.). Routledge.
Shingala, M. C., & Rajyaguru, A. (2015). Comparison of post hoc tests for unequal variance.
Journal of New Technologies in Science and Engineering, 2(5), 22–33.
Smith, R. A. (1971). The effect of unequal group size on Tukey’s HSD procedure. Psychometrika,
36(1), 31–34.
Warne, R. T. (2014). A primer on multivariate analysis of variance (MANOVA) for behavioral sci-
entists. Practical Assessment, Research & Evaluation, 19(17), 1–10.

Further Reading

Gelman, A. (2005). Analysis of variance—Why it is more important than ever. The Annals of
Statistics, 33(1), 1–53.
Ho, R. (2006). Handbook of univariate and multivariate data analysis and interpretation with
SPSS. CRC Press.
Sawyer, S. F. (2009). Analysis of variance: The fundamental concepts. Journal of Manual &
Manipulative Therapy, 17(2), 27–38.
Scheffe, H. (1999). The analysis of variance. Wiley.
Turner, J. R., & Thayer, J. (2001). Introduction to analysis of variance: Design, analyis & inter-
pretation. Sage Publications.
Discriminant Analysis
4

Contents

4.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204


4.2 Procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
4.2.1 Definition of Groups and Specification of the Discrimination Function. . . . . . . . 207
4.2.2 Estimation of the Discriminant Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
4.2.2.1 Discriminant Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
4.2.2.2 Standardization of the Discriminant Coefficients. . . . . . . . . . . . . . . . . . 216
4.2.2.3 Stepwise Estimation Procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
4.2.2.4 Multi-group Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
4.2.3 Assessment of the Discriminant Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
4.2.3.1 Assessment Based on the Discriminant Criterion. . . . . . . . . . . . . . . . . . 219
4.2.3.2 Comparing Estimated and Actual Group Membership. . . . . . . . . . . . . . 222
4.2.4 Testing the Describing Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
4.2.5 Classification of New Observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
4.2.5.1 Distance Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
4.2.5.2 Classification Functions Concept. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
4.2.5.3 Probability Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
4.2.6 Checking the Assumptions of Discriminant Analysis. . . . . . . . . . . . . . . . . . . . . . 235
4.3 Case Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
4.3.1 Problem Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
4.3.2 Conducting a Discriminant Analysis with SPSS. . . . . . . . . . . . . . . . . . . . . . . . . . 239
4.3.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
4.3.3.1 Results of the Blockwise Estimation Procedure. . . . . . . . . . . . . . . . . . . 242
4.3.3.2 Results of a Stepwise Estimation Procedure. . . . . . . . . . . . . . . . . . . . . . 256
4.3.4 SPSS Commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
4.4 Recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 203


K. Backhaus et al., Multivariate Analysis, https://doi.org/10.1007/978-3-658-40411-6_4
204 4 Discriminant Analysis

Table 4.1  Application examples of discriminant analysis in different disciplines


Field of application Exemplary research questions
Biology Do environmental and social factors explain the abundance (decrease,
no change, and increase) of specific species?
Business & Economics What organizational and contextual factors differentiate innovative
from non-innovative companies?
Educational Science Does physical training influence motoric skills of primary school
pupils?
Geography Can soil properties be used to identify the geographic origin of
Chinese green tea?
Medicine What differentiates benign and malignant micro-calcifications in
mammograms?
Political science Who is voting for what political party?
Psychology Can competencies and personality traits explain career success?

4.1 Problem

Imagine you want to find out what differentiates voters of different political parties (e.g.,
Democrats, Republicans, Libertarians, Greens). To do so, you draw a random sample of
voters of the various parties of interest, and collect socio-demographic, psychographic,
and attitudinal data. The variable indicating the party a person votes for is a categorical
(nominal) variable. Its values represent different categories that are mutually exclusive,
that is, each person can be assigned to one specific group (i.e., supporter of a particular
party). The variables considered to describe the voters might be age, income, consump-
tion orientation, or attitude towards technology. These variables are metrically scaled or
can be interpreted as metrically scaled. With the help of discriminant analysis, we can
examine which of the variables that describe the voters discriminate the different groups
of supporters. Table 4.1 shows some more exemplary research questions from various
disciplines that can be answered with the help of discriminant analysis.
Discriminant analysis is a multivariate method to analyze the relationship between
a single categorical dependent variable and a set of metric (normally distributed) inde-
pendent variables.1 The categorical variable is called grouping variable and reflects the
group an observation (i.e., object or subject) belongs to, such as, for example, buyers of
different brands, voters of different parties, patients with different symptoms, or firms
with different performances. If we consider just two groups, the technique is called

1 Ifwe are interested in the question whether two groups differ significantly with respect to just one
variable, we can use an independent samples t-test. For more than two groups, we can use the uni-
variate analysis of variance (see Sect. 3.2.1).
4.2 Procedure 205

1 Definition of groups and specification of the discriminant function

2 Estimation of the discriminant function

3 Assessment of the discriminant function

4 Testing the describing variables

5 Classification of new observations

6 Checking the assumptions of discriminant analysis

Fig. 4.1 Process steps of discriminant analysis

two-group discriminant analysis. If we take three or more groups into account, we call
the technique multi-group discriminant analysis (cf. Sect. 4.2.2.4).
The members of each group are described along a set of observed variables (describ-
ing variables). For example, we might observe socio-demographic or psychographic
variables related to the buyers of different brands, historical health data of patients with
different symptoms, or characteristics of firms. The researcher has to decide which
describing variables are considered, and theoretical considerations should guide the
selection process to ensure non-spurious relationships.
Overall, discriminant analysis may be used to pursue two different aims. First, we can
use a discriminant analysis to identify describing variables that discriminate between dif-
ferent groups and to assess the discriminatory power of the describing variables (discrimi-
nation task). Second, we can use a discriminant analysis to predict the group membership
of new observations based on the describing variables—once we know which describ-
ing variables distinguish the members of the groups (classification task). An example of
the latter task is credit scoring: customers of a bank who have a loan can be divided into
‘good’ and ‘bad’ customers according to their payment behavior. With the help of discrimi-
nant analysis, we can examine what variables (e.g. age, marital status, income, duration of
current employment, or number of existing loans) differ between the two groups. By doing
so, we identify the set of discriminatory variables. If a new customer applies for a loan, the
bank can thus predict the creditworthiness of this customer based on his characteristics.

4.2 Procedure

Generally, discriminant analysis follows a six-step procedure as illustrated in Fig. 4.1.


We first define the groups and, thus, the categories of the dependent variable. Further,
we specify one or more discriminant functions, which are linear combinations of the
independent variables that best discriminate between the groups. If the grouping var-
iable represents just two groups, we specify one discriminant function. If we observe
three or more groups, we use more than one discriminant function (cf. Sect. 4.2.2.4).
Second, we estimate the discriminant function(s). Third, we assess the discriminatory
206 4 Discriminant Analysis

Table 4.2  Perceptions of buyers of the focal brand and the main competitor brand (example case)
Buyers of focal brand Buyers of main competitor
(group = 1) (group = 2)
Buyer Price Delicious Buyer Price Delicious
1 2 3 13 5 4
2 3 4 14 4 3
3 6 5 15 7 5
4 4 4 16 3 3
5 3 2 17 4 4
6 4 7 18 5 2
7 3 5 19 4 2
8 2 4 20 5 5
9 5 6 21 6 7
10 3 6 22 5 3
11 3 3 23 6 4
12 4 5 24 6 6

power of the estimated discriminant function(s). Fourth, we examine the discrim-


inatory power of the describing (independent) variables and assess which describing
variables contribute the most to explaining the differences between the groups. Fifth,
we describe how (new) observations are classified to one of the groups based on the
observed values of the describing variables. Finally, we test the assumptions of the dis-
criminant analysis.
In the following, we illustrate how to conduct a discriminant analysis using an exam-
ple with just two groups and two describing variables (two-group discriminant analysis).
We use a small example related to consumers’ choices of chocolate brands.

Application Example
A manager of a chocolate company wants to know whether the buyers of its own
brand (here: focal brand) perceive its chocolates differently compared to the buyers
of its main competitor brand. The manager considers two describing variables to be
relevant: ‘price’ and ‘delicious’. We examine 12 buyers of the focal brand and 12 buy-
ers of the main competitor brand. All 24 respondents provided information about their
perceptions using a 7-point scale (from 1 = ‘low’ to 7 = ‘high’). Table 4.2 shows the
collected data.2 ◄

2 On the website www.multivariate-methods.info, we provide supplementary material (e.g., Excel


files) to deepen the reader’s understanding of the methodology.
4.2 Procedure 207

4.2.1 Definition of Groups and Specification of the Discrimination


Function

1 Definition of groups and specification of the discriminant function

2 Estimation of the discriminant function

3 Assessment of the discriminant function

4 Testing the describing variables

5 Classification of new observations

6 Checking the assumptions of discriminant analysis

The grouping variable has to be a categorical variable and to reflect different mutually
exclusive and collectively exhaustive groups. The (two or more) groups can either be
determined by the research question at hand or be the result of, for example, a cluster
analysis (cf. Chap. 8). We can also convert a metric variable (e.g., a firm’s profit) into
a categorical variable (e.g., low vs. high performance) to form the grouping variable.
However, we need to be aware that we lose information if we do so.
Generally, the number of groups should not be larger than the number of describing
variables. In the example above, we have two describing variables, namely ‘price’ and
‘delicious’. Thus, we restrict the example to two groups: ‘buyers of the focal brand’ and
‘buyers of the competing brand’. The groups are identified by a group index g (g = 1, 2,
…, G), where G is the total number of groups (here: G = 2 and g = 1, 2).

Recommendations related to the minimum number of observations


It is important to note that the reliability of the discriminant analysis depends on the
number of observations. We should have at least five observations per describing vari-
able; 20 observations per describing variable are recommended. Moreover, each group
should have at least 20 observations to warrant statistically significant and reliable
results. Finally, the relative sizes of the groups should be comparable. Large differences
in the relative group sizes will influence the estimation of the discriminant function(s)
and the classification of the observations. Regarding these considerations, the groups in
our example are actually rather small (only 12 observations for each group), but they are
just meant as an illustrative example.

Discriminant function
The discriminant function is a linear combination of the describing variables:
Y = b0 + b1 X1 + b2 X2 + . . . + bJ XJ (4.1)
208 4 Discriminant Analysis

with

• Y: dependent variable (grouping variable)


• b0: constant term
• bj: discriminant coefficient of the independent (describing) variable j
• Xj: independent (describing) variable j

The discriminant function is also called canonical discriminant function, and the discri-
minant variable Y is called canonical variable, with the term canonical indicating that
the variables are combined linearly. It is thus assumed that the relationship between the
independent and dependent variables is linear.
For each observation that is described by some variables, the discriminant function
predicts a value for the discriminant variable Y. The coefficients b0 and bj (j = 1, 2, …, J)
are estimated based on the observed data in such a way that the groups differ as much
as possible with respect to the values of the discriminant variable Y. Since the describ-
ing variables are metric, the resulting discriminant variable Y is also metric and does not
directly indicate the group membership. In Sect. 4.2.2.2 we discuss how estimates of
group membership are derived from the estimated value of the discriminant variable.

4.2.2 Estimation of the Discriminant Function

1 Definition of groups and specification of the discriminant function

2 Estimation of the discriminant function

3 Assessment of the discriminant function

4 Testing the describing variables

5 Classification of new observations

6 Checking the assumptions of discriminant analysis

When estimating the unknown coefficients b0 and bj of the discriminant function,


the objective is to identify all bj in such a way that they separate the groups well. This
requires an objective function that maximizes the separation of the groups (discriminant
criterion).

4.2.2.1 Discriminant Criterion
Discriminant analysis aims to identify the describing variables that discriminate the
groups. This implies that the groups differ with respect to the describing variables. If
they do so, we should observe different values for the describing variables in the two
groups of the example presented above. Figure 4.2 displays a scatterplot of the observed
values of the describing variables. The buyers of the focal brand are represented by red
squares and the buyers of the main competitor brand by black asterisks.
4.2 Procedure 209

delicious focal brand main competitor


7

0
0 1 2 3 4 5 6 7
price

Fig. 4.2 Scatterplot of the observed data

Figure 4.2 also shows the frequency distributions (histograms) of the values of the
describing variables (i.e. ‘price’ and ‘delicious’) below (‘price’) or beside (‘delicious’)
the scatterplot. The frequency distribution for each group is displayed separately and the
axes correspond to the original x- and y-axes. We learn that buyers of the competing
brand tend to rate ‘price’ higher than buyers of the focal brand (cf. histogram below the
scatterplot). The respective mean values of ‘price’ are 5.0 and 3.5. In contrast, buyers
of the focal brand rate, on average, ‘delicious’ higher than the buyers of the main com-
petitor brand (mean values: 4.5 vs. 4.0) (cf. histogram beside the scatterplot). However,
due to the significant overlaps of the two distributions, neither variable seems to separate
the two groups very well. Visual inspection of the histograms suggests that ‘price’ may
separate the groups better because the groups seem to differ more with respect to this
variable.
While the scatterplot in Fig. 4.2 provides a first glance at the data, it considers each
describing variable in isolation. With the help of the discriminant function, we can
consider both describing variables jointly. Since the two groups seem to differ in their
perceptions regarding ‘price’ (i.e., higher mean for buyers of the main competitor) and
210 4 Discriminant Analysis

Table 4.3  Individual Group Buyer Price Delicious Yi


discriminant values (b0 = 0,
b1 = 0.5 and b2 = –0.5) 1 1 2 3 –0.50
1 2 3 4 –0.50
1 3 6 5 0.50
1 4 4 4 0.00
1 5 3 2 0.50
1 6 4 7 –1.50
1 7 3 5 –1.00
1 8 2 4 –1.00
1 9 5 6 –0.50
1 10 3 6 –1.50
1 11 3 3 0.00
1 12 4 5 –0.50
2 13 5 4 0.50
2 14 4 3 0.50
2 15 7 5 1.00
2 16 3 3 0.00
2 17 4 4 0.00
2 18 5 2 1.50
2 19 4 2 1.00
2 20 5 5 0.00
2 21 6 7 –0.50
2 22 5 3 1.00
2 23 6 4 1.00
2 24 6 6 0.00

‘delicious’ (i.e., higher mean for buyers of the focal brand), we expect that the coeffi-
cients for the two describing variables are contrary—one having a positive effect and the
other one having a negative effect. For the moment, let us assume the following: b0 = 0,
b1 = 0.5 and b2 = –0.5:
Y = 0.5 · X1 − 0.5 · X2
Based on this discriminant function, we can compute the value of the discriminant varia-
ble for each observation (Table 4.3). For example, buyer i = 1 has an estimated value for
the discriminant variable of –0.5 (y1 = 0.5 · 2 − 0.5 · 3).
4.2 Procedure 211

Fig. 4.3 Discriminant axis

YA Y* YB

Discriminant Axis
If the describing variables are able to separate the groups well, the resulting values for
the discriminant variable Y should differ between the two groups. Thus, we can describe
each group g by its mean value for the discriminant variable. This group mean is called
centroid:
Ig
1 
Yg = Yig (4.2)
Ig i=1

If the two groups are well separated, the difference between the centroids is large. For
the assumed coefficients of b1 = 0.5 and b2 = –0.5, we get a group centroid of –0.5 for
the buyers of the focal brand (g = 1) and of 0.5 for the buyers of the main competitor’s
brand (g = 2).
We can display the values of the group centroids on a so-called discriminant axis.
Figure 4.3 illustrates a discriminant axis in general terms. The difference in group cen-
troids can now be expressed in terms of a distance. Besides the values of the group
centroids for the discriminant variable, the discriminant axis also shows the critical
discriminant value (Y*). Knowledge about the critical discriminant value allows us to
assign new observations to one of the groups. In Fig. 4.3, observations with a value for
the discriminant variable lower than the critical value (Yi’ < Y*) are assigned to group
A, while observations with a value for the discriminant variable higher than the critical
value (Yi’ > Y*) are assigned to group B.

Discriminant Criterion Based on the Variation Between and Within Groups


As Fig. 4.4 illustrates, it is not sufficient to just consider the difference between cen-
troids. The upper and lower parts of Fig. 4.4 show the distribution of discriminant val-
ues for two groups, A and B, along the discriminant axis. The difference in centroids
equals D in both scenarios. However, the distributions in the lower part overlap substan-
tially, while in the upper part the two groups only overlap at the tails of their distribu-
tions. Consequently, the two groups in the lower part are not as well separated as the two
groups displayed in the upper part of Fig. 4.4. This is because these distributions have a
higher dispersion.
Thus, we should also consider the dispersion (i.e., standard deviation) of discrimi-
nant values within groups, not just the difference in centroids, when assessing how well
the groups are separated. Thus, large differences between centroids and little dispersion
within each group are preferable outcomes when estimating the discriminant function.
212 4 Discriminant Analysis

A B

Y
D

A B

Y‘

Fig. 4.4 Distribution of discriminant values in two different scenarios (upper part: little overlap of dis-
tributions of discriminant values of groups A and B; lower part: large overlap of distributions of discri-
minant values of groups A and B)

In the example, the standard deviations in the two groups are 0.67 (g = 1) and 0.60
(g = 2), respectively. We can use the information about the group centroids and the stand-
ard deviations within each group to plot the distribution of the discriminant values for
each group. Since we assume that the independent variables are normally distributed,
the discriminant values also follow a normal distribution. Figure 4.5 shows the distribu-
tions of the discriminant values for the buyers of the focal brand and the main competitor
brand based on the discriminant function Y = 0.5 · X1 − 0.5 · X2.
We can see that the distributions of the discriminant values overlap substantially.
There might be two reasons for this rather dissatisfying result. First, the groups of buyers
of the focal and the main competitor brand might not differ much with respect to their
perceptions of ‘price’ and ‘delicious’. Second, the assumed coefficients for the describ-
ing variables may not be ‘optimal’, that is, they are not able to separate the groups well.
Since we have not formally derived the coefficients but simply made an assumption, the
latter reason requires some attention.
The distributions in Fig. 4.5 are not given by the observed data but depend on the
coefficients bj which determine the estimated values for the discriminant variables, and
thus the centroids and the variations around them. Our aim is to separate the groups as
much as possible, that is, the centroids should be ‘far away’ from each other and the
variation in each group should be as small as possible. We can formally express this idea
with the following so-called discriminant criterion Γ:
4.2 Procedure 213

focal brand
main competitor

-5 -4 -3 -2 -1 0 1 2 3 4 5

Fig. 4.5 Distribution of discriminant values for buyers of the focal and the main competitor brand
(b0 = 0, b1 = 0.5 and b2 = –0.5)

G 2
variation between groups g=1 Ig (Y g − Y ) SSb
Ŵ= = G Ig = (4.3)
variation within groups i=1 (Yig − Y g )
2 SS w
g=1

The numerator of the discriminant criterion Γ in Eq. (4.3) represents the difference
between the centroids and thus the variation between groups (SSb). It is calculated as the
squared difference between a group’s centroid and the total mean. In order to account for
different group sizes, the differences are weighted by the respective group size Ig. Thus,
the larger the numerator, the larger the difference between the centroids.
The denominator in Eq. (4.3) represents the variation within groups (SSw), that is, the
squared difference between each discriminant value and the respective group centroid.
We here assume approximately equal dispersion matrices for the different groups. The
smaller the denominator, the smaller the dispersion, and the more likely we can observe
well separated groups. Thus, the larger SSb and the smaller SSw, the larger the value for
the discriminant criterion Γ, and the better the groups are separated.
Since the centroids are defined by the describing variables, the variation between
groups is also called explained variation. Yet, the variation within groups is not
explained by the describing variables and thus is called unexplained variation.3

3 See Sect. 3.2.1.2 for a more detailed discussion of the explained and unexplained variation.
214 4 Discriminant Analysis

Table 4.4  Individual Group Buyer Yi Group Buyer Yi


discriminant values based on
the estimated discriminant 1 1 –1.614 2 13 0.914
function (b0 = –1.982, 1 2 –1.148 2 14 0.448
b1 = 1.031 and b2 = –0.565) 1 3 1.381 2 15 2.412
1 4 –0.117 2 16 –0.583
1 5 –0.018 2 17 –0.117
1 6 –1.810 2 18 2.044
1 7 –1.712 2 19 1.013
1 8 –2.179 2 20 0.350
1 9 –0.215 2 21 0.252
1 10 –2.277 2 22 1.479
1 11 –0.583 2 23 1.946
1 12 –0.681 2 24 0.816

We aim to estimate the coefficients bj in such a way that Γ is maximized. The con-
stant term b0 merely shifts the scale of the discriminant values, but does not influence the
value of the discriminant criterion Γ. Thus, it does not play an active role in the estima-
tion procedure. For our exemplary data, we get the following discriminant function when
maximizing Γ:
Y = −1.982 + 1.031 · X1 − 0.565 · X2
Based on this discriminant function, we compute again the discriminant values for each
observation (Table 4.4). The centroids of the two groups are now –0.914 for the focal
brand (g = 1) and 0.914 for the main competitor brand (g = 2). The respective standard
deviations are 1.079 for g = 1 and 0.915 for g = 2, which are in fact similar.
We can use the information about the centroids together with the individually esti-
mated values of the discriminant variables to compute SSb and SSw, which are equal to
20.07 and 22.0, respectively. Accordingly, the resulting value for the discriminant crite-
rion Γ is 0.912, which is the maximum value in this example. However, from Fig. 4.6 we
learn that there is still a substantial overlap, although the distributions in Fig. 4.6 overlap
less than the ones in Fig. 4.5.
Table 4.5 shows the resulting values of the discriminant criterion for various values
of the coefficients b1 and b2. Since b0 does not affect the discriminant criterion, we have
set its value to zero. For the values b1 = 1 and b2 = 0, the discriminant variable Y is equal
to X1 (i.e., price) and for the values b1 = 0 and b2 = 1, it is equal to X2 (i.e., delicious).
Consequently, the resulting value of the discriminant criterion reflects the discriminatory
4.2 Procedure 215

focal brand
main competitor

-5 -4 -3 -2 -1 0 1 2 3 4 5

Fig. 4.6 Distributions of discriminant values of the two groups (b0 = –1.982, b1 = 1.031 and b2 =
–0.565)

Table 4.5  Resulting values Discriminant coefficients Discriminant criterion


of the discriminant criterion
(here: |b1| + |b2| = 1) b1 b2 Γ
for different values of the 1.000 0.000 0.466
discriminant coefficients 0.000 1.000 0.031
0.500 0.500 0.050
0.500 –0.500 0.667
0.600 –0.400 0.885
0.646 –0.354 0.912
0.700 –0.300 0.882
0.800 –0.200 0.735
0.900 –0.100 0.582

power of the respective describing variable (Table 4.5). As Fig. 4.2 already suggested,
‘price’ has a greater discriminatory power (Γ = 0.466) than ‘delicious’ (Γ = 0.031). The
difference between the means of the describing variable ‘delicious’ is smaller than that
of ‘price’ and at the same time, the standard deviation is larger. This results in a lower
216 4 Discriminant Analysis

variation between groups (SSb) and a higher variation within groups (SSw) for ‘deli-
cious’, ultimately leading to a much lower value of the discriminant criterion. However,
we also learn that considering only the describing variable ‘price’ results in a lower value
of the discriminant criterion than considering both variables jointly (Γ = 0.466 compared
to Γ = 0.912). This result indicates that both variables actually contribute to the separa-
tion of the two groups.

Characteristic of the ‘optimal’ Discriminant Coefficients


In Table 4.5, the coefficients b1 = 0.646 and b2 = –0.354 also result in the maximum
value of 0.912 for the discriminant criterion. The reason is that these coefficients are pro-
portional to the estimated ones:

−0.565 −0.354
= = −0.55
1.031 0.646
Any set of coefficients that meets the requirement of b2 = –0.55 ⋅ b1 leads to a value of
the discriminant criterion equal to 0.912. Yet no other combination of coefficients will
result in a higher value of the discriminant criterion. Thus, when maximizing the discri-
minant criterion, only the ratio of the discriminant coefficients b2/b1 is clearly defined
(here: –0.55). While the discriminant values change depending on the specifically chosen
coefficients, the value of the discriminant criterion does not change.

4.2.2.2 Standardization of the Discriminant Coefficients


In order to obtain clearly defined values for the discriminant coefficients, we can stand-
ardize them. The most common way to standardize the discriminant coefficients is to set
the pooled within-group variation equal to one so that the pooled within-group variation
equals:

2 SSw
spooled = (4.4)
N −G
with

• N: number of observations
• G: number of groups

The constant term b0 is then determined in such a way that the total mean of all discrimi-
nant values is equal to zero. When using SPSS to conduct a discriminant analysis, SPSS
carries out this standardization by default.
When the discriminant coefficients are standardized, the critical discriminant value Y*
is zero. In a two-group discriminant analysis, observations with an estimated discrimi-
nant value larger than zero belong to one group, and observations with an estimated dis-
criminant value smaller than zero are assigned to the other group.
4.2 Procedure 217

delicious focal brand main competitor


7

0
0 1 2 3 4 5 6 7
price

Fig. 4.7 Presentation of the optimal discriminant axis

Figure 4.7 is based on Fig. 4.2 and presents the mapping of the original observations
on the discriminant axis for the discriminant function Y = –1.982 + 1.031X1 –0.565X2.
The discriminant axis has a slope of –0.55 and in the coordinate origin applies Y =
–1.982. The critical discriminant value Y* of zero is indicated by the dashed line that
crosses the discriminant axis. We recognize that one observation of group g = 1 (focal
brand, red squares) and two observations of group g = 2 (main competitor, black aster-
isks) are not correctly identified (cf. Table 4.4). But overall, the two describing variables
‘price’ and ‘delicious’ in combination seem to separate the two groups quite well.

4.2.2.3 Stepwise Estimation Procedure


In the example above, we considered the two describing variables simultaneously
when estimating the discriminant function. Alternatively, we can use a stepwise esti-
mation procedure. In a stepwise estimation process the describing variables are entered
218 4 Discriminant Analysis

consecutively according to their discriminatory power. If variables are not able to dis-
criminate the groups, they are not included in the discriminant function.
The stepwise estimation procedure follows a sequential process of adding or deleting
describing variables:

1. Select the single best discriminating variable, i.e., the describing variable leading to
the highest value for the discriminant criterion when considered alone.
2. Combine the initial variable with each of the other describing variables, one at a time,
and choose the describing variable that is best able to improve the discriminating
power of the function (i.e., leads to the largest increase in the discriminant criterion).
3. Continue adding describing variables to the discriminant function that improve the
discriminant criterion. Note that as additional describing variables are included, some
previously selected describing variables may be removed if the information they con-
tain on group differences is provided by some combination of the other variables
included at later stages.
4. The procedure stops when no further improvement of the discriminant criterion is
possible.

On the one hand, the stepwise estimation procedure can be useful when we have a large
number of describing variables that potentially may discriminate the groups. By sequen-
tially selecting the variables that separate the groups, we may be able to reduce the set
of describing variables and develop a more parsimonious model. On the other hand, we
should only consider describing variables that are supported by theoretical considera-
tions or a priori knowledge. This means that there should be a good reason to include
the describing variables initially. Thus, the stepwise estimation procedure should not be
used to mine the data. Moreover, a non-significant result for a describing variable is also
a relevant finding.

4.2.2.4 Multi-group Discriminant Analysis


As mentioned earlier, discriminant analysis can also be used to differentiate more than
two groups. This procedure is called multi-group discriminant analysis. In multi-group
discriminant analysis, G–1 discriminant functions can be estimated if the number of
describing variables is larger than this number. In general, with G groups and J describ-
ing variables, it is possible to estimate up to min(G–1, J) discriminant functions. Multi-
group discriminant analysis creates multiple equations representing dimensions of
discrimination among the groups—each dimension separate and distinct from the other.
Each discriminant function k results in a separate value for the discriminant variable
Yk. In the case of three groups and k = 2, each object has a separate discriminant value
for the functions k = 1 and k = 2. Thus, in addition to improving the explanation of group
membership, these additional discriminant functions add insights into the various com-
binations of independent variables that discriminate between the groups. We present a
multi-group discriminant analysis with three groups in the case study in Sect. 4.3.
4.2 Procedure 219

4.2.3 Assessment of the Discriminant Function

1 Definition of groups and specification of the discriminant function

2 Estimation of the discriminant function

3 Assessment of the discriminant function

4 Testing the describing variables

5 Classification of new observations

6 Checking the assumptions of discriminant analysis

To assess the quality of a discriminant function, we can use the discriminant criterion.
Moreover, we can compare the estimated and actual group membership of observations.

4.2.3.1 Assessment Based on the Discriminant Criterion

Eigenvalue and Share of Explained Variation


The maximum value of the discriminant criterion for a specific discriminant function is
a measure of the discriminatory power of the discriminant function. The maximum value
of the discriminant criterion is also called eigenvalue. The (theoretical) lower boundary
of the eigenvalue is zero. Its upper boundary can be greater than one, which makes it dif-
ficult to compare eigenvalues from different analyses. To address this issue, we can use
the following formula:

b SS SSb
Ŵ SSw SSw SSb explained variation
= SSb
= SSw +SSb
= = (4.5)
1+Ŵ 1 + SS SSw
SSb + SSw total variation
w

The denominator is now the sum of SSb and SSw and actually equal to the total variation.
Thus, the result of Eq. (4.5) is the share of explained variation.4 In our example, only
47.7% (= 0.912/(1 + 0.912)) of the variation in the dependent variable is explained by the
discriminant function, which is a rather disappointing result.

Relative Eigenvalue as a Measure in Multi-group Analyses


In multi-group discriminant analysis, we also maximize the discriminant criterion,
with the first function having the highest eigenvalue. The second discriminant func-
tion—uncorrelated with the first one—has the second highest eigenvalue, and so forth.
The second discriminant function is determined in such a way that it explains a maxi-
mum proportion of the variation that is unexplained by the first discriminant function.

4 Inthe two-group case, the result of Eq. (4.5) corresponds to the coefficient of determination R2 in
regression analysis (cf. Sect. 2.2.3.2).
220 4 Discriminant Analysis

Since the first discriminant function was determined so that the eigenvalue and, thus, the
explained variation is maximized, the variation explained by the second discriminant
function cannot be any higher. Accordingly, each further discriminant function is deter-
mined in such a way that it explains a maximum proportion of the residual (unexplained)
variation.
We thus get: Ŵ1 ≥ Ŵ2 ≥ Ŵ3 ≥ . . . ≥ ŴK .
As a measure of the relative importance of a discriminant function, the relative eigen-
value can be used, which reflects the share of explained variance:
Ŵk explained variancek
share of explained variancek = = K
Ŵ1 + Ŵ 2 + . . . + Ŵ K 
explained variancek
k=1
(4.6)

The importance of successively determined discriminant functions usually decreases rap-


idly. Empirical evidence shows that even with a large number of groups and describing
variables, two discriminant functions are often sufficient (cf. Cooley & Lohnes, 1971,
p. 244). The advantage of just two discriminant functions is that we can represent the
observations in a two-dimensional space whose axes are the discriminant values of the
respective discriminant functions. Similarly, the describing variables can be represented as
vectors at the discriminant level. Using such graphical presentations makes discriminant
analysis an alternative to factor analysis (cf. Sect. 7.3.3.3) or multidimensional scaling.

Canonical Correlation
Besides the share of explained variation, we can also use the square root of Eq. (4.5) as a
measure of the discriminatory power of the discriminant function. This is called canoni-
cal correlation, and it measures the extent of association between the discriminant value
and the groups (Tatsuoka, 1988, p. 235).

Ŵ
c= (4.7)
1+Ŵ
In our example, the canonical correlation is:

0.912
c= = 0.691
1 + 0.912
The maximum (and best) value that can be achieved for the canonical correlation is 1.

Wilks’ lambda for the Discriminant Function


Additionally, we can assess the significance of the estimated discriminant function with
the help of Wilks’ lambda (also known as U-statistic):

1 1 1 SSw unexplained variation


�= = SSb
= SSw +SSb
= = (4.8)
1+Ŵ 1 + SS SSw
SSb + SSw total variation
w
4.2 Procedure 221

Wilks’ lambda is an inverse quality measure, i.e., the smaller the value, the better the dis-
criminatory power of the discriminant function (Λ = 1–c2). In our example, we get
1
= = 0.523 or  = 1 − 0.477 = 0.523
1 + 0.912
We can transform Wilks’ lambda into a probabilistic variable that follows the chi-square
distribution with J × (G – 1) degrees of freedom:
 
2 J +G
χemp =− N− − 1 ln (�) (4.9)
2
with

• N: number of observations
• J: number of describing variables
• G: number of groups

The chi-square value increases with smaller values for Wilks’ lambda. Higher values
therefore indicate a better separation of the groups. For our example, we get:
 
2 2+2
χemp = − 24 − − 1 ln (0.523) = 13.614
2
The corresponding null hypothesis H0 states that the mean discriminant value is equal
across all groups. With two degrees of freedom (df = 2), the theoretical chi-square value
at a 5% significance level is 5.99.5 Since the empirical chi-square value of 13.6 is larger
than the theoretical one, we reject H0 and accept H1 that the two groups differ with
respect to the mean value of the discriminant variable (p = 0.001). Consequently, the dis-
criminant function is significant.

Multivariate Wilks’ lambda


To assess in a multi-group discriminant analysis whether all K discriminant functions
are significant, we can use multivariate Wilks’ lambda that we obtain by multiplying the
Wilks’ lambdas of each function.

K K
  1
�= �k = (4.10)
k=1 k=1
1 + Ŵk

Again we use a chi-square test with J × (G – 1) degrees of freedom. Multivariate Wilks’


lambda provides information on whether all discriminant functions together are signifi-
cant. Yet, not all estimated discriminant functions might be significant.

5 See Sect. 1.3 for a brief introduction to the basics of statistical testing.
222 4 Discriminant Analysis

Wilks’ lambda Testing for Residual Discrimination


To decide in a multi-group analysis whether, after determining the first k discriminant
functions, the remaining K–k discriminant functions can still contribute significantly to
the discrimination of the groups, it is useful to calculate Wilks’ lambda in the following
way (Wilks’ lambda testing for residual discrimination):
K
 1
�q = (4.11)
q=k+1
1 + Ŵq

The associated chi-square value has (J – k) × (G – k – 1) degrees of freedom.


If Wilks’ lambda that tests for residual discrimination becomes insignificant, we can
stop the identification of further discriminant functions. If Wilks’ lambda that tests for
residual discrimination is already insignificant for k = 0, there is no empirical evidence
for a significant difference between the groups.
Be aware that statistical significance of a discriminant function does not imply that it
separates the groups well but only that the groups differ significantly with regard to this
discriminant function. As with all statistical tests, a significant difference does not have
to be relevant. If the sample size is sufficiently large, even very small differences become
statistically significant.

4.2.3.2 Comparing Estimated and Actual Group Membership


Table 4.6 shows the estimated discriminant values for all 24 respondents of our above
example. The group centroids are –0.914 for g = 1 (i.e., buyers of the focal brand),
and + 0.914 for g = 2 (i.e., buyers of the main competitor brand). The total mean is equal
to zero, which is also the critical discriminant value. Respondents for whom we estimate
a discriminant value smaller than zero are assigned to group g = 1, and those for whom
we estimate a discriminant value larger than zero are assigned to group g = 2.
We learn from Table 4.6 that one respondent of group g = 1 (i = 3) and two respond-
ents of group g = 2 (i = 16 and i = 17) are assigned wrongly based on the estimated
discriminant values. Thus, in total, 21 out of 24 group assignments are correct, which
corresponds to a hit rate of 87.5% (= 21/24 · 100).

Classification Matrix
Table 4.7 shows the so-called classification matrix that is a 2 × 2 table displaying the
number of correctly and incorrectly assigned observations. The diagonal cells contain
the number of correctly classified observations for each group and the off-diagonal cells
contain the number of incorrectly classified observations (relative frequencies related to
the actual group membership are reported in brackets).
At first glance, an overall hit rate of 87.5% sounds decent. However, we need to com-
pare the hit rate of the estimated discriminant function with the hit rate that would be
4.2 Procedure 223

Table 4.6  Comparison of estimated and actual group membership


Group Respondent yi Estimated group Hit?
membership
1 1 –1.614 1 1
1 2 –1.148 1 1
1 3 1.381 2 0
1 4 –0.117 1 1
1 5 –0.018 1 1
1 6 –1.810 1 1
1 7 –1.712 1 1
1 8 –2.179 1 1
1 9 –0.215 1 1
1 10 –2.277 1 1
1 11 –0.583 1 1
1 12 –0.681 1 1
2 13 0.914 2 1
2 14 0.448 2 1
2 15 2.412 2 1
2 16 –0.583 1 0
2 17 –0.117 1 0
2 18 2.044 2 1
2 19 1.013 2 1
2 20 0.350 2 1
2 21 0.252 2 1
2 22 1.479 2 1
2 23 1.946 2 1
2 24 0.816 2 1

Table 4.7  Classification matrix


Actual group membership Estimated group membership
Buyer of focal brand Buyer of main competitor
brand
Buyer of focal brand 11 1
(91.7%) (8.3%)
Buyer of main competitor 2 10
brand (16.7%) (83.3%)
224 4 Discriminant Analysis

achieved by a purely random assignment of the observations (e.g., by tossing a coin).


In the case of two groups of the same size, we expect a hit rate of 50%. Yet, the hit rate
achieved by random assignment can also be higher if the groups have unequal sizes. Let
us assume a ratio of 80:20. The best guess we can make in this case is to assign all obser-
vations to the largest group, which results in a hit rate of 80%. Generally, a discriminant
function only has discriminatory power if it achieves a hit rate that is substantially higher
than the hit rate expected from random assignment. In our example, the benchmark is
50% and the hit rate based on the estimated discriminant function of 87.5% is well above
this critical value.
Be aware that we generally expect a rather high hit rate if it is calculated based on the
sample used to estimate the discriminant function (internal validity). Since the discri-
minant function is always determined in such a way that the in-sample hit rate is max-
imized, we expect a lower hit rate if we apply the discriminant function to a hold-out
sample (external validity, see below). However, this effect decreases with sample size.

Assessing the External Validity of the Results of a Discriminant Analysis


To assess the external validity, we can randomly split the sample in half (split-half
method). We then use one half of the sample to estimate the discriminant function (esti-
mation sample; also called training sample). Afterwards, we apply the discriminant func-
tion to the other half (hold-out sample) and compute the out-of-sample hit rate. We can
repeat this analysis by changing the role of the two samples. Splitting the sample in half
results in a loss of information when estimating the discriminant function. Thus, we need
to make sure that the estimation sample is sufficiently large (cf. Sect. 4.4). Moreover, the
distribution of the variables in the estimation and hold-out samples should be the same as
in the complete sample. Because of our very small sample, we do not assess the external
validity with the help of the split-half method.
For small samples, we can use the leave-one-out method. According to this method
one observation is removed from the sample and the discriminant function is estimated
based on the remaining N–1 observations. Then, we classify the removed observation
and assess whether the estimated and actual group membership classifications corre-
spond. We repeat the procedure N times and finally derive a classification table that is
used to compute the hit rate. When we compare the hit rate based on the total sample to
the hit rate of the leave-one-out method, we get a sense of the sensitivity of the estimated
coefficients to a loss of information. Ideally, the hit rates are very similar. If this is not
the case, the results are sensitive to the considered observations and we should carefully
investigate the reasons of this sensitivity (e.g., by checking for outliers). In the example,
the hit rate for the leave-one-out method is equal to 79.2% and thus much lower than the
one for the complete sample. A reason can be the rather small sample size. However, the
value is still far above the reference value of 50%.
4.2 Procedure 225

Table 4.8  Assessment of the discriminatory power of the describing variables


Variable Discriminant criterion F-value p-value Wilks’ lambda
Price 0.466 10.241 0.004 0.682
Delicious 0.031 0.673 0.421 0.970

4.2.4 Testing the Describing Variables

1 Definition of groups and specification of the discriminant function

2 Estimation of the discriminant function

3 Assessment of the discriminant function

4 Testing the describing variables

5 Classification of new observations

6 Checking the assumptions of discriminant analysis

The value of a particular discriminant coefficient depends on the other describing var-
iables considered in the discriminant function. Moreover, the signs of the coefficients
are arbitrary and do not provide insights into the discriminatory power of a particular
coefficient.
Yet, what is of superior interest is the discriminatory power of the describing vari-
ables. Describing variables with high discriminatory power differ significantly across
groups. In our example, we observe mean values of 3.5 and 5.0 for the describing var-
iable ‘price’ for the groups of buyers of the focal (g = 1) and the main competitor brand
(g = 2), respectively. For the describing variable ‘delicious’, the means are 4.5 for group
g = 1 and 4.0 for group g = 2. Thus, we expect that in our example ‘price’ has more dis-
criminatory power than ‘delicious’.

Discriminant Criterion for Single Variables and F-test


We can use the value of the discriminant criterion when we just consider one describ-
ing variable to assess the discriminatory power of a single variable (Table 4.8; see also
Table 4.5, lines 1 and 2). Based on the discriminant criterion, we can conduct an F-test:

SSb N −G
(G − 1)
Femp =
SSw
= Ŵ · (4.12)
(N − G) G−1

The result of the F-test corresponds to the results of a univariate ANOVA and assesses
whether the groups differ with respect to the describing variable (cf. Sect. 3.2.1.3). In the
two-group case, the results of the F-test are equal to the results of an independent sam-
ples t-test.
226 4 Discriminant Analysis

Table 4.8 shows the result of the F-test for the example. The theoretical F-value for
df1 = (G–1) = 1 and df2 = (N–G) = 22 degrees of freedom is 4.30. For the describing var-
iable ‘price’ the empirical F-value is larger than the theoretical one, and, thus, ‘price’ has
significant discriminatory power (p = 0.004). In contrast, the describing variable ‘deli-
cious’ is not significant (p = 0.421) and does not separate the groups significantly.
Although ‘delicious’ alone has no significant discriminatory power, in combination
with ‘price’ it contributes to an increase in the discriminant criterion (cf. Table 4.5). The
discriminant criterion of the discriminant function when considering both variables is
0.912 compared to 0.466 if only ‘price’ is taken into account.

Wilks’ lambda for Single Variables


Table 4.8 also shows the values for Wilks’ lambda for each describing variable. Since
Wilks’ lambda is an inverse quality measure, the smaller value for ‘price’ (Λ = 0.682)
indicates a higher discriminatory power of this variable than the value for ‘delicious’
(Λ = 0.970).

Standardized Discriminant Coefficient


Since scaling effects might influence the value of the discriminant coefficients, we
consider standardized coefficients to assess the relative importance of the describing
variables in discriminating between the groups.6 The coefficients are standardized by
multiplying them with the pooled standard deviation of the respective describing variable
(cf. Sect. 4.2.2.2). The standardized discriminant coefficient is:

bjstd. = bj · sj (4.13)

with
bj discriminant coefficient of describing variable j
sj standard deviation of describing variable j
  
SSw
sj =
N −G

For our example, we get:


 
s1 = 24−229
= 1.148 and s2 = 24−2
49
= 1.492.

6 For example, if you had a describing variable ‘price’ and changed its unit of measurement from
EUR to Cent, the corresponding discriminant coefficient would decrease by a factor of 100. Yet the
transformation of the scale has no influence on the discriminatory power of the variable.
4.2 Procedure 227

This yields the following standardized discriminant coefficients:

b1std. = b1 · s1 = 1.031 · 1.148 = 1.184


b2std. = b2 · s2 = −0.565 · 1.492 = −0.843

The sign of the standardized coefficients is not relevant when assessing the relative dis-
criminatory power, and we thus consider the absolute values of the standardized coef-
ficients. After standardization, the variable ‘price’ still has a larger absolute coefficient
than ‘delicious’ but the difference is less evident. This means that the describing variable
‘delicious’ does have some discriminatory power, although it is not significant.
It is important to note that the standardized coefficients cannot be used to compute the
discriminant values.

Mean Discriminant Coefficient


In the case of multi-group discriminant functions, several discriminant coefficients are
estimated for each describing variable. In order to assess the discriminatory power of a
describing variable with respect to all discriminant functions, the absolute values of the
standardized coefficients weighted by the share of explained variance of a discriminant
function k can be computed. We obtain the mean discriminant coefficient:

K
  std  explained variancek
bj = b  ·
jk K
 (4.14)
k=1 explained variancek
k=1

with
std
bjk : standardized discriminant coefficient for describing variable j in discriminant
function k

4.2.5 Classification of New Observations

1 Definition of groups and specification of the discriminant function

2 Estimation of the discriminant function

3 Assessment of the discriminant function

4 Testing describing variables

5 Classification of new observations

6 Checking the assumptions of discriminant analysis

Based on the estimated discriminant function, we can assign new observations to one
of the considered groups. We distinguish between three concepts to classify new
228 4 Discriminant Analysis

Table 4.9  Comparison of classification concepts


Distance Classification Probability
Concept Functions Concept Concept
Explicit consideration of a priori probabilities no yes yes
Costs of misclassification no no yes
Unequal variance-covariance matrices yes no yes
Possibility to consider only relevant discriminant yes no yes
functions

observations: the distance concept, the classification functions concept, and the probabil-
ity concept. Before discussing the different concepts in detail, we will outline conceptual
similarities and differences between them (see Table 4.9).

Comparison of the Different Classification Concepts


First, the concepts differ with respect to the explicit consideration of a priori probabili-
ties. The a priori probability expresses the probability that an observation i belongs to a
specific group g. The concepts either implicitly assume equal a priori probabilities (i.e.,
equal group sizes) or allow us to take different a priori probabilities into account which
may be based on group sizes or a priori knowledge.
Second, it often happens that the costs of misclassification differ across groups.
In medical diagnostics, for example, the costs of not detecting a malignant disease
are certainly higher than the costs of an erroneous diagnosis of a malignant disease.
Considering the costs of misclassification allows for making decisions based on the
expected value theory.
Third, discriminant analysis is based on the assumption of equal variance-covari-
ance matrices across the groups (cf. Sect. 4.2.6). The classification concepts differ as to
whether it is possible to consider unequal variance-covariance matrices for classification
purposes.
Finally, the classification concepts can be distinguished with respect to the possibility
to take only relevant discriminant functions into account when more than one discrimi-
nant function can be estimated.

4.2.5.1 Distance Concept
Based on the discriminant function, we can compute the discriminant value for a new
observation by using the observed values of the describing variables for this observation
and the estimated unstandardized discriminant coefficients.
Let us assume that we observe a consumer with the values 5.0 and 5.0 for the describ-
ing variables ‘price’ and ‘delicious’, respectively. For this consumer, we get a discrimi-
nant value of 0.350 (= –1.982 + 5 · 1.031 + 5 · (–0.565)).
4.2 Procedure 229

According to the distance concept, an observation i is assigned to group g to which it


is closest, i.e., it has the smallest distance to the group centroid:

dig2 = (Yi − Y g )2 (4.15)

In our example, the squared distances to the group centroids are 1.598 for g = 1 and
0.319 for g = 2. Therefore, the consumer is assigned to group g = 2.
The distance concept just uses the distance of the discriminant value to the group cen-
troid to classify observations. A priori probabilities are not taken into account. Moreover,
the costs of misclassification are not considered.
If more than one discriminant function is considered, the squared Euclidean distance
and the Mahalanobis distance can be used to classify the observations. Originally, the
distance measures assume equal variance-covariance matrices in the groups but the
measures can be extended to capture unequal group variance-covariance matrices (cf.
Tatsuoka, 1988, p. 350). If more than one discriminant function is estimated, it is, how-
ever, not necessary to consider all possible discriminant functions. In this case, we can
focus on those discriminant functions that are significant without losing critical informa-
tion, which simplifies the computation of the distances.
Overall, the advantage of the distance concept is that it is easy to implement and intu-
itively appealing. The disadvantage is that it is a deterministic approach: observations are
assigned to one of the groups with a probability of 100%.

4.2.5.2 Classification Functions Concept


Classification functions concept is a method of classification in which a linear function is
defined for each group. For each observation, a score on each group’s classification func-
tion is computed and the observation is assigned to the group with the highest score. The
classification functions are also known as Fisher’s linear discriminant functions.
Since the value of the classification function is different from the discriminant value
based on the (unstandardized) discriminant coefficients (cf. Sect. 4.2.2), we will use the
term classification score in the following when we refer to the value of the classification
function.
It is important to note that the classification functions method requires equal with-
in-group variance-covariance matrices (cf. Sect. 4.3). If this requirement is met, we can
define separate classification functions for each group g that have the following form:
F1 = B01 + B11 X1 + B21 X2 + . . . + BJ1 XJ
F2 = B02 + B12 X1 + B22 X2 + . . . + BJ2 XJ
.. (4.16)
.
FG = B0G + B1G X1 + B2G X2 + . . . + BJG XJ
The coefficients Bjg depend on the observed mean values and the variation of the describ-
ing variables but are not a result of an optimization procedure. Generally, the coefficients
230 4 Discriminant Analysis

are the larger, the higher the mean value and the smaller the variation of the describing
variables in a group.
In our example, the mean values for ‘price’ are 3.5 for g = 1 and 5.0 for g = 2. The
corresponding standard deviations are 1.168 for g = 1 and 1.128 for g = 2. Group g = 2
thus has a higher mean value and a lower standard deviation for ‘price’ compared to
group g = 1. Consequently, the coefficient for ‘price’ in the classification functions is
larger for group g = 2 compared to group g = 1. More specifically, we get:
F1 = −6.597 + 1.729 · X1 + 1.280 · X2
F2 = −10.22 + 3.614 · X1 + 0.247 · X2
The values Fg themselves have no interpretative meaning and are just used for classifi-
cation purposes. For a new observation with the values 5.0 for ‘price’ and 5.0 for ‘deli-
cious’, we get a classification score of 8.444 for group g = 1 and of 9.083 for group
g = 2. Since we observe the highest value of the classification score for group g = 2, we
classify the consumer into group g = 2.
The classification functions concept relies on the assumption of equal variance-covar-
iance matrices within groups. The costs of misclassification are not taken into account.
Yet it is possible to explicitly consider a priori probabilities. If we do so, the results
based on the classification functions concept may differ from the results of the distance
concept; otherwise, both concepts lead to the same results.

Explicitly Considering a Priori Probabilities in Classification Functions


The a priori probabilities can be given or may be estimated a priori, and they can
take into account that the groups differ in size and may not encompass the total pop-
ulation. For example, the market shares in our example might be 15% for the focal
brand and 10% for the main competitor brand. Since the a priori probabilities must
add up to 1 across groups, we get the following a priori probabilities for our example:
P(g = 1) = 60% and P(g = 2) = 40%. Besides objectively available information about the
group sizes, the researcher can also use a priori probabilities based on subjective judge-
ments or knowledge. Moreover, it is possible to consider individual a priori probabilities
Pi(g) if such information is available.
To take a priori probabilities P(g) into account, the classification functions are modi-
fied as follows:
 
Fg = B0g + ln P(g) + B1g X1 + B2g X2 + . . . + BJg XJ (4.17)

4.2.5.3 Probability Concept
The probability concept is the most sophisticated approach. It explicitly considers
a priori probabilities and allows for taking the costs of misclassification into account.
Furthermore, we can focus on relevant discriminant functions only if several discri-
minant functions are estimated. Finally, it is also possible to account for unequal vari-
ance-covariance matrices. The sophistication of this concept comes at the price that it is
more difficult to comprehend.
4.2 Procedure 231

Classification Rule
The probability concept allows considering a priori probabilities—similar to the classi-
fication functions approach. Additionally, it permits to take the costs of misclassification
into account. If neither probabilities nor the costs of misclassification are considered, the
probability concept leads to the same classification results as the distance concept.
Generally, the probability concept uses the following classification rule:
Assign observation i to the group for which the probability P(g|Yi) is maximum, where
P(g|Yi) is the probability of the membership of observation i in group g given its discri-
minant value Yi.
In decision theory, the classification probability P(g|Yi) is called a posteriori proba-
bility of group membership. We calculate the a posteriori probability with the help of the
Bayes theorem:

P(Yi |g )Pi (g)


P(g|Yi ) = G (4.18)
g=1 P(Yi |g )Pi (g)

with

• P(g|Yi): a posteriori (classification) probability of observation i to belong to group g


given the discriminant value Yi
• P(Yi|g): conditional probability of observing discriminant value Yi for observation i
given that i belongs to group g (assuming a normal distribution)
• Pi(g): a priori probability of observation i to belong to g

A Posteriori (Classification) Probability


The a posteriori (classification) probabilities are computed with the help of the distances
between the discriminant values and the group centroids (for more details, see Tatsuoka,
1988):

exp (−dig2 /2) Pi (g)


P(g|Yi ) = G 2 (4.19)
g=1 exp (−dig /2) Pi (g)

with
dig distance between the discriminant value of observation i and the centroid of group g

For a new observation with the values 5.0 for ‘price’ and 5.0 for ‘delicious’, we get:
Yi = −1.98 + 1.031 · 5 − 0.565 · 5 = 0.350
The squared distances to the group centroids are di12 = 1.598 and di22 = 0.319, respec-
tively (cf. Sect. 4.2.6). Transforming the distances results in the following densities:
232 4 Discriminant Analysis

f (Yi |g ) = exp (−dig2 /2)


f (Yi |g = 1 ) = 0.450
f (Yi |g = 2 ) = 0.853
If we assume equal a priori probabilities Pi (g = 1) = 0.5 and Pi (g = 2) = 0.5, we obtain
the following classification (a posteriori) probabilities:
f (Yi |g )Pi (g)
P(g|Y1 ) =
f (Yi |g = 1 )Pi (g = 1) + f (Yi |g = 2 )Pi (g = 2)
0.450 · 0.5
P(g = 1|Yi ) = = 0.345
0.450 · 0.5 + 0.853 · 0.5
0.853 · 0.5
P(g = 2|Yi ) = = 0.655
0.450 · 0.5 + 0.853 · 0.5
The observation is assigned to group g = 2 because of the higher a posteriori probability
for this group.
The classification probabilities P(g|Yi) always add up to 1. Thus, each observation
belongs to one of the predefined groups. Consequently, the classification probabilities do
not allow a statement on how likely it is that an observation belongs to one of the groups
at all.

Conditional Probability
The conditional probability represents the likelihood of observing discriminant value Y
for observation i given that i belongs to group g. In contrast to the a priori and a posteri-
ori probabilities, conditional probabilities do not have to add up to 1. Conditional prob-
abilities may be arbitrarily small with respect to all groups. Therefore, the conditional
probability is used for assessing how likely it is that an observation belongs to a group at
all.
The conditional probability can be determined based on the standard normal distribu-
tion (cf. Tatsuoka, 1988). For the observation with a value of 5.0 for ‘price’ and 5.0 for
‘delicious’, the discriminant value is 0.350 and it is closest to group g = 2:
|di2 | = |0.350 − 0.914| = 0.564
The resulting conditional probability is:7
P(Yi |g = 2) = 0.572

7 Visit
www.multivariate-methods.info for more information on how to compute the conditional
probability with Excel.
4.2 Procedure 233

( )

-3 -2 -1 0 1 2 3

Fig. 4.8 Representation of the conditional probability (red area) under the density function of the stan-
dard normal distribution

About 60% of all observations in group g = 2 are further away from the centroid than
observation i. Observation i is therefore a rather good representative of group g = 2.
In contrast, for an observation i with the values 6.0 for ‘price’ and 1.0 for ‘delicious’,
we get a discriminant value of 3.639 and the following classification (a posteriori)
probabilities:
P(g = 1|Yi ) = 0.001
P(g = 2|Yi ) = 0.999
Observation i would, therefore, also be classified into group g = 2 with a very high prob-
ability. However, the distance to the group centroid is relatively large with di2 = 2.725.
This results in a conditional probability of:
P(Yi |g = 2) = 0.006
The probability that an observation within group g = 2 has a distance larger than obser-
vation i is extremely low—more specifically, about 0.6%. The conditional probability
for group g = 1 would, however, even be lower. Thus, it appears rather unlikely that the
observation actually belongs to any of the two groups.
Figure 4.8 shows the relationship between the distance dig and the conditional proba-
bility. The greater the distance dig, the less likely it is that an observation within group g
will be observed at an equal or larger distance, and the less likely becomes the hypothe-
sis that observation i belongs to group g. The conditional probability P(Yi|g) corresponds
to the probability or significance level of the hypothesis.
234 4 Discriminant Analysis

Table 4.10  Costs of a misclassification (loan example)


Classification into group… True group membership
Loan repayment Loan default
1 2
1: Grant loan –100 EUR 1000 EUR
2: Reject loan application 100 EUR 0 EUR

Cost of Misclassification
The Bayesian decision rule, which is based on the expected value theory, can be
extended by considering various costs of misclassification, no matter whether the
expected value of a cost or loss criterion is minimized or whether the expected value of a
profit or gain criterion is maximized.
If we consider the cost of misclassification, we assign an observation i to the group
for which the expected value of the costs is minimal:
G

Eh (Cost) = Costgh · P(g|Yi ) (4.20)
g=1

with

Costgh cost of misclassification if observation i actually belongs to group g but is


assigned to group h
The following example illustrates the application of the Bayesian rule: Customer i of a
bank would like to take out a loan of 1000 EUR for one year at an interest rate of 10%.
The challenge for the bank is to weigh the potential interest gain against the risk of loan
default. The following a posteriori probabilities were determined for customer i:
P(g = 1|Yi) = 0.8 (loan repayment)
P(g = 2|Yi) = 0.2 (loan default)
If classification into group g = 1 is linked to granting the loan and classification into
group g = 2 is linked to rejecting the loan application, the costs of misclassification are
calculated according to Table 4.10.
If the bank grants the loan, it obtains a profit of 100 EUR if the customer repays the
loan, while it incurs a loss of 1000 EUR if the customer is not able to pay back the loan.
If the bank does not grant the loan, it may incur opportunity costs (due to a profit loss) of
100 EUR.
The expected values of the costs for the two alternatives are as follows:
Grant: E1(Cost) = –100 · 0.8 + 1000 · 0.2 = 120
Reject: E2(Cost) = 100 · 0.8 + 0 · 0.2 = 80
4.2 Procedure 235

Since rejecting the loan request results in overall lower costs, the expected profit is
higher when rejecting the loan. Thus, the bank should decide not to grant the loan to the
customer.
Compared to the distance concept and the classification functions approach, the prob-
ability concept is more elaborate in the sense that it is possible to consider costs of mis-
classification. If unequal costs of misclassification exist, it is recommended to use the
probability concept for classifying new observations.

4.2.6 Checking the Assumptions of Discriminant Analysis

1 Definition of groups and specification of the discriminant function

2 Estimation of the discriminant function

3 Assessment of the discriminant function

4 Testing the describing variables

5 Classification of new observations

6 Checking the assumptions of discriminant analysis

We have already mentioned some assumptions of discriminant analysis in the pre-


vious sections. At this point, we provide an overview and a detailed discussion of the
assumptions:

• multivariate normality of independent variables,


• equal within-group variance-covariance matrices,
• no multicollinearity.

Multivariate Normality of Independent Variables


Discriminant analysis assumes metrically scaled independent variables that are normally
distributed. If the assumption of univariate normality of individual variables is supported,
usually the assumption of multivariate normality is also supported. To test for univariate
normality, we can, for example, use the Kolmogorov-Smirnov test. Testing for multivar-
iate normality can be done with the help of Q-Q or P-P plots. If the data do not support
the multivariate normality assumption, the test significance might be biased. A potential
remedy is to transform the variables to reduce the discrepancies (e.g., ln-transformation).
If a transformation turns out to be ineffective, we may consider alternative methods such
as logistic regression (cf. Chap. 5), which does not rely on the assumption of normal-
ity. However, discriminant analysis appears to be rather robust against violations of this
assumption.
236 4 Discriminant Analysis

Table 4.11  Chocolate flavors and perceived attributes examined in the case study
Chocolate flavors Perceived attributes
1 Milk 1 Price
2 Espresso 2 Refreshing
3 Biscuit 3 Delicious
4 Orange 4 Healthy
5 Strawberry 5 Bitter
6 Mango 6 Light
7 Cappuccino 7 Crunchy
8 Mousse 8 Exotic
9 Caramel 9 Sweet
10 Nougat 10 Fruity
11 Nut 11

Equal Within-group Variance-covariance Matrices


Discriminant analysis assumes that the variance-covariance matrices across groups are
equal. The Box’s M test is the most common test to assess whether this assumption is
fulfilled. It tests the null hypothesis that the within-group variance-covariance matrices
are equal. Yet, Box’s M test is sensitive to deviations from multivariate normality. A vio-
lation of the assumption of equal variance-covariance matrices may affect the test for
significance as well as the classification. If the assumption is violated, we can again con-
sider a transformation of the variables. Alternatively, we may consider using a quadratic
discriminant analysis (QDA; cf. Hastie et al., 2009, p. 110). However, discriminant anal-
ysis appears to be rather robust against violations of this assumption, too.

Multicollinearity
Multicollinearity among the describing variables (i.e., correlation between the describ-
ing variables; cf. Sect. 2.2.5.7) may influence the final specification of the discriminant
function if stepwise procedures are used. If variables are highly correlated, one variable
can be explained by the other variable(s) and thus it adds little to the explanatory power
of the entire set. For this reason it may be excluded or not included in the first place.
Moreover, multicollinearity can influence the test of significance. To assess the degree of
multicollinearity, we can compute the tolerance value or variance inflation factor (VIF)
for each describing variable (see Sect. 2.2.5.7 for suggested remedies if multicollinearity
is critical).
4.3 Case Study 237

Table 4.12  Definition of groups for discriminant analysis and number of observations


Group Chocolate flavors
g = 1|Seg_1 Classic (n = 65) Milk, Biscuit, Mousse, Caramel, Nougat, Nut
g = 2|Seg_2 Fruit (n = 28) Orange, Strawberry, Mango
g = 3|Seg_3 Coffee (n = 23) Espresso, Cappuccino

4.3 Case Study

4.3.1 Problem Definition

We now use a larger sample to demonstrate how to conduct a multi-group discriminant


analysis with the help of SPSS.8
A manager of a chocolate company wants to know how consumers evaluate different
chocolate flavors with respect to 10 subjectively perceived attributes. For this purpose,
the manager has identified 11 flavors and has selected 10 attributes that appear to be rele-
vant for the evaluation of these flavors.
First, a small pretest with 18 test persons is carried out. The persons are asked to eval-
uate the 11 flavors (chocolate types) with respect to the 10 attributes (see Table 4.11). A
seven-point rating scale (1 = low, 7 = high) is used for each attribute. Thus, the independ-
ent variables are perceived attributes of the chocolate types.
However, not all persons evaluated all 11 flavors. Thus, the data set contains only 127
evaluations instead of the complete number of 198 evaluations (18 persons × 11 flavors).
Any evaluation comprises the scale values of the 10 attributes for a certain flavor of one
respondent. It reflects the subjective assessment of a specific chocolate flavor by a par-
ticular test person. Since each test person assessed more than just one flavor, the obser-
vations are not independent. Yet in the following we will treat the observations as such.
Of the 127 evaluations, only 116 are complete, while 11 evaluations contain miss-
ing values.9 We exclude all incomplete evaluations from the analysis. Consequently, the
number of cases is reduced to 116.
To investigate differences between the various flavors, 11 groups could be consid-
ered, with each group representing one chocolate flavor. For the sake of simplicity, the
11 products were grouped into three market segments. This was done by applying a
cluster analysis (see the results of the cluster analysis in Chap. 8). Table 4.12 shows the

8 We use the same data set as for logistic regression (cf. Sect. 5.4) in order to better illustrate simi-
larities and differences between the two methods.
9 Missing values are a frequent and unfortunately unavoidable problem when conducting surveys

(e.g. because people cannot or do not want to answer some question(s), or as a result of mistakes
by the interviewer). The handling of missing values in empirical studies is discussed in Sect. 1.5.2.
238 4 Discriminant Analysis

Fig. 4.9 How to start a ‘Discriminant Analysis’ in SPSS

Fig. 4.10 Dialog box: Discriminant Analysis


4.3 Case Study 239

Fig. 4.11 Dialog box: Statistics

composition of the three segments and their sizes. The size of the segment ‘Classic’ is
more than twice the size of the segments ‘Fruit’ and ‘Coffee’. The manager now wants to
study which product attributes discriminate the groups.

4.3.2 Conducting a Discriminant Analysis with SPSS

To conduct a discriminant analysis with SPSS, we go to ‘Analyze/Classify’ and choose


‘Discriminant’… (Fig. 4.9). A dialog box opens, and we first select the grouping variable
and describing variables (‘Independents’; Fig. 4.10).
In the present case study, we consider three groups. The variable ‘segment’ contains
the group membership of each chocolate flavor. We use the option ‘Define Range’ to
select all three groups for the subsequent analyses (here: 1 to 3; cf. Fig. 4.10).
For estimating the discriminant function(s), you can choose between ‘Enter independ-
ents together’ (i.e., blockwise method) and ‘Use stepwise method’ (i.e., stepwise estima-
tion procedure). If you choose the latter, the describing variables are added successively
to the model (cf. Sect. 4.2.2.3). For now, we use the blockwise method, which is also the
default option. Consequently, the dialog box ‘Method’ is not activated.

Dialog box ‘Statistics’


We go to ‘Statistics’ and select the descriptive statistics that are going to be displayed
in the output, the coefficient matrices that are reported, and the coefficients that are dis-
played (Fig. 4.11). We select ‘Means’ to get the mean value for each describing variable
240 4 Discriminant Analysis

Fig. 4.12 Dialog box: Classification

in each group. This descriptive statistic provides us with a first indication of which
describing variables might be able to discriminate between the groups. Further, we select
‘Univariate ANOVAs’ to examine the discriminatory power of the describing variables
(cf. Sect. 4.2.4). If we choose this option, SPSS performs a one-way ANOVA test for
equality of group means for each describing variable (cf. Chap. 3). Additionally, we
select ‘Box’s M’ to get the results of the test whether the within-group variance-covari-
ance matrices are equal (cf. Sect. 4.2.6).
SPSS displays the standardized discriminant coefficients by default (cf. Sect. 4.2.2.2).
We thus select ‘Unstandardized’ to also retrieve the unstandardized discriminant
coefficients.
To obtain Fisher’s classification functions and the according coefficients, we further
select ‘Fisher’s’. A separate set of classification function coefficients is obtained for each
group.

Dialog box ‘Classification’


With the help of the dialog box ‘Classification’ (Fig. 4.12), it is possible to define a pri-
ori probabilities (‘Prior probabilities’). The default option is ‘All groups equal’, which
assumes equal a priori probabilities for all groups. If you choose ‘Compute from group
sizes’, the observed group sizes determine the a priori probabilities of group member-
ship. In our example, 72 out of the total of 127 observations belong to group g = 1. Thus,
the group size of group g = 1—given the data—corresponds to 56.7%. The groups g = 2
and g = 3 are relatively smaller, with 24.4% and 18.9%, respectively. We therefore use
the option ‘Compute from group sizes’—also because the segment ‘Classic’ contains
more flavors than the others.
4.3 Case Study 241

Fig. 4.13 Dialog box: Save

Moreover, we retain the default option of pooled within-group variance-covariance


matrices (‘Within-groups’) instead of separate-groups variance-covariance matrices
(‘Separate-groups’) (Fig. 4.12). We thus assume equal within-group variance-covari-
ance matrices. If it later turns out that this assumption is violated, we can use ‘Separate-
groups’ to account for covariance matrices that differ across the groups.
Additionally, you can choose what information about the classification is displayed
in the SPSS output window (Fig. 4.12). Available display options are ‘Casewise results’,
‘Summary table’, and ‘Leave-one-out classification’. The option ‘Casewise results’ cre-
ates an output that encompasses actual group membership, predicted group member-
ship, a posteriori probabilities, and discriminant values. We can also choose to obtain
this information for a subset of the data only. In our example, we want to retrieve this
information only for the first 15 observations. The option ‘Summary table’ creates a table
that contains information about the number of cases correctly and incorrectly classified
into each of the groups based on the discriminant analysis (i.e., classification matrix). We
activate this option to get information about the model’s hit rate. The option ‘Leave-one-
out classification’ allows for assessing the robustness of the model (cf. Sect. 4.2.3.2). We
leave it up to you to activate this option to test the robustness of the results. If you do so,
you will learn that the hit rate decreases—what is to be expected—but only slightly.
With the option ‘Plots’, we can create different maps to evaluate the classification
(Fig. 4.12). If we select ‘Combined-groups’, SPSS creates an all-groups scatterplot of the
first two discriminant function values. If there is only one function, a histogram is dis-
played instead. To create separate-group scatterplots of the first two discriminant func-
tion values, choose ‘Separate-groups’. Finally, we can select the creation of a territorial
map that is a plot of the boundaries used to classify cases into groups. In this case study,
we select ‘Combined-groups’ and ‘Territorial map’ to visually inspect the results of the
discriminant analysis.
242 4 Discriminant Analysis

You also have the option to substitute missing values with the respective mean value
(i.e., option ‘Replace missing values with mean’). If you activate this option, a missing
observation of a describing variable will be replaced with the mean value of this variable.
In our example, we have six missing values only, and we decide to ‘lose’ these observa-
tions instead of replacing them with the mean value.

Dialog box ‘Save’


The dialog box ‘Save’ allows us to save additional information to the SPSS data file by
adding new variables to it (Fig. 4.13). The available options are ‘Predicted group mem-
bership’, ‘Discriminant scores’, and ‘Probabilities of group membership’. We select all
three options and will discuss the created variables later (cf. Sect. 4.3.3.1).
Additionally, SPSS offers the dialog box ‘Bootstrap’. Generally, bootstrapping is
a method for deriving robust estimates of standard errors and confidence intervals for
coefficients. If the confidence interval contains zero, the estimated coefficients are not
significant. When bootstrapping is used with a discriminant analysis, SPSS performs
bootstrapping for the standardized coefficients of the discriminant functions. We will
refer to the results of the bootstrapping methodology below but will not discuss the
results in detail.

4.3.3 Results

4.3.3.1 Results of the Blockwise Estimation Procedure


SPSS first provides the group statistics that allow assessing whether the sample has a
sufficient size—not only overall but also for each group. In our example, the smallest
group is group g = 3 with 23 observations (Fig. 4.14). Since the minimum requirement
is 20 observations, we continue with the analysis. Moreover, the group statistics provide
information about the mean values of the describing variables for each group.
The mean values provide a first impression of the potential discriminatory power of
the describing variables. We observe quite large differences across (at least two) groups
for the following describing variables: price, crunchy, exotic, and fruity. Actually, it
seems that group g = 2 differs from the other two groups with respect to these variables.
Before presenting the specific results of the discriminant analysis, SPSS provides the
tests of equality of group means for each describing variable (Fig. 4.15). Each test dis-
plays the results of a one-way ANOVA for the describing variable that uses the group-
ing variable as the factor (i.e., F-test). If the significance value is greater than 0.10, the
variable probably does not contribute to the model. According to the results in Fig. 4.15,
the describing variables ‘healthy’, ‘bitter’, and ‘sweet’ are not significant at p = 0.1.
Therefore, we expect that those variables will not significantly contribute to the dis-
crimination between the groups. We further observe that the variables ‘price’, ‘crunchy’,
‘exotic’, and ‘fruity’ are significant at p = 0.000. Thus, these variables may be good
4.3 Case Study 243

Fig. 4.14 Group statistics

candidates for separating the groups, which is in line with our conclusions based on the
mean values.
Besides the result of the one-way ANOVAs, SPSS provides Wilks’ lambda (Fig.
4.15). Remember that smaller values of Wilks’ lambda indicate that a variable is better at
244 4 Discriminant Analysis

Fig. 4.15 Univariate discriminatory power of the describing variables

Fig. 4.16 Eigenvalues and canonical correlation

Fig. 4.17 Wilks’ lambda and the χ2-test

discriminating between the groups. In our example, the values for Wilks’ lambda do not
differ much. The describing variable ‘exotic’ has the smallest value for Wilks’ lambda
(0.800), followed by a value of 0.803 for ‘fruity’. The values of Wilks’ lambda for the
4.3 Case Study 245

Fig. 4.18 Standardized canonical discriminant function coefficients and structure matrix

describing variables ‘healthy’, ‘bitter’, and ‘sweet’ are the largest, supporting the result
of the F-test that these variables are less suitable for discriminating between the groups.

Assessment of the Discriminant Functions


Figure 4.16 shows the maximum values of the discriminant criterion (i.e., eigenvalue)
of the two discriminant functions, as well as the share of explained variance and the
canonical correlation (cf. Sect. 4.2.3.1). The eigenvalue and, consequently, the share
of explained variance for the first discriminant function (eigenvalue = 1.043, % of
246 4 Discriminant Analysis

Table 4.13  Discriminatory Describing variable Discriminatory power (overall)


power of the describing
variables regarding all Fruity 0.448
discriminant functions Crunchy 0.435
Price 0.395
Delicious 0.383
Exotic 0.369
Light 0.366
Refreshing 0.330
Sweet 0.222
Healthy 0.195
Bitter 0.173

variance = 89.8%) is much higher than for the second one (eigenvalue = 0.118, % of
variance = 10.2%).
The canonical correlation as a measure for the extent of association between the dis-
criminant value and the groups equals 0.714 for the first discriminant function, whereas
it is only 0.325 for the second one. Overall, we can conclude that discriminant function 1
has a much higher discriminatory power than discriminant function 2.
Figure 4.17 shows the multivariate Wilks’ lambda including the chi-square test.
Multivariate Wilks’ lambda indicates that both discriminant functions combined are
significant (p = 0.000). However, Wilks’ lambda testing for residual discrimination for
function 2 is not significant (p = 0.205). Thus, the second discriminant function does not
contribute significantly to the separation of the groups and we may consider just the first
discriminant function for interpretation.

Discriminatory Power of the Describing Variables


Next, SPSS presents the standardized discriminant coefficients that allow for comparing
the discriminatory power of the describing variables (Fig. 4.18). Coefficients with large
absolute values correspond to variables with greater discriminatory power. In the exam-
ple, the describing variable ‘fruity’ has the greatest discriminatory power for discrimi-
nant function 1, and the describing variable ‘bitter’ has the greatest discriminatory power
for discriminant function 2. Yet, we need to be aware that discriminant function 2 is not
able to separate between the groups (Fig. 4.17). Additionally, when testing whether there
are significant differences across the groups with the help of an ANOVA (Fig. 4.15), we
found no significant result for the variable ‘bitter’ (p = 0.169). Thus, it is not surprising
that the standardized discriminant coefficient for the variable ‘bitter’ is not significant
when we use bootstrapping.
Moreover, Fig. 4.18 displays the so-called ‘structure matrix’. The structure matrix
shows the correlation of each describing variable with the discriminant functions.
4.3 Case Study 247

Fig. 4.19 Discriminant function coefficients and discriminant values at the group centroids

The asterisks mark each variable’s largest absolute correlation with one of the discrimi-
nant functions. Within each function, these marked describing variables are ordered by
the size of the correlation. Thus, the ordering is different from that in the standardized
coefficients table.
The describing variables ‘fruity’, ‘exotic’, ‘price’, ‘crunchy’, ‘light’, and ‘refresh-
ing’ are most strongly correlated with the first discriminant function, although ‘light’
(correlation = 0.344) and ‘refreshing’ (correlation = 0.243) are rather weakly correlated
with discriminant function 1. Remember that the variables ‘fruity’, ‘exotic’, ‘price’, and
‘crunchy’ have already been identified as good potential discriminating variables when
we considered the group statistics (Fig. 4.14).
The describing variables ‘bitter’, ‘delicious’, ‘sweet’, and ‘healthy’ are most strongly
correlated with the second discriminant function. However, the variables ‘delicious’,
‘sweet’, and ‘healthy’ have rather low correlations with discriminant function 2 com-
pared to ‘bitter’ and are not significant when we use bootstrapping (we do not display the
results here).
In order to assess the discriminatory power of the describing variables with respect to
all discriminant functions, we compute the mean (standardized) discriminant coefficients
according to Eq. (4.14). Table 4.13 shows that the variable ‘fruity’ has the highest value
with 0.448 (= 0.461 · 0.898 + 0.336 · 0.102), while ‘bitter’ has the lowest one (0.201).
248 4 Discriminant Analysis

Fig. 4.20 Scatterplot of the discriminant values for each observation

Thus, the variable ‘bitter’ has the least and ‘fruity’ has the greatest discriminatory power
overall. This result supports the previous notion that the variable ‘bitter’ is not a good
candidate for separating between the groups and that discriminant function 2 is not able
to separate the groups.
Figure 4.19 shows the estimated unstandardized discriminant coefficients for the two
discriminant functions. The unstandardized discriminant coefficients are used to com-
pute the values for the discriminant variable for each observation (cf. Fig. 4.22). Next to
the coefficients, the (unstandardized) group centroids are displayed. The group centroids
indicate that discriminant function 1 is able to separate group g = 2 from the groups g = 1
and g = 3. In contrast, discriminant function 2 seems to be able to separate group g = 3
from the groups g = 1 and g = 2. Considering that group g = 3 represents the coffee fla-
vors, the relevance of the variable ‘bitter’ is reasonable.
The scatterplot in Fig. 4.20 is based on the two discriminant functions and shows the
discriminant values for each observation. The observations are represented as points in
a two-dimensional space that is spanned by the two discriminant functions. The dis-
criminant values determine the location of each observation. We realize that especially
the groups g = 1 (i.e., classic flavors) and g = 3 (i.e., coffee flavors) have a substantial
4.3 Case Study 249

Fig. 4.21 Coefficients of the classification functions

overlap and that the group centroids are rather close to each other. Discriminant function
1 seems to separate group g = 2 (i.e., fruit flavors) from the two other groups.

Classification Results
The next part of the SPSS output concerns the classification results (cf. Sect. 4.2.6).
SPSS reports the a priori probabilities (SPSS output not presented here). In our example,
we used the option ‘Compute from group sizes’, and thus the a priori probabilities corre-
spond to the group sizes in the observed data.
SPSS also provides information about the classification functions (Fisher’s linear dis-
criminant functions; Fig. 4.21). The coefficients of the classification functions are used
to classify the observations into one of the three groups (cf. Sect. 4.2.5).
Previously, we selected the option ‘Casewise results’ for the first 15 observations and
Fig. 4.22 presents the corresponding results. More specifically, we get information about:

• Actual group membership of an observation (Actual Group) that is retrieved from the
data set (variable ‘segment’).
• Estimated group membership (Predicted Group), with asterisks indicating an incor-
rect classification.
• Conditional probability P(D > d|G = g) that an observation of group g has a distance
greater than d to the centroid of group g.
250
4

Fig. 4.22 Individual classification results


Discriminant Analysis
4.3 Case Study 251

Fig. 4.23 Classification matrix

• A posteriori (classification) probability P(G = g|D = d) that an observation with dis-


tance d belongs to group g. The classification probability shows the confidence level
regarding the classification of an observation into a certain group. For example, the
first observation is assigned to group g = 1 with a probability of 86.2%, which is actu-
ally a rather high a posteriori probability.
• Mahalanobis distance to the group centroid of the predicted group.
• The probabilities for, and distances to, the group with the second-highest classifica-
tion probability. For the first observation, the probability of classification into group
g = 3 is 13.2% (vs. 86.2% for classification into group g = 1).
• The last two columns show the estimated discriminant values for the two discriminant
functions based on the unstandardized coefficients (Discriminant Scores). To obtain
these values you have to multiply the observed data with the unstandardized coeffi-
cients for discriminant functions 1 and 2 considering a constant term (cf. Fig. 4.19).

Moreover, SPSS provides the classification matrix (Fig. 4.23). The classification matrix
shows the frequencies of actual and estimated group membership for each group based
on the classification (a posteriori) probabilities. The hit rates are 95.4% for group g = 1,
71.4% for group g = 2, and 13.0% for group g = 3. Overall, 85 (= 62 + 20 + 3) out of 116
observations are correctly classified, and the overall hit rate is therefore 73.3%. We note
that the observations belonging to group g = 3 are not well predicted (see also Fig. 4.22).
We have already learnt that discriminant function 1 is able to separate group g = 2 from
the two other groups but that there is a substantial overlap between groups g = 1 and
g = 3. Since we estimated the a priori probabilities from the actual group sizes and g = 1
is the largest group, the a priori probability for group g = 1 is much higher than for g = 3.
Therefore, most observations belonging to group g = 3 are assigned to group g = 1. If we
assume that the groups have equal sizes and do another discriminant analysis (results are
not reported), the hit rate equals 69.8% which is actually lower. In this case, however, 16
252 4 Discriminant Analysis

Fig. 4.24 Territorial map


4.3 Case Study 253

Fig. 4.25 New variables saved to the SPSS data set

out of the 23 observations belonging to group g = 3 are correctly classified. In contrast,


only 42 of the observations belonging to group g = 1 are correctly classified (compared
to 62 observations in the first case; Fig. 4.23). Both model hit rates (69.8% and 73.3%)
are well above the hit rate of 33.3% if we assume a pure random assignment, though.
We also requested a territorial map to illustrate the classification of the observations.
The territorial map helps to examine the relationships between the groups and the dis-
criminant functions (Fig. 4.24). The first discriminant function, shown on the horizontal
axis, separates group g = 2 (i.e., fruit flavors) from the other two groups. Since ‘fruity’ is
most strongly correlated with discriminant function 1, this suggests that group g = 2 is,
in general, the one perceived as most ‘fruity’—which intuitively makes sense.
The second discriminant function separates groups g = 1 (i.e., classic flavors) and
g = 3 (i.e., coffee flavors). The describing variable ‘bitter’ is the one correlated most
strongly with discriminant function 2. Group g = 1 (i.e., classic flavors) has on average a
lower value for ‘bitter’ than group g = 3 (i.e., coffee flavors). Test persons thus perceive
the classic flavors as less bitter than the coffee flavors. Again, the closeness of the group
centroids—marked by asterisks with red circles—to the territorial lines suggests that the
separation between the two groups is not very strong.

New Variables Saved to the SPSS Data File


We requested to save the information about the estimated group membership (new var-
iable ‘Dis_1′), the values of the discriminant functions (new variables ‘Dis1_1’ and
‘Dis2_1’; cf. Fig. 4.22 ‘Discriminant Scores’), and the a posteriori probabilities (new
variables ‘Dis1_2’, ‘Dis2_2′, ‘Dis3_2′) to the SPSS data set (Fig. 4.25). For example,
the values for the discriminant functions based on the unstandardized coefficients are
–1.12 and –0.622 for observation i = 1. The classification (a posteriori) probabilities are
86.19% (for g = 1), 0.01% (g = 2), and 13.24% (g = 3).
254 4 Discriminant Analysis

Fig. 4.26 Box’s test of equality of variance-covariance matrices

Fig. 4.27 Classification matrix using separate-group covariances

Testing the Assumption of Equal Within-group Variance-covariance Matrices


Remember that discriminant analysis relies on the assumption of equal within-group var-
iance-covariance matrices (cf. Sect. 4.2.6). We thus requested the Box’s M test to test this
assumption (Fig. 4.11). Although we discuss the result of this test only here, at the end,
SPSS actually presents the result very early in the output window. Figure 4.26 shows the
result.
The log determinants are a measure of the variation within groups. Larger log deter-
minants correspond to more variation within a group. Consequently, large differences in
log determinants indicate groups that have different variance-covariance matrices. In our
example, the log determinant of group g = 2 (4.078) is larger than the one for the groups
g = 1 (3.384) and g = 3 (3.444). This results in a significant Box’s M test (p = 0.002).
That is, the assumption of equal variance-covariance matrices is not fulfilled.
To address the issue of unequal covariance matrices, we request separate variance-co-
variance matrices and assess whether this leads to substantially different results (option
‘Use Covariance Matrix/Separate-groups’ in the dialog box ‘Classify’).
4.3 Case Study 255

Fig. 4.28 Territorial map using separate-group covariances

It should be noted that the classification functions are always calculated based on the
pooled variance-covariance matrices. Thus, unlike the classification probabilities, they
do not change. In our case study, the classification results change slightly. The hit rate for
256 4 Discriminant Analysis

Fig. 4.29 Dialog box: Stepwise Method

group g = 1 decreases marginally while it increases for group g = 3 (Fig. 4.27). Overall,
the hit rate is slightly higher, with 75.9% compared to 73.3%.
The territorial map presented in Fig. 4.28, however, looks rather different from the
one in Fig. 4.24. The territorial boundaries are no longer linear.

4.3.3.2 Results of a Stepwise Estimation Procedure


In Sect. 4.3.3.1, we recognized that not all describing variables contribute to the discrim-
ination of the groups. In the following, we present the results of the stepwise estimation
procedure (cf. Fig. 4.10). When we run a stepwise discriminant analysis, we can choose
between different methods regarding the procedure how the describing variables are
entered to and removed from the model (Fig. 4.29). The available alternatives are:

• Wilks’ lambda. The describing variables are entered into the discriminant function
based on how much they lower Wilks’ lambda. At each step, the variable that mini-
mizes the overall Wilks’ lambda is entered.
• Unexplained variance. At each step, the variable that minimizes the sum of the unex-
plained variation between groups is entered.
• Mahalanobis distance. Describing variables are selected based on their potential to
increase the distance between groups.
• Smallest F ratio. The describing variables are selected based on maximizing an F
ratio computed from the Mahalanobis distance between groups.
• Rao's V. This method (also called Lawley–Hotelling trace) measures the differences
between group means. At each step, the variable that maximizes the increase in Rao’s
V is entered.
4.3 Case Study 257

Fig. 4.30 Describing variables included in the discriminant functions (stepwise procedure)

There is no superior method, and we can use multi-group methods to test the robustness
of the results. In this case study, we use Wilks’ lambda which is also the default option in
SPSS.
Further, you can choose the criteria that decide when to stop the entry or removal of
describing variables. Available alternatives are ‘Use F value’ or ‘Use probability of F’.
For the former alternative, a describing variable is entered into the model if its
F-value is greater than the entry value and is removed if its F-value is lower than the
removal value. The entry value must be greater than the removal value, and both val-
ues must be positive. To enter more variables into the model, lower the entry value. To
remove more variables from the model, increase the removal value.
The latter approach enters a variable into the model if the significance level of its
F-value is lower than the entry value and is removed if the significance level is greater
than the removal value. Again, the entry value must be smaller than the removal value,
and both values must be positive and less than 1. To enter more variables into the model,
increase the entry value. To remove more variables from the model, lower the removal
value.
For both approaches, SPSS provides some default values for entering and remov-
ing describing variables. We use the default criterion related to ‘Use F value’ (Entry:
3.84, Removal: 2.71). Finally, we request to display a summary of the different steps
(‘Summary of steps’).
Figure 4.30 shows the variables that are entered into the discriminant functions. In
step 1, the variable ‘exotic’ is entered. Actually, when using the blockwise estimation
procedure, the variable ‘exotic’ was the one with the smallest value for Wilks’ lambda,
followed by the variable ‘fruity’. Thus, it intuitively makes sense that the variable ‘fruity’
is entered in step 2. After step 4, the procedure stops since no more variables can be
entered or removed based on the set criteria. The discriminant functions then consider
the variables ‘exotic’, ‘fruity’, ‘price’, and ‘refreshing’.
258 4 Discriminant Analysis

Fig. 4.31 Eigenvalues for the two discriminant functions (stepwise procedure)

Figure 4.31 shows the eigenvalues for the two discriminant functions as well as the
share of explained variance (% of variance) and the canonical correlation. Since the
eigenvalues of different analyses are hard to compare, we focus on the share of explained
variance. Discriminant function 1 explains 95.5% of the explained variance, while dis-
criminant function 2 only explains the remaining 4.5% of the explained variance. The
canonical correlation of discriminant function 2 is also rather low. Thus, the results sug-
gest that—given the data at hand—one discriminant function is sufficient to explain the
differences between the groups (Fig. 4.31).
Figure 4.32 shows the standardized discriminant coefficients that allow for comparing
the discriminatory power of the describing variables. Here, the variable ‘price’ has the
greatest discriminatory power for discriminant function 1 and the variable ‘exotic’ has
the greatest discriminatory power for discriminant function 2. When you use the step-
wise estimation procedure in SPSS, bootstrapping is not available.
Moreover, the structure matrix is reported (Fig. 4.32). The structure matrix again
shows the correlation of each describing variable with the discriminant functions. The
asterisk marks each variable’s largest absolute correlation with one of the discriminant
functions. The variables with a superscript ‘b’ are actually not included in the final for-
mulation of the discriminant function but SPSS reports the values anyway. The describing
variables ‘fruity’ and ‘price level’ are most strongly correlated with the first discriminant
function, and the variable ‘exotic’ has the highest correlation with discriminant function 2.
The group centroids suggest that while discriminant function 1 separates group g = 2
from the groups g = 1 and g = 3 (Fig. 4.33), discriminant function 2 is not really able to
separate among the groups. This result is to be expected since discriminant function 2 is
again not significant (SPSS result not reported here).
Finally, we take a look at the classification results (cf. Fig. 4.34). The hit rate for the
model with just four variables is 70.0% and, thus, slightly lower than the hit rate of the
model considering ten variables, which is 73.3%.
Generally, the stepwise estimation procedure should be applied with caution. We rec-
ommend using the blockwise estimation procedure unless you have many describing
variables.
4.3 Case Study 259

Fig. 4.32 Standardized canonical discriminant function coefficients and structure matrix (stepwise esti-
mation)

4.3.4 SPSS Commands

Above, we demonstrated how to use the graphical user interface (GUI) of SPSS to con-
duct a discriminant analysis. Alternatively, we can also use the SPSS syntax which is
a programming language unique to SPSS. Each option we activate in SPSS’s GUI is
260 4 Discriminant Analysis

Fig. 4.33 Group centroids


(stepwise estimation)

Fig. 4.34 Classification matrix (stepwise estimation)

translated into SPSS syntax. If you click on ‘Paste’ in the main dialog box shown in Fig.
4.10, a new window opens with the corresponding SPSS syntax. However, you can also
use the SPSS syntax directly and write the commands yourself. Using the SPSS syntax
can be advantageous if you want to repeat an analysis multiple times (e.g., testing dif-
ferent model specifications). Figures 4.35, 4.36 show the SPSS syntax for running the
analyses discussed above.
For readers interested in using R (https://www.r-project.org) for data analysis, we pro-
vide the corresponding R-commands on our website (www.multivariate-methods.info).

4.4 Recommendations

We close this chapter with some prerequisites and recommendations for conducting a
discriminant analysis.
4.4 Recommendations 261

* MVA: Case Study Chocolate Discriminant Analysis.


* Defining Data.
DATA LIST FREE / price refreshing delicious healthy bitter light crunchy
exotic sweet fruity segment.
VALUE LABELS
/segment 1 ‘classic’ 2 ‘fruity’ 3 ‘coffee’.

BEGIN DATA
3 3 5 4 1 2 3 1 3 4 1
6 6 5 2 2 5 2 1 6 7 1
2 3 3 3 2 3 5 1 3 2 1
---------------------
5 4 4 1 4 4 1 1 1 4 1
* Enter all data.
END DATA.

* Case Study Discriminant Analysis : Method "Enter independent together".


DISCRIMINANT
/GROUPS=segment(1 3)
/VARIABLES=price refreshing delicious healthy bitter light crunchy
exotic sweet fruity
/ANALYSIS ALL
/SAVE=CLASS SCORES PROBS
/PRIORS SIZE
/STATISTICS=MEAN STDDEV UNIVF BOXM COEFF RAW TABLE
/PLOT=COMBINED MAP
/PLOT=CASES(15)
/CLASSIFY=NONMISSING POOLED.

Fig. 4.35 SPSS syntax for blockwise estimation with pooled covariance matrices

* MVA: Case Study Chocolate Discriminant Analysis: Method "Stepwise".


DISCRIMINANT
/GROUPS=segment(1 3)
/VARIABLES=price refreshing delicious healthy bitter light crunchy
exotic sweet fruity
/ANALYSIS ALL
/SAVE=CLASS SCORES PROBS
/METHOD=WILKS
/FIN=3.84
/FOUT=2.71
/PRIORS SIZE
/HISTORY
/STATISTICS=MEAN STDDEV UNIVF BOXM COEFF RAW TABLE
/PLOT=COMBINED MAP
/PLOT=CASES(15)
/CLASSIFY=NONMISSING POOLED.

Fig. 4.36 SPSS syntax for stepwise estimation with pooled covariance matrices
262 4 Discriminant Analysis

Collection of Data and Specification of the Discriminant Function


• The dependent variable must be categorical, representing groups of observations
that are expected to differ on the basis of the independent variables. All observations
need to belong to one group only. This requires that group membership is mutually
exclusive.
• The describing (independent) variables must discriminate between at least two groups
to be of any use in discriminant analysis. The number of describing variables should
be greater than the number of groups.
• You should make a minimum of 20 observations per describing variable, and each
group should encompass at least 20 observations.

Estimation of the Discriminant Function and Classification


• A sample should be large enough to test for external validity (split-half analysis) and
each half should meet the above-mentioned requirements.
• Check the equality of within-group variance-covariance matrices with Box’s M test.
If necessary, use separate-group variance-covariance matrices instead of pooled
matrices.
• In multi-group discriminant analysis, not all possible but only significant discriminant
functions should be considered.
• In the case of unequal costs of misclassification, use the probability concept.

Alternative Methods
As an alternative to discriminant analysis, we can use logistic regression to discriminate
between groups and classify observations based on their characteristics/attributes if just
two groups are observed. The main difference between logistic regression and discrimi-
nant analysis is that logistic regression provides probabilities for the occurrence of alter-
native events or the classification into separate groups. In contrast, discriminant analysis
provides discriminant values from which probabilities can be derived in a separate step.
One advantage of logistic regression is that it is based on fewer assumptions about the
data than discriminant analysis. For example, discriminant analysis assumes normally
distributed describing variables with equal within-group variance-covariance matrices.
Logistic regression just assumes a multinomial distribution of the grouping variable.
Logistic regression is, therefore, more flexible and less sensitive than discriminant analy-
sis. If, however, the assumptions of discriminant analysis are fulfilled, discriminant anal-
ysis uses more information derived from the data and delivers more efficient estimates
(i.e., with smaller variance) than logistic regression (cf. Hastie et al., 2009, p. 128). This
is particularly advantageous for small sample sizes (N < 50). Empirical evidence suggests
that with large sample sizes both methods produce similar results, even if the assump-
tions of discriminant analysis are not fulfilled (cf. Michie et al., 1994, p. 214; Hastie
et al., 2009, p. 128; Lim et al., 2000, p. 216.).
If the assumption of equal within-group variance-covariance matrices is violated,
you may consider using a quadratic discriminant analysis (QDA). We do not cover this
References 263

variation of linear discriminant analysis in this book and refer the interested reader to
Hastie et al. (2009).
If the main objective of the analysis is to classify new observations, machine learn-
ing methods such as decision trees and neural networks are alternative methods. In a
large study by Lim et al. (2000), 33 algorithms for classification were tested on 32 data
sets. Discriminant analysis and logistic regression were among the five best methods.
Consequently, before applying sophisticated methods from computer science, rather start
your analysis with a discriminant analysis (or logistic regression) if you aim to classify
observations. The researchers noted that it was interesting that the ‘old’ method of discri-
minant analysis performed just as well as ‘newer’ methods.

References

Cooley, W., & Lohnes, P. (1971). Multivariate data analysis. Wiley.


Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning (2nd ed.).
Springer.
Lim, T., Loh, W., & Shih, Y. (2000). A comparison of predicting accuracy, complexity, and training
time of thirty-three old and new classification algorithms. Machine Learning, 44(3), 203–229.
Michie, D., Spiegelhalter, D., & Taylor, C. (1994). Machine learning, neural and statistical classi-
fication. Ellis Horwood.
Tatsuoka, M. (1988). Multivariate analysis—techniques for educational and psychological
research (2nd ed.). Macmillan.

Further reading

Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees.
Chapman & Hall.
Fisher, R. A. (1936). The use of multiple measurement in taxonomic problems. Annals of
Eugenics, 7(2), 179–188.
Green, P., Tull, D., & Albaum, G. (1988). Research for marketing decisions (5th ed.). Prentice
Hall.
Huberty, C. J., & Olejnik, S. (2006). Applied MANOVA and discriminant analysis (2nd ed.).
Wiley-Interscience.
IBM SPSS Inc. (2022). IBM SPSS Statistics 29 documentation. https://www.ibm.com/support/
pages/ibm-spss-statistics-29-documentation. Accessed November 4, 2022.
Klecka, W. (1993). Discriminant analysis (15th ed.). Sage.
Lachenbruch, P. (1975). Discriminant analysis. Springer.
Logistic Regression
5

Contents

5.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266


5.2 Procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
5.2.1 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
5.2.1.1 The Linear Probability Model (Model 1). . . . . . . . . . . . . . . . . . . . . . . . 275
5.2.1.2 Logit Model with Grouped Data (Model 2). . . . . . . . . . . . . . . . . . . . . . 277
5.2.1.3 Logistic Regression (Model 3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
5.2.1.4 Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
5.2.1.5 Multiple Logistic Regression (Model 4). . . . . . . . . . . . . . . . . . . . . . . . . 287
5.2.2 Estimation of the Logistic Regression Function. . . . . . . . . . . . . . . . . . . . . . . . . . 288
5.2.3 Interpretation of the Regression Coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
5.2.4 Checking the Overall Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
5.2.4.1 Likelihood Ratio Statistic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
5.2.4.2 Pseudo-R-Square Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
5.2.4.3 Assessment of the Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
5.2.4.4 Checking for Outliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
5.2.5 Checking the Estimated Coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
5.2.6 Conducting a Binary Logistic Regression with SPSS. . . . . . . . . . . . . . . . . . . . . . 310
5.3 Multinomial Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
5.3.1 The Multinomial Logistic Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
5.3.2 Example and Interpretation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
5.3.3 The Baseline Logit Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
5.3.4 Measures of Goodness-of-Fit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
5.3.4.1 Pearson Goodness-of-Fit Measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
5.3.4.2 Deviance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
5.3.4.3 Information Criteria for Model Selection. . . . . . . . . . . . . . . . . . . . . . . . 327
5.4 Case Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
5.4.1 Problem Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
5.4.2 Conducting a Multinomial Logistic Regression with SPSS . . . . . . . . . . . . . . . . . 330
5.4.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335

© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 265


K. Backhaus et al., Multivariate Analysis, https://doi.org/10.1007/978-3-658-40411-6_5
266 5 Logistic Regression

5.4.3.1 Blockwise Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335


5.4.3.2 Stepwise Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
5.4.4 SPSS Commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
5.5 Modifications and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
5.6 Recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351

5.1 Problem

With many problems in science and practice, the following questions arise:

• Which of two or more alternatives exists or which event will occur?


• Which factors influence an event and what is their effect?

Often there are only two alternatives, e.g.: Does a patient have a certain disease or not?
Will he survive or not? Will a borrower repay his loan or not? Will a consumer buy a
product or not? In other cases, there are more than two alternatives, e.g.: Which brand
will a potential buyer choose? Which party will a voter vote for?
Logistic regression analysis can be used to answer such questions. As its name indi-
cates, logistic regression is a variant of regression analysis. In general, logistic regression
deals with problems of the form
Y = f (X1 , X2 , . . . , XJ ),
where the dependent variable (response variable) Y is categorical. The independent vari-
ables (predictors) can be metric or categorical variables. Today, logistic regression is the
most important method for analyzing problems with categorical responses.
The values of the dependent variable we usually denote by g = 1,…, G (or 1 and 0
for just two categories). They indicate alternative events (or groups, response categories,
etc.). Since the occurrence of events is usually subject to uncertainty, Y is regarded as a
random variable. The aim of logistic regression is then to estimate probabilities for pre-
dicting events:
π = f (X1 , X2 , . . . , XJ )
Here are some practical examples for logistic regression with just two alternatives:

• Prediction of customers buying a new product. Observations: buying or not buying of


consumers in a test market. Independent variables (predictors): age, gender, income,
lifestyle, etc.
• Prediction of the risk of heart disease. Observations: the occurrenc or non-occurrence
of myocardial infarction. Predictors: age, obesity, smoking habit, diet, and clinical
variables.
5.1 Problem 267

Table 5.1  Examples for applications of logistic regression


Field of application Exemplary research question of regression analysis
Business Judging the creditworthiness of a customer based on certain character-
(Banking) istics, e.g., age, family size, income, number of credit cards, duration of
employment. This is called credit scoring
Economics Detection or prediction of turning points in business cycles (recession or
expansion)
Engineering What are the critical factors in the production process for achieving a
certain specification of the product?
Management What are the critical factors for the success or failure of an innovation?
Marketing What are the differences (e.g., regarding age, gender, income, family size,
education) between light and heavy users in a certain product category?
Which personal characteristics influence the choice between certain car
brands?
Medicine Diagnosing a certain disease given various signs and symptoms of a
patient
Predicting the chance of survival for a person with certain clinical signs
and characteristics (e.g., age, gender, smoking, obesity)
Finding risk factors for osteoporosis in women (e.g., age, body mass
index, smoking)
Psychology Which factors (e.g., gender, social class, ethnicity, activities) determine
the likelihood of graduating from college?
Which factors determine the loyalty of members or customers?

• Design of an automatic detector for spam emails (junk emails). Observations: emails
that were spam emails or valid emails. Predictors: frequencies of certain words or
character strings (57 variables).1

For G = 2 alternatives, Y is a binary (dichotomous) variable; thus, it is called binary


logistic regression. For G ≥ 3 it is called multinomial logistic regression. Table 5.1 lists
further examples for the application of logistic regression analysis.
In conventional regression analysis, the dependent variable is always metric (quan-
titative), the observations (input) as well as the estimates (output). This is different in
logistic regression. While the observations are categorical, the estimates (probabilities)
are quantitative (values between zero and one).

1 Cf. Hastie et al. (2011, pp. 2, 300). The data set “Spambase” contains information on 4601 emails

and is publicly available at https://archive.ics.uci.edu.


268 5 Logistic Regression

The Binary Logistic Regression Model


In binary logistic regression,we denote the two alternative events by 1 and 0 (e.g. success
or failure, buying or not buying). The dependent variable Y is then a 0,1-variable. It is
assumed that Y is a random variable.2 The probability to observe event 1 is denoted by
π. Thus, the following applies:
π ≡ P(Y = 1) and 1 − π = P(Y = 0) (5.1)
Consecutive events are assumed to be independent of each other.
In the simplest case of a logistic regression, we want to investigate how Y depends on
an independent variable X by modeling the conditional probability
π(x) ≡ P(Y = 1|x). (5.2)
π(x) is the probability of the occurrence of event 1 for a given value x of the predictor X.
The logistic regression model is composed of two components, a linear systematic
component and a nonlinear logistic function:

• The systematic component is a linear function of the predictor X. For a given value x
the systematic component takes the value
z(x) = α + β x (5.3)
• The logistic function, from which logistic regression got its name, has the form
ez 1
π= = (5.4)
1 + ez 1 + e−z
and is shown in Fig. 5.1.
The systematic component is a real-valued function that can take any value between
−∞ and +∞. What we need for modeling the conditional probability π(x) is a function
that transforms the systematic component into a probability, i.e. into a range between 0
and 1. The logistic function is such a function.

2 Such a variable is called a Bernoulli variable and the events can be seen as outcomes of a
Bernoulli trial. The resulting probability distribution is called Bernoulli distribution. The name
goes back to Jacob Bernoulli (1656–1705). The simplest example of a Bernoulli trial is the tossing
of a coin with the expected value E(Y) = π = 0.5 and the variance V(Y) = π(1 – π). The Bernoulli
distribution is a special case of the binomial distribution Binomial distribution for N = 1 trials. The
binomial distribution results from a sequence of N Bernoulli trials. Correspondingly, the buying
frequency (sum of buyers) is binomially distributed with sample size N. With increasing N the
binomial distribution converges to the normal distribution.
5.1 Problem 269

1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
-5 -4 -3 -2 -1
z

Fig. 5.1 Logistic function

The logistic function has an S shape, similar to the distribution function (cumula-
tive probability function) of the normal distribution.3 It can thus be used to transform a
real-valued variable Z (range [−∞, +∞]) into a probability (range [0,1]).
By inserting Eq. (5.3) into Eq. (5.4) we get the simple logistic regression model:

eα+β x 1
π(x) = α+β x
= −(α+β x)
(5.5)
1+e 1+e
where α and β are unknown parameters that have to be estimated on the basis of obser-
vations (yi, xi) of Y and X. The larger z(x), the greater π(x) = P(Y = 1|x). Accordingly, the
greater z(x), the smaller P(Y = 0|x).
For multiple logistic regression, the systematic component can be extended to
z(x) = α + β1 x1 + · · · + βJ xJ (5.6)
where x = (x1, …, xJ) is a vector of predictors. Thus, in the model of logistic regression,
the predictors are combined linearly, just as in multiple linear regression analysis.

3 This is the reason for the broad usage and the importance of the logistic function, since it is
much easier to handle than the distribution function of the normal distribution, which can only be
expressed as an integral and is therefore difficult to calculate. The logistic function was developed
by the Belgian mathematician Pierre-Francois Verhulst (1804–1849) to describe and predict popu-
lation growth as an improved alternative to the exponential function. The constant e = 2.71828 is
Euler’s number, which also serves as the basis of the natural logarithm.
270 5 Logistic Regression

By inserting Eq. (5.6) into Eq. (5.4) we get the binary logistic regression model:
1 1
π(x) = − z(x)
= −(α +β 1 x1 +···+βJ xJ )
(5.7)
1+e 1+e
Terminology
As in other methods, logistic regression uses different names for the variables in differ-
ent contexts:

• The dependent variable is also denoted as the response variable, grouping variable,
indicator variable, outcome variable, or y-variable.
• The independent variables are also called predictors, explanatory variables, covari-
ates, or just x-variables.

In some cases, the independent variables are referred to as covariates when they are met-
ric variables and factors when they are categorical variables.4

Odds and Logits


Two numerical (statistical) constructs that are closely related to logistic regression are
odds and logits. They can be used to facilitate the interpretation and calculation of the
logistic model. A probability π has the range 0 ≤ π ≤ 1. The odds and logits are functions
(transformations) of probability that extend the range:
π
odds(π ) = 0 ≤ odds < ∞ (5.8)
1−π
 
π
logit(π) ≡ ln − ∞ < logit < ∞ (5.9)
1−π
The odds can be interpreted as the ratio of “success” (π) to “failure” (1−π). We will
discuss odds in Sect. 5.2.3. By transforming the probability π with the range [0,1] into
odds, the range of values is extended to [0, +∞] (see the left-hand panel of Fig. 5.2).
The logarithm of the odds is called logit. By transforming the probability π or the odds
into logits, the range is extended to [−∞, +∞] (see the right-hand panel of Fig. 5.2).
With the transformation of probabilities into odds and logits, we can express the
logistic regression model
1
Probability: π(x) = (5.10)
1+ e−(α+β x)
in simpler forms:

Odds: odds[π(x)] = eα+β x (5.11)

4 Categorical independent variables with more than two categories must be decomposed into binary
variables, as in linear regression analysis.
5.1 Problem 271

odds logit
14 5

4
12
3
10
2

8 1

0
6
0 0.2 0.4 0.6 0.
-1
probability π
4
-2

2 -3

-4
0
0 0.2 0.4 0.6 0.8 1
-5
probability π

Fig. 5.2 Odds and logit as a function of the probability π

Logit: logit[π(x)] = α + β x (5.12)


The logit of the dependent probability π is equal to the systematic component of the
logistic model. Thus, by using the logit transformation, the logistic regression model
can be linearized and in this way simplified (for an example see Sect. 5.2.1.2). The logit
transformation is therefore of central importance for logistic regression and related
methods.5
Even more simply, we can write:
logit(π ) = z (5.13)
The logit-transformation is the inverse function of the logistic function. This can also be
seen by comparing Fig. 5.1 with the right-hand panel of Fig. 5.2. The logistic function,
with its S-shaped form, transforms a real-valued variable Z (in the range [−∞, +∞])
into a probability (range [0,1]). The logit function does exactly the opposite of what the
logistic function does: it transforms a variable with a limited range between 0 and 1 into
values with an infinite range (without upper or lower limits).
Both functions, logistic and logit, are symmetrical around π = 0.5. For z = 0 we get
the probability π = 0.5:
1 1 1
π= −z
= 0
= = 0.5
1+e 1+e 1+1

5 Within the framework of generalized linear models (GLM), logit(π) forms a so-called link func-
tion by means of which a linear relationship is established between the expected value of a depend-
ent variable and the systematic component of the model. The logit link is used in particular when
a binomial distribution of the dependent variable is assumed (cf. Agresti, 2013, pp. 112–122; Fox,
2015, pp. 418 ff.).
272 5 Logistic Regression

Fig. 5.3 Process steps of 1 Model formulation


logistic regression
2 Estimation of the logistic regression function

3 Interpretation of the regression coefficients

4 Checking the overall model

5 Checking the estimated coefficients

and for π = 0.5 the logit becomes 0:


 
0.5
logit(0.5) = ln = ln (1) = 0
1 − 0.5
If I have zero information about the occurrence of two events, then I need to assign a
probability of 0.5 to each event.

5.2 Procedure

In this section, we will show how logistic regression works. The procedure can be struc-
tured into five steps that are shown in Fig. 5.3.
We will demonstrate the steps of logistic regression with a simple example.

Application Example
The product manager of a chocolate company wants to evaluate the market opportuni-
ties of a new product, extra dark chocolate in a gift box, which is to be positioned in the
premium segment. Because of its bitter taste and premium price, the manager wants to
investigate whether and how the demand for this new gourmet chocolate depends on the
income of the consumers and whether it is more preferred by women or men.
For this purpose, he carries out a product test in which the test persons are asked,
after the presentation and tasting of the product, whether they will buy this new type
of chocolate. The test persons can choose between the following answer categories:
“yes”, “maybe”, “rather not”, “no”. For simplicity’s sake, we summarize the last
three answers into one category and refer to the alternative results as “Buy” and “No
buy”. Table 5.2 encompasses the demographic characteristics of N = 30 respondents
and their responses. Income is given in 1000 EUR. Gender is coded as 0 = female and
1 = male. A variable coded 0 or 1 is called a dummy variable. It can be treated like a
metric variable. Dummy variables can be used to incorporate qualitative predictors
into a linear model.6 ◄

6 On the website www.multivariate-methods.info, we provide supplementary material (e.g., Excel


files) to deepen the reader’s understanding of the methodology.
5.2 Procedure 273

Table 5.2  Exemplary dataset Person Income Gender Buy


i [1000 EUR] 0 = f, 1 = m 1 = yes, 0 = no
1 2.53 0 1
2 2.37 1 0
3 2.72 1 1
4 2.54 0 0
5 3.20 1 1
6 2.94 0 1
7 3.20 0 1
8 2.72 1 1
9 2.93 0 1
10 2.37 0 0
11 2.24 1 1
12 1.91 1 1
13 2.12 0 1
14 1.83 1 1
15 1.92 1 1
16 2.01 0 0
17 2.01 0 0
18 2.23 1 0
19 1.82 0 0
20 2.11 0 0
21 1.75 1 1
22 1.46 1 0
23 1.61 0 1
24 1.57 1 0
25 1.37 0 0
26 1.41 1 0
27 1.51 0 0
28 1.75 1 1
29 1.68 1 1
30 1.62 0 0
274 5 Logistic Regression

5.2.1 Model Formulation

1 Model formulation

2 Estimation of the logistic regression function

3 Interpretation of the regression coefficients

4 Checking the overall model

5 Checking the estimated coefficients

In the first step, the user has to decide which events should be considered as possible
categories of the dependent variable and which variables should hypothetically be con-
sidered and investigated as independent (influencing) variables.
If there is a large number of categories, it may be necessary to combine several cate-
gories so as to reduce the number. Here we already combined the three answer categories
“maybe”, “rather not” and “no” into one category (“No buy”). Similarly, households may
be classified according to whether children are present or not, without further differen-
tiating according to the number of children. It would be different if it was a question of
brand choice, e.g., between Mercedes, BMW, and Audi. In this case, it would not make
sense to combine two of the three categories.
First, only Income shall be considered as an influencing variable. The product man-
ager assumes that income will have a positive influence on purchasing behavior. Thus, he
formulates the following model:
probability of buying = f (Income)
To estimate this model, it must be specified in more detail. The probabilities are not
directly observable, but are manifested in the respondents’ statements whether they will
buy the new chocolate or not. This is expressed by the variable Y with values yi (i = 1, …,
N), with yi = 1 for “Buy”, and 0 for “No buy”.
It is always useful to visualize the data to be analyzed at the outset. For this purpose,
we can use a scatterplot (see Fig. 5.4). Each observation of the variables X = Income and
Y = Buying is represented by a point (xi, yi). The scatter of the data points has a peculiar
look here. The points are arranged in two parallel lines. The upper row of points repre-
sents the “Buyers” and the lower row the “No buys”.
It is evident that the two clusters are overlapping on the x-axis. That is, for medium
incomes we find both Buyers and Non-buyers. However, the Buyers are shifted slightly
to the right side, towards higher incomes. This indicates that income has a positive influ-
ence on purchasing behavior, as already assumed by the product manager. We now want
to quantify this result of the visual inspection of the data by a numerical analysis.
We will now analyze the above data by using four different models and then compare
the results:
5.2 Procedure 275

Y
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.0 0.5 1.0 1.5 2.0 2.5 3. .5
Income X

Fig. 5.4 Scatterplot for buying (Y) versus income (X)

a) linear probability model (Model 1),


b) logit model with grouped data (Model 2),
c) logistic regression (Model 3),
d) multiple logistic regression (Model 4).

The first two models are simple linear regression models which can be estimated by using
the method of ordinary least-squares (OLS). They are easy to handle and can provide
good approximations. Thus, they are of high practical relevance. Besides, it is instructive
to compare these simpler models with the logistic regression models whose estimation
requires the application of the more complicated maximum likelihood method (ML).

5.2.1.1 The Linear Probability Model (Model 1)


The simple linear regression model has the form
Y = α + β x + ε, (5.14)
where the dependent variable is metric and the model contains an error term ε whose dis-
tribution must meet certain assumptions (see Chap. 2 on regression analysis). Its expec-
tancy value must be zero and thus:
E(Y ) = α + β x
276 5 Logistic Regression

y, p
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.0 0.5 1.0 1.5 2.0 2.5 3. .5
Income X

Fig. 5.5 Estimated regression function for the linear probability model (Model 1)

In binary logistic regression, the dependent variable Y is not metric but can only take
the values 0 and 1. Thus we do not have an error term as in linear regression analysis.
But according to Eq. (5.2) the expectancy value of the binary variable Y is a conditional
probability and thus a metric variable. With this, we get the linear probability model7
π(x) = α + β x (5.15)
In our example, π(x) is the probability of buying given a certain income x. With the data
for the variables Buying and Income in Table 5.2 and by using the least-squares method,
we get:

yˆ = a + b x = −0.28 + 0.386 x R2 = 16.6%


If we interpret the estimated values as probabilities ( p ≡ ŷ), we can write:
p(x) = −0.28 + 0.386 x (5.16)
Figure 5.5 shows the estimated regression line for the linear probability model with the
scatter of the observed data.

7 More information on the linear probability model can be found in Agresti (2013, p. 117; 1996,
p. 74); Hosmer and Lemeshow (2000, p. 5).
5.2 Procedure 277

Table 5.3  Data grouped by Group Mean income Buyers Mean(y)


income classes
k x y logit(y)
1 2.717 4 0.667 0.69
2 2.562 5 0.833 1.61
3 2.020 3 0.500 0.00
4 1.720 2 0.333 −0.69
5 1.557 2 0.333 −0.69

The positive sign of the regression coefficient b confirms the product manager’s
assumption that income has a positive influence on the buying probability. The coeffi-
cient of determination (R-square) is only 16.6%, but this is not unusual for individual
data.
However, the model is not logically consistent because it can provide probabilities
that lie outside the interval from 0 to 1. For incomes below 734 EUR, we would get neg-
ative “probabilities”, and for incomes above 3324, we would get “probabilities” greater
than one. Despite these shortcomings the model offers useful approximations within the
range of observed incomes as we will later see when comparing the linear probability
model with the other models.
The advantage of the model is that it is easy to calculate and easy to interpret, as the
buying probability changes linearly with the income. For an income of 1500 EUR, the
expected buying probability is about p = 30%, as can easily be calculated with the esti-
mated function (Eq. 5.16). If the income increases from 1500 EUR to 1600 EUR and,
thus, x from 1.5 to 1.6, the buying probability increases by b/10 = 0.039 to p = 33.9%.

5.2.1.2 Logit Model with Grouped Data (Model 2)


An alternative way of analysis is the grouping of the data, in this case into income
classes. For each income class, we then can calculate the mean y value which is equal to
the proportion of Buyers in this group. Thus we can transform the binary data of buying
into quantitative data (frequencies) at the cost of diminishing the sample size. Table 5.3
shows the result.
We grouped the N = 30 observations into k = 5 income classes with 6 persons each.8
For each group k, we calculated the mean income x and the mean y (the proportion of
Buyers). The data in Table 5.3 are already ordered by income from largest to smallest.
So, the first 6 observations make up the first group with the mean values
x 1 = 2.717 and y1 = 4/6 = 0.67

8 These groups (classes) must be distinguished from the category groups of the dependent variable
Y.
278 5 Logistic Regression

1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0 1.0 2.0 3.0 4.0
Income X

Fig. 5.6 Logistic regression function for grouped data (Model 2)

The scatter of the 5 points (x k , yk ) is shown in Fig. 5.6. Now we have only k = 5 observa-
tions instead of N = 30. Of course, this method works better for larger sample sizes when
we can build more and larger predictor classes.
With Eq. (5.12) we can formulate the simple linear regression model Eq. (5.14) with
grouped data in logit form:
logit(y) = α + β x + ε (5.17)
With the data
 
x k , logit(yk ) k=1, ..., K

given in Table 5.3 and by using the least-squares method we get:

logit(p) = a + bx = −3.48 + 1.73 x R2 = 80.5%


p again denotes the estimated probability. With Eq. (5.10) we get an estimate of the
logistic regression model:
1
p(x) =
1+ e3.48−1.73x
Figure 5.6 shows this estimated logistic regression function. In contrast to the linear
probability model, the curve flattens out with the distance from the average income.
Thus, only values between 0 and 1 can arise for the probability p.
5.2 Procedure 279

This function, which we estimated on an aggregated basis, we can now apply to indi-
vidual income values and thus obtain estimates for individual probabilities. For the first
person with an income of 2530 EUR, we get:
1 1
p1 = = = 0.71
1 + e3.48−1.73 x1 1 + e3.48−1.73·2.53

5.2.1.3 Logistic Regression (Model 3)


While in the preceding section we derived a logistic function after grouping the data, we
will now analyze the individual data by using logistic regression. This is what the SPSS
function ‘Logistic regression’ is for.
For individual data (casewise data), a linearization of the logistic regression function,
as carried out for grouped data using the logit transformation in Eq. (5.12), is not possi-
ble. Instead, we have to estimate the parameters of the non-linear function in Eq. (5.10).
Therefore, a different estimation method must be used, the maximum likelihood method,
which will be described in detail in Sect. 5.2.2.
Here we anticipate the result of the estimation procedure. We have to estimate the
parameters α and β of the logistic model according to Eq. (5.10):
1
π=
1 + e−(α+βx)
With the data xi , yi ) i=1,...N we get the values a = −3.67, b = 1.83. With these values,
 

we get the estimated logistic function

1
p(x) = .
1+ e3.67−1.83x
which is shown in Fig. 5.7. This function is very similar to the one we got in the previous
section with grouped data.

Comparison of the Models


Table 5.4 compares the estimated probabilities of the three models for three selected per-
sons (cases 1, 15, and 30). These probabilities are quite close together for the different
models, especially for person 15 with an average income.
Figure 5.8 shows a diagram of the three models. The function of the logit model with
grouped data (Model 2) is represented by the dashed line which is hardly different from
the logistic regression with individual data (Model 3). This is surprising in view of the
small sample size. Although we formed only 5 rather small groups whose means showed
considerable scatter (Fig. 5.6), this was smoothed by the regression.
In the medium income area the linear model also shows a good approximation to the
other two models. But for incomes further away from the mean, the linear probability
model deviates more strongly from the two logistic functions and eventually produces
probabilities outside the range of 0 to 1.
280 5 Logistic Regression

p
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Income X

Fig. 5.7 Estimated logistic regression function (Model 3)

Table 5.4  Comparison of estimated probabilities


Person Income Buy Model 1: Model 2: Model 3:
(1000 €) 1 = yes, 0 = no linear prob LogReg LogReg
Model grouped individual
1 2.53 1 0.694 0.711 0.722
15 1.92 1 0.458 0.462 0.459
30 1.62 0 0.342 0.338 0.329

5.2.1.4 Classification
The estimated probabilities can be used to predict purchasing behavior or, in the termi-
nology of classification, to assign persons to categories (groups).
Our sample comprises two categories: “Buy” and “No buy”. Now we want to find out
whether our model can correctly predict into which group a person belongs if the income
is known. If this works, we can apply the model to other persons in the population who
were not used for the analysis. So, for each person, we

• have an observed group membership,


• will predict a group membership based on our estimated model.

For predicting the group membership (classification), we use the estimated probabili-
ties. To transform a probability into a prediction of group membership, a cutoff value is
necessary. This value (threshold) we denote by p*. A case with an estimated probability
5.2 Procedure 281

p
1.0
0.9 Linear probability model
0.8 Logistic functions
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Income X

Fig. 5.8 Comparison of the results of Models 1 to 3

greater p* will be “predicted” (classified) as “Buy”, otherwise as “No buy”. The follow-
ing applies:

1, if pi > p∗
ŷi = (5.18)
0, if pi ≤ p∗

The cutoff value for just two alternatives is usually the probability p* = 0.5. Using this
value, all three models give the same predictions (see Table 5.4). For person 1 they cor-
rectly predict “Buyer”, for person 15 they falsely predict “No buy”, and for person 30
they correctly predict “No buy”.
Predictions are usually directed towards the future. Strictly speaking, one can there-
fore only speak of predictions when it comes to future behavior. Here we use a retrospec-
tive design (“predicting into the past”) for checking the predictive power of a model.
Table 5.5 shows the estimated probabilities of all 30 persons, which were derived by
logistic regression with individual data (Model 3), with the observed group membership
(Buy or No buy) and the predicted group membership.
Note that the mean of the estimated buying probabilities is equal to the proportion of
observed Buyers (mean of the y-values). This corresponds to the least-squares method in
linear regression, where the mean values of the estimated and observed y-values are also
always identical.
282 5 Logistic Regression

Table 5.5  Estimated Person Income Buy Estimated Buy


probabilities and predicted [1000 €] observed probability predicted
buyers for p* = 0.5 (Model 3) 1 = yes, p 1 = yes,
0 = no 0 = no
1 2.53 1 0.722 1
2 2.37 0 0.659 1
3 2.72 1 0.786 1
4 2.54 0 0.725 1
5 3.20 1 0.898 1
6 2.94 1 0.846 1
7 3.20 1 0.898 1
8 2.72 1 0.786 1
9 2.93 1 0.843 1
10 2.37 0 0.659 1
11 2.24 1 0.604 1
12 1.91 1 0.455 0
13 2.12 1 0.551 1
14 1.83 1 0.419 0
15 1.92 1 0.459 0
16 2.01 0 0.500 1
17 2.01 0 0.500 1
18 2.23 0 0.600 1
19 1.82 0 0.415 0
20 2.11 0 0.546 1
21 1.75 1 0.384 0
22 1.46 0 0.268 0
23 1.61 1 0.325 0
24 1.57 0 0.310 0
25 1.37 0 0.237 0
26 1.41 0 0.251 0
27 1.51 0 0.287 0
28 1.75 1 0.384 0
29 1.68 1 0.354 0
30 1.62 0 0.329 0
Mean 2.115 0.533 0.533 0.533
Sum 16 16 16
5.2 Procedure 283

Table 5.6  Classification table for the logistic model (p* = 0.5)


Prediction Accuracy
Group 1 = Buy 0 = No buy Sum Proportion correct
1 = Buy 9 7 16 0.563 Sensitivity
0 = No buy 7 7 14 0.500 Specificity
Total 16 14 30 0.533 Hit rate

Classification Tables
The total set of observations and predictions can be summarized in a classification
table (confusion matrix). Table 5.6 represents the classification table for the results in
Table 5.5.
In the diagonal of the four fields (under “Prediction”) are the case numbers of the
correct predictions: 9 “Buy” and 7 “No buy” (bold numbers). The remaining two fields
contain the numbers of incorrect predictions. The column “sum” presents the case num-
bers of the two category groups (Buy and No buy) and the total number of cases (here
the number of all test persons). These numbers are given by the data and do not have to
be calculated. They must match the sum of the cells in the same row.
The right side of the classification table shows three different measures of predictive
accuracy:
Sensitivity p roportion of correctly predicted Buyers in relation to the total number of
Buyers 

9/16 = 0.563 = “correctly predicted Buyers”

Specificity p roportion of correctly predicted Non-buyers in relation to the total number


of Non-buyers 
7/14 = 0.500 = “correctly predicted Non-Buyers”
Hit rate proportion of correct predictions in relation to the number of all cases 
(7 + 9)/30 = 0.533 “correctly predicted”

Table 5.7 shows how to calculate these three measures of accuracy. The hit rate is a
weighted average of sensitivity and specificity.
The hit rate of 53.3% achieved here is very modest and only slightly above what we
would expect from tossing a coin. For comparison, see the classification table for the lin-
ear probability model in Table 5.8. This model yields a hit rate of 60%, which might lead
to the conclusion that this model predicts or classifies better.
284 5 Logistic Regression

Table 5.7  Calculation of the measures of accuracy


Prediction Accuracy
Group 1 = Buy 0 = No buy Sum Proportion correct
1 = Buy n11 n10 n1 n11 /n1 Sensitivity
0 = No buy n01 n00 n0 n00 /n0 Specificity
Total n (n00 + n11 )/n Hit rate

Table 5.8  Classification table for the linear probability model (p* = 0.5)
Prediction Accuracy
Group 1 = Buy 0 = No buy Sum Proportion correct
1 = Buy 9 7 16 0.563 Sensitivity
0 = No buy 5 9 14 0.643 Specificity
Total 14 16 30 0.600 Hit rate

Table 5.9  Classification table for the logistic model (p* = 0.3)


Prediction Accuracy
Group 1 = Buy 0 = No buy Sum Proportion correct
1 = Buy 16 0 16 1.000 Sensitivity
0 = No buy 10 4 14 0.286 Specificity
Total 26 4 30 0.667 Hit rate

Table 5.10  Classification table for the logistic model (p* = 0.7)


Prediction Accuracy
Group 1 = Buy 0 = No buy Sum Proportion correct
1 = Buy 7 9 16 0.438 Sensitivity
0 = No buy 1 13 14 0.929 Specificity
Total 8 22 30 0.667 Hit rate

This, however, is a deception. The hit rate is only conditionally suitable as a meas-
ure of predictive accuracy since it also depends on the selected cutoff value p*. If we
perform the classification with modified cutoff values, e.g. p* = 0.3 or p* = 0.7, using
the data in Table 5.5, we obtain the classifications shown in Tables 5.9 and 5.10. In both
cases we get an increased hit rate. This shows the influence of the cutoff value on the hit
rate. It is unusual, however, for the hit rate to rise both for higher and lower cutoff values
as is the case here.
5.2 Procedure 285

Correctly predicted
buyers (sensitivity)
1

0.9

0.8
p* = 0.5

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Incorrectly predicted buyers (1-specificity)

Fig. 5.9 ROC curve for logistic regression (AUC = 0.723)

ROC Curve
A generalized concept for assessing a classification table is the ROC curve (receiver
operating characteristic). While a classification table is always valid for a specific cutoff
value p*, the ROC curve gives a summary of the classifications for all possible values of
p*.
Figure 5.9 shows the ROC curve for the logistic model with the values from Table
5.5. A point on the ROC curve is valid for a certain cutoff value and therefore also for a
certain classification table.9 The ROC curve is obtained by plotting the sensitivity over
1 – specificity for different cutoff values p*. The classification table above for p* = 0.5
(Table 5.6) yields: 1 – specificity = 0.500 and sensitivity = 0.563. This point (0.500,
0.563) on the ROC curve in Fig. 5.9 is indicated by an arrow and is close to the diagonal
line.

9 The concept of the ROC curve originates from communications engineering. It was originally
developed during the Second World War for the detection of radar signals or enemy objects and
is used today in many scientific fields (see, e.g., Agresti, 2013, pp. 224 ff.; Hastie et al., 2011,
pp. 313 ff.; Hosmer et al., 2013, pp. 173 ff.). SPSS offers a procedure for creating ROC curves for
given classification probabilities or discriminant values. The above ROC curve was created with
Excel.
286 5 Logistic Regression

The diagonal line would be expected if the prediction were purely random, e.g. by
tossing a coin. It does not allow for any discrimination. The area under the ROC curve,
known as area under curve (AUC), is a measure of the overall predictive accuracy (clas-
sification performance) of the model. Its maximum is 1. Hosmer et al. (2013, p. 177)
give the following rule for judging the accuracy expressed by the ROC curve:
AUC < 0.7: not sufficient
0.7 ≤ AUC < 0.8: acceptable
0.8 ≤ AUC < 0.9: excellent
AUC ≥ 0.9: outstanding
For our model, we obtain AUC = 0.723. This value is obtained for all three mod-
els above, even though they lead to different classification tables for individual cutoff
values.10

Choice of the Cutoff Value


The choice of the cutoff value p* always involves a trade-off between sensitivity and
specificity, as the classification tables above (Tables 5.9 and 5.10) illustrate. The over-
all hit rate is identical, but sensitivity and specificity change diametrically. Therefore,
the consequences regarding the predictions or classifications must be taken into account
when choosing the cutoff value.
These consequences can be very serious, for instance in medical testing when it
comes to diagnosing diseases instead of buying behavior. Here the choice of the cutoff
value is of eminent importance. A clinical test for a particular disease should be positive
if the patient has the disease and negative if the patient does not have the disease. Here,
the terms sensitivity and specificity have the following meanings:

• Sensitivity = “true positive”: The test will be positive if the patient is sick (disease is
correctly recognized).
• Specificity = “true negative”: The test will be negative if the patient is not sick.

Imagine that the disease is dangerous but curable if treated quickly. In this case, the
damage of a “false positive” (an unnecessary treatment) is smaller than the damage
of a “false negative” (the patient may die because the disease was not recognized).11

10 We also get the same value for AUC if we apply discriminant analysis to our data. Alternatively,
one can create the ROC curve based on discriminant values or classification probabilities.
11 Another danger of “false negatives” is the risk that sick persons may spread an infectious dis-

ease. The high rate of “false negative” test results contributed to the rapid spread of the corona
virus at the beginning of the pandemic in 2020 (cf. Watson et al., 2020).
5.2 Procedure 287

In this case it would be useful to increase the sensitivity by lowering the cutoff value. In
Table 5.9, the sensitivity is increased to 100% by reducing the cutoff value to p* = 0.3.
If, on the other hand, the disease is not curable, a false diagnosis of disease (“false
positive”) might cause considerable harm by incurring severe anxiety and depression in
the patient, and, in the extreme, ruining his life (cf. Gigerenzer, 2002, pp. 3 ff.; Pearl,
2018, pp. 104 ff.). In this case, it would be appropriate to increase specificity by raising
the cutoff value to avoid “false positives”. In Table 5.10, the specificity is increased to
92.9% by increasing the cutoff value to p* = 0.7, thus reducing the percentage of “false
positives” to (1 – specificity) × 100 = 7.1%. For a medical test, this probability of “false
positives” would still be very high.
A similar problem is encountered when designing a spam filter for e-mails. If we set
“1 = Spam” and “0 = No spam” (corresponding to Buy and No buy in the above exam-
ple), sensitivity represents the ability to recognize spam correctly. A low cutoff value
leads to high sensitivity. But high sensitivity increases the risk that a valid e-mail (No
spam) is lost in the spam filter (“false positive”). Since it is unpleasant if an important
e-mail is lost in this way, the cutoff value will need to be set higher. As a result, sensitiv-
ity is reduced and we continue to receive spam.
The opposite situation is encountered in airport security. Here a “false positive” would
have no serious consequences, leading only to a more extensive checking of passengers.
But a “false negative” means that a terrorist is not detected, potentially leading to hun-
dreds of deaths. Thus, in this case the sensitivity has to be very high (and a low cutoff
value is necessary).

5.2.1.5 Multiple Logistic Regression (Model 4)


As stated above, the binary logistic regression model can be extended to more than one
independent variable. According to Eq. (5.6), we get the systematic component
z(x) = α + β1 x1 + · · · + βJ xJ
where x is a vector of J predictors. This leads to the multiple logistic regression model:

eα+β1 x1 + ··· βJ xJ 1
π(x) = + ···
= −(α+β
(5.19)
1+e α+β 1 x1 β J xJ 1+e 1 x1 + ··· βJ xJ )

with
π(x) = P(Y = 1|x1 , x2 , · · · xJ ) (5.20)
The predictors are combined linearly, as in multiple regression analysis.
Alternatively, the following formulas are used:


βj xj
eα+ j eα+xβ 1
π(x) = 
α + j βj xj
= α+ β ′ = −(α+ xβ ′ )
1+e 1+e x 1+e
where x and β denote row vectors.
288 5 Logistic Regression

Continuing our example with the data in Table 5.2 we will now include the variable
Gender and estimate the model
probability of buying = f (Income, Gender)
According to Eq. (5.19), we formulate the following logistic model:
1
π(x) = (5.21)
1+ e−(α+β1 x1 +β2 x2 )
Using the maximum likelihood method provides the following estimates for the
parameters:
a = −5.635, b1 = 2.351, b2 = 1.751
With these values we get the following logistic regression function:
1
p(x) = (5.22)
1+ e−(−5.635+2.351x1 +1.751x2 )
Example
For the first person in Table 5.2, a woman with an income of 2530 EUR, the systematic
component (the logit) has the value:

z = −5.635 + 2.351 · 2.530 + 1.751 · 0 = 0.313


and we get the probability:
1 1
p= −z
= = 0.578
1+e 1 + e−0.313
The diagram in Fig. 5.10 shows the estimated logistic regression functions for men and
women.
Prediction with the estimated multiple logistic model yields the classification in
Table 5.11. As we can see, inclusion of the variable Gender has considerably improved
the predictive power of the model.
The ROC curve in Fig. 5.11 shows that for the cutoff value p* = 0.5 the hit rate
reaches its maximum. With AUC = 0.813 (area under the ROC curve), the predictive
accuracy of the model can now be judged as “excellent”.

5.2.2 Estimation of the Logistic Regression Function

1 Model formulation

2 Estimation of the logistic regression function

3 Interpretation of the regression coefficients

4 Checking the overall model

5 Checking the estimated coefficients


5.2 Procedure 289

p
1.0
0.9
men
0.8
0.7
0.6 women
0.5
0.4
0.3
0.2
0.1
0.0
0.0 1.0 2.0 3.0 4.0
Income X

Fig. 5.10 Logistic regression functions for men and women

Table 5.11  Classification table for the multiple logistic model (p* = 0.5)
Prediction Accuracy
Group 1 = Buy 0 = No buy Sum Proportion
correct
1 = Buy 14 2 16 0.875 Sensitivity
0 = No buy 3 11 14 0.786 Specificity
Total 17 13 30 0.833 Hit rate

Due to its non-linearity, the logistic regression function needs to be estimated with the
maximum likelihood method, instead of the least-squares method.
The maximum likelihood (ML) principle states: Determine the estimated values for
the unknown parameters in such a way that the realized data attain maximum plausibility
(likelihood).12 Or, in other words, maximize the probability of obtaining the observed data.
For the estimation of the logistic regression model this means that for a person i
the probability p(xi) should be as large as possible if yi = 1, and as small as possible if
yi = 0. This can be summarized by the following expression, which should be as large as
possible:

12 Theprinciple of the ML method goes back to Daniel Bernoulli (1700–1782), a nephew of Jakob
Bernoulli. Ronald A. Fisher (1890–1962) analyzed the statistical properties of the ML method and
paved the way for its practical application and dissemination. Besides the Least-Squares method,
the ML method is the most important statistical estimation method.
290 5 Logistic Regression

Correctly predicted
buyers (sensitivity)
1 p* = 0.5

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Incorrectly predicted buyers (1-specificity)

Fig. 5.11 ROC curve for multiple logistic regression (AUC = 0.813)

p(xi )yi · [1 − p(xi )]1−yi (5.23)


Since the logistic model assumes that the yi over the persons i (i = 1,…, N) are dis-
tributed independently of each other, the common probability over all persons can be
expressed as the product of the individual probabilities. For the logistic regression with a
single predictor and parameters a and b (Model 3), this will result in the following likeli-
hood function to be maximized:

N

L(a, b) = p(xi )yi · [1 − p(xi )]1−yi → Max! (5.24)
i=1

with yi = 1 for Buy and 0 for No buy.


The parameters a and b are to be determined thus that the likelihood becomes max-
imal. The calculation is easier if we logarithmize the probabilities and thus convert the
product into a sum. We get the so-called log-likelihood function:

N

LL(a, b) = (ln[p(xi )] · yi + ln[1−p(xi )] · (1 − yi )) → Max! (5.25)
i=1

Since the logarithm is a strictly monotonously increasing function, maximizing both


functions leads to the same result.
5.2 Procedure 291

b
0
0 1 2 3 4

-10

-20

-30

-40

-50

Fig. 5.12 Maximizing the LL function

The log-likelihood function LL can only assume negative values, because the loga-
rithm of a probability is negative. The maximization of LL, therefore, means that the
value of LL comes as close as possible to the value 0. LL = 0 would result if the proba-
bilities of the chosen alternatives were all 1 and thus the probabilities for the non-chosen
alternatives were all 0.
Figure 5.12 illustrates the course of LL with the variation of the coefficient b. For
b = 1, the value for LL is −28. The maximum is LL = −18.027. It is achieved with
b = 1.83, our estimation for Model 3.
The solution to this optimization problem, i.e. maximizing the log-likelihood func-
tion, requires the application of iterative algorithms. This can be done by using qua-
si-Newton or gradient methods.13 These methods need a lot of processing capacity, but
with today’s computing power, this is of little importance. The fact that iterative algo-
rithms may not always converge or can get stuck in a local optimum is usually more of a

13 Forlogistic regression, quasi-newton methods are primarily used, which converge quite quickly.
These methods are based on Newton’s method for finding the zero of a function. They use the first
and second partial derivatives of the LL function according to the unknown parameters to find the
optimum. The derivatives are approximated differently depending on the method. Special methods
are the Gauss–Newton method and its further development, the Newton–Raphson method. In the
meantime, the method of Iteratively Reweighted Least Squares (IRLS) is also widely used. Cf. e.g.
Agresti (2013, pp. 149 ff.); Fox (2015, pp. 431 ff.); Press et al. (2007, pp. 521 ff.).
292 5 Logistic Regression

problem. This danger, however, does not exist here since the LL function is convex and
therefore only one global optimum exists.14

5.2.3 Interpretation of the Regression Coefficients

1 Model formulation

2 Estimation of the logistic regression function

3 Interpretation of the regression coefficients

4 Checking the overall model

5 Checking the estimated coefficients

Due to the non-linearity of the logistic model, the interpretation of the model parameters
is more difficult than in other methods for analyzing dependencies (e.g., linear regres-
sion, analysis of variance). Figure 5.13 shows three diagrams (a–c) that illustrate the
effects of changes in the parameters of the logistic model with a single predictor.

a) Change of the constant term a (Fig. 5.13a)


A change of the constant term a causes a horizontal shift of the logistic function,
while in linear regression it causes a vertical shift of the regression line. When param-
eter a increases, the curve shifts to the left and the probability at a given value x
increases. The decrease of a has the opposite effect.
b) Change of the coefficient b
An increase of the coefficient b increases the curvature of the logistic function. In the
middle range, the slope of the curve gets steeper (like the regression line). But in the
outer areas, the curve gets flatter, due to its S-shape.
For b = 0 the whole curve flattens to a horizontal line (as in linear regression).
c) Change of the sign of b
A negative sign of the coefficient b leads to a decreasing curve (as in linear
regression).

The coefficient b determines how the independent variable x affects the dependent var-
iable p. A difficulty of interpretation of the logistic model results from the fact that the
effects on the dependent variable are not constant for equal changes in x:

14 McFadden (1974) has shown that with a linear systematic component of the logistic model, the
LL function is globally convex, which makes maximization much easier.
5.2 Procedure 293

a) p
1.0

0.9

0.8

0.7

0.6
a=2 0.5
a = 0 0.4

0.3 a=-2

0.2

0.1

0.0
-5 -4 -3 -2 -1
X
b) p
1.0
b=2
0.9 b=1
b = 0.5
0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0
-5 -4 -3 -2 -1 0 1 2 3 4 5
X
c) p
1.0

0.9

0.8 b=1

0.7

0.6

0.5

0.4

0.3 b = -1
0.2

0.1

0.0
-5 -3 -1 1 3 5
X

Fig. 5.13 Curves of the logistic function for different values of the parameters a and b
294 5 Logistic Regression

• In linear regression, each change of x by one unit causes a constant change b in the
dependent variable.
• In logistic regression, the effect of a change in x on the dependent variable p also
depends on the value of p. The effect is greatest when p = 0.5, and the more p deviates
from 0.5, the smaller becomes the change in p.

At position p, the slope of the logistic function is p(1 − p) b, and thus for p = 0.5 the
slope will be 0.25 b. At p = 0.01 or p = 0.99, however, the slope will only be 0.01 b.
Because of the curvature of the logistic function, this describes only approximately the
change of p due to a change in x by one unit. The smaller the changes in x, the better the
approximation.

Example
We will illustrate the effects of a change in x with a numerical example. For a logistic
regression with a single predictor (Model 3), we had estimated the following function:
1 1
p(x) = = (5.26)
1 + e−(a+b x) 1 + e3.67−1.827x
With an income of 2000 EUR (x = 2) we get approximately p = 0.5. For this value,
a change in income has the greatest effect on p. If the income increases in steps of 1
unit (1000 EUR), then p increases with decreasing rates.
Column 1a in Table 5.12 shows how p increases with increasing x. Column 2a
shows the increments of change (difference to the previous value). One can see that
the increments become smaller. This follows from the curvature of the logistic func-
tion. ◄

Odds
Besides probability, the “odds” are another concept to express the occurrence of random
events. Both concepts originate from gambling, with odds probably being older than
probability, and some prefer it to probability.15 Thus, when rolling dice, the odds to roll a
six are defined as follows:
favorable events 1
odds ≡ =
unfavorable events 5
We state: “The odds to roll a six are 1 to 5.”

15 The term “odds” is used only in plural. The concept of odds and its usefulness was described
by the Italian mathematician and physician Gerolano Cardano (1501–1576), who had to support
his life by gambling. In his “Book on Games of Chance” he wrote the first treatment on proba-
bility. The theory of probability ermerged only later in in the seventeenth century with the works
of the scientists Pierre de Fermat (1601–1665), Blaise Pascal (1623–1662) and Jakob Bernoulli
(1655–1705).
5.2 Procedure 295

Table 5.12  Numerical example


Income Values of the logistic function Difference to the previous value
[1000 €]
x 1a 1b 1c 2a 2b 2c
p(x) Odds Logits p(x) Odds Logits
2 0.496 0.984 −0.016
3 0.859 6.115 1.811 0.364 5.132 1.827
4 0.974 38.01 3.638 0.115 31.899 1.827
5 0.996 236.3 5.465 0.021 198.291 1.827

In contrast, the probability to roll a six is defined differently:


favorable events 1
probability ≡ =
possible events 6
In statistics, the odds are used as an expression of relative probability, e.g., for the ratio
of the two probabilities of a binary variable (Bernoulli trial). If we define q as the prob-
ability “not to roll a six”, then q = 5/6 = 1 − p. With this we get the odds “to roll a six”
(versus “not to roll a six”) as the ratio of the two probabilities p and q:
p p 1/6
odds = = = = 1/5 (5.27)
1−p q 5/6
The odds are always positive and, unlike probabilities, have no upper limit (cf. Fig. 5.2
or column 1b in Table 5.12).
Inversely, the probability p is determined by the odds:
odds
p= (5.28)
odds + 1
For instance, if the odds are 1 to 3, the probability will be
1/3 1
p= = = 0.25
1/3 + 1 4
And if the odds are 3 to 1, the probability will be
3/1 3
p= = = 0.75
3/1 + 1 4
With Eqs. (5.11) and (5.26) we can also write our estimated Model 3 with regard to odds:

odds(x) = ea+bx = e−3.67+1.827x (5.29)


with odds(x) ≡ odds p(x) .
 
296 5 Logistic Regression

To illustrate the effect of a change of x on the odds, we insert x + 1 into this function
and get:

odds(x + 1) = ea+b(x+1) = ea+bx+b = ea+bx · eb = odds(x) · eb (5.30)


If we divide the odds on the left side by the odds on the right side of the equation, we get
odds (x + 1)
= eb (5.31)
odds (x)
From this follows the simple rule: The odds increase by the factor eb if x is increased by
one unit. The factor eb is referred to as the effect coefficient. For our example we get:

eb = e1.827 = 6.216 (5.32)


This value is usually displayed in the output of statistics programs for logistic regression
(e.g. SPSS). In multiple logistic regression it is displayed for each predictor. The ratio in
Eq. (5.31) is also called an odds ratio (OR). The OR is of special importance for binary
predictors (see below).
The value of the effect coefficient can also be obtained (apart from rounding errors)
by dividing the values in columns 1b or 2b of Table 5.12 by the respective preceding
value, e.g.:
odds(3) 6.115
= = 6.214
odds(2) 0.984
In our example, this states that the chances of buying increase by a factor of 6 if a per-
son’s income increases by 1 unit [1000 EUR]. When the income changes in steps of one
unit, the odds do not increase by a constant amount each time but by a constant factor. If
the logistic regression coefficient is negative, the following is obtained:
1
e−b = ≤1
eb
i.e., the odds decrease by this factor if x increases by one unit. For b = 0, the factor is
1 and a change of x has no effect. These interpretations also apply to multiple logistic
regression when one of the predictors is changed and the others are kept constant.

Logit
The logit of a probability p is defined by:
 
p
logit(p) ≡ ln (5.33)
1−p
5.2 Procedure 297

Table 5.13  Effects of an increase in x by one unit (with positive and negative regression
coefficients)
An increase of x to x + 1 has the following effects
b>0 b<0
p An increase by roughly p(1 − p)b A reduction by roughly p(1 − p)|b|
Odds An increase by the factor e b
A reduction by the factor e−|b| = 1
e|b|
Logit An increase by the value of b A reduction by the value of |b|
Odds ratio b
e >1 eb < 1

“Logit” is a short form for logarithmic odds (also log-odds) and can also be defined as16 :
 
logit(p) ≡ ln odds(p) (5.34)
By transforming the odds into logits, the range is extended to [−∞, +∞] (see Fig. 5.2,
right panel).
Following Eq. (5.31), the odds increase by eb if x increases by one unit. Thus, the log-
its (log-odds) increase by

ln(eb ) = b (5.35)
So in our Model 3, the logits increase by b = 1.827 units if x increases by one unit (cf.
column 2c in Table 5.12).
This makes it easy to calculate with logits. The coefficient b represents the marginal
effect of x on logits, just as b is the marginal effect of x on Y in linear regression. If we
know the logits, we can calculate the corresponding probabilities, e.g. with Eq. (5.10):
1
p(x)=
1 + e−z(x)
Thus, logits are usually not computed from probabilities,17 as Eq. (5.33) might suggest.
Instead, logits are used for computing probabilities, e.g. with Eq. (5.10). Table 5.13 sum-
marizes the effects described above.

Odds Ratio and Relative Risk


The ratio of the two odds, as in Eq. (5.31), is called the odds ratio (OR). The odds ratio
is an important measure in statistics. It is usually formed by the odds of two separate

16 The name “logit” was introduced by Joseph Berkson 1944, who used it as an abbreviation for
“logistic unit”, in analogy to the abbreviation “probit” for “probability unit”. Berkson contributed
strongly to the development and popularization of logistic regression.
17 That is the reason why we used the “equal by definition” sign in Eqs. (5.33) and (5.34).
298 5 Logistic Regression

groups (populations), e.g. men and women (or test group and control group), thus indi-
cating the difference between the two groups.
If we calculate the odds ratio for two values of a metric variable (as we have done
for Income above), its size depends on the unit of measurement of that variable and is
therefore not very meaningful. In the example above, we get OR ≈ 6 for an increase of
income x by one unit. The odds ratio is large because the unit of x is [1000 EUR].
The situation is different with binary variables, which can take only the values 0 and
1 and thus have no unit. In Model 4, we included the gender of the persons as a predictor
and estimated the following function:
1 1
p= =
1+ e−(a+b1 x1k +b2 x2k ) 1+ e−(−5.635+2.351x1k +1.751x2k )
The binary variable Gender indicates two groups, men and women.
With an average income of 2 [1000 EUR], the following probabilities may be calcu-
lated for men and women:
1
Men: pm = = 0.694
1 + e−(−5.635+2.351·2+1.751·1)
1
Women: pw = −(−5.635+2.351·2+1.751·0)
= 0.283
1+e
This results in the corresponding odds ratios:
oddsm pm /(1 − pm ) 2.267
ORm = = = = 5.8
oddsw pw /(1 − pw ) 0.393
oddsw pw /(1 − pw ) 0.393
ORw = = = = 0.17
oddsm pm /(1 − pm ) 2.267
A man’s odds for buying are roughly six times higher than a woman’s. A woman’s odds
are less than 20% of a man’s odds.18 This seems to be a very large difference between
men and women.
Another, similar measure for the difference of two groups is the relative risk (RR),
which is the ratio of two probabilities.19 Analogously to the odds ratios we obtain here:
pm 0.694
RRm = = = 2.5
pw 0.283
pw 0.283
RRw = = = 0.41
pm 0.694

18 Alternatively,we may calculate the odds ratios with Eq. (5.32): ORm = eb2 = e1.751 = 5.76
and ORw = e−b2 = e−1.751 = 0.174.
19 In common language, the term risk is associated with negative events, such as accidents, illness

or death. Here the term risk refers to the probability of any uncertain event.
5.2 Procedure 299

According to this measure, a man is 2.5 times more likely to buy the chocolate than a
woman at the given income. The values of RR are significantly smaller (or, more gener-
ally, closer to 1) than the values of the odds ratio OR and often come closer to what we
would intuitively assume. However, OR can also be used in situations where calculating
RR is not possible.20 The odds ratio, therefore, has a broader range of applications than
the relative risk.

5.2.4 Checking the Overall Model

1 Model formulation

2 Estimation of the logistic regression function

3 Interpretation of the regression coefficients

4 Checking the overall model

5 Checking the estimated coefficients

Once we have estimated a logistic model, we need to assess its quality or goodness-of-fit
since no one wants to rely on a bad model. We need to know how well our model fits the
empirical data and whether it is suitable as a model of reality. For this purpose, we need
measures for evaluating the goodness-of-fit. Such measures are:

• likelihood ratio statistic,


• pseudo-R-square statistics,
• hit rate of the classification and the ROC curve.

In linear regression, the coefficient of determination R2 indicates the proportion of


explained variation of the dependent variable. Thus, it is a measure for the goodness-of-
fit that is easy to calculate and to interpret. Unfortunately, such a measure does not exist
for logistic regression, since the dependent variable is not metric. In logistic regression
there are several measures for the goodness-of-fit, which can be confusing.
Since the maximum likelihood method (ML method) was used to estimate the param-
eters of the logistic regression model, it seems natural to use the value of the maximized
likelihood or the log-likelihood LL (see Fig. 5.12) as a basis for assessing the goodness-
of-fit. And indeed, this is the basis for various measures of quality.

20 This can be the case in so-called case-control studies where groups are not formed by random
sampling. Thus the size of the groups cannot be used for the estimation of probabilities. Such stud-
ies are often carried out for the analysis of rare events, e.g. in epidemiology, medicine or biology
(cf. Agresti, 2013, pp. 42–43; Hosmer et al., 2013, pp. 229–230).
300 5 Logistic Regression

A simple measure that is used quite often is the value –2LL (= −2 · LL). Since LL is
always negative, −2LL is positive. A small value for −2LL thus indicates a good fit of
the model for the available data. The “2” is due to the fact that a chi-square distributed
test statistic is aimed at (see Sect. 5.2.4.1).
For Model 4 with the systematic component z = a + b1 x1 + b2 x2 we get:

−2LL = 2 · 16.053 = 32.11 (5.36)


The absolute size of this value says little since LL is a sum according to Eq. (5.25). The
value of LL and thus −2LL therefore depends on the sample size N. Both values would,
therefore, double if the number of observations were doubled, without changing the esti-
mated values. The size of −2LL is comparable to the sum of squared residuals (SSR) in
linear regression, which is minimized by the OLS method (ordinary least squares). For
a perfect fit, both values have to be zero. The ML estimation can be performed by either
maximizing LL or minimizing −2LL.
The −2LL statistic can be used to compare a model with other models (for the
same data set). For Model 3, i.e., the simple logistic regression with only one predictor
(Income), the systematic component is reduced to z = a + b x and we get:

−2LL = 2 · 18.027 = 36.05


So if variable 2 (Gender) is omitted, the value of −2 LL increases from 32.11 to 36.05,
and thus the model fit is reduced.
An even simpler model results with the systematic component

z = a = 0.134
It yields:−2LL = 2 · 20.728 = 41.46.
This primitive model is called the null model (constant-only model, 0-model) and it
has no meaning by itself. But it serves to construct the most important statistic for testing
the fit of a logistic model, the likelihood ratio statistic.

5.2.4.1 Likelihood Ratio Statistic


To evaluate the overall quality of the model under investigation (the fitted or full model),
we can compare its likelihood with the likelihood of the corresponding 0-model. This
leads to the likelihood ratio statistic (the logarithm of the likelihood ratio):
 
Likelihood of the 0-model
LLR = −2 · ln (5.37)
Likelihood of the fitted model
 
L0
= −2 · ln = −2 · (LL0 − LLf )
Lf
with
LL0 maximized log-likelihood for the 0-model (constant-only model)
LLf maximized log-likelihood for the fitted model
5.2 Procedure 301

The greater the distance,


0 LLf the better the model LL0

Maximum Maximum LL-value Maximum LL-value of


attainable considering all the 0-model for the
LL-value predictors given data set

Fig. 5.14 Log-likelihood values in the LR test

The logarithm of the ratio of likelihoods is thus equal to the difference in the log-like-
lihoods. With the above values from our example for multiple logistic regression
(Model 4) we get:
LLR = −2 · (LL0 − LLf ) = −2 · (−20.728 + 16.053) = 9.35
Under the null hypothesis H0: β1 = β2 = … = βJ = 0, the LR statistic is approximately
chi-square distributed with J degrees of freedom (df).21 Thus, we can use LLR to test the
statistical significance of a fitted model. This is called the likelihood ratio test (LR test),
which is comparable to the F-test in linear regression analysis.22
The tabulated chi-square value for α = 0.05 and 2 degrees of freedom is 5.99. Since
LLR = 9.35 > 5.99, the null hypothesis can be rejected and the model is considered to be
statistically significant. The p-value (empirical significance level) is only 0.009 and the
model can be regarded as highly significant.23 Figure 5.14 illustrates the log-likelihood
values used in the LR test.

Comparison of Different Models


Modeling should always be concerned with parsimony. The LR test can also be used to
check whether a more complex model provides a significant improvement versus a sim-
pler model. In our example, we can examine whether the inclusion of further predic-
tors (e.g. age or weight) is justified because they would yield a better fit of the model.
Conversely, we can examine whether Model 4 has led to significant improvements com-
pared to Model 3 by including the variable Gender. To check this, we use the following
likelihood ratio statistic:

21 Thus, in SPSS the LLR statistic is denoted as chi-square. For the likelihood ratio test statistic see,
e.g., Agresti (2013, p. 11); Fox (2015, pp. 346–348).
22 For a brief summary of the basics of statistical testing see Sect. 1.3.

23 We can calculate the p-value with Excel by using the function CHISQ.DIST.RT(x;df). Here, we

get CHISQ.DIST.RT(9.35;2) = 0.009.


302 5 Logistic Regression

 
Lr
LLR = −2 · ln = −2 · (LLr − LLf ) (5.38)
Lf

with
LLr maximized log-likelihood for the reduced model (Model 3)
LLf maximized log-likelihood for the full model (Model 4)
With the above values we get:
LLR = −2 · (LLr − LLf ) = −2(−18.027 + 16.053) = 3.949
The LLR statistic is again approximately chi-square distributed, with the degrees of free-
dom resulting from the difference in the number of parameters between the two models.
In this case, with df = 1, we get a p-value of 0.047. Thus, the improvement of Model 4
compared to Model 3 is statistically significant for α = 0.05. A prerequisite for applying
the chi-square distribution is that the models are nested, i.e., the variables of one model
must be a subset of the variables of the other model.

5.2.4.2 Pseudo-R-Square Statistics
There have been many efforts to create a similar measure for goodness-of-fit in logis-
tic regression as the coefficient of determination R2 in linear regression. These efforts
resulted in the so-called pseudo-R-square statistics. They resemble R2 insofar as

• they can only assume values between 0 and 1,


• a higher value means a better fit.

However, the pseudo-R2 statistics do not measure a proportion. They are based on the
ratio of two probabilities, like the likelihood ratio statistic.
There are three different versions of pseudo-R-square statistics.
(a) McFadden’s R2
 
LLf −16.053
McF − R2 = 1 − =1− = 0.226 (5.39)
LL0 −20.728
In contrast to the LLR statistic, which uses the logarithm of the ratio of likelihoods,
McFadden uses the ratio of the log-likelihoods. In case of a small difference between the
two log-likelihoods (of the fitted model and the null model), the ratio will be close to 1,
and McF − R2 thus close to 0. This means the estimated model is not much better than
the 0-model. Or, in other words, the estimated model is of no value.
If there is a big difference between the two log-likelihoods, it is exactly the other way
round. But with McFadden’s R2 it is almost impossible to reach values close to 1 with
empirical data. For a value of 1 (perfect fit), the likelihood would have to be 1, and thus
the log-likelihood, 0. The values are therefore in practice much lower than for R2. As
a rule of thumb, values from 0.2 to 0.4 can be considered to indicate a good model fit
(Louviere et al., 2000, p. 54).
5.2 Procedure 303

(b) Cox & Snell R2

 2  2
L0 N exp(−20.728) 30
2
RCS = 1− =1− = 0.268 (5.40)
Lf exp(−16.053)

The Cox & Snell R2 can take only values <1, as L0 will always be >0. Thus, it will
deliver values <1 even for a perfect fit.
(c) Nagelkerke’s R2

2
2 RCS 0.2682
RNa = 2/N
= = 0.358 (5.41)
1−L0 1 − exp(−20.728)2/30

The pseudo-R2 according to Nagelkerke is based on the statistics of Cox and Snell. It
modifies it in such a way that a maximum value of 1 can be reached.
For our model, all three pseudo-R2 statistics provide rather low values, although the
model achieves quite a good fit and high significance. The values are below what would
be expected for R2 used in linear regression.

5.2.4.3 Assessment of the Classification


The creation of classification tables, which we discussed above, is a further and particu-
larly easily interpretable way of assessing the goodness-of-fit of a model. Unfortunately,
this alternative approach does not always lead to consistent results. A model can show a
good fit but still provide poor predictions (classifications).
When assessing the hit rate of the classification achieved, the hit rate one would
expect if the elements were assigned purely randomly always has to be taken into
account. With two groups or events, a hit rate of 50% could also be expected by tossing
a coin or, for equal-sized groups, by just assigning all elements to one of the two groups.
Thus, for two groups a hit rate of 50% is of no value.
With this naive kind of classification an even higher hit rate is achieved if the size of
the groups is unequal. If, for example, the ratio of the two groups is 80 to 20, you would
achieve a hit rate of 80% if you assigned all elements to the larger group.
It should also be borne in mind that the hit rate is always too large if, as is common
practice, it is calculated based on the same sample that was used to estimate the logistic
regression function. Since the logistic regression function is always determined in such a
way that the hit rate in the sample is maximized, a lower hit rate is to be expected when
it is applied to a different sample. This sampling effect decreases with the size of the
sample and increases with the number of variables in the model. That is why parsimony
is an important criterion in modeling.
An adjusted hit rate can be obtained by randomly dividing the available sample into
two sub-samples or sub-sets, a “training set” and a (usually smaller) “test set” (valida-
tion set, holdout set). The training set is used to estimate the logistic regression func-
tion, which is then used to classify the elements of the test set and calculate the hit rate.
304 5 Logistic Regression

However, this approach is only useful if a sufficiently large sample is available since the
reliability of the estimated logistic regression function decreases along with the size of
the training set. Besides, in this case the existing information is used only incompletely.
Better ways to achieve undistorted hit rates are provided by cross-validation meth-
ods (cf. Hastie et al., 2011, p. 241; James et al., 2014, p. 175) such as the leave-one-out
method (LOO). An element of the sample is singled out and classified using the logis-
tic regression function whose estimation is based on the other elements. This is then
repeated for all elements of the sample. In this way, an undistorted classification table
can be obtained by making full use of the available information. However, the method
is quite complex and therefore only practicable with small sample sizes. In our example,
using the LOO method yields 3 hits less, reducing the hit rate from 83.3% to 73.3%.
With regard to assessing the quality of the underlying model, the problem arises that
the classification table and thus also the hit rate can change with a change in the selected
cutoff value. Therefore, we used the ROC curve (receiver operating characteristic) above,
which summarizes the classification tables for the possible cutoff values, as a generalized
concept. The area under the ROC curve, known as AUC, is a measure of the model’s pre-
dictive power (see Sect. 5.2.1.4).

5.2.4.4 Checking for Outliers


Empirical data often contain one or more outliers., i.e. observations that deviate mark-
edly from the other data. Such outliers can change the fit of the model or the estimates
of the coefficients. Although logistic regression is considered to be relatively insensitive
to outliers (in contrast to regression analysis), it is always useful to control the data in
this respect. For this purpose, we first have to detect outliers and then have to investi-
gate whether the outliers are influential. If they are influential, further analysis becomes
necessary.
To detect outliers, we have to look at the residuals. In linear regression, the residuals
are calculated by ei = yi − ŷi (observed minus estimated values). Analogously, the residu-
als in logistic regression are calculated according to
ei = yi − pi

(observed values minus estimated probabilities). Since y can only take the values 0 or 1
and p can take values in the range from 0 to 1, the residuals will take values from −1 to
+1. As with linear regression, the sum of the residuals is equal to zero.
To judge the size of a residual, it is advantageous to standardize its value by dividing
it by the standard deviation. As in logistic regression the residuals follow a Bernoulli dis-
tribution we get:
yi − pi
zi = √ (5.42)
pi (1 − pi )

With a large sample size N, the standardized residuals are approximately normally dis-
tributed with a mean value of 0 and a standard deviation of 1. They are also referred to
5.2 Procedure 305

as Pearson residuals. Table 5.14 presents the calculated residuals and Fig. 5.15 shows a
scatterplot of the standardized residuals z (std. resid.).
The sum of the squared Pearson residuals gives the Pearson chi-square statistic:

N
 (yi − pi )2
X2 = = 30.039 (5.43)
p (1 − pi )
i=1 i

The Pearson chi-square statistic is usually calculated based on frequencies of grouped


data, for example, when contingency tables are evaluated. In the context of logistic
regression, it is also used as a measure of the goodness-of-fit (see Sect. 5.3.4). The value
of X2 here is close to the value 32.105 that we received for −2LL. Both −2LL and X2 are
comparable to the sum of squared residuals (SSR) in linear regression.

Detecting Outliers
Outliers can be identified by visual inspection of a scatterplot (like the one in Fig. 5.15)
or automatically by setting a cutoff value. In SPSS the default cutoff value is 2 [standard
deviations]. Observations with a standardized residual that is larger than 2 (in absolute
values) have a probability of occurrence below 5% (2-sigma rule). This can be seen as
a signal indicating the need for further investigation. Table 5.14 shows two such obser-
vations, namely persons 2 and 23, which can also be detected in Fig. 5.15 (the markers
outside of the two red lines).
We will take a closer look at observation 23 with a residual z = 2.522 [std. deviations].
This is a woman with an income of 1610 EUR, which is clearly below average. She
bought the gourmet chocolate although the results of our analysis indicate that the tested
chocolate is more likely to be bought by men with a higher income. In Fig. 5.15 this
observation is represented by the marker above the upper red line.

Influential Observations
Observations that have a strong influence on the estimated parameters are called influen-
tial observations. We want to investigate if person 23 is such an influential observation.
The easiest way to do so is to eliminate the conspicuous person from the data set and
repeat the estimation of the model.
Equation (5.22) above gives the estimated regression function for all data. Now we
compare this with the regression function we get after removing observation 23. The
results (expressed in logits) are:

a) Complete sample logit[p(x)] = –5.635 + 2.351 x1 + 1.751 x2


b) Person 23 removed logit[p(x)] = –7.998 + 3.203 x1 + 2.551 x2
As we can see, the effect is quite strong. All parameters increase considerably (in abso-
lute values). This shows that in a small data set an outlier can have a strong influence on
the result of the analysis.
306 5 Logistic Regression

Table 5.14  Residual analysis


Person Income Gender Buy Probability Residual Std. resid
i [1000 EUR] X2 Y p y−p z
X1
1 2.53 0 1 0.578 0.422 0.855
2 2.37 1 0 0.844 −0.844 −2.326
3 2.72 1 1 0.925 0.075 0.285
4 2.54 0 0 0.583 −0.583 −1.183
5 3.20 1 1 0.974 0.026 0.162
6 2.94 0 1 0.782 0.218 0.528
7 3.20 0 1 0.869 0.131 0.389
8 2.72 1 1 0.925 0.075 0.285
9 2.93 0 1 0.778 0.222 0.534
10 2.37 0 0 0.484 −0.484 −0.969
11 2.24 1 1 0.799 0.201 0.501
12 1.91 1 1 0.647 0.353 0.738
13 2.12 0 1 0.343 0.657 1.385
14 1.83 1 1 0.603 0.397 0.811
15 1.92 1 1 0.653 0.347 0.730
16 2.01 0 0 0.287 −0.287 −0.635
17 2.01 0 0 0.287 −0.287 −0.635
18 2.23 1 0 0.796 −0.796 −1.973
19 1.82 0 0 0.205 −0.205 −0.508
20 2.11 0 0 0.338 −0.338 −0.714
21 1.75 1 1 0.557 0.443 −0.891
22 1.46 1 0 0.389 −0.389 −0.798
23 1.61 0 1 0.136 0.864 2.522
24 1.57 1 0 0.452 −0.452 −0.908
25 1.37 0 0 0.082 −0.082 −0.299
26 1.41 1 0 0.362 −0.362 −0.752
27 1.51 0 0 0.111 −0.111 –0.353
28 1.75 1 1 0.557 0.443 0.891
29 1.68 1 1 0.516 0.484 0.968
30 1.62 0 0 0.139 −0.139 −0.401
Sum 16 16 0.0 0.0
5.2 Procedure 307

z (Pearson residuals)
3.0

2.0

1.0

0.0
0 5 10 15 20 25 30
-1.0

-2.0

-3.0
Persons

Fig. 5.15 Using standardized residuals for identifying outliers

The second outlier here is observation 2 with a residual z = −2.326. This is a man
with an income above average who did not buy the chocolate. In Fig. 5.15 he is repre-
sented by the marker below the lower red line. After eliminating this observation we get:

c) Person 2 removed logit[p(x)] = −7.099 + 3.002 x1 + 2.386 x2

Here the effect is not quite so strong. The parameter values are somewhere between the
previous results. There are two reasons for this:

• the size of the residual is somewhat smaller,


• the observation has a lower leverage.

For calculating the influence of an outlier we can thus use the simplified formula:
influence = size × leverage
Size is a function of the dependent variable (here: y or p), and leverage is a function
of the independent variable(s) X (here: Income). Or more precisely, the leverage effect
depends on the distance of the x-value from the center. In Fig. 5.16 the center (mean
income x = 2115 EUR) is marked by the dashed line. The income of the woman (person
23) is clearly further away from the center than the income of the man (person 2). Thus,
the woman here has greater leverage than the man.
If influential outliers have been detected, further analysis is necessary. Values can be
incorrect due to mistakes in measurement or data entry. If this is the case, they should
be corrected (if possible) or eliminated. Other reasons may be unusual events inside or
308 5 Logistic Regression

z (Pearson residuals)
3.0

2.0

1.0

0.0

-1.0

-2.0

-3.0
Income X1

Fig. 5.16 Scatter of standardized residuals against Income

outside the research context. In the first case, we should try to change the model specifi-
cation, in the second case, the outliers should be eliminated.
If we cannot identify any cause for the outliers, we have to assume that they are due
to random fluctuation. In this case, the outliers must be kept in the analysis. Dropping
outliers without good reason would constitute a manipulation of the regression results. A
value should only be eliminated if we know that it is incorrect and cannot be corrected.

5.2.5 Checking the Estimated Coefficients

1 Model formulation

2 Estimation of the logistic regression function

3 Interpretation of the regression coefficients

4 Checking the overall model

5 Checking the estimated coefficients

Besides the quality of a model, information on the influence and importance of individ-
ual predictors is usually of interest. For this purpose, we have to examine the estimated
coefficients.
In linear regression analysis, the t-test is commonly used for testing if a coefficient
differs significantly from 0 and is thus of importance. Alternatively, one can also use the
F-test. Both tests will provide identical results.
5.2 Procedure 309

Two tests commonly used in logistic regression are the Wald test and the likelihood
ratio test.24 The latter one we already used to evaluate the overall quality of the model.
However, these two tests do not always provide the same results.

(a) Wald Test


The Wald test is similar to the t-test.25 The Wald statistic is
 2
bj
W= (5.44)
SE(bj )

with SE(bj ) = standard error of bj (j = 0, 1, 2, …, J).


Formally, the Wald statistic is identical to the square of the t statistic. Using the null
hypothesis H0: βj = 0, it is asymptotically chi-square distributed with one degree of free-
dom (in contrast to the t statistic, which follows Student’s t-distribution). Table 5.15
shows the values of the Wald statistic and the corresponding p-values for our example
(Model 4). The coefficient of the variable Gender is not significant at α = 0.05.

(b) Likelihood Ratio Test


Analogous to the use of the likelihood ratio test for checking the overall quality of the
model, as shown above, the LR test can also be used to check an estimated regression
coefficient bj for significance. For this purpose, the log-likelihood of the fitted model LLf
is compared with the log-likelihood of the reduced model LL0j that we get after setting
bj = 0. So, LL0j is the maximum log-likelihood that we get after dropping predictor j from
the model and maximizing for the remaining parameters.
For checking the coefficient bj we use the likelihood ratio statistic
LLRj = −2 · (LL0j − LLf ) (5.45)
LLf is the value for the log-likelihood of the fitted model, which we already used for
checking the overall quality of the model.
Under the hypothesis H0: bj = 0, the likelihood ratio statistic LLRj is asymptotically
chi-square distributed with one degree of freedom. Table 5.16 shows the results for our
example.
A comparison of the results shows that the p-values in the likelihood ratio test are
generally lower than in the Wald test. At α = 0.05, all coefficients are significant accord-
ing to the likelihood ratio test, whereas the variable Gender is not significant in the Wald
test.

24 Both tests are used in SPSS, but the LR test is only used in the NOMREG procedure for multi-
nomial logistic regression, not in binary Logistic Regression.
25 Named after the Hungarian mathematician Abraham Wald (1902–1950). For the Wald test see

Agresti (2013, p. 10); Hosmer et al. (2013, pp. 42–44).


310 5 Logistic Regression

The likelihood ratio test is much more computationally expensive than the Wald test
since a separate ML estimation must be carried out for each of the coefficients. For this
reason, the Wald test is often preferred. It can, however, be misleading because it sys-
tematically provides larger p-values than the likelihood ratio test.26 Consequently, it may
fail to indicate the significance of a coefficient (or reject a false null hypothesis), as is the
case here for the variable Gender. The likelihood ratio test is therefore clearly the better
test. The Wald test should only be used for large samples, as in this case the results of the
two tests converge.
It should also be noted that the likelihood ratio test we used here for testing the sig-
nificance of b2 (coefficient of the variable Gender) is identical to the test we performed
according to Eq. (5.38) for the comparison of Model 4 with Model 3. The significance of
the improvement of Model 4 versus Model 3 by including the variable Gender is equal to
the significance of the coefficient of the variable Gender.

5.2.6 Conducting a Binary Logistic Regression with SPSS

In SPSS, as already mentioned, there are two procedures for performing logistic regres-
sion analyses, the procedure ‘Logistic Regression’ for binary logistic regression and the
procedure ‘Multinomial Logistic Regression’ (called NOMREG in SPSS). Both proce-
dures can be accessed via the menu item ‘Analyze/Regression’. Since the NOMREG pro-
cedure will be used for the case study in Sect. 5.4, we will here show extracts from the
output of the procedure ‘Logistic Regression’ for our exemplary dataset (the results above
were derived by MS Excel) and point out some differences between the procedures.27
The procedure ‘Logistic Regression’ can be reached via the menu items ‘Analyze/
Regression/Binary Logistic …’. When clicking on the latter item, the dialog box for
‘Logistic Regression’ shown in Fig. 5.17 opens. The binary variable ‘Buy’ is to be
entered as ‘Dependent’ and the variables ‘Income’ and ‘Gender’ are to be entered as
‘Covariates’.28
In the dialog box ‘Options’ (Fig. 5.18) a cutoff value for a classification table can be
specified. The default ‘Classification cutoff’ is 0.5. For obtaining a list of outliers, select
‘Casewise listing of residuals’. If you click on ‘Continue’ and ‘OK’, the data of Table 5.2
will result in the output shown in Fig 5.19.

26 The reason is that the standard error becomes too large, especially if the absolute value of the
coefficient is large. This makes the Wald statistics too small and the p-value too large (as found by
Hauck & Donner, 1977). Agresti (2013, p. 169), points out that the likelihood ratio test uses more
information than the Wald test and is therefore preferable.
27 The user can find all Excel files used in this chapter on the website www.multivariate-methods.

info.
28 Binary logistic regression can also be performed using the SPSS syntax as shown in Sect. 5.4.4

(Fig. 5.42).
5.2 Procedure 311

Fig. 5.17 Dialog box: Logistic Regression

Fig. 5.18 Dialog box: Logistic Regression Options


312 5 Logistic Regression

Fig. 5.19 Global quality measures

Fig. 5.20 Estimated regression coefficients, results of the Wald test and effect coefficients (Exp (B))

In the upper part, below the heading “Omnibus Tests of Model Coefficients”,
Fig. 5.19 shows the result of the likelihood ratio test according to Eq. (5.37), namely the
value of LLR (in SPSS denoted as ‘Chi-square’) and the corresponding p-value (denoted
as ‘Sig.’ for significance level).
Below the heading “Model Summary”, three further measures of quality are given:

• the value of −2LL according to Eq. (5.36),


• the values of two pseudo-R2 statistics: Cox & Snell R2 and Nagelkerke’s R2 according
to the Eqs. (5.40) and (5.41).

McFadden’s R2 is only obtained by the procedure NOMREG. It is also stated in the out-
put that the ML estimate required 5 iterations.
Figure 5.20 shows the estimated regression coefficients and the results of the Wald
test and corresponds to Table 5.15. Besides, the effect coefficients calculated according
to Eq. (5.31) are listed in the rightmost column. In SPSS, the likelihood ratio test of the
5.2 Procedure 313

Table 5.15  Testing the bj SE Wald p-value


regression coefficients with the
Wald test Constant −5.635 2.417 5.436 0.020
Income 2.351 1.040 5.114 0.024
Gender 1.751 0.953 3.380 0.066

Table 5.16  Checking the regression coefficients with the likelihood ratio test
bj LL0j LLf LLRj p-value
Constant −5.635 −19.944 −16.053 7.783 0.005
Income 2.351 −19.643 −16.053 7.181 0.007
Gender 1.751 −18.027 −16.053 3.949 0.047

regression coefficients, as shown in Table 5.16, can only be obtained with the procedure
NOMREG.
Figure 5.21 shows the classification table and is consistent with the classification in
Table 5.11. But rows and columns are exchanged in SPSS according to the coding 0 and 1.
The cutoff value can be changed by the user. This is not possible in the procedure
NOMREG. Here, the default value is 0.5, which we also used above.
If you selected ‘Casewise listing of residuals’, you will now get the table in Fig. 5.22
which shows the two outliers (persons 2 and 23). For each person the following details
are indicated:

• ‘Observed’ response: Buy or No buy


• ‘Predicted’ probability p
• ‘Predicted group’: Buy or No buy (based on p and the specified cutoff value)
• Residuals: normal, standardized, and studentized residuals (the SResid follow a
Student’s t-distribution).

Fig. 5.21 Classification table


314 5 Logistic Regression

Fig. 5.22 List of outliers

To generate the ROC curve for assessing the classification table, the estimated proba-
bilities (as given in Table 5.14) must first be generated and saved in the work file. To
do this, select the option ‘Save’ in the dialog Box ‘Logistic Regression’ and then select
‘Probabilities’. This causes a variable ‘PRE_1’ to be created after the analysis has been
performed, which contains the estimated probabilities pi = P(Y = 1). The variable PRE_1
will appear in the work file.
Under the menu item ‘Analyze/ROC curve’ you reach the procedure ROC. There you
have to choose the variable ‘PRE_1’ as the ‘Test Variable’ and the variable ‘Buy’ as the
‘State Variable’. Besides, for ‘The Value of State Variable’ we must indicate the value of
Y to which the probabilities apply. In our case, this is 1 (= Buyer). For the display you
should select ‘With diagonal reference line’ and ‘Standard error and confidence interval’.
The output is shown in Figs. 5.23 and 5.24.

5.3 Multinomial Logistic Regression

If logistic regression is extended to more than two categories (groups, events), it is called
multinomial logistic regression. The dependent variable Y can now take the values g = 1,
…, G. As before, x = (x1, x2, …, xJ) is a set of values (vector) of the J independent varia-
bles (predictors).
Y now designates a multinomial random variable. Analogous to Eq. (5.3), we denote
the conditional probability P(Y = g|x) for the occurrence of event g, given values x by
πg (x) = P(Y = g|x ) (g = 1, . . . , G) (5.46)
and the following applies:
G

πg (x) = 1.
g=1

For a better understanding of the model for multinomial logistic regression, we will write
the model for binary logistic regression in a somewhat different form. In Eq. (5.7) we
expressed the binary logistic regression model for the probability P(Y = 1|x) by
5.3 Multinomial Logistic Regression 315

Fig. 5.23 ROC curve for Model 4

Fig. 5.24 Area under the ROC curve (AUC) with p-value and confidence interval
316 5 Logistic Regression

1
π(x) = with z(x) = α + β1 x1 + · · · + βJ xJ
1 + e−z(x)

Now we express the binary logistic regression model by two formulas, one for each
category:

ez1 (x)
P(Y = 1|x) : π1 (x) = (5.47)
ez1 (x) + ez0 (x)

ez0 (x)
P(Y = 0|x) : π0 (x) = (5.48)
ez1 (x) + ez0 (x)
Since the two probabilities must add up to 1, the parameters in one of the two equations
are redundant. Thus, to identify the parameters, we have to fix them in one equation.
We choose the second equation for this purpose. Setting the parameters in z0 (x) to zero,
we get z0 (x) = 0.29 Then we can write:

ez1 (x) ez1 (x) ez1 (x) 1 1


π1 (x) = = = = −z

ez 1 (x) +e z 0 (x) ez 1 (x) +e 0 ez 1 (x) +1 1+e 1 (x) 1 + e−z(x)
ez0 (x) 1 1
π0 (x) = z (x) = ≡
e 1 + ez0 (x) 1 + ez1 (x) 1 + ez(x)
For π1 (x) we now get the same formula as above for π(x).
With the value z(x) = 0.313, which holds for person 1 in Table 5.14, we get

1 1
π1 (x) = = = 0.578
1 + e−z(x) 1 + e−0.313
1 1
π0 (x) = = = 0.422
1 + ez(x) 1 + e0.313

and thus π1 (x) + π0 (x) = 1.


The 0-category (for Y = 0), whose parameters are set to zero, is the reference category
(baseline category). For this category the calculation is not necessary, as we can get the
probability by:
π0 (x) = 1 − π1 (x).
Therefore we had not explicitly modeled the 0-category of the binary model.
In the binary logistic model the 0-category (for Y = 0) is always automatically chosen
as the reference category. In the multinomial logistic model, any of the G categories can
be chosen as the reference category.

29 An alternative is to center the parameters so that their sum across the two categories is zero.
5.3 Multinomial Logistic Regression 317

5.3.1 The Multinomial Logistic Model

Now we can, analogous to Eq. (5.47), formulate the model of multinomial logistic
regression as follows:
ezg (x)
πg (x) = (g = 1, . . . , G)
G
 (5.49)
ezh (x)
h=1

or, in a simplified form:


ez g
πg =
ez1 + ez2 + . . . + ezG
with zg = αg + βg1 · x1 + . . . + βgJ · xJ .
One of the G categories must be selected as the reference category (baseline cate-
gory). Usually, the last category G is chosen for this purpose. If we set the parameters of
category G to zero, we get:
ezg (x)
πg (x) = (g = 1, . . . , G − 1)
G−1 (5.50)
1+ ezh (x)
h=1

The probability of the reference category G is given by:

G−1
1 
πG (x) = =1− πh (x)
−1
G (5.51)
1+ ezh (x) h=1
h=1

By inserting the systematic component into Eq. (5.50) we get the model of multinomial
logistic regression in the following form:
eαg +βg1 x1 +...+βgJ xJ
πg (x) = (g = 1, . . . , G − 1)
G−1 (5.52)
1+ eαh +βh1 x1 +...+βhJ xJ
h=1

The parameters of the categories g = 1 to G − 1 express the relative effect with respect
to the reference category G. If, for instance, the categories correspond to the chocolate
brands Alpia, Lindt, and Milka, then the parameters for Alpia and Lindt would express
the relative importance with respect to Milka. But, of course, we can change the order of
the brands and choose Alpia or Lindt as the reference category.
For each category of the multinomial logistic model (except the reference category
G), a logistic regression function must be formed according to Eq. (5.50). Each of these
G − 1 functions includes all parameters.
318 5 Logistic Regression

In total, (J + 1) × (G − 1) parameters have to be estimated; with two predictors and


three categories, this would be 3 × 2 = 6 parameters, and with 10 predictors this would
be 22 parameters. All parameters are estimated simultaneously.

Maximum Likelihood Method for Multinomial Logistic Regression


For the estimation of the parameters of the multinomial logistic model, the log-likeli-
hood function again has to be maximized over the observations i = 1, …, N:
N 
 G
 
LL = ln pg (xi ) · ygi → Max! (5.53)
i=1 g=1

with ygi = 1 if category g was observed in case i, and ygi = 0 in all other cases. The proba-
bilities are calculated as follows:
eag +bg1 x1i +...+bgJ xJi
pg (xi ) = (g = 1, . . . , G)
−1
G (5.54)
1+ eah +bh1 x1i +...+ bhJ xJi
h=1

with aG = bG1 = … = bGJ = 0.

5.3.2 Example and Interpretation

To illustrate the multinomial logistic regression, we will modify our example above.

Example
Two chocolate types A and B will now be offered to the test persons, resulting in a
total of three response categories30 :

• g = 1: Buying A
• g = 2: Buying B
• g = 3: No buy

Only the variable ‘Income’ will be considered as a predictor. ◄

According to Eq. (5.50) we want to estimate the probabilities for a certain income x:

30 For this example we used a second data set with 50 observations.


5.3 Multinomial Logistic Regression 319

ea1 +b1 x ez1 (x)


Buying A: p1 (x) = 2 = 2
1 + h=1 eah +bh x

1 + h=1 ezh (x)
ea2 +b2 x ez2 (x)
Buying B: p2 (x) = 2 =
1 + h=1 eah +bh x 1 + 2h=1 ezh (x)


1
No buy: p3 (x) = 2 = 1 − (p1 + p2 )
1 + h=1 eah +bh x

(J + 1) × (G − 1) = 2 × 2 = 4 parameters are to be estimated. Each logistic regression


function encompasses all four parameters.
The estimated values of the four parameters are:
Buying A : a1 = −22.418, b1 = 6.697
Buying B: a2 = −7.929, b2 = 2.772
For the reference category g = 3, we set:
No buy: a3 = 0, b3 = 0
The negative signs of the constants a1 and a2 indicate that people with no or only a
low income will choose the No buy alternative. People will buy chocolate only if their
income exceeds a certain threshold. First, they will buy only chocolate B, with higher
incomes they will increasingly also buy chocolate A.
The positive coefficients b1 and b2 tell us that the buying probabilities for both choco-
late types will increase with rising income. The influence of income on buying chocolate
A is stronger than for chocolate B.
With the estimated parameters we can now calculate the probabilities for any income.
For, e.g., an income of 3000 EUR we get the following logits and the correspond-
ing exponential values (for numerical reasons we have to enter the income in units of
1000 EUR):
Buying A: z1 = −22.42 + 6.70 × 3 = −2.329 → exp(z1 ) = 0.097
Buying A: z2 = −7.93 + 2.77 × 3 = 0.387, → exp(z2 ) = 1.473
No buy: z3 = 0, → exp(z3 ) = 1
This yields the probabilities

0.097
Buying A: p1 = 1+0.097+1.473
= 0.04
1.473
Buying B: p2 = 1+0.097+1.473
= 0.57
1
No buy: p3 = 1+0.097+1.473
= 0.39

In Table 5.17 the probabilities of buying are compared for three different incomes. With
a low income of 2000 EUR, the probability is highest for No buy, with a medium income
of 3000 EUR for Chocolate B, and with a high income of 4000 EUR for chocolate A.
The sum of the three probabilities must always be 1. The diagram in Fig. 5.25 shows the
320 5 Logistic Regression

Table 5.17  Estimated Probabilities


probabilities for different
incomes Income Buying A Buying B No buy Sum
2000 0.00 0.08 0.92 1.00
3000 0.04 0.57 0.39 1.00
4000 0.76 0.23 0.01 1.00

Buying probabilties
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1.500 2.000 2.500 3.000 3.500 4.000 4.50 .000

Buy A Buy B No-Buy Income

Fig. 5.25 Probabilities of buying as a function of income

probabilities for incomes between 1500 and 5000 EUR and illustrates the nonlinearity of
logistic regression.

5.3.3 The Baseline Logit Model

We now come back to the simple binary logistic regression with the two categories
1 = Buy and 0 = No buy. According to Eq. (5.12) we can express the logistic function for
probability P(Y = 1|x) in logit form:
 
  p(x)
logit p(x) ≡ ln =a+bx
1 − p(x)
For obtaining the logit we only have to determine the systematic component z for a given
value x.
If we now change the coding of the second category to 2 = No buy and choose cat-
egory 2 explicitly as the baseline (reference category), the binary model can be formu-
lated as a so-called baseline logit model:
5.3 Multinomial Logistic Regression 321

   
P(Y = 1|x) p1 (x)
ln = ln = a + bx (5.55)
P(Y = 2|x ) p2 (x)
Although practically nothing changes, this formulation can easily be extended to a multi-
nomial baseline logit model.
The multinomial logistic model can be represented by a set of binary logistic equa-
tions for pairs of categories. The multinomial baseline logit model comprises a set of
G − 1 logit equations:
 
pg (x)
ln = zg (g = 1, . . . , G − 1) (5.56)
pG (x)
with the systematic components zg = ag + bg1 x1 + . . . + bgJ xJ .
Each equation describes the effects of the predictors on the dependent variable rel-
ative to the baseline category. While the calculation of a probability always requires
all parameters (for all categories), a baseline logit requires only the parameters for the
respective category.
For G categories there are
 
G (G − 1) · G
= pairs of categories
2 2
A subset of G − 1 baseline logits contains all information of the multinomial logistic
model. The rest is redundant. For G = 3 there are 3 possible pairs (baseline logits), of
which one is redundant.
In our example with G = 3 categories and choosing G as the baseline category, the fol-
lowing two equations result:
 
ln pp13 (x) = a1 + b1 x
 (x) 
p2 (x)
ln p3 (x) = a2 + b2 x

For a person with an income of 3000 EUR, the following logits are obtained with the
above parameters for the two chocolate types A and B:
 
ln pp31 (x)
(x)
= −22.42 + 6.70 · 3 = −2.33
 
p2 (x)
ln p3 (x) = −7.93 + 2.77 · 3 = 0.387

From this we can derive the odds with Eq. (5.11). The chances that a person with an
income of 3000 EUR will buy chocolate A (g = 1) compared to not buying (g = 3) are:

odds1 = ez1 = e−2.33 = 0.10


i.e., the odds are 1 to 10.
322 5 Logistic Regression

The chances of buying chocolate B compared to not buying are:

odds2 = ez2 = e0.387 = 1.47


i.e., the odds are about 3 to 2.
Given the logit (systematic component zg) for a category g, it is thus quite easy to cal-
culate the odds relative to the reference category. One just has to evaluate the exponential
function exp(zg).
If we want to know the chances of someone with a higher income, e.g. 3500 EUR,
buying chocolate A, we calculate:
z1 = −22.42 + 9.70 · 3.5 = 1.02 and exp(1.02) = 2.77.
This means the odds are almost 3 to 1. Thus, in our example, the odds increase consider-
ably with the income rising from 3000 to 3500 as can also be seen in Fig. 5.25.

Comparing Categories
For each other pair of categories (none of which is the reference category), we get the
logit by calculating:
     
pg (x ) pg (x) ph (x)
ln = ln − ln = z g − zh (5.57)
ph (x) pG (x) pG (x)
And we get the odds by calculating:
oddsgh = ezg −zh (5.58)
So, for a person with an income of 3000 EUR we get:
z1 − z2 = −2.33 − 0.387 = −2.72
and:

odds12 = ez1 −z2 = e−2.72 = 0.066 = 1/15


The odds that this person will buy chocolate A instead of chocolate B are only 1 to 15.
The same value can also be obtained by using the probabilities p1/p2 = 0.038/0.573.
However, calculating probabilities takes much more computing effort than calculating
the odds. Switching the categories (events) reverses the sign of the z-values and inverses
the odds.31

31 InSPSS (procedure NOMREG) the user can choose any category as the reference category and
thus determine the odds using the baseline logit model. This is done in the dialog box by choosing
the option “Reference Category” and “Custom”. By default, the last category G is chosen. The
category with the lowest coding is chosen if the user chooses the category order “Descending” (the
default setting is “Ascending”).
5.3 Multinomial Logistic Regression 323

Table 5.18  Comparison of models


Comparison of models LLR p [%] Hit rate McF C&S Na
chi-square [%]
Model 3, N = 30 9.350 0.9 83.3 0.226 0.268 0.358
Model 4, N = 50 67.852 0.0 88.0 0.742 0.743 0.885

5.3.4 Measures of Goodness-of-Fit

To check the goodness-of-fit of a binary logistic model we used (see Sect. 5.2.4):

• likelihood ratio statistic LLR (called chi-square in SPSS)


• hit rate
• pseudo-R-square statistics (which are based on the likelihood ratio)
a) McFadden’s R2 (McF)
b) Cox & Snell-R2 (C&S)
c) Nagelkerke’s R2 (Na)

These statistics and corresponding tests can also be used for the multinomial logistic
model. In Table 5.18 we compare the binary logistic model (Model 3) with the covariates
Income and Gender for N = 30 and the corresponding multinomial logistic model (Model
4) for N = 50 with respect to these measures. With the multinomial model, we get a much
better fit.
Besides these measures, SPSS provides two further measures for the “goodness-of-
fit” for multinomial logistic regression:

• Pearson goodness-of-fit measure,


• deviance.

And additionally, SPSS also provides information criteria for model selection.
Unlike the LLR statistic and the pseudo-R square statistics, which become larger
with a better fit, the opposite is true for the Pearson statistic and the deviance. They are
“inverse” measures or actually “badness-of-fit measures” since they measure the lack of
fit. With better fit, they become smaller and, in extreme cases, even zero. In hypothesis
tests, therefore, the acceptance of the null hypothesis is desired, not the rejection. And a
larger p-value is better.

5.3.4.1 Pearson Goodness-of-Fit Measure


The Pearson goodness-of-fit measure is an application of Pearson’s chi-square statistic in
logistic regression. We already mentioned Pearson’s chi-square statistic in Sect. 5.2.4.4
324 5 Logistic Regression

on residual analysis. The measure according to Eq. (5.43) differs from the Pearson
goodness-of-fit measure. It is not truly chi-square distributed (cf. Agresti, 2013, p. 138;
Hosmer & Lemeshow, 2000, p. 146). The approximate chi-square distribution is only
obtained with frequency data as used, e.g., in contingency analysis (see Chap. 6). For
frequency data, Pearson’s chi-square statistic X2 is calculated by
 (observed frequencies − expected frequencies)2
X2 =
cells
expected frequencies

In the SPSS procedure NOMREG for multinomial logistic regression, Pearson’s chi-
square statistic is calculated by:
I 
G I 
G
2
 (mig − eig )2 
X = = rig2 (5.59)
i=1 g=1
eig i=1 g=1

where
mig observed frequency: number of events (e.g.buying) in cell ig
eig expected frequency
The values rig are called Pearson residuals. X2 is approximately chi-square distributed for
sufficiently large observed frequencies mig with df = I (G − 1) − (J + 1).
In logistic regression the expected frequencies are calculated by using the derived
probabilities:
ezig
eig = ni · pig with pig = G
 (5.60)
ezih
h=1

where ni is the number of cases in cell i. This is different from contingency analysis.

• In contingency analysis, X2 serves as a measure of differences (deviations of the


observed from the expected frequencies). Thus, the user usually wants a large value.
• In logistic regression, X2 serves as a measure for the goodness-of-fit. For this pur-
pose, it uses probabilities pig to calculate expected frequencies that come close to the
observed frequencies. The user usually wants a small value.

Example
As a simple example, we analyze the relationship between Buying and Gender, using
our data from Table 5.2. By counting, we get the frequencies listed in Table 5.19. ◄
5.3 Multinomial Logistic Regression 325

Table 5.19  Counted Buy No buy Sum


frequencies g=1 g=2
Man: i = 1 10 5 15
Woman: i = 2 6 9 15
Sum: 16 14 30

Table 5.20  Calculation of Gender Group Cases Observed Prob Expected ri,g²


Pearson’s chi-square in logistic i g ni mi,g pi,g e=n ⋅ p
regression
1 1 15 10 0.667 10.00 0.0
2 Buy 15 6 0.400 6.00 0.0
1 2 15 5 0.333 5.00 0.0
2 No buy 15 9 0.600 9.00 0.0
Chi-square: 0.0

With the derived probabilities according to Eq. (5.60) we calculate the expected frequen-
cies shown in Table 5.20. With X2 = 0, Pearson’s goodness-of-fit measure indicates a per-
fect fit between the observed and the expected frequencies.32
Each row in Table 5.20 corresponds to a cell of the 2 × 2 contingency table. By add-
ing rows, the table can be extended to any number of cells.
A cell is defined by

a) a covariate pattern (subpopulation) i (i = 1, …, I)


b) a response category (group) g (g = 1, …, G).

A covariate pattern is an observed combination of values of the independent variables:


xi = (x1i , x2i , . . . , xJi ) (i = 1, . . . , I ≤ N)
Each covariate pattern defines a subpopulation of the total sample. In our simple exam-
ple above we have only two such subpopulations, men and women. And we have two
response categories (Buy and No buy). So, the number of cells is: I × G = 2 · 2 = 4.
If we have metric (continuous) covariates, we get many different covariate patterns. In
the extreme, any covariate pattern is distinct from the others and the number of covariate
patterns equals the sample size (I = N). All case numbers will be 1 and most cells won’t
contain any events (buying). I × (G − 1) cells will be empty with mig = 0.

32 ForX2 = 0, the p-value has to be 1.0, but it cannot be calculated as there are no degrees of free-
dom for this model. It just serves to demonstrate the principle of the calculation. The predicted
(expected) probabilities here are equal to the relative frequencies of the observed values in the
respective subpopulation, i.e. for men or for women.
326 5 Logistic Regression

Thus, in case of metric covariates (predictors), the calculation of Pearson’s chi-square


statistic is usually not possible or makes no sense. X2 will only follow a chi-square dis-
tribution if multiple events are observed for each covariate pattern (i.e., if the mig are
sufficiently large).
If the user has a model with metric (continuous) predictors and he requires the
‘Goodness-of-fit’ statistics in NOMREG, then usually he receives “Warnings”. He is
informed that there are ‘cells with zero frequencies’.

5.3.4.2 Deviance
The calculation of the deviance is based on the same cells as the calculation of Pearson
goodness-of-fit measure. The values of these two measures are usually very similar. In
SPSS, deviance is calculated by (cf. IBM Corporation, 2022, pp. 768):
I 
G  
2
 mig
X =2 mig · ln (5.61)
i=1 g=1
ni · pig

This shows the similarity to Pearson’s goodness-of-fit measure (see Eq. 5.59). Both
measures are approximately chi-square distributed with df = I (G − 1) − (J + 1) for suffi-
ciently large observed frequencies mig.
Both measures face the same problems if the number of cells becomes large and the
number of observed events mig within the cells becomes small. For empty cells (with
mig = 0) a calculation is not possible, as the logarithm of zero is not defined. Thus, for
models with metric predictors, we have the same objection against the use of deviance as
for Pearson’s goodness-of-fit measure.

The Meaning of Deviance


In general, deviance plays an important role as a goodness-of-fit measure in logistic
regression and related methods (cf. Agresti, 2013, pp. 116, 136 ff.; Hosmer et al., 2013,
pp. 13, 145 ff.) Thus, we will give a brief description. Deviance can be defined as a like-
lihood ratio:
 
maximum likelihood of the fitted model
D = −2 · ln (5.62)
maximum likelihood of the saturated model
Shortened, we can write
 
Lf
D = −2 · ln = −2 · (LLf − LLs )
Ls
5.3 Multinomial Logistic Regression 327

Deviance compares the fitted model with a so-called saturated model, measuring the
deviation from this model. That is where the name comes from.
The saturated model is the “best possible” model concerning fit. This model has a
separate parameter for each observation and therefore achieves a perfect fit. But this is
not a good model with regard to parsimony, as it does not simplify reality. Thus, the sat-
urated model is not a useful model. It just serves as a baseline for a comparison with the
fitted model. In linear regression, a saturated model for N observations would be a model
with J = N − 1 independent variables, e.g. a simple regression with two observations.
There is a similarity between deviance and the likelihood ratio statistic LLR. In Eq.
(5.37) we defined LLR as the difference of the log-likelihood of the fitted model and the
log-likelihood of a null model:
 
L0
LLR = −2 · ln = −2 · (LL0 − LLf )
Lf
Thus, LLR measures the deviation from the “worst possible” model, the 0-model. The
greater the deviation, the greater the goodness-of-fit.
Deviance measures the deviation from the “best possible” model. The greater
the deviation, the worse the model. Thus, deviance measures the lack of fit, just like
Pearson’s goodness-of-fit measure. They are both inverse goodness-of-fit measures and
give similar results.
For models with individual (casewise) data, the likelihood of the saturated model is 1
for each observation and thus the sum of the logarithms will be zero. The deviance then
degenerates to:
D = −2 · LLf
This is the −2 log-likelihood statistic (−2LL) that we find as a summary statistic in
SPSS outputs.
The deviance or the −2LL statistic play the same role in logistic regression as the
sum of squared residuals (SSR) in linear regression. In linear regression, the parameters
are estimated by minimizing SSR, in logistic regression the parameters are estimated by
minimizing −2LL.

5.3.4.3 Information Criteria for Model Selection


If a model is extended by the inclusion of additional predictors, −2LL decreases (for
given data), just as the sum of the squared residuals (SSR) decreases in linear regression.
With more parameters the model will better fit the data of the sample. However, this does
not mean that the model will be better since any model should actually reflect the reality
(population) and not just the data of the sample.
A simpler model with an acceptable fit will therefore usually provide better pre-
dictions for cases outside of the sample than a more complex model that makes better
328 5 Logistic Regression

predictions for cases within the sample. Therefore, parsimony is an important criterion
for modeling.
Thus, in addition to significance tests (e.g., the likelihood ratio test), further criteria
have been developed that are useful for comparing and selecting models with different
numbers of variables. These include the Akaike information criterion (AIC) and the
Bayesian information criterion (BIC). As with the adjusted coefficient of determination
in linear regression, increasing model complexity is penalized by a correction term that
is added within the information criteria. A model with a smaller value of the information
criterion is the better model.
Akaike information criterion (AIC)
AIC = −2LL + 2 · number of parameters (5.63)
Bayesian information criterion (BIC)
BIC = −2LL + ln(N) · number of parameters (5.64)
with

N: sample size
number of parameters: [(G − 1) · (J + 1)] (= degrees of freedom)
J: Number of predictors
G: number of categories of the dependent variable

In the example for multinomial logistic regression (Sect. 5.3.1) with G = 3 and N = 50,
we get – with only one predictor, Income – the value 45.5 for −2LL. The number of
parameters (including an intercept) is:
[(G − 1) · (J + 1)] = [(3 − 1) · (1 + 1)] = 4
We get:
AIC = 45.5 + 2 · 4 = 45.5 + 8 = 53.5
BIC = 45.5 + ln(50) · 4 = 45.5 + 15.6 = 61.1
If we now include the variable Gender, the value of −2LL decreases to 23.6. But the
number of parameters increases to 6 and, with this, the size of the penalization:
AIC = 23.6 + 2 · 6 = 23.6 + 12 = 35.6
BIC = 23.6 + ln(50) · 6 = 23.5 + 23.6 = 47.1
Both measures are reduced by the inclusion of the additional variable Gender. Here the
reduction in −2LL overcompensates the penalty effect. This model extension is therefore
advantageous.
AIC and BIC are only suitable for comparing models based on the same data set. The
measures do not make a statement about the absolute quality of the compared models,
they just indicate which one is better (the one with the lowest value).
5.4 Case Study 329

However, AIC and BIC do not always lead to the same results. As can be seen, the
penalty effect is greater for BIC than for AIC. Therefore, BIC favors more parsimo-
nious models. Which one of the two criteria is “more correct” cannot be decided une-
quivocally. The larger the sample, the more likely the best model will be selected with
BIC. For small samples, on the other hand, there is a risk that by using BIC a too simple
model will be selected (cf. Hastie et al., 2011, p. 235).

5.4 Case Study

5.4.1 Problem Definition

We will now use a larger sample related to the chocolate market to demonstrate how to
conduct a multinomial logistic regression with the help of SPSS.33
The manager of a chocolate company wants to know how consumers evaluate dif-
ferent chocolate flavors with respect to 10 subjectively perceived attributes since he
assumes this information to be strategically important for differentiating his offerings
from those of his competitors and for positioning new products. For this purpose, the
manager has identified 11 flavors and has selected 10 attributes that appear to be relevant
for the evaluation of these flavors.
First, a small pretest with 18 test persons is carried out. The persons are asked to eval-
uate the 11 flavors (chocolate types) with respect to the 10 attributes (see Table 5.21).34
A seven-point rating scale (1 = low, 7 = high) is used for each attribute. Thus, the inde-
pendent variables are perceived attributes of the chocolate types.
However, not all persons evaluated all 11 flavors. Thus, the data set contains only 127
evaluations instead of the complete number of 198 evaluations (18 persons × 11 flavors).
Any evaluation comprises the scale values of the 10 attributes for a certain flavor of one
respondent. It reflects the subjective assessment of a specific chocolate flavor by a par-
ticular test person. Since each test person assessed more than just one flavor, the obser-
vations are not independent. Yet in the following we will treat the observations as such.
Of the 127 evaluations, only 116 are complete, while 11 evaluations have missing val-
ues.35 We exclude all incomplete evaluations from the analysis. Consequently, the num-
ber of cases is reduced to 116.

33 We will use the dataset introduced in Chap. 4 for discriminant analysis in order to better illus-
trate similarities and differences between the two methods.
34 On the website www.multivariate-methods.info, we provide supplementary material (e.g., Excel

files) to deepen the reader's understanding of the methodology.


35 Missing values are a frequent and unfortunately unavoidable problem when conducting surveys

(e.g. because people cannot or do not want to answer some question(s), or as a result of mistakes
by the interviewer). The handling of missing values in empirical studies is discussed in Sect. 1.5.2.
330 5 Logistic Regression

Table 5.21  Chocolate flavors and perceived attributes examined in the case study
Chocolate flavor Perceived attributes
1 Milk 1 Price
2 Espresso 2 Refreshing
3 Biscuit 3 Delicious
4 Orange 4 Healthy
5 Strawberry 5 Bitter
6 Mango 6 Light
7 Cappuccino 7 Crunchy
8 Mousse 8 Exotic
9 Caramel 9 Sweet
10 Nougat 10 Fruity
11 Nut

Table 5.22  Definition of the segments (groups) for multinomial logistic regression


Group (segment) Chocolate flavors in the segment Cases (n)
g = 1 | Seg_1 Classic Milk, Biscuit, Mousse, Caramel, Nougat, Nut 65
g = 2 | Seg_2 Fruit Orange, Strawberry, Mango 28
g = 3 | Seg_3 Coffee Espresso, Cappuccino 23

To investigate the differences between the various flavors, 11 groups could be con-
sidered, with each group representing one chocolate flavor. For the sake of simplicity,
the 11 products were grouped into three market segments. This was done by applying a
cluster analysis (see the results of the cluster analysis in Sect. 8.3). Table 5.22 shows the
composition of the three segments and their sizes. The size of the segment ‘Classic’ is
more than twice the size of the segments ‘Fruit’ and ‘Coffee’.
The manager now wants to use logistic regression to answer the following questions:

• How do the market segments differ concerning the selected attributes?


• Which attributes discriminate best between the market segments?
• How can we assign new products to the market segments?
• Can we find a more parsimonious set of attributes for discriminating between the
segments?

5.4.2 Conducting a Multinomial Logistic Regression with SPSS

Here we show how to run the procedure for multinomial logistic regression (NOMREG)
of SPSS via the graphical user interface (GUI). Figure 5.26 shows the Data editor with the
5.4 Case Study 331

Fig. 5.26 Data editor with the the selection of the procedure ‘Multinomial Logistic Regression’ (NOM-
REG)

work file that contains our data, and the pulldown menus for ‘Analyze’ and ‘Regression’.
The 127 rows of the table contain the evaluations (cases) and the columns represent the 10
attributes (our independent variables). The three last columns represent the ident numbers
of the 18 respondents, the type numbers of the 11 flavors (product types), and the numbers
of the three segments (our dependent variable, not visible in Fig. 5.26).
For selecting an analysis procedure in SPSS, you have to click on ‘Analyze’ to
open the pulldown menu with for the various groups of procedures. The procedure
‘Multinomial Logistic Regression’ is included in the group ‘Regression’. This group also
contains the procedure ‘Binary Logistic Regression’, which was discussed in Sect. 5.2.6.
After selecting ‘Analyze/Regression/Multinomial logistic …’, the dialog box shown
in Fig. 5.27 opens, with the left field showing the list of variables. Our dependent varia-
ble ‘Segment’ has to be moved to the field ‘Dependent’ using the left mouse button (this
332 5 Logistic Regression

Fig. 5.27 Dialog box: Multinomial Logistic Regression

has already been done in Fig. 5.27). The appendix ‘Last’ indicates that the last segment
(G = 3) has by default been chosen as the reference category. As mentioned above, the
user can select any category (segment) as the reference category. Here, we choose the
first segment as the baseline category because this segment is the largest one. For this,
we have to click on ‘Reference Category/Custom’ and enter 1 for ‘Value’.
We mentioned at the beginning that in logistic regression independent variables are
usually called

• covariates if they are metric variables,


• factors if they are categorical variables.

Accordingly, the dialog box contains two fields for specifying the independent variables.
Since in our case all the 10 independent variables are metric, they were moved to the
field ‘Covariate(s)’.
Further dialog boxes can be accessed via the buttons on the right side.
The top dialog box, ‘Model’ (Fig. 5.28), will not be required in the beginning. As
the first item, ‘Main effects’, shows, the Multinomial Logistic Regression procedure
estimates the coefficients for all selected predictors (factors and/or covariates) and the
intercept. This is the default option and is called “blockwise” regression. There are two
5.4 Case Study 333

Fig. 5.28 Dialog box: Model

further items in this dialog box. If ‘Full factorial’ is chosen, all possible interaction
effects between the selected factorial (categorical) variables are also included in the
model. Covariate interactions are not estimated. The ‘Custom/Stepwise’ option allows
the user to specify interaction effects (factor or covariate interactions). Here you can also
choose whether a stepwise regression should be performed. In this case, the independent
variables are selected successively in the order of their significance by an algorithm. We
will use this option in Sect. 5.4.4. Finally, the user can choose a model without an inter-
cept. Click the ‘Continue’ button to return to the main menu.
In the second dialog box, ‘Statistics’, you can specify settings for the output.
Figure 5.29 shows the default settings.
The ‘Case processing summary’ provides information about the specified categorical
variables (e.g., the number of cases by segments, missing values).

• In the field ‘Model’ you can request statistics for assessing the quality of the model,
such as the pseudo-R2 statistics (McFadden, Cox & Snell, Nagelkerke), the likelihood
ratio test, or the classification table. If you choose the option ‘Classification table’,
you will get a table with observed versus predicted responses. Pearson’s chi-square
statistic and the deviance can be requested with the option ‘Goodness-of-fit’. If you
334 5 Logistic Regression

Fig. 5.29 Dialog box: Statistics

choose this option, in our case you will get a warning in the output because the inde-
pendent variables are metric and therefore almost every case has a distinct covariate
pattern. For our total of 116 cases, we would get 113 distinct covariate patterns (i.e.
only three pairs with equal covariate patterns), resulting in 339 cells of which 226
would be empty.
• In the field ‘Parameters’ you can request a table with the estimated coefficients
including standard error, Wald test, and odds ratio (analogous to Table 5.15).
Selecting ‘Likelihood ratio tests’ will result in printouts of the likelihood ratio tests of
the coefficients (analogous to Table 5.16). Confidence intervals for the odds ratios are
also provided, and the user can specify the confidence probability of these intervals.

Pearson's Chi-square statistics and Deviance can be requested under the "Goodness-of-
fit" option. If you choose this option, you get a warning, because the independent var-
iables are metric and therefore almost every case has a distinct covariate pattern. Thus,
most of the cells contain zero frequencies (see above). In total, with the 116 cases, we
get 113 distinct covariate patterns (i.e. there are only three pairs with equal covariate pat-
terns), resulting in 339 cells, of which 226 are empty. SPSS, therefore, gives a warning
message in the output.
5.4 Case Study 335

The dialog box ‘Criteria’ offers controls on the iterative algorithm to perform the
maximum likelihood estimate (e.g. the maximum number of iterations) and printing of
the iteration history. The dialog box ‘Options’ may be used to set parameters for per-
forming a stepwise regression. The dialog box ‘Save’ may be used to save individual
results, such as the estimated probabilities or the predicted category, in the work file by
appending them as new variables.

5.4.3 Results

5.4.3.1 Blockwise Logistic Regression


The output of the blockwise multinomial logistic regression first gives an overview of the
case numbers of the data set (Fig. 5.30). Of the 127 cases entered, 11 cases have miss-
ing values and are therefore discarded. The remaining 116 cases are categorized into the
three specified segments with the percentages 56%, 24%, and 20%.
The remark below the table refers to the above-mentioned splitting up of cases by
subpopulations (covariate patterns, see Sect. 5.3.4.1), which, however, makes no sense
for metric independent variables.

Quality of the Estimated Model


The upper part of Fig. 5.31 shows the likelihood ratio test described in Sect. 5.2.4.1.
The first column presents the values of –2LL0 and –2LLf. The difference results in the
likelihood ratio statistics LLR = 229.326 − 142.016 = 87.310 (‘Chi-Square’). With
J · (G − 1) = 20 degrees of freedom, we get a p-value of practically zero. Thus, the model

Fig. 5.30 Case processing summary


336 5 Logistic Regression

Fig. 5.31 Overall quality of the model

can be regarded as highly statistically significant, i.e. the predictors discriminate between
the three segments.
The values of the three pseudo R square statistics in the lower part of Fig. 5.31 also
indicate an acceptable model fit. McFadden’s R2 results from the above log-likelihood
values:
 
2 LLf −142.016
McF − R = 1 − =1− = 0.381
LL0 −229.326
However, this is not very surprising as the segments were formed by a cluster analysis
of the same 10 attributes that we now use as predictors. It shows that the cluster analysis
probably worked correctly.

Estimating the Parameters


Since we chose segment g = 1 as the reference category (baseline), the parameters for
segments 2 and 3 have to be estimated using the ML method according to Eq. (5.53).
Including the intercepts, a total of (G − 1)· (J + 1) = 22 parameters have to be estimated.
They are shown in Fig. 5.32 together with their standard errors and further statistics.
To facilitate their interpretation, we visualize the estimated coefficients using column
charts. In Fig. 5.33 we can see that the segment ‘Fruit’ is perceived as less delicious, but
more light and fruity than the segment ‘Classic’. This seems plausible. Less plausible is
the result that the segment ‘Fruit’ is perceived as more crunchy, which is what we would
expect from flavors with nuts and biscuits. That the segment ‘Fruit’ is perceived as less
healthy also seems questionable.
5.4 Case Study
337

Fig. 5.32 Estimated parameters of the regression functions for segments 2 and 3
338 5 Logistic Regression

Fruit vs. Classic


1.500

1.000

0.500

0.000

-0.500

-1.000

-1.500

Fig. 5.33 Estimated coefficients for the segments ‘Fruit’ versus ‘Classic’

Coffee vs. Classic


1.500

1.000

0.500

0.000

-0.500

-1.000

-1.500

Fig. 5.34 Estimated coefficients for the segments ‘Coffee’ versus ‘Classic’

Figure 5.34 shows that the segments ‘Coffee’ and ‘Classic’ differ much less than the
segments ‘Fruit’ and ‘Classic’. The greatest difference concerns the attribute ‘bitter’. But
that ‘Coffee’ is perceived as less bitter than ‘Classic’ also seems questionable.
According to Eq. (5.57) we can determine the coefficients for other pairs of categories
by the differences of the respective logits. Figure 5.35 shows the results for ‘Fruit’ versus
5.4 Case Study 339

Fruit vs. Coffee


1.500

1.000

0.500

0.000

-0.500

-1.000

-1.500

Fig. 5.35 Estimated coefficients for the segments ‘Fruit’ versus ‘Coffee’

Table 5.23  Significant coefficients for α = 5%


Segments Positively significant Negatively significant
Fruit vs. Classic Crunchy, fruity, light Delicious, healthy
Coffee vs. Classic Exotic Bitter
Fruit vs. Coffee Fruity Price

‘Coffee’. For ‘Coffee’ versus ‘Fruit’ (by switching the baseline), we would get identical
coefficients with opposite signs.
The segment ‘Fruit’ is perceived as more fruity than the segment ‘Coffee’, which is
not surprising. It is also perceived as lighter, more refreshing, and crunchier. And it is
perceived as less delicious, less expensive, and less healthy. Not all of these findings
match our intuitive expectations, but human attitudes are sometimes unpredictable.
Figure 5.32 also shows the values of the Wald statistics according to Eq. (5.44) and
the corresponding p-values. For the segments ‘Fruit’ vs. ‘Classic’, five coefficients are
significant for α = 5%, but for the other pairs of segments, most coefficients are not sig-
nificant. Table 5.23 shows the significant coefficients in the order of their significance
level. As can be seen, the largest coefficients do not always have the highest significance.
Finally, in the rightmost column of Fig. 5.32, the effect coefficients (odds ratios,
Exp(B)) of the covariates are shown (specific for the respective reference category). They
are all positive, but <1 for negative values of the regression coefficients and >1 for pos-
itive values (see Sect. 5.2.4.3). SPSS also indicates the confidence intervals of the odds
ratios.
340 5 Logistic Regression

Fig. 5.36 Testing the covariates with the likelihood ratio test

Likelihood Ratio Tests of the Covariates


To check the importance of any attribute for the segmentation, we can use the likelihood
ratio test. It is calculated by successively removing each covariate from the model.
In contrast to the Wald-test, all coefficients of a covariate are tested, not just a single
coefficient. In our case, there are two coefficients for each covariate. The result is shown
in Fig. 5.36. This result is independent of whatever baseline category was chosen.
Figure 5.36 shows, for each covariate, the −2LL value of the reduced model, i.e.
when the two coefficients of the respective covariate are set to zero and the likelihood is
maximized for the remaining parameters. If LL0j is the maximum log-likelihood value of
the reduced model (with the coefficients b1j and b2j of covariate j set to zero), according
to Eq. (5.45) the likelihood ratio statistic for covariate j is obtained by:
LLRj = −2 · (LL0j − LLf )
LLf denotes the log-likelihood value for the fitted model as given in Fig. 5.31. For covari-
ate 1 (‘price’), for example, this results in:
5.4 Case Study 341

Fig. 5.37 Estimated probabilities (part of the working file)

LLRj = 146.531 − 142.016 = 4.51


Under hypothesis H0: b1j = b2j = 0, LLRj is asymptotically chi-square distributed with
2 degrees of freedom. The resulting p-value is 0.105, so the influence of the attribute
‘price’ cannot be regarded as significant.
Surprisingly, the attribute ‘crunchy’ has the strongest effect, followed by ‘delicious’,
‘bitter’ and ‘fruity’. The attribute ‘sweet’ has the smallest effect, with a p-value of 0.55
(maybe because the products do not differ much with respect to sweetness).

Classification Results
An important characteristic of logistic regression is that it provides probabilities for the
categories (groups). The probabilities may be used for the prediction or the assignment
of cases to categories. In SPSS we get the probabilities for each test person and flavor for
the three segments. But we can also derive these probabilities for persons that have not
taken part in the analysis. For the calculation see Eqs. (5.47) and (5.48) in Sect. 5.3. In
SPSS, the estimated probabilities can be requested with the dialog box ‘Save’. They will
then be appended to the working file as new variables. The probabilities are independent
of the chosen baseline category.
Figure 5.37 shows part of the working file in the data editor with the created varia-
bles EST1_1, EST2_1 and EST3_1 for the estimated probabilities of the three segments.
342 5 Logistic Regression

Fig. 5.38 Classification table for the case example

The variable PRE_1 indicates the predicted category. This is the category with the high-
est probability.
For case 1 (respondent 1 and flavor 1 = Milk) we get the highest probability 0.84
for segment 1 (Classic). This prediction is correct. For case 10 (respondent 4 and flavor
2 = Espresso), segment 2 (Fruit) is predicted, which is false.
A summary of all observed and predicted segments is given by the classification table
in Fig. 5.38, which now contains 9 cells. The hits are shown in the diagonal cells. Of the
65 cases in segment 1 (Classic), 62 are predicted correctly (95.4%), and of the 28 cases
in segment 2 (Fruit), 23 are predicted correctly (82.1%). These are very good results. But
of the 23 cases in segment 3 (Coffee), only 6 are predicted correctly (26.1%). This shows
that logistic regression yields higher hit rates for larger segments. The overall hit rate is
78.4%.
To check the classification, a ROC curve can be created for each segment. We will
do this for segment 1 (Classic). After selecting the item ‘Analyze/ROC curve’, we have
to choose the variable ‘EST1_1’ as the ‘Test Variable’ and the variable ‘Segment’ as the
‘State Variable’. For ‘The Value of State Variable’ we enter 1 for segment 1 (Classic).
Additionally, we select ‘With diagonal reference line’ and ‘Standard error and confidence
interval’. The output is shown in Fig. 5.39. The area under the curve (AUC) is 82.4%,
which is excellent.
We can do the same for the other two segments. For segment 2 (‘Fruit’) we have to
choose the variable ‘EST2_1’ as the ‘Test Variable’ and enter the value 2 for the ‘State
Variable’. We will get AUC = 91.5%. For segment 3 (Coffee) we get AUC = 79.5%.
These are excellent results. The cluster analysis that was used for the segmentation
apparently did a good job.

5.4.3.2 Stepwise Logistic Regression


A question of great practical importance is: Can we find a more parsimonious set of
attributes for discriminating between the segments? By performing the main study with
a smaller set of attributes than the one used in the pretest, we can save time and money.
And often the quality of the data can be improved in this way because the burden on the
respondents is diminished.
5.4 Case Study 343

Fig. 5.39 ROC curve for segment 1 (Classic), AUC = 0.824

To construct a more parsimonious model, we can use the results above since the like-
lihood ratio test told us that the attributes ‘crunchy’, ‘delicious’, ‘bitter’ and ‘fruity’ have
the greatest importance for assigning cases to segments.
The option ‘Stepwise Regression’ offers another possibility. With this procedure, it
is left to the computer program to find a good model. An algorithm successively adds
variables to a model that in the beginning contains only the intercept. The variables
are selected in the order of their contribution to the likelihood of the model (forward
selection). Or the algorithm removes variables from a model that contains all independ-
ent variables (backward selection). This can be controlled via the dialog box ‘Model’
(Fig. 5.28). The statistical criteria for the selection process can be controlled via the
dialog box ‘Options’ (Fig. 5.40). The default threshold for including a variable into the
model is a p-value ≤5% of the likelihood ratio.
344 5 Logistic Regression

Fig. 5.40 Dialog box: Options

Figure 5.41 shows the results of the stepwise entry of covariates. The procedure
selects five variables that meet the default selection criterion (‘Entry Probability’ ≤5%).
These are ‘fruity’, ‘price’, ‘exotic’, ‘crunchy’, and ‘bitter’. The likelihood statistics are
not identical to the values in Fig. 5.36 because of multicollinearity. With these variables,
we achieve a hit rate of 69%.
If we increase the ‘Entry Probability’ to ≤10%, the attribute ‘refreshing’ is added to
the model and we achieve a hit rate of 71.6%. If we now discard the attribute ‘bitter’, we
can increase the hit rate to 73.1%. This demonstrates that we should not put too much
confidence in the automatic selection by stepwise regression.

5.4.4 SPSS Commands

In Sect. 5.2.6 we demonstrated how to use the GUI (graphical user interface) of SPSS
to conduct a binary logistic regression. Alternatively, we can use the SPSS syntax which
is a programming language unique to SPSS. Each option we activate in SPSS’s GUI is
translated into SPSS syntax. If you click on ‘Paste’ in the main dialog box shown in
5.4 Case Study 345

Fig. 5.41 Summary of the stepwise logistic regression (forward selection)

* MVA: Binary Logistic Regression for the example.


* Defining Data.
DATA LIST FREE / Person Income Gender Buy.

BEGIN DATA
1 2530 0 1
2 2370 1 0
3 2720 1 1
-----------
30 1620 0 0
* Enter all data.
END DATA.

* Logistic Regression: Method "Enter".


LOGISTIC REGRESSION VARIABLES Buy
/METHOD=ENTER Income Gender
/SAVE=PRED
/CASEWISE OUTLIER(2)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).

Fig. 5.42 SPSS syntax for binary logistic regression (Sect. 5.2.6)

Fig. 5.17, a new window opens with the corresponding SPSS syntax. However, you can
also use the SPSS syntax and write the commands yourself. Using the SPSS syntax can
be advantageous if you want to repeat an analysis multiple times (e.g., testing differ-
ent model specifications). Figure 5.42 shows the SPSS syntax for binary regression as
described in Sect. 5.2.6.
346 5 Logistic Regression

* MVA: Case Study Chocolate Multinomial Logistic Regression.


* Defining Data.
DATA LIST FREE / price refreshing delicious healthy bitter light crunchy
exotic sweet fruity segment.
VALUE LABELS
/segment 1 'Classic' 2 'Fruit' 3 'Coffee'.

BEGIN DATA
3 3 5 4 1 2 3 1 3 4 1
6 6 5 2 2 5 2 1 6 7 1
2 3 3 3 2 3 5 1 3 2 1
---------------------
5 4 4 1 4 4 1 1 1 4 1
* Enter all data.
END DATA.

* Case Study Multinomial Logistic Regression: Method "Stepwise".


NOMREG segment (BASE=FIRST ORDER=ASCENDING) WITH price refreshing
delicious healthy bitter light crunchy exotic sweet fruity
/CRITERIA CIN(95) DELTA(0) MXITER(100) MXSTEP(5) CHKSEP(20)
LCONVERGE(0) PCONVERGE(0.000001) SINGULAR(0.00000001)
/MODEL
/STEPWISE=PIN(.05) POUT(0.1) MINEFFECT(0) RULE(SINGLE) ENTRYMETHOD(LR)
REMOVALMETHOD(LR)
/INTERCEPT=INCLUDE
/PRINT=CLASSTABLE PARAMETER SUMMARY LRT CPS STEP MFI
/SAVE ESTPROB PREDCAT.

ROC EST1_1 BY segment (1)


/PLOT=CURVE(REFERENCE)
/PRINT=SE
/CRITERIA=CUTOFF(INCLUDE) TESTPOS(LARGE) DISTRIBUTION(FREE) CI(95)
/MISSING=EXCLUDE.

ROC EST2_1 BY segment (2)


/PLOT=CURVE(REFERENCE)
/PRINT=SE
/CRITERIA=CUTOFF(INCLUDE) TESTPOS(LARGE) DISTRIBUTION(FREE) CI(95)
/MISSING=EXCLUDE.

ROC EST3_1 BY segment (3)


/PLOT=CURVE(REFERENCE)
/PRINT=SE
/CRITERIA=CUTOFF(INCLUDE) TESTPOS(LARGE) DISTRIBUTION(FREE) CI(95)
/MISSING=EXCLUDE.

Fig. 5.43 SPSS syntax for blockwise estimation and ROC curve (Sect. 5.4.3.1)

Figures 5.43 and 5.44 show the SPSS syntax for running some analyses for the case
study presented here.
For readers interested in using R (https://www.r-project.org) for data analysis, we pro-
vide the corresponding R-commands on our website www.multivariate-methods.info.
5.5 Modifications and Extensions 347

* MVA: Case Study Chocolate Multinomial Logistic Regression: Method


"Stepwise".
NOMREG segment (BASE=FIRST ORDER=ASCENDING) WITH price refreshing
delicious healthy bitter light crunchy exotic sweet fruity
/CRITERIA CIN(95) DELTA(0) MXITER(100) MXSTEP(5) CHKSEP(20)
LCONVERGE(0) PCONVERGE(0.000001) SINGULAR(0.00000001)
/MODEL=| FORWARD=price refreshing delicious healthy bitter light
crunchy exotic sweet fruity
/STEPWISE=PIN(.05) POUT(0.1) MINEFFECT(0) RULE(SINGLE) ENTRYMETHOD(LR)
REMOVALMETHOD(LR)
/INTERCEPT=INCLUDE
/PRINT=CLASSTABLE PARAMETER SUMMARY LRT CPS STEP MFI.

Fig. 5.44 SPSS syntax for stepwise estimation (Sect. 5.4.3.2)

5.5 Modifications and Extensions

Logistic regression is of great importance for the estimation of discrete choice models,
i.e. for models that describe, explain, predict, and support choices between two or more
discrete alternatives. As examples we used various instances of buying chocolate (e.g.
buying or not buying, buying type A or type B). For these kinds of models we have to
distinguish between two groups of independent variables (predictors):

• characteristics of the persons (chooser): e.g. socio-demographic variables like gender,


age, income, household size, education, interests, lifestyle, etc.
• attributes of the alternatives (choices): e.g. brand, price, size, color, ingredients, fea-
tures, quality, advertising, etc.

The predictors used in logistic regression are predominantly characteristics of the


persons. But for making choices, attributes of the alternatives are also of eminent
relevance.36
Based on the predominant use of the two types of predictors we can distinguish
between two types of models for discrete choices:

• logistic regression models (using characteristics of the persons),


• logit choice models (using attributes of the alternatives).37

36 There is a third group of variables that vary over persons and alternatives, e.g.,. the perceived
attributes that we encountered in the case study.
37 Logit Choice Models became popular by the work of Daniel McFadden (1974), who laid the

foundations for these models and their applications. In 2000 he won the Nobel Prize in econom-
ics. More information on these models can be found in the books of Ben-Akiva & Lerman, 1985;
Hensher et al., 2015; Train, 2009. Examples of their application are the use of transport alterna-
tives (e.g. car, tram, bus, bicycle, walking; Mc Fadden, 1984) or the interpretation of market data
derived from scanner panels (e.g. Guadagni & Little, 1983; Jain et al., 1994).
348 5 Logistic Regression

A problem with logistic regression models is the large number of parameters that have to
be estimated and interpreted, especially if the number of categories is large. Wikipedia
lists more than 200 chocolate brands. If we select 10 brands and use 10 predictors, we
have to estimate 99 parameters (9 intercepts and 90 coefficients) for a logistic regression
model.
For a logit choice model, the number of parameters is reduced to 19 (9 intercepts and
10 coefficients) because the logit choice model (in its basic form) uses generic coeffi-
cients, while the logistic regression model uses alternative-specific coefficients. For
example, there will be just one coefficient for the prices of the alternatives, and it is
assumed that price has the same effect on all alternatives. The possibility of specifying
generic coefficients that are constant over the alternatives makes it possible to formulate
very efficient and parsimonious models. Logistic regression does not allow the estima-
tion of generic coefficients. Neither can generic coefficients that do not vary over the
alternatives be estimated for characteristics of the persons. We will demonstrate this with
a small example.

Example
As an example, we simplify the model in Eq. (5.52). We assume G = 3 choice alter-
natives and only one predictor. For a person with an income x, we want to predict
the buying probabilities for the three alternatives. This yields the following logistic
regression model:

eαg +βg x
πg (x) = (g = 1, 2, 3) (5.65)
eα1 +β1 x + eα2 +β2 x + eα3 +β3 x
Setting the parameters for the baseline category g = 3 to zero, we have to estimate
(J + 1) x (G–1) = 4 parameters. ◄

Now we contrast the model in the example with the corresponding logit choice model.
We assume the predictor will be the price p of the choice alternatives. This leads to the
following formula:

eαg +βpg
πg (p) = (g = 1, 2, 3) (5.66)
eα1 +βp1 + eα2 +βp2 + eα3 +βp3
Now the price coefficient is a generic parameter, while the prices are specific to the alter-
natives. We expect a negative value for the price coefficient β. Setting α3 to zero, we
have to estimate (J + G–1) = 3 parameters. For alternative 1 we can express the probabil-
ity by:
1
π1 (p) = (5.67)
1+ e(α2 −α1 )+β(p2 −p1 ) + e−α1 +β(p3 −p1 )
The intercepts (constant terms) of the model can here be interpreted as utility values of
the alternatives relative to the baseline category. The model states that the probability of
5.6 Recommendations 349

buying alternative 1 depends on the differences in utility and price between choice 1 and
the other two alternatives.
If we now substitute in the logit choice model the price of the alternatives with the
income of the chooser, we get the following formula:
1 1
π1 (x) = =
1 + e(α2 −α1 )+β(x−x) + e−α1 +β(x−x) 1 + eα2 −α1 + e−α1
Because the income does not vary across the alternatives, it is eliminated from the logit
choice model.
As mentioned above, logistic regression does not allow the estimation of generic coef-
ficients. So, if we include an attribute of the alternatives into a logistic regression model,
we have to estimate a specific coefficient for each alternative. This can be useful in cer-
tain situations. For example, a strong brand might be less affected by an increase in price
than a weak brand and it would be interesting to measure this effect of the brand. But
logistic regression forces us to estimate all parameters specific to alternatives. So, in gen-
eral, we have to estimate many parameters.
The logit choice model can be extended to include characteristics of the persons by
using alternative-specific coefficients besides generic coefficients. So, the logistic regres-
sion model can be seen as a special case of an extended logit choice model.38

5.6 Recommendations

In conclusion, we will give some recommendations for the implementation of logistic


regression.

Requirements Regarding the Data


• Logistic regression imposes relatively low statistical requirements on the data. The
main assumption of the logistic model is that the categorical dependent variable is
a random variable (Bernoulli- or multinomially distributed). The observations of the
dependent variable should be statistically independent. No assumptions are made con-
cerning the independent variables.
• Logistic regression should, therefore, be preferred to discriminant analysis if there is
uncertainty about the distribution of the independent variables, especially if categori-
cal independent variables are present.
• Logistic regression requires a larger number of cases than, for example, linear
regression or discriminant analysis. The number of cases per group (category of

38 SPSS has no special procedure for logit choice analysis but the procedure COXREG for Cox-
Regression may be used for this calculation.
350 5 Logistic Regression

the dependent variable) should not be lower than 25, and the total should therefore
amount to at least 50.
• With a larger number of independent variables, larger numbers of cases per group are
required. There should be at least 10 cases per parameter to be estimated.
• The independent variables should be largely free of multicollinearity (no linear
dependencies).

Estimation of the Regression Coefficients


• To test the regression coefficients, the likelihood ratio test should be preferred to
the Wald test, as the Wald test provides too high p-values for small samples. This
means that possibly relevant influencing variables are not recognized as significant.
However, in SPSS the LR test for the coefficients is only provided by the multinomial
logistic regression procedure.
• It should be noted that when coding the values of the dependent variables with zero
and one, the binary logistic regression procedure selects group zero as the reference
category, while the multinomial logistic regression procedure selects by default the
group with the highest coding—here group one—as the reference category. The esti-
mated parameters thus differ in their sign, though not in their amount. But the user
can select the reference category.

Global Quality Measures


• The likelihood ratio test is the best available test to assess the significance of the over-
all model and is always applicable. It is comparable to the F-test of linear regression
analysis. Other global measures of quality, such as the Pearson statistic or the devi-
ance, should be considered skeptically in the case of metric independent variables (if
many covariate patterns occur) as they are usually not chi-square distributed.
• Low values of the pseudo-R-square statistics should not cause disappointment
since their values are regularly lower than values expected from R-square in linear
regression.
• When testing the model using a classification table, it should be noted that the hit
rates depend on the selection of the cutoff value. An independent quality measure for
the predictive power is the area below the ROC curve (AUC), which can take on val-
ues between 0 and 1.
• In general, an outlier diagnosis based on the Pearson residuals per observation is rec-
ommended. In the case of a multinomial model, they are not calculated per observa-
tion, but cell by cell (option ‘Cell probabilities’, see Fig. 5.29).

Alternative Methods
An alternative method to binary logistic regression is linear regression as demonstrated
in Sect. 5.2.1 with the linear probability model (Model 1, Sect. 5.2.1.1) and with group-
ing the data and applying a logit transformation (Model 2, Sect. 5.2.1.2). These models
can provide good approximations and are computationally much easier. But the linear
References 351

probability model has only restricted validity, and any grouping of data leads to a loss of
information.
Another alternative to logistic regression analysis (LRA) is discriminant analysis
(DA) (cf. Chap. 4). Both methods can also be used for multinomial problems, and both
methods are based on a linear model (linear systematic component). While in LRA we
deal with states or events, in the context of DA we deal with the classification of ele-
ments into predefined groups, which is historically conditioned by the original areas of
application. But as we have shown, LRA can also be used for problems of classification.
The main difference between the two methods is that LRA provides probabilities for
the occurrence of alternative events or the classification into separate groups. In contrast,
DA provides discriminant values from which probabilities can be derived in a separate
step. A nice feature of DA is that it provides territorial maps for classification problems,
i.e. mappings of groups and boundaries between groups.
LRA is computationally more complex since the estimation of the parameters requires
the application of the maximum likelihood method and thus an iterative procedure.
Concerning the statistical properties of the methods, one advantage of LRA is that it is
based on fewer assumptions about the data than DA. Thus, LRA is more robust con-
cerning the data and, in particular, less sensitive to outliers.39 Experience shows, how-
ever, that both methods provide similarly good results, especially for large sample sizes
and even in cases where the assumptions of DA are not fulfilled (cf. Michie et al., 1994,
p. 214; Hastie et al., 2011, p. 128; Lim et al., 2000, p. 216).

References

Agresti, A. (2013). Categorical data analysis. John Wiley.


Ben-Akiva, M., & Lerman, S. (1985). Discrete choice analysis. MIT Press.
Fox, J. (2015). Applied regression analysis and generalized linear models. SAGE.
Gigerenzer, G. (2002). Calculated Risks. How to know when numbers deceive you. Simon &
Schuster.
Guadagni, P., & Little, J. (1983). A logit model of brand choice calibrated on scanner data.
Marketing Science, 2(3), 203–238.
Hastie, T., Tibshirani, R., & Friedman, J. (2011). The elements of statistical learning. Springer.
Hauck, W., & Donner, A. (1977). Wald’s test as applied to hypotheses in logit analysis. Journal of
the American Statistical Association, 72, 851–853.
Hensher, D., Rose, J., & Greene, W. (2015). Applied choice analysis. Cambridge University Press.
Hosmer, D., & Lemeshow, S. (2000). Applied logistic regression. Wiley.
Hosmer, D., Lemeshow, S., & Sturdivant, R. (2013). Applied logistic regression. Wiley.
IBM Corporation. (2022). IBM SPSS Statistics Algorithms. https://www.ibm.com/docs/en/
SSLVMB_29.0.0/pdf/IBM_SPSS_Statistics_Algorithms.pdf. Accessed November 4, 2022.

39 DAassumes that the independent variables follow a multivariate normal distribution, whereas
LRA assumes that the dependent variable follows a binomial or multinomial distribution.
352 5 Logistic Regression

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2014). An Introduction to Statistical Learning.
Springer.
Jain, D., Vilcassim, N., & Chintagunta, P. (1994). A random-coefficients logit brand-choice model
applied to panel data. J. O. Business & Economic Statistics, 13(3), 317–326.
Lim, T., Loh, W., & Shih, Y. (2000). A comparison of predicting accuracy, complexity, and training
time of thirty-three old and new classification algorithms. Machine Learning, 40(3), 203–229.
Louviere, J., Hensher, D., & Swait, J. (2000). Stated choice methods. Cambridge University Press.
McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. In P. Zarembka
(Ed.), Frontiers in econometrics, 40 (pp. 105–142). Academic Press.
McFadden, Daniel L. (1984). Econometric analysis of qualitative response models. Handbook of
Econometrics, Volume II. Chapter 24. Elsevier Science Publishers BV.
Michie, D., Spiegelhalter, D., & Taylor, C. (1994). Machine learning, neural and statistical classi-
fication. Ellis Horwood Limited.
Pearl, J. (2018). The Book of Why. The new science of cause and effect. Basic Books.
Press, W., Flannery, B., Teukolsky, S., & Vetterling, W. (2007). Numerical recipes – The art of sci-
entific computing. Cambridge University Press.
Train, K. (2009). Discrete choice methods with simulation. Cambridge University Press.
Watson, J., Whiting, P., & Brush, J. (2020). Practice pointer: Interpreting a covid-19 test result.
British Medical Journal, 369, m1808.

Further Reading

Corporation, I. B. M. (2017). IBM SPSS regression 25. Armonk.


Hair, J., Black, W., Babin, B., & Anderson, R. (2010). Multivariate data analysis. Pearson.
Maddala, G. (1983). Limited-dependent and qualitative variables in econometrics. Cambridge
University Press.
McCullagh, P., & Nelder, J. (1989). Generalized linear models. Chapman and Hall.
Contingency Analysis
6

Contents

6.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353


6.2 Procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
6.2.1 Creating a Cross Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
6.2.2 Interpretation of Cross Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
6.2.3 Testing for Associations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
6.2.3.1 Testing for Statistical Independence. . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
6.2.3.2 Assessing the Strength of Associations. . . . . . . . . . . . . . . . . . . . . . . . . . 361
6.2.3.3 Role of Confounding Variables in Contingency Analysis. . . . . . . . . . . . 365
6.3 Case Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
6.3.1 Problem Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
6.3.2 Conducting a Contingency Analysis with SPSS. . . . . . . . . . . . . . . . . . . . . . . . . . 367
6.3.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
6.3.4 SPSS Commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
6.4 Recommendations and Relations to Other Multivariate Methods. . . . . . . . . . . . . . . . . . . 377
6.4.1 Recommendations on How to Implement Contingency Analysis. . . . . . . . . . . . . 377
6.4.2 Relation of Contingency Analysis to Other Multivariate Methods. . . . . . . . . . . . 378
Further Reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379

6.1 Problem

Imagine you want to know whether there is an association between gender and being
vegetarian. To answer the question, you draw a random sample of consumers and meas-
ure their gender and diet (vegetarian/omnivore). The two variables ‘gender’ and ‘diet’ are
categorical (nominal) variables. The values of the variables represent different categories

© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 353


K. Backhaus et al., Multivariate Analysis, https://doi.org/10.1007/978-3-658-40411-6_6
354 6 Contingency Analysis

Table 6.1  Examples of research questions appropriate for contingency analysis


Discipline Questions asking for Questions asking for homogeneity
independence
Educational Studies Is there a relation between school Is gender equally distributed
type and skipping school? across highly talented pupils?
Biology Are weather conditions (rainy vs. Is the number of albinos among
sunny) associated with the occur- species living in rural and urban
rence of ticks? areas equally distributed?
Business & Economics Is firm performance related to a Is the market share of fair trade
firm’s customer centricity? FMCG products equal in Europe
compared to the US?
Medicine Is there an association between Are death rates related to Covid-
smoking and dying of lung 19 equally distributed in Europe
cancer? and North America?
Psychology Are self-regulation in childhood Are depression and anxiety
and subsequent levels of achieve- equally distributed across younger
ment related? and older women?

that are mutually exclusive, that is, each consumer can be categorized to one specific
combination of the variables (e.g., female vegetarian, male omnivore).1   
To analyze relations between two or more categorical (nominal) variables, we can
use contingency analysis (also called cross-tabulation). When we are interested in asso-
ciations between categorical variables—like in the example above—, we actually want
to assess whether the variables are independent. Thus, we want to test for independence.
However, sometimes we want to know whether a variable is equally distributed in two
or more samples, as in the following example: you are wondering whether coronary
artery abnormalities are equally distributed in patients who are treated with aspirin or
with gamma globulin. Thus, you draw a random sample of patients with coronary artery
abnormalities and you examine whether the patients were treated with aspirin or with
gamma globulin. Both variables are categorical variables. However, in this example we
would like to know whether the distribution of coronary artery abnormalities is equal in
the two groups (i.e., aspirin vs. gamma globulin). We compare the probabilities of devel-
oping coronary artery abnormalities given treatment with aspirin or gamma globulin.
In this example, we want to test for homogeneity. Such questions can also be addressed
with the help of contingency analysis.
Table 6.1 lists further examples of research questions from different disciplines that
can be answered with contingency analysis. It is further indicated whether the ques-
tions ask for a test for independence or homogeneity. Actually, the procedures to test for

1 We are aware that both variables have more than just two categories. We use this simplification to
illustrate the basic idea of the contingency analysis.
6.2 Procedure 355

Fig. 6.1 Process steps of 1 Creating a cross table


contingency analysis
2 Interpretation of cross tables

3 Testing for associations

independence or homogeneity are alike. In the following, we will again use the choco-
late example and address a question about the association of two variables (i.e., test for
independence).

6.2 Procedure

Generally, contingency analysis follows a three-step procedure as illustrated in Fig. 6.1.


First, we create a cross table that displays the joint distribution of two categorical var-
iables. Second, we discuss how to interpret cross tables and the insights we gain. Third,
we assess whether an association between the categorical variables exists in a statistical
sense and how strong this association actually is.

6.2.1 Creating a Cross Table

1 Creating a cross table

2 Interpretation of cross tables

3 Testing for associations

The starting point of a contingency analysis is a cross table that reflects the joint distri-
bution of two variables. In general, theoretical considerations and reasonable arguments
should support the hypothesis of an association between the two variables. Otherwise,
we may just falsely induce dependencies.

Application Example
We collected data from 181 respondents about their age and preferred type of choc-
olate. The variable ‘age’ has just two distinct categories: ‘18 to 44 years’ (younger)
and ‘45 years and older’ (older). If the variable ‘age’ was measured as a metric varia-
ble, we could transform it into a categorical variable after the data has been collected.
The variable ‘preferred chocolate type’ also has two categories, namely ‘milk’ and
‘dark’. Each observation (i.e., respondent) can uniquely be assigned to a combination
of these categories. For example, we observe a younger person who loves milk choco-
late. Moreover, the individual observations are independent of each other, that is, we
have one single observation from each respondent. ◄
356 6 Contingency Analysis

Table 6.2  Excerpt of Respondent Type of chocolate Age


collected data (1 = milk, 2 = dark) (1 =  18 to 44 years,
2 = 45 years and older)
1 1 1
2 2 2
3 1 1
4 2 1
5 1 2
… … …

Table 6.3  I × J cross table


I×J cross table Variable 2
Variable 1 Category 1 Category 2 Category 3 … Category J Sum
Category 1 n11 n1.
Category 2 n21 n22 n2.
Category 3 …

Category I nI 1 nIJ nI .
Sum n.1 n.2 n.J n

Based on the above example, we want to examine whether there is an association


between ‘age’ and ‘preferred type of chocolate’. To do so, we can use the contingency
analysis. More specifically, we use the contingency analysis to test for independence.
Although we restrict our example to the case of two categorical variables, contin-
gency analysis can also be used for more than two categorical variables. In such cases,
multidimensional cross tables are created that keep the value of one variable constant
within a table.2 Then we may realize that one variable is independent of two others that
are, however, associated with each other. Alternatively, the association between two var-
iables may differ according to different values of the third variable. We will discuss such
an example in Sect. 6.2.3.3.

2 Another possibility is the computation of mean values or ratios (cf. Zeisel, 1985). In addition,
methods for the analysis of multidimensional tables such as log-linear models have been developed
(cf. Fahrmeir & Tutz, 2001 for a literature overview).
6.2 Procedure 357

Yet in the following, we focus on two categorical variables to ease the understanding
of the basic idea of contingency analysis. We want to answer the question whether there
is an association between age and respondents’ preference for either milk chocolate or
dark chocolate. Table 6.2 shows an excerpt of the collected data.
In a first step, we aggregate the collected data to create a cross table. To do so, we
compute the number of observations nij with a specific combination of the different cat-
egories of the variables, where i indicates the category of the variable displayed in the
rows (i = 1, …, I) and j represents the category of the variable displayed in the columns
(j = 1,.., J). The I and J possible categories of the two variables represent the cells of the
cross table (Table 6.3). Apart from the number of observations in each cell of the cross
table (nij), we list information about the total number of observations in each row (ni.)
and column (n.j) as well as the total number of observations (n).
Table 6.4 shows the cross table for the chocolate example with 181 observations.

6.2.2 Interpretation of Cross Tables

1 Creating a cross table

2 Interpretation of cross tables

3 Testing for associations

From Table 6.4 we learn that 75 out of the 181 surveyed consumers are younger (18–
44 years), and that 68 consumers prefer milk chocolate over dark chocolate. Out of those
68 consumers, 45 are between 18 and 44 years old. Yet, to facilitate the interpretation of
cross tables, we often consider relative frequencies (i.e., percentages) instead of abso-
lute frequencies. For this purpose, we can use three alternative ways to compute relative
frequencies:

a) Percentages based on the total number of observations in a row (ni.)


b) Percentages based on the total number of observations in a column (n.j)
c) Percentages based on the total number of observations (n)

Table 6.4  Cross table for the preferred type of chocolate and age
Age
Preferred type of chocolate 18 to 44 years 45 years and older Sum
Milk 45 23 68
Dark 30 83 113
Sum 75 106 181
358 6 Contingency Analysis

Tables 6.5 to 6.7 show the respective cross tables. Each of these tables provides slightly
different representations of the same information. Hence, the selection of the appropriate
presentation of the cross table depends on the specific interest of the researcher.
For example, Table 6.5 shows that 66.2% of all consumers who prefer milk chocolate
are younger than 45 years, while 73.5% of all consumers who prefer dark chocolate are
45 years or older.
Table 6.6 illustrates that 60.0% of the younger consumers prefer milk chocolate,
while 78.3% of the older consumers prefer dark chocolate.
When considering the complete sample of 181 consumers (Table 6.7), we learn that
62.4% of the respondents prefer dark chocolate. Thus, the cross table suggests that there
is a larger market for dark than for milk chocolate. Further taking into account that
older consumers tend to prefer dark chocolate while younger consumers tend to prefer
milk chocolate (Tables 6.5 and 6.6), we can derive some managerial implications. For
example, a retail manager may use these insights for assortment decisions across differ-
ent retail stores. Since the majority of people seems to prefer dark chocolate, the retail

Table 6.5  Cross table—percentages based on total number of observations in a row


Age
Preferred type of chocolate 18 to 44 years (%) 45 years and older (%) Sum (%)
Milk 66.2 33.8 100
Dark 26.5 73.5 100

Table 6.6  Cross table—percentages based on total number of observations in a column


Age
Preferred type of chocolate 18 to 44 years (%) 45 years and older (%)
Milk 60.0 21.7
Dark 40.0 78.3
Sum 100 100

Table 6.7  Cross table—percentages based on total number of observations


Age
Preferred type of chocolate 18 to 44 years (%) 45 years and older (%) Sum (%)
Milk 24.9 12.7 37.6
Dark 16.6 45.9 62.4
Sum 41.4 58.6 100
6.2 Procedure 359

manager may decide to offer generally more kinds of dark chocolate compared to milk
chocolate in the stores. If older consumers prevail in the trade area of a specific retail
store, the retail manager may offer an even greater variety of dark chocolates.
At first glance, the different cross tables presented in Tables 6.5 to 6.7 suggest an
association between the respondents’ age and preferred type of chocolate. Yet, the cross
tables do not provide sufficient evidence for assuming an association between the varia-
bles in a statistical sense. In the following, we will thus formally test whether there is a
significant association.

6.2.3 Testing for Associations

1 Creating a cross table

2 Interpretation of cross tables

3 Testing for associations

To test whether there is indeed an association between the categorical variables, the num-
ber of observations in each cell should be 5 or more in 20% of all cases (cf. Everitt,
1992, p. 39). In our example, all cell counts are larger than 5, and we can thus continue
with the analysis. First, we assess whether the categorical variables are associated with
the help of the chi-square test (here: test for independence). Afterwards, we evaluate the
strength of the association between the variables.

6.2.3.1 Testing for Statistical Independence


Table 6.4 shows that 68 out of 181 respondents prefer milk chocolate—that are 37.6%
of all respondents (Table 6.7). If the variables ‘age’ and ‘preferred chocolate type’ were
independent of each other (not associated), we would expect that about 37.6% of the
younger and older respondents prefer milk chocolate over dark chocolate. Thus, we
would expect to observe 28.2 (= 75 · 0.376) younger and 39.8 (= 106 · 0.376) older
consumers in the respective cells of the cross table (i.e., n11 and n12). However, the
observed number of younger consumers who prefer milk chocolate equals 45, and the
observed number of older consumers who prefer milk chocolate is 23 (cf. Tables 6.4
and 6.8).
Generally, we can use the following equation to compute the expected number of
observations—assuming that the two variables are independent:
ni. · n.j
eij = (6.1)
n
Thus, the expected number of observations in a cell equals the number of observations in
the respective row (ni.) multiplied with the number of observations in the respective col-
umn (n.j) divided by the total number of observations (n).
360 6 Contingency Analysis

Table 6.8  Cross table contrasting observed and expected number of observations


Age
Preferred type of chocolate 18 to 44 years 45 years and older
Milk Observed 45.0 23.0
Expected 28.2 39.8
Dark Observed 30.0 83.0
Expected 46.8 66.2

If the two variables are independent, the deviations between the observed (nij) and
expected (eij) number of observations should be small. Instead large deviations provide
evidence for an association (dependence) between the two variables. In our example, the
deviations equal 16.8 in each cell (cf. Table 6.8). Given the sample size, the deviations
seem to be substantial and the variables ‘age’ and ‘preferred type of chocolate’ are prob-
ably associated.

Chi-square Test
We use the information about the deviations to formally assess whether an associ-
ation between the two variables exists in a statistical sense. We test the hypothesis of
independence:
H0: X and Y are independent of each other (not associated).
H1: X and Y are dependent on each other (associated).
We use the chi-square test statistic to test the null hypothesis. The chi-square test sta-
tistic takes all deviations of observed and expected number of observations (nij − eij)
into account. We divide each deviation by its expected count to normalize the deviations.
Moreover, we consider the squared deviations, so that positive and negative deviations
do not cancel each other out.
I 
J
 (nij − eij )2
χ2 = (6.2)
i=1 j=1
eij

For our example, we get the following value for chi-square:


χ2 = (45–28.2)2/28.2 + (30–46.8)2/46.8 + (23–39.8)2/39.8 + (83–66.2)2/66.2 = 27.47.
The chi-square test statistic is approximately chi-square distributed, with (I – 1) · (J –
1) degrees of freedom. If the test statistic exceeds the theoretical value that corresponds
to a specific significance level, we reject the null hypothesis and accept the alternative
hypothesis H1 that the variables are dependent on each other (cf. Chap. 1). Thus, they are
associated.
Assuming a 5% significance level and having 1 (= (2–1) · (2–1)) degree of freedom,
the theoretical chi-square value equals 3.84. The empirical chi-square value in our exam-
ple is 27.47, which is larger than the theoretical one. Consequently, we reject the null
6.2 Procedure 361

hypothesis, and accept the alternative hypothesis H1 that the variables ‘age’ and ‘pre-
ferred type of chocolate’ are dependent (associated). Based on Table 6.8, we conclude
that older people have a stronger preference for dark chocolate while younger people
rather prefer milk chocolate.

Statistical Tests for Small Samples


As mentioned above, the chi-square test statistic is only approximately chi-square dis-
tributed. It actually has a discrete distribution (multinomial distribution). For sample
sizes between 20 and 60 observations, the Yates’ corrected chi-square test statistic is
therefore preferably used (cf. Fleiss et al., 2003, p. 27). In case of two variables with two
categories each, Yates’ corrected chi-square test corresponds to:
  2
n · |n11 · n22 − n12 · n21 | − n 2
2
χcorr = (6.3)
n1. · n.1 · n2. · n.2

In our example, the Yates correction results in the following value for chi-square:
  2
2 181 · |45 · 83 − 23 · 30| − 181 2
χcorr = = 25.86
68 · 75 · 113 · 106
When we compare the chi-square value of the Yates correction with the theoretical chi-
square value of 3.84, we again reject the null hypothesis. Generally, the Yates correction
results in smaller values for chi-square. Yet with an increasing sample size, the difference
between the chi-square values in Eqs. (6.2) and (6.3) diminishes (cf. Fleiss et al., 2003,
pp. 57–58; Everitt, 1992, p. 13).
For samples with less than 20 observations or strongly asymmetric boundary distribu-
tions (large differences between ni. and/or n.j), we can use the Exact Fisher test to assess
whether the variables are independent (cf. Agresti, 2019, pp. 45–47). The Exact Fisher
test computes the probability of obtaining the observed data under the null hypothesis
that the proportions are the same. The Exact Fisher test does not impose any require-
ments on the sample size. The original test is designed for 2 × 2 cross tables and can be
computed manually and equals3:
n1. !n2. !n.1 !n.2 !
p= (6.4)
n11 !n12 !n21 !n22 !n!
The smaller the computed probability, the more likely are the variables associated.

6.2.3.2 Assessing the Strength of Associations


Often we do not only want to assess the significance of an association but also the
strength of it. Yet, the chi-square value is a function of the sample size, and we therefore

3 Visit www.multivariate-methods.info for more information.


362 6 Contingency Analysis

cannot use it to assess the strength of associations. To illustrate this issue, let us just
duplicate the sample from our example. We would then observe 90 observations in cell
n11, and would expect 56.4 (= 150 · 136/362) observations in this cell, which is twice
the number of 28.2. The resulting chi-square value then equals 54.94 (= 27.47 · 2). But
in fact, the strength of the association between the variables ‘age’ and ‘preferred type of
chocolate’ is still the same, and should not be affected by the sample size. We thus need
to consider the sample size when evaluating the strength of association.
To assess the strength of association, we generally distinguish between measures
for the strength of associations that rely on chi-square and measures that are based on
probability considerations. We first discuss the measures that are based on chi-square—
namely, the phi coefficient, contingency coefficient, and Cramer’s V.

Measuring the Strength of Associations Based on Chi-square


The phi coefficient (φ) corrects for the sample size and is equal to:

χ2
φ= (6.5)
n
The larger the phi coefficient, the stronger the association. A rule-of-thumb states that
a phi coefficient greater than 0.3 indicates an association that is ‘more than trivial’ (cf.
Fleiss et al., 2003, p. 99). In our example, the phi coefficient equals:

27.4
φ= = 0.389
181
Therefore, we conclude that the association between ‘age’ and ‘preferred type of choco-
late’ is more than trivial.
Yet, the phi coefficient has several limitations. First, the phi coefficient has no bound-
aries and, thus, phi coefficients of different studies cannot be compared. Second, if con-
tinuous variables are divided into categories (like ‘age’ in our example), the cutoff value
(here: 45 years) might have a strong influence on the value of phi (cf. Fleiss et al., 2003,
p. 99). Third, when variables with more than two categories are considered, the phi coef-
ficient can be larger than 1.
To address the latter issue, we can use the contingency coefficient

χ2
CC = (6.6)
χ2+n

The contingency coefficient ranges between 0 and 1. Yet, it can rarely reach the maximum
value of 1; rather the upper limit is a function of the number of columns and rows in the
cross table. Therefore, the respective maximum value should be taken into account when
assessing the strength of an association. The upper limit of the contingency coefficient is:

(6.7)

CCmax = (R − 1) R with R = min (I, J)
6.2 Procedure 363

For our example, we get:


 
CC = 0.362 and CCmax = 1 2 = 0.707

Since the maximum value equals 0.707, the value of 0.362 for the contingency coeffi-
cient seems to indicate a rather strong association.
Alternatively, Cramer’s V is a measure for the strength of an association which ranges
between 0 and 1 and reaches 1 if each variable is completely determined by the other:

χ2
Cramer’s V = with R = min(I, J) (6.8)
n(R − 1)

If one of the variables is binary, the phi coefficient and Cramer’s V result in the same
value. Thus, in our example Cramer’s V also equals 0.389.
In Sect. 6.3.3, we will report approximate significance levels for the phi coefficient,
contingency coefficient and Cramer’s V that help to assess whether the association
between the two categorical variables is substantial or not in a statistical and not only
subjective sense.

Measuring the Strength of Associations Based on Probability Considerations


Next to the measures that rely on chi-square, Goodman and Kruskal (1954) suggested
two alternative measures that are based on probability considerations: Goodman and
Kruskal’s lambda (λ) and tau (τ).
The lambda coefficient measures the percentage of improvement in our ability to pre-
dict the value of one categorical variable when we have the knowledge about the dis-
tribution pattern of the other variable. It is assumed that the best strategy for prediction
is to select the category with most observations (modal category) on the logic that this
will minimize the number of wrong guesses (i.e., ‘proportional reduction in error’). More
specifically, the lambda coefficient compares the probability of making a wrong predic-
tion of the category of one variable if we had no knowledge about the category of the
other variable to the probability of making a wrong prediction of the category of one var-
iable if we knew the category of the other variable.
For our example, let us assume that we have no knowledge about age. All we know
is that of the total of 181 respondents, 68 respondents prefer milk chocolate and 113
respondents prefer dark chocolate (Table 6.9). In this setting, the variable ‘preferred type
of chocolate’ is the target variable—also called dependent variable. The best guess we
can make is to predict ‘dark’ for the preferred type of chocolate since this is the modal
category. Thus, we are right for 113 respondents and wrong for 68 respondents. Now
assume that you know the age category a respondent belongs to but you do not know the
preferred chocolate type. If we know that a respondent belongs to the age category ‘18
to 44 years’, our best guess related to the preferred type of chocolate is ‘milk’ (modal
category). Then, we are right for 45 respondents and wrong for 30 respondents. If we
364 6 Contingency Analysis

Table 6.9  Cross table containing the relevant information to compute Goodman and Kruskal’s
lambda
Age
Preferred type of chocolate 18 to 44 years 45 years and older Sum
Milk 45 23 68
Dark 30 83 113
Sum 75 106 181

know that a respondent belongs to the age category’ 45 years and older’, our best guess
related to the preferred type of chocolate is ‘dark’. Now, we are right for 83 respond-
ents and wrong for 23 respondents. Thus, we make overall 53 (= 30 + 23) wrong predic-
tions when we know the age category a respondent belongs to. Compared to the situation
where we do not know a respondent’s age the number of wrong predictions reduces by
15 (= 68 – 53). Using the number of wrong predictions without knowledge about age as
a base, we get the following coefficient:

68 − 53
type = = 0.221
68
We can apply the same logic to the situation where we want to make a prediction about
age without having any information about the preferred type of chocolate (here: the vari-
able ‘age’ is the target variable). The lambda coefficient then equals:

75 − 53
age = = 0.293
75
The lambda coefficient is asymmetric since its value depends on the target variable. In
general terms, the lambda coefficients are computed in the following way:

j maxi nij − maxi ni.

maxj nij − maxj n.j
 row variable = or  column variable = i (6.9)
n − maxi ni. n − maxj n.j

The lambda coefficients always range between 0 and 1. A value of 0 indicates that know-
ing the category of the second variable does not reduce the probability of making a
wrong guess when predicting the first (target) variable. In contrast, a value of 1 indicates
that knowing the category of the second variable results in an error-free prediction of
the first (target) variable. Thus, the size of the lambda coefficient is an indicator for the
strength of association. Moreover, we can test whether the lambda coefficients are statis-
tically significant (cf. Sect. 6.3.3).
Alternatively to computing lambda coefficients for each variable separately, we can
also use the so called symmetric lambda coefficient:
 
1  1
  
2 i maxj nij + j maxi nij − 2 maxj n.j + maxi ni.
sym = (6.10)
n − 21 maxj n.j + maxi ni.
 
6.2 Procedure 365

In our example, the symmetric lambda coefficient equals 0.259. We recognize that all lambda
coefficients are smaller than the chi-square based measures, which is usually the case.
Another measure to assess the strength of the association is Goodman and Kruskal’s
tau (τ). Whereas the lambda coefficient is based on modal category assignment,
Goodman and Kruskal’s tau considers all marginal probabilities. The probability of mak-
ing a wrong prediction is based on random category assignment with probabilities spec-
ified by marginal probabilities. We illustrate the computation of the tau coefficient with
the help of Table 6.10.
Without knowing the age of a respondent, we would assign 37.6% of respondents into
the category ‘milk’ and 62.4% into the category ‘dark’. We make a correct prediction in
37.6% and 62.4% of all cases for the first and second category. Consequently, in total
53.1% (= 0.3762 + 0.6242) of all cases are assigned correctly and 46.9% are assigned
incorrectly. If we knew the age of a respondent, the prediction improves: Among the
younger respondents, 60% prefer milk chocolate and 40% prefer dark chocolate. For the
older respondents, the corresponding values are 21.7% and 78.3%. Hence, we predict
52% (= 0.602 + 0.402) of all younger, and 66% (= 0.2172 + 0.7832) of all older respond-
ents correctly. Overall, we now predict 60.2% (= 0.52 · 0.414 + 0.66 · 0.586) correctly
and 39.8% incorrectly. This is an improvement of 7.1%-points. The tau coefficient again
sets the improvement in relation the probability of making a wrong prediction without
knowledge about age. Therefore, we get:
0.469 − 0.398 0.071
τtype = = = 0.152
0.469 0.469
The tau coefficient is also an asymmetric measure. Yet, in our specific example τage also
equals 0.152. The tau coefficient is smaller than the lambda coefficient and the measures
based on chi-square. We can use the significance for evaluating the tau coefficients (cf. Sect.
6.3.3).

6.2.3.3 Role of Confounding Variables in Contingency Analysis


Confounding variables (also called extraneous variables) are variables that influence the
association or in more general terms the relationships between variables. As such, con-
founding variables represent alternative explanations of results. Thus, they pose a seri-
ous threat to the internal and external validity of analyses. Unless, we control for the

Table 6.10  Cross table containing the relevant information to compute Goodman and Kruskal’s
tau (target variable: type)
Age
Preferred type of chocolate 18 to 44 years (%) 45 years and older (%) Sum (%)
Milk 60.0 21.7 37.6
Dark 40.0 78.3 62.4
Sum 41.4 58.6
366 6 Contingency Analysis

confounding variables, they may affect the association between the focal variables and
may, therefore, blur the results.
To illustrate the potential effect of a confounding variable on the results of a contin-
gency analysis, let us assume the following:

Example
We surveyed 132 young and 158 elderly people regarding the question whether or not
they purchase diet chocolate. Table 6.11 shows the results. For example, the question
to be answered is whether or not the purchase of diet chocolate is related to age. ◄

Table 6.11 indicates that there is an association, i.e., older consumers buy relative more
diet chocolate than younger consumers. However, imagine that we also asked respond-
ents to state their body weight and size, and we computed the BMI (i.e., body mass
index) to assess whether a respondent is obese or not. We divided the sample into two
subgroups according to obesity (3rd (confounding) variable). Tables 6.12 and 6.13 show
the results for the two subgroups. We find that the share of younger and older consumers
who buy (or do not buy) diet chocolate is similar. Around 80% of obese consumers buy
diet chocolate independent of their age (Table 6.12), whereas only around 20% of non-
obese consumers by diet chocolate independent of their age (Table 6.13).

Table 6.11  Relationship between age category and diet chocolate (n = 290)


Buying diet chocolate Sum
Age category Yes No
Young people (up to 30) 30 (23%) 102 (77%) 132 (100%)
Elderly people (60 and above) 100 (63%) 58 (37%) 158 (100%)

Table 6.12  Subgroup of obese respondents (n = 123)


Buying diet chocolate Sum
Age category Yes No
Young people (up to 30) 10 (83%) 2 (17%) 12 (100%)
Elderly people (60 and above) 90 (81%) 21 (19%) 111 (100%)

Table 6.13  Subgroup of non-obese respondents (n = 167)


Buying diet chocolate Sum
Age category Yes No
Young people (up to 30) 20 (17%) 100 (83%) 120 (100%)
Elderly people (60 and above) 10 (21%) 37 (79%) 47 (100%)
6.3 Case Study 367

Thus, obesity influences the buying of diet chocolate in our example. Obese consum-
ers buy relatively more diet chocolate than non-obese consumers. Since age is usually
related to obesity, it seems at first glance that age is associated with the buying of diet
chocolate.
Considering a third variable may also lead to a refinement of the association between
the focal variables, or may result in an association that was originally not observed (sup-
pressed association). Theoretical considerations may guide the researcher to determine
what additional variables may be worthwhile to consider.

6.3 Case Study

6.3.1 Problem Definition

The manager of a chocolate company wants to find out whether the sales of certain types
of chocolate are associated with the time of the year. To find an answer to the ques-
tion, the manager asks a manager of a large supermarket chain for help. The manager
of the chocolate company would like to use scanner data to answer the question. The
manager of the supermarket chain provides a dataset with 220 purchases of three types
of chocolate (i.e., milk, dark, yoghurt) around the year. Each purchase can be assigned
to a season (spring = March to May, summer = June to August, fall = September to
November, and winter = December to February). In a first step, the manager creates a
cross table (Table 6.14).
The cross table suggests that there might be an association between the variables ‘sea-
son’ and ‘type of chocolate purchased’ but to test whether the two variables are really
associated the manager conducts a contingency analysis with the help of SPSS.

6.3.2 Conducting a Contingency Analysis with SPSS

A contingency analysis can be conducted with the help of CROSSTABS in SPSS.


Figure 6.2 shows the data view in SPSS. The data may be available in two different

Table 6.14  Cross table for Type of chocolate


the variables ‘season’ and ‘type
of chocolate purchased’ Season Milk Dark Yoghurt Sum
Spring 19 8 14 41
Summer 14 6 16 36
Fall 27 27 10 64
Winter 35 39 5 79
Sum 95 80 45 220
368 6 Contingency Analysis

Fig. 6.2 Data editor (left: Case wise, right: frequency)

formats: case wise (i.e., each row represents an observation) or frequency (i.e., each row
represents the cell counts of a cross table) data. Figure 6.2 shows the data editor for both
formats.
For the case wise data, the first column is the ID of the observation (i.e., purchase),
the second column indicates the season (1 = spring, 2 = summer, 3 = fall, and 4 = win-
ter), and the third column represents the preferred type of chocolate (1 = milk, 2 = dark,
and 3 = yoghurt).
If only frequency data are available, the columns again represent the variables but the
rows represent the different combinations of the categories. The third column contains
the cell counts (i.e., frequencies). To indicate the cell counts, we use the option ‘Weight
Cases’ which is available in the menu ‘Data’ in SPSS (cf. Fig. 6.3). We weight the cases
by the ‘frequency’ variable.
Yet in the following, we will use the case wise data to conduct the contingency anal-
ysis. Go to ‘Analyze/Descriptive Statistics/Crosstabs’ (cf. Fig. 6.4) and enter the variable
‘season’ in the field ‘Row(s)’ and the variable ‘type’ in the field ‘Column(s)’. The deci-
sion which variable is placed in the rows and which in the columns is arbitrary and has
no consequences for the subsequent analyses (cf. Fig. 6.5).
6.3 Case Study 369

Fig. 6.3 Dialog box: Weight cases

Fig. 6.4 Starting a contingency analysis in SPSS via ‘Descriptive Statistics’


370 6 Contingency Analysis

Fig. 6.5 Dialog box: Crosstabs

Click on ‘Statistics’ to activate ‘Chi-Square’ as well as the ‘contingency coefficient’,


‘phi coefficient’, ‘Cramer’s V’ and (Goodman and Kruskal’s) ‘Lambda’ (cf. Fig. 6.6). If
you activate Lambda, Goodman and Kruskal’s tau will also be provided.
The ‘uncertainty coefficient’ (also called Theil’s U) is an alternative measure of
association that indicates the proportional reduction in error if values of one variable are
used to predict values of the other variable. Since the results of the uncertainty coef-
ficient and Goodman and Kruskal’s tau are typically similar, we focus on Goodman
and Kruskal’s tau in our case study. Click on ‘Continue’ to get back to the dialog box
‘Crosstabs’.
Click on ‘Cells’ to determine what values are displayed in the cross table (cf.
Fig. 6.7). You can define what frequencies (Observed, Expected) and percentages (Row,
Column, Total) will be displayed as well as how residuals are computed. Residuals
are the differences between observed and expected counts. Previously, we computed
unstandardized residuals, and we also activate those in this example. Click on ‘Continue’
to get back to the ‘Crosstabs’ dialog box.
The ‘Style’ option defines the table style (cf. Fig. 6.8). We keep the default style. The
option ‘Bootstrap’ (cf. Fig. 6.9) allows resampling, which is especially useful for small
sample sizes (for more details about bootstrapping, we refer the reader to Efron, 2003;
Efron & Tibshirani, 1994). Clicking the ‘OK’ button in the ‘Crosstabs’dialog box starts
the data analysis.
6.3 Case Study 371

Fig. 6.6 Dialog box: Statistics

6.3.3 Results

First, the cross table is shown in the SPSS output window. In addition to the number of
observations in each cell (Count), the row (season), column (type), and total percentages
are displayed. The information on the cell counts is identical to the information given in
Table 6.14. The expected counts for each cell eij (Expected Count) and the differences
between the observed and expected count (Residual) are also displayed (cf. Fig. 6.10).
Moreover, the results of the chi-square test statistic (Pearson Chi-Square) are shown
(cf. Fig. 6.11). Next to the degrees of freedom (df), the significance is displayed. The chi-
square test is significant at the 5% level. We reject the null hypothesis of independence.
Thus, the chi-square test statistic shows that the variables ‘season’ and ‘type of chocolate
purchased’ are associated, and thus, dependent. The footnote in the SPSS table further
states that the minimum expected number count is equal to 7.36, indicating that each cell
has more than 5 observations.
372 6 Contingency Analysis

Fig. 6.7 Dialog box: Cell Display

Fig. 6.8 Dialog box: Table Style

Furthermore, the so called Likelihood Ratio and Linear-by-Linear Association are dis-
played. For large sample sizes the likelihood ratio test is similar to the chi-square test
and will not be discussed in this context. The Linear-by-Linear Association test is not
applicable to nominal variables and will be ignored (cf. Bishop et al., 2007). Since we
6.3 Case Study 373

Fig. 6.9 Dialog box: Bootstrap

observe a 4 × 3 cross table, the Yates corrected chi-square (Continuity Correction), and
Fisher’s Exact Test are not computed.
Figure 6.12 shows the values for the measures that assess the strength of association
(phi, Cramer’s V, and Contingency Coefficient). Since the variables under consideration
have four and three levels, the values for phi and Cramer’s V are not identical. Assuming
a 5% significance level, the (approximate) significance for all three measures suggests
that we have to reject the null hypothesis which states that a specific measure equals
zero. This is not surprising since we already know that there is an association between
the two variables. The rule of thumb regarding the value of phi further suggests that the
association is more than trivial.
Figure 6.13 shows Goodman and Kruskal’s lambda and tau coefficients. First, the
three lambda coefficients are reported (symmetric lambda coefficient, lambda coefficient
for ‘season’ and ‘type’). Based on the asymptotic standard error of the statistics, we
374 6 Contingency Analysis

Fig. 6.10 Contingency analysis in SPSS: Cross table

can compute the approximate t-value and error probability (approximate significance).
Assuming a 5% significance level, the asymmetric lambda coefficient for ‘type’ is not
signigicant—indicating that knowledge about the season does not significantly improve
the prediction of the type of chocolate purchased. Moreover, the symmetric lambda
coefficient is not significant. The next few rows in Fig. 6.13 contain information about
Goodman and Kruskal’s tau for ‘season’ and ‘type’, and both coefficients are significant.
6.3 Case Study 375

Fig. 6.11 Contingency analysis in SPSS—chi-square test

Fig. 6.12 Contingency analysis in SPSS: Strength of association based on chi-square

Overall, we have learnt that the season and type of chocolate purchased are associated
(i.e., dependent). The strength of association is significant and ‘more than trivial’ when
we consider the chi-square related measures. Yet, the measures related to probability
considerations are partly insignificant leading to mixed findings.

6.3.4 SPSS Commands

In the case study, the contingency analysis was performed using the SPSS graphical user
interface (GUI). Alternatively, we can use the SPSS syntax which is a programming lan-
guage unique to SPSS. Each option we activate in SPSS’s GUI is translated into SPSS
syntax. If you click on ‘Paste’ in the main dialog box shown in Fig. 6.5, a new window
opens with the corresponding SPSS syntax. However, you can also use the SPSS syntax
and write the commands yourself. Using the SPSS syntax can be advantageous if you
want to repeat an analysis multiple times (e.g., testing different model specifications).
376
6

Fig. 6.13 Contingency analysis in SPSS: Strength of association based on probability considerations
Contingency Analysis
6.4 Recommendations and Relations to Other Multivariate Methods 377

* MVA: Case Study Chocolate Contingency Analysis.


* Defining Data.
DATA LIST FREE / season type frequency.
VARIABLE LABELS season “season”
/type “type of chocolate”
/frequency “frequency”.
VALUE LABELS
/season 1 “spring” 2 “summer” 3 “fall” 4 “winter”
/type 1 “milk” 2 “dark” 3 “yoghurt”.

BEGIN DATA
1 1 19
1 2 8
1 3 14
------
4 3 5
* Enter all data.
END DATA.

* Case Study Contingency Analysis.


WEIGHT BY frequency.
CROSSTABS
/TABLES=season BY type
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ CC PHI LAMBDA
/CELLS=COUNT EXPECTED ROW COLUMN TOTAL
/COUNT ROUND CELL.

Fig. 6.14 SPSS syntax for conducting a contingency analysis based on frequency data

Figure 6.14 shows the syntax for the chocolate example when using information about
frequency data. The syntax does not refer to an existing data file of SPSS (*.sav); rather
we enter the data with the help of the syntax editor (BEGIN DATA… END DATA).

6.4 Recommendations and Relations to Other Multivariate


Methods

6.4.1 Recommendations on How to Implement Contingency


Analysis

In the following, we offer guidelines and recommendations for conducting a contingency


analysis:

1. The individual observations must be independent of each other.


2. Each observation can be uniquely assigned to a combination of categories.
378 6 Contingency Analysis

3. The proportion of cells with expected counts fewer than five should not exceed 20%
(rule-of-thumb). None of the expected counts should be less than one. Combining sev-
eral categories of a variable to meet these requirements, should be carefully considered.
4. The chi-square test is valid if there are more than 60 observations. For samples with
20 to 60 observations, use the Yates correction. For sample sizes smaller than 20,
Fisher’s Exact test is appropriate. Yet, in such cases it is recommended to increase the
sample size to improve the robustness of the analysis.
5. The phi coefficient, contingency coefficient, Cramer’s V, Goodman and Kruskal’s
lambda and tau coefficients can be used to assess the strength of association. A mean-
ingful interpretation of the different coefficients requires knowledge about the mini-
mum and maximum value of these measures.

6.4.2 Relation of Contingency Analysis to Other Multivariate


Methods

Contingency analysis examines the relation among two or more categorical varia-
bles (i.e., multidimensional contingency tables). In this chapter, we focused on the sim-
ple case of two variables. Although contingency analysis can also be used for more than
two variables, the application becomes a bit cumbersome. As an alternative to a mul-
tidimensional contingency analysis, we can use log-linear models (cf. Agresti, 2019,
pp. 204–243; Everitt, 1992, pp. 80–107; see also the SPSS procedures HILOGLINEAR;
LOGLINEAR). Log-linear models use a model representation similar to a regression
model (cf. Chap. 2) and assess the influence of several nominal independent variables on a
nominal dependent variable. Ultimately, log-linear models provide information on whether
there is a significant relation between the independent and dependent variables. Log-linear
models assess the strength of the relationships based on the estimated coefficients.
Another method for the analysis of more than two categorical variables is correspond-
ence analysis. Correspondence analysis is a method for the visualization of cross tables.
Correspondence analysis thus serves to illustrate complex relations and can be classified
as a structure-discovering methodology. It is related to factor analysis (cf. Chap. 7).
Log-linear models and correspondence analysis are more complex than contingency
analysis but provide additional insights such as considering explicitly the multivariate
structure of the data and visualization of relations. Yet, contingency analysis as discussed
in this chapter is rather popular in practice. This is due to two reasons: it is easy to con-
duct, and cross tables are easy to interpret and understand. Moreover, several cross tables
provide a more comprehensive picture than more complex analyses which require a good
understanding of the methodology.
References 379

References

Agresti, A. (2019). An introduction to categorical data analysis (3rd ed.). Wiley-Interscience.


Bishop, Y., Fienberg, S., & Holland, P. (2007). Discrete multivariate analysis. Theory and practice.
Springer.
Efron, B. (2003). Second thoughts on the bootstrap. Statistical Science, 18(2), 135–140.
Efron, B., & Tibshirani, R. (1994). An introduction to the bootstrap. Springer.
Everitt, B. (1992). The analysis of contingency tables (2nd ed.). Chapman & Hall.
Fahrmeir, L., & Tutz, G. (2001). Multivariate statistical modelling based on generalized linear
models (2nd ed.). Springer.
Fleiss, J., Levin, B., & Paik, M. (2003). Statistical methods for rates and proportions (3rd ed.).
Wiley-Interscience.
Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications part I.
Journal of the American Statistical Association, 49(December), 732–764.
Zeisel, H. (1985). Say it with figures (6th ed.). Harper Collins.

Further Reading

Fienberg, S. (2007). The analysis of cross-classified categorical data (2nd ed.). Springer.
Kateri, M. (2014). Contingency table analysis – methods and implementation using R. Birkhäuser.
Sirkin, R. M. (2005). Statistics for the social science (3rd ed.). SAGE Publications.
Wickens, T. (1989). Multiway contingency tables analysis for the social sciences. Psychology
Press.
Factor Analysis
7

Contents

7.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382


7.2 Procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
7.2.1 Evaluating the Suitability of Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
7.2.2 Extracting the Factors and Determining their Number. . . . . . . . . . . . . . . . . . . . . 393
7.2.2.1 Graphical Illustration of Correlations. . . . . . . . . . . . . . . . . . . . . . . . . . . 394
7.2.2.2 Fundamental Theorem of Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . 397
7.2.2.3 Graphical Factor Extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
7.2.2.4 Mathematical Methods of Factor Extraction . . . . . . . . . . . . . . . . . . . . . 402
7.2.2.4.1 Principal component analysis (PCA). . . . . . . . . . . . . . . . . . . . . 403
7.2.2.4.2 Factor-Analytical Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
7.2.2.5 Number of Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
7.2.2.6 Checking the Quality of a Factor Solution. . . . . . . . . . . . . . . . . . . . . . . 415
7.2.3 Interpreting the Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
7.2.4 Determining the Factor Scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
7.2.5 Summary: The Essence of Factor Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
7.3 Case Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
7.3.1 Problem Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
7.3.2 Conducting a Factor Analysis with SPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
7.3.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
7.3.3.1 Prerequisite: Suitability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
7.3.3.2 Results of PAF with Nine Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
7.3.3.3 Product Positioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
7.3.3.4 Differences Between PAF and PCA. . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
7.3.4 SPSS Commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
7.4 Extension: Confirmatory Factor Analysis (CFA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
7.5 Recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
Further Reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452

© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 381


K. Backhaus et al., Multivariate Analysis, https://doi.org/10.1007/978-3-658-40411-6_7
382 7 Factor Analysis

Table 7.1  Application examples of factor analysis in different disciplines


Discipline Exemplary research questions
Environmental A broad factor analysis assesses and summarizes the 4 micro-environ-
mental factors (political, economic, social and technological) which
have a significant impact on the business’s operating environment
Marketing A chocolate company is interested in its image. 26 variables were
reduced to 3 factors: reputation, competence, brand
Medicine Factor analysis can improve medical diagnostics. Compared to several
other attempts to solve the diagnostic problem, this concept has the
advantage that it does not depend on the condition of mutual inde-
pendence of symptoms
Psychology The Big Five of the OCEAN project (openness, conscientiousness,
extraversion, agreeableness, neuroticism) are factors that describe
the construct ‘personality’. These 5 factors were derived from 18,000
variables
Operations In many modern manufacturing processes, large quantities of mul-
tivariate data are available through automated in-process sensing.
Factor analysis is suitable for extracting and interpreting information
from these data for the purpose of diagnosing root causes of process
reliability

7.1 Problem

Factor analysis is a multivariate method that can be used for analyzing large data sets
with two main goals:

1. to reduce a large number of correlating variables to a fewer number of factors,


2. to structure the data with the aim of identifying dependencies between correlating
variables and examining them for common causes (factors) in order to generate a new
construct (factor) on this basis.

Basic idea of factor analysis


For both objectives it is assumed that the data set to be analyzed consists mainly of
highly correlated (dependent) variables.1 Factor analysis therefore primarily analyzes
the correlation matrix (and sometimes the variance-covariance matrix) of the data, which
represents the interrelations between the variables.

1 Correlations form the basis of factor analysis. For readers who are not sufficiently familiar with
the term, the concept of correlations is explained in detail in Sect. 1.2.2.
7.1 Problem 383

Correlation Matrix (R)

no
suitable? No
factor analysis
Yes

conceptual requirement empirical requirement


(assumed, not tested) (tested, not assumed)

fulfilled? fulfilled?

No Yes Yes No
no run no
factor analysis factor analysis factor analysis

• extract factors
• determine number of factors
• interpret factors
• compute factor scores

Fig. 7.1 Process for testing the suitability of a correlation matrix R for factor analysis

Factor analysis is used in many disciplines such as psychology, sociology, medicine,


economics, management, finance, and marketing. Table 7.1 provides some examples.
To be suitable for factor analysis, the correlation matrix should have certain proper-
ties; mainly there should be sufficient interrelation (correlation) between the variables. In
other words: not every correlation matrix is suitable for factor analysis.
Therefore, the procedure starts with checking whether the data of an empirical survey
are suitable for factor analysis. This test refers to the correlation matrix of the variables,
with an empirical test which is performed after a conceptual test. Fig. 7.1 shows the test
logic of the first step.
The conceptual suitability refers to a specific causal interpretation of an empiri-
cal correlation. In most cases, the correlation between two variables x1 and x2 (rx1,x2) is
interpreted as an indication that x1 is the cause of x2 or vice versa (cf. Fig. 7.2). Factor
analysis, on the other hand, assumes that the empirical correlation r x1 ,x2 is caused by an
underlying variable F (hypothetical variable), which is called a factor. The following
example may clarify this.

Example

A high correlation (e.g., rx 1 ,x2 = 0.9) was found between the advertising expenditure
for a product (x1) and its sales volume (x2). This may be interpreted in three ways:
384 7 Factor Analysis

case 3
x1
case 1 case 2
F
x1 x2 x1 x2 x2

Regression analysis Factor analysis

Fig. 7.2 Different conceptual interpretations of a correlation (rx 1 ,x2)

• Case 1: Advertising expenditure influences the sales volume (conventional


interpretation).
• Case 2: The amount of sales influences the advertising budget (percentage-of-sales
rule).
• Case 3: Advertising expenditure and turnover are not directly linked, but the
observed correlation can be traced back to a common factor F (e.g. general price
increase, inflation).

While causal relations (cases 1 and 2) are assumed by the dependency methods of
multivariate data analysis (e.g. regression analysis, variance analysis, or discriminant
analysis), factor analysis is based on case 3. Data from an empirical survey are there-
fore only suitable for factor analysis if the user has reliable information that case 3
represents the relevant interpretation in this specific application case. Only if this is
true a factor analysis should be performed.
Therefore, to perform a factor analysis, the first step is to evaluate the suitability
of the data. The empirical suitability is given if the variables in a data set show suf-
ficiently high correlations. High correlations are the necessary prerequisite for factor
analysis, since without correlations no explanation of correlations is possible. Several
instruments are available for testing this prerequisite.
In practical applications, a large number of variables is usually considered, but not
all of them need to be highly correlated with each other. In the second step, the user
must therefore decide how many factors are to be extracted. To facilitate this decision,
factor analysis provides various statistical criteria. There are also various algorithms
for the subsequent extraction of the factors (e.g. principal component analysis (PCA),
principal axis analysis (PAA)). In addition to the number of factors, the user must also
decide on an extraction method.
The extraction procedures result in an assignment of the originally considered var-
iables to the extracted factors. The strength of these assignments is measured by the
7.2 Procedure 385

correlations between the original variables and the factors, called factor loadings. The
factor loadings are an important reference point for the user to interpret the factors
logically. This interpretation is not trivial and is of fundamental importance, since it
is here that the user decides which substantive reason is represented by a factor. The
interpretation of the factors is therefore to be regarded as a separate third step in fac-
tor analysis, since it is of great importance for deriving consequences from the results
of a factor analysis.
The fourth and final step concerns the question how the factors are assessed by the
persons who originally considered the investigated attributes. These (fictitious) assess-
ment values per person are called factor scores. They are derived from the assess-
ments of the initial variables. ◄

Explorative versus confirmatory factor analysis


So far, we have only addressed explorative factor analysis (EFA). EFA serves to dis-
cover structures in a data set. This means that the user has no idea of how the metric
output variables should be combined into factors. Likewise, the number of factors to be
extracted is unknown at the starting point.
There are many applications, however, where the user has very precise ideas about
the number, interpretation and assignment of the output variables to the factors. In such
cases, this information can be included in the calculations. As a result, fewer calcula-
tion steps are required and the steps ‘Extracting the factors and determining the number
of factors’ and ‘Interpreting the factors’ are eliminated, while the extraction algorithms
remain the same. This, however, changes the character of the factor analysis and makes it
a confirmatory, i.e. structure-testing, method. The core principles of this important vari-
ant of factor analysis (so-called confirmatory factor analysis; CFA) are discussed in more
detail in Sect. 7.4.

7.2 Procedure

As described above, factor analysis comprises the four steps shown in Fig. 7.3: The first
step when conducting a factor analysis is to evaluate the suitability of the data. We ulti-
mately combine variables that are highly correlated, and therefore, suitability of the data
is evaluated by the correlations among the variables. In the second step, we extract the

Fig. 7.3 Process steps of factor 1 Evaluating the suitability of data


analysis
2 Extracting the factors and determining their number

3 Interpreting the factors

4 Determining the factor scores


386 7 Factor Analysis

factors and decide on the number of factors to extract. In the third step, we interpret the
derived factors. Finally, we compute factor scores, i.e., evaluation values for the factors
instead of the attributes.

7.2.1 Evaluating the Suitability of Data

1 Evaluating the suitability of data

2 Extracting the factors and determining their number

3 Interpreting the factors

4 Determining the factor scores

In the following, the four steps of a factor analysis are explained in detail using the fol-
lowing example.

Application example (example 1)

The manager of a chocolate company wants to know how the various chocolate fla-
vors are perceived by his customers. For this purpose, 30 test persons are asked to
evaluate the chocolate flavors according to five attributes (milky, melting, artificial,
fruity and refreshing) on a 7-point scale (from 1 = ‘low’ to 7 = ‘high’). The results of
the survey are shown in Table 7.2.
With the help of this data set, the manager now wants to check whether there are
any correlations between the different perceptions of the five attributes and whether
these can be condensed to common causes (factors). ◄

A first glance at the data matrix of the application example shows that the ratings of
the variables ‘fruity’ and ‘refreshing’ show a similar pattern, i.e. higher (lower) values
of ‘fruity’ always occur with higher (lower) values of ‘refreshing’. This pattern indicates
that these two variables are probably positively correlated.2
To get more insights into the interrelations among the variables, we compute the cor-
relation matrix (Table7.3). In the application example, the correlation matrix is a 5 × 5
symmetric matrix and displays the correlations between the variables. We observe
the highest correlation of 0.983 between the variables ‘fruity’ and ‘refreshing’, which
confirms the assumption derived from Table7.2 that these two variables are positively
correlated. Thus, these two variables seem to ‘belong together’, and we may com-
bine them into one factor since they encompass similar information. Moreover, we
learn that the variables ‘milky’ and ‘artificial’ are also highly and positively correlated

2 On the website www.multivariate-methods.info, we provide supplementary material (e.g., Excel


files) to deepen the reader’s understanding of the methodology.
7.2 Procedure 387

Table 7.2  Initial data matrix (X) of the application example


Perceptions
Test person Milky Melting Artificial Fruity Refreshing
1 1 1 2 1 2
2 2 6 3 3 4
3 4 5 4 4 5
4 5 6 6 2 3
5 2 3 3 5 7
6 3 4 4 6 7
7 2 3 3 5 7
8 2 6 3 3 4
9 2 3 3 5 7
10 3 4 4 6 7
11 3 4 4 6 7
12 2 6 3 3 4
13 3 4 4 6 7
14 4 5 4 4 5
15 1 1 2 1 2
16 5 6 6 2 3
17 5 6 6 2 3
18 4 5 4 4 5
19 2 3 3 5 7
20 2 6 3 3 4
21 2 3 3 5 7
22 1 1 2 1 2
23 5 6 6 2 3
24 2 6 3 3 4
25 1 1 2 1 2
26 1 1 2 1 2
27 5 6 6 2 3
28 4 5 4 4 5
29 4 5 4 4 5
30 3 4 4 6 7

(rmilky,artificial = 0.961). Additionally, the variables ‘milky’ and ‘melting’ (rmilky,melting =


0.712) as well as ‘artificial’ and ‘melting’ (rartificial,melting = 0.704) have rather high and
positive correlations. Therefore, these three variables seem to be related to each other.
388 7 Factor Analysis

Table 7.3  Correlation matrix (R) for the application example


Milky Melting Artificial Fruity Refreshing
Milky 1
Melting 0.712 1
Artificial 0.961 0.704 1
Fruity 0.109 0.138 0.078 1
Refreshing 0.044 0.067 0.024 0.983 1

Fig. 7.4 Underlying structure of


the 5 variables of the application Variables Factors
example
x1 milky

x2 artificial factor 1

x3 melting

x4 fruity
factor 2
x5 refreshing

The low correlations in Table7.3 suggest that not all variables can be combined and that
the underlying structure is more complex.
At this point, we would like to mention that high negative correlations can also indi-
cate variables that ‘belong together’.
In a factor analysis, we interpret the correlations between two variables as being
dependent on a third (unobserved) common factor (Child, 2006, p. 21). In our example,
we seem to have two groups of variables that can be combined to two factors, F1 and F2
(Fig. 7.4). How these factors can be interpreted will be discussed in Sect. 7.2.3.
The calculation of the correlation matrix can be simplified if the variables in a data set
are standardized beforehand.3 Standardization has the advantage that variables that are
measured on different dimensions are comparable (see Sect. 7.2.4). The standardization
of the variables has no influence on the correlation matrix itself (Table 7.3), it just sim-
plifies the calculation of the correlation matrix R, and the following applies:

3 Standardized variables have an average of 0 and a variance of 1. For the standardization of varia-
bles see also Sect. 1.2.1.
7.2 Procedure 389

1
R= · Z′ · Z (7.1)
N −1
with

R empirical correlation matrix


Z standardized data matrix Z
Z′ transposed standardized data matrix Z
N rsp. number of cases

If standardized variables are used, the variance-covariance matrix and the correlation
matrix are identical, since the variance of standardized variables is 1 and is reported on
the main diagonal of the variance-covariance matrix. The lower triangular matrix shows
the covariances (cov) that correspond to the correlations for standardized variables. For
the standardized variables, the following applies:
cov(x1 , x2 ) = rx1,x2 .
While high correlations among variables are a prerequisite for factor analysis, the ques-
tion arises whether the correlations observed in our application example are actually
‘high enough’. Several measures, all of them based on the correlation matrix, exist to
further evaluate the suitability of the data.

Bartlett test of sphericity

The Bartlett test of sphericity tests the hypothesis that the sample originates from a
population in which the variables are uncorrelated (Dziuban & Shirkey, 1974, p. 358).

H0  he variables in the sample are uncorrelated.


T
H1 The variables in the sample are correlated.

If the variables are uncorrelated, the correlation matrix is an identity matrix (R = I).
In this case, the variables are not correlated, and thus, unsuitable for a factor anal-
ysis. The null hypothesis of the Bartlett test is, therefore, equivalent to the question
whether the correlation matrix deviates from an identity matrix only by chance.
Factor analysis requires metric variables, and thus the Bartlett test assumes that the
variables in the sample follow a normal distribution. The following test variable is used:4
 
2·J +5
Chi-Square = − N − 1 − · ln(|R|) (7.2)
6

4 For a brief summary of the basics of statistical testing see Sect. 1.3.
390 7 Factor Analysis

with

N rsp. numer of cases (per variable)


J number of variables
|R| determinant of the correlation matrix

The test variable is approximately chi-square distributed, with J(J − 1)/2 df (degrees
of freedom). This means that the value of the test variable is highly influenced by the
sample size. In the application example, the Bartlett test results in a chi-square value of5
 
2·5+5
Chi-Square = − 30 − 1 − · ln (0.00096) = 184.13
6
The degrees of freedom in the example are (5 · (5 − 1))/2 = 10, leading to a sig-
nificance level of p = 0.000. Thus, with an error probability of almost zero we can
assume that the correlation matrix is different from an identity matrix and thus suita-
ble for factor analysis. ◄

Kaiser–Meyer–Olkin (KMO) criterion

Another criterion for assessing the suitability of a correlation matrix is the Kaiser–
Meyer–Olkin (KMO) criterion. The KMO criterion considers the bivariate and partial
correlation and is computed in the following way:
J 
J
rx2j xj′

j=1 j′ =1
j′ �=j
KMO = J 
J J 
J (7.3)
 
rx2j xj′ + partial_rx2j xj′
j=1 j′ =1 j=1 j′ =1
j′ �=j j′ � =j

with

rx2j xj′ squared correlation between the variables xj and xj’


partial_rx2j xj′ squared partial correlation between the variables xj and xj’

A partial correlation is the degree of association between two variables after exclud-
ing the influences of all other variables. If partial correlations are rather small, the
variables share a common variance which may be caused by an underlying com-
mon factor. Thus, small partial correlations result in rather high values for KMO

5 The determinant of the correlation matrix in the application example is det = 0.001. Furthermore,
ln (0.00096)  = −6.9485.
7.2 Procedure 391

Table 7.4  Evaluation of the KMO values Evaluation


KMO criterion
≥ 0.9 Marvelous
≥ 0.8 Meritorious
≥ 0.7 Middling
≥ 0.6 Mediocre
≥ 0.5 Miserable
< 0.5 Unacceptable

(cf. Eq. 7.3); the KMO criterion approaches a value close to 1 if all partial correla-
tions are close to zero. The KMO criterion equals 1 if all partial correlations are 0.
Thus, a value of 1 is the maximum value this criterion may reach. The KMO criterion
is equal to 0.5 if the correlation matrix equals the partial correlation matrix, which is
not desirable (Cureton & D’Agostino, 1993, p. 389). Thus, the value of KMO should
be larger than 0.5, and we would like to observe a value close to 1. Table 7.4 provides
indications for the interpretation of the KMO value according to Kaiser and Rice
(1974, p. 111).
In our example, KMO equals 0.576, which is above the threshold but rather small
and thus not really significant. ◄

Measure of sampling adequacy (MSA)

While the KMO criterion evaluates the suitability of all variables together, the
so-called measure of sampling adequacy (MSA) assesses the suitability of single
variables:
J
rx2j xj′


j =1
j′ �=j
MSAj = J J
(7.4)
 
rx2j xj′ + partial_rx2j xj′
j′ =1 j′ =1
j′ �=j j′ �=j

with

rx2j xj′ squared correlation between variables xj and xj’


partial_rx2j xj′ squared partial correlation between variables xj and xj’

The values for the MSA range between 0 and 1. Again, we would like to observe a
value that is greater than 0.5 and close to 1. The assessment in Table 7.4 is also valid
for evaluating MSA values.
Table 7.5 shows the MSA values for the five variables in our application example.
The variables ‘fruity’ and ‘refreshing’ have MSA values below 0.5. We may consider
392 7 Factor Analysis

Table 7.5  MSA values of MSA


the variables in the application
example Milky 0.597
Melting 0.878
Artificial 0.598
Fruity 0.471
Refreshing 0.467

eliminating these two variables from further analyses since they do not seem to be
sufficiently correlated with the other variables. Yet for illustrative purposes, we decide
to keep the two variables and continue with all 5 variables in our example. ◄

Anti-image covariance matrix


Besides the already discussed criteria, we can also consider the image and anti-image of
the variables. Following Guttman (1953), the image of a variable describes the propor-
tion of variance that can be explained by the remaining variables using multiple regres-
sion analysis (cf. Chap. 2), while the anti-image represents the part that is independent
of the other variables. Since it is a prerequisite of factor analysis that the variables are
correlated and share a common variance, the anti-image should be close to zero.
Apart from the anti-image, we can also consider the partial covariance between two
variables (Dziuban & Shirkey, 1974). The partial covariance is conceptually similar to
the partial correlation. Therefore, the value for the negative partial covariance should be
rather small (i.e., close to zero). A rule of thumb is that data are not suited for factor anal-
ysis if 25% or more of the negative partial covariances are different from zero (> |0.09|).
The so-called anti-image covariance matrix (Table 7.6) contains information about a
variable’s anti-image (i.e., the main diagonal values) as well as the partial covariances (i.
e., the off-diagonal values).
In our example, none of the off-diagonal values and, thus, partial covariances is sig-
nificantly different from zero (> |0.09|). But the anti-image of the variable ‘melting’ is
rather large (0.459), indicating that this variable cannot be explained by the other varia-
bles and is thus weakly correlated with the other variables.

Table 7.6  Anti-image covariance matrix in the application example


Milky Melting Artificial Fruity Refreshing
Milky 0.069 −0.019 −0.065 −0.010 0.010
Melting −0.019 0.459 −0.027 −0.026 0.025
Artificial −0.065 −0.027 0.071 0.009 −0.008
Fruity −0.010 −0.026 0.009 0.026 −0.026
Refreshing 0.010 0.025 −0.008 −0.026 0.027
7.2 Procedure 393

Table 7.7  Criteria to assess the suitability of the data


Criterion When is the criterion met? Is the criterion met for the
application example?
Bartlett test The null hypothesis can be rejected, Fulfilled at 5% significance
i.e.: R ≠ I level
Kaiser–Meyer–Olkin KMO should be larger than 0.5, a value KMO of 0.576 is only slightly
criterion above 0.8 is recommended above the critical value
Measure of sampling MSA should be larger than 0.5 for each The MSA for the variables
adequacy (MSA) variable ‘fruity’ and ‘refreshing’ is
below the threshold of 0.5
Anti-image covariance The off-diagonal elements (i.e., Data meet the requirements
matrix negative partial covariances) of the
anti-image covariance matrix should
be close to zero. Data are not suited
for factor analysis if 25% or more of
the off-diagonal elements are different
from zero (>|0.09|)

Conclusion
To assess the suitability of the data for factor analysis, different criteria can be used but
none of them is superior (Table 7.7). This is due to the fact that all criteria use the same
information to assess the suitability of the data. Therefore, we have to carefully evaluate
the different criteria to get a good understanding of the data.
For our example, it can be concluded that the initial data are only ‘moderately’ suita-
ble for a factor analysis. We will now continue with the extraction of factors to illustrate
the basic idea of factor analysis.

7.2.2 Extracting the Factors and Determining their Number

1 Evaluating the suitability of data

2 Extracting the factors and determining their number

3 Interpreting the factors

4 Determining the factor scores

While the previous considerations referred to the suitability of initial data for a factor
analysis, in the following we will explore the question of how factors can actually be
extracted from a data set with highly correlated variables. To illustrate the correlations,
we first show how correlations between variables can also be visualized graphically by
vectors. The graphical interpretation of correlations helps to illustrate the fundamen-
tal theorem of factor analysis and thus the basic principle of factor extraction. Building
394 7 Factor Analysis

on these considerations, various mathematical methods for factor extraction are then
presented, with an emphasis on principal component analysis and the factor-analyti-
cal procedure of principal axis analysis. These considerations then lead to the question
of starting points for determining the number of factors to be extracted in a concrete
application.

7.2.2.1 Graphical Illustration of Correlations


In general, correlations can also be displayed in a vector diagram where the correlations
are represented by the angles between the vectors. Two vectors are called linearly inde-
pendent if they are orthogonal to each other (angle  = 90°). If two vectors are correlated,
the correlation is not equal to 0, and thus, the angle is not equal to 90°. For example, a
correlation of 0.5 can be represented graphically by an angle of 60° between two vectors.
This can be explained as follows:
In Fig. 7.5, the vectors AB and AC represent two variables. The length of the vectors
is equal to 1 because we use standardized data. Now imagine the correlation between the

A D B

standardized length = 1

Fig. 7.5 Graphical representation of the correlation coefficients


7.2 Procedure 395

two variables equals 0.5. With an angle of 60°, the length of AD is equal to 0.5 which is the
cosine of a 60° angle. The cosine is the quotient of the adjacent leg and the hypotenuse (i.e.,
AD/AC ). Since AC is equal to 1, the correlation coefficient is equal to the distance AD.

Example 2: correlation matrix with three variables

The above relationship is illustrated by a second example with three variables and the
following correlation matrix R.
   ◦ 
1 0
R =  0.8660 1  which is equal to R = 30◦ 0◦ ◄
◦ ◦ ◦
0.1736 0.6428 1 80 50 0

In example 2, we have chosen the correlations in such a way that a graphical illustration
in a two-dimensional space is possible. Figure 7.6 graphically illustrates the relationships
between the three variables in the present example. Generally we can state: the smaller
the angle, the higher the correlation between two variables.
The more variables we consider, the more dimensions we need to position the vectors
with their corresponding angles to each other.

vector x2

vector x1

80°

vector x3

50°

30°

Fig. 7.6 Graphical representation of the correlation matrix with three variables
396 7 Factor Analysis

A
vector x1

0 30°
C
30° resultant

vector x2
B

Fig. 7.7 Factor extraction for two variables with a correlation of 0.5

Graphical factor extraction


Factor analysis strives to reproduce the associations between the variables measured
by the correlations with the smallest possible number of factors. The number of axes
(=dimension) required to reproduce the associations among the variables then indicates
the number of factors. The question now is: how are the axes (i.e., factors) determined in
their positions with respect to the relevant vectors (i.e., variables)? The best way to do
this is to imagine a half-open umbrella. The struts of the umbrella frame, all pointing in
a certain direction and representing the variables, can also be represented approximately
by the umbrella stick.
Figure 7.7 illustrates the case of two variables. The correlation between the two var-
iables is 0.5, which corresponds to an angle of 60° between the two vectors OA and
OB. The vector OC is a good representation of the two vectors OA and OB and thus
represents the factor (i.e., the resultant or the total when two or more vectors are added).
The angles of 30° between OA and OC as well as OB and OC indicate the correlations
between the variables and the factor. This correlation is called factor loading and here
equals cos 30° = 0.866.
7.2 Procedure 397

7.2.2.2 Fundamental Theorem of Factor Analysis


For a general explanation of the fundamental theorem of factor analysis we start with the
assumption that the initial data have been standardized.6 Factor analysis now assumes
that each observed value (zij) of a standardized variable j in person i can be represented
as a linear combination of several (unobserved) factors. We can express this idea with the
following equation:
Q

zij = aj1 · pi1 + aj2 · pi2 + . . . + ajQ · piQ = ajq · piq (7.5)
q=1

with

zij standardized value of observation i for variable j


ajq weight of factor q for variable j (i.e., factor loading of variable j on factor q)
piq value of factor q for observation i

The factor loadings ajq indicate how strongly a factor is related to an initial variable.
Statistically, factor loadings therefore correspond to the correlation between an observed
variable and the extracted factor, which was not observed. As such, factor loadings are a
measure of the relationship between a variable and a factor.
We can express Eq. (7.5) in matrix notation:

Z = P · A′ (7.6)
The matrix of the standardized data Z has the dimension (N × J), where N is the number
of observations (cases) and J equals the number of variables. We observe the standardized
data matrix Z, while the matrices P and A are unknown and need to be determined. Here,
P reflects the matrix of the factor scores and A is the factor loading matrix.
In Eq. (7.1) we showed that the correlation matrix R can be derived from the stand-
ardized variables. When we substitute Z by Eq. (7.6), we get:

1 1
R= · Z′ · Z = · (P · A′ )′ · (P · A′ )
N −1 N −1
(7.7)
1 1
= · A · P ′ · P · A′ = A · · P ′ · P · A′
N −1 N −1

Since we use standardized data, N−1 1


· P′ · P in Eq. (7.7) is again a correlation matrix. More
specifically, it is the correlation matrix of the factors, and we label it C. Thus, we can write:
R = A · C · A′ (7.8)

6 Ifa variable xj is transformed into a standardized variable zj, the mean value of zj = 0 and the var-
iance of zj = 1. This results in a considerable simplification in the representation of the following
relationships. See the explanations on standardization in Sect. 1.2.1.
398 7 Factor Analysis

Table 7.8  Correlation matrix x1 x2 x3 x4 x5


including corresponding angles
in example 3 x1 1 10° 70° 90° 100°
x2 0.985 1 60° 80° 90°
x3 0.342 0.500 1 20° 30°
x4 0.000 0.174 0.940 1 10°
x5 −0.174 0.000 0.866 0.985 1

The relationship expressed in Eq. (7.7) is called the fundamental theorem of factor anal-
ysis, which states that the correlation matrix of the initial data can be reproduced by the
factor loading matrix A and the correlation matrix of the factors C.
Generally, factor analysis assumes that the extracted factors are uncorrelated. Thus, C
corresponds to an identity matrix. The multiplication of a matrix with an identity matrix
results in the initial matrix, and therefore Eq. (7.7) may be simplified to:

R = A · A′ (7.9)
Assuming independent (uncorrelated) factors, the empirical correlation matrix can be
reproduced by the factor loadings matrix A.

7.2.2.3 Graphical Factor Extraction


In the following, a further example is used to show graphically how factors can be
extracted when three or more variables have been collected. Again, we illustrate the cor-
relations graphically and choose the vector representation for the correlations between
variables and factors.

Example 3: correlation matrix of five variables

We now observe five variables, and the correlations are chosen in such a way that
we can depict the interrelations in two-dimensional space—which will hardly be the
case in reality.7 Table 7.8 shows the correlation matrix for the example, with the upper
triangular matrix containing the angle specifications belonging to the correlations. ◄

Figure 7.8 visualizes the correlations of example 3 in a two-dimensional space.


To extract the first factor, we search for the center of gravity (centroid) of the five vec-
tors which is actually the resultant of the five vectors. If the five vectors represented five
ropes tied to a weight in 0 and five persons were pulling with equal strength on each one
of the ends of the ropes, the weight would move in a certain direction. This direction is
indicated by the dashed line in Fig. 7.9, which is the graphical representation of the first
factor (factor 1).

7 Please note that example 3 does not correspond to the application example in Sect. 7.2.1.
7.2 Procedure 399

x3

x4

x2
x5

x1
60°
10° 10°

20°

Fig. 7.8 Graphical representation of the correlation in example 3

F1

x3

x4

x2
x5

x1
60°
10° 10°

45°12‘
20°

45°12‘

0 F2

Fig. 7.9 Graphical representation of the center of gravity


400 7 Factor Analysis

We can derive the factor loadings with the help of the angles between the variables
and the vector of the first factor. For example, the angle between the first factor and x1
equals 55°12′ (= 45°12′ + 10°), which corresponds to a factor loading of 0.571. Table 7.9
shows the factor loadings for all five variables.
Since factor analysis searches for factors that are independent (uncorrelated), a sec-
ond factor should be orthogonal to the first factor (Fig. 7.9). Table 7.10 shows the fac-
tor loadings for the corresponding second factor (factor 2). The negative factor loadings
of x1 and x2 indicate that the respective factor 2 is negatively correlated with the corre-
sponding variables.
If the extracted factors fully explained the variance of the observed variables, the sum
of the squared factor loadings for each variable would be equal to 1 (so-called unit vari-
ance). This relationship can be explained as follows:

1. By standardizing the initial variables, we end up with a mean of 0 and a standard devia-
tion of 1. Since the variance is the squared standard deviation, the variance also equals 1:
sj2 = 1

2. The variance of each standardized variable j is the main diagonal element of the correla-
tion matrix (variance-covariance-matrix) and it is the correlation of a variable with itself:

sj2 = 1 = rjj .

3. If the factors completely reproduce the variance of the initial standardized variables,
the sum of the squared factor loadings will be 1.

Table 7.9  Factor loadings Variable Factor 1


for the one-factor solution in
example 3 x1 0.571
x2 0.705
x3 0.967
x4 0.821
x5 0.710

Table 7.10  Factor loadings Factor 1 Factor 2


for the two-factor solution in
example 3 x1 0.571 −0.821
x2 0.705 −0.710
x3 0.967 0.255
x4 0.821 0.571
x5 0.710 0.705
7.2 Procedure 401

D
resultant 2 (factor 2)

A
vector x1

60° 120°

30°
C
0 30° resultant 1 (factor 1)

B
vector x2

Fig. 7.10 Graphical presentation of the case in which all variances in the variables are explained

To illustrate this, let us take an example where two variables are reproduced by two fac-
tors (Fig. 7.10). The factor loadings are the cosine of the angles between the vectors
reflecting the variables and factors. For x1, the factor loadings are 0.866 (=cos 30°) for
factor 1 and 0.5 (=cos 60°) for factor 2. The sum of the squared factor loadings is 1
(=0.8662 + 0.52). According to Fig. 7.10, we can express the factor loadings of x1 on the
factors 1 and 2 as follows:

OC
OA
for factor 1 and

OD
OA
for factor 2. If the two factors completely reproduce the standardized variance of the ini-
tial variables, the following relation has to be true:
 2  2
OC OD
+ = 1,
OA OA
402 7 Factor Analysis

—which is actually the case.


Thus, the variance of a standardized output variable can also be calculated using fac-
tor loadings as follows:
Q

sj2 = rjj = aj1
2 2
+ aj2 2
+ · · · + ajQ = 2
ajq (7.10)
q=1

with

ajq factor loading of variable j on factor q

The factor loadings represent the model parameters of the factor-analytical model which
can be used to calculate the so-called model-theoretical (reproduced) correlation matrix
(R̂).The parameters (factor loadings ajq) must now be determined in such a way that the
difference between the empirical correlation matrix (R) and the model-theoretical cor-
relation matrix (R̂), which is calculated with the derived factor loadings, is as small as
possible (cf. Loehlin, 2004, p. 160). The objective function is therefore:

F = R − R̂ → Min.!

7.2.2.4 Mathematical Methods of Factor Extraction


In Sect. 7.1 it was pointed out that factor analysis can be used to pursue two different
objectives:

1. to reduce a large number of correlated variables to a smaller set of factors (dimensions).


2. to reveal the causes (factors) responsible for the correlations between variables.

For objective 1 we use principal component analysis (PCA). We look for a small number
of factors (principal components) which preserve a maximum of the variance (informa-
tion) contained in the variables. Of course, this requires a trade-off between the smallest
possible number of factors and a minimal loss of information. If we extract all possible
components, the fundamental theorem shown in Eq. (7.8) applies:

R = A · A′
with

R correlation matrix
A factor-loading matrix
A′ transposed factor-loading matrix

For objective 2, we use factor analysis (FA) defined in a narrower sense. The factors are
interpreted as the causes of the observed variables and their correlations. In this case,
7.2 Procedure 403

it is assumed that the factors do not explain all the variance in the variables. Thus, the
correlation matrix cannot be completely reproduced by the factor loadings and the funda-
mental theorem is transformed to:

R = A · A′ + U (7.11)
where U is a diagonal matrix that contains unique variances of the variables that cannot
be explained by the factors.8
While principal component analysis (objective 1) pursues a more pragmatic purpose
(data reduction), factor analysis (objective 2) is used in a more theoretical context (find-
ing and investigating hypotheses). So, many researchers strictly separate between princi-
pal component analysis and factor analysis and treat PCA as a procedure independent of
FA, and indeed, PCA and FA are based on fundamentally different theoretical models.
But both approaches follow the same steps (cf. Fig. 7.3) and use the same mathematical
methods. Besides, they usually also provide very similar results. For these reasons, PCA
is listed in many statistical programs as the default extraction procedure in the context of
factor analysis (as it is in SPSS).

7.2.2.4.1 Principal component analysis (PCA)


The basic principle of principal component analysis (PCA) is illustrated by an example
with 300 observations of two variables, which were first standardized. The variance of a
variable is a measure of the information that is contained in the variable. If a variable has
a variance of zero, it does not contain any information. Otherwise, after standardization,
each variable has a variance of 1.
Figure 7.11 shows the scatterplot of the two standardized variables Z1 and Z2. Each
point represents an observation. Furthermore, the straight line (solid line) represents
the first principal component (PC_1) of these two variables. It minimizes the distances
between the observations and the straight line. This line accounts for the maximum var-
iance (information) contained in the two variables. The variance of the projections of
the observed points on the solid line is equal to s2 = 1.596. Since the variance of each
standardized variable is 1, the total variance of the data is 2, so the line explains 80%
(1.596/2 = 0.80) of the total variance.
Figure 7.11 also contains the second principal component (PC_2) (dashed line),
which was determined perpendicular (orthogonal) to the first principal component. The
second principal component can explain the remaining 20% of the total information (var-
iance). Thus, PC_2 represents a significantly lower share of the total information in the
data set compared to PC_1. In favor of a more parsimonious presentation of the data,
PC_2 could also be omitted. The example clearly shows that PCA tries to reproduce a
large part of the variance in a data set with only one or a few components.

8 Standardized variables with a unit variance of 1 are assumed. For the decomposition of the vari-
ance of an output variable, see also the explanations in Sect. 7.2.2.4.2 and especially Fig. 7.13.
404 7 Factor Analysis

Z2
3.0

PC_1
2.0

1.0

0.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
Z1

-1.0

PC_2

-2.0

-3.0

Fig. 7.11 Scatterplot of 300 observations and two principal components

This simple graphical illustration is also reflected in the fundamental theorem of


factor analysis, which, according to Eq. (7.8), can reproduce the empirical correlation
matrix (R) completely via the factor loading matrix (A) if as many principal components
as variables are extracted.

Example
Let us return to our example in Sect. 7.2.1 with the correlation matrix shown in Table 7.3.9
The upper part of Table 7.11 shows the factor loading matrix resulting from these data
if all five principal components (as many as there are variables) are extracted with PCA.
With these loadings, the correlation matrix can be reproduced according to Eq. (7.8).

9 Remember that we use standardized variables and therefore the variance of each variable is 1 and
the total variance in the data set is 5.
7.2 Procedure 405

The lower part of Table 7.11 presents the squares of the component loadings (factor
loadings) (a2jq). If these are summed up over the rows (over the components), we get the
variance of a variable that is covered by the extracted components according to Eq. (7.9).
As the variance of a standardized variable is 1, and as all possible 5 components were
extracted (Q = J), the sum of the squared loadings for each variable equals 1. This sum is
called the communality of a variable j:
Communality of variable j:
Q

hj2 = 2
ajq (7.12)
q=1

The communality is a measure of the variance (information) of a variable j that can be


explained by the components.
Of course, nothing would be gained if we extracted as many components as there are
variables, as our objective is a reduction of the data or dimensions. But the communality
tells us how much variance (information) of any variable is explained by the smaller set
of components (Q < J).

Table 7.11  Component loadings and squared loadings of five principal components


Component loadings
1 2 3 4 5
Milky 0.937 −0.229 −0.223 −0.138 0.017
Melting 0.843 −0.160 0.514 0.004 0.004
Artificial 0.929 −0.254 −0.233 0.137 −0.015
Fruity 0.342 0.936 −0.001 −0.026 −0.079
Refreshing 0.277 0.957 −0.028 0.029 0.078
Squared component loadings
1 2 3 4 5 Commu­
nalities
Milky 0.879 0.053 0.050 0.019 0.000 1.000
Melting 0.710 0.026 0.264 0.000 0.000 1.000
Artificial 0.862 0.064 0.054 0.019 0.000 1.000
Fruity 0.117 0.876 0.000 0.001 0.006 1.000
Refreshing 0.077 0.915 0.001 0.001 0.006 1.000
Eigenvalues 2.645 1.934 0.369 0.039 0.013 5.000
EV shares 52.9% 38.7% 7.4% 0.8% 0.3% 100%
Cumulated 52.9% 91.6% 99.0% 99.7% 100%
406 7 Factor Analysis

Fig. 7.12 Scree plot of PCA

The sum of the squared loadings over the variables yields the eigenvalueof a compo-
nent q. The eigenvalue can be interpreted as a measure of the information contained in a
factor.10 The following applies:
Eigenvalue of component q:
J

2
q = ajq (7.13)
j=1

The eigenvalue divided by the number of variables gives the eigenvalue share of a com-
ponent. Cumulated over the components, it tells us how much information is explained
by the extracted components.
Here we see that 91.6% of the information is preserved if only two principal com-
ponents are extracted. With three components we can even preserve 99% of the infor-
mation. But the third component has an eigenvalue of only 0.369 (7.4%). Thus it seems
justified to drop it and restrict the procedure to only two factors for parsimony.
This can be visualized by a scree plot as shown in Fig. 7.12. A scree plot is a line plot
of the eigenvalues over the number of extracted components or factors. The third eigen-
value—and all the following eigenvalues—account only for small amounts of information.
PCA chooses the first principal component in such a way as to account for the max-
imum possible amount of information contained in the variables. Then the second
principal component is chosen so that it accounts for the maximum proportion of the

10 Mathematicallythe eigenvalues are calculated first (this is a standard problem of mathematics)


and then the components or factors are derived.
7.2 Procedure 407

Table 7.12  Initial and extracted communalities of the first two principal components (Q = 2) in
the application example
Principal component analysis
Initial communalities Extracted communalities
Milky 1.000 0.931
Melting 1.000 0.736
Artificial 1.000 0.927
Fruity 1.000 0.993
Refreshing 1.000 0.992

remaining information. This second principal component must be perpendicular (orthog-


onal) to the first principal component.
Each further principal component is extracted so that it again takes the maximum
amount of the remaining information and is perpendicular to the preceding principal
components. So each successive principal component can only account for an increas-
ingly smaller amount of information and will, therefore, be of minor importance. The
same applies to the factors in factor analysis.
Table 7.12 shows the communalities of the variables if only two components are
extracted. We learn that with only two principal components more than 90% of the vari-
ance of each variable can be explained, except for the variable ‘melting’.
Thus, the objective of PCA becomes clear: Finding a small number of components
(reduction of the data) while preserving a maximum of information or minimizing the
loss of information. In summary, it can be said that PCA aims at

• representing a large number of variables by a much smaller number of principal


components
• that are uncorrelated (in contrast to the observed variables) and thus contain no redun-
dant information,
• while preserving a maximum of information contained in the variables.

For interpreting the principal components, the following question should be answered:
Which collective term can be found for the variables that load highly on a principal
component?11
PCA is also often used to ensure the independence of the variables used for expla-
nation, which is usually required in linear models. The assumptions of linearity and
independence are especially important for the dependency-analytical procedures of
multivariate data analysis such as regression analysis, discriminant analysis or logistic

11 On the problem of interpreting factors or main components, also compare Sect. 7.2.3 and the
case study in Sect. 7.3.3.4.
408 7 Factor Analysis

regression. If the independent variables considered in these procedures are correlated,


PCA can be used to form factors that are independent.

7.2.2.4.2 Factor-Analytical Approach
In contrast to PCA, factor analysis aims at a more theoretical objective and is interested
in uncovering the causes (factors) of the observed variables and their correlations. As
the variables usually cannot be measured without any error, they cannot be completely
explained by the factors.
Moreover, it is assumed in factor analysis that each observed variable also contains
a specific variance that cannot be explained by the common factors. Error variance and
specific variance together make up the unique variance (also called residual variance) of
a variable j that cannot be explained by the common factors. In the case of a standard-
ized variable, the rule (1 – communality) also applies to the unique variance.
Thus, in factor analysis the communalities of the variables cannot be 1, but are always
smaller than 1, even if all possible factors are extracted. Figure 7.13 illustrates how the
variance of a variable j is decomposed in the factor-analytical model.
According to Eq. (7.10), for the factor-analytical model the following applies:

R = A · A′ +U
The matrix U contains the unique variances which cannot be explained by the extracted
factors. It follows that, in contrast to PCA, the empirical correlation matrix R cannot be
completely reproduced by the factor loadings.
The empirical correlation matrix contains values of 1 on its diagonal (see Table 7.3).
These are the variances of the standardized variables. As factor analysis assumes that
the total variance cannot be explained by the common factors, the diagonal has to be
substituted by the communalities of the variables, which must be less than 1. This has to
be done before the extraction of factors. But the communalities are only known after the
extraction of factors. This is the communality problem in factor analysis.

Total variance of variable j

Common variance of variable j Unique variance of variable j


(Communality) specific variance error variance

a2j1 + a2j2 + . . . + a2jQ spec2j e 2j

Portion of the unit variance of variable j, Portion of the unit variance of variable j that
which is explainded by all extracted factors cannot be explained by the extracted factors
= 1 – Communality

Fig. 7.13 Decomposition of the unit variance of a variable


7.2 Procedure 409

To overcome this problem we have to estimate initial communalities as starting val-


ues that are iteratively improved by the algorithm of factor analysis. There are different
methods to determine these initial communalities. The most common one is to use the
multiple squared correlation coefficient of each observed variable j. This corresponds to
the coefficient of determination (R-square), which results from considering each variable
j as a dependent variable and then performing a regression (see Chap. 2) with all other
variables. The estimation of the initial communalities via regressions is also a very plau-
sible approach, because the coefficient of determination is a measure for the variance
which a variable j has in common with the other variables.

Communalities in the application example


If in the application example a regression analysis is estimated with the variable ‘milky’
as the dependent variable and the other four variables as independent variables, the fol-
lowing regression equation results:
milky = b0 + b1 · melting + b2 · artificial + b3 · fruity + b4 · refreshing
This regression results in a coefficient of determination of 0.931 (see Table 7.13), which
means that 93.1% of the variance of the variable ‘milky’ can be explained by the other
variables. Using this method, we get the initial communalities listed in the fourth col-
umn of Table 7.13. For comparison, the second and third columns list the values for PCA
from Table 7.12.
As mentioned, in factor analysis the loadings and the communalities are calculated
by an iterative procedure. Different extraction methods are used for this purpose. A basic
method is principal axis factoring (PAF). When extracting two factors using PAF, we
receive the final communalities shown in the last column of Table 7.13.
Table 7.14 shows the squared factor loadings along with the eigenvalues and eigen-
value shares for the two-factor solution when using PAF. Comparing these values with
the corresponding values derived by PCA in Table 7.11, we see that smaller values will
result for eigenvalues and eigenvalue shares. Thus, a smaller percentage of the total var-
iance is explained by the factors compared to the variance explained by the principal

Table 7.13  Comparison of the communalities derived by PCA and PAF for Q = 2


Principal component analysis Principal axis factoring
Initial Extracted Initial Extracted
communalities communalities communalities communalities
Milky 1.000 0.931 0.931 0.968
Melting 1.000 0.736 0.541 0.526
Artificial 1.000 0.927 0.929 0.953
Fruity 1.000 0.993 0.974 0.991
Refreshing 1.000 0.992 0.973 0.981
410 7 Factor Analysis

Table 7.14  Squared loadings of the two-factor-solution when using PAF


Squared Component loadings Communalities
1 2
Milky 0.890 0.079 0.968
Melting 0.499 0.026 0.526
Artificial 0.862 0.091 0.953
Fruity 0.152 0.839 0.991
Refreshing 0.104 0.876 0.981
Eigenvalues 2.51% 1.91% 4.42
EV shares 50.148% 38.225% 88.363%
Cumulated 50.148% 88.363%

components. The cumulative explained percentage of the first two components is 91.6%,
while the first two factors explain only 88.363%.
As Table 7.13 shows, in the present example the results of PCA and PAF are rather
similar. Also, a scree plot of PAF will show no visible difference to the scree plot of
PCA.12 The only exception is the lower loading of the variable ‘melting’ on factor 1,
which results in a lower communality for ‘melting’ in factor analysis.

Extraction methods in factor analysis


Different extraction methods can be used to perform a factor analysis, but all of them
require starting values for the iterative estimation of the communalities. Usually all
methods determine the initial values for the communalities according to the coefficient
of determination mentioned above. Table 7.15 provides an overview of the common
extraction methods.
Principal axis factoring (PAF), which can be described as the basic method of fac-
tor analysis, has gained particular importance. PAF extracts the factor loadings from the
empirical correlation matrix and uses them to estimate the communalities of the varia-
bles. This process is repeated until the estimates of the communalities converge or until a
specified maximum number of iterations has been performed.13

12 Incontrast, the case study shows major differences between PCA and PAF (cf. Sect. 7.3.3.4).
13 Incontrast to PAF, the aim of the ML, GLS and ULS methods is to determine the factor loadings
in such a way that the difference between the empirical correlation matrix (R) and the model-the-
oretical correlation matrix (R̂) is minimal. In alpha factorization, Cronbach’s alpha is maximized,
and image factorization is based on the image of a variable. The listed procedures are all imple-
mented in SPSS, with PCA included as a further extraction procedure (see Fig. 7.21).
7.2 Procedure 411

Table 7.15  Common extraction methods for factor analysis


Principal axis factoring A method of extracting factors from the original correlation matrix
(PAF) with squared multiple correlation coefficients placed in the diago-
nal as initial estimates of the communalities. These factor load-
ings are used to estimate new communalities that replace the old
communality estimates in the diagonal. Iterations continue until the
changes in the communalities from one iteration to the next satisfy
the convergence criterion for extraction
Maximum likelihood (ML) A factor extraction method that produces parameter estimates that
are most likely to have produced the observed correlation matrix if
the sample is from a multivariate normal distribution. The correla-
tions are weighted by the inverse of the uniqueness of the variables,
and an iterative algorithm is employed
Unweighted least-squares A factor extraction method that minimizes the sum of the squared
method (ULS) differences between the observed and the reproduced correlation
matrices (ignoring the diagonals)
Generalized least-squares A factor extraction method that minimizes the sum of the squared
method (GLS) differences between the observed and the reproduced correlation
matrices. Correlations are weighted by the inverse of their unique-
ness, so that variables with high uniqueness are given less weight
than those with low uniqueness
Alpha factoring A factor extraction method that considers the variables in the anal-
ysis to be a sample from the universe of potential variables. This
method maximizes the alpha reliability of the factors
Image factoring A factor extraction method developed by Guttman (1953) and
based on image theory. The common part of the variable, called
the partial image, is defined as its linear regression on remaining
variables, rather than a function of hypothetical factors

In contrast to PCA, all other extraction procedures listed in Table 7.15 lead to a result
of communalities < 1, since all methods always take into account a residual variance per
variable. This means that a maximum of J – 1 factors can be extracted.
For the example in Sect. 7.2.2, Table 7.16 shows the initial values of the communalities
according to the above coefficient of determination of the variables as well as the result of
the iterative estimation after extracting the maximum number of factors using PAF.
In contrast to PCA (see Table 7.11), PAF only leads to communalities < 1, even when
the maximum number of factors is extracted, and the maximum number of extractable
factors is not J (in the example: J = 5 variables), but only J – 1 (in the example: J = 4),
since the individual residual factors (U) must be taken into account. PCA leads to an
explained variance of 89.567% for the application example with the extraction of four
factors. In contrast, PCA could explain 100% of the variance of the standardized initial
variables for five main components.
412 7 Factor Analysis

Table 7.16  Communalities in the application example using PAF and the extraction of four
factors
Principal axis factoring
Initial communalities Extracted communalities
Milky 0.931 0.966
Melting 0.541 0.556
Artificial 0.929 0.961
Fruity 0.974 0.999
Refreshing 0.973 0.996

In summary, the estimation procedures of factor analysis search for factors that can be
regarded as the cause of the correlation between correlated variables. However, this means
that the correlation between two variables becomes zero when this cause (factor) is found.
Thus, the interpretation of an empirical correlation shown in Sect. 7.1 (Fig. 7.2), which is
assumed by factor analysis, becomes apparent: If, for example, the advertising expenditure
of a product (x1) and its sales (x2) are strongly correlated, factor analysis assumes that this
correlation cannot be taken as an indication of a dependency between variables x1 and x2 but
is due, for example, to a general price increase as the cause (factor) of the correlation. When
interpreting the factors, it should therefore be possible to answer the following question:
How can we describe the effect that causes the high loadings of variables on a factor?14

Decision criteria for using PCA or PAF


Ultimately, PCA and factor-analytical approaches (especially PAF) represent different
worlds. Therefore, the key question when conducting a factor analysis is: Which concept
should the decision-maker prefer? As always, the answer is: It depends. The decision is
influenced by

• the objective pursued with the factor analysis,


• the extent of the user’s a priori knowledge.

PCA should be preferred if the goal of factor analysis is data aggregation. If users do
not have specific a priori knowledge, they will not be able to divide the total variance
of a variable into a common and a single residual variance. In this case, the information
required for PAF cannot be provided. If, on the other hand, knowledge of the structure

14 Note that PCA, in contrast, requires a collective term for correlating variables. Cf. the presenta-
tion on the problem of the interpretation of factors in Sect. 7.2.3 and the notes on the case study in
Sect. 7.3.3.4.
7.2 Procedure 413

of the total variance is available (e.g. information on the separation into common vari-
ance, specific variance and error variance), a PAF can be performed. Formally, there will
always be a difference in the main diagonal correlation matrix: in PCA, the main diago-
nal can reach a value of 1, while in PAF only values of less than 1 are possible.

7.2.2.5 Number of Factors
In the above procedures, it was assumed that the user knows the number of factors to be
extracted and therefore knows

• how the conflict between the objectives of the PCA can be resolved or
• how many causes exist which, in the case of PAF, explain the correlations between the
variables.

The number of factors required can also be derived from the user’s idea regarding the
percentage of variance in a data set that should be explained by a factor analysis (e.g. at
least 90%).
If this is known, the user can specify the number of factors (main components) to be
extracted. The factor loadings are then estimated according to the specified number of
factors. In all extraction procedures, the factors are always extracted in such a way that
the first factor combines a maximum of the variance of all variables. Accordingly, the
eigenvalue of the first factor is always the highest, followed by the eigenvalue of the sec-
ond factor, and so on. Thus, the eigenvalues always decrease successively.
If a user does not have a clear idea regarding the number of factors to be extracted,
the following auxiliary criteria can be used for decision-making.

Consideration of the initial eigenvalues as the starting point


In a first step, it is useful to consider the eigenvalues that arise when the entire variance
of a set of variables is to be explained. The question of the single residual factors (matrix
U) is deliberately neglected and therefore Q =   J factors are extracted with the help of
PCA. Table 7.17 shows the results for our application example. It becomes clear that the
first two factors have eigenvalues greater than 1, while the eigenvalues of factors 3 to 5
are clearly below 1.

Table 7.17  Explanation Total % of Variance Cumulative %


of variance when extracting
J = Q factors in the application Factor 1 2.645 52.903 52.903
example Factor 2 1.934 38.678 91.581
Factor 3 0.369 7.374 98.955
Factor 4 0.039 0.786 99.741
Factor 5 0.013 0.259 100.000
414 7 Factor Analysis

Eigenvalue or Kaiser criterion


Since the variance of a single standardized variable is equal to 1 (the total variance set
in this case is thus 5), Kaiser (1970) recommends that only factors with an eigenvalue
greater than 1 should be extracted.Only in this case is a factor able to combine more var-
iance than a single initial variable. Based on this recommendation, which is also called
the eigenvalue criterion, two factors are to be extracted in our application example.
These two factors are able to explain (2.645 + 1.934)/5 = 91.58% of the variance of the
data set. If, however, the user wanted to explain at least 95% of the variance in the ini-
tial data set, according to Table 7.17 three factors would have to be extracted. Therefore,
the eigenvalue criterion is usually the default setting of the software programs for factor
analysis.

Scree test
The so-called scree test also uses the initial eigenvalues as shown in Table 7.17 and dis-
plays them in a coordinate system. If a ‘kink’ or ‘elbow’ shows up in the sequence of
eigenvalues, this means that the difference between two eigenvalues is greatest there. It
is recommended to extract the number of factors to the left of the ‘elbow’. For the appli-
cation example, Fig. 7.14 shows that the largest difference in eigenvalues occurs between
numbers of factors 2 and 3. Therefore, two factors should be extracted according to the
scree test.
The basic idea of this approach is that the factors with the smallest eigenvalues are
considered ‘scree’ (i.e. unsuitable) and are therefore not extracted. However, the scree

Kaiser Criterion

Scree test

Fig. 7.14 Scree test for the application example


7.2 Procedure 415

test does not always provide a clear solution and leaves room for subjective judgement.
For this reason, the Kaiser criterion is usually preferred and frequently used in empirical
studies.

7.2.2.6 Checking the Quality of a Factor Solution


The number of extracted factors allows a direct statement about how much variance can
be explained by a chosen factor solution. In the application example, for example, the
two-factor solution can explain 91.58% and the three-factor solution can explain 98.96%
of the total variance of the five original variables (see Table 7.17).
Furthermore, the differences between the empirical correlations and the correlations
reproduced with the help of the factor loadings can also be considered for assessment.
These difference values (residuals) should not be greater than 0.5 (see also Fig. 7.31 in
the case study).
We obtain the reproduced correlation matrix with the help of the fundamental the-
orem of factor analysis according to Eq. (7.9). Since the extracted communalities are
smaller than 1, the factor loading matrix cannot completely reproduce the original cor-
relation matrix R. For the correlation matrix R̂ reproduced by the model parameters,
R̂ = AA′ applies.
Table 7.18 shows the unrotated factor loading matrixA, which we obtain when two
factors are extracted in the application example using PAF.
The reproduced correlation matrix R̂ can be calculated by multiplying the factor load-
ing matrix (A) with its transposed matrix (A′), as shown for the application example in
Table 7.19. The main diagonal of Table 7.19 presents the communalities of the variables
(in bold script), which result at the end when two factors are extracted with the help of
PAF. For example, the communality of the variable ‘milky’ is 96.8%, i.e. 96.8% of the
variance of the variable ‘milky’ is explained by the two factors extracted. The non-diago-
nal elements reflect the correlations reproduced by the factor loadings.
The reproduced correlation matrix R̂ in Table 7.19 can now be compared with the
empirical correlation matrix R in Table 7.3. The result of this comparison is shown in
Table 7.20. We find that the differences between the original and the reproduced cor-
relations are very small. We thus conclude that the two-factor solution reproduces the

Table 7.18  Unrotated factor Factor loadings


loadings of the two-factor
solution in the application F1 F2
example (PAF) Milky 0.943 −0.280
Melting 0.707 −0.162
Artificial 0.928 −0.302
Fruity 0.389 0.916
Refreshing 0.323 0.936
416 7 Factor Analysis

Table 7.19  Reproduced correlation matrix (R̂) based on the factor loadings matrix
Milky Melting Artificial Fruity Refreshing
Milky 0.968*
Melting 0.712 0.526*
Artificial 0.960 0.705 0.953*
Fruity 0.110 0.127 0.085 0.991*
Refreshing 0.042 0.077 0.017 0.983 0.981*
*: Communalities of variables after extraction of two factors using PAF

Table 7.20  Differences between reproduced and original correlations


Milky Melting Artificial Fruity Refreshing
Milky
Melting 0.000
Artificial 0.001 −0.001
Fruity −0.001 0.011 −0.006
Refreshing 0.002 −0.010 0.006 0.000

original correlation matrix in the application example very well. The two-factor solu-
tion is therefore suitable for describing the five initial variables without a large loss of
information.

7.2.3 Interpreting the Factors

1 Evaluating the suitability of data

2 Extracting the factors and determining their number

3 Interpreting the factors

4 Determining the factor scores

Once we have decided on the number of factors, we need to interpret the extracted fac-
tors. We use the factor loadings matrix to do so since high factor loadings indicate that a
variable is strongly correlated with a factor. In the application example, factor 1 is related
to the variables ‘milky’, ‘melting’, and ‘artificial’. Since the factor loadings are the result
of principal axis factoring, we need to answer the question how to label the effect that
causes the high factor loadings.
In this case, all three variables seem to relate to a factor that could be called ‘texture’,
but also ‘unhealthy aspects’. The second factor is related to the variables ‘fruity’ and
7.2 Procedure 417

‘refreshing’ and might thus be labeled as ‘taste experience’ or ‘aroma’. At this point, it
becomes clear that the interpretation of the factors requires a high level of expertise and
some creativity on the part of the user.

Cross-loadings
Interpreting the factor solution is also difficult if the factor loadings do not clearly
indicate to which factor a certain variable belongs. This is the case if a variable loads
highly (in absolute terms) on more than one factor. High correlations on multiple fac-
tors are called cross-loadings. Cross-loadings make factor interpretation more difficult. If
cross-loadings occur, we have to decide which factor the variable should be assigned to.
There is a rule of thumb that can help us make this decision: The absolute value
of a factor loading should be greater than 0.5 to be relevant for a factor. If a variable
has absolute factor loadings greater than 0.5 for several factors, the variable should be
assigned to each one of these factors. In this case, however, a meaningful interpretation
of the factors may not be possible.

Factor rotation
If the assignment of variables to factors is ambiguous, factor rotation may be employed,
i.e., the factor vectors are rotated. If the coordination system in Fig. 7.9 that is repre-
sented by the factor vectors is rotated around its origin, we get, for example, Fig. 7.15.
The angles between x1 and F1 as well as x2 and F1 are substantially smaller, and, thus,
the factor loadings are increased. The same applies to the variables x3, x4, and x5 and

x3 F2

x4
F1
x2
x5

x1

Fig. 7.15 Example of a rotated factor solution


418 7 Factor Analysis

their angles with factor 2 (F2), and consequently for the factor loadings. Ultimately, the
interpretation of the factors is considerably easier.
There are basically two different options for rotating the coordination system and
hence the factor vectors:

1. We can assume that the factors remain uncorrelated with each other. The factor
vectors maintain a 90° angle to each other. We call this method orthogonal (rec-
tangular) rotation. The most popular orthogonal rotation method is the so-called
varimaxmethod.
2. If, however, a correlation between the rotated axes or factors is assumed, the vectors
of the factors are rotated at an oblique angle (<90°) to each other. Such rotation meth-
ods are called oblique rotation. Statistical software packages such as SPSS offer vari-
ous oblique rotation methods (cf. Fig. 7.22).

Table 7.21 shows the unrotated and rotated factor loadings for the application example,
with the rotated factor loadings obtained via the varimax method. Now the factor load-
ings for the variables ‘milky’, ‘melting’, and ‘artificial’ on factor 1 (F1) have decreased.
The same applies to the variables ‘fruity’ and ‘refreshing’ and their factor loadings on
factor 2 (F2). Despite a clear loading structure, factor interpretation is not easy. Readers
should try to find their own interpretation. We offer the following labels: ‘texture’ for F1
and ‘aroma’ for F2.

7.2.4 Determining the Factor Scores

1 Evaluating the suitability of data

2 Extracting the factors and determining their number

3 Interpreting the factors

4 Determining the factor scores

Table 7.21  Unrotated and rotated factor loadings for the application example (PAF)
Unrotated factor loadings Rotated factor loadings
F1 F2 F1 F2
Milky 0.943 −0.280 0.984 0.032
Melting 0.707 −0.162 0.722 0.070
Artificial 0.928 −0.302 0.976 0.007
Fruity 0.389 0.916 0.080 0.992
Refreshing 0.323 0.936 0.011 0.990
7.2 Procedure 419

For a multitude of questions it is of great interest not only to reduce the variables to a
smaller number of factors and to find labels for the factors, but also to determine how the
objects score on the factors.
Factor analysis aims to present the standardized initial data matrix Z as a linear
combination of factors: Z = P · A′. So far, we have focused on determining A, the fac-
tor loadings matrix. Since Z is given, we still need to determine the matrix of factor
scores,P. P contains the estimated ratings of the respondents with regard to the factors
found. P thus answers the following question: How would the respondents have rated the
factors if they had had the opportunity to do so?
To retrieve P, we proceed with

Z · (A)−1 = P · A′ · (A′ )−1 (7.14)


As A′ · (A′ )−1 by definition equals an identity matrix E, we get:

Z · (A)−1 = P · E (7.15)
Because of P · E = P,

P = Z · (A′ )−1 (7.16)


Since A is usually not quadratic (since we extract fewer factors than the actual variables),
the inverse (A′)−1 cannot be derived. Therefore, we actually multiply Eq. (7.15) by A:

Z · A = P · A′ · A (7.17)
The matrix (A · A) is by definition quadratic and therefore invertible:

Z · A · (A′ · A)−1 = P · (A′ · A) · (A′ · A)−1 (7.18)


As (A′ · A) · (A′ · A)−1 equals an identity matrix, we can write:

P = Z · A · (A′ · A)−1 (7.19)


The term A·(A′ · A)−1 describes the translation process. However, difficulties may arise
when solving Eq. (7.19), and we may thus use simplified approaches to determine P. In
the following, we will consider three different approaches: surrogates, summated scales
and regression analysis. All three approaches rely on the factor loadings but view these
values differently.

Surrogates
Using surrogates is the simplest (but in most cases not the best) way ofdetermining fac-
tor scores. When using this approach, we take the highest loading of a variable on a fac-
tor as a surrogate for its factor score. Yet, surrogates are only rough proxies for factor
scores since we assume that a single variable represents the latent factor. In the appli-
cation example, using surrogates would mean that the highest loading on each factor is
taken as a surrogate (cf. Table 7.21). Consequently, ‘milky’ would represent factor 1 (F1)
420 7 Factor Analysis

and ‘fruity’ would represent factor 2 (F2), with the rotated factor loadings of 0.984 and
0.992, respectively. However, surrogates are only appropriate if the highest factor load-
ing is a dominant value, i.e., the factor loadings of all other variables should be sub-
stantially lower than the surrogate’s factor loading. For factor 1, we observe the second
highest loading for ‘artificial’ with a value of 0.928. For factor 2, the second highest
loading is 0.990 for ‘refreshing’ (cf. Table 7.21). Both these factor loadings are close to
the highest factor loading, and thus using surrogates does not seem appropriate in this
case.

Summated scales
Alternatively, we can use summated scales. Summated scales combine different varia-
bles measuring the same concept into a single construct. They are calculated by taking
the mean of the high-loading variables of each factor. In our case, with two factors, we
expect two summed scales. The first scale can be derived directly from Table 7.21.

F1 =0.984 + 0.976 + 0.722 = 2.682 : 3 = 0.894


F2 =0.992 + 0.990 = 1.982 : 2 = 0.991

Summed scales, like surrogates, are based on information contained in the factor load-
ings. While surrogates focus on the highest loadings per factor, summed scales are based
on multiple loadings. However, neither makes use of the full information provided by the
factor loadings. This is the subject of regression analysis.

Regression analysis
Another approach for deriving factor scores is to use regression analysis (see Chap. 2).
The basic idea of this approach is illustrated in Fig. 7.16.
If we multiply the matrix of the standardized output data (Z) with the matrix of the
regression coefficients (so-called factor score coefficients), we obtain the matrix of the
factor scores P. For the application example, we obtain the factor score matrix P (size
30 × 2) by multiplying the output matrix Z (size 30 × 5) with the regression coefficients
listed in Table 7.22 (size 5 × 2).
For our application example, Table 7.23 shows the factor scores of the two-factor
solution for the first three persons and the last person.
When interpreting the factor scores, we need to be aware that they represent standard-
ized values, due to the standardization of the original data matrix. For the interpretation
of the factor scores this implies that

• A negative factor score means that an object (product) is perceived by a respondent to


be below average in terms of this factor.
• A factor score of 0 means that the rating of an object (product) corresponds exactly to
the average rating of this object (product).
• A positive factor score means that an object (product) is rated by a respondent to be
above average in terms of this factor.
7.2 Procedure 421

Initial standardized Regression Factor


matrix (Z) coefficients scores
(A · (A′·A)−1) (P)

30x5 5x2 30x2

Fig. 7.16 The basic idea of using regression analysis for determining factor scores

Table 7.22  Regression F1 F2
coefficients for determining
factor scores Milky 0.551 –0.049
Melting 0.015 –0.010
Artificial 0.422 0.001
Fruity 0.261 0.673
Refreshing –0.281 0.331

Table 7.23  Matrix P of Person F1 F2


person-related factor scores
in the application example Person 1 –1.305 –1.347
(excerpt) Person 2 –0.520 –0.289
Person 3 0.614 0.205
… … …
Person 30 0.210 1.367
Mean 0.000 0.000
Variance 1.000 1.000
422 7 Factor Analysis

An indication for the “average” evaluation per variable (factor score 0) can be derived
from the means of the variables in the initial data set of the collected data which load on
a factor.

7.2.5 Summary: The Essence of Factor Analysis

Figure 7.17 summarizes the necessary steps when conducting a factor analysis and illus-
trates how to get from the original data matrix to the factor score matrix. Please note that
the size (length and width) of the boxes representing the various matrices reflects the size
of the matrix, that is, the number of rows and columns. Thus, the aim of factor analysis
is to transform the original data matrix into a matrix with fewer columns but the same
number of rows. In the application example, the original data matrix has 30 rows and 5
columns, while the factor score matrix has 30 rows and 2 columns.

7.3 Case Study

7.3.1 Problem Definition

We now use a larger sample related to the chocolate market to demonstrate how to con-
duct a factor analysis with the help of SPSS.15
A manager of a chocolate company wants to know how consumers evaluate different
chocolate flavors with respect to various attributes. For this purpose, the manager iden-
tified 11 flavors and selected 10 attributes that appear to be relevant for the evaluation of
these flavors.
A small pretest with 18 test persons was carried out. The persons were asked to eval-
uate the 11 flavors (chocolate types) with regard to the 10 attributes (see Table 7.24). A
seven-point rating scale (1 = low, 7 = high) was used for each attribute. Thus, the varia-
bles are perceived attributes of the chocolate types.
However, not all persons were able to evaluate all 11 flavors. Thus, the data set con-
tains only 127 evaluations instead of the complete number of 198 evaluations (18 per-
sons × 11 flavors). Every evaluation reflects the subjective assessment of all the 10
attributes with regard to a specific chocolate flavor by a particular test person. Since each
test person assessed more than just one flavor, the observations are not independent. Yet
for convenience, we will here treat the observations as such.

15 Inthe case study, the same data set is used as for discriminant analysis (Chap. 4), logistic regres-
sion (Chap. 5) and cluster analysis (Chap. 88). This is to better illustrate the similarities and differ-
ences between the different variants of the method.
X includes the Z contains the R describes the R contains the A contains the A* contains the P does not contain the
manifestation of the standardized statistical relation estimated correlations between correlations between manifestation of the
variables in question data matrix Z. between the variables. communalities on the variables and factors. variables and factors output variables on
(e.g. “creamy”) on principal diagonal. after rotation of the single persons/objects
the persons. coordinate cross. This anymore (see 2), but
rather the manifestation
7.3 Case Study

serves to simplify the


interpretation. of the determined
factors.

The columns contain The matrix is square-shaped, the number of The matrix is generally not square-shaped, The matrix is generally
the attributes rows/columns is determined by the number of since the number of factors (columns) should be not square-shaped. The
(characteristics), the attributes (characteristics) in Z. smaller than the number of attributes (rows). rows contain the
rows contain the persons (so that the
persons. number of columns of
the matrices A or A*
equals the number of
columns in matrix P),
the rows contain the
factors.

Raw data Factor


Standardized Correlation Reduced Factor Rotated
matrix score
data matrix matrix correlation loadings factor
X matrix
Z R matrix matrix = structures
P
R Factor A*
structures
A

Communality Extraction Rotation Estimation


problem problem problem of factor
score

Fig. 7.17 How to get from the original data matrix to the factor score matrix
423
424 7 Factor Analysis

Table 7.24  Chocolate flavors and attributes examined in the case study


Chocolate flavor Perceived attributes
1 Milk 1 Price
2 Espresso 2 Refreshing
3 Biscuit 3 Delicious
4 Orange 4 Healthy
5 Strawberry 5 Bitter
6 Mango 6 Light
7 Cappuccino 7 Crunchy
8 Mousse 8 Exotic
9 Caramel 9 Sweet
10 Nougat 10 Fruity
11 Nut

Of the 127 evaluations, only 116 are complete, while 11 evaluations contain missing
values (i.e., not all attributes of a flavor were evaluated).16 We exclude all incomplete
evaluations from the analysis. Consequently, the number of cases is reduced to 116.17
The manager of the chocolate company would like the survey to answer the following
three central questions:

• Do the 10 product attributes independently influence the judgement of a chocolate


variety or are there dependencies between the attributes?
• In case there are dependencies between the assessments of the 10 attributes, to which
central causes can these dependencies be traced back, and what are the independent
assessment factors?
• How can the 11 chocolate varieties be positioned against the background of the
assessment factors found in the pretest?

To answer his questions, the manager carries out a factor analysis. According to the
considerations in Sect. 7.3.1, PAF is chosen as the extraction method. For the final

16 Missing values are a frequent and unfortunately unavoidable problem when conducting surveys
(e.g. because people cannot or do not want to answer some of the questions, or as a result of mis-
takes by the interviewer). The handling of missing values in empirical studies is discussed in Sect.
1.5.2.
17 In the following the same data set is used as in the case study of discriminant analysis (Chap. 4),

logistic regression (Chap. 5) and cluster analysis (Chap. 8). Thus, it is easier to demonstrate the
similarities and differences between the methods.
7.3 Case Study 425

positioning of the chocolate varieties, he uses the factor scores, averaged over the inter-
viewed persons per chocolate variety.

7.3.2 Conducting a Factor Analysis with SPSS

To conduct a factor analysis with SPSS, we go to ‘Analyze/Dimension Reduction’ and


choose ‘Factor’ (Fig. 7.18).
A dialog box opens, and we first select the variables that are supposed to be interre-
lated (‘Variables’; cf. Fig. 7.19).
Now we go to ‘Descriptives’ and select the descriptive statistics and the criteria to
assess the suitability of the data (Fig. 7.20). The default option ‘Initial solution’ provides
the basic output of a factor analysis. Since we want to assess the suitability of the data in

Fig. 7.18 Data editor with a selection of the procedure ‘Factor analysis’
426 7 Factor Analysis

Fig. 7.19 Dialog box: Factor analysis

Fig. 7.20 Dialog box:


Descriptives

a first step, we further activate ‘Coefficients’ and ‘Significance levels’ to obtain the corre-
lation matrix and the significance levels of the correlations. Moreover, we activate ‘KMO
and Bartlett’s test of sphericity’ and we also request the anti-image matrix (‘Anti-image’)
and the reproduced correlation matrix (‘Reproduced’).
7.3 Case Study 427

Next, we go to ‘Extraction’ to choose the method for the extraction of the factors. We
discussed that there are two distinct methods to extract the factors: principal components
analysis and principal axis factoring (cf. Sect. 7.2.2.3). Here we assume that specific var-
iances and measurement errors are relevant, and we thus choose ‘Principal axis factor-
ing’ (Fig. 7.21). SPSS offers several extraction methods, which were briefly described in
Table 7.15. Principal axis factoring (PAF) is chosen for the case study because the man-
ager is interested in the causes of the correlations (see Sect. 7.2.2.4.2). Additionally, we
select the option ‘Scree plot’, which helps to make the decision on the number of factors
to be extracted.
To obtain the rotated factor matrix, we open the dialog box ‘Factor Analysis:
Rotation’ and select ‘Varimax’ (Fig. 7.22). The varimax rotation results in uncorrelated
factors and helps with identifying the variables that belong to a specific factor.
Finally, we go to the dialog box ‘Factor Analysis: Factor Scores’ and select the
default option ‘Regression’ (Fig. 7.23). When the option ‘Regression’ is used, SPSS pro-
duces factor scores that have means of 0 and variances equal to the squared multiple cor-
relation between the estimated factor scores and the true factor scores. The scores may
be correlated even if factors are orthogonal. In this case, SPSS saves the estimated factor
scores as new variables to the SPSS data file. We will use the factor scores to position the
different chocolate flavors along the factors.

Fig. 7.21 Dialog box: Extraction


428 7 Factor Analysis

Fig. 7.22 Dialog box: Rotation

Fig. 7.23 Dialog box: Factor


Scores

We further activate the option ‘Display factor score coefficient matrix’. By doing so,
we obtain the regression coefficients that are used to compute the factor scores in the
SPSS output window.
As an alternative, the option ‘Bartlett’ results in factor scores that have means of 0,
and the sum of squares of the unique factors over the range of variables is minimized.
Finally, the method ‘Anderson-Rubin’ is a modification of the Bartlett method which
7.3 Case Study 429

ensures orthogonality of the estimated factors. The factor scores that are derived have
means of 0, standard deviations of 1, and are uncorrelated.

7.3.3 Results

In the following, the results of the factor analysis are presented according to the steps
presented in Sect. 7.2 (see Fig. 7.3). PAF was chosen as the extraction procedure because
the manager is interested in the causes of the correlations in his dataset. We will divide
the presentation of the results into two parts:

1. Checking whether the correlation matrix in the case study is suitable for a factor anal-
ysis (process step 1),
2. Conducting the PAF for the case study (steps 2 to 4).

Subsequently, in Sect. 7.3.3.3, we will deal with the manager’s question regarding the
positioning of the eleven chocolate types. The answer is based on the factor scores of the
respondents.
Because of the great importance of PCA in practical applications and the fundamental
difference to PAF (cf. the discussion in Sect. 7.2.2.4), we will again highlight the central
differences between using PCA and PAF in the case study (Sect. 7.3.3.4).

7.3.3.1 Prerequisite: Suitability
In the first step, the data matrix is standardized and the correlation matrix is computed
(Fig. 7.24).
The upper part of Fig. 7.24 shows that there are only a few high correlations in the
case study. The highest correlations exist between the variables ‘light’ and ‘sweet’
(r = 0.537) and between ‘light’ and ‘fruity’ (r = 0.549). There are many correlations below
0.2 and the lowest correlations exist between ‘healthy’ and ‘refreshing (r =   0.009) and
‘healthy’ and ‘delicious’ (r = –0.019). If we consider the significance levels of the correla-
tions (displayed in the lower matrix), only 25 out of 45 correlations are significant at the
5% level. We conclude that the data are probably not well suited for a factor analysis.
To gain a more detailed insight into the suitability of the data for factor analysis, the
results for KMO and Bartlett’s test for sphericity are also considered (Fig. 7.25). The
KMO criterion is 0.701, which indicates that the data matrix (see Table 7.4) is only mod-
erately suitable for factor analysis. Nevertheless, KMO is above the critical value of 0.5.
The Bartlett test for sphericity is significant (p = < 0.001), which leads to the conclu-
sion that the correlation matrix is not an identity matrix and therefore correlations exist
between the initial data.
Figure 7.26 shows the anti-image correlation matrix with the variable-specific MSA
at the main diagonal. The variables ‘healthy’ (MSA = 0.492) and ‘price’ (MSA = 0.491)
have MSA values below the critical value of 0.5; thus we might consider to ignore these
430
7

Fig. 7.24 Correlation matrix and significance of correlations


Factor Analysis
7.3 Case Study 431

Fig. 7.25 KMO and Bartlett’s test in the case study with 10 variables

variables in the further analyses. The remaining variables have MSA values that score
‘mediocre’ or even better. We decide to keep the variables ‘healthy’ and ‘price’ at this
point.
Generally speaking, the assessment methods for deciding whether the data are suited
for factor analysis or not may be compared to a ‘bunch of flowers’. Although all evalu-
ation criteria rely on the correlations among the variables, the conclusions based on the
criteria may be ambiguous. Therefore it does not make much sense to pick out a spe-
cial criterion because it is necessary to see the whole picture. It is the complete portfolio
(bunch) of criteria that the decision-maker has to rely on.
In a final step, the communalities of the initial variables are checked, as these can pro-
vide information on whether any variables should be excluded from the analysis. This is
always the case if the communalities have very small values and the factors therefore can-
not reproduce the variance of a variable. For this purpose, we consider the matrix of ini-
tial and final communalities when PAF is used. Figure 7.27 shows the results for the case
study.
In theory, all variables with very small communalities should be deleted, i.e., ‘refresh-
ing’ (0.182), ‘healthy’ (0.239) and ‘crunchy’ (0.237). Follow-up questions are: Should
we delete one, two or all three variables? And what will happen if the deletion of the
three variables leads to new critical candidates for deletion? And what does it mean that
at the end we get completely different results for PCA and PAF? Does that mean it is a
trial and error process with respect to the results ‘healthy’ (0.239) and ‘crunchy’ (0.237)?
In the following we will analyze the process if—as an example—we delete the varia-
ble that has the lowest extracted communality, that is, ‘refreshing’ (Fig. 7.27). For doing
so, we have to run a completely new factor analysis. In our case KMO increases slightly
from 0.701 to 0.723, and the MSA value also increases marginally. The Bartlett test is
still significant.
Figure 7.28 shows the results for the communalities for 9 variables if PAF is applied.
After the deletion of ‘refreshing’ the critical variables in the 9-variable solution remain
the same (‘healthy’: 0.188; ‘crunchy’: 0.207). To see how the results change if other var-
iables with weak communalities are eliminated, the researcher can systematically check
the results by running a complete analysis for each combination. The results may vary
widely.
432
7

Fig. 7.26 Anti-image correlation and covariance matrix


Factor Analysis
7.3 Case Study 433

Fig. 7.27 Communalities in the


case study when using PAF (10
variables)

Fig. 7.28 Communalities in the


case study after eliminating the
variable ‘refreshing’ and using
PAF
434 7 Factor Analysis

Does that mean the decision-maker has to return to ‘gut feeling’ decisions? The
answer is no. What is needed is a comprehensive theory that guides through the set of
critical questions. That is why we recommend to look for a stable, adoptable theory
before starting any far-reaching but misleading empirical designs. For didactic reasons
we decide to keep the variables ‘healthy’ and ‘crunchy’ and to continue with the interpre-
tation of the number of factors (based on nine variables and PAF).

7.3.3.2 Results of PAF with Nine Variables


In the following, we present the results for the case study if the variable ‘refreshing’ is
eliminated at the beginning and the remaining nine variables are examined using PAF.
The illustrations follow the steps 2–4 in Fig. 7.3.18

Number of factors
By default, SPSS uses the Kaiser (eigenvalue) criterion to determine the number of
factors.
Figure 7.29 shows the eigenvalues of the factors if we consider just nine variables.
The column ‘Initial Eigenvalues/Total’ shows that according to the Kaiser criterion
(eigenvalues > 1) three factors have to be extracted. A separate factor analysis on the
basis of the three-factor solution results in 45.17% cumulative variance explained. The
values related to ‘Extraction Sums of Squared Loadings’ consider the common variance
instead of the total variance and are thus lower than the values reported in the columns
‘Initial Eigenvalues’. The columns related to ‘Rotation Sums of Squared Loadings’ rep-
resent the distribution of the variance after the VARIMAX rotation.
Next, we have a look at the scree test. Figure 7.30 shows the scree plot. The result is
ambiguous. We could justify either a three-factor or a two-factor solution.

Quality of the three-factor solution


To assess the quality of the factor solutions, we first consider the explained variance.
According to Fig. 7.29, the three-factor solution can explain 45.17% of the total vari-
ance. From the figure, it is evident that five factors would have to be extracted if, for
example, at least 80% of the total variance should be explained.
Further criteria for the evaluation of the quality are the reproduced correlation matrix
(R̂) as well as the deviations between the empirical (R) and the reproduced correlations.
Both matrices are shown in Fig. 7.31.
The matrix of the residuals (R -R̂) shows that only five residuals, i.e., 13% of all cor-
relations, have values >0.05. The correlations rcrunchy,bitter =  + 0.091, rfruity,light =  + 0.078
and rfruity,exotic = –0.070 have the largest residual values. If the communalities listed on

18 By reducing the number of variables to 9, the number of valid cases in the case study changes to
117.
7.3 Case Study

Fig. 7.29 Eigenvalues and total variance explained (based on 9 variables)


435
436 7 Factor Analysis

Kaiser Criterion

Scree Test

Fig. 7.30 Scree test and Kaiser criterion

the main diagonal in the reproduced correlation matrix are taken into account (see also
Fig. 7.28), it can be seen that poor reproductions are mainly related to the variables with
small communalities. These are the variables ‘bitter’ (communality 0.378) and ‘crunchy’
(communality 0.207). In contrast, the variable ‘healthy’ with the smallest communality
(0.188) is not affected. The reason for this is the fact that, in principle, it forms a factor
of its own in factor extraction and dominates factor 3 (see Fig. 7.33). In summary, it can
be said that although the overall goodness (45.17% explained total variance) is rather
low, the reproductions of the correlations are relatively good considering the consistently
rather low communalities.

Factor interpretation
To interpret the three-factor solution, we first use the unrotated factor matrix in Fig. 7.32.
The variables ‘light’, ‘sweet’, ‘bitter’, and ‘fruity’ load highly on factor 1. The varia-
ble ‘delicious’ has a cross-loading with factors 1 and 2. Besides ‘delicious’, the varia-
bles ‘price’ and ‘exotic’ are correlated with factor 2. The variable ‘healthy’ is the only
one that loads rather highly on factor 3 (0.406). This might be a first hint that there will
be incompatibilities with the conceptual requirements of factor analysis. The variable
‘crunchy’ correlates rather weakly with all three factors. Should we delete this variable?
For further interpretation of the factor solution, we look at the factor loadings
obtained after applying the VARIMAX rotation. In the two-dimensional (as well as in
the three-dimensional) case, we can perform the rotation graphically by trying to rotate
the coordination system in such a way that the angles between the variable and the factor
7.3 Case Study

Fig. 7.31 Reproduced correlations and residuals


437
438 7 Factor Analysis

Fig. 7.32 Unrotated factor


matrix for the three-factor
solution

vectors decrease. In the case of more than three factors, however, it is necessary to per-
form the rotation purely mathematically.
The VARIMAX rotation is an orthogonal rotation method, thus maintaining the
assumption that the factors should be independent (i.e., uncorrelated). Since the rotation
of the factors changes the factor loadings but not the communalities of the model, the
unrotated solution is primarily suitable for the selection of the number of factors and for
the quality assessment of the factor solutions. However, an interpretation of the deter-
mined factors on the basis of an unrotated model is not recommended, since the applica-
tion of a rotation method changes the distribution of the explained variance portion of a
variable among the factors.
Figure 7.33 shows the analytical solution of the rotated factor loading matrix for the
case study. Compared to Fig. 7.32, changes in the factor loadings are noticeable, which
in the result facilitate the factor interpretation or make it more unambiguous.
Now the variables ‘light’, ‘sweet’, ‘fruity’, ‘bitter’, ‘exotic’, and ‘crunchy’ are more
clearly related to factor 1. Factor 2 correlates with ‘price’ and ‘delicious’, while factor 3
is only correlated with the variable ‘healthy’.
Overall, it becomes clear that the results of a factor analysis often raise many ques-
tions and rarely provide clear answers. However, this is precisely what makes factor
analysis so “flexible” and leaves room for interpretation. In particular, the interpreta-
tion of the factors is always subjective and ultimately requires a great deal of expertise
on the part of the user in the field of investigation. This often makes it difficult to find
7.3 Case Study 439

Fig. 7.33 Varimax-rotated


factor matrix in the case study

the “right” labels for the abstract, unobserved factors. This is precisely why information
about the variables that load strongly on a factor is used to interpret the factors. The fol-
lowing interpretations are proposed here for the case study:

• Factor 1: taste experience


• Factor 2: value for money
• Factor 3: health aspects

We encourage the reader to challenge these propositions and to find alternative terms for
the factor labels. Moreover, we would like to motivate the reader to carefully compare
and inspect the different alternative options when conducting a factor analysis to gain
confidence about the robustness of the results.

Factor scores
After extracting the factors, often the question how the interviewed persons would assess
the (fictitious) factors is still of interest. These assessments can be estimated by using
the matrix of the initial data (Z) with the so-called factor scores. We use the regression
method to calculate the factor scores (see Fig. 7.23 and Sect. 7.2.4). SPSS provides the
Factor Score Coefficient Matrix (Fig. 7.34) to calculate the factor scores from the output
data.
440 7 Factor Analysis

Fig. 7.34 Regression


coefficients for calculating the
factor scores

The regression coefficients serve as weights for the standardized initial values to com-
pute the factor scores. From Fig. 7.34 we learn that the variables ‘light’, ‘sweet’, ‘fruity’,
‘bitter’, ‘exotic’, and ‘crunchy’ have the highest weights, corresponding to the previous
results. As expected, the variables ‘price’ and ‘delicious’ received the highest weights for
factor 2, and the variable ‘sweet’ has the highest relevance when the factor score for fac-
tor 3 is computed. Since all variables are considered when computing the factor score for
one factor, the factor scores are correlated.
SPSS computes the factor scores and saves these values as new variables in the SPSS
data file. In SPSS, the factor scores of the individual cases are appended to the data
matrix as new variables: FAC1_1, FAC2_1 and FAC3_1 (Fig. 7.35).
As generally discussed in Sect. 7.2.4, when interpreting factor scores, it must be taken
care to ensure that they represent standardized values with a mean of zero and a variance
of 1. The resulting interpretation is illustrated here for persons 1, 2 and 10:

• The factor scores for person 1 are –1.093, –0.321, and –0.520.
This means that person 1 rates all three factors below average compared to the aver-
age of all respondents. With the rating scale used (1 = low, 7 = high), this means that
person 1 rates all three factors low compared to the average.
• The factor scores for person 2 are 0.319, 0.690, and 0.427.
Person 2 shows the opposite phenomenon, i.e. the assessments are above average for
all three factors. Person 2 therefore rates all three factors highly compared to the aver-
age values.
7.3 Case Study

Fig. 7.35 SPSS data editor with the factor scores of the first 28 persons in the case study
441
442 7 Factor Analysis

• The factor scores for person 10 are –0.002, –0.650, and –0.086.
Person 10 has values close to zero for factors 1 and 3, i.e. the assessment of these fac-
tors corresponds to the average assessment of the persons interviewed.

An indication of the “average” evaluation of a variable (factor score 0) can be derived


from the mean of the variables in the initial data that load on a factor. For this purpose,
the field ‘Univariate descriptives’ must be additionally selected in the menu ‘Descriptives’
in Fig. 7.20. This statistic will then show the mean assessment values per initial variable
of a survey. For the case study, this mean is 4.73 for the variable ‘price level’ and 4.31 for
the variable ‘delicious’. Since these two variables dominate factor 2, it may be concluded
that person 2 interprets factor 2 as particularly high (factor score + 0.690), while person
10 estimates the factor as significantly lower (factor score –0.650). A factor score of zero
indicates that the assessment of this person is exactly at the level (4.31 or 4.73).

7.3.3.3 Product Positioning
To position the eleven types of chocolate selected for the case study, the manager of the
chocolate company uses the (fictitious) evaluations of the three factors as represented by
the factor scores of the 116 interviewed persons (see Fig. 7.32). For product positioning, he
needs the average evaluations of the three factors for each type of chocolate. These can be
calculated using the SPSS procedure ‘Means’, which is called up via the following menu
sequence: ‘Analyze/Compare Means and Proportions/Means’. In the procedure ‘Means’,
the three variables with the factor scores (“FAC1_1”; “FAC2_1”; “FAC3_1”) must be
added to the ‘Dependent List’, and the chocolate type must be specified in the ‘Independent
List’. Afterwards, the means can be calculated via the menu ‘Options’ (see Fig. 7.36).

Fig. 7.36 SPSS dialog boxes: Means and Means: Options


7.3 Case Study 443

The data matrix for positioning is thus an 11 × 3 matrix, with the 11 chocolate flavors
as cases and the three average factor assessments as variables. It should be noted that
averaging per chocolate means that the information about the variations in assessment
between the various individuals is lost. Depending on the heterogeneity among the test
persons, this loss of information can be large and may not be acceptable. Figure 7.37
shows the average factor ratings for the 11 chocolate flavors.
Figure 7.37 shows that only the fruit varieties (orange, strawberry, mango) have high
positive values in the taste experience (factor 1). This means that they are perceived
above average compared to all other varieties. On the other hand, the fruit varieties show
high negative values for the other two factors, ‘value-for-money’ and the ‘health aspect’;
here they are perceived as below average. As shown above, the scale used in a survey
(here: 1 to 7) determines what “average” means. The means of the initial variables that
load on a factor determine the average value of the standardized factor score with a fac-
tor j (zj = 0). A good overall impression is obtained by the graphical representation of
the values in Fig. 7.37. The resulting three-dimensional factor space (perception space) is
shown in Fig. 7.38.
Figure 7.38 shows that the perception of the three types of fruit chocolate is very dif-
ferent from that of the other types of chocolate (in all three dimensions). In contrast, the
remaining eight types of chocolate are positioned relatively close to each other in the

Fig. 7.37 Means of the factor scores for each chocolate flavor
444 7 Factor Analysis

1.25 Mango

1.00
Factor 1: Taste experience

.75 Strawberry
Orange
(EV: 2.397)

.50

.25

Cappuccino
.00
Milk Mousse Biscuit
-.25 Caramel
Nut
Nougat
-.50
Espresso
-.60
-1.00 -.40
-.50 -.20
.00 .00
.50 .20
1.00 .40

Fig. 7.38 Three-dimensional representation of the chocolate flavors in the factor space

consumers’ perceptual space.19 For the manager of the chocolate company, this means
that advertising for the fruit varieties should emphasize the taste dimension.
However, caution is also recommended when interpreting the factor space in
Fig. 7.38: Due to the different eigenvalues of the factors (see also Fig. 7.29), the factors
have a different significance for explaining the variances in the data set. Especially factor
3 has only a very small explanatory power.

7.3.3.4 Differences Between PAF and PCA


In Sect. 7.2.2.4 we pointed out that the objective of PCA is fundamentally different from
the objective of the extraction methods of factor analysis. It was emphasized that the

19 Cluster analysis is the central methodological instrument for identifying similarly perceived
objects. The cluster analysis presented in this book (cf. Chap. 8) is also based on the data set used
in this case study (Table 7.24) and confirms the result of a two-cluster solution that is emerging
here.
7.3 Case Study 445

Fig. 7.39 Initial and extracted communalities: PCA vs. PAF (10 variables)

decision for one or the other of the two approaches must be made on logical grounds, as
they have completely different objectives.20
In our case study, the application of PAF was determined by the manager’s question.
But in the following we will briefly describe the results of the case study if we use PCA
instead of PAF.

Estimating the communalities


Figure 7.39 shows the different estimates of the communalities for the PCA and the PAF.
The two methods arrive at very different communalities for the output variables. It is
particularly noticeable that the variable ‘refreshing’ has the highest communality after
extraction in PCA, whereas this variable was excluded in the case study when applying
PAF. The same applies to the variable ‘healthy’, which has the second highest commu-
nality in PCA, while it has the second lowest value in PAF. The different estimates of the
communalities in the case study make it very clear that both extraction methods lead to
completely different subsequent decisions. While ‘refreshing’ would be excluded when
using PAF, the variable ‘crunchy’ with a final communality of 0.429 would be excluded
when using PCA. Ultimately, the case study illustrates very well the different theoretical
fundamentals of the two extraction methods which will be elaborated in the following
section. The picture does not change even if only 9 variables are analyzed instead of the
initial 10 variables.

20 See the general comments in Sect. 7.2.2.4.


446 7 Factor Analysis

Central differences of PCA and PAF in the case study


The main difference between PCA and PAF is reflected in the estimation of communal-
ities (see Sect. 7.2.2.4). If as many factors are extracted as there are variables, PCA can
reproduce 100% of the variance with five factors. In contrast, PAF can only extract (J –
1 = 4) four factors that explain only 57.87% of the variance.
As a consequence, there are further differences in the current case study, if a PCA is
performed with the same 9 variables as the PAF. The following aspects are highlighted:

• In PCA, three principal components are extracted when applying the Kaiser crite-
rion. While PAF could explain only 45.17% of the variance (see Fig. 7.29), PCA can
explain 63.15% of the total variance.
• A comparison between the rotated factor loading matrix and the rotated component
matrix does not lead to major differences, neither in terms of the assignment of the
variables to the factors nor in terms of the loadings. It is remarkable, however, that
the variable ‘healthy’ (with a loading of 0.900) loads on the third principal component
only and thus can be explained best by PCA (communality of 0.823). In PAF, this fac-
tor loading is only 0.421 and the communality is 0.188 (see Fig. 7.31 and 7.33).
• Obviously, the rotated loading matrices show only slight differences in the two extrac-
tion methods. The difference is rather that the factors have to be interpreted against
different backgrounds. In PCA, the aim is not to find the cause of the correlations, but
a collective term for the correlating variables.
• However, there are also differences with regard to the factor scores. This means that
the evaluation behavior of the respondents with regard to the factors (main compo-
nents) is represented differently in the two extraction methods.

7.3.4 SPSS Commands

Above we demonstrated how to use the graphical user interface (GUI) of SPSS to con-
duct a factor analysis. Alternatively, we can use the SPSS syntax which is a programming
language unique to SPSS. Each option we activate in SPSS’s GUI is translated into SPSS
syntax. If you click on ‘Paste’ in the main dialog box shown in Fig. 7.19, a new window
opens with the corresponding SPSS syntax.
However, you can also use the SPSS syntax and write the commands yourself. Using
the SPSS syntax can be advantageous if you want to repeat an analysis multiple times
(e.g., testing different model specifications). Figure 7.40 shows the SPSS syntax for
running the factor analysis in the case study, applying the principal axis factoring on 9
variables (without ‘refreshing’). The second part of the syntax contains the procedure
‘Means’, which was used to calculate the means of the factor scores per factor for the 11
chocolate flavors (Fig. 7.37) and to create the factor space in Fig. 7.38. The syntax does
not refer to an existing data file of SPSS (*.sav); rather, we enter the data with the help of
the syntax editor (BEGIN DATA… END DATA).
7.4 Extension: Confirmatory Factor Analysis (CFA) 447

* MVA: Case Study Chocolate Factor Analysis.


* Defining Data.
DATA LIST FREE / price refreshing delicious healthy bitter light crunchy
exotic sweet fruity respondent type.

BEGIN DATA
3 3 5 4 1 2 3 1 3 4 1 1
6 6 5 2 2 5 2 1 6 7 3 1
2 3 3 3 2 3 5 1 3 2 4 1
-------------------------
5 4 4 1 4 4 1 1 1 4 18 11
* Enter all data.
END DATA.

* Case Study Factor Analysis with 9 variables: Method "Principal axis


factoring".
FACTOR
/VARIABLES price delicious healthy bitter light crunchy exotic sweet
fruity
/MISSING LISTWISE
/ANALYSIS price delicious healthy bitter light crunchy exotic sweet
fruity
/PRINT INITIAL CORRELATION SIG KMO REPR AIC EXTRACTION ROTATION FSCORE
/PLOT EIGEN
/CRITERIA MINEIGEN(1) ITERATE(25)
/EXTRACTION PAF
/CRITERIA ITERATE(25)
/ROTATION VARIMAX
/SAVE REG(ALL)
/METHOD=CORRELATION.

* Calculation of the means of the factor scores per chocolate flavor.


MEANS TABLES=FAC1_1 FAC2_1 FAC3_1 BY type
/CELLS=MEAN.

Fig. 7.40 SPSS syntax for conducting the factor analysis in the case study

For readers interested in using R (https://www.r-project.org) for data analysis, we


provide the corresponding R-commands on our website www.multivariate-methods.info.

7.4 Extension: Confirmatory Factor Analysis (CFA)

So far, the term factor analysis has been used to describe the method of exploratory fac-
tor analysis (EFA). In addition to EFA, as presented above, confirmatory factor analysis
(CFA) is also very important for practical applications. In the following, we will high-
light the conceptual differences between the two approaches.

Theoretical foundation
The aim of EFA is to identify those variables that are highly correlated and can be com-
bined to form factors. There are no a priori defined relationships between the variables;
448 7 Factor Analysis

rather, we use EFA to explore the data and search for structures in a set of variables (i.e.,
we aim to discover structures). We often perform EFA to reduce the number of variables
to be considered in further analyses.
In contrast, CFA aims to measure given, so-called hypothetical constructs via empir-
ically collected variables (measurement variables or indicators). The hypothetical con-
structs correspond to the factors. Accordingly, CFA is used exclusively to test whether a
factor (construct) is reflected in the variables specified by the user and measured empiri-
cally. CFA is thus an instrument for operationalizing hypothetical constructs, also known
as latent variables.
Let us use the variables of our case study (Sect. 7.3) to exemplify CFA. We assume
that a user wants to operationalize the hypothetical constructs ‘taste experience’ (factor
1) and ‘value for money’ (factor 2) by means of certain indicators (measurement vari-
ables). Since the factors are latent variables that cannot be directly measured, the user
looks for indicators (variables) that reflect the two factors in reality. For this reason, we
also speak of reflective measurement models when referring to hypothetical constructs.
Each measurement variable should represent as good a reflection of a construct as possi-
ble. For this to be true, the variables assigned to a construct must have high correlations.
The measurement variables of different constructs, however, should not be correlated if
the constructs (factors) under consideration are independent. Figure 7.41 illustrates the
correlations and shows that each indicator is generated by the postulated construct, but
each measurement is also subject to a measurement error (δ).
Due to this basic idea of CFA, the process steps presented in Fig. 7.3 for EFA have
to be changed slightly: in principle, step 1 is not necessary, since the variables should be
selected in advance in such a way that all indicators (variables) that reflect a construct
(factor) show high correlations. Thus, the first step is not used to check whether a data
set is suitable for factor analysis, but to check whether the variables assigned to a con-
struct in advance do have the expected high correlations.
Furthermore, CFA is based on the PAF model. According to Eq. (7.10) the following
applies: R =   AA’ +  U. As Fig. 7.41 shows, each measurement is reflected in a regres-
sion equation, which is represented for the seven measurement variables (indicators) as
follows:

x1 = 11 F1 + δ1 x5 = 12 F2 + δ5 ;
x2 = 21 F1 + δ2 x6 = 22 F2 + δ6 ;
x3 = 31 F1 + δ3 x7 = 32 F2 + δ7 ;
x4 = 41 F1 + δ4
with

xi measurement variable i (indicator variable)


λiq factor loading variable i to factor q
δi measurement error of variable i
Fq factor q
7.4 Extension: Confirmatory Factor Analysis (CFA) 449

δ1 x1: light

δ2 x2: sweet
taste experience
δ3 x3: fruity (F1)

δ4 x4: bítter
rF1,F2 = 0

δ5 x5: expensive
value-for-money
δ6 x6: saturating
(F2)
δ7 x7: quantity

Fig. 7.41 Example of a reflective measurement model of the CFA

The quantities λiq form the regression coefficients to be estimated from the empirical
measurements, which correspond to the factor loadings in the case of standardized meas-
urement variables.
The example above shows that the following information is required before carrying
out a CFA and has to be determined a priori by the user on the basis of factual logical or
theoretical considerations:

• number of factors (constructs considered),


• naming of the factors in terms of content,
• finding and assigning measurement variables that reflect the constructs under consid-
eration in their entirety (reflective measurements).

Thus, step 2 (extracting the factors and determining their number) has a completely differ-
ent meaning in CFA, since the number of factors to be extracted is determined a priori by
the user and only one of the factor-analytical approaches (usually PAF or the ML method)
can be used as an extraction method. In contrast, PCA has no significance for CFA.
If two or more factors are considered simultaneously in a model, the assignment of
the measurement variable to the factors is reflected in the factor loading matrix. Whereas
in EFA all factor loadings must be estimated, CFA only requires an estimate of the factor
loadings that have definitely been assigned to a factor. All other factor loadings can a
priori be set to zero. The estimation of the factor loading matrix is also carried out within
the framework of CFA on the basis of the fundamental theorem of factor analysis (see
Sect. 7.2.2.2). Here, the goal of the estimation is to minimize the difference between the
(model-theoretical) correlation matrix calculated with the help of the parameter estimates
and the empirical correlation matrix.
450 7 Factor Analysis

Factor 1 Factor 2
Factor 1 Factor 2 “taste “value-for-
“??? “ “??? “ experience“ money“
Variable
X1: 11 12 11 0
X2: 21 22 21 0
X3: 31 32 31 0
X4: 41 42 41 0
X5: 51 52 0 52

X6: 61 62 0 62

X7: 71 72 0 72

Exploratory Confirmatory
Factor analysis Factor analysis

Fig. 7.42 Estimation of the factor loading matrix for EFA and CFA

Since the factors of CFA are already defined in terms of content prior to the examina-
tion, step 3 of the process (interpreting the factors), which is very important for EFA, is
omitted.
Figure 7.42 illustrates the central differences in the basic ideas of EFA and CFA using
the factor loading matrix.
After performing a CFA, it is essential to assess the validity of the model in order to con-
firm the measurement model. CFA provides a large number of quality criteria for this pur-
pose (cf. Harrington, 2009, pp. 5–7). However, it is beyond the scope of this book to discuss
CFA in detail. The central differences between EFA and CFA are summarized in Table 7.25.
For further information see the bibliographic information at the end of this chapter.
SPSS provides an independent software package called AMOS (Analysis of Moment
Structures) for performing a CFA.

7.5 Recommendations

We close this chapter with some requirements and recommendations for conducting a
factor analysis.
The considerations in this chapter have shown that an exploratory factor analysis
(EFA) can lead to different results for the same initial data, depending on how the pro-
cess options are determined (see Fig. 7.17). The recommendations listed in Table 7.26
are based on what has proven to be effective and may serve as a beginner’s guide for
defining the parameters. Anyone wanting to dive deeper into the topic is referred to the
specialized literature (see references at the end of this chapter).
7.5 Recommendations 451

Table 7.25  Exploratory versus confirmatory factor analysis


Exploratory factor analysis Confirmatory factor analysis
Objectives Compression of highly correlating Testing the relationships between
variables to as few independent variables and hypothetical fac-
factors as possible tors based on a priori theoretical
considerations
Assignment of The algorithm assigns the variables The user carries out the assignment
variables to factors to factors (structure of the factor of variables to factors (constructs) a
loading matrix) priori from a theoretical perspective
Number of factors Determined on the basis of statistical Specified a priori by the user
criteria (e.g. eigenvalue criterion)
Estimation of the All elements of the factor loading As a rule, multiple loadings of
factor loading matrix are estimated variables on factors are not permitted
matrix (zeroing of variables)
Interpretation of • Rotation of the factor loading • A factor rotation is irrelevant due to
factors matrix to facilitate the interpreta- the given factor structure
tion of the results • Not applicable since the interpreta-
• Performed a posteriori by the user tion is specified a priori by the user
on the basis of the estimated factor
loading matrix

Table 7.26  Recommendations for conducting an exploratory factor analysis


Necessary steps of factor analysis Recommendations or requirements
Initial survey • Variables must be scaled metrically
• The number of observations should be at least three times
the number of variables, and at least 50 observations
Creating the initial matrix • Standardize the original variables (not relevant when using
SPSS for conducting the factor analysis)
Estimating communalities and • Decide whether to perform a principal component analysis
factor extraction or a principal axis factoring
• Besides principal axis factoring, the maximum likelihood
method is frequently used to estimate the communalities
and factor loadings
Determining the number of factors • Use the Kaiser criterion and scree test
Rotation • The VARIMAX rotation is frequently used since it main-
tains the assumption of orthogonal (uncorrelated) factors
Interpretation • Variables that belong to a factor should have a factor load-
ing above 0.5
Determining factor scores • Summated scales are frequently used in academic literature
since the original scale is kept
• The results of the regression method are affected by the
data at hand and result in correlated factors
452 7 Factor Analysis

References

Child, D. (2006). The essentials of factor analysis (3rd ed.). Bloomsbury Academic.
Cureton, E. E., & D’Agostino, R. B. (1993). Factor analysis: An applied approach. Erlbaum.
Dziuban, C. D. & Shirkey, E. C. (1974). When is a correlation matrix appropriate for factor analy-
sis? Some decision rules. Psychological bulletin, 81(6), 358.
Guttman, L. (1953). Image theory for the structure of quantitative variates. Psychometrika, 18(4),
277–296.
Harrington, D. (2009). Confirmatory factor analysis. Oxford University Press.
Kaiser, H. F., & Rice, J. (1974). Little jiffy, mark IV. Educational and psychological measurement,
34(1), 111–117.
Loehlin, J. (2004). Latent variable models: An introduction to factor, path, and structural equation
analysis (4th ed.). Psychology Press.

Further Reading

Bartholomew, D. J., Knott, M., & Moustaki, I. (2011). Latent variable models and factor analysis:
A unified approach (Vol. 904). Wiley.
Costello, A. B., & Osborne, J. W. (2005). Best practices in exploratory factor analysis: Four rec-
ommendations for getting the most from your analysis. Practical Assessment, Research and
Evaluation, 10(7), 1–9.
Harman, H. (1976). Modern factor analysis (3rd ed.). The University of Chicago Press.
Kaiser, H. F. (1970). A second generation little jiffy. Psychometrika, 35(4), 401–415.
Kim, J. O., & Mueller, J. (1978). Introduction to factor analysis: What it is and how to do it.
SAGE.
Stewart, D. (1981). The application and misapplication of factor analysis. Journal of Marketing
Research, 18(1), 51–62.
Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Allyn & Bacon.
Thompson, B. (2004). Exploratory and confirmatory factor analysis – Understanding concepts
and applications. American Psychological Association.
Yong, A. G., & Pearce, S. (2013). A beginner’s guide to factor analysis: Focusing on exploratory
factor analysis. Tutorials in Quantitative Methods for Psychology, 9(2), 79–94.
Cluster Analysis
8

Contents

8.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454


8.2 Procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
8.2.1 Selection of Cluster Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
8.2.2 Determination of Similarities or Distances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
8.2.2.1 Overview of Proximity Measures in Cluster Analysis. . . . . . . . . . . . . . 460
8.2.2.2 Proximity Measures for Metric Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
8.2.2.2.1 Simple and Squared Euclidean Distance Metric (L2 Norm). . . 462
8.2.2.2.2 City Block Metric (L1 Norm) . . . . . . . . . . . . . . . . . . . . . . . . . . 464
8.2.2.2.3 Minkowski Metric (Generalization of the L Norms). . . . . . . . . 466
8.2.2.2.4 Pearson Correlation as a Measure of Similarity. . . . . . . . . . . . . 467
8.2.3 Selection of the Clustering Method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
8.2.3.1 Hierarchical Agglomerative Procedures. . . . . . . . . . . . . . . . . . . . . . . . . 470
8.2.3.2 Single Linkage, Complete Linkage and the Ward Procedure. . . . . . . . . 474
8.2.3.2.1 Single-Linkage Clustering (Nearest Neighbor). . . . . . . . . . . . . 474
8.2.3.2.2 Complete Linkage (Furthest Neighbor). . . . . . . . . . . . . . . . . . . 477
8.2.3.2.3 Ward’s Method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
8.2.3.3 Clustering Properties of Selected Clustering Methods. . . . . . . . . . . . . . 481
8.2.3.4 Illustration of the Clustering Properties with an Extended Example. . . 482
8.2.4 Determination of the Number of Clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
8.2.4.1 Analysis of the Scree Plot and Elbow Criterion. . . . . . . . . . . . . . . . . . . 488
8.2.4.2 Cluster Stopping Rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
8.2.4.3 Evaluation of the Robustness and Quality of a Clustering Solution. . . . 491
8.2.5 Interpretation of a Clustering Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
8.2.6 Recommendations for Hierarchical Agglomerative Cluster Analyses. . . . . . . . . . 493
8.3 Case Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
8.3.1 Problem Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
8.3.2 Conducting a Cluster Analysis with SPSS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
8.3.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
8.3.3.1 Outlier Analysis Using the Single-Linkage Method. . . . . . . . . . . . . . . . 498

© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 453


K. Backhaus et al., Multivariate Analysis, https://doi.org/10.1007/978-3-658-40411-6_8
454 8 Cluster Analysis

8.3.3.2 Clustering Process using Ward’s Method. . . . . . . . . . . . . . . . . . . . . . . . 498


8.3.3.3 Interpretation of the Two-Cluster-Solution. . . . . . . . . . . . . . . . . . . . . . . 505
8.3.4 SPSS Commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
8.4 Modifications and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
8.4.1 Proximity Measures for Non-Metric Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
8.4.1.1 Proximity Measures for Nominally Scaled Variables. . . . . . . . . . . . . . . 511
8.4.1.2 Proximity Measures for Binary Variables. . . . . . . . . . . . . . . . . . . . . . . . 515
8.4.1.2.1 Overview and Output Data for a Calculation Example. . . . . . . 515
8.4.1.2.2 Simple Matching, Jaccard and RR Similarity Coefficients. . . . 516
8.4.1.2.3 Comparison of the Proximity Measures. . . . . . . . . . . . . . . . . . . 519
8.4.1.3 Proximity Measures for Mixed Variables. . . . . . . . . . . . . . . . . . . . . . . . 520
8.4.2 Partitioning clustering methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
8.4.2.1 K-means clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
8.4.2.1.1 Procedure of KM-CA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
8.4.2.1.2 Conducting KM-CA with SPSS. . . . . . . . . . . . . . . . . . . . . . . . . 526
8.4.2.2 Two-Step Cluster Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
8.4.2.2.1 Procedure of TS-CA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
8.4.2.2.2 Conducting a TS-CA with SPSS. . . . . . . . . . . . . . . . . . . . . . . . 528
8.4.2.3 Comparing between KM-CA and TS-CA. . . . . . . . . . . . . . . . . . . . . . . . 529
8.5 Recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531

8.1 Problem

Empirical studies often deal with large data sets that show great variety with regard to
certain attributes, i.e. the elements of the data set have a high degree of heterogeneity. If
highly heterogeneous data are described by a mean value, for example, the variance or
standard deviation is high. This indicates that the mean value is of little statistical sig-
nificance as an attribute measure of the overall data set. The lower the heterogeneity, the
more reliable the mean value and the smaller the associated standard deviation.
One way to solve the problem of data heterogeneity is to merge persons (or objects)
into comparable (i.e., homogenous) groups. This means that the survey population is
broken down into groups in which the persons (or objects) show a high degree of homo-
geneity (intra-group homogeneity), while there is a high degree of heterogeneity between
groups (intergroup heterogeneity). In this way, statistical analyses can be carried out for
each group separately, providing significantly more reliable results per group.
Cluster analysis is the methodological instrument for breaking down heterogeneous
survey results into homogenous groups. It can be applied in many disciplines such as
medicine, sociology, biology or economics and is used to determine similarities, e.g. of
patients, buyers, plant species, companies or products. Table 8.1 shows selected research
questions in different fields of application, all of which aim to form groups of objects or
persons.
8.1 Problem 455

Table 8.1  Application examples of cluster analysis in different disciplines


Discipline Exemplary research questions
Agriculture Which plants show a similar growth behavior and should therefore be cultured in
a similar manner?
Biology Are there unexplored genetic relationships between certain animal species?
Finance Which creditworthiness levels can be distinguished based on the payment behav-
ior of bank customers?
Marketing How can an overall market be broken down into homogenous market segments
on the basis of consumer behavior?
Medicine How can patients be divided into different groups on the basis of laboratory
values in order to develop more tailored therapies?
Meteorology Can regions with similar climatic conditions be identified to develop an early
warning system for each group of regions?
Pharmacy Can drugs with similar side effects be identified in order to derive recommenda-
tions for the best possible therapy?

The questions make it clear that cluster analysis is related to exploratory data analysis
procedures, because it leads to suggestions for grouping surveyed objects, thus generat-
ing “new findings” or discovering structures in data sets.
To visualize the procedure of cluster analysis, two examples are shown in Fig. 8.1. In
the first case (diagram A), age and income were recorded for 30 persons. The average
age of the survey population is 31.4 years and the average income is 2595°€. However,
as Fig. 8.1A shows, these two averages are not very meaningful for characterizing the
30 persons since the standard deviation (s) of age is sage =  ± 8.1 years and the stand-
ard deviation for income is sincome =  ± 1151€. Clustering the data into two groups, as
indicated in diagram A, leads to more meaningful results. For the group of younger
persons (g = 1), the average age is 24.7 years (s1,age =  ± 2.4 years), with an average
income of 1,550€ (s1,income =  ± 225€). In contrast, the group of older persons (g = 2)
is on average 38.2 years old (s2,age =  ± 4.1 years) and has an average income of 3640€
(s2,income =  ± 475).1
Another example is shown in Fig. 8.1B. Ten products were examined with regard to
their price and quality levels as perceived by consumers. The graphic representation sug-
gests three segments that show a high degree of similarity in the assessments. This result
might, for example, allow a supplier to set up segment-specific marketing campaigns.

1 In
diagram A the two characteristics “income” and “age” are not independent. This means that the
two-cluster solution could have been achieved on the basis of only one of the two characteristics.
On the independence of cluster variables, see Sect. 8.2.1.
456 8 Cluster Analysis

price level
Income
Diagram A high Diagram B
Segment 1
5000
Group 2
P1
P8

4000 P5
P4

(3640; 38.2)
3000
Total average Segment 2
(2595; 31.4)
P10 Segment 3
2000 Group 1 P3
P7 P2
P6
(1550; 24.7) P9
1000

Total average average group 1 average group 2


low
0
15 20 25 30 35 40 45 low high
age
quality level

Fig. 8.1 Exemplary results of cluster analysis

Since in both examples the objects were described by just two variables, they can
be graphically represented in a diagram or scatterplot. In many applications, however,
the objects are described by considerably more variables. In these cases, the results of
a cluster analysis can no longer be visualized. For example, all 20.000 enrolled students
of a university are listed as cases (objects), and age, gender, number of semesters, and
high school graduation grade are collected as attributes. If a clustering is carried out on
the basis of these data, the university management may use the results, for example, to
develop group-specific offers. Again, the number of groups should be chosen in such a
way that the students within a group are as similar to each other as possible while there
are only minor similarities between the student groups.
If a graphical illustration (in two- or three-dimensional space) is to be used in a mul-
ti-variable case, the set of variables can be aggregated in advance, e.g. with the help of
factor analysis (see Chap. 7). Using, for example, the first two factors, a visualization is
possible. If the user is interested in more detailed knowledge about the differences of a
clustering solution, discriminant analysis (see Chap. 4) can be used for this purpose. In
discriminant analysis, the result of the cluster analysis (number of groups) is used as a
dependent variable and the differences between the groups are examined on the basis of
the independent variables.

8.2 Procedure

The first step in performing a cluster analysis is to decide which variable should be used
to cluster a set of objects. It depends on the choice of the cluster variable that homogene-
ous groups are described later. In the second step, we have to decide how the similarity
8.2 Procedure 457

Fig. 8.2 Process steps of 1 Selection of cluster variables


hierarchical cluster analysis
2 Determination of similarities or distances

3 Selection of the clustering method

4 Determination of the number of clusters

5 Interpretation of a cluster solution

or dissimilarity between objects is to be determined. In cluster analysis, we can choose


among a variety of criteria (so-called proximity measures) to determine the similarity or
dissimilarity (distance) between objects based on the cluster variable. Proximity meas-
ures express the similarity or dissimilarity between two objects by a numerical value.
Once the similarity between the objects has been determined, the third step is to
choose a clustering method. There is a large number of so-called clustering algorithms
that can be used to combine similar objects into a cluster.
When clustering objects, the main questions are how many clusters to use and
whether there is an “optimal number” of clusters. Although this decision is ultimately
up to the user, there are a number of criteria that can be used to derive indications for the
“best possible” number of clusters. The decision for a certain number of clusters is often
a compromise between manageability (small number of clusters) and homogeneity (large
number of clusters). Once the decision on the number of clusters has been made, the last
step is to interpret the content of the resulting clusters.
Thus, the procedure of hierarchical cluster analysis comprises the process steps
shown in Fig. 8.2.

8.2.1 Selection of Cluster Variables

1 Selection of cluster variables

2 Determination of similarities or distances

3 Selection of the clustering method

4 Determination of the number of clusters

5 Interpretation of a cluster solution

A central goal of cluster analysis is to group objects together that are homogenous with
respect to certain criteria in order to subject them to different measures. Take, for exam-
ple, a market segmentation analysis, in which groups of customers with similar purchas-
ing behavior (so-called market segments) are identified and subsequently subjected to
segment-specific marketing concepts in order to avoid the scattering losses typical of
undifferentiated marketing. Segmentation (clustering) typically leads to more effective
and efficient marketing concepts.
458 8 Cluster Analysis

The example of market segmentation shows that the homogeneity of a cluster (seg-
ment) is defined by the variables used for cluster formation. The selection of the cluster
variable is fundamental in preparing a cluster analysis, even though this content-driven
question is not important for the clustering algorithm itself. The “correct” definition of
the cluster variable determines how well the results of a cluster analysis can be used later
on. Based on the example of market segmentation (cf. Wedel & Kamakura, 2000; Wind,
1978, pp. 318 ff.), the following characteristics can be derived that should be fulfilled by
cluster variables in general:

• relevance for the grouping,


• independence,
• measurability,
• comparability of the measurement dimensions,
• controllable (may be influenced),
• high separating power,
• representativeness,
• cluster stability.

Cluster variables must be highly relevant to the content-related objectives of a cluster


analysis. For example, if the goal of a cluster analysis is to find groups of people with
similar behavior, the cluster variable must be of behavioral relevance. If, for example,
parties are clustered, the cluster variable must represent meaningful variables for the
general description of party programs. The user must therefore make sure that only those
characteristics are considered in the grouping process which can be regarded as relevant
to the subject matter from a theoretical or logical point of view. Characteristics that are
irrelevant to the context of the investigation must be excluded from the cluster analysis.
Since a data set is usually described by several variables, which are then included in
the cluster formation, the cluster variables should be independent. In particular, we need
to ensure that the cluster variables do not show high correlations. If correlations are pres-
ent, there is an “implicit weighting” of certain aspects in the clustering process, which
can then lead to a distortion of the results. The following options are available to the user
if cluster variables are correlated:

• Exclusion of variables: Information provided by a highly correlated variable is largely


captured by the other variable and can therefore be considered redundant. The exclu-
sion of correlated characteristics from the initial data matrix is therefore a way to
ensure that the data are equally weighted.
• Conduction of a factor analysis (principal component analysis) prior to cluster anal-
ysis: With the help of a factor analysis (see Chap. 7) highly correlated variables can
be reduced to independent factors. Principal component analysis should be used as
the extraction procedure and the number of factors extracted (principal components)
should equal the number of variables. In this way, no information is lost in the data
8.2 Procedure 459

set, but the principal components are independent of each other. The cluster analysis
should then be performed on the basis of the factor values. It should be noted, how-
ever, that it might be difficult to interpret the factors and thus the factor values if only
the central factors and not all factors are used. If only fewer principal components are
extracted than variables, part of the initial information is lost.
• Mahalanobis distance as proximity measure: If the Mahalanobis distance is used
to determine the differences between the objects, any correlations between the var-
iables can be excluded (in the distance calculation between objects). However, the
Mahalanobis distance imposes certain requirements on the data (e.g. uniform mean
values of the variables in all groups), which are often not fulfilled, especially in clus-
ter analysis (Kline, 2011, p. 54).

Cluster variables should, if possible, be manifest variables that are also observable and
measurable in reality. If, on the other hand, cluster variables are hypothetical variables
(latent variables), suitable operationalizations must be found.
If cluster variables are measured in different measurement dimensions (scales), this
may lead to an increase in distances between objects. In order to establish comparability
between the variables, a standardization procedure should be carried out in advance for
metrically scaled cluster variables. As a consequence, all (standardized) variables have a
mean value of zero and a variance of 1.2
Typically, following a cluster analysis the user would like to subject the resulting clusters
to specific measures. These measures should be tailored to the properties of the clusters. It is
therefore important to ensure that the cluster variable can be influenced by the user.
Since the various clusters should be as heterogeneous as possible, the cluster varia-
ble(s) must have a high separating power to distinguish clusters. Cluster variables that
are very similar for all objects (so-called constant characteristics) lead to a levelling of
the differences between the objects and thus cause distortions when fusing objects. Since
constant characteristics are not separable, they should be excluded from the analysis in
advance. This applies in particular to characteristics with a high share of zero values.
If a cluster analysis is carried out on the basis of a sample with the goal of drawing
conclusions about the population, it must be ensured that the individual groups contain
enough elements to represent the corresponding subgroups of the population. Since it
is usually not known in advance which groups are represented in a population—since
finding such groups is precisely the aim of cluster analysis—outliers in a data set should
be eliminated. Outliers influence the fusion process, make it more difficult to recognize
relations between objects, and thus result in distortions.3

2 Cf. on the standrdisation of variables the explanations on the statistical principles in Sect. 1.2
3 For the analysis of outliers, see also the explanations on the statistical principles in Sect. 1.5.1 as
well as the desription of the single linkage method in Sect. 8.2.3.2, which is particularly suitable
for identifying outliers in cluster analyses.
460 8 Cluster Analysis

Table 8.2  Structure of the Variable 1 Variable 2 … Variable J


raw data matrix
Object 1
Object 2



Object N

Cluster characteristics may change over time. However, it is important for the pro-
cessing of clusters that their characteristics remain stable at least for a certain period
of time, since measures developed on the basis of clustering will usually need a certain
amount of time to show effects.

8.2.2 Determination of Similarities or Distances

1 Selection of cluster variables

2 Determination of similarities or distances

3 Selection of the clustering method

4 Determination of the number of clusters

5 Interpretation of a cluster solution

The starting point of cluster analysis is a raw data matrix with N objects (e.g. persons,
companies, products) which are described along J variables. The general structure of this
raw data matrix is illustrated in Table 8.2.
This matrix contains the values of the variables for each object. They can be met-
ric and/or non-metric. The first step is about determining the similarities between the
objects by using a statistical measure. For this purpose, the raw data matrix is converted
into a distance or similarity matrix (Table 8.3) that is always a square (N × N) matrix.
This matrix contains the similarity or dissimilarity values (distance values) between
the objects, which are calculated using the object-related variable values from the raw
data matrix. Measures that quantify similarities or differences between objects are gener-
ally referred to as proximity measures.

8.2.2.1 Overview of Proximity Measures in Cluster Analysis


Two types of proximity measures can be distinguished:

• Similarity measures reflect the similarity of two objects: the greater the value of a
similarity measure, the more similar two objects are to each other.
• Distance measures reflect the dissimilarity between two objects: the greater the value
of a distance measure, the more dissimilar two objects are to each other. If two objects
are completely identical, the distance measure equals zero.
8.2 Procedure 461

Table 8.3  Structure of a Object 1 Object 2 … Object N


distance or similarity matrix
Object 1
Object 2



Object N

Measures of similarity and distance are complementary, i.e. similarity = 1 − dissimilarity.


A large number of proximity measures exists depending on the measurement level of the
variables. We distinguish proximity measures for metric data, binary data (0/1 variable)
and count data (discrete frequencies). Count data are integer values that result, for exam-
ple, from counting nominally scaled attributes. Table 8.4 gives an overview of common
proximity measures used in practice.4
In the following, the considerations concentrate on cluster analysis based on met-
rically scaled variables. If the variables are nominally scaled or if binary variables
(0/1-variable) are present, the measures that can be used to determine the proximity will
change accordingly (see Sect. 8.4.1).5

8.2.2.2 Proximity Measures for Metric Data


To explain the various metrics, we use a small numerical example.

Application example
In a survey, 30 persons were asked about their perceptions of five chocolate flavors.
The test persons assessed the flavors ‘Biscuit’, ‘Nut’, ‘Nougat’, ‘Cappuccino’ and
‘Espresso’ regarding the attributes (i.e., variables) ‘price’, ‘bitter’ and ‘refreshing’ on
a 7-point scale from high (=7) to low (=1). Table 8.5 shows the mean subjective per-
ception values of the 30 persons interviewed about chocolate flavors.6 ◄

For all metrics considered below, we need to ensure that comparable measures are used.
This is fulfilled in our application example, as all three attributes were assessed on a

4 The selection of the proximity dimensions shown in Table 8.4 is based on the proximity measure-
ments provided in the SPSS procedure “Hierarchical Cluster Analysis”.
5 On the website www.multivariate-methods.info, we provide supplementary material (e.g., Excel

files) to deepen the reader’s understanding of the methodology.


6 To simplify the following calculations, only integer values were included in the initial data

matrix.
462 8 Cluster Analysis

Table 8.4  Selected proximity measures for hierarchical cluster analysis


Scale level of the attributes
Metric data Metric data Count data
(interval) (interval) (frequency)
Similarity • Cosine • Simple matching (M-coefficient)
measures • Pearson correlation • Phi 4-point correlation
• Lambda (Goodman & Kruskal)
• Dice
• Jaccard
• Rogers and Tanimoto
• Russel and Rao
Distance • Euclidean distance • Euclidean distance • Chi-square
measures • Squared Euclidean • Squared Euclidean distance measure
distance • Size difference • Phi-square
• Chebychev • Pattern difference measure
• City block metric • Variance
• Minkowski • Dispersion
• Lance and Williams

Table 8.5  Application Flavors Attributes


example with five products
and three metrically scaled Price Bitter Refreshing
attributes 1: Biscuit 1 2 1
2: Nut 2 3 3
3: Nougat 3 2 1
4: Cappuccino 5 4 7
5: Espresso 6 7 6

7-point rating scale. If this requirement is not fulfilled, the initial data must first be made
comparable, e.g. with the help of a standardization procedure.7

8.2.2.2.1 Simple and Squared Euclidean Distance Metric (L2 Norm)


The Euclidean distance (L2 norm) is one of the most widely used distance measures
in empirical applications. It measures the shortest distance between two objects (cf.
Fig. 8.3). To compute the squared Euclidean distance, the differences between variables
are first squared and then summed up. The Euclidean distance is obtained by taking the
square root of this sum.

7 On the standardization of variables, see the comments on statistical basics in Sect. 1.2.1.
8.2 Procedure 463

x1
k
xk2 6 (xk1; xk2)

(xk1 – xl1)
4

(xk2 – xl2) l
xl2 2 (xl1; xl2)

0
0 1 2 3 4 5 6
xk1 xl1 x2

Fig. 8.3 Visualization of the Euclidean distance in the two-variable case


J 2
(8.1)

dk,l = xkj − xlj
j=1

with
dk,l distance between objects k and l
xkj,xlj value of variable j for objects k, l (j = 1,2,…,J)

For a two-variable case, Fig. 8.3 illustrates the Euclidean distance as the direct (shortest)
connection between the objects k and l, which corresponds to the hypotenuse of the rec-
tangular triangle.
In the two-variable case, the Euclidean distance between the points k = ‘Biscuit’ with
the coordinates (6,1) and l = ‘Nut’ with the coordinates (2,5) is calculated as follows:
 √
dBiscuit, Nut = (6 − 2)2 + (1 − 5)2 = 16 + 16 = 5.656
464 8 Cluster Analysis

Table 8.6  Distance matrix according to the squared Euclidean distance in the application
example
Biscuit Nut Nougat Cappuccino Espresso
Biscuit 0
Nut 6 0
Nougat 4 6 0
Cappuccino 56 26 44 0
Espresso 75 41 59 11 0

For the multi-variable case (example in Table 8.5), the squared Euclidean distance for
the product pair ‘Biscuit’ and ‘Nut’ is calculated as follows:

dBiscuit, Nut = (1 − 2)2 + (2 − 3)2 + (1 − 3)2


=1+1+4
=6

By squaring the values, large differences have a higher impact on the distance measure
while small difference values have a lower impact. Moreover, positive and negative dif-
ferences can no longer cancel each other out. The Euclidean distance is then obtained by
calculating the square root of the squared Euclidean distance. In the example above, a
value of 6 is obtained.
Both the squared Euclidean distance and the Euclidean distance can be used for
measuring the dissimilarity of objects. Since simulation studies have shown that many
algorithms give the best results when using the squared Euclidean distance, the consider-
ations below focus on the squared Euclidean distance.
Table 8.6 summarizes the squared Euclidean distances for our numerical exam-
ple with five products. Since the distance of an object to itself is always zero, the main
diagonal of a distance matrix contains zeros. The distance matrix already shows that
the smallest distance is observed between ‘Biscuit’ and ‘Nougat’, while ‘Biscuit’ and
‘Espresso’ are the most dissimilar flavors (indicated by bold values).

8.2.2.2.2 City Block Metric (L1 Norm)


In the city block metric, the distance between two objects is calculated as the sum of the
absolute distances between the objects. This distance measure is calculated as follows:
J  
dk,l =
j=1
xkj − xlj  (8.2)

The idea of the city block metric is derived from a city like Manhattan (with streets in a
checkerboard pattern), where the distance between two locations is determined by driv-
ing along the blocks of streets from one of the locations to the other one. It is therefore
8.2 Procedure 465

x1
k
xk2 6 (xk1; xk2)

|xk1 – xl1|
4

|xk2 – xl2| l
xl2 2 (xl1; xl2)

0
0 1 2 3 4 5 6
xk1 xl1 x2

Fig. 8.4 Visualization of the city block metric in a two-variable case

also called Manhattan metric or taxi driver metric. It plays an important role in certain
practical applications, such as the grouping of locations, and is derived by calculating the
difference between two objects for each variable and adding the resulting absolute differ-
ence values. Figure 8.4 illustrates this for a two-variable case.
The distance of the objects k = ‘Biscuit’ with the coordinates (6,1) and l = ‘Nut’ with
the coordinates (2,5) is:
dBiscuit,Nut = |6 − 2| + |1 − 5| = 8,
which results from inserting the values in Eq. (8.2).
For the multi-variable case (example in Table 8.5), the city block metric for the prod-
uct pair ‘Biscuit’ and ‘Nut’ is calculated as follows, with the first number in the differ-
ence calculation representing the property value of ‘Biscuit’:
dBiscuit, Nut = |1 − 2| + |2 − 3| + |1 − 3|
=1+1+2
=4
466 8 Cluster Analysis

Table 8.7  Distance matrix according to the city block metric (L1 norm)
Biscuit Nut Nougat Cappuccino Espresso
Biscuit 0
Nut 4 0
Nougat 2 4 0
Cappuccino 12 8 10 0
Espresso 15 11 13 5 0

Table 8.8  Sequence of Biscuit Nut Nougat Cappuccino


similarities according to the
city block metric and the Nut 2(2)
squared Euclidean distance Nougat 1(1) 2(2)
(values in brackets) Cappuccino 7(7) 4(4) 5(6)
Espresso 9(9) 6(5) 8(8) 3(3)

This means that the distance between the products “biscuit” and “nut” is 4 according to
the city block metric. Based on the city block metric, the distances for all other pairs of
objects are determined in the same way, with the results shown in Table 8.7.
As above (Table 8.7), the products ‘Nougat’ and ‘Biscuit’ have the greatest similar-
ity with a distance value of 2, while the least similarity exists between ‘Espresso’ and
‘Biscuit’ with a distance of 15.
Regarding the most similar and the most dissimilar pair, the squared Euclidean dis-
tance leads to the same results as the city block metric. A complete comparison of the
results of the city block metric (L1 norm) and the Euclidean distance (L2 norm) (cf.
Table 8.8) shows a shift in the order for the product pairs ‘Cappuccino’ and ‘Nougat’ as
well as ‘Espresso’ and ‘Nut’. Thus, it is obvious that the choice of the distance measure
influences the sequence of the test objects. This means that the proximity measure should
not be chosen arbitrarily, but according to its suitability for the application.
The results are different due to the way the calculated differences are taken into
account: While in the city block metric all difference values are weighted equally, large
differences have a stronger effect in the Euclidean distance metric due to the effect of
squaring.

8.2.2.2.3 Minkowski Metric (Generalization of the L Norms)


A generalization of the Euclidean distance and the city block metric is provided by the
Minkowski metric (L norm). For two objects k and l, the distance is calculated as the dif-
ference of the coordinate values over all dimensions. These differences are raised to the
power of a constant factor r and then summed up. By raising the total sum to the power
of 1/r, the distance d(k,l) is obtained as follows:
8.2 Procedure 467

Table 8.9  Similarity matrix according to the Pearson correlation coefficient


Biscuit Nut Nougat Cappuccino Espresso
Biscuit 1.000
Nut 0.500 1.000
Nougat 0.000 −0.866 1.000
Cappuccino −0.756 0.189 −0.655 1.000
Espresso 1.000 0.500 0.000 −0.756 1.000

J
xkj − xlj r ] 1r
  
dk.l = [ (8.3)
j=1

with
dk,l distance between objects k and l
xkj,xlj value of variable j for objects k, l (j = 1,2,…,J)
r ≥ 1 Minkowski constant

The Minkowski constant r is a positive constant. For r = 1, the city block metric (L1
norm) results, and for r = 2, the Euclidean distance (L2 norm) follows.

8.2.2.2.4 Pearson Correlation as a Measure of Similarity


If the similarity between objects with a metric variable structure is not to be determined
by a distance measure but by a similarity measure, the correlation coefficient is a com-
monly used measure, which can be calculated as follows:
J
j=1 (xjk − x k ) · (xjl − x l )
rk,l =
 1
2
(8.4)
J J
j=1 (xjk − x k )2 · j=1 (xjl − x l )2

with
xj,k Observed value k (k = 1, 2, … , K) in factor level j (j = 1, 2, … , J)
x k average value of all variables for object (cluster) k or 1

The Pearson correlation between two objects k and l takes all variables of an object into
account.8 For the application example, the similarity matrix shown in Table 8.9 is based
on the Pearson correlation coefficient.

8A detailed description of the calculation of the correlation coefficient may be found in Sect. 1.2.2.
468 8 Cluster Analysis

1 2 3 4 5 6 7
Price

Bitter

Refreshing

Biscuit Espresso

Fig. 8.5 Profile curves of ‘Biscuit’ and ‘Espresso’

When comparing these similarity values with the distance values in Table 8.7, it
becomes clear that the relation between the objects has changed significantly. According
to the squared Euclidean distance, ‘Espresso’ and ‘Biscuit’ are most dissimilar, while
they are defined as the most similar product pair by to the correlation coefficient.
Similarly, according to the Euclidean distance, ‘Nougat’ and ‘Biscuit’ are very similar
(with a distance of 4), while the pair is considered completely dissimilar according to the
correlation coefficient (with a correlation of 0 in Table 8.9).
These comparisons show that similarity or distance measures need to be chosen
depending on the target of the researcher. To illustrate this, let us take a look at the pro-
file curves of ‘Biscuit’ and ‘Espresso’ shown in Fig. 8.5 according to the initial data in
our example.
The profiles show that although the values for ‘Biscuit’ and ‘Espresso’ are distant
from each other, their profiles are the same. This explains why the products are dissim-
ilar when using a distance measurement while they are found to be similar when using
the correlation coefficient. In general, the following can be concluded:

• Distance measures consider the absolute distance between objects, and the dissimilar-
ity is greater if two objects are further away from each other according to the consid-
ered variables.
• Similarity measuresbased on correlation values consider how similar the profiles of
two objects are, regardless of the specific values of the objects according to the con-
sidered variables.

Let us consider an example: For a number of companies product sales have been
recorded over a period of five years (= variable). With the help of a cluster analysis,
these companies are grouped according to the following factors:
8.2 Procedure 469

1. they have achieved a similar sales level with their product;


2. they have shown a similar sales trend over the five years.

In the first case, clustering is based on the sales level, which means that the proxim-
ity between the companies must be determined using a distance measure. In the second
case, similar sales trends are of interest. Therefore, a similarity measure (e.g. the Pearson
correlation coefficient) is a feasible proximity measure.

8.2.3 Selection of the Clustering Method

1 Selection of cluster variables

2 Determination of similarities or distances

3 Selection of the clustering method

4 Determination of the number of clusters

5 Interpretation of a cluster solution

The previous section has shown how to derive a distance or similarity matrix from a data
set using proximity measures. The obtained distance or similarity matrix is the departing
point for the subsequent use of clustering methods, which aim to assign similar objects
to the same cluster. Cluster analysis offers the user a wide range of algorithms for group-
ing a given set of objects. The great advantage of cluster analysis is that a large number
of variables can be used simultaneously to group the objects. Clustering methods can be
classified according to the type of clustering procedure. Figure 8.6 provides an overview
of different clustering methods (clustering algorithms).
The following explanations focus on hierarchical cluster procedures since they are of
great importance for practical applications. Explanations of the partitioning procedures
are given in Sect. 8.4. The hierarchical procedures may be classified into agglomerative
and divisive methods. While the agglomerative methods are based on a very granular
partition (regarding the number of examined objects), the divisive methods are based on
a broad partition (all examined objects are in one group). Therefore, agglomerative meth-
ods form different groups out of a granular partition, while divisive clustering methods
divide a full sample into different groups.9
Due to their great practical importance, we first focus on hierarchical, agglomerative
clustering methods. The case study in Sect. 8.3 describes a hierarchical cluster analysis
using Ward’s method. Sect. 8.4.2 gives a comparatively short description of partitioning

9 Due to their rather minor practical importance, divisive cluster procedures will not be discussed
here. If you consider applying a divisive clustering algorithm, you can do this in SPSS by clicking
on ‘Analyze/Classify/Tree‘.
470 8 Cluster Analysis

Cluster methods

hierarchical partitioning
methods methods

agglomerative divisive exchange minimum-distance-


methods methods methods methods (K-Means)

• Between-groups linkage • Centroid clustering • optimal methods • sequential method


• Within-groups linkage • Median clustering • parallel methods • parallel method
• Nearest neighbor • Ward‘s method
(single linkage)
• Furthest neigbor
(complete linkage)
metric scale level
any scale level
(preferred: Squared Euclid)

Fig. 8.6 Overview of selected clustering methods

clustering methods, concentrating on k-means and two-step cluster analysis since these
methods play a central role in the analysis of large data sets.

8.2.3.1 Hierarchical Agglomerative Procedures


The agglomerative cluster procedures shown in Fig. 8.6 are frequently used in practice.
The procedure of these methods can be described by the following general process steps:10

• Starting point: Each object represents a cluster. With N objects, there are N single-ob-
ject clusters.
• Step 1: For the objects (clusters) contained on a fusion stage, the pairwise distances or
similarities between the objects (clusters) are calculated.
• Step 2: The two objects (clusters) with the smallest distance (or greatest similar-
ity) to each other are combined into a cluster. The number of objects or groups thus
decreases by 1.
• Step 3: The distances between the new cluster and the remaining objects or groups are
calculated, resulting in the so-called reduced distance matrix.
• Step 4: Steps 2 and 3 are repeated until all objects are contained in only one cluster
(so-called single-cluster solution). With N objects, a total of N–1 fusion steps are car-
ried out.

10 The course of a fusion process is usually illustrated by a table (so-called agglomeration sched-
ule) and by a dendrogram or icicle diagrams. Both options are explained in detail for the sin-
gle-linkage method in Sect. 8.2.3.2.1.
8.2 Procedure 471

Differences in distance calculation


The agglomerative methods may be differentiated according to the way the distance
between an object (cluster) R and the new cluster (P + Q) is calculated in step 3. If two
objects (clusters) P and Q are to be combined, the distance D(R;P + Q) between any clus-
ter R and the new cluster (P + Q) is a result of the following transformation (cf. Kaufman
& Rousseeuw, 2005, p. 225):
D(R; P + Q) =A · D(R, P) + B · D(R, Q) + E · D(P, Q)
(8.5)
+ G · |D(R, P) − D(R, Q)|

with
D(R,P) distance between clusters R and P
D(R,Q) distance between clusters R and Q
D(P,Q) distance between clusters P and P

The values A, B, E and G are constants that vary depending on the algorithm used. The
agglomerative methods listed in Table 8.10 are characterized by assigning certain values
to the constants in Eq. (8.5). Table 8.10 shows the respective values and the resulting
distance calculations for selected agglomerative processes (cf. Kaufman & Rousseeuw,
2005, S. 225 ff.).
While for the first four methods all available proximity measures may be used, the
application of the methods “Centroid”, “Median” and “Ward” only makes sense if a dis-
tance measure is used. Regarding the measurement level of the data, the procedures can
be applied to both metric and non-metric data. The only decisive factor here is that the
proximity measures used are matched to the measurement level (metric or non-metric).

Illustration of the clustering process via an agglomeration schedule


The fusion process of a clustering method is usually illustrated by a so-called agglom-
eration schedule. For each fusion step, the table shows which two objects or clusters are
combined at which heterogeneity level. Furthermore, it is indicated at which level the
objects or clusters were first included in the fusion process and at which level the formed
cluster is considered next. Sect. 8.2.3.2.1 gives an example of an agglomeration schedule
in connection with the single-linkage method (see Table 8.14).

Illustration of the clustering process via a dendrogram


The clustering process of a clustering method can also be illustrated by a so-called den-
drogram. A dendrogram indicates the degree of heterogeneity associated with a given
number of clusters. Therefore, in hierarchical cluster procedures all objects considered
in an investigation are usually listed on the vertical axis. The objects form the start-
ing points for the agglomerative methods and each object represents its own “cluster”.
Accordingly, at the beginning the clusters always have a heterogeneity of “0”. As the
clustering progresses, heterogeneity increases, and the dendrogram graphically connects
472

Table 8.10  Calculation of distances for selected agglomerative processes


Processess Constant Measurement of the distance (D(R;P + Q)) according to Eq. ( 7.5):
A B E G
Single linkage 0.5 0.5 0 –0.5 0.5 · {D(R, P) + D(R, Q) − |D(R, P) − D(R, Q)|}
Complete linkage 0.5 0.5 0 0.5 0.5 · {D(R, P) + D(R, Q) + |D(R, P) − D(R, Q)|}
Average linkage 0.5 0.5 0 0 0.5 · {D(R, P) + D(R, Q)}
(unweighted)
Average linkage NP NQ 0 0 1
NP+NQ NP+NQ NP+NQ
· {NP · D(R, P) + NQ · D(R, Q)
(weighted)
Centroid NP NQ NP·NQ 0 1
NP+NQ NP+NQ
− (NP+NQ) 2
· {NP · D(R, P) + NQ · D(R, Q)}−
NP + NQ
NP · NQ
· D(P, Q)
(NP · NQ)2
Median 0.5 0.5 –0.25 0 0.5 · {D(R, P) + D(R, Q)} − 0, 25 · D(P, Q)

Ward NR+NP NR+NQ NR 0 1


NR+NP+NQ NR+NP+NQ NR+NP+NQ
− NR+NP+NQ · {(NR + NP) · D(R, P) + (NR + NQ) · D(R, Q) − NR · D(P, Q)}

NR: number of objects in group R


8

NP: number of objects in group P


NQ: number of objects in group Q
Cluster Analysis
8.2 Procedure 473

Objects Stage of fusion: 01 Objects Stage of fusion: 02

Espresso Espresso

Cappuccino Cappuccino

Nut Nut

Nougat Nougat

Biscuit coefficient coefficient


Biscuit
4 8 12 16 4 6 8 12 16

Objects Objects Stage of fusion: 04


Stage of fusion: 03

Espresso Espresso

Cappuccino Cappuccino

Nut Nut

Nougat Nougat
coefficient coefficient
Biscuit Biscuit
4 6 8 11 16 4 8 12 16 20 24 26

Fig. 8.7 Dendrogram of the single-linkage method

Objects

Espresso

Cappuccino

Nut

Nougat

Biscuit

4 8 12 16 20 72 76 distance
stage of fusion: 1 2 3 4
distance measurement: 4 6 11 75

Fig. 8.8 Dendrogram of the complete-linkage process

those objects that are merged at a certain clustering level. For the three clustering meth-
ods presented in Sect. 8.2.3.2, the corresponding dendrograms are shown in Figs. 8.7, 8.8
and 8.9, respectively.
474 8 Cluster Analysis

Objects

Espresso

Cappuccino

Nut

Nougat

Biscuit


2 6 10 14 18 62 66 variance criterion
stage of fusion: 1 2 3 4
error sum of square 4 5.333 10.833 65.600

Fig. 8.9 Dendrogram of the Ward process

8.2.3.2 Single Linkage, Complete Linkage and the Ward Procedure


In the following, three cluster procedures frequently used in practice are explained in
detail. We will use the data of the application example in Table 8.5 and the squared
Euclidean distance matrix in Table 8.6.
Single linkage, complete linkage and Ward’s method may be differentiated according
to how the distance between a single object and a cluster (or later in the fusion process
between two clusters) is formed and how new cluster centers are calculated. Table 8.11
shows the different types of distance formation for the three methods.
In the following, the above three procedures are presented and their respective clus-
tering process is explained. In Sect. 8.2.3.3 the clustering properties of the three methods
are compared using an extended example.

8.2.3.2.1 Single-Linkage Clustering (Nearest Neighbor)


In our application example (Table 8.5), the single-linkage method performs the following
clustering stages when starting from the distance matrix in Table 8.6.

Table 8.11  Distance formation between objects and clusters


Algorithm Distance formation
Single-linkage clustering Smallest distance between the members of two groups/objects
(Nearest neighbor)
Complete-linkage clustering Largest distance between the members of two groups/objects
(Furthest neighbor)
Ward’s method Smallest increase in the error sum of squares
(Minimum variance linkage) (variance criterion)
8.2 Procedure 475

Stage 1
The objects that have the smallest distance, i.e. the objects that are most similar, are
combined. Thus, in the first iteration the objects ‘Biscuit’ and ‘Nougat’ are combined
with a distance of 4 (see stage 1 in Fig. 8.7).

Stage 2
Since ‘Biscuit’ and ‘Nougat’ now form a separate cluster, the distance of this cluster to
all other objects must be determined next. The new distance between the new cluster
‘Biscuit, Nougat’ and an object R is determined according to Eq. (8.5) as follows (see
Table 8.10).
D(R; P + Q) = 0.5{D(R, P) + D(R, Q) − |D(R, P) − D(R, Q)|} (8.6)
Thus, the distance sought is simply the smallest value of the individual distances:
D(R; P + Q) = min{D(R, P); D(R, Q)}
The single-linkage method thus assigns to a newly formed group the smallest distance
resulting from the old distances of the objects combined in the group to a specific other
object. Therefore, this method is also known as “nearest neighbor method”.
Let us illustrate this approach, for the example, of the distance between the cluster
‘Biscuit, Nougat’ and ‘Cappuccino’. To calculate the new distance, the distances between
‘Biscuit’ and ‘Cappuccino’ and between ‘Nougat’ and ‘Cappuccino’ are calculated. The
initial distance matrix (Table 8.6) shows that the first distance is 56 and the second dis-
tance is 44. Thus, in the second iteration of the single-linkage procedure, the distance
between the ‘Biscuit, Nougat’ group and ‘Cappuccino’ is 44.
Formally, these distances can also be determined using Eq. (8.6). P + Q represents the
group ‘Nougat’ (P) and ‘Biscuit’ (Q’), and R represents a remaining object. In our exam-
ple, the new distances between ‘Nougat, Biscuit’ and the other objects result as follows
(see the values in Table 8.6):

The reduced distance matrix is obtained by removing the rows and columns of the
merged objects from the distance matrix and inserting a new column and row for the
newly built cluster. At the end of the first iteration, a reduced distance matrix is generated
(Table 8.12), which is used as starting point for the second iteration.
In stage 3, again the objects (clusters) with the smallest distance (according to the
reduced distance matrix) are combined. This means that ‘Nut’ is included in the cluster
‘Nougat, Biscuit’, because it has the smallest distance d = 6 (see stage 2 in Fig. 8.7).
476 8 Cluster Analysis

Table 8.12  Reduced distance Nougat, Biscuit Nut Cappuccino


matrix after the first clustering
step in the single-linkage Nut 6
process Cappuccino 44 26
Espresso 59 41 11

Table 8.13  Reduced distance Nougat, Biscuit, Nut Cappuccino


matrix after the second
clustering step in the single- Cappuccino 26
linkage process Espresso 41 11

Stage 3
For the reduced distance matrix in the second iteration, the distances between the group
‘Nougat, Biscuit, Nut’ and ‘Cappuccino’ or ‘Espresso’ are calculated as follows:

This results in the reduced distance matrix shown in Table 8.13.


According to the values in Table 8.13, the objects ‘Espresso’ and ‘Cappuccino’ are
merged into a separate cluster with a distance of d = 11 (see stage 3 in Fig. 8.7).

Stage 4
The distance between the remaining clusters ‘Nut, Nougat, Biscuit’ and ‘Espresso,
Cappuccino’ is calculated on the basis of Table 8.13 as follows:
D(Nougat, Biscuit, Nut; Cappuccino, Espresso) = 0.5 · {(26 + 41) − |26 − 41|} = 26
This means that the two clusters ‘Nut, Nougat, Biscuit’ and ‘Espresso, Cappuccino’ are
combined in step 4 at a distance of d = 26 (cf. stage 4 in Fig. 8.7). After this step, all five
objects are combined in one cluster.

Summary
The clustering steps may be summarized in an agglomeration schedule. Table 8.14 shows
the agglomeration schedule for the single-linkage method for the application example
(cf. Table 8.5) and the corresponding distance matrix (Table 8.6). In the first step, objects
1 (Biscuit) and 3 (Nougat) are fused at a heterogeneity coefficient of 4.0. This corre-
sponds to the squared Euclidean distance between the two objects. In Table 8.14, these
objects are identified as single (“0” in the column “Stage Cluster First Appears”). In con-
trast, a newly formed cluster is always identified by the smallest number of the fused
objects (here: 1). This cluster 1 is then fused with object 2 (‘Nut’) in stage 2. The group
8.2 Procedure 477

Table 8.14  Agglomeration schedule for the single-linkage method


Stage Cluster Combined Coefficients Stage Cluster First Appears Next Stage
Cluster 1 Cluster 2 Cluster 1 Cluster 2
1 1 3 4.000 0 0 2
2 1 2 6.000 1 0 4
3 4 5 11.000 0 0 4
4 1 4 26.000 2 3 0

formed in this way is again identified as “1”. Only in stage 4 it is merged with group “4”,
which consists of objects 4 (‘Cappuccino’) and 5 (‘Espresso’).
The steps of the clustering process shown in Table 8.14 can also be illustrated graph-
ically. Figure 8.7 shows the development of the dendrogram in the four clustering steps.
In software programs, however, only the overall process is shown (cf. the dendrogram for
stage 4).

8.2.3.2.2 Complete Linkage (Furthest Neighbor)

Stage 1
The complete-linkage method also merges the objects ‘Biscuit’ and ‘Nougat’ in the first
step since they have the smallest distance (dk,l = 4) according to Table 8.6. The sin-
gle-linkage and the complete-linkage processes differ, however, in how the next dis-
tances are calculated.
Complete-linkage clustering calculates distances according to Eq. (8.5) (see
Table 8.10):
D(R; P + Q) = 0, 5 · {D(R, P) + D(R, Q) + |D(R, P) − D(R, Q)|} (8.7)
Equivalently, the distance can also be determined as follows:
D(R; P + Q) = max{D(R, P); D(R, Q)}
Therefore, this process is also referred to as “furthest neighbor method”. So, starting
with the distance matrix in Table 8.6, the objects ‘Biscuit’ and ‘Nougat’ are combined in
the first step. However, the distance of this group to the others, e.g. ‘Cappuccino’, is now
determined by the largest individual distance in the reduced distance matrix. Formally,
the individual distances according to Eq. (8.7) are as follows:

D(Nut; Nougat + Biscuit) = 0.5 · {(6 + 6) + |6 − 6|} = 6


D(Cappuccino; Nougat + Biscuit) = 0.5 · {(44 + 56) + |44 − 56|} = 56
D(Espresso; Nougat + Biscuit) = 0.5 · {(59 + 75) + |59 − 75|} = 75

This leads to the reduced distance matrix shown in Table 8.15.


478 8 Cluster Analysis

Table 8.15  Reduced distance Nougat, Biscuit Nut Cappuccino


matrix after the first clustering
step in the complete-linkage Nut 6
process Cappuccino 56 26
Espresso 75 41 11

Stage 2
In stage 2, ‘Nut’ is included in the group ‘Biscuit, Nougat’ as it has the smallest distance
with d = 6 (Table 8.15). The process now continues in the same way as with the sin-
gle-linkage method (cf. the agglomeration schedule in Table 8.14), with the respective
distances always determined according to Eq. (8.7). Figure 8.8 shows the final result as a
dendrogram.

8.2.3.2.3 Ward’s Method
Ward’s method is widely used in practice. It differs from the previous ones not only in
how the new distances are calculated, but also in its clustering approach. The distance
between the last cluster formed and the remaining groups is calculated as follows (see
Table 8.10):

1
D(R; P + Q) = NR+NP+NQ
{(NR + NP) · D(R, P)
(8.8)
+ (NR + NQ) · D(R, Q) − NR · D(P, Q)}

According to Ward’s method, it is not the groups with the smallest distance that are com-
bined, but rather those objects (groups) whose combination increases the variance in
the resulting group the least. The variance sg2 of a group g is calculated for a group g as
follows:
Ng J
 
sg2 = (xijg − x jg )2 (8.9)
i=1 j=1

with
xijg observed value of variable j (j = 1, …, J) for object i (for all objects i = 1, …, Ng in
group g) Ng
x jg mean value over the observed values of variable j in group g (= 1/Ng xijg )

i=1

Equation (8.9) is also called variance criterionor error sum of squares. If the Ward pro-
cedure is based on the squared Euclidean distance as a proximity measure, in the first
step we can use the squared Euclidean distances in Table 8.6 for the 5-products example.
Accordingly, the error sum of squares has a value of zero in the first step. This means
8.2 Procedure 479

that each object forms an “independent group” and no variance occurs in the variable
values of the objects. The target criterion for the Ward procedure for grouping objects
(groups) is:
“Fuse those objects (groups) that increase the error sum of squares the least.”
This error sum of squares is the heterogeneity measure in the Ward procedure. It
can be shown that the values of the distance matrix in Table 8.6 (squared Euclidean
distances) and the distances calculated with the help of Eq. (8.8) correspond exactly to
twice the increase of the sum of error squares according to Eq. (8.9) with the clustering
of two objects (groups).
In the first stage, Ward’s method also combines the two objects with the smallest
squared Euclidean distance. In our application example, these are the products ‘Biscuit’
and ‘Nougat’, which have a squared Euclidean distance of 4 (see Table 8.6). Taking
into account that the mean value of the variable ‘Price’ is (1 + 3)/2 = 2, the error sum of
squares in this case is 2.

s2 (Nougat, Biscuit) = (1 − 2)2 + (3 − 2)2 = 2


This is actually half the squared Euclidean distance.
In the second stage, the distances between the group ‘Nougat, Biscuit’ and the
remaining objects have to be determined according to Eq. (8.8). Again, the squared
Euclidean distances in Table 8.6 are used for calculation:

1
D(Nut; Nougat + Biscuit) ={(1 + 1) · 6 + (1 + 1) · 6 − 1 · 4} = 6.667
3
1
D(Cappuccino; Nougat + Biscuit) = {(1 + 1) · 56 + (1 + 1) · 44 − 1 · 4} = 65.333
3
1
D(Espresso; Nougat + Biscuit) = {(1 + 1) · 75 + (1 + 1) · 59 − 1 · 4} = 88.000
3
The result is the reduced distance matrix in Table 8.16, which also shows the double
increase in the error sum of squares when two objects (groups) are merged.
The double increase of the error sum of squares is smallest with the addition of ‘Nut’
to the group ‘Biscuit, Nougat’. In this case, the error sum of squares is only increased by
1/2 of 6.667 = 3.333. After this step, the total error sum of squares is:

Sg2 = 2 + 3.333 = 5.333

where the value 2 represents the increase in the error sum of squares of the first step.
Upon completion of this fusion, the products ‘Biscuit, Nougat, Nut’ are in one group,
and the error sum of squares is 5.333.
In stage 3, the distances between the group ‘Biscuit, Nougat, Nut’ and the remaining
products have to be determined. For this purpose, we use Eq. (8.8) and the results of the
first run presented in Table 8.16:
480 8 Cluster Analysis

Table 8.16  Reduced distance Biscuit, Nougat Nut Cappuccino


matrix after the second stage of
the Ward process Nut 6.667
Cappuccino 65.333 26
Espresso 88.000 41 11

1
D(Cappuccino; Biscuit + Nougat + Nut) = {(1 + 2) · 65.333 + (1 + 1) · 26 − 1 · 6.667} = 60.333
4
1
D(Espresso; Biscuit + Nougat + Nut) = {(1 + 2) · 88.000 + (1 + 1) · 41 − 1 · 6.667} = 84.833
4

The result of this step is shown in Table 8.17. It becomes clear that the double increase
in the error sum of squares is smallest when we combine the objects ‘Cappuccino’ and
‘Espresso’ in this step. In this case, the error sum of squares only increases by 1/2 of
11 = 5.5, i.e. the result after merging is as follows:

sg2 = 5.333 + 5.5 = 10.833

The value 10.833 reflects the amount of the error sum of squares after completion of the
third step. According to Eq. (8.9), the total value is correctly split into the following two
individual values: s2 (Biscuit, Nougat, Nut) = 5.333 and s2 (Cappuccino, Espresso) = 5.5.
In the last step, the groups ‘Biscuit, Nougat, Nut’ and ‘Cappuccino, Espresso’ are
merged, leading to a double increase of the error sum of squares:
1
D(Biscuit, Nougat, Nut, Espresso, Cappuccino) = {(3 + 1) · 84.833 + (3 + 1) · 60.333 − 3 · 11}
5
= 109.533

After this step, all objects are merged into one cluster, with the variance criterion
increasing by ½ of 109.533 = 54.767. The total error sum of squares in the final state is
therefore 10.833 + 54.767 = 65.6.
The clustering process according to Ward’s method can also be summarized by a
dendrogram, with the error sum of squares (variance criterion) listed after each stage of
fusion (Fig. 8.9).

Table 8.17  Matrix of double increases in heterogeneity after the second clustering step of the
Ward process
Biscuit, Nougat, Nut Cappuccino
Cappuccino 60.333
Espresso 84.833 11
8.2 Procedure 481

8.2.3.3 Clustering Properties of Selected Clustering Methods


The clustering methods considered so far can be generally characterized with regard to
their clustering attributes as follows (cf. Lance & Williams, 1966, pp. 373 ff.):

• Dilating procedures tend to group the objects into individual groups of approximately
equal size.
• Contracting algorithms tend to form a few large groups with many small ones “left
over”. Contracting algorithms are thus especially suitable for identifying outliers in an
object space.
• Conservative procedures show no tendency to dilate or contract.

In addition, an evaluation can be made according to whether a process tends to form


chains or not. Chain formation means that primarily individual objects are merged
together to form large groups.
According to the criteria mentioned above, agglomerative cluster processes can be
characterized as shown in Table 8.18.
When comparing the clustering methods considered in Sect. 8.2.2.2, the following
general differences may be highlighted:

1. Since the single-linkage method tends to form many small and a few large groups
(contractive method), it forms a good basis for the identification of outliers in a set
of objects.However, the disadvantage of this procedure is that it tends to form chains
(because of the large groups), which means that poorly separated groups are not
detected.

Table 8.18  Characterization of agglomerative cluster processes


Method Attributes Proximity measure Comment
Between-groups Conservative All measures –
(average linkage)
Within-groups (average Conservative All measures –
linkage)
Nearest neighbor (single Contractive All measures Tends to form chains
linkage)
Furthest neighbor (complete Dilating All measures Tends to form small groups
linkage)
Centroid clustering Conservative Distance measure –
Median clustering Conservative Distance measure –
Ward’s method Conservative Distance measure Tends to form equal-sized
groups
482 8 Cluster Analysis

2. While the fusion processes in the application example are identical for the single- and
complete-linkage procedures, the complete-linkage procedure tends to form smaller
groups. This is due to the fact that the largest value of the individual distances is used
as the new distance. Therefore, the complete-linkage method is not suitable for detect-
ing outliers in a population of objects. Instead, these lead to a distortion of the group-
ing in the complete-linkage process and should therefore be eliminated beforehand
(e.g. with the help of the single-linkage process).
3. Compared to other algorithms, Ward’s method mostly finds very good partitionings
and usually correctly assigns the elements to groups. Ward’s method is a very good
clustering algorithm, particularly in the following cases (Milligan, 1980; Punj &
Stewart, 1983):
– The use of a distance measure is a useful criterion (in terms of content) for deter-
mining similarity.
– All variables are metric.
– No outliers are contained in a set of objects or were previously eliminated.
– The variables are uncorrelated.
– The number of elements in each group is approximately the same.
– The groups have approximately the same extension.

The latter three conditions relate to the applicability of the variance criterion used in the
Ward procedure. However, the Ward process tends to build clusters of equal size and is
not able to detect elongated groups or groups with a small number of elements.

8.2.3.4 Illustration of the Clustering Properties with an Extended


Example
The central clustering properties of the methods “single linkage”, “complete linkage”
and “Ward’s process” are illustrated below using an extended example with 56 cases.
The different clustering processes of the three methods are illustrated with the help of
dendrograms generated by SPSS. Since the final heterogeneity measures (i.e. once all
objects are in one cluster) can take on very different values, the heterogeneity develop-
ment in the SPSS dendrograms is always normalized to a scale from 0 to 25. The end of
the clustering process is therefore always associated with a value of 25.11

Extended example with 56 cases:


In this additional example, 56 cases are considered, each described by two variables
as illustrated in Fig. 8.10. The example data were selected in such a way that three
groups can be visually identified: group A with 15 cases, group B with 20 cases and
group C with 15 cases. In addition, there are six outliers, each marked by a star. ◄

11 For
the extended example, the dendrograms were created using the procedure CLUSTER in
SPSS (see Sect. 8.3.2).
8.2 Procedure 483

outlier 1 outlier 4
A 7

3
B

1
outlier 2 outlier 5

-7 -5 -3 -1 1 3
-1

C
-3

-5

outlier 3 outlier 6
-7

Fig. 8.10 Example data for illustrating the clustering properties

If we apply the single-linkage method to the example data in Fig. 8.10, the corresponding
dendrogram in Fig. 8.11 shows that the method tends to form chains. While the objects
of the three different groups are merged more or less on the same level, the objects
marked as outliers are only fused at the end of the process. This demonstrates that the
single-linkage method is particularly suitable for detecting outliers in a set of objects.
If, on the other hand, the complete-linkage and Ward methods are applied to the
data, they lead to clearly different clustering processes. Figure 8.12 shows that the com-
plete-linkage process does recognize a three-cluster solution, but only group C is defi-
nitely isolated, while group B is only partially separated and the majority of the elements
in B are combined with the objects in group A. The complete-linkage method is there-
fore not able to reproduce the “true grouping” (as given by the example data) according
to Fig. 8.10.
The dendrogram of Ward’s method in Fig. 8.13 shows that a four-cluster solution is
formed at a normalized heterogeneity measure of about 4 and a three-cluster solution
at a heterogeneity measure of about 7. The two-cluster solution only emerges at a clus-
ter distance of approx. 8. Ward’s method is thus able to reproduce the “true grouping”
according to the example data in Fig. 8.10. In the three-cluster solution the six outliers
are distributed over the three groups.
484 8 Cluster Analysis

Fig. 8.11 Dendrogram of the


single-linkage method for the
extended example
8.2 Procedure 485

Fig. 8.12 Dendogram of the


complete-linkage method for the
extended example
486 8 Cluster Analysis

Fig. 8.13 Dendogram of the


Ward's method for the extended
example

2-cluster solution

3-cluster solution

4-cluster solution
8.2 Procedure 487

Fig. 8.14 Agglomeration schedule of the Ward method for the extended example

For Ward’s method, the fusion process illustrated by the dendrogram in Fig. 8.13 is
also represented by an agglomeration schedule (Fig. 8.14 ).12 We can see that only sin-
gle objects are merged in the first five stages (all numbers are “0” in the column “Stage
Cluster First Appears”), while in stages 50 and 53 to 55, the clusters formed on previous
levels are merged. We can also see that the outlier objects 53 and 54 (identifiers 0) are
merged at stages 50 and 52. The column “Coefficients” lists the error sum of squares
(variance criterion) for each case, which is used by Ward’s method as a measure of heter-
ogeneity in the fusion process.

8.2.4 Determination of the Number of Clusters

1 Selection of cluster variables

2 Determination of similarities or distances

3 Selection of the clustering method

4 Determination of the number of clusters

5 Interpretation of a cluster solution

12 The agglomeration schedule was also created using the procedure CLUSTER in SPSS.
488 8 Cluster Analysis

In the previous sections we described different methods for merging individual objects
into groups. In this process, all agglomerative processes start with the finest partitioning
(all objects form separate clusters) and end with grouping all objects into one large clus-
ter. In a third step, we now need to decide which number of groups represents the “best”
solution. Since the user does not know the optimal number of clusters beforehand, this
number needs to be determined on the basis of statistical criteria. Usually, the user does
not have any factually justifiable ideas about the grouping of the objects of investigation
and therefore tries to uncover a grouping inherent in the data with the help of the cluster
analysis. Against this background, the determination of the number of clusters should
also be oriented towards statistical criteria and not be justified factually (with regard to
the cases assigned to the groups).
In deciding on the number of clusters, there is always a trade-off between the “homo-
geneity” and the “manageability” of a clustering solution. Factual logic considerations
can also be used to resolve this conflict, but these should only relate to the number of
clusters to be chosen and not to the cases grouped together in the clusters.
The following sections describe different options for determining the number of
clusters.13

8.2.4.1 Analysis of the Scree Plot and Elbow Criterion


A first clue in the determination of the cluster number is provided by the identification
of a “leap” (elbow) in the values of the heterogeneity measure during the fusion pro-
cess. For this purpose, the development of the heterogeneity measure as a function of
the number of clusters shown in the agglomeration schedule can be plotted in a dia-
gram (so-called scree plot). For the extended example, Fig. 8.15 shows the result for
Ward’s method based on the assignment overview in Fig. 8.14. If the scree plotshows an
“elbow” in the development of the heterogeneity measure, this may be used as a decision
criterion for the number of clusters to be selected.
This criterion is therefore called the elbow criterion. Figure 8.15 shows the scree plot
resulting for the 56 cases from the example data, with a four-cluster solution recom-
mended by the elbow criterion. Since the Elbow criterion also provides visual support for
the cluster decision, the one-cluster solution should not be considered when construct-
ing the corresponding diagram. The reason for this recommendation is that the transition
from the two-cluster to the one-cluster solution always involves the largest “jump” in
heterogeneity., and when this is taken into account, an Elbow emerges in almost all use
cases.

13 Sincethere are no criteria available in SPSS for determining the optimal number of clusters, it is
recommended to use alternative programs such as S-Plus, R or SAS and the cubic clustering crite-
rion (CCC) if available.
8.2 Procedure 489

Error square sum


800

700

600

“Elbow“
500

400

300

200

100

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 56
Number of clusters

Fig. 8.15 Scree plot for determining the number of clusters according to the elbow criterion

8.2.4.2 Cluster Stopping Rules


The elbow criterion depends somewhat on the subjective assessment of the user.
Therefore, a large number of statistical criteria (so-called stopping rules) have been
developed to provide objective rules for determining the optimal number of clusters in
hierarchical cluster analysis. Milligan and Cooper (1985, p. 163), for example, tested a
total of 30 stopping rules as part of an extensive simulation study. The authors speci-
fied different clustering solutions (with 2 to 5 clusters). Afterward, they tested to which
extent the individual methods were able to identify the “true” group number when using
single, complete, and average linkage as well as Ward’s method.

Rule of Calinski/Harabasz
Calinski/Harabasz's criterion was identified as the best stopping rule, as it was able to
reveal the “true group structure” in over 90% of the cases examined.
“True” here means that the grouping carried out in a simulation study could be
uncovered.
The criterion according to Calinski and Harabasz (1974), which is suitable for met-
ric attributes, considers the ratio of the deviation between groups (SSb) and the devia-
tion within a group (SSw) as a test statistic, in analogy to the analysis of variance (cf.
Chap. 3). This test statistic is calculated for all clustering solutions. If its value decreases
monotonically with an increasing number of groups, this indicates that there is no group
structure in the data set. If, however, the test statistic increases with the number of clus-
ters, a hierarchically structured data set can be assumed. The cluster number k at which
the test statistic reaches a maximum is taken as the number of groups that exist within a
data set.
490 8 Cluster Analysis

Test of Mojena
The test of Mojena was also identified as one of the ten best methods for determining
the number of clusters. In the following, we will have a look at this test, because it can
be easily carried out with the help of a spreadsheet. As a starting point, this test uses the
standardized clustering coefficients (α̃) per clustering step, which are calculated as fol-
lows (Mojena, 1977, p. 359):14

1 n−1 1 n−1 αi − α
α= αi ;sα = (αi − α)2 ; α̃i = (8.10)
n−1 i=1 n−2 i=1 sα
The optimum number of clusters is indicated by the group number for which a given
threshold value of the standardized clustering coefficient is exceeded for the first time.
The literature mentions different requirements for defining this threshold value. In his
simulation study, Mojena (1977) achieved the best results with a threshold value of 2.75.
The studies of Milligan and Cooper (1985) mention a value of 1.25, with the quality of
the result only slightly varying for threshold values between 1 and 2. However, the opti-
mal parameter strongly depends on the data structure. We recommend selecting a value
between 1.8 and 2.7, since this seems well suited for most data sets, based on various
studies carried out in the past.

Optimizing a clustering solution with k-means


Once a clustering solution has been found, we can check whether this solution may be
improved by moving objects between the clusters. Improvement here means that the
within-group deviation should be as small as possible and the between-group deviation
as large as possible. K-means cluster analysis (KM-CA) is often used for this purpose.
In KM-CA, a number of k clusters (partitions) is formed, to which the set of data
points or cases (objects, persons) is then assigned on the basis of the variables under con-
sideration. If k-means is used as an optimization procedure for a clustering solution that
was found with a hierarchical agglomerative method, the number of clusters found and
the corresponding objects are first specified for KM-CA. KM-CA then checks whether
this clustering solution may be improved by exchanging objects between clusters or by
forming new clusters. KM-CA uses the variance criterion (see Eq. 8.9) as the target crite-
rion for distributing a set of objects (X) among the clusters.15

14 Fora brief summary of the basics of statistical testing, see Sect. 1.3.
15 Inaddition to KM-CA, two-step cluster analysis may also be used to optimize a clustering solu-
tion found by another procedure. Both methods belong to the partitioning clustering methods
described in detail in Sect. 8.4.2.
8.2 Procedure 491

8.2.4.3 Evaluation of the Robustness and Quality of a Clustering


Solution
After deciding on the number of clusters to be used (clustering solution), the final ques-
tion is how stable (robust) this solution is. In general, “robustness” can be defined as the
insensitivity of statistical results to the violation of theoretical assumptions in modelling,
methodological assumptions (e.g. assumptions of statistical tests; premises of an analy-
sis), the influence of outliers, etc. Since cluster analysis is an exploratory analysis with
the goal to discover structures in a data set, there are no objective criteria that can be
used for robustness testing. Instead, the following procedures should be followed.
First, outliers in a data set should be eliminated, since they influence the results of a
cluster analysis in a special way.16 Particularly when using hierarchical clustering meth-
ods, outliers can cause “chaining effects”. The single-linkage method (cf. Sect. 8.2.3.2.1)
is well suited for identifying outliers.
In a second step, the sensitivity should be examined by comparing the results of dif-
ferent clustering methods. Care should be taken to ensure that only methods of the same
category (e.g. centroid clustering, median clustering and Ward's method) are compared
since alternative clustering methods consider different fusion properties (cf. Table 8.18)
and will lead to a different structuring of the data. If the results of different clustering
methods do not differ at all or only slightly, a certain robustness of the solution can be
concluded. The so-called split-half method is also commonly used. Here, the data are
randomly divided into two samples. Then a cluster analysis is performed for each group,
using the same clustering method. If the same or similar cluster structures are found in
both samples, this indicates robustness of this clustering solution.17
Finally, a discriminant analysis (cf. Chap. 4) can also be used to assess the quality
of a cluster analysis. In this case, the clusters found through the cluster analysis form
the dependent, nominally scaled variable. The metrically scaled variables already used
for clustering can be used as the independent variable of the discriminant analysis. The
results of the discriminant analysis then allow statements on which variables the clus-
ters show differences in a particular way. A statement about the quality of a clustering
solution can be made by discriminant analysis, for example, by looking at the correctly
classified objects. If discriminant analysis is used in this way, it should be noted, how-
ever, that the quality is usually assessed as relatively high. The reason for this is that the
variables used for clustering and the independent variables of discriminant analysis are
identical.

16 On the problem of outliers, see also the comments in Sect. 1.5.1.


17 For more detailed considerations on the robustness of cluster analyses, see García-Escudero
et al. (2010, p. 89).
492 8 Cluster Analysis

8.2.5 Interpretation of a Clustering Solution

1 Selection of cluster variables

2 Determination of similarities or distances

3 Selection of the clustering method

4 Determination of the number of clusters

5 Interpretation of a cluster solution

The interpretation of a clustering solution should be based on the values of the cluster
variables in the clusters identified. It is useful to make a comparison with the survey pop-
ulation by calculating the t-values and F-values. Additionally, a discriminant analysis
may be used for analyzing the differences between the clusters found (cf. Chap. 4).

Calculation of t-values
Similarly to testing for differences in mean values, t-values for each variable j in each
cluster g may be calculated as follows:
xg j − xj
tg j = (8.11)
sj

with
x gj average of variable j in cluster g (g = 1, …, G)
x j average of variable j in the survey population (j = 1, …, J)
sj standard deviation of variable j in the survey population

The t-values represent normalized values, with

• negative t-values indicating that a variable is underrepresented in the considered clus-


ter compared to the survey population;
• positive t-values indicating that a variable is overrepresented in the considered cluster
compared to the survey population.

Thus, these values do not actually assess the quality of a clustering solution, but can be
used to characterize the clusters.

Calculation of F-values
To assess the homogeneity of a cluster in comparison to the survey population, F-values
for each variable j in each group g can also be calculated, analogous to the F-test:18

18 For a brief summary of the basics of statistical testing, see Sect. 1.3.
8.2 Procedure 493

2
sgj
Fgj = (8.12)
sj2

with
2
sgj variance of variable j in cluster g (g = 1, …, G)
sj2 variance of the variable j in the survey population (j = 1, …, J)

The lower the F-value, the smaller the dispersion of this variable within the cluster com-
pared to the survey population. The F value should not exceed 1, because otherwise the
corresponding variable has a greater variation within the cluster than in the survey pop-
ulation. The calculations of t- and F-values are shown in detail in Sect. 8.3.3.3 for the
chocolate case study.

Discriminant analysis
A discriminant analysis (see Chap. 4) can also be used to characterize a clustering solu-
tion. In this case, the clusters found by the cluster analysis form the dependent, nomi-
nally scaled variable. The metrically scaled variables used for clustering can be used as
the independent variables of the discriminant analysis. In this way, it is possible to deter-
mine, for example, which variable is particularly responsible for the separation between
the clusters. In addition, other variables that the user considers useful can also be used in
discriminant analysis. In this way, it is possible to investigate differences with regard to
other variables for the groups identified in the cluster analysis.

8.2.6 Recommendations for Hierarchical Agglomerative Cluster


Analyses

From the above considerations, the following four-step procedure may be recommended
for conducting a hierarchical, agglomerative cluster analysis:

1. Apply the single-linkage method (nearest neighbor) to identify outliers.


2. Eliminate the outliers and subsequently apply a further agglomerative process (e.g.
Ward’s method) to the reduced data set. The agglomerative process must be selected
against the background of the respective application situation and the clustering prop-
erties of the clustering methodology.
3. Optimize the clustering solution found in step 2 by using the k-means method.
4. Assessment of the robustness of the clustering solution and interpretation of the
results

Further recommendations for conducting a cluster analysis can be found in Sect. 8.5.
494 8 Cluster Analysis

8.3 Case Study

8.3.1 Problem Definition

We now use a larger sample to demonstrate how to conduct a cluster analysis with the
help of SPSS.
A manager of a chocolate company wants to know how consumers evaluate different
chocolate flavors with respect to subjectively perceived attributes. For this purpose, the
manager has identified 11 flavors and has selected 10 attributes that appear to be relevant
for the evaluation of these flavors.
A small pretest with 18 test persons was carried out. The persons were asked to eval-
uate the 11 flavors (chocolate types) with regard to the 10 attributes (see Table 8.19).19 A
seven-point rating scale (1 = low, 7 = high) was used for each attribute. Thus, the varia-
bles represent perceived attributes of the chocolate types.
However, not all persons were able to evaluate all 11 flavors. Thus, the data set con-
tains only 127 evaluations instead of the complete number of 198 evaluations (18 per-
sons × 11 flavors). Any evaluation comprises the scale values of the 10 attributes for a
certain flavor as given by a respondent. Thus, it reflects the subjective assessment of a
specific chocolate flavor by a particular test person. Since each test person assessed more
than just one flavor, the observations are not independent. Yet, for simplicity’s sake, we
will treat the observations as such in the following.
Of the 127 evaluations, only 116 are complete, while 11 evaluations contain miss-
ing values.20 We exclude all incomplete evaluations from the analysis. Consequently, the
number of cases is reduced to 116.
The manager of the chocolate company now wants to know which chocolate varieties
are rated similarly in terms of their attributes by his customers. To answer this question,
first the average ratings of the 18 test persons regarding the attributes of the 11 choc-
olate flavors are determined. The mean values are calculated with the SPSS procedure
‘Means’, which is called up via the following menu sequence: ‘Analyze/Compare Means/
Means’. The input matrix for the cluster analysis is thus an 11 × 10 matrix, with the
11 chocolate flavors as cases and the 10 averaged attribute assessments as variables. It
should be noted that averaging per chocolate means that the information about the varia-
tions in assessment between individuals is lost.21

19 Supplementary material (e.g. Excel files) is made available on the website www.multivariate.de,
with the held of which the reader can deepen his understanding of cluster analysis.
20 Missing values are a frequent and unfortunately unavoidable problem when conducting surveys

(e.g. because people cannot or do not want to answer the question, or as a result of mistakes by the
interviewer). The handling of missing values in empirical studies is discussed in Sect. 1.5.2.
21 The mean values were calculated on the basis of the data set that was also used in the case study

of discriminant analysis (Chap. 4), logistic regression (Chap. 5) and factor analysis (Chap. 7). Using
the same case study allows us to illustrate the similarities and differences between the methods.
8.3 Case Study 495

Table 8.19  Chocolate flavors and attributes examined in the case study


Chocolate flavor Perceived attributes
1 Milk 1 Price
2 Espresso 2 Refreshing
3 Biscuit 3 Delicious
4 Orange 4 Healthy
5 Strawberry 5 Bitter
6 Mango 6 Light
7 Cappuccino 7 Crunchy
8 Mousse 8 Exotic
9 Caramel 9 Sweet
10 Nougat 10 Fruity
11 Nut

To identify groups with similarly perceived chocolate flavors (market segments), the
manager conducts a cluster analysis. Based on the recommendations in Sect. 8.2.3.3, he
uses Ward’s method as a fusion algorithm and squared Euclidean distances as a proxim-
ity measure.

8.3.2 Conducting a Cluster Analysis with SPSS

To conduct a hierarchical cluster analysis with SPSS, go to ‘Analyze/Classify’ and


choose ‘Hierarchical Cluster’ (Fig. 8.16).
In the dialog box ‘Hierarchical Cluster Analysis’ that appears, select the 10 varia-
bles for describing the chocolate flavors and transfer them to the field ‘Variable(s)’ (see
Fig. 8.17). The variable ‘Type’ is used to describe the 11 types of chocolate and is trans-
ferred to the field ‘Label Cases by’. In addition, this dialog box (below the item ‘Cluster’)
offers the possibility to cluster cases (so-called Q-analysis) or variables (so-called
R-analysis). Here, only the Q-analysis is considered, i.e. the option ‘Cases’ has to be
selected. For clustering variables, factor analysis is typically used (see Chap. 7).
In the sub-menu ‘Statistics’, an assignment overview and the similarity matrix of the
11 objects based on the selected proximity measure can be requested. Furthermore, a
range of solutions may be specified (e.g. 2–5 cluster solutions) for which the respective
cluster membership of the objects is then output.
The sub-menu ‘Plots’ may be used to request graphical representations of the cluster-
ing process. We select the option ‘Dendrogram’ for this case study. The option ‘Icicle’
displays information about how cases are combined into clusters at each iteration of the
analysis. A vertical or horizontal plot can be selected (see Fig. 8.18).
496 8 Cluster Analysis

Fig. 8.16 Data editor: how to select a ‘Hierarchical Cluster Analysis’

Fig. 8.17 Dialog box: Hierarchical Cluster Analysis


8.3 Case Study 497

Fig. 8.18 Dialog boxes: Statistics (left) and Plots (right)

The sub-menu ‘Method’ is used to determine the clustering method (clustering algo-
rithm) and the proximity measure (‘Measure’). Various measures are available for met-
rically scaled (‘Interval’), nominally scaled (‘Counts’) and binary-coded (‘Binary’)
variables (cf. Fig. 8.19).
Overall, the proximity measures listed in Table 8.4 are available, comprising measures
for metric (interval-scaled) data, count data and binary data. Seven different clustering
methods can be selected from the drop-down list ‘Cluster Method’. In the case study,
the single-linkage method (nearest neighbor) is initially selected to identify outliers.
Then the 11 chocolate varieties are analyzed using Ward’s method and with the squared
Euclidean distance as a proximity measure. Once the settings have been entered in the
sub-menus, the ‘Continue’ button takes us back to the starting point of the ‘Hierarchical
Cluster Analysis’ and the analysis can be started by pressing ‘OK’.

8.3.3 Results

In the following, the results of the cluster analysis generated with SPSS are presented
first. Then criteria for determining the cluster number in the case study are shown. The
considerations conclude with a characterisation of the cluster solution found by t- and
F-values.
498 8 Cluster Analysis

Fig. 8.19 Dialog box: Method

8.3.3.1 Outlier Analysis Using the Single-Linkage Method


In principle, it is recommendable to examine the data to be analyzed with regard to outli-
ers. As pointed out in Sect. 8.2.3.3, the single-linkage method is particularly suitable for
this purpose. Figure 8.20 shows the dendrogram for the single-linkage method.
The dendrogram shows that in the last step ‘Mango’ is combined with the other
objects. Due to the relatively low degree of heterogeneity and the overall low number of
cases, ‘Mango’ is not declared an outlier and included in the following analysis.

8.3.3.2 Clustering Process using Ward’s Method


The distance matrix resulting from the Ward process is shown in Fig. 8.21. The matrix is
called ‘Proximity Matrix’ and the footnote indicates “This is a dissimilarity matrix”.
In this matrix, the greatest dissimilarity exists between ‘Espresso’ and ‘Mango’, with
a squared Euclidean distance of 34.904. The smallest distance (1.544) can be observed
for ‘Caramel’ and ‘Nut’. The steps of the clustering process are reflected in the so-called
agglomeration schedule which is shown in Fig. 8.22.
The column “Stage” specifies the clustering steps. The number of clustering steps is
always equal to the number of objects minus 1. The column “Cluster combined” indi-
cates the number of objects or clusters merged in the respective step. In the column
“Coefficients” we can see the respective value of the heterogeneity measure (here:
8.3 Case Study 499

Fig. 8.20 Dendrogram for the single-linkage method

variance criterion) which is calculated at the end of a clustering step. The objects or
clusters merged into a new cluster are assigned the number of the first object (cluster)
as identifier. The column “Stage Cluster First Appears” indicates the clustering step
in which the respective object (cluster) has been merged for the first time. The column
“Next Stage” indicates the next clustering in which the cluster will be used. For exam-
ple, in stage 7, cluster 1 (which was formed in stage 4) is combined with object 7 with a
heterogeneity measure of 13.969. The resulting cluster is assigned the identifier “1” and
used again in step 8. The value “0” in the column “Stage Cluster First Appears” indicates
that a single object is included in the clustering. Overall, it is clear that in the first four
fusion steps the chocolate varieties ‘Caramel’ (9), ‘Nut’ (11), ‘Nougat’ (10) and ‘Biscuit’
(3) are combined, with the error sum of squares after the fourth step being 4.906. This
means that the variance of the variable values in this group is still relatively small.
The dendrogram in Fig. 8.23 provides a graphical illustration of the clustering pro-
cess. In the dendrogram, SPSS normalizes the used heterogeneity measure to the interval
[0;25].
500

Fig. 8.21 Squared Euclidean distance matrix of the eleven chocolate types
8
Cluster Analysis
8.3 Case Study 501

Fig. 8.22 Agglomeration schedule of Ward’s method for the case study

Fig. 8.23 Dendrogram of cluster analysis using the Ward process


502 8 Cluster Analysis

Determination of the number of clusters


The procedure ‘Hierarchichal Cluster Analysis’ in SPSS does not provide any criteria for
determining the number of clusters, so the user has to calculate them himself, e.g. with
Excel.
The values in Fig. 8.23 give a first indication how many clusters should be formed.
The dendrogram and the development of the heterogeneity measure, which is shown in
the agglomeration schedule (Fig. 8.22), provide a first indication of how many clusters
are to be used as the final solution. From clustering step 9 to clustering step 10 (i.e. from
the two- to the single-cluster solution), the error sum of squares increases significantly
(by 58.145°–°22.373 = 35.772). Therefore, a two-cluster solution seems reasonable. In
order to validate the decision for a two-cluster solution, we apply the criteria discussed in
Sect. 8.2.3 and consider the elbow criterion as well as the test of Mojena.

Elbow criterion
To use the elbow criterion, the error sum of squares is plotted against the corresponding
cluster number in a diagram (see also Sect. 8.2.4.1). Figure 8.24 shows the resulting dia-
gram for the case study based on the coefficient values in the agglomeration schedule
(see Fig. 8.22). The diagram shows a clear “elbow” in the two-cluster solution of the
case study. However, it should be noted that in most analyses an elbow is evident in the

Error sum of square


60
58.145

50

40

Elbow criterion with


a two-cluster solution
30

22.373

20
18.066

13.969

10.391
10

6.997
4.906
3.206
1.835
0 0.772
0 1 2 3 4 5 6 7 8 9 10
Number of clusters

Fig. 8.24 Development of the heterogeneity measure in the case study


8.3 Case Study 503

two-cluster solution, since the increase in heterogeneity is always greatest for the step
from the two- to the single-cluster solution. In practical applications, therefore, a second
elbow in the diagram is always necessary for an unequivocal decision. However, a sec-
ond elbow is not visible in this case.

Test of Mojena
For the implementation of the test of Mojena, we transfer the clustering steps from
Fig. 8.22 column ‘Coefficients’ to a spreadsheet, and calculate the standardized cluster-
ing coefficients (α) per clustering stage according to Eq. (8.10) (see Sect. 8.2.4.2). This
leads to an average clustering coefficient of ai = 14.066 and a standard deviation of the
coefficients of sα = 18.096. The results are shown in Table 8.20. With 2 as the critical
threshold value, a single-cluster solution is suggested (ã10 = 2.436 ).
Both a two-cluster and a three-cluster solution will be discussed below. Figure 8.25
shows which object is located in which cluster. For our example, the cluster assign-
ments for the 2-, 3-, 4- and 5-cluster solution are given. In a two-cluster solution, the
types ‘Milk’, ‘Espresso’, ‘Biscuit’, ‘Cappuccino’, ‘Mousse’, ‘Caramel’, ‘Nougat’ and
‘Nut’ (classic cluster) are grouped together in a first cluster and the types ‘Strawberry’,
‘Mango’ and ‘Orange’ (fruit cluster) are grouped in a second cluster.
In order to compare the agglomerative processes, the case study was analyzed with
the “complete-linkage”, “average-linkage”, “centroid” and “median” processes. The
main difference in comparison to Ward’s method is that these methods do not con-
tain the development of the error sum of squares in the column “Coefficients” of the
“Classification overview” (see Fig. 8.22), but the distances or similarities of the objects
or groups that were merged. However, all procedures lead to identical solutions in the
two-cluster case, i.e. a “fruit cluster” and a “classic cluster” are identified.

Optimization of the clustering solution using k-means


First of all, it should be emphasized that for a data set with 11 cases an optimization by
k-means makes little sense since it will not lead to any improvement. This is confirmed
by the fact that the k-means method does not lead to any change when applied to the
two-cluster solution in our case study (cf. cluster membership in Fig. 8.26).
However, k-means provides more information about the cluster variables used: The
cluster centers of the final (two-cluster) solution represent the means of the variables in
both clusters. It becomes clear that all variables (except the variables ‘price’, ‘delicious’

Table 8.20  Results of the Test of Mojena for the case study


Clustering stage 1 … 7 8 9 10
Amount of cluster 11 … 4 3 2 1
Clustering coefficient 0.772 … 13.969 18.066 22.373 58.145
Standard deviation of the coefficients −0.735 … −0.005 0.221 0.459 2.436
504 8 Cluster Analysis

Fig. 8.25 Cluster membership for different solutions (2 to 5 clusters)

Fig. 8.26 Cluster membership and final cluster centers according to the k-means method

and ‘healthy’) are more pronounced in cluster 1 (fruit cluster) than in cluster 2 (clas-
sic chocolate flavors). Furthermore, the reported table of variance (ANOVA) shows the
squared deviations between the two groups per variable (see Fig. 8.27). The greater these
deviations are, the more a variable is responsible for the differences between the two
clusters. It can be seen that especially the variables ‘price’, ‘light’, ‘crunchy’, ‘exotic’
and 'fruity' show clear differences in the two clusters (cf. column “Sig.” in Fig. 8.27).
8.3 Case Study 505

Fig. 8.27 ANOVA table of the k-means clustering method

The F-tests, however, should only be used for descriptive purposes, as the clusters have
been chosen to maximize the differences among cases in both clusters. The F-values and
the reported significance levels should not be interpreted here as statistical tests of the
hypothesis that the cluster means are equal. The above information can nevertheless be
used for the interpretation of the clusters as described in the following section.

8.3.3.3 Interpretation of the Two-Cluster-Solution


The cluster procedures implemented in SPSS do not provide any criteria for characteris-
ing a cluster solution found. However, clues for describing the clusters can be obtained
from the variable mean values per cluster from the K-Means analysis and the variance
table shown there (cf. Figs. 8.26 and 8.27).
Detailed information on the interpretation of an ANOVA table can be found in Chap.3
Analysis of Variance (cf. Sect. 3.3.3.1). To carry out a K-Means analysis with SPSS, see
Sect. 8.4.2.1.2.

Description with t- and F-values


To describe a clustering solution, the mean values and variances of the variables in the
groups should first be compared with those in the survey population. Figure 8.28 shows
the means and variances calculated for the survey population and the two groups when
using the menu sequence ‘Analyze/Tables/Custom Tables‘.
506
8

Fig. 8.28 Mean values and variances of the assessments in the survey population (total) and the two clusters
Cluster Analysis
8.3 Case Study 507

Table 8.21  t- and F-values of t-value F-value


the two-cluster solution in the
case study Classic Fruit Classic Fruit
Price 0.514 −1.372 0.303 0.058
Refreshing −0.307 0.818 0.281 2.636
Delicious 0.371 −0.989 0.849 0.014
Healthy 0.274 −0.730 0.164 3.327
Bitter −0.290 0.773 0.929 0.515
Light −0.452 1.205 0.543 0.106
Crunchy −0.456 1.217 0.523 0.114
Exotic −0.521 1.390 0.201 0.312
Sweet −0.184 0.491 1.082 0.716
Fruity −0.527 1.405 0.156 0.383

Using the results from Fig. 8.28, the t- and F-values for the two clusters can now be
calculated as described in Sect. 8.2.5. The results are shown in Table 8.21.
The calculation of t- and F-values for the case study is shown here for the variable
‘price’ in the fruit cluster.
The t-value is calculated according to Eq. (8.11) as follows:
3.581 − 4.675
t= √ = −1.372
0.636
For the F-value, Eq. (8.12) gives the following result:
0.037
F= = 0.058
0.636
Using the t- and F-values listed in Table 8.21, the two clusters can now be described as
follows:

• In the cluster ‘Classic’, the variables mostly have lower values than in the survey pop-
ulation (t-value < 0), while they have higher values in the cluster ‘Fruit’ (t-value > 0).
This means that in the cluster ‘Fruit’ these variables are perceived as much more
important. In the cluster ‘Classic’, however, they are perceived as significantly
weaker. Only the values of the variables ‘price’, ‘delicious’ and ‘healthy’ are above
average in the ‘Classic’ cluster, whereas they are below average in the ‘Fruit’ cluster.
The largest differences in the mean values are found for the variables ‘price’, ‘exotic’
and ‘fruity’.
• With regard to the homogeneity of the variables in the two clusters, the F-values
predominantly indicate significantly lower variances than in the survey population
(F-value < 1). In the ‘Classic’ cluster only the variable ‘sweet’ has a slightly higher
508 8 Cluster Analysis

Table 8.22  Definition of the three groups for discriminant analysis and logistic regression
Group (segment) Chocolate flavors in the segment Cases (n)
g = 1 | Seg_1 Classic Milk, Biscuit, Mousse, Caramel, Nougat, Nut 65
g = 2 | Seg_2 Fruit Orange, Strawberry, Mango 28
g = 3 | Seg_3 Coffee Espresso, Cappuccino 23

variance than in the survey population. In the ‘Fruit’ cluster, this applies to the var-
iables ‘refreshing’ and ‘healthy’. Overall, however, with regard to the homogeneity
in both clusters, it can be stated that the variables almost always have a significantly
lower variance than in the survey population.

Description of the two-cluster solution using discriminant analysis


A discriminant analysis with specification of the two clusters ‘Classic’ (8 types)
and ‘Fruit’ (3 types) confirms the correct classifications of the two-cluster solution.
Furthermore, stepwise discriminant analysis shows that the variables ‘fruity’, ‘healthy’,
‘exotic’, ‘delicious’ and ‘light’ are particularly strong in separating the flavors. This
means that the mean scores of these variables differ significantly in the two clusters—as
indicated by the t-values. Only the separating power of the variable ‘price’ appears to be
“absorbed” by the other variables since ‘price’ is not identified as strongly separating by
the stepwise discriminant analysis.
The case studies for discriminant analysis (Chap. 4) and multinomial logistic regres-
sion (Chap. 5) use the results of cluster analysis but assume three groups (segments)22
since this is helpful for clarifying the procedures of these two methods. For this reason,
the chocolate types ‘Espresso’ and ‘Cappuccino’ are moved from the cluster ‘Classic’ to
a separate third cluster called ‘Coffee’. A solution with three clusters can be logically
justified and is also supported by the dendrogram in Fig. 8.23. The affiliation of the 11
chocolate flavors to the three clusters is summarized in Table 8.22.

8.3.4 SPSS Commands

Above, we demonstrated how to use the graphical user interface (GUI) of SPSS to con-
duct a cluster analysis. Alternatively, we can use the SPSS syntax which is a program-
ming language unique to SPSS. Each option that is activated in SPSS’s GUI is translated

22 Multinomial logistic regression requires at least three groups. In case of a two-cluster solution a
binary logistic regression would have to be performed.
8.3 Case Study 509

* MVA: Reduction of the data set with 127 cases.


* Defining Data.
DATA LIST FREE / price refreshing delicious healthy bitter light crunchy
exotic sweet fruity person type.

BEGIN DATA
3 3 5 4 1 2 3 1 3 4
6 6 5 2 2 5 2 1 6 7
2 3 3 3 2 3 5 1 3 2
-------------------
5 4 4 1 4 4 1 1 1 4
* Enter all data.
END DATA.

* Calculation of the means per chocolate flavor and output in own data
set.
DATASET DECLARE DATACluster.
AGGREGATE
/OUTFILE='DATACluster'
/BREAK=type
/price=MEAN(price)
/refreshing=MEAN(refreshing)
/delicious=MEAN(delicious)
/healthy=MEAN(healthy)
/bitter=MEAN(bitter)
/light=MEAN(light)
/crunchy=MEAN(crunchy)
/exotic=MEAN(exotic)
/sweet=MEAN(sweet)
/fruity=MEAN(fruity).

Fig. 8.29 SPSS syntax for calculating the means of the attribute assessments in the case study

into SPSS syntax. If you click on ‘Paste’ in the main dialog box shown in Fig. 8.17, a
new window opens with the corresponding SPSS syntax.
However, you can also use the SPSS syntax directly and write the commands your-
self. Using the SPSS syntax can be advantageous if you want to repeat an analysis mul-
tiple times (e.g., testing different model specifications). The syntax does not refer to an
existing data file of SPSS (*.sav); rather, we enter the data with the help of the syntax
editor (BEGIN DATA … END DATA).
Figure 8.29 shows the syntax file used to calculate the mean assessments for the
10 attributes of the 11 chocolate flavors from the 127 individual assessments. With
these mean values, a data matrix is generated which forms the basis for carrying out
a cluster analysis. Figure 8.30 shows the SPSS syntax for running various clustering
methods.
For readers interested in using R (https://www.r-project.org) for data analysis, we pro-
vide the corresponding R-commands on our website www.multivariate-methods.info .
510 8 Cluster Analysis

* MVA: Case Study Chocolate Cluster Analysis.


* Defining Data.
DATA LIST FREE / type expensive refreshing saturating healthy bitter light crunchy
exotic sweet fruity.

BEGIN DATA
Milk 4.5000 4.0000 4.3750 3.8750 3.2500 3.7500 4.0000 2.3750 4.6250 4.1250
Espresso 5.1667 4.2500 3.8333 3.8333 2.1667 3.7500 3.2727 2.3333 3.7500 3.4167
Biscuit 5.0588 3.8235 4.7647 3.4375 4.2353 4.4706 3.7647 2.7059 3.5294 3.5294
Orange 3.8000 5.4000 3.8000 2.4000 5.0000 5.0000 5.0000 4.4000 4.0000 4.6000
Strawberry 3.4444 5.0556 3.7778 3.7647 3.9444 5.3889 5.0556 4.9444 4.2222 5.2778
Mango 3.5000 3.5000 3.8750 4.0000 4.6250 5.2500 5.5000 6.0000 4.7500 5.3750
Cappuccino 5.2500 3.4167 4.5833 3.9167 4.3333 4.4167 4.6667 3.6667 4.5000 3.5833
Mousse 5.8571 4.4286 4.9286 3.8571 4.0714 5.0714 2.9286 2.0909 4.5714 3.7857
Caramel 5.0833 4.0833 4.6667 4.0000 4.0000 4.2500 3.8182 1.5455 3.7500 4.1667
Nougat 5.2727 3.6000 3.9091 4.0909 4.0909 4.0909 4.5455 1.7273 3.9091 3.8182
Nut 4.5000 4.0000 4.2000 3.9000 3.7000 3.9000 3.6000 2.2000 3.5000 3.7000
END DATA.

* Case Study Cluster Analysis: Method "Single Linkage"(Analysis of outliers).


CLUSTER price refreshing delicious healthy bitter light crunchy exotic sweet
fruity
/METHOD SINGLE
/MEASURE=SEUCLID
/ID=type
/PRINT SCHEDULE CLUSTER(2,5)
/PRINT DISTANCE
/PLOT DENDROGRAM.

* Case Study Cluster Analysis: Method "Ward" (hierarchical procedure).


CLUSTER price refreshing delicious healthy bitter light crunchy exotic sweet
fruity
/METHOD WARD
/MEASURE=SEUCLID
/ID=type
/PRINT SCHEDULE CLUSTER(2,5)
/PRINT DISTANCE
/PLOT DENDROGRAM.

* Case Study Cluster Analysis: Method "K-Means" (partitioning procedure).


QUICK CLUSTER price refreshing delicious healthy bitter light crunchy exotic sweet
fruity
/MISSING=LISTWISE
/CRITERIA=CLUSTER(2) MXITER(10) CONVERGE(0)
/METHOD=KMEANS(NOUPDATE)
/PRINT ID(type) INITIAL ANOVA CLUSTER DISTAN.

* Case Study Cluster Analysis: Method "Two-Step" (partitioning procedure).


TWOSTEP CLUSTER
/CONTINUOUS VARIABLES=price refreshing delicious healthy bitter light crunchy
exotic sweet fruity
/DISTANCE LIKELIHOOD
/NUMCLUSTERS FIXED=2
/HANDLENOISE 0
/MEMALLOCATE 64
/CRITERIA INITHRESHOLD(0) MXBRANCH(8) MXLEVEL(3)
/VIEWMODEL DISPLAY=YES.

Fig. 8.30 SPSS syntax for running the cluster analyses of the case study
8.4 Modifications and Extensions 511

8.4 Modifications and Extensions

The previous explanations concentrated on cases in which objects are described by met-
ric data and a hierarchical, agglomerative clustering method is used. But cluster analysis
is also capable of processing non-metric data. In these cases, however, different proxim-
ity measures have to be used, as already indicated in Sect. 8.2.2.1. In Sect. 8.4.1, we will
describe common proximity measures for.

• nominally scaled variables,


• binary variables (0/1 variable),
• mixed variables.

In Sect. 8.4.2, k-means cluster analysis and two-step cluster analysis, two frequently
used partitioning cluster procedures (see Fig. 8.6), will be presented and compared with
each other.

8.4.1 Proximity Measures for Non-Metric Data

8.4.1.1 Proximity Measures for Nominally Scaled Variables


With nominally scaled variables, there are basically two ways of processing them in
cluster analysis:

A. transformation into binary variables,


B. analysis of frequency data.

A. Transformation into binary variables


When transforming a nominal variable into a binary variable, all categories of a nominal
variable are regarded as independent binary variables and coded with 0/1. The value 1
means “attribute value exists” and the value 0 means “attribute value does not exist”.

Example
The nominal variable ‘delivery complaints’ has four categories as attribute values;
thus, it is coded by four binary variables. Table 8.23 shows the corresponding trans-
formation of the complaint categories into binary variables. Each position in the row
stands for a complaint type, which is coded with 1 (= attribute value exists) if it is
valid. ◄

The example shows that the number of categories (values) of a nominal variable deter-
mines the length of the binary variables consisting of zeros and ones. In the above exam-
ple, the number of possible complaint types determines the length of the binary variable
512 8 Cluster Analysis

Table 8.23  Binary decomposition of the nominal variable ‘complaints’


Complaint category Type of complaint Transformation into
binary variables
A Defective product 1000
B Incomplete delivery 0100
C Packaging damage 0010
D Delayed delivery 0001

consisting of zeros and ones (see 3rd column). If there are no complaints, the coding in
the example is ‘0000’.
To calculate the similarity or dissimilarity between objects with nominally scaled
variables, the proximity measures discussed in Sect. 8.4.1.2 can be used. It should be
noted, however, that similarity coefficients that count common non-possession as a
match should not be used. The reason for this is that with such similarity coefficients
(e.g. SM-coefficient; see Sect. 8.4.1.2.2) a large and, in particular, very different number
of characteristic expressions leads to distortions in the similarity measure.

B. Analysis of the frequencies (counts) of attribute values


Since no arithmetic operations can be carried out with nominally scaled variables, we
usually analyze the frequencies with which the values of a nominal variable occur in
a survey. The frequencies can then be used to check whether there are any statistical
dependencies between the nominal variables (see also Chap. 6 on contingency analysis).

Example
In a survey, 100 people were asked about their assessment of five chocolate types
(Espresso, Cappuccino, Biscuit, Nut and Nougat) with regard to the preferred packag-
ing. The answer categories ‘paper’, ‘tin can’ and ‘gift box’ were specified as possible
types of packaging.
Table 8.24 shows the frequencies per type of packaging for the corresponding
chocolate types, with multiple mentions being possible. In total, N = 606 answers
were given. ◄

The data in Table 8.24 form a cross table of the two nominally scaled variables
‘Chocolate type’ and ‘Packaging type’. The distance between two objects k and l can
now be used as the distance measure between the two objects k and l, using the test value
of the chi-square homogeneity test. This test checks the null hypothesis that the two
objects are from the same distribution (population). In the procedure ‘Cluster’ of SPSS,
the chi-square statistic used to determine distances for frequency data is calculated as
follows:
8.4 Modifications and Extensions 513

Table 8.24  Frequency data for the preferred packaging type


Type of packaging
Flavors Paper Tin can Gift box Row total
Espresso 24 65 12 101
Cappuccino 35 55 21 111
Biscuit 20 40 75 135
Nut 83 30 21 134
Nougat 75 28 22 125
Column total 296 338 184 606

Table 8.25  Distance matrix of the frequency data according to the chi-square statistic
Chi-square value between frequency sets
Espresso Cappuccino Biscuit Nut Nougat
Espresso 0.000
Cappuccino 2.209 0.000
Biscuit 6.931 5.901 0.000
Nut 6.642 4.994 8.387 0.000
Nougat 6.470 4.754 7.941 0.430 0.000

 0,5
  nij − eij 2
 
Chi − square = (8.13)
eij

with
nij number of entries of variable j for object i (I = k, l; j = 1, …, J) (cell frequency)
eij expected number of entries of variable j for object i when attributes are independ-
ent [(line sum × column sum)/total sum]

The larger the chi-square, the greater the probability that the two objects do not come
from the same population and should therefore be classified as dissimilar. The distance
matrix according to the chi-square for the frequency data in the above example (see
Table 8.24) is shown in Table 8.25.
The distances for all five objects according to the chi-square statistic in Table 8.25
show that the frequency data of ‘Nut’ and ‘Nougat’ (with a value of the chi-square sta-
tistic of 0.430) have the smallest distance (greatest similarity) and would therefore be
merged at the first stage. Accordingly, ‘Espresso’ and ‘Cappuccino’ would be fused in
the next step (with a value of the chi-square statistic of 2.209).
514 8 Cluster Analysis

Table 8.26  Frequencies of the preferred packaging types in the example


Type of packaging
Flavors Paper Tin can Gift box Row total
Espresso 24 65 12 101
Cappuccino 35 55 21 111
Column total 59 120 184 212

Table 8.27  Expected Type of packaging


frequencies of the preferred
packaging types in the example Paper Tin can Gift box
Flavors 28.108 57.170 15.722
Espresso 30.892 62.830 17.278
Cappuccino 6.931 5.901 0.000

If the absolute frequencies show large differences between the individual pair com-
parisons, the phi-square statistic should be used to determine the distance.It is based on
the chi-square statistic, but additionally normalizes the data by dividing them by the total
number of cases of the two objects under consideration.
For the sake of clarity, we will show the calculation of the chi-square statistic of 2.209
for the objects ‘Espresso’ and ‘Cappuccino’. For this purpose, only the first two rows of
Table 8.24 are considered in Table 8.26.
In addition, the expected frequency eij which is independent of the two types of choc-
olate must be calculated according to Eq. (8.14):
sum of row × sum of column
eij = (8.14)
total sum
(Example: the expected frequency for the cell ‘Espresso’—‘paper’ equals: (101 ⋅ 59) /
212 = 28.108).
For the expected frequencies (independent of the two varieties) the result is given in
Table 8.27.
Using Table 8.26 and 8.27, the chi-square for the flavors ‘Espresso’ and ‘Cappuccino’
can now be calculated according to Eq. (8.13):

(24 − 28.108)2 (35 − 30.892)2


chi square = ( +
28.108 30.892
(65 − 57.170)2 (55 − 15.722)2
+ +
57.170 15.722
(12 − 15.722)2 (21 − 17.278)2 0.5
+ + )
15.722 17.278
= (0.6005 + 0.5464 + 1.0725 + 0.9758 + 0.8810 + 0.8016)0.5
= 4.8780.5 = 2.209
8.4 Modifications and Extensions 515

Table 8.28  Possible combinations of binary variables


Object 2
Object 1 Property present (1) Property not present (1) Row sum
Property present (1) Property not a c a+c
present (2) b d b+d
Column sum a+d c+d M

For the example, the following phi-square measure results between the types of espresso
and cappuccino, with 212 representing the total number of cases of the two objects
considered
 0.5
2 4.8778
φEspresso,Cappuccino = = 0.152
212

8.4.1.2 Proximity Measures for Binary Variables


8.4.1.2.1 Overview and Output Data for a Calculation Example
A binary variable structure exists if all attribute variables have the values 0 or 1, with
the value 1 usually meaning “property exists” and the value 0, “property does not exist”.
When determining the similarity between two objects, cluster analysis always starts with
a paired comparison, i.e. for each of the two objects, all property features are compared
with each other. As shown in Table 8.28, four cases can be distinguished in the case of
binary features when comparing two objects with respect to one property:

• property exists for both objects (cell a).


• only object 2 has the property (cell b)
• only object 1 has the property (cell c)
• property does not exist for both objects (cell d).

For the determination of similarities between objects with binary variable structures, a large
number of measures have been developed in the literature. Most of them can be traced back
to the following general similarity function (Kaufman & Rousseeuw, 2005, p. 22):
a+δ·d
Sij = (8.15)
a + δ · d + (b + c)
with
Sij similarity between objects i and j
δ, λ (constant) weighting factors

Variables a, b, c and d correspond to the identifiers in Table 8.28, where, for example,
variable a corresponds to the number of properties present in both objects (1 and 2).
516 8 Cluster Analysis

Table 8.29  Possible combinations of binary variables


Name of the similarity coefficient Weight factors Definition
δ λ
Simple matching (M-coeffi cient) 1 1 a+d
M
Dice 0 ½ 2a
2a+(b+c)
Jaccard 0 1 a
a+b+c
Rogers & Tanimoto 1 2 a+d
a+d+2(b+c)
Russel & Rao (RR) – – a
M

Depending on the choice of weighting factors δ and λ, we will get different similarity
measures for objects with binary variables. Table 8.29 gives an overview, where M is the
number of features (cf. Kaufman and Rousseeuw 2005, p. 24).
The procedure ‘Hierarchichal Cluster Analysis’ in SPSS offers a total of 7 distance
measures and 20 similarity measures for calculating the proximity of objects with a
binary variable structure. The choice of the proximity measure should be based on logi-
cal considerations and depends on the specific situation.
In the following, the similarity coefficients simple matching (SM), Jaccard and Russel
& Rao (RR), which are frequently used in practical applications in the case of binary
variables, are examined in more detail. We will use the example in Table 8.30.

Example: Description of five chocolate varieties by 10 binary characteristics


The five types of chocolate considered so far are described by 10 binary attributes,
which are shown in Table 8.30. For all characteristics it is indicated whether a product
does (1) or does not (0) have the relevant feature. ◄

8.4.1.2.2 Simple Matching, Jaccard and RR Similarity Coefficients

Simple matching similarity coefficient (SM)


The SM coefficient (also called “Rand similarity coefficient”), counts all components
present in both objects in the numerator. When comparing ‘Espresso’ and ‘Nut’, accord-
ing to Table 8.30, these are 5 of the 10 attributes: ‘Storing time’, ‘National ads’, ‘Paper
packaging’, ‘Special display’ (= attributes in both flavors present) and ‘Storage prob-
lems’ (= attribute in both flavors not present). The SM coefficient is calculated as a+dM
and results in a value of 0.5 for the named product pair (with M = a + b + c + d). The
SM-values for the other object pairs are shown in Table 8.31.

Jaccard similarity coefficient


The Jaccard similarity coefficient measures the relative proportion of common proper-
ties in relation to the number of properties that apply to at least one of the objects under
Table 8.30  Initial data matrix for the calculation of Jaccard, SM and RR similarity coefficient
8.4 Modifications and Extensions

Attributes Storing Seasonal National ads Paper XL size Sales Special Brand Margin > 20% Storage
Flavor time > 1 packaging packaging promotion display product problems
year
Espresso 1 1 1 1 0 0 1 0 0 0
Cappuccino 1 1 1 1 1 0 1 0 1 0
Biscuit 1 1 0 1 0 1 0 1 0 1
Nut 1 0 1 1 1 1 1 1 1 0
Nougat 1 1 0 1 1 1 0 1 1 0
517
518 8 Cluster Analysis

Table 8.31  Simple matching coefficient (simple match)


Espresso Cappuccino Biscuit Nut Nougat
Espresso 1
Cappuccino 0.8 1
Biscuit 0.5 0.3 1
Nut 0.5 0.7 0.4 1
Nougat 0.4 0.6 0.7 0.7 1

Table 8.32  Similarities according to the Jaccard coefficient


Espresso Cappuccino Biscuit Nut Nougat
Espresso 1
Cappuccino 0.714 1
Biscuit 0.375 0.3 1
Nut 0.444 0.667 0.4 1
Nougat 0.333 0.556 0.625 0.667 1

consideration. The first step is to determine how many properties both products have
in common. In our example, the chocolate bars ‘Espresso’ and ‘Cappuccino’ have five
components in common (‘Storing time > 1 year’, ‘Seasonal packaging’, ‘National ads’,
‘Paper packaging’ and ‘Special display’). The components that are only present in one
product are counted next. In our example, two attributes can be found in this category
(‘XL size’ and ‘Margin > 20%’). If we put the number of attributes present in both prod-
ucts in the numerator (a = 5) and add to the denominator the number of attributes present
in both products (a = 5) and those present in only one product (b + c = 2), the Jaccard
coefficient for the products ‘Espresso’ and ‘Cappuccino’ is 5/7 = 0.714.
In the same way, the corresponding similarities are calculated for all other object
pairs. Table 8.32 shows the results. With regard to this matrix, two things should be
noted:

• The similarity of two objects is not influenced by their order in the comparison, i.e. it
is irrelevant whether the similarity between ‘Espresso’ and ‘Cappuccino’ or between
‘Cappuccino’ and ‘Espresso’ is measured (symmetry property). This also explains
why the similarity of the products in Table 8.32 is represented only by the lower trian-
gular matrix.
• The values of the similarity measurement are between 0 (“total dissimilarity”, a = d
= 0) and 1 (“total similarity”, b = c = 0). If the conformity between the characteris-
tics of a single product and itself is checked, one naturally finds complete conformity.
Thus, it is also understandable that only the number 1 can be found in the diagonal of
the matrix.
8.4 Modifications and Extensions 519

These considerations now put us in a position to determine the most similar and the most
dissimilar pair. The chocolate bars ‘Espresso’ and ‘Cappuccino’ have the greatest simi-
larity (Jaccard coefficient = 0.714). The slightest similarity exists between ‘Cappuccino’
and ‘Biscuit’ with a Jaccard coefficient of 0.3.

Russel and Rao similarity coefficient


In a slightly different way, the Russel and Rao coefficient (RR coefficient) measures the
similarity of pairs of objects. In contrast to the Jaccard coefficient, cases in which both of
the considered objects do not have certain attributes (d) are also included in the denomi-
nator. This means that the number of properties common to both objects is put in relation
to the number of all properties examined. Apart from the extreme values (0 and 1), in our
example, the denominator of the RR coefficient is always 10, because a total of 10 fea-
tures per object is considered. If the pairwise comparison shows that at least one property
does not exist in both objects, the RR coefficient has a smaller value than the Jaccard
coefficient. This is the case for the product pair ‘Espresso’ and ‘Nut’. Both of these choc-
olate types do not have the feature ‘Storage problems’. Thus, their value of similarity in
comparison to the Jaccard coefficient drops from 0.444 to 0.4. If no property is missing
(d = 0), both similarity measures lead to the same result. The RR coefficient values are
shown in Table 8.33. It should be noted that the main diagonal of the similarity matrix
according to Russel & Rao as generated by SPSS does not show the “1”, but the propor-
tion of the attributes present for each test object (coding with 1). The values of the main
diagonal can be easily calculated with the help of the initial data matrix in Table 8.30.

8.4.1.2.3 Comparison of the Proximity Measures


All three measures of similarity arrive at the same result if no property is missing in the
pairwise comparisons, i.e. if d = 0. If at least one feature is not present in both of the two
compared objects, the RR coefficient shows the lowest value of similarity and the SM
coefficient the highest. The Jaccard similarity measure occupies a middle position. The
Jaccard and SM coefficients, however, arrive at the same result if only cases (a) and (d)
exist, i.e. if attributes are either simultaneously present or absent.
We will not go into detail about the differences in the similarity rankings of the three
coefficients, but only note the following observations for our example:

Table 8.33  Similarity coefficient according to Russel & Rao (RR coefficient)


Espresso Cappuccino Biscuit Nut Nougat
Espresso 0.5
Cappuccino 0.5 0.7
Biscuit 0.3 0.3 0.6
Nut 0.4 0.6 0.4 0.8
Nougat 0.3 0.5 0.5 0.6 0.7
520 8 Cluster Analysis

• The object pair ‘Espresso’ and ‘Cappuccino’, for example, takes third place in the
order of similarity according to the RR coefficient. According to the other two simi-
larity measures, however, these two products are most similar and therefore rank first.
• Whereas ‘Biscuit’ – ‘Espresso’ and ‘Biscuit’ – ‘Cappuccino’ show only low similar-
ity (below 0.375) according to the Jaccard and RR coefficients, the pair ‘Biscuit’ –
‘Espresso’ achieves a value of similarity of 0.5 according to the SM-coefficient, while
‘Biscuit’ – ‘Cappuccino’ has a similarity of 0.3.

There is no general recommendation as to which similarity measure should be preferred.


The decision should be made on a case-by-case basis, depending on how important the
absence of a component in both objects is compared to its presence in both objects.

Example
In the case of the variable ‘gender’, for example, the existence of the characteristic
‘male’ has the same significance as its absence. This does not apply to the attribute
‘nationality’ with the expressions ‘American’ and ‘non-American’, because the exact
nationality that may be of interest cannot be determined by the statement ‘non-Amer-
ican’. Thus, if the presence of a component has the same significance for the grouping
as its absence, similarity measures which take into account all equal characteristics in
the numerator should be preferred (e.g., the SM coefficient). Conversely, it is advisa-
ble to use the Jaccard coefficient or related proximity measures. ◄

In case of unequally distributed characteristics (e.g., cases of suffering from a very rare
disease), the application of proximity measures leads to distortions (in this example, the
highly probable case that two persons do not suffer from this rare disease would be inter-
preted as a similarity).

8.4.1.3 Proximity Measures for Mixed Variables


The previous sections showed that cluster analytical methods do not require a specific
scale level of the attributes. This advantage also applies to the problem of treating mixed
variables. Empirical studies very often record both metric and non-metric properties of
the objects to be classified.
In this case, the question arises how attributes with different scale levels can be con-
sidered together. The following approaches are possible:

A. separate calculation for metric and non-metric variables,


B. transformation to a lower measurement level (grouping into classes)
C. grouping into classes.

A. Separate calculation for metric and non-metric variables


One option is to separately calculate similarity coefficients or distance measures for
the metric and the non-metric variables. The overall similarity is then determined as
8.4 Modifications and Extensions 521

the unweighted or weighted mean of the values calculated in the previous step. Let us
assume, for example, that the similarity of the products ‘Nut’ and ‘Nougat’ is determined
on the basis of nominal and metric properties. The SM coefficient for these two products
is 0.7 (see Table 8.31). The resulting distance between the two types of chocolate bars is
0.3, which is obtained by subtracting the similarity value from 1.
For the metric properties, we calculated a squared Euclidean distance of 6 (Table 8.6)
for these two products. If the unweighted arithmetic mean is now used as the common
distance measure, we obtain a value of [(0.3 + 6)/2 = ] 3.15 in our example. Alternatively,
the distance can be obtained by using the weighted arithmetic mean. Therefore, exter-
nal weights for metric and non-metric distances need to be specified. For example,
the respective share of variables of the total number of variables could be used as a
weighting factor. In this case, the example does not change compared to the case when
using the unweighted arithmetic mean for both ten nominal and ten metric features for
classification.

B. Transformation to a lower measurement level (grouping into classes)


Another option of handling mixed variables is to transform them from a higher to a
lower measurement level. The potential variants in this case are illustrated by the follow-
ing example of the feature ‘price’:

Example: measurement transformation of the feature ‘price’


For the 5 chocolate types considered in the ‘metric case’, average sales prices are
given in Table 8.34. ◄

One possibility for converting the present ratio scales into binary scales is dichotomiza-
tion. In this case, a threshold is established to separate the low- and high-priced choco-
late bars. If this limit is assumed to be 1.60 €, for example, the price specifications up
to 1.59 € are assigned a 0 key and the prices above this value are assigned a 1 key. The
advantage of this approach is its simplicity and quick application. On the other hand,
the high loss of information is problematic, since ‘Biscuit’ is on the same price level
as ‘Espresso’, although the latter is 0.40 € more expensive. Another difficulty is the

Table 8.34  Average sales Sort Price


prices
Espresso 2.05 €
Cappuccino 1.75 €
Biscuit 1.65 €
Nougat 1.59 €
Nut 1.35 €
522 8 Cluster Analysis

Table 8.35  Coding of price Price classes Binary attribute


classes
1 2 3
Up to 1.40€ 0 0 0

1.41–1.69€ 1 0 0

1.70–1.99€ 1 1 0

2.00–2.30€ 1 1 1

Table 8.36  Encryptions of all Products Binary value


chocolate flavors
Strawberry 1 1 1
Nut 1 1 0
Nougat 1 0 0
Biscuit 1 0 0
Espresso 0 0 0

threshold definition. Arbitrarily establishing a threshold can easily distort the actual con-
ditions and thus the grouping result.
The loss of information can be reduced if price intervals are formed and each interval
is binary-coded in such a way that “1” is assigned if the price of a product falls within
the interval and “0” otherwise.

C. Grouping into classes


In our example, we create four price classes (price intervals) (see Table 8.35). For
encryption, we then need three binary attributes. The coding of “0” or “1” is done
according to the answer to the following questions:

• Attribute 1: Price equal to or greater than 1.41 €? (yes = 1; no = 0)


• Attribute 2: Price equal to or greater than 1.70 €? (yes = 1; no = 0)
• Attribute 3: Price equal to or greater than 2.00 €? (yes = 1; no = 0)

The first price class is coded with three zeros, because each question is answered with
no. If the other classes are also treated in the same way, the coding shown in Table 8.35
is obtained. If the binary combination obtained is used to encrypt ‘Nut’, for example, we
obtain the number sequence “0 0 0” for this product. Table 8.36 lists the encryptions for
all chocolate bars.
The particular advantage of this method is its low loss of information, which is even
smaller with reduced sizes of the price class intervals. For example, seven price class
intervals reduce the span by half and thus better reflect the actual price differences.
However, the disadvantage of this method is that the importance of the considered
8.4 Modifications and Extensions 523

attribute (‘price’ in the above example) increases with the number of intervals created for
this attribute. If we assume, for example, that in a study only properties with two compo-
nents exist in addition to ‘price’, it can be seen that, in case of four price class intervals,
the price is three times as important as each one of the other attributes. Reducing the
price interval spans by half will result in an importance five times as high as the other
features. The extent to which a higher importance of an individual feature is desirable
must be decided on a case-by-case basis.

8.4.2 Partitioning clustering methods

The hierarchical agglomerative cluster techniques are of great practical importance.


However, even today these methods quickly reach their limits if large data sets are to
be analyzed. This is increasingly the case in the age of Big Data. Due to the increasing
multitude of electronic systems (e.g. social networks, Internet of Things, cyber-physical
systems, cheaper processors and increasing data usage), a growing amount of data is col-
lected almost in real-time for almost all cases (persons and objects). In hierarchical clus-
ter analyses, as presented in Sect. 8.2.3, the pairwise distances or similarities between all
cases at each clustering level need to be calculated. However, for high caseloads even the
currently available computer capacities are no longer sufficient. A way to solve this prob-
lem is offered by partitioning clustering methods (see Fig. 8.6).
Partitioning cluster techniquesstart with a predefined grouping of objects (start par-
tition) and then gradually improve this first partition with respect to a specific target cri-
terion. The single objects are re-sorted between the groups, using a so-called exchange
algorithm until the specified target criterion is met. Partitioning cluster procedures have
the advantage that they are quite dynamic because elements can be exchanged between
different groups even during the clustering process. This gives them greater flexibility.
This is not possible when applying hierarchical clustering methods in which a group,
once formed, cannot be dissolved in the clustering process.
In the following, we describe k-means cluster analysis (KM-CA) and two-step cluster
analysis (TS-CA) which provide efficient methods for identifying groups in large data
sets. At the end of each section, helpful information is provided on how to carry out the
procedures in SPSS.

8.4.2.1 K-means clustering
KM-CA starts with the assumption that a data set is partitioned into k clusters. A cluster i
is represented by its centroid, which is calculated by averaging the cases assigned to clus-
ter i. This is the reason why the method is called “k-means”. It is also known as cluster
center analysis. The target criterion (Z) of KM-CA is formally represented as follows:
k 
xj − µ2  → Min!
 
Z=
i=1 xj ∈Si i (8.16)
524 8 Cluster Analysis

A: Starting Point B: Initial Cluster


CC_1

xn xm x(1)m
x(2)n

CC_2
CC_3

C: New Cluster centers D: Final Solution

x(2)n x(1)m x(1)n x(3)m


Cluster 1
Cluster 1

Cluster 3
Cluster 3
Cluster 2 Cluster 2

Fig. 8.31 Example of the k-means algorithm with 21 cases and 3 cluster centers (CC)

Equation (8.16) shows that a given number of objects (x) is to be split into k partitions.
This is done in such a way that the sum of the squared deviations between the data
points (xj) and the center of gravity of a cluster (mean value μi) results in a minimum.
According to this target criterion, KM-CA performs clustering by minimizing the vari-
ance criterion (see Eq. (8.9)).

8.4.2.1.1 Procedure of KM-CA
The KM-CA procedure will be explained in the following on the basis of Fig. 8.31.

Example of the k-means algorithm


21 data points (cases, objects) are considered, and each case is described by two var-
iables. It is important to note that this limitation is solely made to allow for a graphic
representation. In real applications, objects are usually described by a large number of
variables. ◄

Step 1: Random determination of k initial cluster centers


Box A in Fig. 8.31 shows that we start with 21 objects represented in a two-dimensional
space. In the first step, KM-CA requires the specification of a number of clusters to
8.4 Modifications and Extensions 525

which the objects will be assigned. This number (k) may either be specified by the user
based on logical reasoning or may be determined automatically by SPSS. The example
assumes that we want to form three cluster centers (k = 3). These clusters are (randomly)
inserted into the coordinate system and are shown as rectangles in in Fig. 8.31B (Initial
Clusters).

Step 2: Allocation of cases (data points) to the (initial) cluster centers depending on
the variance of the clusters
In order to be able to assign the 21 data points to the three given cluster centers, the
Euclidean distances (cf. Eq. (8.1)) between all data points (xj) and the three cluster
centers (μi) are calculated. A given data point is then assigned to that one of the three
clusters in which the so-called variance criterion (cf. Eq. (8.9)) is increased least. In our
example, 7 cases are classified into cluster 1 (CC_1), 8 cases are classified into cluster 2
(CC_2) and 6 cases are classified into cluster 3 (CC_3) (see Fig. 8.31B).

Step 3: Recalculation of the cluster centers


After all cases have been assigned to the initial cluster centers in step 2 (Fig. 8.31B), the
cluster centers have to be recalculated in step 3. Each newly calculated cluster center is
defined as the mean value (mi) of the data points (xj) belonging to this cluster (i):

1 
mi(t+1) =   xj
(t) xj ∈Si(t) (8.17)
Si 

Figure 8.31C shows that each of the cluster centers of the three groups is now located
in the center of the objects belonging to the corresponding cluster. The variance of the
clusters is now calculated with the new cluster centers, analogously to step 2. Thus, we
can check whether a reduction of the cluster variance can be achieved and whether data
points should be reassigned to another cluster. A comparison between panels C and D in
Fig. 8.31 shows that case xn is reassigned from cluster 2 to cluster 1 and case xm is reas-
signed from cluster 1 to cluster 3.

Step 4: Analyzing the convergence criterion


Steps 2 and 3 are repeated until the variance in each of the three clusters can no longer
be reduced by reassigning cases. As soon as the variance criterion cannot be reduced
anymore, the final cluster centers are determined by the group averages. These values are
regarded as “representatives” of the clusters. In the example of Fig. 8.31, only two itera-
tions were necessary to find the “optimal” solution.
If the sums of the deviation squares are within a satisfactory range, the computed
result can be evaluated in a next step. In order to interpret the clusters, the so-called
ANOVA table needs to be requested in SPSS. This table shows the F-value and the cor-
responding significance level (“Sig.”) for each variable. It is important to note that the
observed significance levels are not corrected, i.e. they cannot be interpreted as statistical
526 8 Cluster Analysis

tests for the hypothesis of equality of cluster mean values. The F-values only have a
descriptive value and only provide an indication as to whether the features of the var-
iables differ in the clusters. If the user wants to check the stability of the solution, the
KM-CA can be performed several times, with the cases assigned in a different, randomly
selected order. It should also be noted that the calculation of the cluster centers depends
on the subjective initial classification of the clusters (step 1). If the user initially decides
to propose a different classification, it is possible that the calculation results in a different
solution. That is why there is always a risk of finding a local optimum, but not a global
optimum. It is therefore advisable to try different numbers of initial cluster centers and to
compare the results with each other.

8.4.2.1.2 Conducting KM-CA with SPSS


KM-CA is implemented in SPSS and can be opened with the menu ‘Analyze/Classify/K-
Means Cluster‘. Figure 8.32 shows the starting window of KM-CA, where we can enter
the number of clusters (here: 2) and choose between the options ‘Iterate and classify’
or ‘Classify only’. After opening ‘Options’, the initial cluster centers, the ANOVA table
and cluster information for each case can be requested. With regard to missing values, a
choice can be made between listwise and pairwise case exclusion. The KM-CA proce-
dure only allows the processing of metrically scaled variables.

Fig. 8.32 Dialog box: “K-means Cluster Analysis”


8.4 Modifications and Extensions 527

8.4.2.2 Two-Step Cluster Analysis


Two-step cluster analysis (TS-CA) is also suitable for analyzing large amounts of data.
This method is able to process variables with different measurement levels. The algo-
rithm is also able to detect outliers in the data set and to determine the optimum number
of clusters. TS-CA is a robust clustering method that is not specifically vulnerable to
violations of assumptions. A further advantage is that the method produces easily inter-
pretable results.

8.4.2.2.1 Procedure of TS-CA

As the name suggests, TS-CA forms clusters in two steps:


In the first step, the cases in the initial data set are assigned to the nodes of a deci-
sion tree (so-called cluster feature tree; CF tree). The number of nodes can be specified
directly by the user or generated automatically by SPSS. In the CF-tree (see Fig. 8.33),
all cases are initially contained in one node. The cases in this first node are then subdi-
vided into further nodes according to a decision tree pattern at the next level, to which
the original cases are assigned based on their similarities. The division into further levels
is continued successively, and a maximum of eight nodes can be formed per level. With a
maximum of three levels, 8 x 8 x 8 = 512 sub-nodes can be created, to which the individ-
ual data points (cases) of the initial data set are allocated.
On the third level of the CF tree, the data points contained in a final node are each
combined into a case by averaging. At the first step of TS-CA, outliers can also be identi-
fied, which are then grouped into a separate node at the end.
In the second step, a hierarchical cluster analysis (cf. Sect. 8.2.2) is applied to the
final nodes of the first step (without outliers). If all variables used in the cluster analysis

Starting Point
(all cases)

Level 1 Knot 1 Knot 2 Knot 3

Knot 1.1 Knot 1.2 Knot 3.1 Knot 3.2

Knot 2.1 Knot 2.2 Knot 2.3 Knot 2.4


Level 2
Knot 2.1.1 Knot 2.1.2 Knot 2.2.1 Knot 2.2.2 Knot 2.4.1 Knot 2.4.2

Final Final Final Final Final Final Final Final


Level 3 Knot Knot Knot Knot Knot Knot Knot Knot
2.3.1 2.3.2 2.3.3 2.3.4 2.3.5 2.3.6 2.3.7 2.3.8

Fig. 8.33 Example of a CF-tree with three levels


528 8 Cluster Analysis

Fig. 8.34 Dialog box: TwoStep Cluster Analysis

are metrically scaled, the Euclidean distance (see Eq. (8.1)) can be used to calculate the
distances. If, however, the variables have different scale levels, the user must use the
probability-theoretical model approach of the log-likelihood distance. The log-likelihood
criterion can also be used for purely metric variables. To determine the final number of
clusters, the criteria listed in Sect. 8.2.4 can be used, for example.

8.4.2.2.2 Conducting a TS-CA with SPSS


TS-CA is implemented in SPSS with the procedure ‘TwoStep Cluster Analysis’ and is
opened via the menu ‘Analyze/Classify/TwoStep Cluster‘. Figure 8.34 shows the open-
ing window of TS-CA. In contrast to KM-CA, TS-CA can process metric (continuous)
and nominally scaled (categorical) variables. Euclidean distances and log-likelihood
may be used as the distance measure for determining the similarity between two clusters.
With regard to the number of clusters, we can define the number of clusters ourselves
(by specifying a fixed number) or have it determined automatically by SPSS. When
determining the number of clusters by SPSS, the user can choose between Schwarz’s
Bayesian information criterion (BIC) and Akaike’s information criterion (AIC). For
details on BIC and AIC see Chap. 5.
8.5 Recommendations 529

Table 8.37  Comparison between KM-CA and TS-CA


Criteria KM-CA TS-CA
Scale level of the variables Only metrical scale level Metrical and nominal scale level
Initial cluster number/ Specified by the researcher, Specified by the researcher, opti-
subclusters randomly specified by SPSS mum number of sub-clusters can
be specified in SPSS
Outliers Need to be identified by the user Can be identified and processed
automatically
Sequence of input data Can influence the results Has no effects on the results
Violation of assumptions Relatively robust against viola- Relatively robust against viola-
tions of assumptions tions of assumptions

A standardization of variables can be requested by opening the box ‘Options’. In this


section, various options are also offered for the creation of the CF tree.

8.4.2.3 Comparing between KM-CA and TS-CA


Although the KM-CA as well as the TS-CA represent partitioning methods, clear differ-
ences can be seen in the approach. Table 8.37 summarizes essential differences between
the two methods.
Despite the increased importance of partitioning cluster techniques for the clustering
of large amounts of data, the following problems should be carefully considered:

• The results of the partitioning procedures are influenced by the target function under-
lying the “reassignment” of the objects.
• The initial partition is often selected based on subjective judgement and can influence
the results of the clustering process. If the initial partition is created randomly, the
clustering solutions may vary and the results may not be comparable.
• With partitioning procedures, local optima rather than global optima may be
determined.

8.5 Recommendations

After determining the cluster variable, the first step of a cluster analysis is to decide
which proximity measure and which fusion algorithm should be used. Ultimately, these
decisions should be based on the specific application and the properties of the different
agglomerative cluster procedures discussed in Sect. 8.2.3.3. In general, the Ward method
leads to fairly good partitions and, at the same time, indicates the correct number of
clusters. To validate the results of the Ward method, other algorithms can be applied,
but the properties of the different algorithms should always be taken into account (cf.
Table 8.18).
530 8 Cluster Analysis

• What is the purpose of the investigation?


Concretization of the problem
• Which hypotheses should be tested?

Determination of the objects to be • How can the objects be described?


classified • How many objects should be considered?

Selection of the variables • Should qualitative and/or quantitative characteristics be used?


• How large should the number of variables be?
• Is standardization reasonable?

Definition of a similarity or distance • Which similarity or distance measure should be selected?


measure • How should mixed variables be treated?

Selecting an appropriate clustering


algorithm • Should a hierarchical or partitioning method be selected?
• What are the effects of changing the algorithm?

Determination of the number of


groups • How many groups should be formed?
• How do the results change with different numbers of groups?

Execution of the grouping operation

Analysis and interpretation of the • How do the clusters differ?


results • Can the results be interpreted meaningfully?

Fig. 8.35 Process steps and decision problems of cluster analysis

However, agglomerative procedures may lead to calculation problems especially


for large case numbers, because the procedures require calculating a distance matrix
between all cases in each clustering step. For a large number of cases it is therefore rec-
ommended to use a partitioning clustering algorithm, such as the k-means cluster analy-
sis and the two-step clustering method offered by SPSS.
Figure 8.35 summarizes the steps to be carried out within the framework of a cluster
analysis. The process on the left-hand side shows the eight main steps of a grouping pro-
cess. The individual steps require no further explanation, but it should be noted that for a
complete analysis and meaningful interpretation of the results several iterations may be
required. This will always be the case if the results do not allow for a meaningful inter-
pretation. Further reasons for repeating the process are given on the right-hand side of
Fig. 8.35. For each step in the process, the boxes list exemplary questions that need to be
answered when planning a study. Checking the effects of a different answer alternative
on the grouping results can thus also lead to a repeated run through of individual stages.
It should be pointed out that the questions only concern the central decision-making
problems of a cluster analysis and that there are more than two alternative answers to
References 531

many of these questions. Against this background, it becomes clear that the user has a
wide range of interpretation in cluster analysis. On the one hand, this is an advantage of
cluster analysis, since it opens up a wide field of applications. On the other hand, it bears
the risk that data may be manipulated to obtain the desired results. Therefore, we recom-
mend to always answer the following questions when presenting the results of a cluster
analysis:

1. Which similarity measure and which algorithm were chosen?


2. What were the reasons for selecting these methods?
3. How stable are the results if
– the similarity measure is changed,
– the algorithm is changed,
– the number of groups is changed?

References

Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in
statistics—Theory and methods, 3(1), 1–27.
García-Escudero, L., Gordaliza, A., Matrán, C., & Mayo-Iscar, A. (2010). A review of robust clus-
tering methods. Advances in Data Analysis and Classification, 4, 89–109.
Kline, R. (2011). Principles and practice of structural equation modeling (3rd ed.). Guilford Press.
Lance, G. H., & Williams, W. T. (1966). A general theory of classification sorting strategies I.
Hierarchical systems. Computer Journal, 9, 373–380.
Milligan, G. W. (1980). An Examination of the effect of six types of error pertubation on fifteen
clustering algorithms. Psychometrika, 45(3), 325–342.
Milligan, G. W., & Cooper, M. (1985). An examination of procedures for determining the number
of clusters in a data set. Psychometrika, 50(2), 159–179.
Mojena, R. (1977). Hierarchical clustering methods and stopping rules: A evaluation. The
Computer Journal, 20(4), 359–363.
Punj, G., & Stewart, D. (1983). Cluster analysis in marketing research: Review and suggestions for
application. Journal of Marketing Research, 20(2), 134–148.
Wedel, M., & Wagner, A. (2000). Market segmentation: Conceptual and methodological founda-
tions (2nd ed.). Springer.
Wind, Y. (1978). Issues and advances in segmentation research. Journal of Marketing Research,
15(3), 317–337.

Further reading

Anderberg, M. R. (2014). Cluster analysis for applications: Probability and mathematical statis-
tics: A series of monographs and textbooks (Vol. 19). Academic press.
Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein, D. (1998). Cluster analysis and display of
genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25),
14863–14868.
532 8 Cluster Analysis

Everitt, B., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis (5th ed.). Wiley.
Hennig, C., Meila, M., Murtagh, F., & Rocci, R. (Eds.). (2015). Handbook of cluster analysis.
Chapman & Hall/CRC.
Kaufman, L., & Rousseeuw, P. (2005). Finding groups in data: An introduction to cluster analysis
(2nd ed.). Wiley.
Romesberg, C. (2004). Cluster analysis for researchers. Lulu.com.
Wierzchoń, S., & Kłopotek, M. (2018). Modern Algorithms of Cluster Analysis. Springer Nature.
Conjoint Analysis
9

Contents

9.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534


9.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
9.2.1 Selection of Attributes and Attribute Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
9.2.2 Design of the Experimental Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
9.2.2.1 Definition of Stimuli. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
9.2.2.2 Number of Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
9.2.3 Evaluation of the Stimuli. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
9.2.4 Estimation of the Utility Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
9.2.4.1 Specification of the Utility Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
9.2.4.2 Estimation of Utility Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
9.2.4.3 Assessment of the Estimated Utility Function. . . . . . . . . . . . . . . . . . . . 555
9.2.5 Interpretation of the Utility Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
9.2.5.1 Preference Structure and Relative Importance of an Attribute. . . . . . . . 557
9.2.5.2 Standardization of Utility Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . 558
9.2.5.3 Aggregated Utility Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
9.2.5.4 Simulations Based on Utility Parameters. . . . . . . . . . . . . . . . . . . . . . . . 561
9.3 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
9.3.1 Problem Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
9.3.2 Conducting a Conjoint Analysis with SPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
9.3.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
9.3.3.1 Individual-level Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
9.3.3.2 Results of Joint Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
9.3.4 SPSS Commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
9.4 Choice-based Conjoint Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
9.4.1 Selection of Attributes and Attribute Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
9.4.2 Design of the Experimental Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
9.4.2.1 Definition of Stimuli and Choice Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 581
9.4.2.2 Number of Choice Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
9.4.3 Evaluation of the Stimuli. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584

© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 533


K. Backhaus et al., Multivariate Analysis, https://doi.org/10.1007/978-3-658-40411-6_9
534 9 Conjoint Analysis

9.4.4Estimation of the Utility Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584


9.4.4.1 Specification of the Utility Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
9.4.4.2 Specification of the Choice Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
9.4.4.3 Estimation of the Utility Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
9.4.4.4 Assessment of the Estimated Utility Function. . . . . . . . . . . . . . . . . . . . 593
9.4.5 Interpretation of the Utility Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
9.4.5.1 Preference Structure and Relative Importance of an Attribute. . . . . . . . 595
9.4.5.2 Disaggregated Utility Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
9.4.5.3 Simulations Based on Utility Parameters. . . . . . . . . . . . . . . . . . . . . . . . 596
9.4.5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
9.5 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
9.5.1 Recommendations for Conducting a (Traditional) Conjoint Analysis. . . . . . . . . . 596
9.5.2 Alternatives to Conjoint Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598

9.1 Problem

The term conjoint analysis—also called conjoint measurement, conjunctive analysis, or


compound measurement—refers to a procedure for measuring and analyzing consumers’
preferences for specific objects (e.g., products, services).
Let us assume, for example, that a manager of a chocolate manufacturer considers
introducing a new type of chocolate bar to the market. Yet, the question is what type of
chocolate do consumers like? Do they favor chocolate with a high cocoa content? Or is
the price of a chocolate bar more decisive for their purchase decision? To answer these
questions, the manager needs to learn more about consumers’ preferences when it comes
to buying chocolate. These questions can be answered with the help of conjoint analysis.
The process of a conjoint analysis can be described as follows:
In a first step, conjoint analysis involves developing a set of different objects (e.g.,
chocolates) that are described along different attributes. For instance, a bar of choco-
late can be described based on attributes such as ‘cocoa content’, ‘packaging’, ‘price’,
or ‘brand’. These attributes can have various levels such as ‘30% cocoa content’, ‘50%
cocoa content’ for the attribute ‘cocoa content’, or ’1.00 EUR’ and ‘1.50 EUR’ for the
attribute ‘price’. We combine the different attribute levels in various ways to get a set of
chocolates that differ along the attribute levels. The chocolate bars can be existing but
also non-existing products. To indicate that the objects (e.g. products) in conjoint studies
can be fictitious, we call them stimuli.
In a second step, consumers state their preferences regarding the stimuli. Conjoint
analysis relies on the assumption that consumers consider all presented attributes jointly
to form their preferences (CONsidered JOINTly). This implicitly indicates that conjoint
analysis focusses on decisions that involve some consideration (i.e., trade-off decisions)
on the consumers’ side. In our chocolate example, the required cognitive effort is rather
limited but the example serves our purpose.
To state their preferences, we can ask consumers to rank, rate, or choose certain stim-
uli. Depending on how consumers indicate their preferences, we distinguish between
9.1 Problem 535

(traditional) conjoint analysis and choice-based conjoint (CBC) analysis. The former
asks consumers to evaluate all stimuli using ordinal or metric measurement scales (e.g.,
by ranking or rating). For example, consumers may state their preferences by ranking 10
different stimuli, giving the lowest number to the most preferred stimulus (i.e., rank = 1)
and the highest number to the least preferred stimulus (i.e., rank = 10). In contrast, in CBC
analyses consumers choose a stimulus (e.g., object, product) out of a small set of stimuli
and do so multiple times. For instance, we present three different chocolate bars to a con-
sumer and ask her which one she would buy. Then we present another set of chocolate
bars to her and ask the same question again. Since consumers just select one product, the
observed evaluations are nominal (i.e., 1 if an object is chosen, and 0 otherwise).
In the following, we use the term conjoint analysis whenever consumers evaluate the
stimuli with the help of ordinal or metric measurement scales. If consumers evaluate the
stimuli by making choices, we refer to CBC analysis.
In a final step, we analyze the collected preference data. Traditional conjoint as well
as CBC analyses assume that stated preferences reflect the stimulus’ total utility values.
The stimulus (i.e., object, product) with the highest total utility is the most preferred one.
We presume that an object’s (total) utility equals the sum of the utility contributions of
each of its attribute levels. The stated preference for an object serves as a proxy for an
object’s total utility value. For example, if a consumer evaluated a specific bar of choco-
late with ‘8’ on a rating-scale from 1 to 10 (1 = ‘not attractive at all’ to 10 = ‘very attrac-
tive’), we assume that ‘8’ reflects the chocolate’s total utility value.
The total utility value is the result of the utility contributions of the chocolate’s spe-
cific attribute levels. We call these utility contributions partworths. Figure 9.1 illustrates
this idea. The chocolate bar has a cocoa content of 75%, is wrapped in paper, is offered
for 1.20 EUR and produced by the brand ‘ChocoMania’. Each attribute level contributes
to the chocolate’s total utility value. For instance, the cocoa content of 75% has a part-
worth of 5, and the sum of the partworths across all attribute levels equals 8.
It is the aim of conjoint analysis to identify the utility contribution of each attribute
level. Such knowledge allows the researcher to elicit consumers’ most preferred product,
which is the product with the highest utility and encompasses the most preferred attrib-
ute levels. The key idea of conjoint analysis thus is to decompose consumers’ overall
preferences for objects into preferences for attribute levels. Consequently, conjoint anal-
ysis is a decompositional procedure.
Frequently, we do not only want to study consumers’ preferences regarding the stimuli
that have been evaluated, but we also want to use the results of conjoint analyses to predict
consumers’ preferences for objects that have not been evaluated. In this case, we use the
results of conjoint analyses for simulation purposes. In our example, we are interested in
the potentially ‘best’ chocolate bar. Based on the results of the conjoint analysis, we can
identify the ‘best’ chocolate bar that will probably be most successful in the market.
As stated above, conjoint as well as CBC analyses aim to derive each attribute level’s
partworth from stated preferences. When using conjoint analysis, we usually have suf-
ficient information to estimate the partworths for each individual consumer. Yet in CBC
studies, we just observe consumers’ choices that contain less information than ordinal
536 9 Conjoint Analysis

attribute leve artworth


cocoa content 75% 5.0
packaging paper 3.5
price 1.20 EU 1.5
brand ChocoMania 1.0

total utility of chocolate bar 8.0

Fig. 9.1 Illustration of the relation between partworths and total utility

or metric data in conjoint studies. For this reason, in CBC studies we typically cannot
estimate individual-level partworths but only partworths for the complete study sample.
This difference is critical if we want to use the results for simulation purposes (cf. Sects.
9.2.5.4 and 9.4.5.3).
Conjoint analyses are frequently applied in marketing. Yet, conjoint analyses can also
be used in other disciplines and Table 9.1 lists some examples.
We will now proceed with discussing (traditional) conjoint analysis in detail (cf. Sect.
9.2). After demonstrating how to conduct a conjoint analysis with the help of SPSS (cf.
Sect. 9.3), we describe the CBC analysis in more detail (cf. Sect. 9.4). CBC analysis
has been developed to address shortcomings of (traditional) conjoint analysis and has
become very popular in research and practice. Various other variations of traditional
conjoint analysis have been developed to address its specific limitations. We will briefly
introduce some important further developments in Sect. 9.5.2.

9.2 Procedure

Conjoint analysis generally follows a five-step procedure. In the following, we will pres-
ent and discuss the various steps of a conjoint analysis (Fig. 9.2).
A conjoint analysis starts with selecting the attributes and attribute levels that describe
the stimuli. In a second step, we generate an experimental design that represents the
stimuli based on the considered attributes and attribute levels. In step 3, we discuss alter-
native methods to evaluate the stimuli. After the preferences have been collected, we
use the stated preferences to estimate the partworths of the attribute levels to map the
respondents’ preferences (step 4). In a final step, we interpret the estimated partworths
9.2 Procedure 537

Table 9.1  Application examples of conjoint analysis in different disciplines


Discipline Exemplary research Considered attributes
questions
Educational What are the drivers of • Future job prospects
Science students’ preferences • Teaching quality
when choosing an • Teachers’ expertise
institution for higher • Course content
education? • Location
Marketing How should a smart • Focus of applications (energy, safety, communication)
home be designed? • Handling
• Installation
• Customer service
• Degree of innovativeness
Medicine What are women's pref- • Time spent at the hospital receiving treatment
erences for miscarriage • Level of pain experienced
management? • Number of days of bleeding after treatment
• Time taken to return to normal activities after treatment
• Cost of treatment to women
• Chance of complications requiring more time or read-
mission to hospital
Political Does electoral violence • Use of electoral violence
science affect voter choice and • Availability and quality of performance record
willingness to vote?

Fig. 9.2 Process steps of 1 Selection of attributes and attribute levels


conjoint analysis
2 Design of the experimental study

3 Evaluation of the stimuli

4 Estimation of the utility function

5 Interpretation of the utility parameters

and discuss how the results can be used to support the decision-making of managers,
policymakers, and others.

9.2.1 Selection of Attributes and Attribute Levels

1 Selection of attributes and attribute levels

2 Design of the experimental study

3 Evaluation of the stimuli

4 Estimation of the utility function

5 Interpretation of the utility parameters


538 9 Conjoint Analysis

The first step when conducting a conjoint analysis is to decide on the attributes and
attribute levels that are used to describe the stimuli. In the following, we will again use
an example related to the chocolate market.

Example of Traditional Conjoint Analysis

A manager of a chocolate manufacturer wants to know whether engaging in a sustainabil-


ity initiative will be positively evaluated by the company’s target group. More specifically,
the manager considers applying for the UTZ label (www.utz.org), which would require a
change in the supply chain. The UTZ label certifies that products have been sourced in a
sustainable manner—from farms to shop shelves. To become certified and eligible to print
the UTZ label on a product’s packaging, all suppliers have to follow a Code of Conduct
with expert guidance on farming methods, working conditions, and care for nature.
Several focus groups have demonstrated that the attributes ‘cocoa content’ and
‘price’ are also relevant for consumers buying chocolate. Thus, the manager considers
these attributes besides the ‘UTZ label’ (Table 9.2). To define the attribute levels, the
manager relies on actual prices of 100 g chocolate bars and common levels of cocoa
content. The example is kept very simple. In actual conjoint analyses, more attributes
and attribute levels are usually taken into account. ◄

In general, several aspects need to be considered when selecting the attributes and attrib-
ute levels for conjoint analyses.

1. The attributes have to be relevant for the respondents’ decisions: Only attributes that
respondents take into account when making decisions should be considered. Focus
groups can be used to identify relevant attributes. If consumers substantially dif-
fer concerning the attributes that are relevant for their decision-making, alternative
approaches to (traditional) conjoint analysis have been developed, such as the adap-
tive conjoint analysis (ACA) (cf. Sect. 9.5).
2. The selected attributes have to be independent of each other: Conjoint analysis
assumes that each attribute level contributes to the total utility independently of the
other attribute levels. A violation of this assumption contradicts the additive model
of conjoint analysis. Moreover, it should be ensured that the attribute levels are

Table 9.2  Attributes and Attributes Attribute levels


attribute levels considered in
the example Cocoa content 30% of cocoa
50% of cocoa
70% of cocoa
UTZ label Yes
No
Price 0.80 EUR
1.00 EUR
1.20 EUR
9.2 Procedure 539

empirically independent. This means any combination of attribute levels can actually
occur and is not perceived as dependent by the respondent. Especially when consider-
ing characteristics such as brand and price, it is important to ensure that no implausi-
ble stimuli are considered in the survey design.
3. Managers, policymakers or others have to be able to adapt the attributes: To be able
to act on the results of a conjoint analysis, we need to be able to adapt the attributes.
For example, considering brand as an attribute is in conflict with this requirement.
Yet, brand is sometimes included to assess whether products are simply preferred
because of the brand name.
4. The attribute levels have to be realistic and feasible: To be able to act on the results of
the conjoint analysis, managers, policymakers or others need to be able to change the
product design regarding the preferred attribute levels.
5. The individual attribute levels need to be compensatory: Conjoint analysis assumes
that a poor attribute level of one attribute can be compensated by a certain level of
another attribute. For example, an increase in price that usually reduces total utility
can be compensated by an improvement in another, desirable attribute. This require-
ment implicitly assumes a decision-making process in which respondents simultane-
ously evaluate all attributes.
6. The considered attributes and attribute levels are no exclusion criteria: Exclusion cri-
teria exist when certain attribute levels must be present from the perspective of the
consumer. If exclusion criteria occur, the requirement of a compensatory relationship
between the attribute levels is not met.
7. The number of attributes and attribute levels needs to be limited: The effort to evalu-
ate the stimuli grows exponentially with the number of attributes and attribute levels.
It is suggested to consider not more than six attributes with three to four levels each.

When we apply the different requirements to the example, we conclude that we meet all of
them (Table 9.3). Actually, the number of attributes and attribute levels is rather small. In an
actual study, we would probably want to consider additional attributes such as packaging
size or packaging material. However, we keep this example small for illustrative purposes.
Note that once you have decided on the attributes and attribute levels, you may no
longer change or adapt them. Thus, it is of critical importance to carefully select the
attributes and attribute levels.

9.2.2 Design of the Experimental Study

1 Selection of attributes and attribute levels

2 Design of the experimental study

3 Evaluation of the stimuli

4 Estimation of the utility function

5 Interpretation of the utility parameters


540 9 Conjoint Analysis

Table 9.3  Do the attributes and attribute levels in the example meet the requirements?
Criterion Assessment
Relevance The attributes and attribute levels were identified with the
help of focus groups
Independence The UTZ label is not correlated with price, and there is only
a weak correlation between price and cocoa content to be
found when inspecting market prices
Adaptability A chocolate manufacturer can manipulate all three attributes
and attribute levels—at least after making some investment
in R&D or changes in the supply chain
Realistic and feasible All attribute levels reflect realistic and feasible attribute
levels
Compensatory Consumers probably accept a higher price for the cocoa
content they prefer. The same most likely applies if they
have to pay more for a chocolate with a sustainability label
No exclusion criteria None of the attributes and attribute levels represents an
exclusion criterion
Limited number of attributes and We consider just 3 attributes with 2 or 3 levels each
levels

Conjoint analysis is an experimental method. This means we have to develop a survey


design before collecting and analyzing consumers’ preferences. The researcher has to
make two decisions when designing the experimental study:

1. Definition of stimuli: The researcher has to decide how the different stimuli are pre-
sented to the respondents. The fundamental decision is whether the stimuli are pre-
sented with all attributes simultaneously or whether just two attributes are presented
at a time (cf. Sect. 9.2.2.1).
2. Number of stimuli: The number of possible stimuli increases exponentially with the
number of attributes and attribute levels. For example, when considering 3 attributes
with 3 levels each, 27 (=33) different combinations of attribute levels (i.e., stimuli)
are possible. If we consider 6 attributes with 3 levels each, 729 (=36) possible stimuli
exist. To avoid information overload and resulting fatigue effects, it is advisable to
use a reduced design that only takes a subset of all possible stimuli into account (cf.
Sect. 9.2.2.2).

9.2.2.1 Definition of Stimuli
When conducting a conjoint analysis, respondents state their preferences regarding
different combinations of attribute levels (i.e., stimuli). The stimuli can be designed
in two alternative ways: either they are described based on all considered attributes
9.2 Procedure 541

simultaneously or they are composed of just two attributes. The former approach is
called full-profile method and the latter trade-off method.

Full-profile Method
Table 9.4 shows three exemplary stimuli according to the full-profile method. Each stim-
ulus is described based on all three attributes and the stimuli differ concerning the attrib-
ute levels. Overall, we can describe 18 (=3 · 2 · 3) different stimuli.

Trade-off Method
When using the trade-off method, we need to develop a  trade-off
 matrix for each possi-
J
ble pair of attributes. With Jattributes, we get a total of trade-off matrices. In our
 2
3
example, this results in 3 = trade-off matrices (Table 9.5). Each cell of a trade-off
2
matrix forms a stimulus. For example, the combination A1B1 describes the stimulus with
30% cocoa and an UTZ label. Yet, this stimulus contains no information about the price.

Table 9.4  Definition of Stimulus 1 Stimulus 2 Stimulus 3


stimuli based on the full-profile
method (here: three exemplary 30% of cocoa 70% of cocoa 50% of cocoa
stimuli) UTZ label UTZ label No UTZ label
0.80 EUR 1.00 EUR 1.20 EUR

Table 9.5  Definition of stimuli based on the trade-off method


Trade-off matrix 1 UTZ label (A)
Cocoa content (B) UTZ label (A1) No UTZ label (A2)
30% of cocoa (B1) A1B1 A2B1
50% of cocoa (B2) A1B2 A2B2
70% of cocoa (B2) A1B3 A2B3

Trade-off matrix 2 UTZ label (A)


Price (C) UTZ label (A1) No UTZ label (A2)
0.80 EUR (C1) A1B1 A2B1
1.00 EUR (C2) A1B2 A2B2
1.20 EUR (C3) A1B3 A2B3

Trade-off matrix 3 Cocoa content (B)


Price (C) 30% of cocoa (B1) 50% of cocoa (B1) 70% of cocoa (B1)
0.80 EUR (C1) B1C1 B2C1 B3C1
1.00 EUR (C2) B1C2 B2C2 B3C2
1.20 EUR (C3) B1C3 B2C3 B3C3
542 9 Conjoint Analysis

Both approaches have advantages and disadvantages related to the following aspects:

• required cognitive effort and time on the respondents’ side,


• realism of the evaluation task,
• occurrence of position effects.

Required Cognitive Effort and Time on the Respondents’ Side


The full-profile method requires that respondents evaluate all considered attributes and
attribute levels at a time. If we consider, for example, six attributes, the simultaneous
evaluation can be cumbersome for the respondents. In contrast, the trade-off method
involves weighing only two attributes against each other. Thus, the cognitive effort
required is relatively low. Yet, the trade-off method involves the evaluation of many
paired comparisons to gather sufficient information about the respondents’ preferences.
Additionally, respondents need to rank the stimuli of each trade-off matrix. The full-pro-
file method instead involves fewer evaluations, thus requiring less time.

Realism of the Evaluation Task


The full-profile method is more realistic than the trade-off method since consumers
usually do not evaluate just two attributes at a time when making decisions. In the first
trade-off matrix, all stimuli are described regarding the UTZ label and cocoa content
only, ignoring the price. In real life, however, consumers likely use price information
for forming their preferences. Thus, it is questionable whether the evaluations of the six
stimuli represented in trade-off matrix 1 are realistic and thus reliable and valid.

Occurrence of Position Effects


When using the full-profile method, a so-called position effect may occur. Consider
Table 9.4, where the attribute ‘cocoa content’ is always mentioned first in the descrip-
tion of the stimuli. Respondents may believe that this attribute is more important and
consequently assign a higher relevance to it when forming their preferences. Thus, the
order in which the attributes are presented may influence their relevance for respond-
ents’ evaluations without reflecting their actual preferences (cf. Kumar & Gaeth, 1991).
Similarly, if respondents perceive the evaluation task as challenging, they may focus on
the attribute(s) mentioned first to reduce their cognitive effort. Thus, the position effect
has a psychological background. One way to counter this effect is to vary the order of the
attributes when presenting the stimuli. Yet, changing the order of the attributes increases
the required cognitive effort on the respondents’ side. With the trade-off method, the
position effect does not occur.

Despite the potential limitations of the full-profile method, it has prevailed due to the
greater realism of the evaluation task. Moreover, pictures of the stimuli may be used
9.2 Procedure 543

to further enhance the realism of the evaluation task. Because of its great relevance in
research and practice, we use the full-profile method in the following.

9.2.2.2 Number of Stimuli

Full Factorial Design


In our example, 18 (=3 · 2 · 3) fferent stimuli are possible. If respondents are asked to
evaluate all 18 stimuli, we are using a so-called full factorial design. However, due to
fatigue effects it is often not feasible to present all possible stimuli to the respondents.
Research shows that consumers can evaluate up to 30 stimuli before fatigue effects and
information overload occur (cf. Green & Srinivasan, 1978). Practical experience shows
that up to 20 stimuli are manageable for most respondents. When considering 3 attrib-
utes with 3 levels each, 33= 27 stimuli are possible. Frequently, more than just 3 attrib-
utes and 3 attribute levels are considered. Consequently, often a reduced design has to be
developed that considers only a subset of all possible stimuli.

Reduced Design
Drawing a random sample from all possible stimuli is the simplest approach to preparing
a reduced design. However, experimental research suggests several other approaches to
develop reduced designs that are prevalent in research and practice (cf. Kuhfeld et al.,
1994). These approaches systematically select certain stimuli from the set of all possible
stimuli. The basic idea of all these approaches is to find a subset of stimuli that allows
estimating all utility contributions (partworths) unambiguously.
If all attributes have the same number of levels, we refer to symmetric designs while
we refer to asymmetric designs if the number of levels varies across attributes. An advan-
tage of symmetric designs is that reduced designs are relatively easy to develop. A spe-
cial case of a reduced symmetric design is the Latin square, which is briefly described
to illustrate the basic idea of reduced designs. Its application is limited to the case of
exactly three attributes. If each attribute has three levels, the full factorial design com-
prises 27 (= 33) stimuli (Table 9.6).
Of those 27 stimuli, nine are selected in such a way that each attribute level is con-
sidered exactly once with each level of another attribute (bold combinations in Table
9.6). Thus, each attribute level is represented exactly three times instead of nine times.
Table 9.7 shows the corresponding Latin square design.
As mentioned above, developing a reduced asymmetric design is more challenging
(cf. Addelman, 1962a, b). Fortunately, software packages such as SPSS offer methods
to generate reduced asymmetric designs, thus decreasing the effort for the researcher to
develop such designs manually. For our example, a reduced design with nine stimuli was
generated with the help of SPSS (Table 9.8; cf. Sect. 9.3.2).
Generally, reduced designs should be orthogonal, which means that the attribute lev-
els are independent from each other (i.e., no multicollinearity). Table 9.8 shows that, for
544 9 Conjoint Analysis

Table 9.6  Full factorial A1B1C1 A2B1C1 A3B1C1


design for the case of 3
attributes with 3 attribute levels A1B1C2 A2B1C2 A3B1C2
each A1B1C3 A2B1C3 A3B1C3
A1B2C1 A2B2C1 A3B2C1
A1B2C2 A2B2C2 A3B2C2
A1B2C3 A2B2C3 A3B2C3
A1B3C1 A2B3C1 A3B3C1
A1B3C2 A2B3C2 A3B3C2
A1B3C3 A2B3C3 A3B3C3

Table 9.7  Latin square design A1 A2 A3


for the case of 3 attributes with
3 levels each B1 A1B1C1 A2B1C2 A3B1C3
B2 A1B2C2 A2B2C3 A3B2C1
B3 A1B3C3 A2B3C1 A3B3C2

Table 9.8  Reduced Stimulus Cocoa content (%) UTZ label Price (EUR)
asymmetric design for the
example 1 70 1 0.80
2 30 0 1.20
3 70 1 1.20
4 30 1 1.00
5 50 1 1.20
6 70 0 1.00
7 50 0 0.80
8 50 1 1.00
9 30 1 0.80

example, the level ‘70%’ for the attribute ‘cocoa content’ occurs with and without an
UTZ label and with all three price levels. The same applies to the other levels of the
attribute ‘cocoa content’. Consequently, the reduced design presented in Table 9.8 is
orthogonal. Orthogonal designs ensure that we can calculate the partworths of the differ-
ent attribute levels in the later analysis.
However, reduced designs often do not allow for evaluating interaction effects since
reduced designs lead to a loss of information. Thus, reduced designs can only be used
meaningfully if interaction effects are negligible.
9.2 Procedure 545

9.2.3 Evaluation of the Stimuli

1 Selection of attributes and attribute levels

2 Design of the experimental study

3 Evaluation of the stimuli

4 Estimation of the utility function

5 Interpretation of the utility parameters

After developing the experimental design, we ask respondents to evaluate the selected
stimuli. We can use either metric (rating scale, dollar metric, and constant sum scale)
or ordinal (ranking and paired comparisons) evaluation methods to collect information
about respondents’ preferences (Fig. 9.3). In the following, we describe the different
evaluation methods.

Evaluation Methods Leading to Ordinal Preference Data


When using a ranking scale, respondents simultaneously evaluate all stimuli by ranking
them according to their preferences. The most preferred stimulus is assigned the lowest
number (rank ‘1’), the least preferred stimulus the highest number (rank = K where K is
the number of stimuli).
In contrast, only two stimuli are presented and evaluated by the consumer at a time in
a paired comparison. The respondents compare various pairs so that in the end an order
(i.e., ranking) is derived.
Both evaluation methods lead to ordinal preference data.

Evaluation Methods Leading to Metric Preference Data


When using a rating scale, respondents evaluate the stimuli using a numerical scale, with
a high value corresponding to a high preference. For example, they rate the attractiveness

Evaluation methods

Ordinal methods Metric methods

Ranking Paired comparison Rating Dollar metric Constant-sum

Fig. 9.3 Alternative evaluation methods used in conjoint analysis


546 9 Conjoint Analysis

of diverse stimuli on a scale from 1 (= ‘not attractive at all’) to 10 (= ‘very attractive’).


Strictly speaking, the resulting preference values are ordinally scaled, but they are often
interpreted as metric.
With the help of the Dollar metric, respondents indicate a ‘$ amount’ either for the mon-
etary difference between stimuli or for the specific value of a stimulus. Naturally, the Dollar
metric scale can only be used if price is not considered an attribute in the conjoint design.
The constant sum scale asks respondents to allocate a constant sum of points (e.g.,
100 points) to the various stimuli. The most preferred stimulus receives the highest num-
ber and the least preferred stimulus receives the lowest number of points.

Assessment of the Different Evaluation Methods


The various evaluation methods can be assessed based on the following aspects:

• information content,
• number of evaluations,
• uniqueness of the evaluations.

Generally, metric evaluation methods (i.e., rating, Dollar metric, constant sum) contain
more information about the respondents’ preferences than ordinal methods (i.e., ranking
or paired comparison), because respondents not only indicate their preferences but also
the strength of their preferences. When using metric evaluation methods, the assigned
preference values can be interpreted as the total utility values of the stimuli. However,
this interpretation is only acceptable if we can assume completeness, reflectivity, and
transitivity of the preferences. A drawback of the metric methods is the relatively low
reliability of the evaluations (cf. Green & Srinivasan, 1978, p. 112), which results from
the fact that respondents are often not able to provide reliable information on the strength
of their preferences.
Mostly, the number of evaluations corresponds to the number of stimuli considered
in the conjoint study. Only the method of paired comparisons requires more evaluations
to ensure sufficient information about the respondents’ preferences. In our example with
nine stimuli, respondents need to make 36 (=9 · (9-1)/2) paired comparisons. These
comparisons need to be consistent to allow the order of the stimuli to be derived (condi-
tion of transitivity). The more evaluations, the higher the chance that respondents are not
able to make consistent assessments. Instead of considering all attributes, respondents
may focus on the attributes that are particularly important to them. One way to address
this issue is to let respondents divide the stimuli into several subgroups according to their
preferences (stacking) and ask for an evaluation among these subgroups.
Ordinal methods result in unique evaluations, whereas metric methods can result in
identical evaluations for several stimuli (so-called ties). On the one hand, metric meth-
ods do not force respondents to place the stimuli in a strict order if they do not perceive
any differences in preference. On the other hand, the assignment of numerous identical
evaluations may indicate that respondents were overburdened with the evaluation task or
9.2 Procedure 547

Table 9.9  Evaluation of the stimuli (respondent i = 1) based on a rating scale from 1 to 10


(1 = least preferred stimulus, 10 = most preferred stimulus)
Stimulus Cocoa content (%) UTZ label Price (EUR) Rating
1 70 1 0.80 3
2 30 0 1.20 4
3 70 1 1.20 1
4 30 1 1.00 6
5 50 1 1.20 8
6 70 0 1.00 2
7 50 0 0.80 9
8 50 1 1.00 10
9 30 1 0.80 7

did not make any effort to evaluate the stimuli corresponding to their preferences. A high
number of ties may also indicate that the considered attributes and/or attribute levels are
not relevant for a respondent. In any case, the estimation of partworths will be difficult
if many ties occur. If a respondent evaluates all stimuli as equal, it is not possible to esti-
mate the utility contributions at all.
Given the advantages and disadvantages of the different evaluation methods, the rat-
ing and ranking methods prevail in practice because respondents find them easier to use
(cf. Wittink et al., 1994, p. 44). Therefore, we use a rating scale from 1 to 10 for the eval-
uation of the stimuli in our example (Table 9.9).
No ties are observed and the different evaluations provide some evidence for the
strength of preferences. We learn that respondent i = 1 has the highest preference for
chocolate with a cocoa content of 50%, an UTZ label, and a price of 1.00 EUR. Overall,
we recognize that this respondent evaluates stimuli with a cocoa content of 50% most
favorably, followed by stimuli with a cocoa content of 30%. The stimuli with a cocoa
content of 70% are evaluated least favorably. For the attributes ‘UTZ label’ and ‘price’,
the preferences are not so pronounced.

9.2.4 Estimation of the Utility Function

1 Selection of attributes and attribute levels

2 Design of the experimental study

3 Evaluation of the stimuli

4 Estimation of the utility function

5 Interpretation of the utility parameters


548 9 Conjoint Analysis

After all respondents have evaluated the different stimuli, we can start to determine the
partworths of the attribute levels. In the following, we will first describe the specification
of the utility function that links the utility contributions of the attribute levels to the total
utility value of a stimulus. There are three approaches to do so:

• partworth model,
• vector model and
• ideal point model.

We describe all three approaches before we explain how to derive the utility contribu-
tions of the attribute levels (partworths).

9.2.4.1 Specification of the Utility Function


Conjoint analysis assumes that the total utility of a stimulus results from the utility con-
tributions of the attribute levels. Thus, we assume an additive model:
Mj
J 

yk = βjm · xjmk (9.1)
j=1 m=1

with

yk total utility of stimulus k


βjm utility contribution of level m of attribute j (partworth)

1 if level m of attribute j is present for stimulus k
xjmk :
0 otherwise

In addition to the utility contributions stemming from the attribute levels, often a con-
stant term is added in the utility function, which reflects a basic utility value:
Mj
J 

yk = β0 + βjm · xjmk (9.2)
j=1 m=1

with
β0 constant term

Partworth Model
The presentation of the utility function in Eq. (9.1) requires an estimation of the util-
ity contribution of each attribute level. As mentioned above, the utility contribution of
an attribute level is called partworth, and thus, the specification of the utility function
in Eq. (9.1) is called partworth model. The partworth model does not assume any rela-
tion between the attribute levels and their utility contributions (Fig. 9.4). Hence, the
9.2 Procedure 549

partworth

1 2 3 attribute level

Fig. 9.4 Illustration of the partworth model

partworth model is very flexible. Moreover, it only requires nominally scaled attribute
levels.
For metric attributes—such as price—the attribute levels are converted into binary var-
iables. Generally, we have to use Mj–1 binary (dummy) variables to represent the differ-
ent levels of an attribute. For example, for the attribute ‘price’ with 3 levels—‘0.80 EUR’,
‘1.00 EUR’, and ‘1.20 EUR’, we need two binary variables to represent the different lev-
els. Variable 1 takes on the value 1 if the stimulus has a price of 0.80 EUR, otherwise it
is 0. Variable 2 takes on the value 1 if the stimulus has a price of 1.00 EUR, otherwise
it is 0. If both variables are equal to 0, we know that the stimulus has a price of 1.20
EUR. The price level of 1.20 EUR serves as the reference level. However, it is up to the
researcher to decide what value is used as reference value. In our example, the lowest or
the average price can also serve as reference value. The same logic applies to the attribute
‘cocoa content’. Again, here the highest level (i.e., 70%) serves as the reference level.
Thus, the utility function for our example can be formulated as follows:
yk =β0 + β11 · x11k + β12 · x12k + β21 · x21k + β31 · x31k + β32 · x32k
        
cocoa content UTZ label price

where j = 1 represents the attribute ‘cocoa content’, j = 2 denotes the attribute ‘UTZ
label’, and j = 3 represents the attribute ‘price’.
For stimulus k = 1 in our example, which has a cocoa content of 70%, an UTZ label,
and a price of 0.80 EUR (cf. Table 9.9), we get the following formulation of the utility
function:
550 9 Conjoint Analysis

y1 = β0 + β11 · 0 + β12 · 0 + β21 · 1 + β31 · 1 + β32 · 0

We aim to estimate the constant term β0 and the partworths βjm in such a way that the
resulting total utility values yk correspond ‘as well as possible’ to the empirically col-
lected evaluations of the stimuli. In the example, respondent i = 1 evaluated stimulus
k = 1 with ‘3’ (on a 10-point scale). We use this rating (stated preference) as a proxy for
the stimulus’ total utility (i.e., y1 = 3).
In total, we need to estimate 6 (= 1 + 2 + 1 + 2) parameters based on 9 observations.
Thus, we have 3 degrees of freedom, which is sufficient to estimate the utility parameters
for each individual respondent.

Vector Model
For metrically scaled attributes, we can alternatively assume a linear relationship
between the attribute levels’ utility contributions and the total utility value (Fig. 9.5). For
the attribute ‘price’, we may assume that the total utility value decreases with increasing
price levels. If we accept such a linear relationship for all attributes, we obtain the fol-
lowing general form of the utility function:
J

yk = β0 + βj · xjk (9.3)
j=1

partworth

0.8 1.00 1.20 attribute level


(here: price)

Fig. 9.5 Illustration of the vector model


9.2 Procedure 551

with

βj utility parameter of attribute j


xjk value of attribute j for stimulus k

If we assume a linear relationship for all attributes, we have to estimate j + 1 parameters


including the constant term. Because of its linear form, this model is also called vector
model. Yet, when using the vector model, we do not estimate the partworths of the attrib-
ute levels straight away. To obtain the utility of the contribution of an attribute level (i.e.,
the partworth), we have to multiply the utility parameter βj with the respective value xjk
of the attribute.
In our example, a linear relationship between the price levels and the total utility
seems likely. We thus use the vector model for the attribute ‘price’, but the partworth
model for the attributes ‘UTZ label’ and ‘cocoa content’. This leads to the following for-
mulation of the utility function:
yk = β0 + β11 · x11k + β12 · x12k + β21 · x21k + β3 · x3k
  
price

or, more specifically for stimulus k = 1:


y1 = β0 + β11 · 0 + β12 · 0 + β21 · 1 + β3 · 0.80
Now we need to estimate 5 parameters for all attributes, which is 1 parameter less than
in the partworth model. If we assumed the vector model for the attribute ‘cocoa content’
as well, the number of parameters that need to be estimated would decrease to 4, and the
number of the degrees of freedom would increase to 5.

Ideal Point Model


For metrically scaled attributes, we can also assume an ideal point (Fig. 9.6). For example,
we may assume that there is an ideal level of cocoa content—chocolate should be choc-
olaty but not too chocolaty. The ideal attribute level is comparable to a saturation point.
Exceeding or falling short of this ideal attribute level leads to a reduction of the utility
contribution of the attribute level. If we assume that the evaluation function is symmetrical
around the ideal point, we can use a quadratic function to define the ideal point model.
In our example, the attribute ‘cocoa content’ could have an ideal level for consumers.
If we reformulate the utility function accordingly, we get:
 2
yk = β0 + β1 · x1k − x1ideal +β21 · x21k + β3 · x3k
  
cocoa content

In this equation, we use the ideal point model for the attribute ‘cocoa content’, the part-
worth model for the attribute ‘UTZ label’, and the vector model for the attribute ‘price’.
552 9 Conjoint Analysis

partworth

30 50 70 attribute level
(here: cocoa
content)

Fig. 9.6 Illustration of the ideal point model

Implementing the ideal point model requires knowledge about respondents’ ideal
level of an attribute. This information can be difficult to obtain. To address this chal-
lenge, we can alternatively use the partworth model that is also able to capture the idea
that an ideal point exists for consumers.
We have learned from Table 9.9 that respondent i = 1 seems to prefer a cocoa content
of 50% compared to 30% and 70%. It appears that there is an ideal point for this specific
respondent. However, we do not know whether it is actually 50% or some value around
50%. Therefore, we keep the partworth model for the attribute ‘cocoa content’ and pro-
ceed with the following specification of the utility function:
yk = β0 + β11 · x11k + β12 · x12k + β21 · x21k + β3 · x3k ,
Generally, there is a trade-off between the flexibility of the different approaches to spec-
ify the utility function and the number of parameters that need to be estimated. The part-
worth model is the most flexible approach since it does not assume a specific functional
relationship between the partworths and the total utility value. However, the number of
parameters to be estimated for the partworth model increases with the number of levels
considered per attribute in comparison to the other two approaches. Yet, due to its great
flexibility and its ability to take into account attributes measured at different scales, the
partworth model is used quite often (cf. Green et al., 2001, p. 59). Thus, we may want to
start with a specification of the utility function that is as flexible as possible (i.e., part-
worth model for all attributes). If we learn from later analyses that a linear relationship
9.2 Procedure 553

can be assumed for some (or even all) attributes, we can still change the specification of
the utility function and re-run the analysis. Note that we usually assume the same specifi-
cation of the utility function for all respondents.

9.2.4.2 Estimation of Utility Parameters


After we have specified the link between the utility contribution of the attribute levels
and a stimulus’ total utility value, we can estimate the utility parameters. Commonly, a
regression analysis is used to estimate the parameters of the utility function. We will not
elaborate on the estimation procedure here, but refer the reader to Chap. 2.2.2, where the
estimation of a regression function is explained in detail.
In the following, we discuss the results of the partworth model and a mixed model
(i.e., partworth model for the attributes ‘cocoa content’ and ‘UTZ label’ and vector
model for the attribute ‘price’). The evaluations of the different stimuli serve as depend-
ent variables yk. The independent variables are the variables that reflect the attribute
levels. If we use the partworth model, the independent variables are binary (dummy
variables). Table 9.10 shows the corresponding coding of the independent variables for
that model. Since all independent variables are binary variables, we actually conduct a
dummy regression. In the example, we need to estimate 6 parameters (including a con-
stant term).
An estimation of the utility parameters using regression analysis requires that the
empirically collected evaluations of the stimuli are interpreted as the total utility values
of the stimuli. This implies that the evaluations of the stimuli are considered as met-
ric. Although we do not get metric evaluations of the stimuli when using non-metric
approaches, it is possible to transform them. Thus, the estimation of utility parameters
via regression has also prevailed for the ranking scale (cf. Wittink et al., 1994, p. 46). If

Table 9.10  Coding of the variables to estimate the utility parameters (partworth model)
Stimulus Cocoa content Cocoa content UTZ label Price Price Rating
= 30% = 50% (x21k) = 0.80 EUR = 1.00 EUR (yk)
(x11k) (x12k) (x31k) (x32k)
1 0 0 1 1 0 3
2 1 0 0 0 0 4
3 0 0 1 0 0 1
4 1 0 1 0 1 6
5 0 1 1 0 0 8
6 0 0 0 0 1 2
7 0 1 0 1 0 9
8 0 1 1 0 1 10
9 1 0 1 1 0 7
554 9 Conjoint Analysis

Table 9.11  Estimated utility Parameter p-value (significance)


parameters (partworths;
partworth model) Constant term 0.22 0.53
Cocoa content = 30% 3.67 0.00
Cocoa content = 50% 7.00 0.00
UTZ label 0.83 0.05
Price = 0.80 EUR 2.00 0.01
Price = 1.00 EUR 1.67 0.01

consumers ranked the stimuli and the most preferred stimulus has rank 1, we first have
to recode the ranking in such a way that the most preferred stimulus receives the highest
value.1 This is achieved by subtracting the observed rank from the maximum value for
the rank + 1. For example, if nine stimuli have been ranked, then the most preferred stim-
ulus receives a value of 9 (= 9 – 1 + 1), and the least preferred stimulus receives a value
of 1 (= 9 – 9 + 1).
In our example, we measured the preferences using a rating scale and we can thus
use the ratings as dependent variables. Table 9.11 shows the estimated utility parame-
ters (partworths) based on a regression analysis. The parameters for the attribute levels
‘cocoa content = 30%’ and ‘cocoa content = 50%’ are both positive. The attribute level
‘cocoa content = 70%’ served as the reference level, and thus, has a utility parameter
(partworth) of 0. Accordingly, the level of ‘50%’ has the highest partworth and is the
preferred level of the attribute ‘cocoa content’ for respondent i = 1. This result sug-
gests that respondent i = 1 likes some bitterness but not too much bitterness. This result
is in line with our previous conjecture. Moreover, we learn that the respondent prefers
an UTZ label compared to no UTZ label. Additionally, the respondent prefers a lower
price to a higher price. The level ‘1.20 EUR’ is the benchmark and has a parameter of 0.
Consequently, the respondent prefers a price of 0.80 EUR over a price of 1.00 EUR and
a price of 1.20 EUR.
For illustrative purposes, we also estimate the utility parameters for the mixed model.
We use the partworth model for the attributes ‘cocoa content’ and ‘UTZ label’, and the
vector model for the attribute ‘price’. Table 9.12 shows the corresponding coding of the
independent variables. Now we have to estimate 5 parameters including the constant
term.
Table 9.13 shows the results of the regression analysis for the mixed model. When
using the vector model for ‘price’, we estimate a negative parameter that indicates that
the respondent prefers a lower price to a higher price. The partworths for the different

1 Thiscomment applies to an estimation of the utility function in, for example, Excel (see www.
multivariate-methods.info ). If you use SPSS for your analysis, SPSS does the recoding for you.
9.2 Procedure 555

Table 9.12  Coding of the variables to estimate the utility parameters (mixed model)
Stimulus Cocoa content Cocoa content UTZ label Price Rating
= 30% = 50%
1 0 0 1 0.80 3
2 1 0 0 1.20 4
3 0 0 1 1.20 1
4 1 0 1 1.00 6
5 0 1 1 1.20 8
6 0 0 0 1.00 2
7 0 1 0 0.80 9
8 0 1 1 1.00 10
9 1 0 1 0.80 7

Table 9.13  Estimated utility Parameter p-value (significance)


parameters (mixed model)
Constant term 6.44 0.01
Cocoa content = 30% 3.67 0.00
Cocoa content = 50% 7.00 0.00
UTZ label 0.83 0.11
Price – 5.00 0.01

price levels are now – 4 (= – 5·0.80) for a price of 0.80 EUR, – 5 for a price of 1.00 EUR,
and – 6 for a price of 1.20 EUR. Overall, the findings are the same.

9.2.4.3 Assessment of the Estimated Utility Function


To decide which specification of the utility function is better suited to map respondent’s
(i = 1) preferences, we assess the validity of the partworth and the mixed model. We can
assess the validity of the estimated utility functions based on the following criteria:

• plausibility of the estimated utility parameters,


• goodness-of-fit of the model,
• predictive validity.

Plausibility of the Estimated Utility Parameters


To assess the plausibility of the estimated utility parameters, we can assess whether the
sign of the estimated parameters is in line with a priori expectations or not (face valid-
ity). In the example, the estimated utility parameters for price correspond with a priori
expectations since we expected that respondents prefer low prices. Moreover, it seems
556 9 Conjoint Analysis

likely that a sustainability label is evaluated positively. For the attribute ‘cocoa content’,
we can think of all kinds of relationships (decreasing, increasing, ideal point). Yet, exam-
ining the evaluations suggested that the respondent has an ideal point related to cocoa
content. Thus, the results of the partworth model seem plausible.
Additionally, the significance of the utility parameters indicates whether the attributes
and levels have been relevant for the respondents. Yet, the small number of the degrees
of freedom often leads to large standard deviations, and thus, insignificant parameters.
Therefore, insignificant parameters have to be treated with care. In the example, all
parameters are significant when using the partworth model (Table 9.11), while the util-
ity parameter for ‘UTZ label’ is not significant at the 5% level (p = 0.11) when imple-
menting the mixed model. Thus, the two different utility functions lead to contradictory
results regarding the relevance of the sustainability label.

Goodness-of-fit of the Model


When assessing the goodness-of-fit of an estimated utility function, we check whether
the observed total utility values correspond to the estimated total utility values.
Correlation-based measures such as Pearson’s correlation or Kendall’s tau are used for
this assessment. Pearson’s correlation is appropriate if the observed evaluations are
metrically scaled (i.e., ratings). In contrast, Kendall's tau is designed for ranking data.
Kendall's tau compares the estimated order to the observed order of the stimuli. The sign
of Kendall's tau indicates the direction of the relationship, whereas the absolute value
shows the strength of the relationship. As for Pearson’s correlation coefficient, the value
ranges between – 1 and + 1, and we aim for a value close to + 1.
Since we examine rating data, we use Pearson’s correlation to assess the goodness-
of-fit of the two estimated models. For the partworth model, Pearson’s correlation equals
0.998. For the mixed model, Pearson’s correlation is slightly lower with a value of 0.992.
Yet, both values are very high and indicate a good fit of the two models. In the following,
we focus on the mixed model since we would like to elaborate on the specificity of the
interpretation of the vector model for price.

Predictive Validity
For assessing the predictive validity, we can use a holdout sample. A holdout sam-
ple consists of stimuli that are evaluated by the respondents, but are not included in the
estimation of the parameters. While considering holdout stimuli allows us to assess the
predictive validity, one reason not to use holdout stimuli is that they increase the num-
ber of evaluations a respondent has to make. Thus, the cognitive effort increases for the
respondents, which can negatively affect the reliability and validity of the evaluations.
If we use holdout stimuli, we predict the total utility values for these stimuli using
the estimated utility parameters (partworths). These values are then compared to the
observed total utility values. We can again use a correlation measure to assess the pre-
dictive validity. Alternatively, we can predict which stimulus in the holdout sample is
the most preferred one (first choice) and assess whether we also observe the highest total
9.2 Procedure 557

utility value for this specific stimulus. If this is the case, we observe a so-called ‘hit’. The
percentage of first-choice hits in the holdout sample can serve as a measure of predictive
validity. In our example, we did not consider any holdout stimuli, and we refer the reader
to Sect. 9.3.3 for a more detailed discussion.

9.2.5 Interpretation of the Utility Parameters

1 Selection of attributes and attribute levels

2 Design of the experimental study

3 Evaluation of the stimuli

4 Estimation of the utility function

5 Interpretation of the utility parameters

In a final step, we discuss what insights and implications we can derive from the esti-
mated utility parameters. First, we elaborate on the insights regarding the preference
structure of individual respondents. We also discuss how to obtain the relative impor-
tance of attributes. Second, we discuss how to compare the results across respondents
and how to use the findings to predict consumer behavior to support decisions by manag-
ers, policymakers, etc.

9.2.5.1 Preference Structure and Relative Importance of an Attribute


The estimated utility parameters allow us to derive the ‘optimal’ product for an indi-
vidual respondent. Independent of the specification of the utility function (i.e., part-
worth model or mixed model), chocolate with 50% cocoa, an UTZ label, and a price
of 0.80 EUR is the product with the highest utility for respondent i = 1. The esti-
mated utility equals 10.05 (= 0.22 + 7.00 + 0.83 + 2.00) for the partworth model and
10.27 (= 6.44 + 7.00 + 0.83 + (−5 ⋅ 0.80)) or the mixed model. A cocoa content of 50%
delivers the highest contribution to the estimated utility with a partworth of 7.00 in both
models.
Yet, it should be noted that the absolute level of the utility parameters does not indi-
cate the relative importance of an attribute. For example, if an attribute has consistently
high partworths for all levels compared to another attribute, it cannot be concluded that
this attribute is more important for the respondent than another attribute with lower
partworths. Rather we need to inspect the change in total utility when we change the
attribute level (cf. Vriens et al., 1998). Thus, the range of partworths for an attribute is
decisive for its importance. The range is the difference between the highest and the low-
est partworth of an attribute. For example, the range for the attribute ‘cocoa content’ is 7
(= 7.00 – 0). If the range is large, a significant change in total utility can be achieved by
varying the focal attribute. If the range of an attribute is set in relation to the sum of the
ranges across all attributes, we obtain the relative importance of an individual attribute:
558 9 Conjoint Analysis

   
max bjm − min bjm
m m
wj = J  (9.4)
    
max bjm − min bjm
j=1 m m

In the example, the attribute ‘cocoa content’ is the most important attribute for the
respondent i = 1 (Table 9.14). The relative importance equals 71.2% (or 0.712 =
7.00/9.83) when considering the mixed model. A change in cocoa content leads to a sub-
stantial change in the total utility value. In contrast, the attribute ‘UTZ label’ is the least
important one (relative importance = 8.5% or 0.085).
Note that the relative importance of an attribute may depend on the number of levels
of an attribute (number-of-levels effect) and the range of that attribute (bandwidth effect)
(Verlegh et al., 2002).
The so-called number-of-levels effect occurs when an increase in the number of levels
of an attribute—while holding the range of levels constant—results in a higher attrib-
ute importance. For instance, we considered three levels for the attribute ‘price’ rang-
ing from 0.80 EUR to 1.20 EUR. If we increased the number of levels to five but kept
the range constant (i.e., 0.80 EUR, 0.90 EUR, 1.00 EUR, 1.10 EUR, 1.20 EUR), the
attribute ‘price’ becomes more important for respondents, simply because there are more
levels and respondents pay more attention to changes in price. Moreover, a larger range
(bandwidth) of attribute levels can also lead to a higher importance of that attribute (e.g.,
if we increased the price range from 0.80 EUR to 1.50 EUR). Thus, the researcher needs
to be careful when deciding on the number of levels and attribute ranges right from the
start of a conjoint analysis (cf. Sect. 9.2.1).

9.2.5.2 Standardization of Utility Parameters


So far, we have focused on the analysis of individual preference data. If we analyze the
preference of a sample of respondents, it is not possible to compare the estimated utility
parameters directly since the respondents may have used different subjective scales to
evaluate the stimuli. To make inter-individual comparisons, we need to transform and
standardize the individually estimated utility parameters.

Table 9.14  Relative importance of attributes (mixed model)


Minimum estimated Maximum estimated Range Relative importance
partworth partworth
Cocoa content 0.00 7.00 7.00 0.712
UTZ label 0.00 0.83 0.83 0.085
Price –6.00 –4.00 2.00 0.203
Sum 9.83 1.000
9.2 Procedure 559

We can transform and standardize the individual utility parameters in such a way that
the estimated utility parameters for all respondents share the same ‘zero point’. Usually,
we set the lowest partworth of an attribute equal to zero. Thus, in a first step the differ-
ence between the individual partworths and the lowest partworth of the corresponding
attribute is computed:


bjm = bjm − bjmin (9.5)

with

bjm estimated partworth of level m of attribute j


bjmin estimated minimum partworth across all levels of attribute j

Equation (9.5) applies to the partworth model. If a vector model was used, we first need
to compute the partworths by multiplying the utility parameters with the respective val-
ues for the attribute. For the mixed model, we thus obtain for the attribute ‘price’ the
following value for bjm

: 2 for a price of 0.80 EUR, 1 for a price of 1.00 EUR, and 0 for a
price of 1.20 EUR.
In a second step, we consider the maximum total utility for an individual respondent.
The maximum total utility value is the sum of the most preferred attribute levels. In the
example, the respondent prefers a cocoa content of 50%, an UTZ label, and a price of
0.80 EUR. Such a chocolate has a total utility of 16.27 (= 6.44 + 7.00 + 0.83 + 2.00) when
considering the mixed model. We use the maximum total utility to standardize the scales
by setting the total utility of the most preferred stimulus to 1. Doing so results in the
standardized partworths:


std.
bjm
bjm = J
  ∗ (9.6)
b0 + max bjm
j=1

Table 9.15 shows the resulting standardized partworths for respondent i = 1. If we sum
up the partworths of the most preferred attribute levels and consider the constant term,
we get a value of 1. If we standardize the utility parameters (partworths) according to
Eq. (9.6), we can compare the results from various individual analyses.

9.2.5.3 Aggregated Utility Parameters


The researcher is often interested in aggregated utility parameters across all respondents
or groups (segments) of respondents. Thus, we may want to derive some aggregated util-
ity parameters. There are two basic approaches to obtain aggregated results from con-
joint analyses:
560 9 Conjoint Analysis

Table 9.15  Standardized partworths (mixed model)


Estimated partworth Transformed partworth Standardized
bjm b*jm ­partworth bstd.jm
Constant term 6.44 0.396
Cocoa content = 30% 3.67 3.67 0.226
Cocoa content = 50% 7.00 7.00 0.430
UTZ label 0.83 0.83 0.051
Price = 0.80 EUR − 4.00 2.00 0.123
Price = 1.00 EUR − 5.00 1.00 0.061

• Using the evaluations of individual respondents to conduct individual analyses. The


estimated individual utility parameters are subsequently aggregated to the desired
level (across all or groups of respondents).
• Using the evaluations of several respondents and analyzing these evaluations jointly.
In this case, we do not estimate a regression function for each respondent but for all
or a group of respondents jointly.

Aggregating the Results of Individual Analyses


The first approach conducts N individual analyses to derive the utility parameters for
each individual respondent. Then, the individual utility parameters are standardized (cf.
Sect. 9.2.5.2). Finally, we compute the mean value of each utility parameter across the
respondents. If we assume, for example, that women and men differ in their preferences
when it comes to chocolate, we can split the sample according to gender and compute
the mean utility parameters for men and women separately. Thus, we can use predefined
groups and compute the utility parameters for each of the groups based on the individual
partworths. Alternatively, we can use the standardized utility parameters as segmentation
variables for a cluster analysis (cf. Chap. 8 ). In this case, we identify groups with similar
preferences. Later we can use describing variables such as demographics, psychograph-
ics, etc. to describe the groups (clusters).

Joint Conjoint Analysis


We can also use all evaluations together and estimate a regression function based on all
observations jointly. Let us assume that we observe a sample of 100 respondents and
each respondent evaluated the nine stimuli described in Table 9.8. Thus, we have 900
instead of only nine observations to estimate the utility parameters.
Yet, an estimation of the utility parameters across all respondents usually results in less
accurate estimates for an individual respondent. This is because we ignore heterogeneity
in respondents’ preferences. We lose information by jointly estimating the utility parame-
ters and we might get biased utility parameters. However, we have now more observations
9.2 Procedure 561

to estimate the same number of utility parameters. Thus, we have more degrees of free-
dom, which results in more efficient estimates of the utility parameters. Therefore, we
have to check carefully whether a joint estimation leads to valid results. For example, we
can first conduct individual analyses to test whether the standardized utility parameters
vary substantially across respondents. If this is the case, there is a high degree of hetero-
geneity and we should not run a joint estimation but rather use a cluster analysis to iden-
tify homogeneous groups, that is, groups with similar preference structures (cf. Chap. 8).

9.2.5.4 Simulations Based on Utility Parameters


Often the researcher is interested in using the results of a conjoint analysis to predict the
total utility for objects that were not part of the experimental design. We can compute the
total utility for any possible combination of attribute levels based on the estimated utility
parameters.
Assume we want to know whether consumers prefer chocolate with no UTZ label for
1.00 EUR or chocolate with a UTZ label for 1.20 EUR—both with a cocoa content of
50%. The first alternative was not part of the experimental design, while the second alter-
native was evaluated by respondent i = 1 (k = 5; rating = 8).
Using the estimated utility parameters from the mixed model leads to the following
predicted total utility values:

• Utility_1 (50% cocoa, no UTZ label, 1.00 EUR) = 8.44


• Utility_2 (50% cocoa, UTZ label, 1.20 EUR) = 8.27

The respondent prefers the first alternative with no UTZ label for a price of 1.00 EUR
compared to the second alternative with an UTZ label for 1.20 EUR. Yet, the difference
in estimated utilities is rather small.
Based on these results, we can predict consumers’ buying behavior. There are three
basic approaches to predict consumer behavior:

• First-choice rule (also called maximum-utility rule),


• Probabilistic choice rule (also called Bradley-Terry-Luce (BTL) rule), and
• Logit rule.

Since conjoint analysis makes no assumptions about how the total utility values are
linked to actual choice behavior, we need to specify a choice rule.

First-choice Rule
The first-choice rule predicts that consumers will ‘for sure’ choose the product with
the highest utility. Thus, we assign a choice probability of 100% to the alternative with
the highest total utility; the other alternative receives a choice probability of 0%. If two
alternatives have exactly the same total utility value, the choice probability is distributed
562 9 Conjoint Analysis

equally across these stimuli (i.e., 50%). In our example, we predict that the consumer
chooses alternative 1 with a choice probability of 100%, although we may wonder
whether the world is really only black or white (0/1).

Probabilistic Choice Rule


Alternatively, we can assume a probabilistic choice (Bradley-Terry-Luce rule (BTL
rule)) The probabilistic choice rule takes into account that consumers choose a product
only with a certain probability. The respective total utility value of the product is divided
by the sum of the total utility values of all considered alternatives.
uik ∗
Pik ∗ = K∗

uik ∗ (9.7)
k ∗ =1

with

Pik* probability that consumer i chooses alternative k*


uik* total utility of alternative k* for consumer i

In the example, the choice probability for the first alternative equals 0.505 (=
8.44/16.71) or 50.5%, while the choice probability for the second alternative is 0.495 or
49.5%.

Logit Rule
A variation of the probabilistic choice rule is the logit rule that relies on the following
equation to compute the choice probability:
exp (uik ∗ )
Pik ∗ = K∗
 (9.8)
exp (uik ∗ )
k ∗ =1

The logit rule implies an s-shaped relationship between the utility and the choice prob-
ability (cf. Fig. 9.25 ). In our example, the choice probabilities are 54% (= exp(8.44)/
(exp(8.44) + exp(8.27))) for the first alternative and 46% for the second alternative. The
choice probabilities do not differ substantially since the total utility values are rather sim-
ilar. If we observe a large difference in the total utility values, the logit rule converges to
the first-choice rule.
Remember that conjoint analysis does not imply a specific choice rule. The researcher
has to decide which rule to apply to predict consumer behavior. Since consumers usu-
ally make choices, conjoint analysis has been criticized for being far from reality. As
an answer to this criticism, the choice-based conjoint (CBC) analysis was developed.
Because of its relevance in research and practice, we discuss this variant of the tradi-
tional conjoint analysis separately in Sect. 9.4.
9.3 Case Study 563

9.3 Case Study

9.3.1 Problem Definition

We now use a larger sample and a slightly different research question to demonstrate
how to conduct a conjoint analysis with the help of SPSS. The manager of the chocolate
manufacturer actually knows a lot about consumer preferences when it comes to choc-
olate bars. Yet, the chocolate company considers introducing a range of chocolate truf-
fles. The manager visited a confectionary trade fair and organized several focus groups.
Table 9.16 shows the attributes and attribute levels that seem to be relevant for consum-
ers when buying chocolate truffles.
In total, we could create 324 (=3·3·2·2·3·3) stimuli. In the following, we describe
how to use SPSS to generate a reduced design and analyze the collected data.

9.3.2 Conducting a Conjoint Analysis with SPSS

To conduct a conjoint analysis with SPSS, several steps are necessary. First, we generate
the experimental design. Second, we create a data file that contains the evaluations of the
different stimuli. Third, we estimate the utility parameters for each individual respondent
and the sample.

Table 9.16  Attributes and attribute levels considered in the case study


Attribute Attribute level
Types of flavor • Fruity
• Nutty
• Mixed
Size of truffles •5g
• 10 g
• 15 g
Superfoods • Containing superfoods (carob and lucuma)
• Not containing superfoods
Filling • Creamy
• Liquid
Packaging • Individually wrapped in foil
• Individually wrapped in paper
• Not individually wrapped
Price (150 g box) • 5.99 EUR
• 6.99 EUR
• 7.99 EUR
564 9 Conjoint Analysis

Generating a Reduced Design with the Procedure ORTHOPLAN


To create a reduced design with SPSS, we go to ‘Data/Orthogonal Design/Generate’ to
start the procedure ORTHOPLAN (Fig. 9.7).
A dialog box opens (Fig. 9.8) that asks us to enter the name and label of the first
attribute (here: ‘flavor’). The ‘Factor Name’ has to correspond to the SPSS naming con-
ventions. That is, the name must start with a letter and it may have up to 8 characters
without any blanks. The ‘Factor Label’ can be chosen freely. In our example, the first
attribute is ‘flavor’. Once the factor name and label are entered, click on ‘Add’.
Click on the attribute to define the attribute levels (cf. Fig. 9.9; ‘Define Values’). A
new dialog box opens and we can enter the different values for the attribute levels and
the corresponding labels. We assign the values 1 to 3 to the three attribute levels for the
attribute ‘flavor’. The labels are ‘fruity’, ‘nutty’, and ‘mixed’. We click on ‘Continue’
and enter the other attributes and their levels following the same procedure.
If the attributes are metric in nature such as the attributes ‘size’ and ‘price’, you can
also use labels that represent the actual levels (e.g., 5 = 5 g or 5.99 = 5.99 EUR).
After we have entered all attributes and attribute levels, we can define a set of holdout
stimuli. To do so, we go to ‘Options’ (Fig. 9.9, left side) and enter the number of holdout
stimuli (Fig. 9.10). In our example, we consider four holdout stimuli.
To generate the reduced design, we click on ‘Create new data file’ to save the gener-
ated design (Fig. 9.9). SPSS tells us that 16 cards (i.e., stimuli) are generated. This is the
minimum number of stimuli necessary to estimate the utility contributions of the attrib-
ute levels. That means, instead of the 324 possible stimuli, the reduced orthogonal design
consists of 16 stimuli only, which is a substantial reduction. Moreover, four stimuli are
generated that represent the holdout stimuli.
We open the newly generated data file and go to ‘View’ to activate ‘Value Labels’
(Fig. 9.11). Instead of the numerical value for the attribute levels, the level labels will
be displayed (Fig. 9.12). SPSS uses the full-profile method. Hence, each stimulus is
described based on all considered attributes.
Figure 9.12 shows the generated reduced orthogonal design. The column ‘STATUS’
indicates whether a stimulus is part of the reduced design (label: Design, value: 0) or a
holdout stimulus (label: Holdout, value: 1).
The researcher can also define stimuli for simulation purposes. These stimuli are indi-
cated by a value of 2 in the column ‘STATUS’. The respondents do not evaluate the stim-
uli that are used for simulation (numbering starts again at 1). SPSS computes the total
utility of the simulation stimuli using the estimated utility parameters. In our example,
we define two alternatives for simulation:

• Alternative 1: mixed mini truffles (5 g) that contain superfoods and have a creamy filling.
The truffles are not wrapped individually and will be offered for a price of 6.99 EUR.
• Alternative 2: fruity truffles with a medium size (10 g) that contain a liquid filling but
no superfoods, are individually wrapped in paper and will be offered for 7.99 EUR.
9.3 Case Study 565

Fig. 9.7 Generating a reduced design in SPSS

These two alternatives have been added to the file and appear in row 21 and 22 in
Fig. 9.12.

Creating a Data File Containing the Evaluations of the Stimuli


We can now use the generated stimuli in a survey to gather information about consumer
preferences. In our example, we collected data from 41 respondents. The respondents’
evaluations are entered into a new SPSS file. Each row contains the evaluations of one
respondent (Fig. 9.13). There are three options for entering the evaluations.
566 9 Conjoint Analysis

Fig. 9.8 Entering the attributes for an orthogonal design

Fig. 9.9 Entering the attribute levels for an orthogonal design

1. We can ask the respondents to order the stimuli from most preferred to least preferred.
In this case, the first column in the SPSS data file contains the respondent’s ID; the
second column contains the number of the most preferred stimulus, and so forth.
2. We can ask the respondents to rank the stimuli. A lower rank implies greater prefer-
ence. In this case, the second column of the SPSS data file contains the rank of stimu-
lus k = 1, and so forth.
9.3 Case Study 567

Fig. 9.10 Dialog box: Options

Fig. 9.11 Displaying the labels of the attribute levels


568 9 Conjoint Analysis

Fig. 9.12 Reduced orthogonal design

Fig. 9.13 Excerpt of the data file showing the respondents’ preferences (ranking)

3. We can ask the respondents to use a metric scale to evaluate the stimuli. A higher
score implies greater preference, so the second column of the SPSS data file contains
the value assigned to stimulus k = 1.

In our example, we used the ranking method for evaluating the stimuli (option 2). The
respondents were asked to rank the stimuli according to their preferences. The most pre-
ferred stimulus received the lowest rank (i.e., rank = 1). Each person evaluated 20 stimuli
(including four holdout stimuli). Figure 9.13 shows an excerpt of the SPSS data file.
9.3 Case Study 569

Estimating the Utility Parameters with the Procedure CONJOINT


The procedure CONJOINT is not integrated into the graphical menu structure of SPSS
so that we need to use a syntax file to start the procedure (‘File/New/Syntax’). At this
point, we first explain the most important options of the SPSS command CONJOINT.
We provide the specific command structure of the CONJOINT procedure used for the
example in Sect. 9.3.4.
When using the CONJOINT procedure, we first need to indicate which files con-
tain the experimental design and the stimuli evaluations (cf. Sect. 9.3.4). Then we have
to define the relation between the partworths and total utility values (cf. Sect. 9.2.4.1).
In our example, we use the partworth model for the attributes ‘flavor’, ‘size of truffle’,
‘superfoods’, ‘filling’, and ‘packaging’ (SPSS command ‘DISCRETE’). For the attrib-
ute ‘price’, we assume a linear relationship and thus use the vector model. In SPSS, we
further need to specify whether we assume a positive (SPSS command LINEAR MORE’)
or negative (SPSS command ‘LINEAR LESS’) linear relationship. Since we expect that
a lower price is preferred over a higher price, we here specify a negative linear relation-
ship. To specify the link between the utility parameters and the total utility value, we use
the subcommand ‘FACTORS’:

/FACTORS = flavor (DISCRETE) size (DISCRETE) superfoods (DISCRETE) fill-


ing (DISCRETE) packaging (DISCRETE) price (LINEAR LESS)

Besides the partworth and vector models, we can also specify the ideal point model
with the SPSS subcommands ‘IDEAL’ (i.e., inverted u-shaped quadratic relationship) or
‘ANTIIDEAL’ (i.e., u-shaped quadratic relationship).
After indicating the link between the utility parameters and the total utility values, we
need to tell SPSS how the data have been collected. We can choose between the subcom-
mands ‘SEQUENCE’, ‘RANK’, or ‘SCORE’. These options correspond to the above-de-
scribed options on how to enter the data into the SPSS data file. In our example, we
have ranking data and we hence use the subcommand ‘RANK’. SPSS now automatically
recodes the ranking to values that reflect total utility values (cf. Sect. 9.2.4.2). We further
indicate the name of the stimuli that are considered in the later estimations (here: the 16
stimuli of the experimental design and the 4 holdout stimuli):

/RANK = stimulus01 to stimulus16 holdout01 to holdout04

The ‘PRINT’ subcommand allows us to define which results will be displayed in the
SPSS output file. If we use the option ‘ANALYSIS’, only the results of the experimental
data analysis are included. The estimated utility parameters for each respondent as well
570 9 Conjoint Analysis

as the overall results of a joint analysis are presented. The option ‘SIMULATION’ leads
to the reporting of the simulation data only. The results of the first-choice rule, proba-
bilistic choice (BTL) rule, and logit rule are displayed. The option ‘SUMMARYONLY’
reports the result of the joint estimation but not the individual results. Finally, the option
‘ALL’ reports the results of both the experimental and simulation data analyses. This
option is the default option. If you choose the option ‘NONE’, no results are reported
in the SPSS output file. We choose the option ‘ALL’ in our example because we want to
explore whether there is heterogeneity among the respondents, and we want to know the
results for the simulation.

/PRINT = all

The subcommand ‘UTILITY’ can be used to save the estimated utility parameters to a
new SPSS data file. If you want to do so, you have to indicate the file where the results
should be saved (cf. Sect. 9.3.4).
The subcommand ‘PLOT’ produces plots in addition to the numerical output. The fol-
lowing three options are available for this subcommand:

1. SUMMARY. Produces bar charts of the relative importance for all attributes and of the
utility contributions for each attribute.
2. SUBJECT. Produces bar charts of the relative importance for all attributes and of the
utility contributions for each attribute clustered by respondents.
3. ALL. Plots both summary and subject charts.

Since we do not require any plots for our case study, we ignore this subcommand.

9.3.3 Results

Before the results of the individual and joint estimations are presented, SPSS reports any
reversals. A reversal occurs if an assumed link between the utility parameters and the
total utility value is not confirmed, for example, if we assume a negative linear relation
between price and total utility but we find a positive relation for an individual respond-
ent. In our example, no reversals occur, i.e., the assumed relationship between price and
total utility is valid for all respondents.

9.3.3.1 Individual-level Results
In the following, we present the SPSS results of the individual analyses. For respondent
i = 1, we get the utility parameters displayed in Fig. 9.14.
9.3 Case Study 571

Fig. 9.14 Estimated utility parameters for respondent i = 1

Note that SPSS presents the partworths for all attribute levels, although we used the
vector model for the attribute ‘price’. The estimated parameter for the attribute ‘price’ is
displayed later, in the SPSS output file. In our example, the price parameter for respond-
ent i = 1 equals –1.545. To derive the partworths for the different price levels, we multi-
ply the prices with the price parameter (e.g., –9.257 = 5.99x(–1.545); deviations are due
to rounding).
Moreover, SPSS uses effect coding for the attribute levels of the partworth model.
That is, the partworth for the reference level is not 0, as for dummy coding, but the part-
worths belonging to one attribute add up to 0. Effect coding uses the values + 1, 0 and –1
to represent categorical variables. Table 9.17 shows exemplarily the coding of the differ-
ent attribute levels for dummy and effect coding for the attribute ‘flavor’.
The general interpretation of the effect-coded partworths is similar to the interpreta-
tion of the dummy-coded partworths. The attribute level with the highest value is the
572 9 Conjoint Analysis

Table 9.17  Dummy and Variable


effect coding (attribute:
‘flavor’; reference level: Variable ‘Fruity’ ‘Nutty’
‘mixed’) Level = fruity 1 0
Level = fruity 0 1
Level = mixed 0 0

Effect coding
Level = fruity 1 0
Level = nutty 0 1
Level = mixed –1 –1

Fig. 9.15 Relative attribute


importance for respondent i = 1

preferred level (e.g., ‘fruity’), and the partworth with the smallest value is the least pre-
ferred level (e.g., ‘nutty’). From Fig. 9.14, we learn that respondent i = 1 has a prefer-
ence for fruity truffles that have a size of 5 g, contain superfoods, have a creamy filling,
are not individually wrapped, and are offered for 5.99 EUR.
Next, SPSS presents the relative importance of each attribute (Fig. 9.15). For respond-
ent i = 1, packaging is the most important attribute (29.057%) followed by truffle size
(20.755%) and type of flavor (17.642%). The attribute ‘filling’ is the least important
attribute for this particular respondent.
As goodness-of-fit measures, SPSS uses Pearson’s correlation (‘Pearson’s R’) and
Kendall’s tau for the estimation (‘Kendall’s tau’) and holdout samples (‘Kendall’s tau for
Holdouts’) (Fig. 9.16). Since we used ranking data, we focus on Kendall’s tau that equals
1 for the estimation and holdout samples. That is, we are able to completely reproduce
the original ranking based on the estimated utility function.
Finally, the estimated total utility values for the two alternatives considered for sim-
ulation (‘Preference Scores of Simulations‘) are displayed (Fig. 9.17). Respondent i = 1
has a strong preference for the first alternative, i.e. mixed mini truffles (5 g) that contain
9.3 Case Study 573

Fig. 9.16 Goodness-of-fit of the estimated model for respondent i = 1

Fig. 9.17 Estimated total


utility values of the alternatives
considered for simulation for
respondent i = 1

superfoods, have a creamy filling, are not wrapped individually, and are offered for a
price of 6.99 EUR.

9.3.3.2 Results of Joint Estimation


After presenting the results of all individual analyses, SPSS displays the results of a joint
estimation across all 41 respondents. Figure 9.18 shows the estimated utility parameters
(partworths). In comparison to the results presented in Fig. 9.14, the sample has on aver-
age a slight preference for nutty flavors and a truffle size of 15 g.
The values for the relative attribute importance are more balanced when we jointly
estimate the utility parameters. Consequently, the results suggest that there is heteroge-
neity among respondents (Fig. 9.19). Hence, a joint estimation may lead to biased utility
parameters and, ultimately, to wrong managerial implications since the preference struc-
tures are not well reflected in the results of the joint estimation. When taking a closer
look at the original data (Fig. 9.13), we can see that there are actually two groups of
respondents that differ substantially in their preferences (Table 9.18). The first group is
similar to respondent i = 1, while the second group is similar to respondent i = 2 in the
data set.2

2 We leave it up to the reader to inspect the data in more detail and to examine the heterogeneity in
the preference structures. Visit www.multivariate-methods.info to obtain the dataset.
574 9 Conjoint Analysis

Fig. 9.18 Estimated utility parameters of the joint estimation

Fig. 9.19 Relative attribute


importance for the joint
estimation
9.3 Case Study 575

Table 9.18  Preferred attribute levels of the two groups in the data set
Group 1 Group 2
Flavor Fruity Nutty
Size of truffles 5g 15 g
Superfood Containing superfoods Not containing superfoods
Filling Creamy Creamy
Packaging No individual packaging No individual packaging
Price Negative Negative

At the very end of the SPSS output file, the results for the simulation are reported.
Remember that the manager considered two alternative chocolate truffles for the simula-
tion process.

• Alternative 1: mixed mini truffles (5 g) that contain superfoods and have a creamy fill-
ing. The truffles are not wrapped individually and will be sold for a price of 6.99 EUR.
• Alternative 2: fruity truffles with a medium size (10 g) that contain a liquid filling but
no superfoods, are individually wrapped in paper and will be offered for 7.99 EUR.

SPSS presents the derived choice probabilities according to the different choice rules
(Fig. 9.20). When using the first-choice rule (‘Maximum Utility’), we predict that
Alternative 1 is chosen by all 41 respondents. The probabilistic choice rule (‘Bradley-
Terry-Luce’) and the logit rule (‘Logit’) provide a more nuanced picture. According to
the probabilistic choice rule, Alternative 1 is chosen with a probability of 65.0%. This
probability is substantially smaller than the probability of 91.3% that is derived when
using the logit rule. The reason for this difference is due to the large difference in
estimated total utility values (Preferences Scores of Simulations). If large differences
in total utility values are predicted, the logit rule converges with the result of the first-
choice rule.

Fig. 9.20 Estimated total utility values and preference probabilities for the two alternatives considered
for simulation (joint estimation)
576 9 Conjoint Analysis

The manager of the chocolate company can now use these results to make a deci-
sion about which kind of truffles to launch in the market. However, it might be advis-
able to take into account the heterogeneity among respondents. If there are two groups
(segments) of consumers that differ concerning their preferences, and both segments are
interesting as target groups, the manager might want to consider introducing two differ-
ent kinds of truffles to the market.
We further requested SPSS to save the individually estimated utility parameters in
a new data file. Figure 9.21 shows an excerpt of this data set. For the attribute ‘price’,
the estimated utility parameters are saved, not the partworths for each level. Besides the
individually estimated utility parameters (‘CONSTANT’, ‘flavor1’ etc.), the data file con-
tains the estimated total utility values for the 20 stimuli including the holdout stimuli
(‘SCORE1’ to ‘SCORE20’) and the alternatives considered for simulation (‘SIMUL01’
and ‘SIMUL02’). A closer look at the estimated total utility values for the two alter-
natives considered for simulation shows that there is some heterogeneity among the
respondents. While some respondents have a very strong preference for Alternative 1,
other respondents show more balanced preferences.
In a next step, we can use the individually estimated utility parameters to compute the
standardized utility parameters as described in Sect. 9.2.5.2. Subsequently, we could use
these standardized utility parameters to conduct a cluster analysis to identify groups of
respondents with similar preferences (cf. Chap. 8 ).

9.3.4 SPSS Commands

Figures 9.22 and 9.23 show the SPSS syntax for the chocolate truffle example. Figure 9.22
illustrates how to generate a reduced design with the SPSS procedure ORTHOPLAN. The
subcommand ‘MIXHOLD’ indicates whether the holdout stimuli are mixed with the stim-
uli of the experimental design (YES) or whether they are presented at the end of the data
file (NO). We decided to present them at the end of the data file (cf. Fig. 9.10).
Figure 9.23 shows the SPSS syntax for the CONJOINT procedure to estimate the util-
ity parameters (partworths).
For readers interested in using R (https://www.r-project.org ) for conducting a con-
joint analysis, we provide the corresponding R-commands on our website (www.multi-
variate-methods.info ).

9.4 Choice-based Conjoint Analysis

Traditional conjoint analysis (as described in Sect. 9.2) has been criticized because of
the non-realistic evaluation of the stimuli. Usually, respondents evaluate a set of objects
jointly and choose the object they prefer the most. This is also the basic idea of discrete
choice analysis that is frequently used in economics (cf. McFadden, 1974). Louviere and
9.4 Choice-based Conjoint Analysis

Fig. 9.21 Estimated utility parameters for each individual respondent


577
578 9 Conjoint Analysis

* MVA: Case Study Chocolate Conjoint Analysis.


* Defining Data.
ORTHOPLAN
/FACTORS = flavor 'flavor' (1 'fruity' 2 'nutty' 3 'mixed')
size 'size of truffle' (1 '5g' 2 '10g' 3 '15g')
superfoods (1 'containing superfoods' 0 'not containing
superfoods')
filling 'filling' (1 'creamy' 2 'liquid')
packaging (1 'individually wrapped in foil' 2 'individually
wrapped in paper' 3 'not individually wrapped (box)')
price 'price' (5.99 '5.99EUR' 6.99 '6.99EUR' 7.99 '7.99EUR')
/OUTFILE='C:\...\conjoint_design_truffles.sav'
/HOLDOUT 4
/MIXHOLD NO.

Fig. 9.22 SPSS syntax for generating a reduced design with the procedure ORTHOPLAN

* MVA: Case Study Chocolate Conjoint Analysis.


CONJOINT
plan = 'C:\...\conjoint_design_truffles.sav'
/data = 'C:\...\conjoint_data_truffles.sav'
/factors = flavor (DISCRETE) size (DISCRETE) superfoods
(DISCRETE) filling (DISCRETE) packaging
(DISCRETE) price (LINEAR LESS)
/subject = person
/rank = stimulus01 to stimulus16 holdout01 to holdout04
/print = all
/utility = 'C:\...\conjoint_results_truffles.sav'.

Fig. 9.23 SPSS syntax for estimating the utility parameters with the procedure CONJOINT

Woodworth (1983) introduced the basic idea of discrete choice analysis into business
(esp. marketing) research and developed the so-called choice-based conjoint (CBC) anal-
ysis that asks consumers to choose a stimulus from a set of stimuli.
Consumer preferences are thus measured on a nominal scale (1 if the stimulus is cho-
sen and 0 otherwise). The nominal variable that reflects consumer preferences contains
less information than evaluations of stimuli using an ordinal or metric scale (cf. Sect.
9.2.3). Therefore, it is usually not possible to estimate the utility parameters for each
individual; rather, a homogeneous preference structure is assumed and we estimate one
set of utility parameters for all respondents. In principle, however, an individual estima-
tion of the utility parameters is possible if consumers make a sufficient number of choice
decisions (i.e., 50 or more choices). Yet in practice, it is hardly possible to ask consumers
to make 50 or more choice decisions. Instead, sophisticated statistical models such as
latent class models and Bayesian models are used to derive utility parameters at a group
9.4 Choice-based Conjoint Analysis 579

Table 9.19  Differences between traditional conjoint analysis and CBC analysis


Conjoint analysis CBC analysis
Design of experimental Respondents evaluate a Respondents evaluate stimuli that are
study subset of stimuli part of a choice set and do so repeatedly
Evaluation of stimuli Preferences are measured Preferences are collected with the help
with the help of metric or of choice tasks, i.e., respondents indicate
ordinal scales what stimulus they would choose from a
set of stimuli
Measurement scale of Ordinal or metric Nominal
preferences
Model Utility model Utility model and choice model
Estimation of utility Regression analysis Maximum likelihood (ML) method—
function (cf. Chap. 2 ) iterative optimization (cf. Chap. 5 )
Results • Partworths (individual or • Partworths (mostly aggregate)
aggregate) • Relative attribute importance
• Relative attribute • Choice probabilities
importance
Prediction of choices Requires decision about a Choice probabilities are inherent in the
model that links preferences underlying model of CBC analysis
to choice behavior

Fig. 9.24 Process steps of CBC 1 Selection of attributes and attribute levels
analysis
2 Design of the experimental study

3 Evaluation of the stimuli

4 Estimation of the utility function

5 Interpretation of the utility parameters

or individual level. As this section serves as an introduction to CBC analysis, we focus


on the aggregated estimation of the utility parameters.
Table 9.19 illustrates the main differences between traditional conjoint analysis and
CBC analysis beyond the evaluation of stimuli.
In the following, we present and discuss the different steps taken when conducting
a CBC analysis, with the different steps corresponding to the process steps of conjoint
analysis (Fig. 9.24). We discuss every step but elaborate in detail only those steps that
substantially differ between conjoint and CBC analysis. Specifically, we discuss the
design of the experimental study (step 2), the evaluation of stimuli (step 3), and the esti-
mation of the utility function (step 4).
580 9 Conjoint Analysis

9.4.1 Selection of Attributes and Attribute Levels

1 Selection of attributes and attribute levels

2 Design of the experimental study

3 Evaluation of the stimuli

4 Estimation of the utility function

5 Interpretation of the utility parameters

The first step when conducting a CBC analysis is to decide on the attributes and attribute
levels that are used to describe the stimuli. In general, the same aspects need to be con-
sidered when selecting the attributes and attribute levels for a CBC analysis as would be
considered for a conjoint analysis.
The attributes must be relevant and independent of each other. Moreover, we need to
be able to adapt the attributes. The attribute levels have to be realistic and feasible, and
the individual attribute levels need to be compensatory. Moreover, the considered attrib-
utes and attribute levels must not be exclusion criteria. Finally, the number of attributes
and attribute levels needs to be limited. Again, it is recommended to use not more than 6
attributes with 3 to 4 levels each. As in conjoint analysis, we need to be aware that once
we have decided on the attributes and attribute levels, we can no longer change or adapt
them.
In the following, we use a small example to illustrate the CBC analysis. We use a
rather small example to facilitate understanding—especially of the estimation procedure
(cf. Sect. 9.4.4). CBC analysis is not part of the IBM SPSS software package but we can
use the SPSS procedure COXREG to analyze data that we collected with a CBC design.
Yet, there are many commercial software tools for conducting CBC analyses, such as
tools by Sawtooth Software or Conjoint.ly. However, you can also carry out a CBC anal-
ysis with the help of R. Since not every reader is familiar with R, we use Microsoft Excel
for our example.

Example CBC Analysis

Let us imagine the manager of a chocolate company wants to measure and analyze
the preferences for dark chocolate. More specifically, the manager wants to learn
more about the cocoa content and the price consumers prefer. Thus, the manager
considers the attributes ‘cocoa content’ and ‘price’ in the study. Table 9.20 shows
the considered attributes and attribute levels. For the sake of simplicity, we consider
just two attributes with two levels each. The two different levels of cocoa content
are different from the ones in our previous example since here we focus on dark
chocolate only. ◄
9.4 Choice-based Conjoint Analysis 581

Table 9.20  Example of Attribute Attribute levels


a CBC analysis with two
attributes and two levels each Cocoa content 60% of cocoa
78% of cocoa
Price 1.50 EUR
2.00 EUR

9.4.2 Design of the Experimental Study

1 Selection of attributes and attribute levels

2 Design of the experimental study

3 Evaluation of the stimuli

4 Estimation of the utility function

5 Interpretation of the utility parameters

For the study, the researcher has to define the stimuli and choice sets and has to decide
how many choices the respondents have to make.

9.4.2.1 Definition of Stimuli and Choice Sets


CBC analysis requires using the full-profile method to present the stimuli to the respond-
ents. The trade-off method is not appropriate since it contradicts the underlying idea of
CBC analysis that consumers evaluate the stimuli considering all attributes jointly. When
using the full-profile method, we can also present the stimuli visually to further enhance
the realism of the later evaluation task.
Besides the stimuli that represent different combinations of attribute levels, we may
consider a no-choice option (‘I would not buy any of these products.’). The consideration
of a no-choice option is recommended if the results of a CBC analysis are to be used for
simulating consumer behavior. The no-choice option tells the researcher whether a con-
sumer is actually ‘in the market’ or not.
Including a no-choice option may, however, have a negative influence on the results.
First, respondents may use the no-choice option to bypass difficult decisions, which has
a negative impact on the validity of the utility parameters estimated later (cf. Haaijer
et al., 2001; Dhar, 1997). In addition, the no-choice option provides limited informa-
tion about respondents’ preferences, and the reasons for choosing the no-choice option
may vary across consumers (cf. Dhar, 1997). One reason may be that none of the stimuli
meets a respondent’s expectations, i.e. the total utility values of all stimuli do not exceed
a certain threshold value. Another reason may be that the respondents expect to find
stimuli that provide a higher total utility than those presented. The latter reason, however,
appears to be less important in the context of a CBC analysis, since the consumers are in
an experimental situation. In practice, the use of a no-choice option is therefore common.
In our example, we consider a no-choice option.
582 9 Conjoint Analysis

Table 9.21  Exemplary choice Stimulus 1 Stimulus 2 No-choice


set for the chocolate example
60% of cocoa 78% of cocoa I would not buy
1.50 EUR 2.00 EUR any of these
chocolates

Apart from deciding whether to include a no-choice option or not, we need to define
the number of stimuli presented in a choice set. The more stimuli are presented, the more
difficult the task of the respondent becomes. Moreover, considering a larger number of
stimuli in a choice set may affect the relative attribute importance. Respondents seem to
pay more attention to the attributes if fewer stimuli are considered in a choice set. This
indicates that respondents are overwhelmed with the choice task if choice sets consist of
many stimuli. Due to these arguments, we often observe a small number of stimuli in a
choice set in CBC studies.
The use of a small number of stimuli in a choice set is also driven by the fact that
we aim for minimal overlap of the attribute levels. There is minimal overlap between
the attribute levels if the stimuli in each choice set have non-overlapping attribute levels,
which means that an attribute level only appears once in a choice set. Minimal overlap
of attribute levels forces respondents to make trade-off decisions between the different
stimuli in a choice set. Keeping in mind that many CBC designs consider 3 or 4 levels
for each attribute (cf. Sect. 9.4.1), the maximum number of stimuli in a choice set is 4 if
we want to achieve minimal overlap. Since we only consider 2 levels for each attribute
in our example, each choice set consists of just 2 stimuli (Table 9.21). Consequently, we
have no overlap. Yet it is important to note that some overlap in attribute levels is not
critical and may facilitate the estimation of interaction effects. Besides the 2 stimuli, we
also consider a non-choice option in our example.

9.4.2.2 Number of Choice Sets


The researcher has to decide on the stimuli that are presented in a choice set as well as
on the number of choice sets a consumer has to choose from. In our example, we have
4 (= 2·2) possible stimuli. Yet, we can generate more than 4 choice sets. In general, the
number
  of possible choice sets depends on the number of stimuli in a choice set and equals
K
where K is the number of possible stimuli, and R is the number of stimuli in a choice
R
set. With 4 possible stimuli and 2 stimuli in a choice set, we can generate 6 possible choice
sets (Table 9.22). If we want to avoid overlapping attribute levels, there are only 2 choice
sets that fulfill this aim (i.e., choice sets 5 and 6 in Table 9.22).
For our chocolate example introduced in Sect. 9.2.1, we considered 3 attributes with
2 or 3 levels each. There are 18 (=3· 2· 3) possible stimuli. If we have 18 possible stim-
uli and consider 2 stimuli in each choice set, we get 153 possible choice sets—plus a
9.4 Choice-based Conjoint Analysis 583

no-choice option if we decide on presenting it. If we considered 3 stimuli in each choice


set, we would get 816 different options. Obviously, we need to develop a reduced design
that represents just a subset of all possible sets. Research shows that respondents are
able to make up to 20 choices without any decrease in response quality (cf. Johnson &
Orme, 1996). A larger number of choice sets will lead to fatigue effects and reduces data
quality.
In the literature, different approaches are discussed for arriving at a reasonable num-
ber of choice sets in CBC studies. All these approaches lead to reduced designs that
are efficient. The efficiency of an experimental design is described by the variance and
covariance of the estimated utility parameters. The smaller the variance and covariance
of the estimated utility parameters, the more efficient an experimental design is. Yet, we
face the challenge that the design needs to be generated before the preference data are
collected. Thus, we can only estimate the efficiency of a reduced design.
One requirement for an efficient design is utility balance, i.e., the balance of the total
utility values of the stimuli in a choice set. This requirement is fulfilled if the total utility
values of the stimuli in a choice set are similar according to a priori expectations. If the
requirement of utility balance is fulfilled, respondents have to make trade-off decisions.
On the negative side, utility balance may encourage the choice of the no-choice option as
consumers may avoid difficult choices (cf. Dhar, 1997).
Due to the complexity of generating efficient designs in CBC analysis, such designs
are usually developed with software packages (e.g., Sawtooth Software or R). Typically,
the choice sets vary across respondents to improve the efficiency of the overall design.
In our example, we continue with two choice sets with no overlapping attribute levels
(choice sets 9.5 and 9.6 in Table 9.22).

Table 9.22  Possible choice Choice set Stimulus 1 Stimulus 2


sets in a CBC analysis with
two attributes and two levels 1 60% of cocoa 60% of cocoa
each 1.50 EUR 2.00 EUR
2 78% of cocoa 78% of cocoa
1.50 EUR 2.00 EUR
3 60% of cocoa 78% of cocoa
1.50 EUR 1.50 EUR
4 60% of cocoa 78% of cocoa
2.00 EUR 2.00 EUR
5 60% of cocoa 78% of cocoa
1.50 EUR 2.00 EUR
6 60% of cocoa 78% of cocoa
2.00 EUR 1.50 EUR
584 9 Conjoint Analysis

9.4.3 Evaluation of the Stimuli

1 Selection of attributes and attribute levels

2 Design of the experimental study

3 Evaluation of the stimuli

4 Estimation of the utility function

5 Interpretation of the utility parameters

In contrast to conjoint analysis, in a CBC study respondents are asked to indicate which
stimulus in a choice set they would choose. CBC analysis assumes that respondents
select the most preferred stimulus in a choice set. By gathering respondents’ prefer-
ences with the help of choices, we can avoid the problem of a subjective use of scales.
Additionally, making choices instead of evaluating stimuli on a scale is considered more
realistic.
Table 9.23 represents the choice of one exemplary respondent i = 1 for choice set
r = 1. Respondent i = 1 chooses stimulus k = 2, so it is the most preferred one. However,
we neither learn anything about the strength of the preference nor whether stimulus k = 1
is preferred over the no-choice option or not. That is because respondents’ choice deci-
sions contain less information than evaluations on a metric or ordinal scale, as only a
nominally scaled variable is recorded.

9.4.4 Estimation of the Utility Function

1 Selection of attributes and attribute levels

2 Design of the experimental study

3 Evaluation of the stimuli

4 Estimation of the utility function

5 Interpretation of the utility parameters

Table 9.23  Typical question in a CBC study


Imagine you are in a supermarket and you are considering buying a bar of dark chocolate. There
are two products available. Which product would you buy? Please mark the product you would
buy. You also have the option to state that you would not buy any of the two products
Chocolate 1 Chocolate 2 No-choice
60% of cocoa 1.50 EUR 78% of cocoa 2.00 EUR I would not buy any of
these chocolates
Your answer: X
9.4 Choice-based Conjoint Analysis 585

After respondents have made their choices, we determine the partworths of the attribute
levels. In the following, we discuss the specification of the utility function, and how to
derive the partworths.

9.4.4.1 Specification of the Utility Function


CBC analysis assumes—similar to conjoint analysis—that the total utility value of a stim-
ulus results from the utility contributions of the attribute levels. Hence, we again assume
an additive model and, as in conjoint analysis, we can use a partworth, vector, or ideal
point model to express the relation between the partworths and the total utility value.
In our example, we use the partworth model for the attributes ‘cocoa content’ (j = 1)
and ‘price’ (j = 2). We do not consider a constant term, and the utility function becomes:
2 
 2
yk = βjm · xjmk
j=1 m=1

where x11k indicates whether stimulus k has a cocoa content of 60% (1) or not (0), and
x21k indicates whether stimulus k has a price of 1.50 EUR (1) or not (0).
Although a stimulus has a total utility value independent of the choice set(s) it is part
of, we consider the choice set in the general formulation of the utility function. If we
include a no-choice option, we represent this option as an attribute of the stimulus. Thus,
we get the following general formulation of the utility function if we use the partworth
model for all attributes:
Mj
J 

yk = βjm · xjmk (k ∈ Kr ) (9.9)
j=1 m=1

with

yk total utility of stimulus k (k ∈ Kr)


βjm utility contribution of level m of attribute j (partworth)

1 if level m of attribute j is present for stimulus k (k ∈ Kr )
xjmk
0 otherwise

For our example, the no-choice option is represented by attribute j = 3, and x31k equals 1
if the stimulus is the no-choice option and 0 otherwise.
If the vector model is employed for all attributes, we get the following formulation:
Mj
J 

yk = βjm · xjk (k ∈ Kr ) (9.10)
j=1 m=1
586 9 Conjoint Analysis

with

yk total utility of stimulus k (k ∈ Kr)


βjm utility contribution of level m of attribute j (partworth)
xjk value of attribute j for stimulus k (k ∈ Kr)

Since the variable for the no-choice option is a binary variable, we can either use the for-
mulation xjmk or xjk. For our example, we can write:
yk = β11 · x11k + β21 · x21k + β3 · x3k
        
cocoa content price no - choice

In total, we need to estimate three parameters including the parameter for the no-choice
option.

9.4.4.2 Specification of the Choice Model


CBC analysis relies on the logit model to map individual choice behavior, and thus, on
the logit rule discussed in Sect. 9.2.5.4. If there are two stimuli in a choice set, the logit
model corresponds to the binary logit model. If there are more than two alternatives to
choose from, the model is called multinomial logit (MNL) model.
The probability that respondent i chooses stimulus k from a set of stimuli presented in
choice set r is:
exp (yk ) 1
Prob(k ∈ Kr ) = Kr
= Kr

exp (yk ) 1+

exp (−1 · (yk − yk ′ )) (9.11)
k=1 k ′ � =k

Equation (9.11) illustrates that the probability of choosing a stimulus depends on its util-
ity and the utility of all other stimuli in a choice set. The choice probability for stimulus
k is reflected in a non-linear relationship between the utility value of stimulus k and the
utility values of the other considered stimuli in a choice set. More specifically, there is
an s-shaped relationship between the choice probability and the estimated utility because
the model relies on the logistic function (Fig. 9.25).
The right-hand side of Eq. (9.11) further shows that the probability of choosing stim-
ulus k among the stimuli in choice set r depends on the difference in utility values. Thus,
the choice probability in the logit model is determined solely by the differences in utility
values, not by their absolute value. Stated differently, it is only the differences in utility
that matter.
If two stimuli have exactly the same utility values and we just consider two stimuli in
a choice set, the choice probability for both stimuli is 0.5 or 50% (Fig. 9.25). For exam-
ple, let us assume two stimuli with a utility value of 2 for both stimuli. Equation (9.11)
then equals for k = 1 (and k = 2):
9.4 Choice-based Conjoint Analysis 587

probability

1.0

0.5

0.0
0 (yk – yk‘)

Fig. 9.25 Illustration of the logistic function

exp (2) 1 1
Prob(k = 1) = = = = 0.5
exp (2) + exp (2) 1 + exp (−1 · (2 − 2)) 1 + exp (0)

A characteristic of the logit model is that if two stimuli have very similar utility val-
ues, small changes in the utility values have a strong effect on the choice probabilities.
For example, if we assume utility values of 2.2 and 1.8 for k = 1 and k = 2, respectively,
the choice probabilities are Prob(k = 1) = 60% and Prob(k = 2) = 40%. But in the case of
large differences in utility values, small changes in these values have only a minor effect
on the choice probabilities.
Another characteristic of the logit model is that the model relies on the assumption
of ‘independence of irrelevant alternatives’ (IIA). The IIA assumption implies that new
alternatives affect the existing alternatives in the same way. For example, if we consider
a no-choice option, the choice probabilities for k = 1 and k = 2 change. If we assume a
utility parameter of 1 for the no-choice option, we get the following equations for k = 1
(k = 2) and the no-choice option:
1 1
Prob(k = 1, 2) = = = 0.42
1 + exp (−1 · (2 − 2)) + exp (−1 · (2 − 1)) 1 + exp (0) + exp (−1)

1 1
Prob(no-choice) = = = 0.16
1 + exp (−1 · (1 − 2)) + exp (−1 · (1 − 2)) 1 + exp (1) + exp (1)

Although the choice probabilities for k = 1 and k = 2 will change if we consider a


no-choice option, the ratio of the choice probabilities stays the same—it is still equal to 1.
588 9 Conjoint Analysis

This is the result of the model’s inherent IIA assumption. However, there might be situa-
tions in which we introduce a new alternative that is a close substitute for an existing alter-
native. In this case, we would expect that the substitute suffers more from the introduction
of the new alternative than the other choice option. For that reason, the IIA assumption
is considered a limitation of the logit model, and alternative models such as nested logit
models have been developed to address this limitation (cf. Train, 2009, pp. 77–78).
So far, we have assumed that we know the total utility values of the stimuli. Yet, we
actually have to estimate the utility parameters to derive the total utility values and the
resulting choices. In the following, we discuss how to estimate the utility parameters.

9.4.4.3 Estimation of the Utility Parameters


To estimate the utility parameters in CBC analyses, we cannot use regression analysis
(cf. Chap. 2.2.5.6 ) because the assumptions of normality and a metric dependent varia-
ble are not met. Instead, there is a dichotomous (0/1) dependent variable. The so-called
maximum likelihood (ML) method can cope with dichotomous dependent variables.

Maximum Likelihood (ML) Method


The objective function of the ML method is the following:

Kr
R 

L= Prob(k)dk → max! (9.12)
r=1 k=1

with

L likelihood function
Prob(k) estimated choice probability for stimulus k in choice set r
dk binary variable indicating whether stimulus k in choice set r has been chosen
(1) or not (0)

The value of the likelihood function depends solely on the utility parameters. We aim
to identify those utility parameters that result in a choice probability of 1 for the cho-
sen stimulus in a choice set. If we are able to identify the utility parameters that lead
to a choice probability of 1 for each chosen stimulus across all choice sets, the likeli-
hood function reaches a value of 1, which is its maximum value (minimum value = 0).
However, it is very unlikely that we obtain a value of 1 for the likelihood function.
Usually, the values for the likelihood are rather small since we multiply a number of
probabilities. For example, if we observed 12 choices and were able to predict each
choice with a probability of 0.9, the resulting value for the likelihood function would
be 0.282 (= 0.912). Small values for the likelihood make it more difficult to identify
the parameters that maximize the likelihood. To address this issue, we take the natural
logarithm of the likelihood and maximize the ln-likelihood function. We thus transform
Eq. (9.12) into the following form:
9.4 Choice-based Conjoint Analysis 589

Kr
R 

ln L = LL = dk · ln (Prob(k)) → max! (9.13)
r=1 k=1

Taking the natural logarithm does not influence the estimated utility parameters. The
value for the ln-likelihood lies within the interval ]−∞,0]. Thus, maximizing LL corre-
sponds to reaching a value of 0. To identify the utility parameters that maximize LL, we
can use an iterative procedure such as the Newton–Raphson algorithm (cf. Train, 2009,
pp. 187–188).
The ML estimation procedure is based on the idea that observed choices can be gen-
erated from a set of utility parameters. There is one set of utility parameters that can best
describe the empirically observed choice decisions. These utility parameters are system-
atically searched for and determined with the help of the iterative algorithm. Figure 9.26
illustrates an exemplary progression of the LL function if one parameter is varied. For
our example, we can see that the maximum of the LL function is around –10 and the util-
ity parameter that maximizes the LL function is around 0.5.
To derive robust parameter estimates, a sufficient number of choices must be
observed. That is, we need a sufficient number of degrees of freedom. The literature
refers to at least 60 degrees of freedom for the ML method; conservative voices even
recommend 120 degrees of freedom (cf. Eliason, 1993, p. 83). In our example, we need
to estimate three parameters and thus we would need to observe at least 63 choices.
However, our experimental design considers just two choice sets. Consequently, we need
to observe about 32 respondents to estimate the utility parameters.
Since we can hardly observe a sufficient number of choices from one single respond-
ent, an individual estimation of the parameters is not feasible in CBC studies. We thus

-2 -1.5 -1 -0.5 0 0.5 1.0 1.5 2.0


0
utility parameter jm

-5

-10

-15

-20
LL

Fig. 9.26 Progression of the LL function when varying a single utility parameter
590 9 Conjoint Analysis

estimate one set of utility parameters for all respondents. If we assume that there is het-
erogeneity among the consumers, we may want to split the sample based on observed
characteristics (e.g., gender, age groups etc.). We can then estimate the utility parameters
for each sub-sample—if there are sufficient observations for each sub-sample. However,
this approach requires the user to know which characteristics are relevant for heterogene-
ity in preferences. Alternatively, other estimation methods such as latent class models or
Bayesian models can be employed to obtain group-specific or individual-specific utility
parameters. Yet, these methods are not discussed in this book.

Starting Values for the ML Method


To illustrate the ML method, we are going to use the observations from six respondents,
which is actually a very small sample but sufficient for our purposes. Table 9.24 shows
the data set used for further analyses.
First, we need to define starting values for the utility parameters. We use these starting
values to determine a value for LL that is subsequently maximized by iteratively adapt-
ing the utility parameters. To identify reasonable starting values, we can, for example,
use the information how many times the attribute levels have been chosen across all
choice sets. Such frequencies offer first insights into the respondents’ preferences regard-
ing the attribute levels. Table 9.25 shows the number of times the different attribute lev-
els have been selected. We learn that the respondents seem to prefer a cocoa content of
78% and a price of 1.50 EUR. To derive starting values for the utility parameters, we can
compute the relative frequency of an attribute level and subtract the relative frequency
of the reference category (here: cocoa content of 78% and price of 2.00 EUR). For the
no-choice option, we use a starting value of 0.
As Table 9.24 illustrates, we use dummy coding for the representation of the stim-
uli. Alternatively, we can use effect coding as described in Sect. 9.3.3. If we do so, we
center the starting values on 0. For example, for the attribute ‘cocoa content’, the starting
value for the utility parameter of ‘60% cocoa content’ would be –0.375 (= –0.75/2) and
the starting value for the utility parameter of ‘78% cocoa content’ would be 0.375 (=
0.75/2). Whether we use dummy or effect coding has no influence on the findings later
on.
Based on the starting values, we compute the total utility values of all stimuli consid-
ered in the experimental design. Thereby, we assume homogeneity, that is, a stimulus has
the same utility for all respondents. Table 9.26 shows the derived total utility values for
the 3 stimuli considered in the two choice sets.
We use the total utility values to derive the choice probabilities (Table 9.26). From
Table 9.24, we know that respondent i = 1 has chosen stimulus k = 2 in both choice sets.
The corresponding choice probabilities are 0.384 and 0.466. Thus, the likelihood when
just consider respondent i = 1 equals 0.180 (= 0.384 · 0.466) and LL is –1.715. The like-
lihood and LL values taking all observed choices into account are 0.00002 and –10.832,
respectively. The likelihood value is very small and not close to 1. Consequently, the
starting values are probably not the ‘optimal’ utility parameters.
9 Conjoint Analysis 591

Table 9.24  Choice data collected from 6 respondents


Attribute levels
Respon- Choice Stimulus 60% 78% 1.50 2.00 No-choice Observed
dent set cocoa cocoa EUR EUR option choice
1 1 1 1 0 1 0 0 0
1 1 2 0 1 0 1 0 1
1 1 3 0 0 0 0 1 0
1 2 1 1 0 0 1 0 0
1 2 2 0 1 1 0 0 1
1 2 3 0 0 0 0 1 0
2 3 1 1 0 0 1 0 0
2 3 2 0 1 1 0 0 1
2 3 3 0 0 0 0 1 0
2 4 1 1 0 1 0 0 1
2 4 2 0 1 0 1 0 0
2 4 3 0 0 0 0 1 0
3 5 1 1 0 1 0 0 0
3 5 2 0 1 0 1 0 1
3 5 3 0 0 0 0 1 0
3 6 1 1 0 0 1 0 0
3 6 2 0 1 1 0 0 1
3 6 3 0 0 0 0 1 0
4 7 1 1 0 0 1 0 0
4 7 2 0 1 1 0 0 1
4 7 3 0 0 0 0 1 0
4 8 1 1 0 1 0 0 0
4 8 2 0 1 0 1 0 1
4 8 3 0 0 0 0 1 0
5 9 1 1 0 1 0 0 0
5 9 2 0 1 0 1 0 0
5 9 3 0 0 0 0 1 1
5 10 1 1 0 0 1 0 0
5 10 2 0 1 1 0 0 1
5 10 3 0 0 0 0 1 0
6 11 1 1 0 0 1 0 0
6 11 2 0 1 1 0 0 1
6 11 3 0 0 0 0 1 0
(continued)
592 9 Conjoint Analysis

Table 9.24 (continued)

Attribute levels
Respon- Choice Stimulus 60% 78% 1.50 2.00 No-choice Observed
dent set cocoa cocoa EUR EUR option choice
6 12 1 1 0 1 0 0 0
6 12 2 0 1 0 1 0 1
6 12 3 0 0 0 0 1 0

Table 9.25  Number of times an attribute level has been chosen


Attribute Level Number of times chosen Relative frequency Starting value
Cocoa content 60% 1 0.08 –0.75
78% 10 0.83 0.00
Price 1.50 EUR 7 0.58 0.25
2.00 EUR 4 0.33 0.00

Table 9.26  Starting values for total utility and choice probabilities


Choice set 1 Stimulus 1 Stimulus 2 No-choice
60% of cocoa 78% of cocoa I would not buy any of these
1.50 EUR 2.00 EUR chocolates
Total utility –0.5 = –0.75 + 0.25 0=0+0 0
Choice probability 0.233 0.384 0.384

Choice set 2 Stimulus 1 Stimulus 2 No-choice


60% of cocoa 78% of cocoa I would not buy any of these
2.00 EUR 1.50 EUR chocolates
Total utility –0.75 = –0.75 + 0 0.25 = 0 + 0.25 0
Choice probability 0.171 0.466 0.363

We use an iterative algorithm to identify those utility parameters that maximize the
likelihood and LL value. On our website www.multivariate-methods.info , we demon-
strate how to use the Microsoft Excel solver to maximize the LL function. The maximum
values for the likelihood and LL are 0.005 and –5.205, respectively. The likelihood value
is still rather small but we have now reached the maximum value, given the observed
choices. One reason for the rather small value might be that there is heterogeneity in the
respondents’ preferences, which makes it difficult to identify one set of utility parameters
that maps all observed choices. The utility parameters that result in the maximum value
for LL are: b11 =  − 14.419, b21 = 13.033, and b3 =  − 1.386. In the following, we discuss
how to assess the estimated utility function.
9.4 Choice-based Conjoint Analysis 593

9.4.4.4 Assessment of the Estimated Utility Function


An assessment of the quality of the estimated utility function should, as in traditional
conjoint analysis be based on the following criteria:

• plausibility of the estimated utility parameters,


• goodness-of-fit of the model,
• predictive validity.

Plausibility of the Estimated Utility Parameters


The plausibility of the estimated utility parameters can be assessed in the same way as
in conjoint analysis. To assess the plausibility of the estimated utility parameters, we
can assess whether the sign of the estimated parameters is in line with a priori expecta-
tions or not (face validity). In our example, we used the partworth model for both attrib-
utes, and thus, we did not make any a priori assumptions on the relation between utility
parameters (i.e., partworths) and the total utility value. For the attribute level ‘60% cocoa
content’, we get a negative partworth. Consequently, the respondents prefer a higher
to a lower cocoa content. This result reflects the initial finding that the attribute level
‘78% cocoa content’ was chosen much more frequently than the attribute level ‘60%
cocoa content’. For the price level ‘1.50 EUR’, we find a positive partworth, indicating
that the respondents prefer a lower to a higher price, which intuitively makes sense. The
no-choice option has a negative utility parameter. This negative parameter reflects the
respondents’ tendency to rather choose a stimulus than to state ‘I would not buy any of
the chocolates’.

Goodness-of-fit of the Model


Since CBC analysis is similar to logistic regression, we can also evaluate the goodness-
of-fit of the estimated model with the help of the measures that are used in the context of
logistic regression (cf. Chap. 5 ). Since the absolute value of LL depends on the number
of observations, we cannot use the value of LL directly to assess the goodness-of-fit (cf.
Chap. 5 ). Hence, we need to compare the maximum LLb value (i.e., the value of the LL
function after maximization) to a reference value. This is done with the help of the likeli-
hood ratio (LLR) test:
 
L0
LLR = −2 · ln = −2 · (LL0 − LLb ) (9.14)
Lb
We compare the maximum LLb value to the LL0 value of the so-called null model, that
is, the LL value if all utility parameters are set to zero. The choice probabilities of the
null model are 1/K, and all stimuli have the same choice probability. Thus, the likeli-
hood equals (1/K)R. In our example, we get a value of –13.183 for LL0. The likelihood
ratio test yields a value of 15.956 (= – 2·(– 13.183 + 5.205)). The LLR test statistic is
chi-square-distributed with 3 degrees of freedom (i.e. the number of estimated utility
parameters). As a result, we receive a p-value of 0.00116 or 0.12%. The estimated model
is statistically highly significant.
594 9 Conjoint Analysis

Table 9.27  Significance of LL0j LLRj p-value (%)


estimated utility parameters
Cocoa content = 60% –10.652 10.894 0.097
Price = 1.50 EUR –6.793 3.175 7.476
No-choice –6.169 1.927 16.504

We can also use the LLR test to determine the significance of the estimated utility
parameters. The LLR test for a single utility parameter is obtained by setting the utility
parameter to zero in the utility function and then maximizing the likelihood function for
this reduced model over the remaining parameters:
 
LLRj = −2 · LL0j − LLb (9.15)
Table 9.27 shows the significance levels for the estimated utility parameters based on Eq.
(9.15). In our example, only the utility parameter for ‘60% cocoa content’ is significant.
Another common measure of the goodness-of-fit for the complete model is
McFadden’s R-square (also known as the likelihood ratio index):
 
2 LLb  2

RM = 1 − 0 ≤ RM ≤1 (9.16)
LL0
For the example, McFadden’s R-square equals 0.605. Given that the value ranges between
0 and 1, the obtained value indicates a high goodness-of-fit of the estimated utility function.
Besides the two measures based on LL, we can compute the hit rate for the estimation
sample (i.e., set up a classification matrix, cf. Chap. 5 ). That is, we count the number
of times we predict the chosen stimulus correctly if we always choose the stimulus with
the highest choice probability. In our example, we get a hit rate of 0.833 (= 10/12) or
83.3%. The benchmark is a hit rate of 50% and, thus, the model fits the data quite well,
although the likelihood value is rather small.

Predictive Validity
To assess the predictive validity, we can use a holdout sample and compute the hit rate
for this sample. In our example, we did not consider a holdout sample, and thus we are
not able to assess the predictive validity of the model.

9.4.5 Interpretation of the Utility Parameters

1 Selection of attributes and attribute levels

2 Design of the experimental study

3 Evaluation of the stimuli

4 Estimation of the utility function

5 Interpretation of the utility parameters


9.4 Choice-based Conjoint Analysis 595

Finally, we briefly discuss the insights and implications we can derive from the results of
a CBC analysis.

9.4.5.1 Preference Structure and Relative Importance of an Attribute


In CBC analyses, we face the problem that the goodness-of-fit of the model influences
the absolute value of the utility parameters (so-called scaling effect). The better the
model can map the observed choice decisions of the respondents, the higher the abso-
lute values of the estimated utility parameters. For this reason, the size of the estimated
utility parameters cannot be interpreted and compared directly (cf. Swait & Louviere,
1993, p. 305).
Yet, we can use the information about the utility parameters to derive at a conclusion
regarding which product the respondents prefer. In our example, a chocolate bar with a
cocoa content of 78% offered for a price of 1.50 EUR would be the ‘optimal’ product for
the sample.
To make a statement about the attribute that is most relevant for the respondents we
can compute the relative attribute importance according to Eq. (9.4) (ignoring the util-
ity parameter of the no-choice option). When doing so, we get a relative importance
of 52.5% (= 14.419/(14.419 + 13.033)) for the attribute ‘cocoa content’. The attribute
‘price’ has a relative importance of 47.5%.

9.4.5.2 Disaggregated Utility Parameters


As discussed above, it is hardly feasible to estimate utility parameters for each individ-
ual since the ML estimation procedure requires a large number of degrees of freedom
to obtain robust results. Yet, there are two basic approaches for deriving at disaggregate
utility parameters:

• Define sub-groups a priori and estimate the utility parameters for each sub-group sep-
arately by splitting the sample. This approach requires a priori knowledge about het-
erogeneity. For example, if we are confident that gender differentiates the respondents
in their preferences, we may split the sample into women and men. However, often
we do not know what observable variables are associated with heterogeneity in the
respondents’ preferences.
• Use more advanced estimation procedures such as Bayesian methods or latent
class models to derive at utility parameters at the individual or segment level (cf.
Gustafsson et al., 2007). In contrast to defining groups a priori, latent class models
use the respondents’ choice decisions to identify respondents with similar preferences
(i.e., choice behavior) and to jointly estimate the utility parameters for those respond-
ents who demonstrate similar choice behavior. This will result in segment-specific
utility parameters.
596 9 Conjoint Analysis

9.4.5.3 Simulations Based on Utility Parameters


Often we want to use the results of a CBC analysis to predict the choice probabilities for
products that were not part of the experimental design. We can compute the choice prob-
abilities for any possible combination of attribute levels based on the estimated utility
parameters. Since we used a very small example to illustrate CBC analysis, we refrain
from a detailed simulation of choice probabilities.

9.4.5.4 Conclusion
Overall, CBC analysis and conjoint analysis share the same steps. Yet, there are also fun-
damental differences as described in this section. While we discussed CBC analysis in
some detail, it is not the only variation of conjoint analysis that has established itself
in practice. In Sect. 9.5.2, we will briefly elaborate on other developments to give the
reader an idea of the manifold methodologies available for measuring and mapping con-
sumer preferences.

9.5 Recommendations

9.5.1 Recommendations for Conducting a (Traditional) Conjoint


Analysis

In conclusion, the following recommendations can be given for using a conjoint analysis.

Attributes and Attribute Levels


The number of attributes and attribute levels should be kept as small as possible. The
attributes and levels need to be independent and relevant. Additionally, the attribute lev-
els must be feasible and realistic. It is recommended not to use more than 6 attributes
with 3 to 4 levels each.

Survey Design
The survey design should not include more than 20 stimuli. If this number is exceeded in
the full factorial design, a reduced design should be created using the full-profile method.

Evaluation of Stimuli
The evaluation method needs to be determined based on the concrete research question.

Specification of the Utility Function


If the reduced design contains only a few stimuli and the partworth model is used to link
the attribute levels to total utility, the number of degrees of freedom for the individual-level
estimation might be very small or even zero. In such a case, the observed preference data
can be mapped perfectly, but only a limited validity of the results can be assumed. Thus, the
researcher needs to consider the degrees of freedom when specifying the utility function.
9.5 Recommendations 597

Aggregation of the Utility Parameters


In conjoint analysis, an aggregation (or joint analysis) of all respondents is only appro-
priate if the utility structure is similar across all respondents. If there is reason to believe
that the consumers may be classified into various groups according to their preferences,
a cluster analysis can be used to obtain segment-specific utility parameters (cf. Chap. 8 ).
In CBC analysis, we can use advanced estimation techniques to derive utility parameters
at the segment or individual level.
All these recommendations need to be carefully evaluated with regard to the specific
setting.

9.5.2 Alternatives to Conjoint Analysis

Conjoint analysis has several limitations that have been addressed by further develop-
ments of the methodology. The two main limitations are the way respondents evaluate
the stimuli (e.g., using ordinal or metric scales) and that only a limited number of attrib-
utes and attribute levels can be considered. The following methods address these limita-
tions next to CBC analysis.

MaxDiff Method
We have already discussed CBC analysis as one alternative to conjoint analy-
sis to address the criticism that the evaluation task in conjoint analyses is unrealistic.
Another approach that addresses this weakness is the MaxDiff method (cf. Louviere
et al., 2015).
MaxDiff is a technique that might be considered a more sophisticated extension of the
method of paired comparisons. With MaxDiff, respondents are shown a set (subset) of
the possible stimuli in the study and are asked to indicate (among this subset, with a min-
imum of three stimuli) the best and worst stimulus. MaxDiff assumes that respondents
evaluate all possible pairs of stimuli within the displayed subset and choose the pair that
reflects the maximum difference in preference.

Adaptive Conjoint Analysis


Another limitation of conjoint analysis is that the number of attributes and attribute
levels needs to be limited. To address this limitation, adaptive conjoint analysis (ACA)
has been developed (cf. Green et al., 1991). The main characteristic of ACA is that the
considered attributes and attribute levels can differ across individuals. In a first step,
the researcher identifies those attribute levels that are relevant for each respondent by
eliciting the most and least preferred levels. This approach allows considering up to 30
attributes and 15 attribute levels. Thus, ACA is useful if there is heterogeneity across
consumers with regard to which attributes are relevant for the considered object. In this
case, the ultimate evaluation task differs across respondents. This makes it necessary to
use an online survey since the experimental design is adapted to a respondent’s answers.
598 9 Conjoint Analysis

One limitation of ACA that results from the adaptive approach is that the results across
individual respondents may be difficult to compare.

Adaptive CBC Analysis


CBC analysis also suffers from the limitation that only a small number of attributes
and attribute levels can be considered. Moreover, the choice tasks can be repetitive and
boring.
Adaptive CBC (ACBC) analysis is a method that addresses this issue by using more
engaging choice tasks. An ACBC analysis involves the following steps:

1. Build-Your-Own (BYO) task: Respondents answer a ‘Build-Your-Own’ question to


introduce the attributes and attribute levels. Respondents indicate their most preferred
levels for each attribute.
2. Screening: Respondents evaluate a subset of stimuli in a choice set. In contrast to
indicating which stimulus they would buy, respondents state which one(s) they would
consider buying (i.e., ‘a possibility’ or ‘not a possibility’). The answers are analyzed
simultaneously and potential ‘unacceptable’ attribute levels are identified based on the
evaluations. Next, respondents are asked whether certain attribute levels are actually
unacceptable. Moreover, ‘must-haves’ can also be identified. Must-haves and unac-
ceptable levels define the exclusion criteria (cf. Sect. 9.2.2.1).
3. Choice tasks: Finally, respondents evaluate stimuli that are close to their BYO-
specified product, which they consider ‘possibilities’ and which strictly conform to
any cutoff (‘must have’, ‘unacceptable’) rules. The respondents are asked to choose
the most preferred stimulus in a choice set. Then, the gathered data are analyzed to
estimate the utility parameters.

All of these three options have been implemented by Sawtooth Software in their com-
mercial software packages. Yet, other commercial software packages are also available
and there are some R packages available as well. The wide availability of further devel-
opments and software packages illustrates the high relevance of the conjoint methodol-
ogy for research and practice.

References

Addelman, S. (1962a). Orthogonal main-effect plans for asymmetrical factorial experiments.


Technometrics, 4(1), 21–46.
Addelman, S. (1962b). Symmetrical and asymmetrical fractional factorial plans. Technometrics,
4(1), 47–58.
Dhar, R. (1997). Consumer preference for a no-choice option. Journal of Consumer Research,
24(2), 215–231.
Eliason, S. R. (1993). Maximum likelihood estimation: Logic and practice. Sage Publications.
References 599

Green, P. E., & Srinivasan, V. (1978). Conjoint analysis in consumer research: Issues and outlook.
Journal of Consumer Research, 5(2), 103–123.
Green, P. E., Krieger, A. M., & Agarwal, M. K. (1991). Adaptive conjoint analysis: Some caveats
and suggestions. Journal of Marketing Research, 28(2), 215–222.
Green, P. E., Krieger, A. M., & Wind, Y. (2001). Thirty years of conjoint analysis: Reflections and
prospects. Interfaces, 31(3), 56–73.
Gustafsson, A., Herrmann, A., & Huber, F. (2007). (eds.). Conjoint measurement—methods and
applications. Springer.
Haaijer, R., Kamakura, W. A., & Wedel, M. (2001). The ‘no-choice’ alternative to conjoint choice
experiments. International Journal of Market Research, 43(1), 93–106.
Johnson, R. M., & Orme, B. K. (1996). How many questions should you ask in choice-based con-
joint studies? Research Paper Series. Sawtooth Software. https://www.sawtoothsoftware.com/
download/techpap/howmanyq.pdf . Accessed: 19. Sept. 2020.
Kuhfeld, W. F., Tobias, R. D., & Garratt, M. (1994). Efficient experimental design with marketing
research applications. Journal of Marketing Research, 31(4), 545–557.
Kumar, V., & Gaeth, G. J. (1991). Attribute order and product familiarity effects in decision tasks
using conjoint analysis. International Journal of Research in Marketing, 8(2), 113–124.
Louviere, J. J., & Woodworth, G. (1983). Design and analysis of simulated consumer choice or
allocation experiments: An approach based on aggregated data. Journal of Marketing Research,
20(4), 350–367.
Louviere, J. J., Flynn, T., & Marley, A. A. J. (2015). Best-Worst Scaling. Cambridge University
Press.
McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. In P. Zarembka
(Ed.), Frontiers in econometrics (pp. 205–142). Academic.
Swait, J., & Louviere, J. (1993). The role of the scale parameter in the estimation and comparison
of multinomial logit models. Journal of Marketing Research, 30(3), 305–314.
Train, K. (2009). Discrete choice models with simulation. Cambridge University Press.
Verlegh, P. W. J., Schifferstein, H. N. J., & Wittink, D. R. (2002). Range and number-of-levels
effects in derived and stated measures of attribute importance. Marketing Letters, 13(1), 41–52.
Vriens, M., Oppewal, H., & Wedel, M. (1998). Ratings-Based versus Choice-Based Latent Class
Conjoint Models. International Journal of Market Research, 40 (3), 1–11.
Wittink, D. R., Vriens, M., & Burhenne, W. (1994). Commercial use of conjoint analysis in
Europe: Results and critical reflections. International Journal of Research in Marketing, 11(1),
41–52.
Index

A C
Adaptive choice-based-conjoint (ACBC), 598 Calinski&Harabasz rule, 489
Adaptive conjoint analysis (ACA), 597 Causality, 40
Agglomeration schedule. See Cluster analysis causal diagram, 100
Akaike information criterion, 328 regression analysis, 69, 99, 129
Alpha factoring, 411 Central limit theorem, 26
Analysis of variance. See ANOVA Chi-square statistic
ANCOVA, 194 Cluster analysis, 512
ANOVA, 12, 148 Contingency analysis, 360
multivariate, 192 Pearson, 323
one-way, 150 Choice-based conjoint analysis (CBCA), 578
two-way, 166 Choice rules (Conjoint), 561
ANOVA table, 82, 162, 177, 182 City block metric (L1-norm), 464
Anti-image covariance matrix, 392 Classification
Area under curve. See ROC curve a posteriori probability, 231
Autocorrelation, 91, 106 a priori probability, 228
Average (arithmetic mean), 15 conditional probability, 232
Average linkage. See Clustering algorithms function, 229
matrix, 222
score, 229
B table, 283
Bandwidth effect, 558 Cluster analysis, 14, 454
Bartlett test, 389, 429 Agglomeration schedule, 471, 476, 487
Baseline logit model, 320 Hierarchical agglomerative, 470, 474
Bayesian information criterion, 328 Partitioning, 523
Bayes theorem, 231 Cluster center analysis. See K-means clustering
Beta coefficients, 75 Clustering algorithms, 469, 471
Binary variable, 9 Centroid clustering, 471
BLUE characteristics, 92 Complete linkage, 477
Bonferroni test, 165 K-means, 490, 493, 503, 523
Boxplot, 45 Median clustering, 471
Bradley-Terry-Luce rule (BTL rule), 562 Single linkage, 474, 493

© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 601


K. Backhaus et al., Multivariate Analysis, https://doi.org/10.1007/978-3-658-40411-6
602 Index

Two-step, 527 Cost of misclassification, 228, 234


Ward‘s method, 478, 498 Covariance, 22
Cluster interpretation Covariance analysis. See ANCOVA
F-values, 492, 507 Covariate pattern, 325
t-values, 492, 507 Covariates, 196, 270
Cluster numbers, stopping rules Cox & Snell R2, 303
Calinski&Harabasz, 489 Cramer’s V, 363, 370, 378
Elbow criterion, 488, 502 Critical test value, 29
Test of Mojena, 490, 503 Cross-loadings, 417
Coefficients, 58 Cross-sectional data, 5
Communality, 405, 408 Cross tabulation. See Contingency analysis
Complete linkage. See Clustering algorithms Cutoff value, 280
Confidence interval, 38, 89
Confirmatory factor analysis, 447
Confounding variables, 41, 98, 365 D
Confusion matrix, 283 Decomposition of variation, 78
Conjoint analysis, 534 Degrees of freedom, 20
Adaptive, 597 Dendrogram, 471, 482, 499
Adaptive CBC, 598 Determination coefficient. See R-square
Choice-based, 13, 535, 578 Deviance, 326
MaxDiff method, 597 Dichotomous variable, 9
Traditional, 13, 535 Discrete choice analysis, 576
Conjoint attributes, 536 Discriminant analysis, 12, 204, 508
Conjoint design Discriminant
asymmetric, 543 axis, 211, 217
full factorial, 543 coefficient, mean, 227
Latin square, 543 coefficient, standardized, 226
orthogonal, 543 criterion, 212
reduced, 543 Discriminant function
symmetric, 543 canonical, 208
utility balance, 583 Fisher’s, 229, 249
Conjoint stimuli, 540 Dollar metric, 546
evaluation, 545 Dummy variable, 9, 137, 272
Constant sum scale, 546 Durbin-Watson test, 107
Contingency analysis, 13, 354
Contingency coefficient, 362
Contrast analysis, 164 E
Cook’s distance, 120 Effect coefficient, 296
Correlation, 23 Eigenvalue, 219, 220, 406
Bravais Pearson, 23, 467 Eigenvalue criterion, 414
canonical, 220 Elbow criterion, 488, 502
causal interpretation, 42 Error probability, 29
graphical representation, 394, 398 Error sum of squares. See Variance criterion
multiple, 78, 111 Error term, 80, 91
spurious, 41, 100 Error variance, 408
Correlation matrix, 24, 73 Eta-squared
empirical, 386 of ANOVA, 157, 175, 185
model-theoretical, 402 partial, 176
Correspondence analysis, 378 Euclidean distance (L2-norm), 462
Index 603

Exact Fisher test, 361 I


Expected number of observations, 359 Ideal point model, 551
Experiment, 6 Image factoring, 411
Experimental design, 6, 152 Interaction effects
balanced, 153, 196 analysis of variance, 169
unbalanced, 187 regression analysis, 94
within ANOVA, 148 Interaction term, 95
within Conjoint analysis, 540 Interval scale, 7
Extraneous variables, 365

J
F Jaccard coefficient, 516
Factor analysis, 382
confirmatory, 14
exploratory, 13 K
Factor extraction methods, 410 Kaiser criterion. See Eigenvalue criterion
Factorial design, 167 Kaiser-Meyer-Olkin (KMO) criterion, 390
Factor levels of ANOVA, 148 K-means clustering. See Clustering algorithms
Factor loading matrix
unrotated, 415
Factor loadings, 396 L
Factor rotation, 418 LAD method, 72
Factor score coefficients, 420, 439 Lambda coefficient, 363, 370, 378
Factor score determination Latent variable, 5
Regression method, 420 Latin square, 543
Summated scales, 420 Least-squares method, 71, 92
Surrogates, 419 Leave-one-out method, 224, 304
Factor scores, 419 Level of significance (sig), 29, 34
First-choice rule, 561 Levene test, 163, 165, 181, 197
F-test, 82, 159, 164, 225 Leverage effect, 117, 307
Full-profile method, 541, 581 Likelihood ratio
statistic, 300
test, 301, 309, 593
G Linear probability model, 276
GLS method, 411 Linear trend model, 139
Goldfeld-Quandt test, 105 Link function, 271
Goodman and Kruskal’s lambda, 363 Logistic function, 268
Goodman and Kruskal’s tau, 363 Logistic regression, 266
Goodness-of-fit, 77, 83, 299, 323 binary, 268
Grouping variable, 204 multinomial, 314
multiple, 287
Logit, 270, 296
H Logit choice model, 348
Heteroscedasticity, 104 Logit rule, 562
Histogram, 45 Log-likelihood function, 290
Hit rate, 222, 283 Log-linear model, 378
Homoscedasticity, 91, 163 Longitudinal data, 5
Lurking variables, 98
604 Index

M Omnibus hypothesis, 163


Mahalanobis distance, 229 Ordinal scale, 7
MANCOVA, 192 Ordinary least squares (OLS), 71
Manifest variable Outliers, 44
manifest, 4 within cluster analysis, 481
Manipulation check, 200 within discriminant analysis, 304
MANOVA, 192 within regression analysis, 113
MaxDiff method, 597
Maximum likelihood
method, 289, 318, 411, 588 P
principle, 289 Paired comparison, 545
McFadden’s R2, 302 Parameters, 58
Measurement level, 521 Parsimony, 84
Measure of sampling adequacy (MSA), 391 Partworth, 535, 548
Median clustering. See Clustering algorithms Pearson residuals, 304
Method of Glesjer, 105 Pearson’s chi-square statistic. See Chi-square
Metric scale, 6 statistic
Minkowski metric (L-norms), 466 Phi coefficient, 362, 370, 378
Missing values, 48 Phi-square statistic (Cluster analysis), 514
Multiple Imputation, 50 Point-biserial correlation, 10
System, 48 Population, 4
User, 48 Post-hoc test, 165, 187
Multicollinearity, 110, 236 Power of a test, 34
Multidimensional contingency table, 378 P-P plot, 109
Multinomial logit model, 586 Prediction error, 142
Multiple Imputation, 50 Predictive accuracy, 283
Multivariate analysis Principal axis factoring (PAF), 410
Structure discovering methods, 13 Principal component analysis (PCA), 403
Structure testing methods, 11 Probabilistic rule, 562
Term and overview, 10, 14 Proximity measures, 461, 511
Distance measures, 460, 468
Similarity measures, 460, 468
N Pseudo-R-square statistics, 302
Nagelkerke’s R2, 303 P-value, 32, 82
No-choice option, 581
Nominal scale, 7
Non-linearity, 95 Q
Non-metric scale, 6 Q-Q plot, 109
Non-normality, 108
Null model, 184, 300
Number-of-levels effect, 558 R
Random error, 26
Ranking scale, 545
O Rating scale, 8, 545
Oblique factor rotation, 418 Ratio scale, 8
Observational data, 6 Regression analysis, 56
Odds, 270, 294 logistic, 12
ratio, 296, 297 multiple, 60, 73
Omitted variable, 97 multivariate, 144
Index 605

polynomial, 94 Statistical testing, 26


simple, linear, 11, 59 Stochastic model
stepwise, 133, 343 of ANOVA, 151, 167
with dummies, 137 of regression analysis, 80
with time-series data, 138 Sum of squared residuals (SSR), 70, 77
Regression coefficients Surrogates, 419
partial, 74 Systematic component, 268
standardized, 75 Systematic error, 26
Regression effect, 68 System missing values, 48
Regression to the mean, 103
Relative risk, 298
Residuals, 69, 304 T
standardized, 115 Tau coefficient, 370, 378
studentized, 119 Test for
ROC curve, 285 a proportion, 36
R-square, 78 homogeneity, 354
adjusted, 78, 83 independence, 354
Russel and Rao coefficient, 519 the mean, 27
Test of
causality, 41
S Mojena, 490
Sample data, 4 Theil’s U, 370
Saturated model, 327 Ties, 546
Scale level Time series analysis, 138
interval, 7 Time series data, 5, 138
metric, 6 Tolerance, 111
nominal, 7 Total utility (Conjoint analysis), 535, 548, 585
non-metric, 6 Trade-off method, 541
ordinal, 7 T-Test
ratio, 8 one-tailed, 35
transformation, 521 two-tailed, 27, 29
Scatterplot, 47 within ANOVA, 87
Scheffe test, 165, 187 within logistic regression analysis, 308
Scree plot, 406, 488 within regression analysis, 87
Scree test, 414, 434 Tukey test, 165, 187
SD line, 68 Two-step clustering. See Clustering algorithms
Sensitivity, 283 Type I error, 33
Significance level. See Level of significance Type II error, 33
Simple matching coefficient, 516
Single linkage. See Clustering algorithms
Specificity, 283 U
Specific Variance (factor analysis), 408 ULS method, 411
Split-half method, 224 Uncertainty coefficient, 370
Standard deviation, 17 Unique variance, 408
Standard error (SE), 86, 120, 142 Unit variance, 400, 408
of the coefficient, 85, 112 User missing values, 48
of the regression, 77 U-statistic. See Wilks’ lambda, univariate
Standardized variable, 21
606 Index

V Vector model, 550


Variable
binary, 9
dichotomous, 9 W
dummy, 9 Wald test, 309
latent, 5 Ward’s method. See Clustering algorithms
standardized, 20 Wilks' lambda, 226
Variance, 17 Wilks’ lambda
decomposition (ANOVA), 155, 172 multivariate, 221
decomposition (factor analysis), 408 univariate, 220
homogeneity, 163
Variance criterion, 478
Variance inflation factor, 112 Y
Variation between groups, 213 Yates’ corrected chi-square test, 361
Variation within groups, 213
Varimax factor rotation, 418

You might also like