Professional Documents
Culture Documents
Multivariate Analysis: Klaus Backhaus Bernd Erichson Sonja Gensler Rolf Weiber Thomas Weiber
Multivariate Analysis: Klaus Backhaus Bernd Erichson Sonja Gensler Rolf Weiber Thomas Weiber
Bernd Erichson
Sonja Gensler
Rolf Weiber
Thomas Weiber
Multivariate
Analysis
An Application-Oriented
Introduction
Second Edition
Multivariate Analysis
ii
1. Go to https://flashcards.springernature.com/login
2. Create an user account by entering your e-mail adress and
assiging a password.
3. Use the link provided in one of the first chapters to access your
SN Flashcards set.
If the link is missing or does not work, please send an e-mail with the subject
“SN Flashcards” and the book title to customerservice@springernature.com.
Klaus Backhaus · Bernd Erichson ·
Sonja Gensler · Rolf Weiber · Thomas Weiber
Multivariate Analysis
An Application-Oriented Introduction
Second Edition
Klaus Backhaus Bernd Erichson
University of Münster Otto-von-Guericke-University Magdeburg
Münster, Nordrhein-Westfalen, Germany Magdeburg, Sachsen-Anhalt, Germany
Thomas Weiber
Munich, Bayern, Germany
English Translation of the 17th original German edition published by Springer Fachmedien Wiesbaden,
Wiesbaden, 2023
© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2021, 2023
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage
and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or
hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does
not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective
laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors
or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
This Springer Gabler imprint is published by the registered company Springer Fachmedien Wiesbaden GmbH,
part of Springer Nature.
The registered company address is: Abraham-Lincoln-Str. 46, 65189 Wiesbaden, Germany
Preface 2nd Edition
• The latest version of SPSS (version 29) was used to create the figures.
• Errors in the first edition, which occurred despite extremely careful editing, have been
corrected. We are confident that we have now fixed (almost) all major mistakes. A
big thank you goes to retired math and physics teacher Rainer Obst, Supervisor of
the Freiherr vom Stein Graduate School in Gladenbach, Germany, who read both the
English and German versions very meticulously and uncovered quite a few inconsist-
encies and errors.
• We made one major change in the chapter about the cluster analysis. The application
example was adapted to improve the plausibility of the results.
Also for the 2nd edition, research assistants have supported us energetically. Special
thanks go to the research assistants at the University of Trier, Mi Nguyen, Lorenz
Gabriel and Julian Morgen. They updated the literature, helped to correct errors and
adjusted all SPSS figures. They were supported by the student Sonja Güllich, who was
instrumental in correcting figures and creating SPSS screenshots. The coordination of
the work among the authors as well as with the publisher was again taken over by the
research assistant Julian Morgen, who again tirelessly and patiently accepted change
requests and again implemented them quickly. Last but not least, we would like to thank
Barbara Roscher and Birgit Borstelmann from Springer Verlag for their competent
support.
SpringerGabler also provides a set of electronic learning cards (so-called “flash-
cards”) to help readers test their knowledge. Readers may individualize their own learn-
ing environment via an app and add their own questions and answers. Access to the
flashcards is provided via a code printed at the end of the first chapter of the book.
v
vi Preface 2nd Edition
We are pleased to present this second edition, which is based on the current version
of IBM SPSS Statistics 29 and has been thoroughly revised. Nevertheless, any remaining
mistakes are of course the responsibility of the authors.
The book Multivariate Analysis provides Excel formulas with semicolon separators. If
the regional language setting of your operating system is set to a country that uses peri-
ods as decimal separators, we ask you to use comma separators instead of semicolon
separators in Excel formulas.
Preface
This is the first English edition of a German textbook that covers methods of multivari-
ate analysis. It addresses readers who are looking for a reliable and application-oriented
source of knowledge in order to apply the discussed methods in a competent manner.
However, this English edition is not just a simple translation of the German version;
rather, we used our accumulated experience gained through publishing 15 editions of the
German textbook to prepare a completely new version that translates academic statistical
knowledge into an easy-to-read introduction to the most relevant methods of multivariate
analysis and is targeted at readers with comparatively little knowledge of mathematics
and statistics. The new version of the German textbook with exactly the same content is
now available as the 16th edition.
For all methods of multivariate analysis covered in the book, we provide case studies
which are solved with IBM’s software package SPSS (version 27). We follow a step-by-
step approach to illustrate each method in detail. All examples and case studies use the
chocolate market as an example because we assume that every reader will have some
affinity to this market and a basic idea of the factors involved in it.
This book constitutes the centerpiece of a comprehensive service offering that is cur-
rently being realized. On the website www.multivariate-methods.info, which accompa-
nies this book and offers supplementary materials, we provide a wide range of support
services for our readers:
• For each method discussed in this book, we created Microsoft Excel files that allow
the reader to conduct the analyses with the help of Excel. Additionally, we explain
how to use Excel for many of the equations mentioned in this book. By using Excel,
the reader may gain an improved understanding of the different methods.
• The book’s various data sets, SPSS jobs, and figures can also be requested via the
website.
• While in the book we use SPSS (version 27) for all case studies, R code (www.r-pro-
ject.org) is also provided on the website.
• In addition to the SPSS syntax provided in the book, we explain how to handle SPSS
in general and how to perform the analyses.
vii
viii Preface
We hope that these initiatives will improve the learning experience for our readers. Apart
from the offered materials, we will also use the website to inform about updates and, if
necessary, to point out necessary corrections.
The preparation of this book would not have been possible without the support of our
staff and a large number of research assistants. On the staff side, we would like to thank
above all Mi Nguyen (MSc. BA), Lorenz Gabriel (MSc. BA), and Julian Morgen (M.
Eng.) of the University of Trier, who supported us with great meticulousness. For creat-
ing, editing and proofreading the figures, tables and screenshots, we would like to thank
Nele Jacobs (BSc. BA). Daniela Platz (BA), student at the University of Trier, provided
helpful hints for various chapters, thus contributing to the comprehensibility of the text.
We would also like to say thank you to Britta Weiguny, Phil Werner, and Kaja Banach
of the University of Münster. They provided us with help whenever needed. Heartfelt
thanks to Theresa Wild and Frederike Biskupski, both students at the University of
Münster, who provided feedback to improve the readability of this book.
Special thanks go to Julian Morgen (M. Eng.) who was responsible for the entire pro-
cess of coordination between the authors and SpringerGabler. Not only did he tirelessly
and patiently implement the requested changes, he also provided assistance with ques-
tions concerning the structure and layout of the chapters.
Finally, we would like to thank Renate Schilling for proofreading the English text and
making extensive suggestions for improvements and adaptations. Our thanks also go to
Barbara Roscher and Birgit Borstelmann of SpringerGabler who supported us continu-
ously with great commitment. Of course, the authors are responsible for all errors that
may still exist.
Methods
We provide supplementary and additional material (e.g., examples in Excel) for each
method discussed in the book.
FAQ
On this page, we post frequently asked questions and the answers.
Forum
The forum offers the opportunity to interact with the authors and other readers of the
book. We invite you to make suggestions and ask questions. We will make sure that you
get an answer or reaction.
Service
Here you may order all tables and figures published in the book as well as the SPSS data
and syntax files. Lecturers may use the material in their classes if the source of the mate-
rial is appropriately acknowledged.
Corrections
On this page, we inform the readers about any mistakes detected after the publication
of the book. We invite all readers to report any mistakes they may find on the Feedback
page.
Feedback
Here the authors invite the readers to share their comments on the book and to report any
ambiguities or errors by sending a message directly to the authors.
ix
x www.multivariate-methods.info
3URIHVVXUI¾U0DUNHWLQJXQG,QQRYDWLRQ
8QLY3URI'U5ROI:HLEHU
8QLYHUVLW¦WVULQJ
'(7 7ULHU *HUPDQ\
&RQWDFWmva@uni-trier.deದSKRQH
6HQGHU
____________________________
____________________________
____________________________
____________________________
(0DLOBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB3KRQHBBBBBBBBBBBBBBBBBB
6XEMHFW0XOWLYDULDWH$QDO\VLV
+HUHZLWK,RUGHU
DOOGDWDVHWVDQG6366V\QWD[ILOHVIRUDOOPHWKRGVFRYHUHGLQWKLVERRNDWDSULFHRI(85
FRPSOHWHVHWRIILJXUHVIRUDOOPHWKRGVFRYHUHGLQWKLVERRNDWDSULFHRI (85
WKHVHWRIILJXUHVDVUHDGRQO\3RZHU3RLQWILOHVDWDSULFHRI(85HDFKIRUWKHIROORZLQJFKDSWHUV
_____________________ ________________________
'DWH 6LJQDWXUH
<RXPD\DOVRSODFH\RXURUGHURQRXUZHEVLWHV
ZZZPXOWLYDULDWHPHWKRGVLQIRRUZZZPXOWLYDULDWHGH
RUSHU0DLOWRPYD#XQLWULHUGH
Contents
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
xi
Introduction to Empirical Data Analysis
1
Contents
With the purchase of this book, you can use our “SN Flashcards” app to access
questions free of charge in order to test your learning and check your understanding
of the contents of the book. To use the app, please follow the instructions below:
1. Go to https://flashcards.springernature.com/login
2. Create a user account by entering your e-mail address and assigning a password.
3. Use the following link, to access your SN Flashcards set: https://sn.pub/hhdvdz
If the code is missing or does not work, please send an e-mail with the subject “SN
Flashcards” and the book title to customerservice@springernature.com.
This introductory chapter introduces, characterizes and classifies the methods of multi-
variate data analysis covered in this book. It also presents the fundamentals of empirical
data analysis that are relevant to all these methods. Most readers will be familiar with the
fundamentals of statistical data analysis, so this is primarily intended as a repetition of
important aspects of quantitative data analysis.
This chapter comprises six sections:
Section 1.1 elaborates on the objectives of this book and summarizes the fundamen-
tals of empirical research. Data are the ‘raw material’ for multivariate methods. Thus, we
first describe the various types of data and their measurement levels. Furthermore, the
methods discussed in this book are characterized and classified.
Section 1.2 explains the basics of statistics (i.e., mean, variance, standard deviation,
covariance, and correlation) and illustrates how to compute some central statistical
measures.
Section 1.3 elaborates on the principles of statistical testing, the choice of the signif-
icance level and the use of the so-called p-value, thus laying the foundation for under-
standing the different statistical tests used in this book.
Section 1.4 explains the concept of causality in the context of multivariate analysis
which aims to describe, explain and predict real-life phenomena. While causality is not a
statistical concept, it is crucial for the interpretation of statistical results.
Section 1.5 deals with the problem of outliers and missing values in empirical data.
We discuss how to detect outliers and what influence they may have on the results of
empirical studies. Moreover, we illustrate the options provided by IBM SPSS for han-
dling missing values.
Section 1.6 introduces the IBM SPSS statistical software package. The SPSS proce-
dures used throughout this book are briefly presented. Users of R are referred to www.
multivariate-methods.info.
1.1 Multivariate Data Analysis: Overview and Basics 3
This book deals with methods of statistical data analysis that examine several variables
simultaneously and quantitatively analyze their relationships with the aim of describing
and explaining these relationships or predicting future developments. These methods are
collectively referred to as multivariate analysis techniques. Sometimes, only methods
of analysis that consider several dependent variables are explicitly referred to as mul-
tivariate. This applies, for example, to multivariate regression analysis (cf. Sect. 2.4.3)
or multivariate analysis of variance (cf. Sect. 3.4.1). In this book we follow the broad
understanding of the term, which regards bivariate analyses (that consider only two var-
iables at a time) as the simplest form of multivariate data analysis. However, the user
should be aware that in practice the interrelationships are usually much more complex
and require the consideration of more than just two variables. A more detailed differenti-
ation of the various methods of multivariate analysis is provided in Sect. 1.1.3.
Today, methods of multivariate analysis are one of the foundations of empirical
research in science. So, not surprisingly, the methods are still undergoing rapid devel-
opment. New methodological advancements are constantly being developed, new areas
of application are being opened up and new or improved software packages are being
developed. Despite their importance, beginners often hesitate to apply methods of multi-
variate analysis, mostly for the following reasons:
This book aims to address all these issues. To this end, we followed the following princi-
ples when writing this book:
1. We took great care to present the various methods in an easy and straightforward
manner. Instead of providing many methodological details, we focused on easy com-
prehensibility. The guiding principle for each chapter was to focus on the fundamen-
tals of each method and its application. We discuss statistical details only if necessary
to facilitate the understanding. We explain the computation in detail, so readers will
get a good idea of how the methods work as well as their possible applications and
limitations.
2. Short examples are used to support our explanations and to facilitate understanding. A
more extensive case study is presented at the end of each chapter to discuss the imple-
mentation of the different methods with the help of IBM SPSS statistical software.
3. We use simple examples from the chocolate market that are related to management
questions but are also easy to follow for non-business students or readers from other
disciplines.
4 1 Introduction to Empirical Data Analysis
4. For the case studies, we use IBM SPSS Statistics (SPSS for short). SPSS has become
very widely used, especially in higher education, but also in practice and research,
because of its ease of use. For demonstration we use the graphical user interface (GUI)
of SPSS. But we also offer the SPSS syntax commands at the end of each chapter so
readers can easily repeat the analysis (maybe with changed data or testing different
model specifications). All case study data may be ordered and downloaded from our
website. We show and explain the IBM SPSS output files in detail for each method.
5. For the case studies in Chaps. 4, 5, 7, and 8, we use the same data set to demonstrate
the similarities and differences between the different methods. In cases where we use a
different data set, we still stick to the chocolate market and related research questions.
6. For the calculation of smaller examples, there will be Excel files provided on our
website www.multivariate-methods.info.
7. For readers using R, we provide additional material for the case studies on our web-
site (www.multivariate-methods.info).
Overall, this book is targeted at novices and amateur users who would like to apply
the methods of multivariate analysis more knowledgeably. Therefore, all methods are
explained independently of each other, i.e., the different chapters may be read individu-
ally and in any order.
Empirical research involves the collection of data and their evaluation using qualitative
or quantitative methods. The primary objectives of empirical studies are:
In most cases, empirical studies are based on some part of a population (sample) that is
used to draw conclusions about the entire population. The sample may be described by
certain statistical parameters, but the question is: What are the ‘true’ parameters in the
population? Using the sample data, central parameters can be estimated for the popula-
tion, which is explained in more detail in Sects. 1.2 and 1.3.
The data collected in the course of empirical studies usually refer to various attributes
and their manifestations. The numerically coded attributes of objects are called variables
and are usually denoted by letters (e.g., X, Y, Z). Their values express certain character-
istics of subjects or objects. Therefore, a variable varies across the considered subjects
(objects) and possibly also over time.
1.1 Multivariate Data Analysis: Overview and Basics 5
Manifest variables are variables that may be directly observed or measured (e.g., weight,
height, hair color, sex, age, price, quantities, and income). In contrast, latent variables
are variables that cannot be directly observed, but are assumed to be related to manifest
(observable) variables. Hypothetical constructs such as trust, motivation, intelligence,
illness, or brand image are regarded as latent variables. To measure latent variables, suit-
able operationalizations using manifest variables are required. This means the value of a
latent variable is derived from various observed variables (compositional approach). Let
us take the construct “intelligence”, for example, which is measured by a linear combi-
nation of various observed variables. The measurement of observable variables can be
performed at different levels (so-called scales) (cf. Sect. 1.1.2.1).
In the majority of methods, a distinction is made between dependent and independent
variables (Table 1.1). Such a division is necessary if changes in some observed variable
(y-variable) are supposed to be explained by changes in another variable (x-variable). It
is assumed that a y-variable is dependent on the x-variables. In this case, we speak of
analysis of dependence or structure-testing methods (cf. Sect. 1.1.3.1).
The various dependency analyses may be distinguished according to the measurement
levels of the dependent and independent variables. A categorization of the methods along
these lines is discussed in Sect. 1.1.2.1. In so-called interdependency analyses, the vari-
ables are not subdivided a priori. In this book, we discuss the following interdependency
analyses: factor analysis and cluster analysis. Sect. 1.1.3.2 briefly explains the basic idea
of these two methods.
Data are the ‘raw material’ of multivariate data analysis. In empirical research, we distin-
guish between different types of data
• gender (male—female—diverse),
• religion (catholic—protestant—…),
• color (red—yellow—green—blue—…),
• advertising channel (television—newspapers—billboards—…),
• sales area (north—south—east—west).
1.1 Multivariate Data Analysis: Overview and Basics 7
Nominal scales serve only for the identification of qualitative attributes. If the attribute
‘color’, for example, occurs in three variants, it can be coded as follows:
• red = 1,
• yellow = 2,
• green = 3.
The choice of numbers to identify the colors is arbitrary, i.e., other numbers (e.g., 5, 3,
7) could have been chosen just as well. The only condition is that different numbers are
assigned to different attributes and identical numbers to identical attributes (one-to-one
correspondence). For nominal variables, no arithmetic operations (such as addition, sub-
traction, multiplication, or division) are permissible. It is obvious that a mean value of
2.0 for the three colors mentioned above would not be meaningful. However, we can
infer frequencies or percentages by counting how often the values of the different attrib-
utes occur in a data set (e.g., 25% red; 40% yellow; 35% green).
The ordinal scale represents the next higher measurement level.The ordinal scale results
from ranking. Examples: product A is preferred to product B, or person M is more efficient
than person N. The ranking values (1st, 2nd, 3rd etc.) do not say anything about the dif-
ferences between the objects. It is therefore not possible to state from the ordinal scale by
how much product A is preferred to product B. Hence, ordinal data, just like nominal data,
cannot be used for arithmetic operations. But it is possible to calculate the median and quan-
tiles, in addition to counting frequencies. Ordinal variables are often treated as nominal (and
thus qualitative) variables (e.g., in the case of social classes: lower, middle, upper class).
The interval scale represents the next higher measurement level. This scale implies
intervals of equal size. Since the intervals may be infinitely small units (at least theo-
retically), we also speak of a continuous scale. But for simplicity’s sake, usually only
8 1 Introduction to Empirical Data Analysis
discrete values are used. A typical example is the Celsius scale for measuring temper-
ature, where the distance between the freezing point and the boiling point of water is
divided into 100 equally sized (discrete) intervals. With interval-scaled data, the differ-
ences between the data points also contain information (e.g., large or small temperature
differences), which is not the case in nominal or ordinal data. Since this scale has no
natural zero point, we cannot state that 20 °C is twice as warm as 10 °C. But we can
compute mean values, standard deviations, or correlations for interval-scaled data.
The ratio scale represents the highest measurement level. It differs from the interval
scale in that there is a natural zero point that can be interpreted as “not present” for the
attribute in question. This applies to most physical characteristics (e.g., length, weight,
speed) and most economic characteristics (e.g., income, costs, prices). In the case
of ratio-scaled data, not only the difference but also the ratio of the data makes sense.
Ratio-scaled data thus allow the application of all arithmetic operations as well as the
calculation of all statistical measures.
In general, we can say that the higher the measurement level, the greater the informa-
tion content of the data and the more arithmetic and statistical operations can be applied
to the data. It is possible to transform data from a higher to a lower scale, but not vice
versa. This can be useful for increasing the clarity of the data or for simplifying the anal-
ysis. For example, in many cases income classes or price classes are formed. This may
involve transforming the originally ratio-scaled data to an interval, ordinal or nominal
scale. Of course, transformation to a lower scale always involves a loss of information.
• Assessment scales, which are used to assess, for example, the level of an object’s
quality or performance.
• Importance scales, which are used to assess, for example, the importance of an
attribute.
• Intensity scales, which indicate to what extent (low to high) an attribute is present.
• Agreement scales, which are used to determine the level of agreement or disagree-
ment with a statement (cf. Fig. 1.1)
1 2 3 4 5 6
I fully disagree I fully agree
If dichotomous variables are coded with 0 and 1, they are called binary variables or
dummy variables. Binary variables can be treated like metric variables. The mean value
of a binary variable indicates the proportion with which the attribute value coded with 1
occurs in a data set. If, for example, ‘purchase’ is coded with 1 and ‘no-purchase’ with 0,
the mean value of 0.75 means that the product was purchased by 75% of the respondents.
Nominal variables with more than two categories cannot be treated like metric varia-
bles. But it is possible to replace a nominal variable with several dummy variables that
can then be interpreted like metric variables.
Example
A supplier wants to know if the color of packaging influences consumers’ purchase
decisions. Three colors are considered: red, yellow and green.
The nominal variable ‘color’ may be represented by three dummy variables, each
of which may have the value 1 (color present) or 0 (otherwise). For the color red, for
example, the coding of the dummy variable q1 is as follows:
{
1 if color = red
q1 =
0 otherwise
Similarly, a dummy variable q2 can be defined for the color yellow and a dummy
variable q3 for the color green. If only these three colors are considered, one of
the three dummies is redundant: if q1 = 0 and q2 = 0, q3 must be equal to 1. The
three colors can therefore be described by two dummy variables q1 and q2. The
10 1 Introduction to Empirical Data Analysis
Dummy variables may be treated like metric variables. The correlation of dummy vari-
ables with metric variables is called a point-biserial correlation and represents a special
case of the Pearson correlation (cf. Sect. 1.2.2).1
A problem arises if dummy variables are used for recoding nominal variables with
many categories. This would lead to an implicit weighting, since the number of variables
which ultimately measure the same property (e.g., color) increases. Caution is therefore
required when transforming a nominal variable with many categories. In empirical stud-
ies, it should always be stated whether dummy variables were used and how they were
assembled.
For this book, we selected methods of multivariate analysis that are frequently used in
science and practice and are of central importance for teaching in higher education.
The following methods are covered:
1. Structure-testing methods are procedures of multivariate analysis with the primary goal
of testing relationships between variables. In doing so, the dependence of a variable on
one or more independent variables (influencing factors) is considered. The user has prop-
ositions about the relationships between the variables based on logical or theoretical con-
siderations and would like to test these with the help of methods of multivariate analysis.
1 Both SPSS and R use the point-biserial calculation of a correlation if one of the variables has
only two calculation-relevant values.
1.1 Multivariate Data Analysis: Overview and Basics 11
However, it needs to be stressed that the methods cannot always be assigned exclusively
to the above two categories because sometimes the objectives of the different procedures
may overlap.
1.1.3.1 Structure-testing Methods
Structure-testing procedures are primarily used to carry out causal analyses in order to
establish cause and effect, for example, whether and to what extent the weather, soil con-
ditions, and different fertilizers have an effect on crop yield or how strongly the demand
for a product depends on its quality, price, advertising, and consumer income.
A prerequisite for applying these methods is that the user develops a priori (i.e., in
advance) hypotheses about the causal relationship between the variables. This means
that the user already knows or suspects which variables might affect another variable.
In order to test this hypothesis, the variables are usually categorized into dependent and
independent variables. The methods of multivariate analysis are used to test the proposi-
tions based on collected data. According to the measurement levels of the variables, the
basic structure-testing procedures can be characterized as shown in Table 1.3.
Regression Analysis
Regression analysis is a very flexible method and of great importance for describing and
explaining relationships as well as for making predictions. It is therefore certainly one
of the most important and most frequently used methods of multivariate analysis. In par-
ticular, it can be used to investigate relationships between a dependent variable and one
or more independent variables. With the help of regression analysis, such relationships
can be quantified, hypotheses can be tested, and forecasts can be made.
Let us take, for example, the investigation of the relationship between a product’s
sales volume and its price, advertising expenditure, number of sales outlets, and national
income. Once these relationships have been quantified and confirmed with regression
analysis, forecasts (what-if analyses) can be made that predict how the sales volume will
change if, for example, the price or the advertising expenditure or both vary.
In general, regression analysis is applicable if both the dependent and the independ-
ent variables are metric variables. Qualitative (nominally scaled) variables may also be
included in regression analysis, if dummy variables (cf. Sect. 1.1.2.2) are used.
Discriminant Analysis
Discriminant analysis is used if the independent variables are metric and the depend-
ent variable is nominally scaled. Discriminant analysis is a method for analyzing differ-
ences between groups, for example, if we want to investigate whether and how the voting
behavior for political parties differs in terms of voters’ socio-demographic and psych-
ographic characteristics. The dependent nominal variable identifies the group member-
ship, that is, the elected political party, and the independent variables describe the group
elements, that is, the voters’ socio-demographic and psychographic characteristics.
Another area of application of discriminant analysis is the classification of elements.
For example, once relationships between group membership and characteristics have
been analyzed for a given set of objects (subjects), we can predict the group membership
of ‘new’ objects (subjects). These kinds of predictions are frequently used in credit scor-
ing (i.e., risk classification of bank customers applying for a loan) or in performance pre-
dictions (e.g., classification of sales representatives according to expected sales success).
Contingency Analysis
Contingency analysis examines relations among two or more nominally scaled varia-
bles. For example, we can investigate the relation between smoking (smoker versus non-
smoker) and lung cancer (yes, no). We do so with the help of a cross-table (contingency
table) that maps the occurrence of combinations of levels of the nominal variables.
1.1.3.2 Structure-Discovering Methods
Structure-discovering methods are used to analyze correlations between variables or
between objects, so we do not split the set of variables into dependent and independent
variables in advance, as we do for structure-testing methods.
their relationship to the underlying variables has been specified a priori, factor analysis
becomes a structure-testing instrument and is called confirmatory factor analysis (CFA).
Chap. 7 focuses on EFA, only briefly discussing CFA.
Cluster Analysis
While factor analysis is used to reduce the number of variables, cluster analysis is used
to bundle objects (subjects). The aim is to combine objects to different groups (i.e., clus-
ters) in such a way that all objects in one group are as similar to each other as possible,
while the groups are as dissimilar to each other as possible. Typical examples of research
questions tackled with cluster analysis are the definition of personality types based on
psychographic characteristics or the definition of market segments based on demand-rel-
evant consumer characteristics.
With a subsequent discriminant analysis, we can check to what extent the variables
that were used for clustering contribute to or explain the differences between the identi-
fied clusters.
The discussion of the various methods in this book involves as little mathematics as pos-
sible and requires only basic statistical knowledge—equivalent to an introductory course
in statistics. To refresh the reader’s knowledge, we will in the following provide a short
recapitulation of the most relevant statistical measures and concepts:2
2 On www.multivariate-methods.info, the reader will also find an Excel sheet with information on
the calculation of the various statistical parameters using Excel.
1.2 Basic Statistical Concepts 15
Table 1.5 provides the equations for these statistical measures. For simplicity’s sake, we
are using N to denote the sample size as well as the population size. Usually, however,
the latter is unknown.
To clarify whether a statistical parameter is calculated for a sample or denotes the
value of the population, different terms are used. Table 1.6 shows this distinction for the
mean, the variance, and the standard deviation.
Arithmetic Mean
Large amounts of empirically collected quantitative data may be characterized very well
by a few descriptive statistics. The most important descriptive statistical measure is the
arithmetic mean x j, which reflects the average value of a variable:3
3 InExcel, the mean of a variable can be calculated by: = AVERAGE(matrix), where (matrix) is the
range of cells containing the data of the variable. For example, “ = AVERAGE(C6:C55)” calculates
the mean of the 50 cells C6 to C55 in column C. Note on Excel operation: The book Multivariate
16 1 Introduction to Empirical Data Analysis
N
1 Σ
xj = xij (1.1)
N i=1
Analysis provides Excel Formulas with semicolon seperators. If the regional language setting of
your operating system is set to a country that uses periods as decimal seperators, we ask you to use
comma seperators instead of semicolon seperators in Excel formulas.
1.2 Basic Statistical Concepts 17
with
xij observed value of variable j for person or object i
N number of cases in the data set
The mean value is a measure of central tendency, also called center or location of a dis-
tribution. It is most useful if the data follow an approximately symmetric distribution. If
this is not the case, the mean cannot be considered the ‘central’ value of the observations.
Variance
In addition to the mean, it is important to measure the dispersion (variability, variation)
of the data, i.e., the deviation of the observed values from the mean. The most impor-
tant measure of dispersion is the variance, which is the average of the squared deviations
from the mean.4 If N denotes the sample size, the following applies:
N
Σ
sj2 = (xij − x j )2 (1.2)
i=1
with
xij observed value of variable j for person or object i
xj mean of variable j
N number of cases in the data set (sample size)
Standard Deviation
Since an average of squared deviations is difficult to interpret, the standard deviation,
which is the square root of the variance, is usually considered.5 An advantage of the
standard deviation compared to the variance is that it measures the dispersion in the
same units as the original data. This makes an interpretation easier and facilitates a com-
parison with the mean.
|
|
| 1 Σ N
(2
(1.3)
)
sj = √ xij − x j
N − 1 i=1
with
xij observed value of variable j for person or object i
xj mean of variable j
N number of cases in the data set (sample size)
4 In Excel, the sample variance can be calculated by: sx2 = VAR.S(matrix). The population variance
can be calculated by: σx2 = VAR.P(matrix).
5 In Excel, the sample standard deviation can be calculated by: s = STDEV.S(matrix). The popula-
x
tion standard deviation is calculated by: σx = STDEV.P(matrix).
18 1 Introduction to Empirical Data Analysis
While these three statistical measures are very useful for describing even large amounts
of data, they are also of fundamental importance for all methods of multivariate analysis.
In the following, we will briefly illustrate how to calculate these measures.
Application Example
We consider a data set derived from five persons. For each person we collect data on his
or her age (x1), income (x2) and gender (x3). The variable ‘gender’ is a binary (dummy)
variable, with 0 = male and 1 = female. Table 1.7 lists the values collected for the five
persons (n = 5) (i.e., rows 2–6 of column “age”, “income”, and “gender”). In addition
to the three statistical measures we also computed the simple deviation from the mean
(x1 − x 1) and the squared deviation from the mean (x1 − x 1 )2 for all values. ◄
The sum of the deviations from the mean is always zero (see columns A, C and E). The
reason for this is the so-called centering property of the arithmetic mean, that is, the
mean is always located at that point in a series of data where positive and negative devia-
tions from the mean are exactly the same:
N
Σ N
Σ
(xij − x j ) = 0 and thus xij = N · x j (1.4)
i=1 i=1
with
xij observed value of variable j for person i
xj mean of variable j
N number of cases in the data set
The simple deviation from the mean is therefore not suitable as a measure of disper-
sion. Instead, the squared deviations from the mean are computed (see columns B, D and
F).6 If we divide the squared deviations by (n–1) and take the square root, we obtain the
standard deviation. The standard deviation (SD) is easy to interpret: It is a measure of
how much the observed values deviate on average from the mean value.
Degrees of Freedom
Most empirical studies are based on data from random samples. This means, however,
that the true characteristics of the population are usually not known and must therefore
be estimated from the sample. The larger the size of a sample, the greater the probability
that the statistical measures calculated for a sample will also apply to the population (cf.
Table 1.5).
6 Varianceand standard deviation cannot be interpreted meaningfully for the variable “gender”.
However, columns E and F are required for the calculation of covariance and correlations.
1.2
To estimate the measures for the population based on information from a sample, the
concept of the degrees of freedom (df) is important. The sample mean x j provides the
best estimator for the unknown population mean µj. But there will always be an error:
Mean of a population:
µj = x j ± error (1.5)
with
xj mean of variable j
The error depends, among other things, on the degrees of freedom (cf. Sect. 1.3). It will
be smaller for larger degrees of freedom.
In general, the number of degrees of freedom is the number of observations in the
computation of a statistical parameter that are free to vary. Let us assume, for example,
that we determine the age of 5 people in a sample: 18, 20, 22, 24, and 26 years. The
sample mean equals 22 years. Knowing that the mean value is 22 years, we can freely
choose 4 observations in the sample and the final one is determined because we need to
make sure that the sample mean is again 22 years. This sample therefore has 5–1 = 4 df.
If several parameters are used in a statistic, the number of the degrees of freedom is
generally the difference between the number of observations in the sample and the num-
ber of estimated parameters in the statistic.
The number of degrees of freedom increases with increasing sample size and
decreases with the number of measures (i.e., parameters) to be estimated. The larger
the number of degrees of freedom, the greater the precision (the smaller the error) of an
estimate.
Standardized Variables
It is often difficult to compare statistical measures across variables if the variables have
different dimensions. In the application example, age was measured in years (two-digit
values), income in Euro (four-digit values), and gender is a binary variable (0/1 values).
As a result, we cannot compare the dispersion of the variables. We first need to standard-
ize them.
To do so, we compute the difference between the observed value of a variable (xij) and
this variable’s mean (x j). Then we divide this difference by the variable’s standard devia-
tion (sj).
xij − x j
zij = (1.6)
sj
with
xij observed value of variable j for object i
x j mean of variable j
sj standard deviation of variable j
1.2 Basic Statistical Concepts 21
Standardization ensures that the mean of the standardized variable equals 0, and the
variance as well as the standard deviation are equal to 1.
Table 1.8 lists the standardized values for the three variables of the application exam-
ple (columns B, D, and F). The last row of the table displays the means of the standard-
ized variables, which are 0, and the variances or standard deviations, which are 1 for all
three variables.
We can display the standardized variable values in a matrix, the standardized data
matrix Z (Table 1.9).
From a statistical point of view, two variables are independent if they do not vary in the
same way more often than might be expected according to chance. No relationship can
be found between two independent variables, or in other words: information on one vari-
able does not convey any information on the other variable.
22 1 Introduction to Empirical Data Analysis
Covariance
The covariance is a measure of the joint variability of two random variables. It is referred
to as cov(x1, x2) or sx1,x2. The covariance between two variables x1 and x2 is calculated as
the expected value of the product of their deviations from the respective mean:
N
1 Σ
sx1 ,x2 = (x1i − x 1 ) · (x2i − x 2 ) (1.7)
N − 1 i=1
with
xij observed value of variable j for object i
x j mean of variable j
N number of cases in the data set
As Fig. 1.2 shows, the covariance may take on positive and negative values depending
on which of the four quadrants we consider. A covariance of zero results when the neg-
ative and positive covariations cancel each other out. This is the case if the observed
1.2 Basic Statistical Concepts 23
values show no tendency of moving in the same direction (becoming higher or lower). In
this case we may conclude that the two variables change independently of each other. It
should be noted, however, that the covariance examines only linear relationships between
two variables. For the values shown in Fig. 1.2, we get a covariance cov(x1, x2) that is
equal to 0 over the four quadrants. The variables have a U-shaped relationship and thus a
non-linear dependence exists.
For the example, Table 1.10 shows that positive covariances exist between variables
x1 and x2 (cov(x1,x2) = 1,525) and between x2 and x3 (cov(x2,x3) = 125), while the covar-
iance between x1 and x3 is zero.7 Thus, a statistical dependence between x1 and x2 and
between x2 and x3 can be inferred, whereas for x1 and x3 there is no linear dependence in
a statistical sense.
Correlation
The covariance has the disadvantage that its value is influenced by the units of meas-
urement of the variables and thus its interpretation is difficult. But we can normalize the
covariance by dividing it by the standard deviations of the two variables under consider-
ation (sx1 and sx2). This results in the so-called Pearson correlation coefficient for metric
variables, rx1 x2:8
N
1 Σ (x1i − x 1 ) (x2i − x 2 ) sx x
rx1 x2 = · = 12 (1.8)
N −1 i sx1 sx2 sx1 sx2
7 In
Excel, the covariance can be calculated as follows: sxy = COVARIANCE.S(matrix1;matrix2).
8 In
Excel, the correlation between variables can be calculated as follows: rxy =
CORREL(matrix1;matrix2).
24 1 Introduction to Empirical Data Analysis
with
xij observed value of variable j for object i
x j mean of variable j
sxj standard deviation of variable xj
Sx1 x 2 covariance of variables x1 and x2
N number of cases in data set
For the three variables of the application example, Table 1.10 also shows the calculation
of the correlations. They can be presented as a so-called correlation matrix, as shown in
Table 1.11.9
Unlike the covariance, the correlation coefficient can be compared for different units
of measurement. For example, correlations of prices will lead to the same values of r,
regardless whether they have been calculated in dollars or euros.
Figure 1.3 shows values for two variables X and Y for a set of data with different scat-
ter. Scenario a represents a data set with no relationship between the two variables, i.e.,
the correlation is zero or close to zero. In scenario b we see a tendency that larger values
of one variable occur with larger values of the other variable. Thus, we get a positive
correlation (r > 0). A negative correlation (r < 0) indicates that X and Y tend to change
in opposite directions (scenario c). Scenario d shows a non-linear relationship between
the two variables. Here again the correlation is zero or close to zero. As a non-linear rela-
tionship cannot be captured by r, a visual examination of the data using a scatter plot is
always recommended before performing calculations.
The correlation coefficient r has the following properties:
9 Cf. the correlation of binary variables with metrically scaled variables in Sect. 1.1.2.2.
1.2 Basic Statistical Concepts 25
Y Y
X X
a) uncorrelated data ( ≈ 0) b) positive correlation ( > 0)
Y Y
X X
c) negative correlation ( < 0) d) nonlinear correlation ( ≈ 0)
However, the correlation coefficient must also be evaluated in the context of the appli-
cation (e.g., individual or aggregated data). For example, in social sciences, where vari-
ables are often influenced by human behavior and many other factors, a lower value may
be regarded as a strong correlation than in natural sciences, where much higher values
are generally expected.
Another way to assess the relevance of a correlation coefficient is to perform a sta-
tistical significance test that takes the sample size into account. The t-statistic or the
F-statistic may be used for this purpose:10
r
t=√
(1 − r 2 )/(N − 2)
r2
F=
(1 − r 2 )/(N − 2)
with
r correlation coefficient
N number of cases in the data set
and df = N–2.
We can now derive the corresponding p-value (cf. Sect. 1.3.1.2).11
Data usually contain sampling and/or measurement errors. We distinguish between ran-
dom errors and systematic errors:
Random errors, e.g. in sampling, are not avoidable, but their amount can be calculated
based on the data, and they can be diminished by increasing the sample size. Systematic
errors cannot be calculated and they cannot be diminished by increasing the sample size,
but they are avoidable. For this purpose, they first have to be identified.
As statistical results always contain random errors, it is often not clear if an observed
result is ‘real’ or has just occurred randomly. To check this, we can use statistical testing
(hypothesis testing). The test results may be of great importance for decision-making.
Statistical tests come in many forms, but the basic principle is always the same. We
start with a simple example, the test for the mean value.
toward a normal distribution if n is sufficiently large, even if the original variables themselves are
not normally distributed. This is the reason why a normal distribution can be assumed for many
phenomena.
1.3 Statistical Testing and Interval Estimation 27
Example
The chocolate company Choco Chain measures the satisfaction of its customers
once per year. Randomly selected customers are asked to rate their satisfaction on
a 10-point scale, from 1 = “not at all satisfied” to 10 = “completely satisfied”. Over
the last years, the average index was 7.50. This year’s survey yielded a mean value
of 7.30 and the standard deviation was 1.05. The sample size was N = 100. Now the
following question arises: Did the difference of 0.2 only occur because of random
fluctuation or does it indicate a real change in customer satisfaction? To answer this
question, we conduct a statistical test for the mean. ◄
1. formulation of hypotheses,
2. computation of a test statistic,
3. choosing an error probability α (significance level),
4. deriving a critical test value,
5. comparing the test statistic with the critical test value.
• null hypothesis: H0 : µ = µ0
• alternative hypothesis: H1 : µ � = µ0
where µ0 is an assumed mean value (the status quo) and µ is the unknown true mean
value. For our example with µ0 = 7.50 we get:
The alternative hypothesis states the opposite. In our example, it means: “satisfaction
has changed”, i.e., it has increased or decreased. Usually it is this hypothesis that is of
primary interest to the researcher, because its acceptance often requires some action. It
is also called the research hypothesis. The alternative hypothesis is accepted or “proven”
by rejecting the null hypothesis.
f(t)
0.4
0.3
0.2
0.1
0.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
temp H0
null hypothesis. This error probability is denoted by α and is also called the level of
significance.
Thus, the error probability α that we choose should be small, but not too small. If α is
too small, the research hypothesis can never be ‘proven’. Common values for α are 5%,
1% or 0.1%, but other values are also possible. An error probability of α = 5% is most
common, and we will use this error probability for our analysis.
0.4
0.3
0.2
0.1
2 0.025 2 0.025
0.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
t 2 t 2
The critical value for a given α value and degrees of freedom (df = N–1) may be taken
from a t-table or calculated by using a computer. Table 1.12 shows an extract from the
t-table for different values of α and degrees of freedom.
1.3 Statistical Testing and Interval Estimation 31
This means, the result of the test is not statistically significant at α = 5%.
Interpretation
It is important to note that accepting the null hypothesis does not mean that its correct-
ness has been proven. H0 usually cannot be proven, nor can we infer a probability for the
correctness of H0.
In a strict sense, H0 is usually “false” in a two-tailed test. If a continuous scale is used
for measurement, the difference between the observed value and µ0 will practically never
be exactly zero. The real question is not whether µ0 will be zero, but how large the dif-
ference is. In our example, it is very unlikely that the present satisfaction index is exactly
7.50, as stated by the null hypothesis. We have to ask whether the difference is suffi-
ciently large to conclude that customer satisfaction has actually changed.
The null hypothesis is just a statement that serves as a reference point for assessing a
statistical result. Every conclusion we draw from the test must be conditional on H0.
Thus, for a test result |temp | > 2, we can conclude:
| |
• Under the condition of H0, the probability that the test result has occurred solely by
chance is less than 5%. Thus, we reject the| proposition of H0.
Or, as in our example, for the test result |temp | ≤ 2, we can conclude:
|
• Under the condition of H0, the probability that this test result has occurred by chance
is larger than 5%. We require a lower error probability. Thus, we do not have suffi-
cient reason to reject H0.
13 InExcel we can calculate the critical value tα/2 for a two-tailed t-test by using the function
T.INV.2T(α,df). We get: T.INV.2T(0.05,99) = 1.98. The values in the last line of the t-table are
identical with the standard normal distribution. With df = 99 the t-distribution comes very close to
the normal distribution.
32 1 Introduction to Empirical Data Analysis
f(t)
0.4
0.3
0.2
0.1
p 2 0.03 p 2 0.03
0.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
-1.9 H0 1.9
The aim of a hypothesis test is not to prove the null hypothesis. Proving the null hypoth-
esis would not make sense. If this were the aim, we could prove any null hypothesis by
making the error probability α sufficiently small.
The hypothesis of interest is the research hypothesis. The aim is to “prove” (be able to
accept) the research hypothesis by rejecting the null hypothesis. For this reason, the null
hypothesis has to be chosen as the opposite of the research hypothesis.
14 InExcel we can calculate the p-value by using the function T.DIST.2 T(ABS(temp);df). For the
variable in our example we get: T.DIST.2 T(ABS(−1.90);99) = 0.0603 or 6.03%
1.3 Statistical Testing and Interval Estimation 33
The p-value is also referred to as the empirical significance level. In SPSS, the
p-value is called “significance” or “sig”. It tells us the exact significance level of a test
statistic, while the classical test only gives us a “black and white” picture for a given α.
A large p-value supports the null hypothesis, but a small p-value indicates that the prob-
ability of the test statistic is low if H0 is true. So probably H0is not true and we should
reject it.
We can also interpret the p-value as a measure of plausibility. If p is small, the plau-
sibility of H0 is small and it should be rejected. And if p is large, the plausibility of H0 is
large.
By using the p-value, the test procedure is simplified considerably. It is not necessary
to start the test by specifying an error probability (significance level) α. Furthermore,
we do not need a critical value and thus no statistical table. (Before the development of
computers, these tables were necessary because the computing effort for critical values
as well as for p-values was prohibitive.)
Nevertheless, some people like to have a benchmark for judging the p-value. If we use
α as a benchmark for p, the following criterion will give the same result as the classical
t-test according to the rule according Eq. (1.10):
If p < α, reject H0 (1.12)
Since in our example p = 6%, we cannot reject H0. But even if α is used as a benchmark
for p, the problem of choosing the right error probability remains.
0
acceptance region
The probability (1–β) is the probability that a false null hypothesis is rejected (see
lower right quadrant in Table 1.13), what we want. This is called the power of a test and
it is an important property of a test. By decreasing α, the power of the test also decreases.
Thus, there is a tradeoff between α and β. As already mentioned, the error probability α
should not be too small; otherwise the test is losing its power to reject H0 if it is false.
Both α and β can only be reduced by increasing the sample size N.
How to Choose α
The value of α cannot be calculated or statistically justified, it must be determined by the
researcher. For this, the researcher should take into account the consequences (risks and
opportunities) of alternative decisions. If the costs of a type I error are high, α should be
small. Alternatively, if the costs of a type II error are high, α should be larger, and thus β
smaller. This increases the power of the test.
In our example, a type I error would occur if the test falsely concluded that customer
satisfaction has significantly changed although it has not. A type II error would occur if
customer satisfaction had changed, but the test failed to show this (because α was set too
low). In this case, the manager would not receive a warning if satisfaction had decreased,
and he would miss out on taking corrective actions.
reduced. However, conducting a one-tailed test requires some more reasoning and/or a
priori knowledge on the side of the researcher.
A one-tailed t-test is appropriate if the test outcome has different consequences
depending on the direction of the deviation. If in our example the satisfaction index has
remained constant or even improved, no action is required. But if the satisfaction index
has decreased, management should be worried. It should investigate the reason and take
action to improve satisfaction.
The research question of the two-tailed test was: “Has satisfaction changed?” For the
one-tailed test, the research question is: “Has satisfaction decreased?”.
Thus, we have to ‘prove’ the alternative hypothesis
H1 : µ < 7.5
by rejecting the null hypothesis
H0 : µ ≥ 7.5
which states the opposite of the research question. The decision criterion is:
reject H0 if temp < tα (1.13)
Note that in our example tα is negative. The rejection region is now only in the lower tail
(left side) and the area under the density function has double the size. The critical value
for α = 5% is tα = –1.66 (Fig. 1.8).15 As this value is closer to H0 than the critical value
tα/2 = 1.98 for the two-tailed test, a smaller deviation from H0 is significant.
The empirical test statistic temp = –1.9 is now in the rejection region on the lower tail.
Thus, H0 can be rejected at the significance level α = 5%. With the more powerful one-
tailed test we can now “prove” that customer satisfaction has decreased.
15 In Excel we can calculate the critical value tα for the lower tail by using the function
T.INV(α;df). We get: T.INV(0.05;99) = –1.66. For the upper tail we have to switch the sign or use
the function T.INV(1–α;df).
16 In Excel we can calculate the p-value for the left tail by using the function T.DIST(temp;df;1).
We get: T.DIST(−1.90;99;1) = 0.0302 or 3%. The p-value for the right tail is obtained by the func-
tion T.DIST.RT(temp;df).
36 1 Introduction to Empirical Data Analysis
0.4
0.3
0.2
0.1
0.05
0.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
t 1.66
Fig. 1.8 t-distribution and critical value for a one-tailed test (α = 5%, df = 99).
• null hypothesis H0 : π = π0
• alternative hypothesis H1 : π � = π0
Example
The chocolate company ChocoChain knows from regular surveys on attitudes and
lifestyles that 10% of its customers are vegetarians. In this year’s survey x = 52 cus-
tomers stated that they are vegetarians. With a sample size of N = 400 this amounts to
a proportion prop = x/N = 0.13 or 13%. Does this result indicate a real increase or is it
just a random fluctuation?
For π0 = 10% we get σ = 0.30 and, with Eq. (1.13), we get the test statistic:
0.13 − 0.10
zemp = √ = 2.00
0.30/ 400
A rule of thumb says that an absolute value ≥ 2 of the test statistic is significant at
α = 5%. So we can conclude without any calculation that the proportion of vegetari-
ans has changed significantly. The exact critical value for the standard normal distri-
bution is zα/2 = 1.96.
The two-tailed p-value for zemp = 2.0 is 4.55% and thus smaller than 5%. If our
research question is: “Has the proportion of vegetarians increased?”, we can perform
a one-tailed t-test with the hypotheses
H0 : π ≤ π0 = 10%
H1 : π > π0
In this case, the critical value will be 1.64 and the one-tailed p-value will be 2.28%,
which is clearly lower than 5%. Thus, the result is highly significant. ◄
17 Cf., e.g., Hastie et al. (2011), Pearl and Mackenzie (2018); Gigerenzer (2002).
38 1 Introduction to Empirical Data Analysis
• Sensitivity = percentage of “true positives”, i.e., the test will be positive if the patient
is sick (disease is correctly recognized).
• Specificity = percentage of “true negatives”, i.e., the test will be negative if the patient
is not sick.
For an example, we can look at the accuracy of the swab tests (RT-PCR-tests) used early
in the 2020 corona pandemic for testing people for infection with SARS-CoV-2. The
British Medical Journal (Watson et al. 2020) reported a test specificity of 95%, but a
sensitivity of only 70%. This means that out of 100 persons infected with SARS-CoV-2,
the test was falsely negative for 30 people. Not knowing about their infection, these 30
people contributed to the rapid spreading of the disease.
In Sect. 1.3.1.1 we discussed type I and type II errors (α and β) in statistical test-
ing. These errors can be seen as inverse measures of accuracy. There is a close corre-
spondence to specificity and sensitivity. Assuming “no disease” as the null hypothesis,
Table 1.14 shows the correspondence of these measures of accuracy to the error types in
statistical testing. The test sensitivity of 70% corresponds to the power of the test and the
“falsely negative” rate of 30% corresponds to the β-error (type II error).
Measures of sensitivity and specificity can be used for results of cross tables (in con-
tingency analysis), discriminant analysis, and logistic regression. In Chap. 5 on logis-
tic regression, we will give further examples of the calculation and application of these
measures.
Interval estimation and statistical testing are part of inferential statistics and are based on
the same principles.
do not know. It is the best estimate we can get for the true value μ. However, since x
is a random variable, we cannot expect it to be equal to μ. But we can state an interval
around x within which we expect the true mean μ with a certain error probability α (or
confidence level 1–α):
µ = x ± error
This interval is called an interval estimate or confidence interval for μ. Again, we can
use the t-distribution to determine this interval:
sx
µ = x ± tα/2 · √ (1.16)
N
We use the same values as we used above for testing: tα/2 = 1.98, sx = 1.05 and N = 100,
and we get:
1.05
µ = 7.30 ± 1.98 · √ = 7.30 ± 0.21
100
Thus, with a confidence of 95%, we can expect the true value μ to be in the interval
between 7.09 and 7.51:
7.09 ← µ → 7.51
The smaller the error probability α (or the greater the confidence 1–α), the greater the
interval must be. Thus, for α = 1% (or confidence 1–α = 99%), the confidence interval is
[7.02, 7.58].
We can also use the confidence interval for testing a hypothesis. If our null hypothesis
µ0 = 7.50 falls into the confidence interval, it is equivalent to the test statistic falling into
the acceptance region. This is an alternative way of testing a hypothesis. Again, we can-
not reject H0, just as in the two-tailed test above.
1.4 Causality
A causal relationship is a relationship that has a direction. For two variables X and Y, it
can be formally expressed by
X → Y
cause effect
This means: If X changes, Y changes as well. Thus, changes in Y are caused by changes
in X.
However, this does not mean that changes in X are the only cause of changes in Y. If X
is the only cause of changes in Y, we speak of a mono-causal relationship. But we often
face multi-causal relationships, which makes it difficult to find and prove causal relation-
ships (cf. Freedman, 2002; Pearl & Mackenzie, 2018).
Finding and proving causal relationships is a primary goal of all empirical (natural and
social) sciences. Statistical association or correlation plays an important role in pursuing
this goal. But causality is no statistical construct and concluding causality from an asso-
ciation or correlation can be very misleading. Data contain no information about causal-
ity. Thus, causality cannot be detected or proven by statistical analysis alone.
To infer or prove causality, we need information about the data generation process
and causal reasoning. The latter is something that computers or artificial intelligence are
still lacking. Causality is a conclusion that must be drawn by the researcher. Statistical
methods can only support our conclusions about causality.
There are many examples of significant correlations that do not imply causality. For
instance, high correlations were found for
1.4 Causality 41
Such non-causal correlations between two variables X and Y are also called spurious cor-
relations. They are often caused by a lurking third variable Z that is simultaneously influ-
encing X and Y. This third variable Z is also called a confounding variable or confounder.
It is causally related to X and Y. But often we cannot observe such a confounding varia-
ble or do not even know about it. Thus, the confounder can cause misinterpretations.
The strong correlation between the number of storks and the birth rate that was
observed in the years from 1966 to 1990 was probably caused by the growing industrial
development combined with prosperity. For the reading skills of school children and their
shoe size, the confounder is age. For crop yield of hops and beer consumption, the con-
founder is probably the number of sunny hours. The same may be true for the relationship
between ice cream sales and the rate of drowning. If it is hot, people eat more ice cream
and more people go swimming. If more people go swimming, more people will drown.
• X is a cause of Y: X → Y
• Y is a cause of X: Y → X
For the correlation coefficient it makes no difference whether we have situation a) or b).
Thus, a significant correlation is no sufficient proof of the hypothesized causal relation-
ship a).
A cause must precede the effect, and thus changes in X must precede corresponding
changes in Y. If this is not the case, the above hypothesis is wrong. In an experiment, this
may be easily verified. The researcher changes X and checks for changes in Y. Yet if one
has only observational data, it is often difficult or impossible to check the temporal order.
We can do so if we have time series data and the observation periods are shorter than
the lapse of time between cause and effect (time lag). Referring to our example, the time
lag between advertising and sales depends on the type of product and the type of media
used for advertising. The time lag will be shorter for FMCG like chocolate or toothpaste
and longer for more expensive and durable goods (e.g., TV set, car). Also, the time lag
will be shorter for TV or radio advertising than for advertising in magazines. For adver-
tising, the effects are often dispersed over several periods (i.e., distributed lags).
1.5 Outliers and Missing Values 43
In case of a sufficiently large time lag (or sufficiently short observation periods) the
direction of causation can be detected by a lagged correlation (or lagged regression).
Under hypothesis X → Y, the following has to be true (Campbell & Stanley, 1966, p. 69):
rXt−r Yt > rXt Yt−r
where t is the time period and r the length of the lag in periods (r = 1, 2, 3 …).
Otherwise, it indicates that the hypothesis is wrong and causality has the opposite
direction.
A time lag can also obscure a causal relationship. Thus, rXt Yt might not be significant,
but rXt−r Yt is. This should be considered from the outset if there are reasons to suspect a
lagged relationship. The relationship between sales and advertising is an example where
time lags frequently occur. Regression analysis (see Chap. 2) can cope with this by
including lagged variables.
The results of empirical analyses can be distorted by observations with extreme values
that do not correspond to the values “normally” expected. Likewise, missing data can
lead to distortions, especially if they are not treated properly when analyzing the data.
44 1 Introduction to Empirical Data Analysis
1.5.1 Outliers
Empirical data often contain one or more outliers, i.e., observations that deviate substan-
tially from the other data. Such outliers can have a strong influence on the result of the
analysis.
Outliers can arise for different reasons. They can be due to
• chance (random),
• a mistake in measurement or data entry,
• an unusual event.
1.5.1.1 Detecting Outliers
When faced with a great number of numerical values, it can be tedious to find unusual
ones. Even for a small data set as the one in Table 1.15 it is not easy to detect possi-
ble outliers just by looking at the raw data. Numerical and/or graphical methods may
be used for detecting outliers, with graphical methods such as histograms, boxplots, and
Histogram
Frequency
8
6
4
2
0
5 10 15 20 25 30 35 40 45 50 More
scatterplots usually being more convenient and efficient (du Toit et al., 1986; Tukey,
1977). A simple numerical method for detecting outliers is the standardization of data.
Standardization of Data
Table 1.15 shows the observed values of two variables, X1 and X2, and their standardized
values, called z-values. We can see that only one z-value exceeds 2 (observation 16 of
variable X1). If we assume that the data follow a normal distribution, a value >2 has a
probability of less than 5%. The occurrence of a value of 2.56, as observed here, has
a probability of less than 1%. Thus, this value is unusual and we can identify it as an
outlier.
The effect of an outlier on a statistical result can be easily quantified by repeating the
computations after discarding the outlier. Table 1.15 shows that the mean of variable X1
is 22.3. After discarding observation 16, we get a mean of 21.0. Thus, the mean value
changes by 1.3. The effect will be smaller for larger sample sizes. Especially for small
sample sizes, outliers can cause substantial distortions.
Histograms
Figure 1.10 shows a histogram of variable X1, with the outlier at the far right of the
figure.18
Boxplots
A more convenient graphical means is the boxplot. Figure 1.11 shows the boxplots of
variables X1 and X2, with the outlier showing up above the boxplot of X1.
A boxplot (also called box-and-whisker plot) is based on the percentiles of data. It is
determined by five statistics of a variable:
18 The histogram was created with Excel by selecting “Data/Data Analysis/Histogram”. In SPSS,
histograms are created by selecting “Analyze/Descriptive Statistics/Explore”.
46 1 Introduction to Empirical Data Analysis
x1 x2
• maximum,
• 75th percentile,
• 50th percentile (median),
• 25th percentile, and
• minimum.
The bold horizontal line in the middle of each box represents the median, i.e., 50% of the
values are above this line and 50% are below. The upper rim of the box represents the
75th percentile and the lower rim represents the 25th percentile. Since these three per-
centiles, the 25th, 50th, and 75th percentiles, divide the data into four equal parts, they
are also called quartiles. The height of the box represents 50% of the data and indicates
the dispersion (spread, variation) and skewness of the data.
The whiskers extending above and below the boxes represent the complete range of
the data, from the smallest to the largest value (but without outliers). Outliers are defined
as points that are more than 1.5 box lengths away from the rim of the box.19
X2
50
45
40
35
30
25
20
15
10
5
0
0 10 20 30 40 50
X1
Scatterplots
Histograms and boxplots are univariate methods, which means that we are looking for
outliers for each variable separately. The situation is different if we analyze the relation-
ship between two or more variables. If we are interested in the relationship between two
variables, we can display the data by using a scatterplot (Fig. 1.12). Each dot represents
an observation of the two variables X1 and X2.
The relationship between X1 and X2 can be represented by a linear regression line
(dashed line in Fig. 1.12). We can see that observation 16 (at the right end of the regres-
sion line), which we identified as an outlier from a univariate perspective, fits the linear
regression model quite well. The slope of the line will not substantially be affected if we
eliminate the outlier. However, it is also possible that an outlier impacts the slope of the
regression line and biases the results.
Some methods discussed in this book (e.g., regression analysis or factor analysis)
are rather sensitive to outliers. We will therefore discuss the issue of outliers in detail in
those chapters.
Missing values are an unavoidable problem when conducting empirical studies and fre-
quently occur in practice. The reasons for missing values are manifold. Some examples
are:
The problem with missing values is that they can lead to distorted results. The validity of
the results can also be limited since many methods require complete data sets and cases
with a missing value have to be deleted (i.e., listwise deletion). Finally, missing values
also represent a loss of information, so the validity of the results is reduced as compared
to analyses with complete data sets.
Statistical software packages offer the possibility of taking missing values into
account in statistical analyses. Since all case studies in this book are using IBM SPSS,
the following is a brief description of how this statistical software package can identify
missing values. There are two options:
Fig. 1.13 Definition of user missing values in the data editor of IBM SPSS
• The values are excluded “case by case” (“Exclude cases listwise”), i.e., as soon as a
missing value occurs, the whole case (observation) is excluded from further analy-
sis. This often reduces the number of cases considerably. The “listwise” option is the
default setting in SPSS.
• The values are excluded variably (“Exclude cases pairwise”), i.e., in the absence of a
value only pairs with this value are eliminated. If, for example, a value is missing for
variable j, only the correlations with variable j are affected in the calculation of a cor-
relation matrix. In this way, the coefficients in the matrix may be based on different
numbers of cases. This may result in an imbalance of the variables.
• There is no exclusion at all. Average values (“Replace with mean”) are inserted for
the missing values. This may lead to a reduced variance if many missing values occur
and to a distortion of the results.
For option 3, SPSS offers an extra procedure, which can be called up by the menu
sequence Transform/Replace Missing Values (cf. Fig. 1.14). With this procedure, the user
can decide per variable which information should replace missing values in a data set.
The following options are available:
• Series mean,
• Mean of nearby points (the number of nearby points may be defined as 2 to all),
• Median of nearby points (number of nearby points: 2 to all),
• Linear interpolation,
• Linear trend at point.
50 1 Introduction to Empirical Data Analysis
For cross-sectional data, only the first two options make sense, since the missing values
of a variable are replaced with the mean or median (nearby points: all) of the entire data
series. The remaining options are primarily aimed at time series data, in which case the
order of the cases in the data set is important: With the options “Mean of nearby points”
and “Median of nearby points”, the user can decide how many observations before and
after the missing value are used to calculate the mean or the median for a missing value.
In the case of “Linear interpolation”, the mean is derived from the immediate predeces-
sor and successor of the missing value. “Linear trend at point” calculates a regression
(see Chap. 2) on an index variable scaled 1 to N. Missing values are then replaced with
the estimated value from the regression.
With the menu sequence Analyze/Multiple Imputation, SPSS offers a good possibil-
ity of replacing missing values with very realistic estimated values. SPSS also offers the
possibility to analyze missing values under the menu sequence Analyze/MissingValue
Analysis.
In addition to the general options for handling missing values described above, some
of the analytical procedures of SPSS also offer options for handling missing values.
Table 1.16 summarizes these options for the methods discussed in this book.
value are completely excluded (listwise exclusion of missing values). But this may result
in a greatly reduced number of cases. Replacing missing values with other values is
therefore a good way to counteract this effect and avoid unequal weighting.
Besides, the option user missing values offers the advantage that the user can differ-
entiate missing values in terms of content. Missing values that should not be included
in the calculation of statistical parameters may still provide specific information, i.e.,
whether a respondent is unable to answer (does not know) or does not want to answer
(no information). If the option to differentiate “missing values” in such a way is inte-
grated into the design of a survey from the start, important information can be derived
from it.
Finally, it should be emphasized again that it is important to make sure that missing
values are marked as such in SPSS so they are not included in calculations, thus distort-
ing the results.
In this book, we primarily use the software IBM SPSS Statistics (or SPSS) for the differ-
ent methods of multivariate analysis, because SPSS is widely used in science and prac-
tice. The name SPSS originally was an acronym for Statistical Package for the Social
Sciences. Over time, the scope of SPSS has been expanded to cover almost all areas of
data analysis.
IBM SPSS Statistics may be run on the operating systems Windows, MAC, and
Linux. It includes a base module and several extension modules. Apart from the full
52 1 Introduction to Empirical Data Analysis
version of IBM SPSS Statistics Base, a lower-cost student version is available for educa-
tional purposes. This has some limitations that are unlikely to be relevant to the majority
of users: Data files can contain a maximum of 50 variables and 1500 cases, and the SPSS
command syntax (command language) and extension modules are not available.
To use SPSS, the basic IBM SPSS Statistics Base package must be purchased, con-
taining basic statistical analysis. This basic module is also a prerequisite for purchas-
ing additional packages or modules, which usually focus on specific analysis procedures
such as SPSS Regression (regression analysis), SPSS Conjoint (conjoint analysis), or
SPSS Neural Networks.
An alternative option is to use the IBM SPSS Statistics Premium package which
includes all the procedures of the Basic and Advanced packages and is available to stu-
dents at most universities.
Table 1.17 provides an overview of the analytical methods covered in this book and
the associated SPSS procedures, all of which are included in the SPSS premium pack-
age. They run under the common user interface of SPSS Statistics. For readers without
SPSS premium package, the column “SPSS module” lists those SPSS modules or pack-
ages that contain the corresponding procedures.
The various data analysis methods can be selected in SPSS via a graphical user inter-
face. This user interface is constantly being improved and extended. Using the available
menus and dialog boxes, even complex analyses can be performed in a very convenient
way. Thus, the command language (command syntax) previously required to control the
References 53
program is hardly used any more, but it still has some advantages for the user, such as
the customization of analyses. All chapters in this book therefore contain the command
sequences required to carry out the analyses.
There are several books on how to use IBM SPSS, all of which provide a very good
introduction to the package:
• George, D. & Mallery, P. (2019). IBM SPSS Statistics 26 Step by Step (16th ed.).
London: Taylor & Francis Ltd.
• Field, A. (2018). Discovering Statistics Using IBM SPSS Statistics (5th ed.). London:
Sage Publication Ltd.
• Härdle, W. K., & Simar, L. (2015). Applied Multivariate Statistical Analysis, (5th
ed.). Heidelberg: Springer.
IBM SPSS also provides several manuals under the link https://www.ibm.com/support/
pages/ibm-spss-statistics-29-documentation, which are regularly updated.
Users who work with the programming language R will find notes on how to use it
for data analysis under the link www.multivariate-methods.info.
In addition, a series of Excel files for each analysis method is also provided on the
website www.multivariate-methods.info, which should help the readers familiarize them-
selves more easily with the various methods.
References
Campbell, D. T., & Stanley, J. C. (1966). Experimental and quasi-experimental designs for
research. Rand McNelly.
du Toit, S. H. C., Steyn, A. G. W., & Stumpf, R. H. (1986). Graphical exploratory data analysis.
Springer.
Freedman, D. (2002). From association to causation: Some remarks on the history of statistics (p.
521). Technical Report, University of California.
Gigerenzer, G. (2002). Calculated rsks. Simon & Schuster.
Green, P. E., Tull, D. S., & Albaum, G. (1988). Research for marketing decisions (5th ed.).
Prentice Hall.
Hastie, T., Tibshirani, R., & Friedman, J. (2011). The elements of statistical learning. Springer.
Pearl, J., & Mackenzie, D. (2018). The book of why—The new science of cause and effect. Basic
Books.
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680.
Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.
Watson, J., Whiting, P.F., & Brush, J.E. (2020). Interpreting a covid-19 test result. British Medical
Journal, 369, m1808.
54 1 Introduction to Empirical Data Analysis
Further Reading
Anderson, D. R., Sweeney, D. J., & Williams, T. A. (2007). Essentials of modern business statistics
with Microsoft Excel. Thomson.
Field, A., Miles, J., & Field, Z. (2012). Discovering sstatistics using R. Sage.
Fisher, R. A. (1990). Statistical methods, experimental design, and scientific inference. Oxford
University Press.
Freedman, D., Pisani, R., & Purves, R. (2007). Statistics (4th ed.). Norton.
George, D., & Mallery, P. (2021). IBM SPSS statistics 27 step by step: A simple guide and refer-
ence (17th ed.). Routledge.
Sarstedt, M., & Mooi, E. (2019). A concise guide to market research: The process, data, and meth-
ods using IBM SPSS statistics (3rd ed.). Springer.
Wonnacott, T. H., & Wonnacott, R. J. (1977). Introductory statistics for business and economics
(2nd ed.). Wiley.
Regression Analysis
2
Contents
2.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.2 Procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.2.1 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.2.2 Estimating the Regression Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.2.2.1 Simple Regression Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.2.2.2 Multiple Regression Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.2.3 Checking the Regression Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.2.3.1 Standard Error of the Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.2.3.2 Coefficient of Determination (R-square) . . . . . . . . . . . . . . . . . . . . . . . . 78
2.2.3.3 Stochastic Model and F-test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.2.3.4 Overfitting and Adjusted R-Square. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.2.4 Checking the Regression Coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2.2.4.1 Precision of the Regression Coefficient. . . . . . . . . . . . . . . . . . . . . . . . . 85
2.2.4.2 t-test of the Regression Coefficient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
2.2.4.3 Confidence Interval of the Regression Coefficient. . . . . . . . . . . . . . . . . 89
2.2.5 Checking the Underlying Assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.2.5.1 Non-linearity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.2.5.2 Omission of Relevant Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.2.5.3 Random Errors in the Independent Variables. . . . . . . . . . . . . . . . . . . . . 101
2.2.5.4 Heteroscedasticity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.2.5.5 Autocorrelation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.2.5.6 Non-normality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
2.2.5.7 Multicollinearity and Precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
2.2.5.8 Influential Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
2.3 Case Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
2.3.1 Problem Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
2.3.2 Conducting a Regression Analysis With SPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
2.3.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
2.3.3.1 Results of the First Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
2.1 Problem
Regression analysis is one of the most useful and thus most frequently used methods for
statistical data analysis. With the help of regression analysis, one can analyze the rela-
tionships between variables. For example, one can find out if a certain variable is influ-
enced by another variable, and if so, how strong this effect is.
By this one can learn, how the world works. Regression analysis can be used in the
search for truth, which can be very exciting. Regression analysis is very useful when
searching for explanations or making decisions or predictions. Thus, regression analysis
is of eminent importance for all empirical sciences as well as for solving practical prob-
lems. Table 2.1 lists examples for the application of regression analysis.
Regression analysis takes a special position among the methods of multivariate data
analysis. The invention of regression analysis by Sir Francis Galton (1822–1911) in con-
nection with his studies on heredity1 can be considered as the birth of multivariate data
analysis. Stigler (1997, p. 107) calls it “one of the grand triumphs of the history of sci-
ence”. And of further importance is, that regression analysis also provides a basis for
numerous other methods used today in big data analysis and machine learning. For an
understanding of these other, often more complex methods of multivariate data analysis a
profound knowledge about regression analysis is indispensable.
While regression analysis is a relatively simple method within the field of multivariate
data analysis, it is still prone to mistakes and misunderstandings. Thus, wrong results
or wrong interpretations of the results of regression analysis are frequent. This concerns
above all the underlying assumptions of the regression model. We will come back to this
later but will add a word of caution here. Regression analysis can be very helpful for
finding causal relationships, and this is the main reason for its application. But neither
regression analysis nor any other statistical method can prove causality. For this purpose,
1 Galton (1886) investigated the relationship between the body heights of parents and their adult
children. He “regressed the height of children on the height of parents”.
2.1 Problem 57
reasoning beyond statistics and information about the generation of the data may be
needed.
First, we want to show here how regression analysis works. For the application of
regression analysis, the user (researcher) must decide which one is the dependent vari-
able that is influenced by one or more other variables, so-called independent variables.
The dependent variable must be on a metric (quantitative) scale. The researcher further
needs empirical data on the variables. These may be derived from observations or exper-
iments and may be cross-sectional or time-series data. Somewhat bewildering for the
novice are the different terms that are used interchangeably in the literature for the varia-
bles of regression analysis and that vary by the author and the context of the application
(see Table 2.2).
Example
When analyzing the relationship between the sales volume of a product and its
price, sales will usually be the dependent variable, also called the response variable,
explained variable or regressand, because sales volume usually responds to changes
in price. The price will be the independent variable, also called predictor, explana-
tory variable, or regressor. So, an increase in price may explain why the sales volume
has declined. And the price may be a good predictor of future sales. With the help of
regression analysis, one can predict the expected sales volume if the price is changed
by a certain amount. ◄
or Ŷ = fˆ (X). (2.2)
Of course, the estimated values are not identical with the real (observed) values. That is
why the variable for estimated sales is denoted by Ŷ (Y with a hat). To get a quantitative
estimate for the relationship in Eq. (2.2) we must specify its structure. In linear regres-
sion, we assume:
Ŷ = a + b X (2.3)
With given data of Y and X, regression analysis can find values for the parameters a and
b. Parameters are numerical constants in a model, whose values we want to estimate.
Parameters that accompany (multiply) a variable (such as b) are also called coefficients.
Let us assume that the estimation yields the following result:
Ŷ = 500 + 3 X (2.4)
Figure 2.1 illustrates this function. Parameter b (the coefficient of X) is an indicator of
the strength of the effect of advertising on sales. Geometrically, b is the slope of the
2.1 Problem 59
Sales
1,000
500
0
0 50 100 150
Advertising
regression line. If advertising increases by 1 Euro, in this example sales will increase by
3 units. Parameter a (the regression intercept) reflects the basic level of sales if there is
no advertising (X = 0).
With the help of the estimated regression function a manager can answer questions
like:
So, for example, if the advertising budget is 100 Euros, we will expect sales to be
and many other influences.2 So, with simple linear regression, we usually can get only
very rough estimates of sales volumes.
Multiple Regression
With multiple regression analysis one can take into account more than one influencing
variable by including them all into the regression function. So, Eq. (2.1) can be extended
to a function with several independent variables:
Y = f (X1 , X2 , . . . , Xj , . . . , XJ ) (2.6)
Choosing again a linear structure, we get:
Ŷ = a + b1 X1 + b2 X2 + . . . + bj Xj + . . . + bJ XJ (2.7)
By including more explaining variables, the predictions of Y can become more precise.
However, there are limitations to extending the model. Often, not all influencing vari-
ables are known to the researcher. Or some observations are not available. Also, with
an increasing number of variables the estimation of the parameters can become more
difficult.
2.2 Procedure
In this section, we will show how regression analysis works. The procedure can be
structured into five steps that are shown in Fig. 2.2. The steps of regression analysis
are demonstrated using a small example with three independent variables and 12 cases
(observations) as shown in Table 2.3.3
Example
The manager of a chocolate manufacturer is not satisfied with the sales volume of
chocolate bars. He would like to find out how he can influence the sales volume. To
this end, he collected quarterly sales data from the last three years. In particular, he
took data on sales volume, retail price, and expenditures for advertising and sales pro-
motion. Data on retail sales and prices were acquired from a retail panel (Table 2.3).
◄
2 Sales can also depend on environmental factors like competition, social-economic influences, or
weather. Another difficulty is that advertising itself is a complex bundle of factors that cannot sim-
ply be reduced to expenditures. The impact of advertising depends on its quality, which is difficult
to measure, and it also depends on the media that are used (e.g., print, radio, television, internet).
These and other reasons make it very difficult to measure the effect of advertising.
3 On the website www.multivariate-methods.info we provide supplementary material (e.g., Excel
1 Model formulation
The first step in performing a regression analysis is the formulation of a model. A model
is a simplified representation of a real-world phenomenon. It should have some structural
or functional similarity with reality. A city map, for example, is a simplified visual model
of a city that shows its streets and their courses. A globe is a three-dimensional model of
Earth.
62 2 Regression Analysis
sales = f (advertising)
or
Y = f (X)
The manager further assumes that the effect of advertising is positive, i.e. that the sales
volume increases with increasing advertising expenditures. To check this hypothesis, he
inspects the data in Table 2.3. It is always useful to visualize the data by a scatterplot
(dot diagram), as shown in Fig. 2.3. This should be the first step of an analysis.
Each observation of sales and advertising in Table 2.3 is represented by a point in
Fig. 2.3. The first point at the left is the point (x1 , y1 ), i.e. the first observation with the
values 203 and 2596. Using Excel or SPSS, such scatter diagrams can be easily created,
even for large amounts of data.
Sales
3,500
3,000
2,500
2,000
200 220 240 260
Advertising
Fig. 2.3 Scatterplot of the observed values for sales and advertising
The scatterplot shows that the sales volume tends to increase with advertising. We can
see some linear association between sales and advertising.6 This confirms the hypothesis
of the manager that there is a positive relationship between sales and advertising. For
the correlation (Pearson’s r) the manager calculates rxy = 0.74. Moreover, the manager
assumes that the relationship between sales and advertising can be approximately repre-
sented by a linear regression line, as shown in Fig. 2.4.
The situation would be different if we had a scatterplot as shown in Fig. 2.5. This
indicates a non-linear relationship. Advertising response is always non-linear. Linear
models are in almost all cases a simplification of reality. But they can provide good
approximations and are much easier to handle than non-linear models. So, for the
data in Fig. 2.5, a linear model could be appropriate for a limited range of advertising
6 The terms association and correlation are widely and often interchangeably used in data analysis.
But there are differences. Association of variables refers to any kind of relation between variables.
Two variables are said to be associated if the values of one variable tend to change in some sys-
tematic way along with the values of the other variable. A scatterplot of the variables will show
a systematic pattern. Correlation is a more specific term. It refers to associations in the form of a
linear trend. And it is a measure of the strength of this association. Pearson’s correlation coeffi-
cient measures the strength of a linear trend, i.e. how close the points are lying on a straight line.
Spearman’s rank correlation can also be used for non-linear trends.
64 2 Regression Analysis
Sales
3,500
3,000
2,500
2,000
200 220 240 260
Advertising
Sales
3,000
2,500
2,000
1,500
1,000
500
0
0 100 200 300 400
Advertising
expenditures, e.g. from zero to 200. For modeling advertising response over the com-
plete range of expenditures, a non-linear formulation would be necessary (for handling
non-linear relations see Sect. 2.2.5.1).
The regression line in Fig. 2.4 can be mathematically represented by the linear
function:
Ŷ = a + b X (2.8)
2.2 Procedure 65
Y
3,000
2,000
1,000
0
0 50 100 150 200 250
X
with
Ŷ estimated sales
X advertising expenditures
a constant term (intercept)
b regression coefficient
ΔŶ
b= (2.9)
ΔX
Parameter b tells us how much Y will probably increase if X is increased by one unit.
1 Model formulation
A mathematical model, like the regression function in Eq. (2.3), must be adapted to real-
ity. The parameters of the model must be estimated based on a data set (observations of
the variables). This process is called model estimation or calibration. We will demon-
strate this with the data from Table 2.3, first for simple regression, and then for multiple
regression.
With the statistics for the standard deviations and the correlation of the two variables
given in Table 2.4, we can calculate the regression coefficient more easily as7
sy 234.38
b = rxy = 0.742 · = 9.63 (2.11)
sx 18.07
7 Thesebasic statistics can be easily calculated with the Excel functions AVERAGE(range) for
mean, STDEV.S(range) for std-deviation, and CORREL(range1;range2) for correlation.
2.2 Procedure 67
Sales
3,500
3,000
2,500
2,000
200 210 220 230 240 250 260
Advertising
Understanding Regression
A regression line for given data must always pass through the centroid of the data (the
point of means or point of averages). This is a consequence of the least-squares esti-
mation. In our case, the point of means is [x, y] = [235, 2825] (marked by a bullett in
Fig. 2.7).
For standardized variables with x = y = 0 and sx = sy = 1, the regression line will
pass through the origin of the coordinate system, the point [x, y] = [0, 0]. For the con-
stant term, we get a = 0, as we can see from Eq. (2.12). And from Eq. (2.11) we can see
that the regression coefficient (the slope of the regression line) is simply the same as the
correlation coefficient. In our case, after standardization we would get:
68 2 Regression Analysis
b = rxy = 0.74.
For the original variables X and Y the slope b also depends on the standard deviations
of X and Y, sx and sy. Only for sx = sy the values of b and rxy are identical. If sy > sx,
then the slope of the regression line will be larger than the correlation (b > rxy ) and vice
versa. The greater sy, the greater the coefficient b will be, and the greater sx, the smaller b
will be.
As the standard deviation of any variable changes with its scale, the regression coeffi-
cient b also depends on the scaling of the variables. If, e.g., the advertising expenditures
are given in cents instead of EUR, b will be diminished by the factor 100. The effect of a
change by one cent is just 1/100 of the effect of a change by one EUR.
By changing the scale of the variables, the researcher can arbitrarily change the stand-
ard deviations and thus the regression coefficient. But he cannot change the value of the
correlation coefficient rxy since its value is independent of differences in scale.
The line through the point of means with the slope sy /sx (with the same sign as the
correlation coefficient) is called the standard deviation line (SD line (Freedman et al.
2007, pp. 130–131). This line is known before performing a regression analysis. For our
data, we get sy /sx = 13. In Fig. 2.7 the SD line is represented by the dashed line.
For rxy = 1, the regression line is identical with the SD line. But for empirical data,
we will always get rxy < 1. Thus, it follows that the regression line will always be flatter
than the SD line, i.e. |b| < sy /sx. This effect is called the regression effect, from which
regression analysis got its name.
For our data we get b = 9.63 < 13. The estimated regression line will always lie
between the SD line and a horizontal line through the point of means.
8 Blalock (1964, p. 51) writes: “A large correlation merely means a low degree of scatter …. It is
the regression coefficients which give us the laws of science.”
2.2 Procedure 69
This is different in regression analysis. There are two forms of regression functions for
two variables X and Y:
Y = f (X) and X = f (Y ) (2.15)
with the regression coefficients
sy sx
b = rxy and b = rxy (2.16)
sx sy
Residuals
Due to random influences, the estimated values ŷi and the observed values yi will not
be identical. The differences between the observed and the estimated y-values are called
residuals and they are usually denoted by “e” (for “error”):
ei = yi − ŷi (i = 1, . . . , N) (2.18)
with
yi o bserved value of the dependent variable Y
ŷi estimated value of Y for xi
N number of observations
Table 2.5 shows the sales estimated with the regression function in Eq. (2.13) for the
given values of X. The last two columns list the residuals and the squared residuals.
The residuals are caused by influences on sales that are not considered in the model.
There are two types of such influences:
70 2 Regression Analysis
N
Σ
SSR = e21 + e22 + . . . e2N = e2i → min! (2.20)
i=1
0
0
X
N
Σ
SSR = (yi − a − b xi )2 → min! (2.21)
a,b
i=1
The sum of squared residuals is a function of the unknown regression parameters a and
b. The resulting optimization problem can be solved by differential calculus, i.e. by tak-
ing partial derivatives with respect to a and b. In this way, we can derive the formulas
sy
a = y − b x, b = rxy
sx
that we used above for calculation. The minimum value of SSR (see Table 2.5, bottom
right) is given by the values a = 560 and b = 9.63. No other values for a and b can make
this sum smaller.9
This method of estimation is called the “method of least squares” (LS). Concerning
linear regression, it is also called ordinary least squares (OLS). It was developed by the
great mathematician Carl Friedrich Gauß and is the most widely used statistical method for
the estimation of parameters. Gauß was able to show that the least-squares criterion) will,
under certain assumptions (see Sect. 2.2.5), yield best linear unbiased estimators (BLUE).10
9 With the optimization tool Solver of MS Excel it is easy to find this solution without differential
calculus or knowing any formulas. One chooses the cell that contains the value of SSR (the sum at
the bottom of the rightmost column in Table 2.5) as the target cell (objective). The cells that con-
tain the parameters a and b are chosen as the changing cells. Then minimizing the objective will
yield the least-squares estimates of the parameters within the changing cells.
10 Carl Friedrich Gauß (1777–1855) used the method in 1795 at the age of only 18 years for calcu-
lating the orbits of celestial bodies. This method was also developed independently by the French
mathematician Adrien-Marie Legendre (1752–1833). G. Udny Yule (1871–1951) first applied it to
regression analysis.
72 2 Regression Analysis
N
Σ
|ei | → min! (2.22)
i=1
An advantage of LAD is that it is more robust than OLS. It is less sensitive to outliers,
i.e., observations with unusually large deviations from the regression line. By squaring
these deviations, they have a stronger effect on the estimation results in OLS. This is
especially a problem for small sample sizes and can be seen as a disadvantage of OLS.
The LAD method looks simpler than OLS, but it is computationally more difficult to
handle since we cannot use differential calculus for solving the optimization problem.
Instead, iterative numerical methods are necessary. Before the invention of the computer,
this made the application of LAD prohibitive, while today this is not such a problem any
more. But one cannot analytically derive such nice formulas for the estimation of the
parameters as Eqs. (2.11 and 2.12). Another problem of LAD is that it does not always
give a unique solution. Multiple solutions are possible. But in most cases, OLS and LAD
will yield very similar results.
This approach of estimating a separate regression function for each independent variable
poses some problems:
• If the manager fixes his marketing plan for the coming period, each equation will give
a different value for the expected sales.
• The parameters of each of the three equations will probably be biased by having
neglected the other two variables.
2.2 Procedure 73
So, instead of having three separate regression functions that are all not very accurate, it
would be better to have one regression function with all three independent variables that
will give more accurate results. This possibility is offered by multiple regression. The
use of simple regression should be restricted to problems where we have only one inde-
pendent variable.
Ŷ = b0 + b1 X1 + b2 X2 + . . . + bj Xj + . . . + bJ XJ (2.23)
where J denotes the number of independent variables. For technical reasons, we will
now denote the constant term a by b0.11 Unfortunately, multiple regression does not yield
such nice formulas for the estimation of the parameters as simple regression.
Ŷ = b0 + b1 X1 + b2 X2 (2.25)
with X1 = advertising and X2 = price.
Note that this function does not represent a line, as in a simple regression. It now
defines a plane in a three-dimensional space spanned by the three variables. For more
than two regressors the regression function becomes a hyperplane.
11 When using matrix algebra for calculation, the constant term is treated as the coefficient of a
fictive variable with all values equal to 1. By this, it can be computed in the same way as the other
coefficients and the calculation becomes easier.
74 2 Regression Analysis
When using the LS criterion, we have to minimize the following sum of squared
residuals:
N
Σ N
Σ
SSR = e2i = (yi − b0 − b1 x1i − b2 x2i )2 → min! (2.26)
b0 , b1 , b2
i=1 i=1
Minimizing this sum by taking partial derivatives with respect to the three unknown
parameters b0 , b1 , b2 yields the value SSR = 254,816 (in contrast to SSR = 271,217 in
Table 2.5). Thus, by taking into account the influence of price on sales, we can reduce
the sum of squared residuals by 16,401 or 6%. This is not very much. But because of the
low correlation between sales and price, we could not expect more.
The resulting regression function is:
For the simple regression we can see from Eq. (2.11) that the value of the regression
coefficient b is determined by the value of the correlation r between the two variables
and the ratio of the standard deviations:
sy
b = rxy
sx
Thus, the greater the standard deviation of the independent variable, the smaller b will
be. For our data the following applies:
sx1 = 18.07 for advertising and
sx2 = 0.201 for price.
The standard deviation of advertising is much greater than that of price. Thus, for
an equal importance of advertising and price, a much smaller regression coefficient for
advertising is to be expected.
A way to make the regression coefficients comparable is to standardize them. The
standardized regression coefficients are usually called beta coefficients. They are calcu-
lated as follows:
sxj
betaj = bj (2.28)
sy
When comparing this formula with Eq. (2.11), we can see that the scaling of the varia-
bles X and Y is eliminated in the beta coefficients. Thus, the beta coefficients are inde-
pendent of any linear transformations of the variables and can thus be used as a measure
of importance. We get for
From this, we can see that in our example advertising has a much greater influence on
sales variation than price. One may wonder about the comparatively small importance
of price in this case. This can have several reasons. We have seen above that price has
a very low correlation with sales. This may be due to the fact that the prices in our data
set were taken from a retail panel and are average values over many sales outlets. Due to
this, the variation in price is probably diminished and the correlation somewhat blurred.
We could also get the beta coefficients by standardizing the variables. In this case,
OLS would yield regression coefficients that are identical with the beta coefficients and
the constant term would be zero. For simple regression, the beta coefficient will be iden-
tical with the correlation coefficient. But in general, the beta coefficients cannot be inter-
preted as correlation coefficients.
Note: For estimating the effects of changes in the independent variables or mak-
ing predictions of the dependent variable, we need the non-standardized regression
coefficients.
76 2 Regression Analysis
Ŷ = b0 + b1 X1 + b2 X2 + b3 X3 (2.29)
With the data in Table 2.3, OLS-estimation yields the following regression function:
1 Model formulation
Once we have estimated a regression function, we need to assess its goodness or quality.
Nobody wants to rely on a bad model. We need to know how our model fits the empirical
data and whether it is suitable as a model of reality. For this reason, we need measures
for evaluating the goodness-of-fit.
2.2 Procedure 77
A natural basis for evaluating the goodness-of-fit is provided by the fitting criterion
of regression, the sum of squared residuals (SSR). We have already compared the three
models above by their SSR. With each regressor that was added, SSR became smaller
and thus the fit to the data improved.
However, the absolute value of SSR has no meaning, because it depends not only on
the goodness-of-fit but also on the number of observations and the scaling of Y. So, by
using SSR we can only compare models for the same data set. And SSR does not tell us
whether a model is good or bad, or how good or bad it is.
Suitable measures for assessing the goodness-of-fit of a model are:
R2 = ryx
2
(2.31)
R2 = ryŷ
2
(2.32)
The correlation coefficient ryŷ between the observed and the fitted y-values is called mul-
tiple correlation because Ŷ is a linear combination of the x-variables.
R-square can be interpreted as the proportion of total variation in Y that is explained
by the independent variables. The higher R-square, the better the fit. This intuitively easy
interpretation is the reason for the popularity of R-square. In our example we get the fol-
lowing values:
So, while with advertising alone we can explain only 55% of the variation in sales, with
the full model we can explain 92%. With each additional regressor R-square increases.
0
0 X
This is quite trivial and can be easily confirmed in Fig. 2.9. However, it is not trivial
that this equation is still valid if the elements are squared and summed over the observa-
tions.12 This results in the principle of the decomposition of the sample variation of Y.
12 This holds only for linear models and LS estimation. The principle is also of central importance
for the analysis of variance (ANOVA, cf. Chap. 3) and for discriminant analysis (cf. Chap. 4).
80 2 Regression Analysis
like economics, sociology, or psychology, where human behavior is involved. Here one
has to cope with many influencing variables and a great amount of randomness. The kind
of data also makes a difference. For experimental data, we can expect higher values of
R-square than for observational data. And for aggregated data, we can expect higher
values than for individual data. So, to judge the value of R-square, it is necessary to
compare it with values from similar applications. Ultimately, the interpretation of all
measurements depends on comparisons.
13 This is called inferential statistics and has to be distinguished from descriptive statistics.
Inferential statistics makes inferences and predictions about a population based on a sample drawn
from the studied population.
2.2 Procedure 81
The error term ε is the stochastic component. It represents all influences on Y that are
not explicitly contained in the systematic component. These may be errors in the meas-
urement of Y or influences that are unknown or cannot be measured. The error term is
also called disturbance, because it disturbs the estimation of the systematic component
that we are interested in. The errors are not observable, but they become manifest in the
residuals ei. We assume that the random errors are independent of the x-values and have
the mean value zero.
Regarding sales data, for example, there are countless influences by competitors,
retailers, and buyers. The behavior of humans always has some degree of randomness.
Besides, there are various macroeconomic, social, and other environmental influences.
Usually, the data are obtained by sampling and a random sampling error is unavoidable.
It is therefore justified to regard the error term as a random variable.
Since the dependent variable Y contains the error ε, it also is a random variable. Thus,
the estimated regression parameters bj that are obtained from observations of Y are real-
izations of random variables. In the case of repeated random samples, these estimates
fluctuate around the true values βj.
Based on this reasoning we can check the statistical significance of a model. The
research question is: Can the model, or at least one of the independent variables, contrib-
ute to explaining the variation of the dependent variable Y? To answer this question, we
test the null hypothesis
H0 : β1 = β2 = . . . = βJ = 0 (2.37)
versus the alternative hypothesis
H1: at least one βj is non-zero
To prove this, we have to reject the null hypothesis. This can be done by an F-test. For
this purpose, it is useful to organize the data in an ANOVA table as used in the analysis
of variance (see Chap. 3).
ANOVA Table
By dividing sums of squares (SS) by their corresponding degrees of freedom (df) we get
mean squares (MS) or sample variances. In Table 2.7 we do this with the values of our
Model 3 for the explained, residual and total sums of squares.
The degrees of freedom for the explained variation (df1) are given by the number of
independent variables in the regression model. The degrees of freedom for the residual
variation (df2) are given by the number of observations minus the number of parame-
ters in the regression model: df2 = N – (J + 1). For a model without an intercept, we get
df2 = N – J.
From Eq. (2.34) we know that the explained variation and the residual variation sum up
to the total variation: SST = SSE + SSR. The same holds for the corresponding degrees of
freedom: df3 = df1 + df2. But this is not valid for the variances: MST ≠ MSE + MSR.
F-test
For performing an F-test we must compute an empirical value of the F-statistic.14 With
the values in Table 2.7 we get for our Model 3:
MSE explained variance 185,704
Femp = = = = 31.50 (2.38)
MSR unexplained variance 5, 896
Under the null hypothesis, the F-statistic follows an F-distribution. Its density function
for the degrees of freedom in our example is displayed in Fig. 2.10 (the lower line).
Taking the degrees of freedom into account, we can write the F-statistic as a function of
R-square. The F-statistic in R-square form is
R2 /J
Femp = (2.39)
(1 − R2 )/(N − J − 1)
Using this form, it is easy to test an R-square or a simple coefficient of correlation.15
From the empirical F-value, we can derive an empirical significance level, the
p-value. In SPSS, the p-value is referred to as “Significance” or “Sig”. Figure 2.10 shows
the p-value as a function of Femp. The greater Femp, the smaller is p. For Femp = 31.50
with df1 = J and df2 = N – J – 1 (in the numerator and the denominator) we get
p = 0.009%.16
We reject H0 if p < α and conclude that the estimated regression function is statisti-
cally significant with the probability α (alpha). α is the probability that H0 will be falsely
rejected, if it is true (type I error), and is also called the significance level. Commonly the
value α = 0.05 or 5% is chosen.17
“gold” standard in statistics that goes back to Sir R. A. Fisher (1890–1962) who also created the
F-distribution. But the researcher must also consider the consequences (costs) of making a wrong
decision.
2.2 Procedure 83
0.9
0.6
0.5
0.4
0.3
f(F, 3, 8)
0.2
0.1
0.0
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
F, Femp
Figure 2.10 shows that our empirical F-value Femp = 31.50 is much larger than
the critical F-value for α = 5%. So our p is almost zero and it is practically impossible
that H0 is true. Our estimated regression function for Model 3 is highly significant. Our
Models 1 and 2 are also statistically significant, as can be checked with the given values
of R-square.
• R-square does not take into account the number of observations (sample size N) on
which the regression is based. But we will have more trust in an estimation that is
based on 50 observations than in one that is based on only 5 observations. In the
extreme case with only two observations, a simple regression would always yield
R2 = 1 since a straight line can always be laid through two points without deviations.
But for this, we would not need a regression analysis.
84 2 Regression Analysis
• R-square does not consider the number of independent variables contained in the
regression model and thus the complexity of the model. We mentioned the principle
of parsimony for model building. Making a model more complex by adding variables
(increasing J) will always increase R-square, but not necessarily increase the good-
ness of the model.
The amount of “explanation” added by a new variable may only be a random effect.
Moreover, with an increasing number of variables, the precision of the estimates can
decrease due to multicollinearity between the variables (see Sect. 2.2.5.7).
Also, with too much fitting, called “overfitting”, “the model adapts itself too closely
to the data, and will not generalize well” (cf. Hastie et al. 2011, p. 38). This especially
concerns predictions: We are not interested in predicting a value yi that we used already
for estimating the model. We are more interested in predicting a value yN+i that we have
not yet observed. And for this, a simpler model may be better than a more complex
model, because every parameter in the model contains some error.
On the other hand, if the model is omitting relevant variables and is not complex
enough, called “underfitting”, the estimates of the model parameters will be biased, i.e.
contain systematic errors (see Sect. 2.2.5.2). Again, large prediction errors will result.
Remember: Modeling is a balancing act between simplicity and complexity, or
between underfitting and overfitting.
The inclusion of a variable in the regression model should always be based on log-
ical or theoretical reasoning. It is bad scientific style to haphazardly include several or
all available variables into the regression model in the hope of finding some independ-
ent variables with a statistically significant influence. This procedure is sometimes called
“kitchen sink regression”. With today’s software and computing power, the calculation is
very easy and such a procedure is tempting. As R-square cannot decrease by adding vari-
ables to a regression model, it cannot indicate the “badness” caused by overfitting.
For these reasons, in addition to R-square, an adjusted coefficient of determination
(adjusted R-square) should also be calculated. With the values in Table 2.7 we get:
2
with Radj < R2 .
The adjusted R-square uses the same information as the F-statistic. Both statistics con-
sider the sample size and the number of parameters. To compare the adjusted R-square
with R-square, we can write:
2 N −1
Radj =1− (1 − R2 ) (2.41)
N −J −1
2.2 Procedure 85
The adjusted R-square becomes smaller when the number of regressors increases (other
things being equal) and can also become negative. Thus, it penalizes increasing model
complexity or overfitting.18 In our example we get the following values:
By including price into the model, the adjusted R-square decreases. Price contributes
only little to explaining the sales volume and its contribution cannot compensate for
the penalty for increasing model complexity. With the inclusion of promotion, we get
another picture. Promotion strongly boosts the explained variation. Here the increase of
model complexity plays only a minor role.
The term adjusted R-square may be misunderstood, because Radj 2
is not the square of
any correlation. Another name, corrected R-square, is also misleading, because it sug-
gests that R2 is false, which is not the case.
1 Model formulation
18 Othercriteria for model assessment and selection are the Akaike information criterion (AIC) and
the Bayesian information criterion (BIC). See, e.g., Agresti (2013, p. 212); Greene (2012, p. 179);
Hastie et al. (2011, pp. 219–257).
86 2 Regression Analysis
Variation of the x-values and sufficient sample size are essential factors for getting reli-
able results in regression analyses. To make a comparison, you cannot get a stable posi-
tion by balancing on one leg. So, if the variance of the x-values and/or the sample size is
small, the regression analysis will be a shaky affair. In an experiment, the researcher can
control these two conditions. He can manipulate the independent variable(s) and deter-
mine the sample size. But mostly we have to cope with observational data. Experiments
are not always possible, and a higher larger sample size takes more time and leads to
higher costs.
For multiple regressions, the formula for the standard error of an estimated coefficient
extends to:
SE
SE(bj ) = √ (2.43)
√
s(xj ) · N − 1 · 1 − Rj2
where Rj2 denotes the R-square for a regression of the regressor j on all other independent
variables. Rj2 is a measure of multicollinearity (see Sect. 2.2.5.7). It refers to the rela-
tionships among the x-variables. The precision of an estimated coefficient increases
(other things being equal) with a smaller Rj2, i.e. with less correlation of xj with the other
x-variables.
2.2 Procedure 87
For our Model 3 and variable j = 1 (advertising) we get the following standard error
for b1:
76.8
SE(b1 ) = √ √ = 1.34
18.07 · 12 − 1 · 1 − 0.089
We estimated b1 = 7.91. So, the relative standard error for the coefficient of advertising
has now decreased to 0.17 or 17%. This is due to a substantial reduction of the standard
error of the regression in Model 3.
19 Fora brief summary of the basics of statistical testing see Sect. 1.3.
20 With Excel we can calculate the critical value tα/2 for a two-tailed t-test by using the function
T.INV.2 T(α;df). We get: T.INV.2 T(0.05;8) = 2.306.
21 The p-values can be calculated with Excel by using the function T.DIST.2 T(ABS(t
emp);df). For
the variable price we get: T.DIST.2 T(3.20;8) = 0.0126 or 1.3%.
88 2 Regression Analysis
f(t)
0.4
0.3
0.2
0.1
0.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
Fig. 2.11 t-distribution and critical values for error probability α = 5% (two-tailed t-test)
regression analysis, a one-tailed t-test offers greater power since smaller deviations from
zero are now statistically significant and thus the danger of a type II error (accepting a
wrong null hypothesis) is reduced. But a one-tailed test requires more reasoning and a
priori knowledge on the researcher’s side.
A one-tailed t-test is appropriate if the test outcome has different consequences
depending on the direction of the deviation. Our manager will spend money on adver-
tising only if advertising has a positive effect on sales. He will not spend any money if
the effect is zero or if it is negative (while the size of the negative effect does not matter).
Thus, he wants to prove the alternative hypothesis
H1 : βj > 0 versus the null hypothesis H0 : βj ≤ 0 (2.45)
H0 states the opposite of the research question. The decision criterion is:
If temp > tα , then reject H0 (2.46)
2.2 Procedure 89
Now the critical value for a one-tailed t-test at α = 5% is only tα = 1.86.22 This value
is much smaller than the critical value tα/2 = 2.306 for the two-tailed test. As the rejec-
tion region is only in the upper tail (right side), the test is also called an upper tail test.
The rejection region on the upper tail has now double size (α instead α/2). Thus, a lower
value of temp is significant.
Using the p-value, the decision criterion is the same as before: We reject H0 if p < α.
But the one-tailed p-value is just half the two-tailed p-value. Thus, if we know the two-
tailed p-value, it is easy to calculate the one-tailed p-value by dividing it by 2. Table 2.8
gives the p-value p = 0.0004 or 0.04% for the variable advertising. Thus, the one-tailed
p-value is p = 0.02%.23
22 With Excel we can calculate the critical value tα for a one-tailed t-test by using the function
T.INV(1 – α;df). We get: T.INV(0.95;8) = 1.860.
23 With Excel we can calculate the p-value for the right tail by the function T.DIST.RT(temp;df).
The confidence interval therefore is a range around bj in which the unknown value βj can
be found with a certain probability. Its size depends on a specified error probability α (or
confidence level 1 – α). We can calculate it as
bj − tα/2 · SE(bj ) ≤ βj ≤ bj + tα/2 · SE(bj ) (2.48)
with
βj true regression coefficient (unknown)
bj estimated regression coefficient
tα/2 theoretical t-value for error probability α and df = N–J–1
SE(bj) standard error of bj
With a probability (confidence level) of (1–α) the true value βj is located in the given
interval around the estimate bj. With an error probability of α it is located outside of the
confidence interval. The lower α, the larger the interval.
All values needed for the calculation were already used above for the t-test (see
Table 2.8). For the variable advertising and Model 3, we get
7.91 − 2.306 · 1.342 ≤ βj ≤ 7.91 + 2.306 · 1.342
4.81 ≤ βj ≤ 11.00
This is the interval for the error probability α = 5% or confidence level 95%. Thus, with a
probability of 95% the true regression coefficient of the variable ‘advertising’ is between
4.81 and 11.00. If we increase the confidence level, the interval will increase accord-
ingly. Table 2.9 shows the confidence intervals for all regression coefficients of Model 3.
1 Model formulation
Table 2.10 gives an overview on the violations of these assumptions and their conse-
quences, which we will discuss in the following. The first three assumptions are the most
25 Cf. e.g., Kmenta (1997, p. 392); Fox (2008, p. 105); Greene (2012, p. 92); Wooldrige (2016,
p. 79); Gelman and Hill (2018, p. 45). You will find slight differences between the formulations of
the different authors.
92 2 Regression Analysis
important ones because they concern the validity of the results. Together with assump-
tions 4 and 5, the method of least squares yields unbiased and efficient linear estimations
of the parameters. This characteristic is called BLUE (best linear unbiased estimators),
where “best” stands for the smallest possible variance.26
Assumption 6 is needed for significance tests and confidence intervals. This assump-
tion is supported by the central limit theorem of statistics.27 Perfect multicollinearity
should not occur. If it does, there is a mistake in modeling. But strong multicollinearity is
a frequent problem.
In general, the effect or harm caused by a violation of these assumptions depends on
the degree of the violation. Thus, any violation can be harmful. But the good news is that
minor violations will not do any harm.
It should be emphasized that meeting the stated assumptions of the linear regres-
sion model is a necessary but not a sufficient condition for getting good estimates. For
achieving high precision of the estimates, regression analysis also requires a sufficient
variation (spread) in the independent variables, sufficiently large sample sizes, and low
multicollinearity.
2.2.5.1 Non-linearity
The world is non-linear. In almost all instances, linear models are a simplification of
reality. But after all, “all science is dominated by the idea of approximation” (Bertrand
Russell). For certain ranges, depending on the data, linear models can provide good
approximations, and they are much easier to handle than non-linear models. But if we
26 This follows from the Gauss-Markov theorem. See e.g. Fox (2008, p. 103); Kmenta (1997,
p. 216).
27 The central limit theorem plays an important role in statistical theory. It states that the sum or
4
80
3 60
2 40
1 20
0 0
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Concave Concave with wear-out
1.2
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Concave with saturation S-shape with saturation
extend the range, a linear model may become inappropriate. If there is a strong non-lin-
ear relationship between Y and any x-variable, the expectancy-value E(εi ) cannot be zero
for any value of X and it cannot be independent of X.
A good example of this phenomenon can be found in advertising. If we double the
budget, the effect will usually not double. The more money we spend, the less will
be the marginal gains. Figure 2.12 shows possible curves of non-linear advertising
response. The same models are also used in other areas (e.g., epidemiology, diffusion of
innovations).
By transforming variables we can handle many non-linear problems within the linear
regression model. Assumption A1 of the regression model only postulates that the model
is linear in the parameters. Thus, a variable in the model can be a non-linear function of
an observed variable. To model a concave advertising response, we can transform adver-
tising expenditures X by a square root
√
X′ = X (2.53)
and estimate the model
Y = α + β · X′ + ε (2.54)
by linear regression.
In general, any variable X in a regression model can be replaced by a variable.
X ′ = f (X),
94 2 Regression Analysis
Y = α · Xβ · ε (2.56)
1n Y = α ′ + β · 1n X + ε′ (2.57)
with α ′ = ln α and ε′ = ln ε. This can be extended to multiple regression.
Another very flexible form of non-linear transformation is offered by polynomials. A
polynomial regressionof the jth degree is given by
Y = β 0 + β1 X + β 2 X 2 + β3 X 3 + . . . + βj X j + ε (2.58)
The regression line in Fig. 2.13 shows a polynomial of the 2nd degree. With a polyno-
mial of the 3rd degree, we can create S-shaped functions.
Interaction Effects
Another form of non-linearity can be caused by interaction effects if the joint effect of
two independent variables is greater or smaller than the sum of the individual effects.
Such effects can occur, for example, between price and promotions. A price reduction
will often not be noticed by consumers if it is not accompanied by a promotion. And the
effect of a promotion will be increased by a price reduction. That is why they often go
together.
2.2 Procedure 95
Sales
3,000
2,500
2,000
1,500
1,000
500
0
0 100 200 300 400
Advertising
Such interactions can be modeled by including the product of the two variables in the
model:
Y = β0 + β1 · A + β2 · B + β3 · A · B + ε (2.59)
with A for price and B for promotion. The product A × B is called an interaction term.
One of two interacting variables can also be a moderating variable (see Fig. 17).
Detection of Non-linearity
Undiscovered non-linearity will lead to a bias (systematic distortion) in the estimated
parameters. Therefore, the detection and correct handling of non-linearities are impor-
tant. Often researchers are aware of non-linearities in their problems due to prior experi-
ence. If not, statistical tools can be used to check for non-linearity. A visual inspection of
a scatterplot of the data usually works best.
The scatterplot in Fig. 2.13 shows a non-linear association between sales and advertis-
ing expenditures. It can be seen that the mean of the error terms εi variates over the range
of the x-values. In the medium range, it will be above zero and for low or high expendi-
tures it will be below zero. Thus, E(εi |x1i , x2i , . . . , xJi ) = 0 is violated by non-linearity.
To detect non-linearities in multiple regressions we can plot the y-values against each
independent variable. Another possibility is the use of a Tukey-Anscombe plot that we
will discuss in the following section.
Y = β̃0 + β̃1 X 1 + ε̃
with
ε̃ = ε + β2 X2
In the falsely specified model, the effect of X2 is absorbed by the error ε̃. If X1 and X2 are
correlated, then X1 and ε̃ in the second model are also correlated and A2 is violated (cf.
Kmenta 1997, p. 443; Fox 2008, p. 111).
For the two models above we estimate the regression functions:
Ŷ = a + b1 X 1 + b2 X2
Ŷ = ã + b̃1 X 1
The estimator b̃1 will be biased because it takes the effect of X2.
We will demonstrate this effect with our Models 1 and 2. Looking at the correla-
tion matrix in Table 2.6, we can see that price is positively correlated with advertising
(r = 0.155). Thus, the coefficient of advertising is biased in Model 1, where the price is
omitted. We estimated above:
Model 2: Ŷ = 814 + 9.97 X1 − 194.6 X2
Model 1: Ŷ = 560 + 9.63 X1
The estimate b̃1 = 9.63 for advertising in Model 1 shows a downward bias because it
contains the negative effect of price. By subtracting the coefficients of the “correct”
model (Model 2) from those of the biased model (Model 1) we get:
From this formula we can learn: the bias increases with b2 and the correlation r12.
With the values in Tables 2.3 and 2.6 we get:
0.201
bias = −194.6 · 0.155 · = −0.34
18.07
The bias is small here because the variable price has only a small influence on Y and is
only weakly correlated with advertising. The bias of omitting promotion is much larger
(>2) and positive. For training purposes, the reader may calculate this bias in Model 1.
We summarize: An omitted variable is relevant if
An omitted variable does not cause bias if it is not correlated with the independent varia-
ble(s) in the model.
28 Anscombe and Tukey (1963) demonstrated the power of graphical techniques in data analysis.
98 2 Regression Analysis
200
100
0
2,400 2,500 2,600 2,700 2,800 2,900 3,000 3,100 3,200
-100
-200
-300
-400
Fitted y-values
200
100
0
2,400 2,500 2,600 2,700 2,800 2,900 3,000 3,100
-100
-200
-300
-400
Estimated Sales
For Model 3, which includes the variables price and promotion, we get the scatterplot
in Fig. 2.16. Now the suspicious scatter on the right-hand side of Fig. 2.15 has vanished.
200
100
0
2,400 2,500 2,600 2,700 2,800 2,900 3,000 3,100 3,200
-100
-200
-300
-400
Estimated Sales
Thus, great care has to be taken when concluding causality from a regression coefficient
(cf. Freedman 2002). Causality will be evident if we have experimental data.29 But most
data are observational data. To conclude causality from an association or a significant
correlation can be very misleading.
“Correlation is not causation” is a mantra that is repeated again and again in statistics.
The same applies to a regression coefficient. If we want to predict the effects of changes
in the independent variables on Y, we have to assume that a causal relationship exists.
But regression is blind for causality. Mathematically we can regress an effect Y on its
cause X (correct) but also a cause X on its effect Y. Data contain no information about
causality so it is the task of the researcher to interpret a regression coefficient as a causal
effect.
A danger is posed by the existence of lurking variables, that influence both the
dependent variable and the independent variable(s), but are not seen or known and thus
are omitted in the regression equation. Such variables are also called confounders (see
Fig. 2.17). They are confounding (confusing) the relationship between two variables X
and Y.
29 Inan experiment the researcher actively changes the independent variable X and observes
changes of the dependent variable Y. And, as far as possible, he tries to keep out any other influ-
ences on Y. For the design of experiments see e.g. Campbell and Stanley (1966); Green et al.
(1988).
100 2 Regression Analysis
X Y X Y X Y
Z Z Z
Example
A lot of surprise and confusion were created by a study on the relation between
chocolate consumption and the number of Nobel Prize winners in various countries
(R-square = 63%).30 It is claimed that the flavanols in dark chocolate (also contained
in green tea and red wine) have a positive effect on the cognitive functions. But one
should not expect to win a Nobel Prize if only one eats enough chocolate. The con-
founding variable is probably the wealth or the standard of living in the observed
countries. ◄
Causal Diagrams
Confounding can be illustrated by the causal diagrams (a) and (b) in Fig. 2.17. In dia-
gram (a) there is no causal relationship between X and Y. The correlation between X and
Y, caused by the lurking variable Z, is a non-causal or spurious correlation. If the con-
founding variable Z is omitted, the estimated regression coefficient is equal to the bias in
Eq. (2.61).
In diagram (b) the correlation between X and Y has a causal and a non-causal part.
The regression coefficient of X will be biased by the non-causal part if the confounder Z
is omitted. The bias in the regression is given by Eq. (2.61).31
Another frequent problem in causal analysis is mediation, illustrated in diagram (c).
Diagrams (b) and (c) look similar and the dataset of (c) might be the same as in (b),
but the causal interpretation is completely different. A classic example of mediation is
the placebo effect in medicine: a drug can have a biophysical effect on the body of the
patient (direct effect), but it can also act via the patient’s belief in its benefits (indirect
30 Switzerland was the top performer in chocolate consumption and number of Noble Prizes. See
Messerli, F. H. (2012). Chocolate consumption, Cognitive Function and Nobel Laureates. The New
England Journal of Medicine, 367(16), 1562–1564.
31 For causal inference in regression see Freedman (2012); Pearl and Mackenzie (2018, p. 72).
Problems like this one are covered by path analysis, originally developed by Sewall Wright (1889–
1988), and structural equation modeling (SEM), cf. e.g. Kline (2016); Hair et al. (2014).
2.2 Procedure 101
effect). We will give an example of mediation in the case study (Sect. 2.3). Thus, one
must clearly distinguish between a confounder and a mediator.32
Y ∗ = X∗
which forms a diagonal line.
Now we assume that we can observe Y and X with the random errors εx and εy:
Y = Y ∗ + εy and X = X ∗ + εx
We assume the errors are normally distributed with means of zero and standard devia-
tions σεx and σεy .
Based on these observations of Y and X we estimate as usual:
Ŷ = a + b · X
32 “Mistaking
a mediator for a confounder is one of the deadliest sins in causal inference.” (Pearl
and Mackenzie 2018, p. 276).
102 2 Regression Analysis
Y Regression 1 Y Regression 1
500 500
450 450
400 400
350 350
300 300
250 250
200 200
Y Regression 1 Y Regression 1
500 500
450 450
400 400
350 350
300 300
250 250
200 200
What is important now is that the two similar errors εx and εy have quite different effects
on the regression line. We will demonstrate this with the following four scenarios that are
illustrated in Fig. 2.18:
1) σεx = σεy = 0
No error. All observations are lying on the diagonal, the true model. By regression we
get correctly a = 0 and b = 1. The regression line is identical to the diagonal.
2) σεx = 0, σεy = 50
We induce an error in Y. This is the normal case in regression analysis. Despite con-
siderable random scatter of the observations, the estimated regression line (solid line)
shows no visible change.
2.2 Procedure 103
The slope of the SD line (dashed line) has slightly increased because the standard
deviation of Y has been increased by the random error in Y.
3) σεx = 50, σεy = 50.
We now induce an error in X that is equal to the error in Y. The regression line moves
clockwise. The estimated coefficient b < 0.75 is now biased downward (toward zero).
The slope of the SD line has also slightly decreased because the standard deviation
of X has been increased by the random error in X. The deviation between the SD line
and the regression line has increased because the correlation between X and Y has
decreased (random regression effect).
4) σεx = 100, σεy = 50
We now double the error in X. The effects are the same as in 3), but stronger. The
coefficient b < 0.5 is now less than half of the true value.
Table 2.12 shows the numerical changes in the four different scenarios.
The effect of the measurement error in X can be expressed by
b = β · reliability (2.62)
where β is the true regression coefficient (here β = 1) and reliability expresses the
amount of random error in the measurement of X. We can state:
σ 2 (X ∗ )
reliability = ≤1 (2.63)
σ 2 (X ∗ ) + σ 2 (εx )
Reliability is 1 if the variance of the random error in X is zero. The greater the random
error, the lower the reliability of the measurement.
Diminishing reliability affects the correlation coefficient as well as the regression
coefficient. But the effect on the regression coefficient is stronger, as the random error in
X also increases the standard deviation of X.
The effect of biasing the regression coefficient toward zero is called regression to the
mean (moving back to the average), whence regression got its name.33 It is important
33 The expression goes back to Francis Galton (1886), who called it “regression towards medioc-
rity”. Galton wrongly interpreted it as a causal effect in human heredity. It is ironic that the first
and most important method of multivariate data analysis got its name from something that means
the opposite of what regression analysis actually intends to do. Cf. Kahneman (2011, p. 175); Pearl
and Mackenzie (2018, p. 53).
104 2 Regression Analysis
e e
0 0
to note that this is a purely random effect. To mistake it for a causal effect is called the
regression fallacy (regression trap).34
In practice, it is difficult to quantify this effect, because we usually do not know the
error variances.35 But it is important to know about its existence so as to avoid the regres-
sion fallacy. If there are considerable measurement errors in X, the regression coefficient
tends to be underestimated (attenuated). This causes non-significant p-values and type II
errors in hypothesis testing.
2.2.5.4 Heteroscedasticity
Assumption 3 of the regression model states that the error terms should have a constant
variance. This is called homoscedasticity, and non-constant error variance is called het-
eroscedasticity. Scedasticity means statistical dispersion or variability and can be meas-
ured by variance or standard deviation.
As the error term cannot be observed, we again have to look at the residuals.
Figure 2.19 shows examples of increasing and decreasing dispersion of the residuals in a
Tukey-Anscombe plot.
34 Cf. Freedman et al. (2007, p. 169). In econometric analysis this effect is called least squares
attenuation or attenuation bias. Cf., e.g., Kmenta (1997, p. 346); Greene (2012, p. 280);
Wooldridge (2016, p. 306).
35 In psychology great efforts have been undertaken, beginning with Charles Spearman in 1904, to
measure empirically the reliability of measurement methods and thus derive corrections for attenu-
ation. Cf., e.g., Hair et al. (2014, p. 96); Charles (2005).
2.2 Procedure 105
Heteroscedasticity does not lead to biased estimators, but the precision of least-
squares estimation is impaired. Also, the standard errors of the regression coefficients,
their p-values, and the estimation of the confidence intervals become inaccurate.
To detect heteroscedasticity, a visual inspection of the residuals by plotting them
against the predicted (estimated) values of Y is recommended. If heteroscedasticity is
present, a triangular pattern is usually obtained, as shown in Fig. 2.19. Numerical testing
methods are provided by the Goldfeld-Quandt test and the method of Glesjer.36
Goldfeld-Quandt test
A well-known test to detect heteroscedasticity is the Goldfeld-Quandt test, in which the
sample is split into two sub-samples, e.g. the first and second half of a time series, and
the respective variances of the residuals are compared. If perfect homoscedasticity exists,
the variances must be identical:
s12 = s22 ,
i.e. the ratio of the two variances of the subgroups will be 1. The further the ratio devi-
ates from 1, the more uncertain the assumption of equal variance becomes. If the errors
are normally distributed and the assumption of homoscedasticity is correct, the ratio
of the variances follows an F-distribution and can, therefore, be tested against the null
hypothesis of equal variance:
H0 : σ12 = σ22 .
The F-test statistic is calculated as follows:
s12
Femp =
s22
N1 N2
e2i e2i
Σ Σ
i=1 i=1 (2.64)
with s12 = and s22 =
N1 − J − 1 N2 − J − 1
N1 and N2 are the numbers of cases in the two subgroups and J is the number of inde-
pendent variables in the regression. The groups are to be arranged in such a way that
s12 ≥ s22 applies. The empirical F-value is to be tested at a given significance level against
the theoretical F-value for (Nl – J – 1, N2− J −1) degrees of freedom.
Method of Glesjer
An easier way for detecting heteroscedasticity is the method of Glesjer, in which the
absolute residuals are regressed on the regressors:
36 An overview of this test and other tests is given by Kmenta (1997, p. 292); Maddala and Lahiri
(2009, p. 214).
106 2 Regression Analysis
e e
0 0
J
Σ
|ei | = β0 + βj xji (2.65)
j=1
2.2.5.5 Autocorrelation
Assumption 4 of the regression model states that the error terms are uncorrelated. If this
condition is not met, we speak of autocorrelation. Autocorrelation occurs mainly in time
series but can also occur in cross-sectional data (e.g., due to non-linearity). The devia-
tions from the regression line are then no longer random but depend on the deviations of
previous values. This dependency can be positive (successive residual values are close
to each other) or negative (successive values fluctuate strongly and change sign). This is
illustrated by the Tukey-Anscombe plot in Fig. 2.20.
Like heteroscedasticity, autocorrelation usually does not lead to biased estimators,
but the efficiency of least-squares estimation is diminished. The standard errors of the
2.2 Procedure 107
regression coefficients, their p-values, and the estimation of the confidence intervals
become inaccurate.
Detection of Autocorrelation
To detect autocorrelation, again a visual inspection of the residuals is recommended by
plotting them against the predicted (estimated) values of Y.
A computational method for testing for autocorrelation is the Durbin-Watson test. The
Durbin-Watson test checks the hypothesis H0 that the errors are not autocorrelated:
Cov(εi , εi+r ) = 0
with r ≠ 0. To test this hypothesis, a Durbin-Watson statistic DW is calculated from the
residuals.
N
(ei − ei−1 )2
Σ
i=2 [ ]
DW = N
≈ 2 1 − Cov(εi , εi−1 ) (2.66)
e2i
Σ
i=1
For sample sizes around N = 50, the Durbin-Watson statistic should roughly be between
1.5 and 2.5 if there is no autocorrelation.
More exact results can be achieved by using the critical values dL (lower limit) and
dU (upper limit) from a Durbin-Watson table. The critical values for a given significance
level (e.g., α = 5%) vary with the number of regressors J and the number of observations
N.
Figure 2.21 illustrates this situation. It shows the acceptance region for the null
hypothesis (that there is no autocorrelation) and the rejection regions. And it also shows
that there are two regions of inconclusiveness.
Decision Rules for the (Two-sided) Durbin-Watson Test (Test of H0: d = 2):
1. Reject H0 if: DW < dL or DW > 4 − dL (autocorrelation).
2. Do not reject H0 if: dU < DW < 4 − dU (no autocorrelation).
3. The test is inconclusive in all other cases.
108 2 Regression Analysis
inconclusive inconclusive
positive no negative
autocorr. autocorrelation autocorr.
0 dL dU 2 4 - dU 4 - dL 4
For our data (Model 1) we get DW = 2.04. This is very close to 2, so the hypothesis of no
autocorrelation must not be rejected.37 There is no reason to suspect autocorrelation.
2.2.5.6 Non-normality
The last assumption concerning the error terms states that the errors are normally dis-
tributed. This assumption is not necessary for getting unbiased and efficient estimations
of the parameters. But it is necessary for the validity of significance tests and confidence
intervals. For these, it is assumed that the estimated values of the regression parameters
are normally distributed.38 If this is not the case, the tests are not valid.
As the errors cannot be observed, we again have to look at the residuals to check the
normality assumption. Graphical methods are best suited for doing so.39 A simple way is
to inspect the distribution of the residuals via a histogram (Fig. 2.22). Since the normal
distribution is symmetric, this should also apply to the distribution of the residuals. But
for small sample sizes, this might not be conclusive.
37 From a Durbin-Watson table we derive the values dL = 0.97 and dU = 1.33 and thus
1.33 < DW < 2.67 (no autocorrelation).
38 If the errors are normally distributed, the y-values, which contain the errors as additive elements,
are also normally distributed. And since the least-squares estimators form linear combinations of
the y-values, the parameter estimates are normally distributed, too.
39 Numerical significance tests of normality: the Kolmogorov-Smirnov test and the Shapiro-Wilk
test.
2.2 Procedure 109
0
-300 -200 -100 0 100 200 300 More
Fig. 2.23 Q-Q plot and P-P plot, based on standardized residuals
Better instruments for checking normality are specialized probability plots such as the
Q-Q plot and the P-P plot. Both are based on the same information and they give similar
results (see Fig. 2.23). They look at the same thing from different sides.
• Q-Q plot: the standardized residuals, sorted in ascending order, are plotted along the
x-axis and the corresponding quantiles of the standard normal distribution are plotted
along the y-axis.
• P-P plot: The expected cumulative probabilities of the (sorted) standardized residuals
are plotted along the y-axis against the cumulative proportions (probabilities) of the
observations on the x-axis.
Under the normality assumption, the points should scatter randomly along the diagonal
(x = y line). This is the case here. Slight deviations at the ends are frequently encountered
and pose no problem.
110 2 Regression Analysis
If the normality assumption is violated, one should not worry too much. For large
samples (N > 40) the estimated parameters will be normally distributed, even if the errors
are not normally distributed. This follows from the central limit theorem of statistical
theory. So, the significance tests and confidence intervals will be approximately correct.
But with small samples, one should be careful. Significance levels and confidence inter-
vals cannot be interpreted in the usual way.
A violation of the normality assumption is often the consequence of some other vio-
lations, e.g., missing variables, non-linearities, or outliers. After fixing these problems,
non-normality often disappears.
SSy
A B
C
SSx1 SSx2
D
However, the overlapping information is not completely lost. It reduces the standard
error of the regression and thus increases R-square and also the accuracy of forecasts.
As a result of multicollinearity, R-square may be significant although none of the
coefficients in the regression function is significant. Another consequence of multicol-
linearity may be that the regression coefficients change significantly if another variable is
included in the model or if a variable is removed. Thus, the estimates become unreliable.
Detection of Multicollinearity
To prevent the problem of multicollinearity, it is first necessary to detect it, i.e. to deter-
mine which variables are affected and how strong the extent of multicollinearity is. The
collinearity between two variables can be measured by the correlation coefficient. Thus,
a first clue may be provided by looking at the correlation matrix. High correlation coef-
ficients between the independent variables can point to a collinearity problem. However,
the correlation coefficient measures only pairwise relations. Therefore, high-grade multi-
collinearity can also exist despite consistently low values for the correlation coefficients
of the independent variables.42
To detect multicollinearity, it is therefore recommended to regress each independent
variable Xj on the other independent variables to determine their multiple relationships.
A measure for this is the corresponding squared multiple correlation coefficient denoted
by Rj2. A large value of Rj2 means that the variable Xj may be approximately generated
by a linear combination of the other independent variables and is therefore redundant.
Rj2 can thus be used as a measure of redundancy of the variable Xj. The complementary
value Tj = 1 − Rj2 is called the tolerance of variable j.
The reciprocal of the tolerance value is the variance inflation factor (VIF) of variable
Xj, which is currently the most common measure of multicollinearity:
1
VIFj = (2.67)
1 − Rj2
In statistic software for regression analysis both tolerance and VIF can usually be used
for checking multicollinearity. Exact cut-off values, however, cannot be given: For
Tj = 0,2 you get VIFj = 5, or for Tj = 0.1 you get VIFj = 10. Such critical values can be
found in the literature.43
In our data, we do not find considerable multicollinearity, as the statistics in
Table 2.13 show.
43 Verysmall tolerance values can lead to computational problems. By default, SPSS will not allow
variables with Tj < 0.0001 to enter the model.
2.2 Procedure 113
Once the data are given, factors a) and b) cannot be changed anymore. Factors c) and
d) can be changed by the researcher by changing the model. A simple way to prevent
high multicollinearity is to remove variables with large VIF. But by removing variables
from the model, the model fit usually decreases and the standard error of the regression
increases. So, this is a balancing act. It becomes problematic if a variable with large VIF
is of primary interest to the researcher. In this case he is possibly faced with the dilemma
of either removing the variable and thus possibly compromising the purpose of the inves-
tigation or keeping the variable and accepting the consequences of multicollinearity.
Factor analysis (see Chap. 7) can be very useful for coping with multicollinearity. It
helps to analyze the interrelationships among the independent variables because it may
help to select variables with low correlation or to create composite variables (indicators)
by combining two or more variables into a new variable (e.g. by summation or averag-
ing) and thus diminishing multicollinearity (cf. Hair et al. 2010, pp. 123–126). Or one
can also do a regression on the factors, which are always uncorrelated (they are compos-
ites of all variables). However, if the regressors are replaced by factors, this may jeopard-
ize the actual purpose of the investigation, as the factors cannot be observed.
The simplest way to prevent multicollinearity is by increasing the sample size. But
this usually costs time and money and is not always possible.44
2.2.5.8 Influential Outliers
Empirical data often contain one or more outliers, i.e. observations that deviate substan-
tially from the other data. Regression analysis is susceptible to outliers because the resid-
uals are squared in the OLS method. Therefore, an outlier can have a strong influence on
the result of the analysis. In this case, the outlier is called influential.45
To find out whether an outlier is influential, we first have to detect the outlier(s).
And if an outlier is influential, we have to check whether this leads to a violation of the
assumptions.
Outliers can arise for different reasons. They can be due to
44 Another method to counter multicollinearity, which is beyond the scope of this text, is ridge
regression. By this method one trades a small amount of bias in the estimators for a large reduction
in variance. See Fox (2008, p. 325); Kmenta (1997, p. 440); Belsley et al. (1980, p. 219).
45 Excellent treatments of this topic may be found in Belsley et al. (1980); Fox (2008, p. 246).
3,000
2,800
2,600
2,400
2,200
200 210 220 230 240 250 260
Advertising
Fig. 2.25 Regression line (solid line) for data with a high-leverage outlier, while the dashed line shows
the correct regression line
To demonstrate the effect of outliers, we will use a small simulation. We go back to our
Model 1 for simple regression. Table 2.5 shows the data and residuals. Now imagine
that a “small mistake” happened. For sales in the first period (i = 1) a wrong digit was
entered: instead of the correct value 2596, the value 2996 was entered, which is an
increase of 400 units.
If the expected value of the error in period 1 was E(ε1 ) = 0 before the mistake, it will
now be E(ε1 ) = 400. This is a violation of condition (2.51). It is instructive to look for
the effects of this mistake on the regression results.
Figure 2.25 shows the scatterplot with the outlier, point (203, 2996) on the left side
(represented by the bullet). And it also shows the corresponding regression line (solid
line). The dashed line represents the regression line from Fig. 2.4 with the correct data.
Usually we do not know this line when dealing with outliers. We inserted it here to illus-
trate the effect of the “small mistake”.
Table 2.14 shows the numerical results of regression analysis with the wrong and the
correct y-value in period 1. The bottom row shows the changes caused by the outlier.
• By increasing the observed y-value in period 1 by 400, the estimated value increases
by 149.
• The regression coefficient diminishes from 9.36 to 6.05 [chocolate bars/€]. The effect
on the slope of the regression line can be seen in Fig. 2.25. We see how the outlier
pulls on the regression line.
2.2 Procedure 115
The example demonstrates clearly that the change of only one data point can have very
strong effects on the results of regression analysis, especially for small sample sizes.
Detecting Outliers
When we encounter an outlier, we do not know the correct value, as we did in the simu-
lation above. So if we create a scatterplot, like the one in Fig. 2.25, the correct regression
line (dashed line) will be missing. And in a table like Table 2.14, we will only have the
values in the first row.
For detecting outliers one can use graphical and/or numerical methods. Graphical
methods are easier to understand, and they are quicker and more efficient.46 It can be
tedious to find unusual values in a possibly great number of numerical values. But when
looking at a scatterplot like the one in Fig. 2.25, unusual points47 such as the high marker
on the left side, can be easily detected.
We will now take a closer look at this point with numerical methods. To judge the size
of a residual, it is advantageous to standardize its value by dividing it through the stand-
ard error of the regression. In this way we get standardized residuals:
zi = ri /SE (2.69)
The residual of observation 1 is r1 = 332, and for SE we calculate 209. So, for observa-
tion 1 we get the standardized value
z1 = 332/209 = 1.59
The bar chart in the upper left-hand panel of Fig. 2.26 shows the standardized residuals
for our data. We can see that observation 1 has the largest residual. The value 1.59 can
46 Before doing a regression analysis one can use exploratory techniques of data analysis, like
box plots (box-and-whisker plots), for checking the data and detecting possible outliers. But these
methods do not show the effects on regression.
47 This may be different when the number of variables is large. In this case the detection of multi-
variate outliers by scatterplots can be difficult (see Belsley et al. 1980, p. 17).
116
2.00 0.35
1.50 0.30
1.00 0.25
0.50 0.20
0.00 0.15
1 2 3 4 5 6 7 8 9 10 11 12
-0.50
0.10
-1.00
0.05
-1.50
0.00
-2.00 1 2 3 4 5 6 7 8 9 10 11 12
0.50 0.80
0.00 0.60
2
1 2 3 4 5 6 7 8 9 10 11 12
-0.50
0.40
-1.00
0.20
-1.50
-2.00 0.00
1 2 3 4 5 6 7 8 9 10 11 12
Regression Analysis
) (2
1 1 xi − x
hi = + · (1/N ≤ hi ≤ 1) (2.70)
N N −1 sx
• increases with the squared distance (xi − x)2, which is the source of the leverage,49
• decreases with the standard deviation of the independent variable,
• decreases with the sample size.
The bar chart in the upper right-hand panel of Fig. 2.26 shows the leverage values for our
data. The mean of the N leverage values hi is h = (J + 1)/N . For our example, we get
0.1677. Leverages with hi > 2 h are considered high leverages.
For the leverage of the outlying observation 1 we get:
) (2
1 1 203 − 235.2
h1 = + · = 0.371
12 12 − 1 18.07
This leverage is distinctly smaller than the leverage h1 = 0.371 for observation 1.
Types of Residuals
Because it is difficult to detect an outlier by its residual, different types of residuals are used.
So far, we have introduced the normal (unstandardized) and the standardized residual. Two
further types of residuals are the studentized residuals and studentized deleted residuals.
3,100
3,000
2,900
2,800
2,700
2,600
2,500
2,400
2,300
2,200
200 210 220 230 240 250 260
Advertising
Fig. 2.27 Regression line (solid line) for data with low leverage outlier, while the dashed line shows the
correct regression line
Above we described how an outlier pulls on the regression line and thus diminishes
its residual. The effect of this mechanism is related to the leverage of the outlier. The cal-
culation of studentized residuals and studentized deleted residuals includes the leverage
of an observation. Compare the formulas of the four types of residuals:
Normal residuals: ei = yi − ŷi
/( √ )
Studentized residuals: ti = ei SE · 1 − hi (2.72)
/( √ )
Studentized deleted residuals: ti∗ = ei SE(−i) · 1 − hi (2.73)
Table 2.15 shows the values for these four different types of residuals for observation
1, first for the incorrect sales value (after the mistake) and then for the correct sales
value (before the mistake), to identify the effect of the mistake in observation 1. We also
include the abbreviations used in SPSS.
The calculation of the studentized deleted residual i uses the standard error of
the regression after deleting observation i. We have denoted this standard error of the
120 2 Regression Analysis
Table 2.15 Values of different types of residuals for observation 1 (after and before the mistake)
Data Residual Standardized Studentized Student.Deleted
r Residual z Residual t Residual t*
RES ZRE SRE SDR
Incorrect 332 1.59 2.01 2.46
Correct 81 0.49 0.62 0.60
regression by SE(–i).50 Here we have SE = 208.9 and SE(–i) = 170.2. For observation 1
we get:
/( √ )
ti∗ = 332 170.2 · 1 − 0.37 = 2.46
The lower left-hand panel in Fig. 2.26 shows the bar chart of the studentized deleted
residuals. They follow a t-distribution with N − J − 2 degrees of freedom.51 Thus, for
the p-value of the studentized-deleted residual t1∗ = 2.46 we can infer p = 3.6%.52 This
value is considerably smaller than the p-value p = 11.2 that we got for the standardized
residual of observation 1. And the p-value is below 5%. Thus, the studentized-deleted
residual responds more sensitively to residuals with high leverage and marks observation
1 as an outlier.
Cook’s Distance
Above, we defined the influence of a residual roughly as the product of two factors:
influence = size · leverage.
A specification of this formula is Cook’s distance that can be calculated as:
ti2 hi
Di = · (2.74)
J + 1 1 − hi
This statistic by Cook (1977) is currently the most frequently used measure of influence.
Its calculation is based on the studentized residual for an observation i and its hat value.
For observation 1 we get:
2.012 0.37
D1 = · = 2.02 · 0.587 = 1.19
1 + 1 1 − 0.37
50 By using s(−i) instead of standard error s, the numerator and the denominator in the formula
for the studentized deleted residuals become stochastically independent. See Belsley et al. (1980,
p. 14).
51 See Fox (2008, p. 246); Belsley et al. (1980, p. 20).
3,000
2,800
2,600
2,400
2,200
200 210 220 230 240 250 260
Advertising
Fig. 2.28 Regression line (solid line) after discarding observation 1 (outlier); the dotted line shows the
regression with outlier and the dashed line the correct regression line
The lower right-hand panel of Fig. 2.26 shows the bar chart of Cook’s distances for all
observations. We can see that the bar for observation 1 clearly stands out from the other
observations. Thus, Cook’s distance gives a clear indication that observation 1 is a highly
influential outlier.
In the literature, one finds different opinions concerning a cut-off value for the detec-
tion of influential outliers (e.g. 4/N = 0.333, or 4/(N–J–1) = 0.4 or just 0.5). Values
greater than 1 are significant. Our value of Cook’s distance here exceeds these possible
cut-off values by far. This also indicates that observation 1 is a highly influential outlier.
But in general, the best way of detecting influential outliers is to check a diagram
(such as Fig. 2.26) for values (dots or bars) that stand out from the others.
Outlier Deletion
Above, we showed the effect of a simulated outlier on regression results by comparing
the results with the correct data. But usually we do not know the correct data. Another
way to show the effect of an outlier is to repeat the analysis after deleting the observation
with the outlier. This is illustrated in Fig. 2.28. Table 2.16 shows the numerical results.
We can see that after deleting observation 1 (the outlier) the regression is close to the
regression with the (usually unknown) correct data. Thus, in this case, the deletion of the
outlier yields good results.
122 2 Regression Analysis
Table 2.16 Regression results: a) after discarding observation 1 (outlier), b) before discarding the
outlier, c) correct data
Data Coeff. b R2 s
a) Outlier discarded 10.8 0.52 170.2
b) With outlier 6.05 0.23 208.9
c) Correct data 9.63 0.55 164.7
We will now use another sample related to the chocolate market to demonstrate how to
conduct a regression analysis with the help of SPSS.
The marketing manager of a chocolate company wants to analyze the influence of
demographic variables on the shopping behavior of his customers. He wants to find out if
and how age and gender influence the shopping frequency in his outlet stores. His model
is:
Shopping frequency = f (age, gender).
2.3 Case Study 123
Ŷ = b0 + b1 X1 + b2 X2
with
Ŷ estimated shopping frequency (shoppings)
X1 age
X2 gender (coded 0 for females and 1 for males)
A variable coded 0 or 1 is called a dummy variable. It can be treated like a metric var-
iable. Dummy variables can be used to incorporate qualitative predictors into a linear
model (cf. Sect. 2.4.1).
For the estimation, the manager started with a small sample of 40 customers, ran-
domly selected from the company’s database.
As the manager was not content with the results of his analysis, he additionally col-
lected data on the income of his customers by a separate survey. This was omitted in the
first survey, as people usually do not like to report their income. For this, the respondents
had to be ensured that their data are kept confidential and anonymous. The income data
from the second survey are also contained in the data file.
To conduct a regression analysis with SPSS, we can use the graphical user interface
(GUI). After loading the data file into SPSS we can see the data in the SPSS data editor.
To select the procedure for regression analysis we have to click on ‘Analyze’. A pull-
down menu opens with submenus for groups of procedures (see Fig. 2.29). The group
‘Regression’ contains (among other forms of regression analysis) the procedure of linear
regression (‘Linear’).
After selecting ‘Analyze/Regression/Linear’, the dialog box ‘Linear Regression’
opens, as shown in Fig. 2.30. The left field shows the list of variables. Our dependent
variable ‘shopping frequency’ has to be moved into the field ‘Dependent’. The independ-
ent variables ‘age’ and ‘gender’ have to be moved into the field ‘Independent(s)’.
SPSS offers various model-building methods. We here choose the method ‘Enter’,
which is the default option. This means that all selected independent variables will be
included into the model as they were entered in the field ‘Independent(s)’. This is called
blockwise regression.
The dialog box Regression contains several buttons that lead to further submenus.
If we click the button ‘Statistics’, the dialog box ‘Linear Regression: Statistics’ opens
(Fig. 2.31). Here we can request various statistical outputs. ‘Estimates’ and ‘Model fit’’
are the default settings.
124 2 Regression Analysis
Fig. 2.29 Data editor with a selection of the procedure ‘Linear Regression’
If the data set contains missing values, which often occurs in practice, this can be
taken into account with the options under ‘Missing Values’.53 The regression analysis in
SPSS offers the possibility of excluding missing values listwise or pairwise. Missing val-
ues can also be replaced by means (cf. Fig. 2.32).
2.3.3 Results
53 Missing values are a frequent and unfortunately unavoidable problem when conducting surveys
(e.g., because people cannot or do not want to answer some question, or as a result of mistakes by
the interviewer). The handling of missing values in empirical studies is discussed in Sect. 1.5.2.
2.3 Case Study 125
the constant term and the two coefficients. With these values we can write the estimated
regression function:
Shoppings = 3.832 + 0.101 · age − 0,015 · gender (2.75)
126 2 Regression Analysis
This result indicates that the shopping frequency increases slightly with age and
decreases with gender. As gender was coded with 0 for females and 1 for males, the neg-
ative sign means that shopping frequency is somewhat lower for men than for women.
The ‘Model Summary’ contains global measures for evaluating the goodness-of-fit
of the estimated regression function. The coefficient of determination, R-square, tells us
that only 4.8% of the total variation of the dependent variable Y = ‘shopping frequency’
can be explained by the two predictors ‘age’ and ‘gender’. This is a very disappointing
result.
The second part with the heading ‘ANOVA’ (analysis of variance) shows that, with an
empirical F-value of only 0,923, the regression function has no statistical significance
(cf. Sect. 2.2.3.3). The critical F-value for J = 2 and N – J – 1 = 37 df is 3.25. Thus,
we get a p-value of 40,6%, much larger than the usual limit at α = 5% for statistical
significance.
The first column, with the heading ‘Sum of Squares’, shows the decomposition of
variation according to Eq. (2.34): SSE + SSR = SST. The next column gives the corre-
sponding degrees of freedom (df) (cf. Table 2.7).
2.3 Case Study 127
The last section with the estimated regression parameters gives the
nothing can change age. Thus, if there is a causal relationship, age must be the cause
of changes in shopping frequency, and not vice versa. But why has the regression coef-
ficient become negative, while the correlation between age and shopping frequency is
positive?
The reason is that age is directly and indirectly causally related to shopping fre-
quency, and income functions as a mediator (cf. Fig. 2.17c). Age has a direct effect on
shopping frequency, which is negative. And it has an indirect effect via income, which is
positive, and which is larger than the direct effect.
In Eq. (2.75) income is omitted. Thus, part of the effect of income on shopping fre-
quency is erroneously assigned to the coefficient of age because age and income are pos-
itively correlated. The coefficient b1 = 0.101 in Eq. (2.75) comprises the direct and the
indirect effect of age on shopping. frequency. And as the positive indirect effect is larger
than the negative direct effect, we get a wrong positive value for b1.
By including income in the regression Eq. (2.76), the direct and indirect effects of
age are separated. Thus, the coefficient b1 of age in the 2nd analysis reflects only the
direct effect, and this is negative. This means that within a group of customers with equal
income, the shopping frequency will decrease with higher age. Intuitively the chocolate
manager had felt this and thus undertook the 2nd regression analysis.
A1: Linearity
To check for non-linearity, we can plot the dependent variable ‘shopping frequency’
against any of the independent variables. For example, by plotting ‘shopping frequency’
versus ‘income’ we get the scatterplot in Fig. 2.36. In SPSS we can do this by select-
ing ‘Graphs / Chart Builder / Scatter/Dot’ and then moving ‘shopping frequency’ to the
y-Axis and ‘income’ to the x-Axis. The scatter in Fig. 2.36 does not indicate a violation
of the linearity assumption. Additionally, we can fit a linear line to the scatter by select-
ing ‘Total’ under the option ‘Linear Fit Lines’. In the same way, we can create scatter-
plots with the other independent variables.
130 2 Regression Analysis
20
y=-2.04+2.65*x
15
Shopping frequency
10
0
1.000 2.000 3.000 4.000 5.000 6.000 7.000
Income
Fig. 2.36 Scatterplot of the dependent variable ‘shopping frequency’ against ‘income’
A more efficient plot is the Tukey-Anscombe plot, which involves plotting the resid-
uals against the fitted y-values on the x-axis (cf. Sect. 2.2.5.2) as the fitted y-values are
linear combinations of all x-values. We can easily create such a plot with the dialog box
‘Linear Regression: Plots’ of SPSS Regression (see Fig. 2.37). This box offers stand-
ardized values of the fitted y-values and the residuals. We put the standardized residuals
*ZRESID on the y-axis and the standardized predicted values *ZPRED on the x-axis (as
shown in Fig. 2.37), and receive the diagram in Fig. 2.38. The scatterplot does not show
any outliers. The residuals seem to scatter randomly without any structure and do not
show any suspicious pattern. This is what we want to see.
and also bias. Due to this, the estimated regression coefficient will be underestimated
(attenuated).
But in the present case, this does not pose a problem. The estimated regression coef-
ficient of the variable ‘income’ is very large here, despite possible attenuation, with a
p-value of practically zero.
Scatterplot
Dependent Variable: Shopping frequency
2
Regression Standardized Residual
-1
-2
-2 -1 0 1 2
Histogram
Dependent Variable: Shopping frequency
10
Mean = 1.19E-15
Std. Dev. = 0.961
N = 40
8
Frequency
0
-2 -1 0 1 2
Regression Standardized Residual
0.8
Expected Cum Prob
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
to Eq. (2.67). The lowest tolerance value here results for the variable ‘income’:
T1 = 0.726. Thus, for income we get the largest VIF-value with VIF1 = 1.377. This is a
very moderate value. Thus, we do not have a collinearity problem here.
2.3.3.4 Stepwise Regression
Besides the blockwise regression (method ‘Enter’) used above, SPSS also offers a step-
wise regression. This method can be chosen via the dialog box ‘Linear Regression’ (see
Fig. 2.30).
If we choose stepwise regression, an algorithm of SPSS will build the model. It will
include the independent variables sequentially into the model, one by one, based on their
statistical significance (this process is called forward selection). Non-significant varia-
bles will be omitted. By this procedure, the algorithm tries to find a good model. This
method can be useful if we have a great number of independent variables.
134 2 Regression Analysis
In our case study with only three independent variables, a total of seven different
models (regression equations) can be formed: three with one independent variable, three
with two independent variables, and one with three independent variables. If we had 10
independent variables, 1023 models could be formed. The number of possible combina-
tions increases exponentially with the number of variables.
For this reason, it can be very tempting to let the computer find a good model. But
there is also a risk. The computer can only select variables according to statistical crite-
ria, but it cannot recognize whether a model is meaningful in terms of content. We have
shown that also nonsense correlations can be statistically significant. To recognize this is
the task of the researcher. The computer does not know anything about causality because
the data do not contain information about causality. Thus, a computer could possibly
“think” that tax is a good predictor for sales or profit because of a strong correlation.
Anyway, we will now demonstrate the stepwise regression method. Figure 2.41 shows
that ‘income’ is included in the first step and ‘age’ in the second step. The variable ‘gen-
der’ is not included because its p-value exceeds α = 5%.
The target criterion for the successive inclusion of variables is the increase of
R-square. In the first step, the variable ‘income’ is selected because it has the highest cor-
relation with the dependent variable and thus yields the highest R-square (see Fig. 2.42).
In each successive step, the variable is selected which yields the highest increase
of R-square. The process ends if there is no further variable that leads to a significant
increase in R-square. Here in our example the process ends after the second step.
The estimated coefficients are shown in Fig. 2.43. After the first step, the coeffi-
cient of ‘income’ is somewhat biased downwards because it takes the negative effect of
‘age’. This is corrected in the second step by the inclusion of ‘age’. The coefficients of
‘income’ and ‘age’ are almost identical with the coefficients in Eq. (2.76) which addi-
tionally contains the variable ‘gender’.
The selection process of stepwise regression can be controlled via the dialog box
‘Options’ (cf. Fig. 2.32). The default settings of SPSS shown in Fig. 2.32 were used
(PIN: 0.05; POUT = 0.10). The user can change the p-values for PIN (entry) and POUT
(removal) of a variable. An already selected variable can lose importance due to the
inclusion of other variables and thus its p-value can increase. If the p-value exceeds
2.3 Case Study 135
the “removal” value POUT, then the variable will be removed from the model. The
“removal” value must always be larger than the “entry” value (PIN < POUT), because
otherwise the algorithm may not find an end.
If we set PIN = 0.7 and POUT = 0.8, then the variable ‘gender’ will also be selected
and stepwise regression will yield the same regression function as the blockwise regres-
sion in Eq. (2.76).
If you choose backward elimination, then the algorithm starts with a model that
includes all variables and removes in each step the variable that results in the smallest
change in R-square.
Above, we demonstrated how to use the graphical user interface (GUI) of SPSS to
conduct a regression analysis. Alternatively, we can use the SPSS syntax which is a
136 2 Regression Analysis
BEGIN DATA
1 6 37 0 1.800
2 12 25 0 2.900
3 2 20 1 2.000
--------------------------------
40 9 56 0 4.900
END DATA.
* Enter all data.
programming language unique to SPSS. Each option we activate in SPSS’s GUI is trans-
lated into SPSS syntax. If you click on ‘Paste’ in the dialog box ‘Linear Regression’
shown in Fig. 2.30, a new window opens with the corresponding SPSS syntax. However,
you can also run SPSS by only using the syntax and write the commands yourself. Using
the SPSS syntax can be advantageous if you want to repeat an analysis multiple times
(e.g., maybe with changed data or testing different model specifications). Figure 2.44
2.4 Modifications and Extensions 137
shows the SPSS syntax for running the analyses discussed above. The procedure ‘Linear
Regression’ can be requested by the command ‘REGRESSION’ and several subcom-
mands. The syntax shown here does not refer to an existing data file of SPSS (*.sav), but
the data is embedded in the commands between BEGIN DATA and END DATA.
For readers interested in using R (https://www.r-project.org) for data analysis, we pro-
vide the corresponding R-commands on our website www.multivariate-methods.info.
The flexibility of the linear regression model can be extended considerably by the use
of dummy variables. In this way, qualitative (nominally scaled) variables can also be
included in a regression model as explaining variables or predictors. Dummy variables
are binary variables (0,1-variables). Mathematically, they can be handled like metric
variables.
We have encountered an example of a dummy variable in the case study, where we
estimated:
Shopping frequency = f (age, gender)
For the variable ‘gender’ we used a dummy variable with the values 0 and 1. If we
denote the dummy variable by d, we can write the estimated regression function in the
following form:
Ŷ = a + b1 · d + b2 · X (2.77)
with
This can be generalized to qualitative variables with more than two categories. Let’s
assume that, instead of gender, we want to investigate whether hair color influences the
shopping of chocolate. We distinguish between the colors blond, brown, and black. For
q = 3 categories we now need two dummy variables:
{
1 for blond
d1 =
0 else
{
1 for brown
d2 =
0 else
Here, black is the baseline category for which we do not need a dummy. We now have to
estimate the regression function
Ŷ = a + b1 · d1 + b2 · d2 + b3 · X (2.78)
So in general, for a qualitative variable with q categories we need q – 1 dummy varia-
bles. By including q dummies in the model, we would cause perfect multicollinearity
between the independent variables and violate assumption A7.
An example of using dummy variables will be given in the next section on regression
analysis with time-series data.
completely uniform manner and independently of all other events.54 Time has no cause.
It has an ordering function and puts the data in a fixed and unchangeable order. With
cross-section data, on the other hand, the order of the data is irrelevant and can be
changed at will. The variable ‘time’ divides time into equidistant points or periods (e.g.
days, weeks, months, years).
Linear trend model
Example
As a numerical example, we will use the sales data from our introductory example in
Table 2.3. Table 2.17 shows the sales of chocolate without the marketing variables,
but with a time variable t (t = 1, …, 12). The time variable here counts periods of
three months (quarters). The four dummy variables indicate certain quarters within a
year. ◄
Figure 2.45 shows a scatterplot of the sales data. We can recognize a slight increase over
time. The simplest time-series model is the linear trend model:
Y = α + β ·t + ε (2.79)
By simple regression, we get the following estimated model:
54 Since Albert Einstein (1879–1955) we know that this is not quite true. Relativity theory tells us
that time slows down with increasing speed and even comes to a standstill at the speed of light. But
for our problems we can neglect this.
140 2 Regression Analysis
Sales
3,400
3,200
3,000
2,800
2,600
2,400
2,200
2,000
0 2 4 6 8 10 12
Time
The estimated model is represented by the trend line in Fig. 2.45. By extrapolating this
line, we can get a prediction of sales for any period (N + k) in the future (outside the
range of observations). Thus, for the next period N + 1 = 13 we get:
ŷN+1 = a + b · (N + 1) = 2617 + 32.09 · 13 = 3034
For 10 periods ahead we get:
ŷN+10 = a + b · (N + 10) = 2617 + 32.09 · 22 = 3322
To assess the goodness of the estimated model, we can use the measures we discussed in
Sect. 2.2.3:
Another common measure used in the time-series analysis is the mean absolute deviation
(MAD):
N |
Σ |
|yi − ŷi |
i=1 2060.7 (2.81)
MAD = = = 171.7
N 12
For assessing the predictive quality of a model, the calculation of the MAD should be
based on observations that have not been used for the estimation of the model. A model
might provide a good fit, but not necessarily also a good predictive performance.
2.4 Modifications and Extensions 141
Here, all the measures above indicate a poor quality of the estimated model. This can
also be seen in Fig. 2.45, which shows a considerable scatter (unexplained variation) of
the observations around the trend line.
Ŷ = a + b1 · q1 + b2 · q2 + b3 · q3 + b · t (2.82)
An alternative specification leads to a model without a constant term:
Ŷ = b1 · q1 + b2 · q2 + b3 · q3 + b4 · q4 + b · t (2.83)
After removing the constant term, we can include all 4 dummies into the model without
causing perfect multicollinearity. In SPSS we can do this via the dialog box “Options”
by removing the checkmark from the default option “Include constant in the equation”.
Estimating this model, we get
Seasonal pattern
250
200
150
100
50
0
1. quarter 2. quarter 3. quarter 4. quarter
-50
-100
-150
-200
-250
3,400
3,200
3,000
2,800
2,600
2,400
2,200
2,000
0 1 2 3 4 5 6 7 8 9 10 11 12
Time
diminished the unexplained variation (see Fig. 2.47). This considerably improves the fit
of the model. We now get:
By extrapolation of Eq. (2.84), we can make predictions for any period in the future. For
the next period (quarter) N + 1 we get:
ŷ13 = 2676 · 1 + 2571 · 0 + 2481 · 0 + 2887 · 0 + 26.38 · 13 = 3019
And for period 20 we get:
ŷ20 = 2676 · 0 + 2571 · 0 + 2481 · 0 + 2887 · 1 + 26.38 · 20 = 3415
The dashed line in Fig. 2.48 shows the predictions for the next eight periods, 13 to 20.
Prediction Error
Unfortunately, predictions are always associated with errors (“especially if they are
directed towards the future” can be remarked ironically). Based on the standard error of
regression according to Eq. (2.30) we can calculate the standard error of the prediction
for a future period N + k as follows:
√
1 (N + k − t)2
sp (N + k) = SE 1+ + Σ N (2.85)
N 1 (t − t)
2
2.4 Modifications and Extensions 143
where N = 12 is the period of the last observation. It is important to note that the pre-
diction error increases with the prediction horizon N + k, i.e. the longer the prediction
reaches into the future, the larger the error.
For the next period we get:
√
1 (12 + 1 − 6.5)2
sp (13) = 163 1 + + = 191.4
12 143
And for period 20 we get:
√
1 (12 + 8 − 6.5)2
sp (20) = 163 1+ + = 250.4
12 143
Interval Predictions
The predictions we did above are called point predictions. With the help of the standard
error of the prediction we can also make interval predictions, i.e. we can specify a confi-
dence interval in which the future value will lie with a certain probability:
ŷT +k − tα/2 · sp (N + k) ≤ yT +k ≤ ŷT +k + tα/2 · sp (N + k) (2.86)
So, for period 13, we get the interval
3019 − tα/2 · 191.4 ≤ y13 ≤ 3019 + tα/2 · 191.4
tα/2 denotes the quantile of the t-distribution (student distribution) for error probability
α = 5% (confidence level 1 − α = 95%) for a two-tailed test. For sample sizes N > 30 we
2.4 Modifications and Extensions
144 2 Regression Analysis
can assume tα/2 ≈ 2. Here we have only N – 4 = 8 degrees of freedom and thus we have
to use tα/2 = 2.3. With this, we get the prediction interval
3019 − 440 ≤ y13 ≤ 3019 + 440
The prediction interval here has a span of 880. For period 20 the interval increases to
more than 1100. This explains why predictions so often fail, especially if they reach far
into the future.
2.5 Recommendations
Here are some recommendations for the practical application of regression analysis.
number of observations but also on the variation of the independent variable. And for
multiple independent variables, the precision also depends on their collinearity (see
Sect. 2.2.5.7).
• After estimating a regression function, the coefficient of determination must first be
checked for significance. If no significant test result can be achieved, the entire regres-
sion approach must be discarded.
• The individual regression coefficients must then be checked logically (for sign) and
statistically (for significance).
• It has to be checked whether the assumptions of the linear regression model are met
(see Sect. 2.2.5). For this, plotting the residuals is an easy and effective instrument
(see Sects. 2.2.5.1 and 2.2.5.2).
• If the model is used for explanation or decision making, the correctness of the causal-
ity assumptions is essential. This requires reasoning beyond statistics and information
on the generation of the data.
• To find a good model, you may need to remove variables from the equation or add
new variables. Modeling is often an iterative process in which the researcher formu-
lates new hypotheses based on empirical results and then tests them again.
• If the regression model has passed all statistical and logical checks, its validity has to
be checked against reality.
References
Further Reading
Fahrmeir, L., Kneib, T., Lang, S., & Marx, B. (2009). Regression—models, methods and aplica-
tions. Springer.
Hanke, J. E., & Wichern, D. (2013). Business forecasting (9th ed.). Prentice-Hall.
Härdle, W., & Simar, L. (2012). Applied multivariate analysis. Springer.
Stigler, S. M. (1986). The history of statistics. Harvard University Press.
Analysis of Variance
3
Contents
3.1 Problem
Both science and business practice are often confronted with questions regarding the
most suitable actions to achieve a certain goal. A test of the effectiveness of different
measures can be carried out by defining alternative measures (e.g. different advertis-
ing concepts) and then applying them in different groups. If the measures implemented
in the different groups lead to different results for the variable of interest, this can be
seen as an indication that the different measures influence the target variable in different
ways. Of course, this only applies if the different groups are comparable in their struc-
ture and the only difference between the groups is the measure applied to the groups.
The analysis of variance (ANOVA) is the most important statistical procedure for the
analysis and statistical evaluation of such situations. In the simplest case, it examines the
effect of one or more independent variables on a dependent variable. Thus, a presumed
causal relationship is investigated, which can be formally represented as:
Y = f X1 , X2 , . . . , Xj , . . . , XJ .
Here, the independent variables (Xj) are nominally scaled (categorical) variables, which
can occur in different forms, while the dependent variable is always measured at the met-
ric scale level. Table 3.1 shows a number of examples from different fields of application
with the assumed direction of effect.
The example from marketing analyzes whether advertising as an independent varia-
ble (with the three states “internet”, “poster” and “newspaper”) has an influence on the
dependent variable “number of visitors”. In the example from education, it is investi-
gated whether the teaching methodology can change the grades in a school subject. In
all examples, the independent variables always comprise alternative states (factor levels)
that are assumed to influence the metrically measurable dependent variables (e.g. number
of visitors, annual sales, image, recovery time, grades) in different ways.
Another common feature of the examples in Table 3.1 is that they describe experi-
mental situations. Experiments are a classical instrument for the empirical investigation
of causal hypotheses. The researcher actively intervenes in an experiment by systemati-
cally varying (manipulating) the independent variables and then measuring the effects on
the dependent variable. The independent variables (X) are thus subjected to systematic
“treatment” by the user, which is why they are often referred to as the experimental fac-
tor. In contrast, the dependent variable (Y) is often referred to as the measured parameter
or criterion variable. Table 3.2 gives an overview of the different designations of the var-
iables used in ANOVAs that are common in the literature. In this chapter, an independ-
ent variable is consistently referred to as a “factor” and its alternative states as “factor
levels”.
The specific variation of the factors (treatment, manipulation) is based on theoretical
or logical considerations of the user. These are reflected in the so-called experimental
design. The experimental design is intended to ensure that the formed groups are equal
3.1 Problem 149
(do not differ systematically). Only under this condition can the differences in test out-
comes be unambiguously attributed to the differences in treatments (combinations of
factor levels). To meet this condition, the application of ANOVA requires that the test
objects are randomly assigned to the different factor levels. The average measured val-
ues of the criterion variable in the different groups are then compared with each other.
If significant differences occur, this is taken as an indicator that the factor levels do have
150 3 Analysis of Variance
an influence on the dependent variable. If, on the other hand, this is not the case, this
is taken as an indicator that the different factor levels influence the dependent variable
differently.
If a factor with only two factor levels is considered, checking whether the depend-
ent variable differs for the two factor levels corresponds to a simple test for differences
between the means. If, however, a factor has three or more factor levels or if two or more
factors are considered, the (simultaneous) testing for differences between the means is
no longer possible and an ANOVA is required. The term “analysis of variance” can be
traced back to the fact that the scatter (= variance) within the different groups as well as
between the groups is included in the test value (cf. Sect. 3.2.1.3).
The term “analysis of variance” is not only used to describe a specific procedure, but
also as a collective term for different variants of ANOVA. The differences between these
procedures are briefly described in Sect. 3.4.1 (see Table 3.15). Here, it may suffice to
point out that univariate (i.e. one dependent variable) analyses of variance (ANOVAs)
can be described as one-, two-, three-way (or one-, two-, three-factorial) etc. ANOVAs,
depending on the number of factors considered. This chapter focuses on the one- and
two-way ANOVA.
3.2 Procedure
In the following, the basic principle of ANOVAs is first explained using the example
of a univariate ANOVA with one dependent and one independent variable (one-way
ANOVA). The considerations are then extended to two-way ANOVAs (two independent
variables). In both cases, the procedure is divided into four steps which are shown in
Fig. 3.1.
In the first step, the model is formulated and central preliminary considerations for
the implementation of ANOVA are presented. The model formulation varies depending
on the chosen type of ANOVA (one-, two-, three-factorial, etc.). The second step encom-
passes the analysis of variation which is the basic principle of ANOVA. The more factors
are considered and the more factor levels they comprise, the more sources of variation
there are. Based on these considerations, the third step shows how to statistically test
3.2 Procedure 151
3 Statistical evaluation
whether differences between the mean values of the factor levels are significant and how
to assess the quality of the formulated variance model. Finally, the fourth step serves to
interpret the results and to examine follow-up questions arising from the results.
3.2.1.1 Model Formulation
1 Model formulation
3 Statistical evaluation
The ANOVA model formulates how a certain observed value i of the dependent variable
(Y), which originates from a certain group (factor level) g of the independent variable
(X), can be reproduced.
1 This is called inferential statistics and has to be distinguished from descriptive statistics.
Inferential statistics makes inferences and predictions about a population based on a sample drawn
from the studied population.
152 3 Analysis of Variance
with
ygi observed value i (i = 1, 2, …, N) of the dependent variable in factor level g (g = 1,
2, …, G)
μ total mean of the population (global expected value)
αg true effect of factor level g (g = 1, 2, …, G)
εgi error term (disturbance)
αg = μg – μ
with μg = mean for factor level g in the population
The effects of the different factor levels are reflected in the deviations (αg) of the mean
y values of these factor levels or groups of factors (μg) from the total expected mean μ.
These deviations reflect whether the factor levels have an effect on the dependent varia-
ble and thus provide an explanation for the measured values of the dependent variable.
This component of the model is called the systematic component.
In contrast, the error term εgi expresses measurement errors and the effect of variables
not considered in the model. It is assumed that all groups have approximately the same
degree of disturbance. This component is called the stochastic component.
The model of the ANOVA is thus composed of a systematic component and a stochas-
tic component, which are linearly linked to each other.
1. Candy section
2. Special placement
3. Cash register
The manager develops the following experimental design: Out of the largely com-
parable 100 shops of his supermarket chain, he randomly selects 15 supermar-
kets. Subsequently, each one of the three placements is implemented in 5 stores,
also selected at random, for one week. At the end, the chocolate sales (explained
variable) per supermarket are recorded in “kilograms per 1000 checkout transac-
tions”. Table 3.3 shows the three subsamples (g) with the chocolate sales in the five
3.2 Procedure 153
supermarkets (i = 5 for each type of placement). As each of the three groups com-
prises the same number of observations, the design is called “balanced”.2 ◄
The means of the three groups and the total mean of the data in the example are listed
in Table 3.4. These values are necessary to perform the ANOVA because this procedure
analyzes the differences between the means. The variances of the observed values around
these mean values play a decisive role in ANOVA since the unknown true mean values
μg of the factor levels can be estimated from the means of the observed values.
An analysis should always start with an illustration of the data.3 Box plots of the out-
put data, as shown in Fig. 3.2, are particularly suitable for comparing several samples.
Each boxplot represents one of the three subsamples (type of placement), describing its
position and the statistical variation. The box shows the position of the median. The box
indicates the range in which 50% of the observations are found. The whiskers mark the
span of the data (maximum and minimum value, with the exception of outliers).
Figure 3.2 shows that there are differences between the sales volumes of the various
placement groups. Therefore, one can conclude that the type of placement has an influ-
ence on chocolate sales. For the moment, however, this is considered a hypothesis which
will have to be tested by an ANOVA.
2 In our example, only 5 observations per group and thus a total of 15 observations were chosen in
order to make the subsequent calculations easier to understand. The literature usually recommends
a minimum of 20 observations per group.
3 On the website www.multivariate-methods.info we provide supplementary material (e.g., Excel
Sales
70
2= 65
60
3 = 51
50
= 45
40
30
Type of
placement
Candy section Special Cash register
Fig. 3.2 Box plots of the average chocolate sales in the three placements
Since the sales volumes differ even for the same placement (variation within a group),
there must be other influencing factors in addition to placement. In general, there are
always many different influences, most of which cannot be observed.
In the model of ANOVA, these influences are represented by random variables (error
variables) ϵgi, which are contained in the observations ygi. The effects of the factor levels
on the sales quantities in our example result from the deviations of the group mean val-
ues from the total mean value. Since the model in Eq. (3.1) has G + 1 unknown param-
eters with only G categories, an auxiliary condition (reparameterization condition) is
required for the purpose of unambiguous determination (identifiability). As one possibil-
ity, it can be assumed that the effects cancel each other out, so that only their scaling is
affected. In this case the following applies:
G
αg = 0 (3.2)
g=1
Alternatively, one of the categories could be chosen as the reference category and its
effect set to zero.
The effects of the three forms of placement can be estimated by using the different
mean values. The “true” effect of placement g in the population (αg) is estimated by the
3.2 Procedure 155
observed difference between the group mean and the overall mean (ag). Thus, the follow-
ing applies:
αg = (yg − y) (3.3)
with
N
1
yg = ygi Group means (3.4)
N i=1
G N
1
y= ygi Total means (3.5)
G · N g=1 i=1
Entering the values of Table 3.4 in Eq. (3.3) leads to the following deviations between
the group means and the overall mean:
a1 = (y1 − y) = 43.40 − 53.33 = −9.93 Candy section
a2 = (y2 − y) = 64.40 − 53.33 = 11.07 Special placement
a3 = (y3 − y) = 52.20 − 53.33 = −1.13 Cash register
The sum of the effects is zero (except for rounding errors). The special placement has the
strongest positive effect, while average sales are lowest in the candy section, with a value
of −9.93 (group average of 43.40).
Despite these differences, the question remains whether the inferred effects were actu-
ally caused by the way the chocolate was placed. Due to the presence of non-observable
influencing variables (error terms), the estimated effects could possibly also have arisen
purely by chance. This question can be answered by a so-called variance decomposition.
1 Model formulation
3 Statistical evaluation
the formulated model of ANOVA (random component). The basic principle here is the
decomposition of the variation between the observed values ygi and the total mean y (cf.
also the general relationship shown in Eq. 3.6).
total variation = explained variation + unexplained variation
SSt(otal) = SSb(etween) + SSw(ithin) (3.6)
G N 2 G 2 G N 2
g=1 i=1 ygi − y = g=1 N yg − y + g=1 i=1 ygi − yg
Here, SS stands for ‘sum of squares’ and reflects the various squared deviations from the
mean value. Accordingly, SSt stands for the total variation within a dataset. SSb reflects
the scatter (variance) between groups that can be explained by the formulated model, and
SSw corresponds to the scatter within a group that cannot be explained by the formulated
model. If the scatter decomposition in Eq. (3.6) is applied to the dataset in Table 3.3, the
data in Table 3.5 are obtained.
For our example, Table 3.5 shows that of the total variation of chocolate sales
(SSt = 1287.33) in the experiment, a variation of SSb = 1112.13 can be explained by the
type of placement, while SSw = 175.20 remains unexplained.
As an example, we will here determine the variance decomposition for the first obser-
vation value in supermarket 2 for the special placement (y21). According to Table 3.3,
this value is y21 = 68. Using the mean values in Table 3.4, the following calculations can
now be made:
The deviation from the total mean is
y21 − y = 68 − 53.3 = 14.7
The deviation
y2 − y = a2 = 11.1
can be explained by the effect of different placements, while the deviation of
y21 − y2 = 68 − 64.4 = 3.6
cannot be explained by placement.
Therefore, the relationship shown in Table 3.6 applies. This relationship is presented
graphically in Fig. 3.3 for the observation values y21 = 68 and y13 = 40 (cf. Table 3.3).
The equation in Table 3.6 also applies when the elements are squared and summed
over the observations (SS = sum of squares).
Sales
68 21 unexplained
deviation
64.4
60 explained
deviation
53.3
50
explained
deviation
43.4
unexplained
deviation
40 13
group means
30
Candy section Special Placement
Fig. 3.3 Explained and unexplained deviations of two exemplary observation values
3.2.1.3 Statistical Evaluation
1 Model formulation
3 Statistical evaluation
A high eta-squared value indicates that the estimated model explains the sample data
well. However, this does not mean that this statement also applies to the total popula-
tion. ANOVA uses F-statistics to test the statistical significance of a model. The research
question here is: Does the factor under consideration have an effect (αg) that helps to
explain the variation of the dependent variable (y)? The answer to this question is pro-
vided by an F-test.
In the following, the steps of a classical F-test are first described in general terms and
then applied to the extended example described in Sect. 3.2.1.1 (Table 3.3). The results
lead to an ANOVA table which is important for the analysis of variance.
3.2 Procedure 159
F-test
1. State a null hypothesis
2. Calculate the empirical F-value Femp
3. Choose an error probability α (significance level) and make a
decision
4. Interpretation
Since the F-test assumes that the variances within the groups are homogeneous, this
assumption, called variance homogeneity, will be checked at the end of this section
(Fig. 3.4).
F-Test
The classical F-test can be summarized by the following steps (Fig. 3.4).4
Step 1
First, the null hypothesis of the F-test has to be formulated. It should be noted that dif-
ferent formulations of the null hypothesis exist in the literature, but they are identical in
their statements.
Option a:
For the stochastic model formula of the ANOVA in Eq. (3.1), the null hypothesis states
that all factor steps have an effect of zero (αg = 0) on the dependent variable and thus
have no influence on the dependent variable:
H0 : αi = α2 = . . . = αG = 0
(3.8)
H1 : at least two αg are � = 0
Option b:
A second formulation of the null hypothesis states that all group mean values are identi-
cal. This also means that the factor levels have no effect on the dependent variable, since
they do not lead to differences in the group mean values.
H0 : µ1 = µ2 = . . . = µG = 0
4 For a brief summary of the basics of statistical testing see Sect. 1.3.
160 3 Analysis of Variance
Regarding variant b, it should be noted that the ANOVA model is also formulated dif-
ferently here compared to Eq. (3.1). The following applies:
ygi = µg + ∈gi (3.10)
with
ygi observed value i (i = 1, …, N) in factor level g (g = 1, 2, …, G)
μg mean for factor level g in the population (expected value)
ϵgi error term (disturbance)
Option c:
The null hypotheses formulated in Eqs. (3.8, 3.9) are both equivalent to the statement
that the variation of the examined factor between the groups does not differ from the
variation within the groups. Accordingly, the null hypothesis could also be written in the
following form:
H0 : SSb = SSw or SSb /SSw = 1
(3.11)
H1 : SS b �= SS w or SS b /SS w > 1
Step 2
In the second step, the F-statistic (Femp) has to be calculated. Due to the relation in
Eq. (3.6) all three variants of the null hypothesis result in the following test statistic:
explained variance SSb /(G − 1) MSb
Femp = = = (3.12)
unexplained variance SSw /(G · (N − 1)) MSw
The test variable follows an F-distribution and relates the variance between the groups
to the variance within the groups. The variances are calculated by dividing the variations
(SS) by their respective degrees of freedom (df).5 These variances are called mean square
deviations (MS). The name “analysis of variance” also refers to the scatter decomposi-
tion and the ratio of two variances considered in the F-test.
The stronger the experimental effects, the larger the F-value. If the interferences
are small, even the smallest effects can be proven to be significant (i.e., caused by the
factor). However, the stronger the interferences, the larger the (unexplained) variance
in the denominator and the more difficult it gets to prove significance. To use an anal-
ogy: the louder the environmental noise, the louder you have to shout to be understood.
Communications engineering refers to this as the signal-to-noise ratio.
Step 3
Once the empirical F-value has been calculated, it has to be compared with the theoret-
ical F-value (Fα) found in an F-table. The magnitude of Fα is determined by the error
probability α (significance level) chosen by the user and the degrees of freedom in the
0.9
0.8
p-values critical F-value
for = 0.05
0.7
0.6
region of rejection
0.5 <
0.4
0.3
0.2
0.1
0.0
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
F, Femp
numerator and denominator of the test value. The decision to reject the null hypothesis is
determined by the following rules:
Femp > Fα → H0 is rejected → the relationship is significant
Femp ≤ Fα → H0 cannot be rejected
The error probability α is the probability that the null hypothesis is rejected even
though it is correct (first-order error). The smaller α, the greater the user’s effort not to
make an error when rejecting the null hypothesis. Usually a value of α = 0.05 or 5% is
chosen.6
When using statistical software packages, the decision to reject the null hypothesis is
usually based on the so-called p-value (probability value). It is derived on the basis of the
empirical F-value and indicates the probability that the rejection of the null hypothesis
is a wrong decision. The greater Femp, the smaller p. Figure 3.5 shows the p-value as a
function of Femp. In SPSS, the p-value is called “significance” or “sig”. When using the
p-value, the null hypothesis is rejected if the following applies: p < α
6 The user can also choose other values for α. However, α = 5% is a kind of “gold” standard in
statistics and goes back to R. A. Fisher (1890–1962) who developed the F-distribution. However,
the user must also consider the consequences (costs) of a wrong decision when making a decision.
162 3 Analysis of Variance
Table 3.7 ANOVA table for the one-way ANOVA in our application example
Source SS df MS
Between factor levels G
2 G−1=2 SSb
= 556.07
N yg − y = 1112.13 G−1
g=1
Total N
G 2 G · N − 1 = 14 SSt
= 91.95
ygi − y = 1287.33 G·N−1
g=1 i=1
Step 4
If the null hypothesis is rejected on the basis of the F-test, it can be concluded that there
are statistically significant differences between the group mean values or between the
variations between and within the groups. This means that the factors considered in the
model do have a significant influence on the dependent variable.
7 The p-value can also be calculated with Excel by using the function F.DIST.RT(Femp;df1;df2). For
our application example, we get: F.DIST.RT(38.09;2;12) = 0.0000064 or 0.00064%. The reader
will also find a detailed explanation of the p-value in Sect. 1.3.1.2.
3.2 Procedure 163
1 Model formulation
3 Statistical evaluation
The central results of the ANOVA are reflected in the ANOVA table (cf. Table 3.7). It
provides information on whether the factor has a significant influence on the dependent
variable and how large the explanatory contribution of the model is. The F-test used here
is based on a so-called omnibus hypothesis, i.e. it tests whether there are fundamental
8 Guidance on testing the assumption of multivariate normal distribution is given in Sect. 3.5.
A detailed description of the testing of variance homogeneity using the Levene test is given in
Sect. 3.4.3.
164 3 Analysis of Variance
differences between the groups. However, it is not possible to tell from the test whether
only one, several or even all groups differ from each other and how large these effects
are. This is the case in our example, but the box plots in Fig. 3.2 also show that the dif-
ference between special placement and candy section is particularly large, while the dif-
ference between candy section and cash register placement is smaller. Thus, if the F-test
shows that a factor has a significant influence on the dependent variable, it cannot be
concluded from such a result that all factor level means (i.e., group means) are different
and thus all factor levels considered have a significant influence on the dependent varia-
ble. Rather, it is quite possible that several group means are the same and only one factor
level is different. For the user, however, an exact knowledge of the differences is often of
great interest. To analyze such differences, two situations must be distinguished:
a) Prior to the analysis (a priori), the user already has theoretically or factually logical
hypotheses as to where exactly mean differences in the factor levels exist. Whether
such presumed differences (contrasts) actually occur, can then be checked with the
help of a contrast analysis. Contrast analyses are thus confirmatory, i.e. hypothe-
sis-testing, analyses.
b) The user has no hypotheses about possible differences in the effect of the factor levels
and would therefore like to know, after a significant F-test (a posteriori), where empir-
ically significant differences in means are to be found. For this purpose, he can resort
to so-called post-hoc tests. Their application is therefore exploratory, i.e. hypothe-
sis-generating, and is carried out ad hoc.
• The Bonferroni test carries out the pairwise comparisons between the group mean val-
ues on the basis of t-tests.
• The Scheffe test performs pairwise comparisons simultaneously for all possible pair-
wise combinations of the mean values using the F-distribution.
• The Tukey test performs all possible pairwise comparisons between the groups using
the t-distribution.
9 The alpha error reflects the probability of rejecting the null hypothesis although it is true. For type
I and type II errors, refer to the basics of statistical testing in Sect. 1.3.
10 SPSS offers a total of 18 variants of post-hoc tests. See Fig. 3.16 in Sect. 3.3.3.2.
166 3 Analysis of Variance
Apart from the assumed distributions, the tests differ mainly with regard to the correction
of alpha error inflation. Since the tests determine the overall error rate in different ways,
their results differ with regard to the confidence intervals shown.
For the application example, Fig. 3.2 already showed that there are differences
between the mean values of the three factor levels (types of placement). In addition,
Table 3.8 shows the mean value differences for the example calculated by the post-hoc
tests. The last column reports the significance level for the Bonferroni test. It can be seen
that all placement combinations differ significantly.
We can conclude that all three forms of placement have a significant impact on choc-
olate sales. The differences between the group means in our example are also apparent
from the box plots in Fig. 3.2.
3.2.2.1 Model Formulation
1 Model formulation
3 Statistical evaluation
For the purpose of a unique determination (identifiability) of the effects, the effects
should add up to zero. In multi-factor ANOVA, the isolated effects of the factors are
referred to as main effects to distinguish them from interaction effects.
11 Pleasenote that the number of 5 observations per group, and thus a total of 30 observations, was
chosen in order to make the subsequent calculations easier to understand. In the literature, at least
20 observations per group are usually recommended for a two-way ANOVA.
168 3 Analysis of Variance
Table 3.10 Group mean values and margin mean values in the extended application example
Packaging Packaging
Placement h1: Box h2: Paper (margin means)
g1: Candy section 43.4 37.4 40.4
g2: Special placement 64.4 55.8 60.1
g3: Cash register 52.2 49.8 51.0
Placement (margin means) 53.3 47.7 50.5
bh = (y.h − y) (3.15)
3.2 Procedure 169
with
H N
1
yg. = yghi (Group means g)
H · N h=1 i=1
G N
1
y.h = yghi (Group means h)
G · N g=1 i=1
G H N
1
y= yghi (Total means)
G · H · N g=1 h=1 i=1
For the effects of the three placement types, the following can be observed:
The sum of the three effects is again zero. The same applies to the effects of the two
types of packaging:
b1 = (y1 − y) = 53.33 − 50.5 = 2.83 (Box)
b2 = (y2 − y) = 47.67 − 50.5 = −2.83 (Paper)
Different effects of the factors under consideration can be assumed if the factor levels of
the two factors each show clear differences in average sales volumes. This is the case if
the mean values at the factor levels of one factor differ significantly from the mean val-
ues at the factor levels of the other factor.
Table 3.10 shows that the average sales volumes at the three factor levels of ‘place-
ment’ differ with respect to the two factor levels of ‘packaging’ (box and paper): For
example, in “special placement” the average sales amount of chocolate in box packag-
ing is 64.4, while in paper packaging it is only 55.8. It can also be seen that the average
chocolate sales volume in box packaging shows differences for the three types of place-
ment (43.3 compared to 64.4 and 52.2). If these differences prove to be significant across
all levels and factors, we can assume that the factors have different main effects on the
dependent variable.
differences, it can be concluded that the type of placement has an influence on the sales
volumes of the two types of packaging. If these differences were exactly the same, there
would be no interaction between the factors.
The interaction effects are estimated by
(ab)gh = ygh − ŷgh (3.16)
with
5
1
ygh = yghi : observed mean in cell(g, h)
N i=1
ŷgh : estimated value for the mean from cell(g, h) without interaction
The estimated value ŷgh is the value we would expect for the cell (g, h), i.e. placement g
and packaging h, if there is no interaction. This value is derived from the group average
and the total average as follows:
ŷgh = yg. + y.h − y (3.17)
Let us now look at the cell for g = 3 and h = 2 in the extended example. The observed
mean is y32 = 49.8 (Table 3.10). If any interaction exists, this value contains the interac-
tion effect. The estimated value without interaction can be calculated as:
ŷ32 = (51.00 + 47.67) − 50.50 = 48.17
Thus, the interaction effect is:
(ab)gh = 49.8 − 48. 17 = 1.63
Due to the interaction, the sales volume of chocolate in paper is higher if offered at the
cash register.
Interaction effects can also be tested for significance by the following hypotheses:
• H0: The mean values of the factor levels are identical, therefore there is no interaction
between the factors.
• H1: The mean values of the factor levels are not identical, therefore there is an interac-
tion between the factors.
B1 B1 B1
B2 B2
B2
A1 A2 A3 A1 A2 A3 A1 A2 A3
y y y
A1 A1
A1
A2 A2
A2
B1 B2 B3 B1 B2 B3 B1 B2 B3
group mean values. According to Leigh and Kinnear (1980), three types of interaction
effects can be distinguished, as shown in Fig. 3.6.12
12 Here, the different types of interaction are illustrated graphically. The interaction effects in
the application example correspond to those in the case study and are shown and explained in
Fig. 3.15.
172 3 Analysis of Variance
c) Hybrid interaction effects: If ordinal and disordinal interaction effects occur simul-
taneously, we call this a hybrid interaction effect. While in the case of an ordinal
interaction effect both influencing factors can be interpreted, the interpretation in the
case of hybrid interaction effects is only possible for one of the two influencing fac-
tors. In one of the two plots, the lines do not intersect, i.e. the main effect of this
factor can be interpreted. However, the effect shown in this plot is not reflected in the
counterplot. The trend of this main effect runs in the opposite direction in the other
plot and the lines intersect, which means that the main effect of the other factor can-
not be interpreted.
1 Model formulation
3 Statistical evaluation
Total variation
SS t
Fig. 3.7 Distribution of the total variation in a factorial design with 2 factors
3.2 Procedure 173
Just as in the one-way ANOVA, the total variation is split into an explained variation
and an unexplained variation. The explained variation is then further divided into three
components which result from the influence of factor A, the influence of factor B and the
interaction of factors A and B. This results in the following decomposition of the total
variation:
SSt = SSA + SSB + SSAxB + SSw (3.18)
The sum of squares (SS) presented in Eq. (3.18) is now calculated as follows:
SSt: total variation
G
N
H
SSt = (yghi − y)2 (3.19)
g=1 h=1 i=1
However, the total variation explained is obtained by adding the individual effects:
SSb = SSA + SSB + SSAxB (3.21)
The variations produced by the isolated effects (main effects) of factor A (placement)
and factor B (packaging) are derived from the deviations of the row or column means
from the total mean. Eqs. (3.22, 3.23) show the general calculation of the variation
explained by the main effects.
G
SSA = H · N · (yg − y)2 (3.22)
g=1
H
SSB = G · N · (yh − y)2 (3.23)
h=1
with
G Number of factor levels of factor A
H Number of factor levels of factor B
N Number of elements in cell (g, h)
yg Mean of row
yh Mean of column
174 3 Analysis of Variance
The variation generated by the interaction effects is obtained by adding the squared
deviations of the cell means and the estimated values that would be expected without
interaction:
G
H
SSAxB = N · (ygh − ŷgh )2 (3.24)
g=1 h=1
The means per group (i.e., cell) are shown in Table 3.10.
The following estimated values are obtained for the means to be expected without
interaction:
ŷ11 = 40.4 + 53.3 − 50.5 = 43.23
ŷ12 = 40.4 + 47.6 − 50.5 = 37.56
ŷ21 = 60.1 + 53.3 − 50.5 = 62.93
ŷ22 = 60.1 + 47.6 − 50.5 = 57.26
ŷ31 = 51.0 + 53.3 − 50.5 = 53.83
ŷ32 = 51.0 + 47.6 − 50.5 = 48.16
3.2 Procedure 175
= 48.47
SSb is the difference between the group means and the total means. For our extended
example, the result is
2233.5
Eta-Squared = = 0.904
2471.5
When using the extended model, 90.4% of the total variation can be explained (previ-
ously 0.864 according to Eq. 3.7). A one-way ANOVA with the factor placement can
explain 78.7% of the variation for the present data set. By extending the model, the unex-
plained variation can be reduced from 21.3% to 9.6%. Again, it should be noted that
176 3 Analysis of Variance
3.2.2.3 Statistical Evaluation
1 Model formulation
3 Statistical evaluation
In a two-way ANOVA, the statistical assessment for different effects of the two factors is
carried out by comparing the means in all cells. If all means are approximately equal, the
factors have no effect (null hypothesis). The alternative hypothesis is that at least one of
the factors has an influence.
The global significance test for the two-factor model is therefore identical with the
test for the simple model (except for the different number of degrees of freedom; cf.
Eq. 3.12):
3.2 Procedure 177
Table 3.13 shows the results of the specific F-tests with a confidence level of 95%. The
variance of the unexplained variation is the same in all cases, namely 238.000.
The result shows that the alternative hypothesis H1 is accepted for the main effects,
i.e., both packaging and placement have an effect on the sales volume, whereas the inter-
action is not significant. This result does not necessarily mean that in reality there is no
178 3 Analysis of Variance
connection, but on the basis of the available results the null hypothesis cannot be rejected
in this case (cf. the graphical analysis of the interactions in Fig. 3.15).
1 Model formulation
3 Statistical evaluation
For the two-way ANOVA, the central results are also included in an ANOVA table (see
Table 3.12). They provide information on whether the factors or their interaction have a
significant effect on the dependent variable. The F-statistics, on the other hand, are omni-
bus tests, i.e. they do not provide information about which levels of one or more factors
have a significant influence on the dependent variable and how large these effects are.
If there are a priori assumptions about possible differences in the two-way ANOVA,
these can of course also be tested using contrast analysis. And if such assumptions can-
not be made in advance, post-hoc tests can also be performed for significant F-values. In
our extended example, however, post-hoc tests for the type of packaging (box or paper)
are not useful, since there are only two factor levels. A post-hoc test can only be carried
out for the factor “placement”, and it is identical with the test in the case of a one-way
ANOVA (see Sect. 3.2.1.4).
Further explanations on contrasts and post-hoc tests for two-way ANOVAs can be
found in the case study (Sects. 3.3.3.2, 3.3.3.3) which uses the dataset of our extended
example, with a focus on the options implemented in SPSS.
Based on experience, the manager of a supermarket chain assumes that the sales of a
certain type of chocolate can be influenced by the type of packaging and the placement.
To test his assumption, he presents the chocolate in three different places (candy section,
3.3 Case Study 179
special placement, cash register) and in two different types of packaging (box and
paper). This results in 3 × 2 = 6 different ways of presenting the chocolate. As described
in Sect. 3.2.2.1, 15 out of the 100 shops of his supermarket (SM) chain are randomly
selected and the chocolate is presented in a box and in paper in 5 shops per placement,
also selected at random, for one week each. Table 3.14 shows the achieved chocolate
sales in kilograms per 1,000 checkout transactions in the 15 supermarkets for chocolate
in boxes and in paper.13
With the help of the data collected, he now wants to check whether the packaging and
the placement have a significant influence on the sales volume. To answer this question,
the manager conducts a two-way ANOVA. If the influence of the placement turns out
to be significant, the manager wants to know in a second step whether all three place-
ments (candy section, special placement and cash register placement) have an influence
on chocolate sales and how strong these effects are. This is possible via a so-called post
hoc test.
13 For didactic reasons, the data of the extended example are also used in the case study (cf.
Sect. 3.2.2.1; Table 3.9). Note that the case study is thus based on a total of 30 cases only. In the
literature, a number of at least 20 observations per group is usually recommended.
180 3 Analysis of Variance
Since the manager knows that both the ANOVA and the post-hoc test require variance
equality in the factor levels (groups), he would like to check this assumption in advance
using the Levene test.
To conduct a two-way (and any multi-factorial) ANOVA with SPSS, we can use the
graphical user interface (GUI). After loading the data file into SPSS, the data are avail-
able in the SPSS data editor. Click on ‘Analyze’ to select the procedure for ANOVA. A
pull-down menu opens with submenus for groups of procedures (see Fig. 3.8). The group
‘General Linear Model’ contains the procedure ‘Univariate …’, which means that only
one dependent variable is considered.
In the dialog box ‘Univariate’, select the dependent variable (sales volume of choco-
late) and the two independent, nominally scaled variables (placement and type of pack-
aging) from the list and transfer them to the field ‘Fixed Factors’ (see Fig. 3.9).
Fig. 3.8 Data editor with selection of the analysis method ‘Univariate’
3.3 Case Study 181
Additionally, various statistics and parameters can be selected via the field ‘Options’
(see Fig. 3.10). For the present case study, ‘Descriptive statistics’ and ‘Estimates of effect
sizes’ have to be selected. By clicking on the box ‘Homogeneity test’, the Levene test for
homogeneity of variances is requested.
A plot of the factor level mean values can also be requested to visually check the
presence of interactions. To do this, click on the dialog box ‘Diagrams’. In the dialogue
box ‘Plot’, enter the factor placement as the ‘Horizontal axis’ and the factor packag-
ing under ‘Separate lines’ and then enter them both in the field ‘Plots’ using ‘Add’ (see
Fig. 3.11).
3.3.3 Results
3.3.3.1 Two-way ANOVA
Since ANOVA requires homogeneity of variance between groups, we first check the
result for Levene’s test (see Sect. 3.2.1.3) shown in Fig. 3.12. In the last column, the sig-
nificance is given as Sig. = 0.499. The test score of the Levene test is therefore not signif-
icant. A rejection of the null hypothesis would be a wrong decision with a probability of
0.499. The null hypothesis (“error variance of the dependent variable is the same across
the groups”) can therefore be accepted. This means that there are no significant differ-
ences in the error variances of the three factor levels, i.e. variance homogeneity can be
assumed.
182 3 Analysis of Variance
Figure 3.13 lists the descriptive statistics, with the average chocolate sales in kilogram
per 1000 checkout transactions for the different placements as well as the standard devia-
tions and the case numbers (N) for the two packaging types.
A review of the descriptive results reveals differences in the mean sales volumes for
the combinations of factor levels of the two independent variables: For example, the
packaging type box consistently shows higher sales figures than paper packaging in all
placements, and the special placement leads to the highest results, with 60.1 sales units.
The mean values thus already suggest the effectiveness of the two marketing measures.
The results of the two-way ANOVA are presented in an ANOVA table (Fig. 3.14).
Since we used the same figures as in Sect. 3.2.2.2, the results correspond to the values
in Table 3.12 and 3.13. In addition to the empirical F-values, Fig. 3.14 shows the corre-
sponding p-values (“significance”).
The structure of the table in Fig. 3.14 clearly reflects the basic principle of the vari-
ance decomposition because according to Eq. (3.6):
(constant-only model). The Null model serves primarily for comparison with other
models in order to clarify their explanatory power. With reference to the case study, the
constant term indicates the sum of the deviation squares that would be generated “on
average” if the supermarket manager had not undertaken any activities regarding place-
ment or packaging.
Column four in Fig. 3.14 shows the square means that results if the ‘Type III Sum of
Squares’ is divided by the degrees of freedom (df). Using the square means, the test val-
ues of the F-statistics and their significances can be calculated directly (cf. Sects. 3.2.1.3,
3.2.2.3). It should be emphasized that in the case study the F-tests for the two market-
ing instruments (placement and packaging) are significant (cf. column Sig. in Fig. 3.14).
This means that both measures have a significant influence on chocolate sales. In con-
trast, the interaction between placement and packaging is not significant (Sig. = 0.108).
14 Forthe types of interaction effects and the calculation of the interaction effect in the case study,
see the explanations in Sect. 3.2.2.1.
3.3 Case Study 187
Due to the interaction, the sales volume of chocolate in paper is higher if it is offered
at the cash register.
a priori ideas about the effects of the factor levels, e.g. due to logical considerations. In
these cases, these assumptions about the differences in the effectiveness of the factor lev-
els can be checked by the so-called contrast analysis (cf. Sect. 3.2.1.3).
3.3 Case Study 189
For our case study, it is assumed that prior research has shown that the special place-
ment of chocolate can increase sales. The supermarket manager would now like to know
whether this effect is also valid in his case. The manager is therefore interested in a con-
trast analysis for the factor “placement”. In this case, a one-way ANOVA can be used to
perform the contrast analysis.
In SPSS, the one-way ANOVA is called up by the menu sequence Analyze/Compare
Means/One-Way ANOVA. There, the sales volume can be entered as a dependent variable
and the placement as factor. Pressing the button ‘Contrasts’ opens the corresponding dia-
log box, as shown in Fig. 3.18.
In the case of one-way ANOVAs, contrast analysis compares the contrast variable of
interest (factor level) with the other factor levels, which, for this purpose, are combined
into one group. This is achieved by determining the so-called contrast coefficients, which
are often referred to as lambda coefficients.
In order to contrast special placement with the two other factor levels, the supermar-
ket manager selects a contrast coefficient of −1. Based on logical reasoning, the contrast
coefficients are set to +0.5 for the remaining factor levels, candy section and cash regis-
ter. As a result, special placement is regarded as an independent group, while the factor
levels candy section and cash register are combined into one group. In the dialog box
‘Contrasts’, these values are entered in the field ‘Coefficients’ and transferred to the anal-
ysis by clicking on ‘Add’.
Note that the absolute magnitude of the lambda coefficients is irrelevant. They merely
indicate the weighting ratio of the mean (here 1:1). Also note that the coefficients of the
factor levels to be contrasted must have opposite algebraic signs and that the sum of all
contrast coefficients has to result in a total of zero.
assumptions “equal variances” and “no equal variances”. The contrast value reflects the
difference between the two considered group averages and here is calculated as follows:
Contrast = 0.5 · 40.4 + ( − 1 · 60.1) + 0.5 · 51.0 = − 14.4
The group mean values of the factor levels can be found in Fig. 3.13 (they are always
shown in the total sum line). Both t-tests lead to the same result and are highly signif-
icant with a p-value of 0.000. This means that the assumption of the supermarket man-
ager can be confirmed, i.e. the special placement significantly increases chocolate sales
compared to chocolate placement in the candy section and at the cash register.
In one-way ANOVAs, contrast analysis refers solely to the differences between the
factor levels of the factor under consideration. If, however, a contrast analysis is per-
formed within the framework of a multifactorial ANOVA, the differences between the
factors are considered across all factor levels (so-called boundary mean values).
Above, we demonstrated how to use the graphical user interface (GUI) of SPSS to con-
duct an analysis of variance. Alternatively, we can also use the SPSS syntax which is a
programming language unique to SPSS. Each option we activate in SPSS’s GUI is trans-
lated into SPSS syntax. If you click on ‘Paste’ in the main dialog box shown in Fig. 3.9,
a new window opens with the corresponding SPSS syntax. However, you can also use
the SPSS syntax directly and write the commands yourself. Using the SPSS syntax can
be advantageous if you want to repeat an analysis multiple times (e.g., testing different
model specifications). Figure 3.20 shows the SPSS syntax for the two-way ANOVA of
the case study.
Figure 3.21 show the SPSS syntax for running a covariance analysis (ANCOVA),
which is presented in Sect. 3.4.2
For readers interested in using R (https://www.r-project.org) for data analysis, we pro-
vide the corresponding R-commands on our website www.multivariate-methods.info.
3.4 Modifications and Extensions 191
BEGIN DATA
1 1 1 47 1,89 16
2 1 1 39 1,89 21
3 1 1 49 1,89 19
------------------
30 3 2 51 2,13 18
* Enter all data.
END DATA.
Fig. 3.20 SPSS syntax for the two-way ANOVA of the case study
This section presents different extensions of ANOVA. Also, the covariance analysis
(ANCOVA) is considered more closely and explained for the case study in Sect. 3.3.1.
For this purpose, the data of the case study will be extended by two covariates (metri-
cally scaled independent variables). Finally, in Sect. 3.4.3 the Levene test for checking
the assumption of variance homogeneity is described in more detail.
192 3 Analysis of Variance
The term “analysis of variance” comprises various forms of ANOVA, with the extensions
resulting from the inclusion of additional variables (see Table 3.15). Since these exten-
sions lead to changes in the procedure, they are referred to differently in the literature
(see last column in Table 3.15).
All variants of ANOVA follow the principle of the variance decomposition. Regarding
multifactorial analyses of variance, it should be noted that with an increasing number of
factors the possibilities of interaction relationships also increase.
As an example, Fig. 3.22 shows a three-way ANOVA including the different levels of
interactions: the interaction between all possible combinations of two factors and, addi-
tionally, the interaction between all three factors. If more than three factors are included
in the analysis, again, all the interactions of the factors have to be considered, although
in these cases the interactions can hardly be interpreted in terms of content.
In practical applications, ANCOVA has become much more important because it
allows the simultaneous consideration of nominally and metrically scaled independent
variables. Therefore, we will describe ANCOVA in more detail in the following section.
MANOVA allows a design with more than one dependent variable and several fac-
tors. MANCOVA also considers several dependent variables and includes metrically
scaled covariates in addition to nominally scaled factors. MANCOVAs can be per-
formed in SPSS via the menu sequence ‘Analyze/General Linear Model/Multivariate’.
MANOVA and MANCOVA result in a general linear model approach (for the general
linear model see Christensen, 1996, pp. 427–431 and for the multivariate ANOVA see
3.4 Modifications and Extensions 193
Total variation
SS t
Christensen, 1996, pp. 367–374 as well as Haase & Ellis, 1987, pp. 404–113 or Warne,
2014, pp. 1–10).
For practical applications, analyses of variance that do not only consider nominally
scaled but also metrically scaled independent variables are of great importance. The
metrically scaled independent variables are called covariates. Analyses of variance with
194 3 Analysis of Variance
covariates are therefore called covariance analyses (ANCOVA). Due to the great impor-
tance of ANCOVAs, we will examine this type of analysis in more detail below, using
the case study in Sect. 3.3.
In order to perform the two-way ANOVA with covariates, the metrically scaled varia-
bles “price” and “temp” have to be inserted in the field ‘Covariates’ in the dialog box
‘Univariate’ (see Fig. 3.9). After the transfer, the sub-item ‘Post-hoc’ is automatically
hidden, as post-hoc tests are only defined for analyses without covariates. To execute the
procedure ‘Univariate’ with covariates, click on ‘OK’ again.
as above, the degrees of freedom (df), the variances (mean squares), the empirical
F-values (F), the significance level of the F-statistics (significance, Sig.) and the partial
eta-squared.
196 3 Analysis of Variance
As the results show, the covariates price and temperature (with partial eta-squared val-
ues of 0.022 and 0.021, respectively) do not have any significant explanatory power with
regard to the dependent variable. The supermarket manager’s assumption that the quan-
tity sold can be additionally explained by these factors can therefore not be confirmed.
Mathematically, the sales quantity is ‘corrected’ for the influence of the covariates.
This correction is expressed by the fact that the two-way ANOVA now relates to the total
variation minus the influence of the covariates. This has a direct effect on the results, as a
comparison of the ANOVA tables in Figs. 3.14, 3.23 shows:
• The variation explained by the factor placement decreased in absolute terms from
1944.2 to 1207.881. The explanation by the packaging factor has also decreased from
240.833 to 82.605.
• The change in the constant term is also quite obvious: the explained variation in abso-
lute terms dropped from 76507.500 to 8.815. Thus, the constant term is no longer
significant and now only has a partial eta-squared of 0.038 (previously: 0.997). This
means that the original “explanatory power” of the constant term has apparently been
absorbed by the two covariates.
results are obtained with both method variants. The reader should understand the differ-
ences in the results when choosing different contrast options for the case study.
As discussed in Sect. 3.2.1.3, the Levene test (cf. Levene, 1960) is a widely used tool for
statistically testing the assumption of variance homogeneity. The test is based on the null
hypothesis that the variances in the groups do not differ (or that the error variance of the
dependent variable is the same across all groups).
The decision to reject the null hypothesis is based on the following tenets:
Lemp > Fα → H0 is rejected, i.e. there is no variance homogeneity
Lemp ≤ Fα → H0 is not rejected, i.e. variance homogeneity exists
In order to be able to assume variance homogeneity, Lemp should be as small as possi-
ble or the corresponding p-value as large as possible (at least p > 0.05). The Levene test
is relatively robust to a violation of the assumption of normal distribution. The following
test variable is used to test the null hypothesis:
L1
Lemp = (3.29)
L2
with
1 G 2
L1 = · Ng · lg· − l (3.30)
G−1 g=1
198 3 Analysis of Variance
Table 3.17 Absolute deviations (l-values) and their mean values in the application example
above for the factor ‘placement’ (packaging type “box”)
Absolute deviations of the sample values from the respective sample mean
Candy section Special placement Cash register
|47 − 43.4| = 3.6 |68 − 64.4| = 3.6 |59 − 52.2| = 6.8
|39 − 43.4| = 4.4 |65 − 64.4| = 0.6 |50 − 52.2| = 2.2
|40 − 43.4| = 3.4 |63 − 64.4| = 1.4 |51 − 52.2| = 1.2
|46 − 43.4| = 2.6 |59 − 64.4| = 5.4 |48 − 52.2| = 4.2
|45 − 43.4| = 1.6 |67 − 64.4| = 2.6 |53 − 52.2| = 0.8
Group mean values of the deviations lg
(3.6 + 4.4 + 3.4 + 2.6 + 1.6)/5 = (3.6 + 0.6 + 1.4 + 5.4 + 2.6)/5 (6.8 + 2.2 + 1.2 + 4.2 + 0.8)/5
3.12 = 2.72 = 3.04
Total mean value of the deviations l
(3.12 + 2.72 + 3.04)/3 = 2.96
1 G N 2
L2 = · lgi − lg· (3.31)
G(N − 1) g=1 i=1
L follows an F distribution with df1 = G–1 and df2 = G (N–1). To calculate L, the abso-
lute deviations lgi between the observed values ygi and the sample mean values within the
groups (y g) need to
be considered first:
lgi = ygi − yg· with g = 1, . . . , G and i = 1, . . . , N
Then the mean values for each group (lg) and the total mean value (l ) must be deter-
mined for the l values:
1 N
lg = · lgi (3.32)
N i=1
1 G
l= lg (3.33)
G g
For our application example of one-way ANOVA above, Table 3.17 shows the calcula-
tions for the three factor levels of the factor placement. The calculations for the example
are based on the initial data in Table 3.3 and the average chocolate sales in the three
supermarkets in Table 3.4.
Using the results in Table 3.17, the weighted variances L1 and L2 of the l-values can
now also be determined according to Eqs. (3.30, 3.31):
1 1
L1 = · 5 · (3.12 − 2.96)2 + . . . + 5 · (3.04 − 2.96)2 = · 0.448 = 0.224
2 2
3.5 Recommendations 199
1 1
· (3.6 − 3.12)2 + (4.4 − 3.12)2 + . . . + (0.8 − 3.04)2 =
L2 = ·43.328 = 3.611
12 12
L1 0.224
Lemp = = = 0.062
L2 3.611
With an assumed error probability of α = 0.05, the F-table results for df1 = G−1 = 2 and
df2 = G (N−1) = 12 in a value of Fα = 3.89. Thus, Lemp is < Fα, i.e. the null hypothesis
must not be rejected. The assumption of variance homogeneity can thus be regarded as
fulfilled. For Lemp = 0.062 a p-value of p = 0.9402 follows.15 This means that p > 0.05 and
the null hypothesis cannot be rejected either.
3.5 Recommendations
For using ANOVA, the following prerequisites, which relate both to the characteristics of
the data collected and to the evaluation of the data, must be fulfilled:
Model foundation
ANOVA is a confirmatory (structure-checking) analysis. Accordingly, factual logic,
expert knowledge and theory are decisive for formulating a well-founded model and to
identify the possible influencing variable(s) on a dependent variable.
15 The p-value can also be calculated using Excel by using the function F.DIST.RT(Femp;df1;df2).
For the example in Sect. 3.2.1.1, we obtain: F.DIST.RT(0,062;2;12) = 0.9402. A detailed explana-
tion of the p-value may be found in Sect. 1.3.1.2.
200 3 Analysis of Variance
This would be the case, for example, if “packaging” and “branding” were chosen as fac-
tors but the customer perceived both as inseparable.
Error terms
Error terms must not contain any influencing variables. If other influencing factors
(extraneous and confounding variables) are present, they are automatically included in
the error terms. This problem exists mainly in single-factor ANOVAs. A solution here is
to extend the model (e.g. to a multifactorial ANOVA, with the inclusion of covariates).
Manipulation check
Prior to an empirical investigation, it must be ensured that changes in the observations
of the dependent variable are definitely due to different factor levels of the selected fac-
tors. The targeted variation (manipulation) of the independent variable must be carried
out in advance by the user based on theoretical or logical considerations. These consider-
ations are then reflected in the so-called experimental design (cf. Kahn, 2011, pp. 687 ff.;
Perdue & Summers, 1986, pp. 317 ff.).
Number of factors
An ANOVA only makes sense if the effect of at least one factor with three or more factor
levels is examined. If there is one factor with only two factor levels, a simple comparison
of mean values should be used.
Number of observations
A model becomes more reliable if it is based on a large number of observations. The
more factors and factor levels are considered, the more observations are needed. A rule
of thumb is that a group should contain at least 20 observations. In addition, each cell
References 201
should be occupied by about the same number of cases (cf. Perreault & Darden, 1975,
p. 334 ff.). In order to counteract violations of the assumptions, the cases should be ran-
domly assigned to the groups in advance, if possible.
Model complexity
Getting started with ANOVA is easier if the beginner does not include too many factors
(and possibly covariates) in the investigation so the interpretation of the results does not
become too difficult (e.g. increasing number of interaction effects).
Outlier analysis
Outliers influence the variances of a survey in a specific way. They also have an influ-
ence on the assumptions of variance homogeneity and normal distribution. They should
therefore be identified and excluded from the analysis.
References
Perreault, W. D., & Darden, W. R. (1975). Unequal cell sizes in marketing experiments: Use of the
general linear hypothesis. Journal of Marketing Research, 12(3), 333–342.
Pituch, K. A., & Stevens, J. P. (2016). Applied multivariate statistics for the social sciences (6th
ed.). Routledge.
Shingala, M. C., & Rajyaguru, A. (2015). Comparison of post hoc tests for unequal variance.
Journal of New Technologies in Science and Engineering, 2(5), 22–33.
Smith, R. A. (1971). The effect of unequal group size on Tukey’s HSD procedure. Psychometrika,
36(1), 31–34.
Warne, R. T. (2014). A primer on multivariate analysis of variance (MANOVA) for behavioral sci-
entists. Practical Assessment, Research & Evaluation, 19(17), 1–10.
Further Reading
Gelman, A. (2005). Analysis of variance—Why it is more important than ever. The Annals of
Statistics, 33(1), 1–53.
Ho, R. (2006). Handbook of univariate and multivariate data analysis and interpretation with
SPSS. CRC Press.
Sawyer, S. F. (2009). Analysis of variance: The fundamental concepts. Journal of Manual &
Manipulative Therapy, 17(2), 27–38.
Scheffe, H. (1999). The analysis of variance. Wiley.
Turner, J. R., & Thayer, J. (2001). Introduction to analysis of variance: Design, analyis & inter-
pretation. Sage Publications.
Discriminant Analysis
4
Contents
4.1 Problem
Imagine you want to find out what differentiates voters of different political parties (e.g.,
Democrats, Republicans, Libertarians, Greens). To do so, you draw a random sample of
voters of the various parties of interest, and collect socio-demographic, psychographic,
and attitudinal data. The variable indicating the party a person votes for is a categorical
(nominal) variable. Its values represent different categories that are mutually exclusive,
that is, each person can be assigned to one specific group (i.e., supporter of a particular
party). The variables considered to describe the voters might be age, income, consump-
tion orientation, or attitude towards technology. These variables are metrically scaled or
can be interpreted as metrically scaled. With the help of discriminant analysis, we can
examine which of the variables that describe the voters discriminate the different groups
of supporters. Table 4.1 shows some more exemplary research questions from various
disciplines that can be answered with the help of discriminant analysis.
Discriminant analysis is a multivariate method to analyze the relationship between
a single categorical dependent variable and a set of metric (normally distributed) inde-
pendent variables.1 The categorical variable is called grouping variable and reflects the
group an observation (i.e., object or subject) belongs to, such as, for example, buyers of
different brands, voters of different parties, patients with different symptoms, or firms
with different performances. If we consider just two groups, the technique is called
1 Ifwe are interested in the question whether two groups differ significantly with respect to just one
variable, we can use an independent samples t-test. For more than two groups, we can use the uni-
variate analysis of variance (see Sect. 3.2.1).
4.2 Procedure 205
two-group discriminant analysis. If we take three or more groups into account, we call
the technique multi-group discriminant analysis (cf. Sect. 4.2.2.4).
The members of each group are described along a set of observed variables (describ-
ing variables). For example, we might observe socio-demographic or psychographic
variables related to the buyers of different brands, historical health data of patients with
different symptoms, or characteristics of firms. The researcher has to decide which
describing variables are considered, and theoretical considerations should guide the
selection process to ensure non-spurious relationships.
Overall, discriminant analysis may be used to pursue two different aims. First, we can
use a discriminant analysis to identify describing variables that discriminate between dif-
ferent groups and to assess the discriminatory power of the describing variables (discrimi-
nation task). Second, we can use a discriminant analysis to predict the group membership
of new observations based on the describing variables—once we know which describ-
ing variables distinguish the members of the groups (classification task). An example of
the latter task is credit scoring: customers of a bank who have a loan can be divided into
‘good’ and ‘bad’ customers according to their payment behavior. With the help of discrimi-
nant analysis, we can examine what variables (e.g. age, marital status, income, duration of
current employment, or number of existing loans) differ between the two groups. By doing
so, we identify the set of discriminatory variables. If a new customer applies for a loan, the
bank can thus predict the creditworthiness of this customer based on his characteristics.
4.2 Procedure
Table 4.2 Perceptions of buyers of the focal brand and the main competitor brand (example case)
Buyers of focal brand Buyers of main competitor
(group = 1) (group = 2)
Buyer Price Delicious Buyer Price Delicious
1 2 3 13 5 4
2 3 4 14 4 3
3 6 5 15 7 5
4 4 4 16 3 3
5 3 2 17 4 4
6 4 7 18 5 2
7 3 5 19 4 2
8 2 4 20 5 5
9 5 6 21 6 7
10 3 6 22 5 3
11 3 3 23 6 4
12 4 5 24 6 6
Application Example
A manager of a chocolate company wants to know whether the buyers of its own
brand (here: focal brand) perceive its chocolates differently compared to the buyers
of its main competitor brand. The manager considers two describing variables to be
relevant: ‘price’ and ‘delicious’. We examine 12 buyers of the focal brand and 12 buy-
ers of the main competitor brand. All 24 respondents provided information about their
perceptions using a 7-point scale (from 1 = ‘low’ to 7 = ‘high’). Table 4.2 shows the
collected data.2 ◄
The grouping variable has to be a categorical variable and to reflect different mutually
exclusive and collectively exhaustive groups. The (two or more) groups can either be
determined by the research question at hand or be the result of, for example, a cluster
analysis (cf. Chap. 8). We can also convert a metric variable (e.g., a firm’s profit) into
a categorical variable (e.g., low vs. high performance) to form the grouping variable.
However, we need to be aware that we lose information if we do so.
Generally, the number of groups should not be larger than the number of describing
variables. In the example above, we have two describing variables, namely ‘price’ and
‘delicious’. Thus, we restrict the example to two groups: ‘buyers of the focal brand’ and
‘buyers of the competing brand’. The groups are identified by a group index g (g = 1, 2,
…, G), where G is the total number of groups (here: G = 2 and g = 1, 2).
Discriminant function
The discriminant function is a linear combination of the describing variables:
Y = b0 + b1 X1 + b2 X2 + . . . + bJ XJ (4.1)
208 4 Discriminant Analysis
with
The discriminant function is also called canonical discriminant function, and the discri-
minant variable Y is called canonical variable, with the term canonical indicating that
the variables are combined linearly. It is thus assumed that the relationship between the
independent and dependent variables is linear.
For each observation that is described by some variables, the discriminant function
predicts a value for the discriminant variable Y. The coefficients b0 and bj (j = 1, 2, …, J)
are estimated based on the observed data in such a way that the groups differ as much
as possible with respect to the values of the discriminant variable Y. Since the describ-
ing variables are metric, the resulting discriminant variable Y is also metric and does not
directly indicate the group membership. In Sect. 4.2.2.2 we discuss how estimates of
group membership are derived from the estimated value of the discriminant variable.
4.2.2.1 Discriminant Criterion
Discriminant analysis aims to identify the describing variables that discriminate the
groups. This implies that the groups differ with respect to the describing variables. If
they do so, we should observe different values for the describing variables in the two
groups of the example presented above. Figure 4.2 displays a scatterplot of the observed
values of the describing variables. The buyers of the focal brand are represented by red
squares and the buyers of the main competitor brand by black asterisks.
4.2 Procedure 209
0
0 1 2 3 4 5 6 7
price
Figure 4.2 also shows the frequency distributions (histograms) of the values of the
describing variables (i.e. ‘price’ and ‘delicious’) below (‘price’) or beside (‘delicious’)
the scatterplot. The frequency distribution for each group is displayed separately and the
axes correspond to the original x- and y-axes. We learn that buyers of the competing
brand tend to rate ‘price’ higher than buyers of the focal brand (cf. histogram below the
scatterplot). The respective mean values of ‘price’ are 5.0 and 3.5. In contrast, buyers
of the focal brand rate, on average, ‘delicious’ higher than the buyers of the main com-
petitor brand (mean values: 4.5 vs. 4.0) (cf. histogram beside the scatterplot). However,
due to the significant overlaps of the two distributions, neither variable seems to separate
the two groups very well. Visual inspection of the histograms suggests that ‘price’ may
separate the groups better because the groups seem to differ more with respect to this
variable.
While the scatterplot in Fig. 4.2 provides a first glance at the data, it considers each
describing variable in isolation. With the help of the discriminant function, we can
consider both describing variables jointly. Since the two groups seem to differ in their
perceptions regarding ‘price’ (i.e., higher mean for buyers of the main competitor) and
210 4 Discriminant Analysis
‘delicious’ (i.e., higher mean for buyers of the focal brand), we expect that the coeffi-
cients for the two describing variables are contrary—one having a positive effect and the
other one having a negative effect. For the moment, let us assume the following: b0 = 0,
b1 = 0.5 and b2 = –0.5:
Y = 0.5 · X1 − 0.5 · X2
Based on this discriminant function, we can compute the value of the discriminant varia-
ble for each observation (Table 4.3). For example, buyer i = 1 has an estimated value for
the discriminant variable of –0.5 (y1 = 0.5 · 2 − 0.5 · 3).
4.2 Procedure 211
YA Y* YB
Discriminant Axis
If the describing variables are able to separate the groups well, the resulting values for
the discriminant variable Y should differ between the two groups. Thus, we can describe
each group g by its mean value for the discriminant variable. This group mean is called
centroid:
Ig
1
Yg = Yig (4.2)
Ig i=1
If the two groups are well separated, the difference between the centroids is large. For
the assumed coefficients of b1 = 0.5 and b2 = –0.5, we get a group centroid of –0.5 for
the buyers of the focal brand (g = 1) and of 0.5 for the buyers of the main competitor’s
brand (g = 2).
We can display the values of the group centroids on a so-called discriminant axis.
Figure 4.3 illustrates a discriminant axis in general terms. The difference in group cen-
troids can now be expressed in terms of a distance. Besides the values of the group
centroids for the discriminant variable, the discriminant axis also shows the critical
discriminant value (Y*). Knowledge about the critical discriminant value allows us to
assign new observations to one of the groups. In Fig. 4.3, observations with a value for
the discriminant variable lower than the critical value (Yi’ < Y*) are assigned to group
A, while observations with a value for the discriminant variable higher than the critical
value (Yi’ > Y*) are assigned to group B.
A B
Y
D
A B
Y‘
Fig. 4.4 Distribution of discriminant values in two different scenarios (upper part: little overlap of dis-
tributions of discriminant values of groups A and B; lower part: large overlap of distributions of discri-
minant values of groups A and B)
In the example, the standard deviations in the two groups are 0.67 (g = 1) and 0.60
(g = 2), respectively. We can use the information about the group centroids and the stand-
ard deviations within each group to plot the distribution of the discriminant values for
each group. Since we assume that the independent variables are normally distributed,
the discriminant values also follow a normal distribution. Figure 4.5 shows the distribu-
tions of the discriminant values for the buyers of the focal brand and the main competitor
brand based on the discriminant function Y = 0.5 · X1 − 0.5 · X2.
We can see that the distributions of the discriminant values overlap substantially.
There might be two reasons for this rather dissatisfying result. First, the groups of buyers
of the focal and the main competitor brand might not differ much with respect to their
perceptions of ‘price’ and ‘delicious’. Second, the assumed coefficients for the describ-
ing variables may not be ‘optimal’, that is, they are not able to separate the groups well.
Since we have not formally derived the coefficients but simply made an assumption, the
latter reason requires some attention.
The distributions in Fig. 4.5 are not given by the observed data but depend on the
coefficients bj which determine the estimated values for the discriminant variables, and
thus the centroids and the variations around them. Our aim is to separate the groups as
much as possible, that is, the centroids should be ‘far away’ from each other and the
variation in each group should be as small as possible. We can formally express this idea
with the following so-called discriminant criterion Γ:
4.2 Procedure 213
focal brand
main competitor
-5 -4 -3 -2 -1 0 1 2 3 4 5
Fig. 4.5 Distribution of discriminant values for buyers of the focal and the main competitor brand
(b0 = 0, b1 = 0.5 and b2 = –0.5)
G 2
variation between groups g=1 Ig (Y g − Y ) SSb
Ŵ= = G Ig = (4.3)
variation within groups i=1 (Yig − Y g )
2 SS w
g=1
The numerator of the discriminant criterion Γ in Eq. (4.3) represents the difference
between the centroids and thus the variation between groups (SSb). It is calculated as the
squared difference between a group’s centroid and the total mean. In order to account for
different group sizes, the differences are weighted by the respective group size Ig. Thus,
the larger the numerator, the larger the difference between the centroids.
The denominator in Eq. (4.3) represents the variation within groups (SSw), that is, the
squared difference between each discriminant value and the respective group centroid.
We here assume approximately equal dispersion matrices for the different groups. The
smaller the denominator, the smaller the dispersion, and the more likely we can observe
well separated groups. Thus, the larger SSb and the smaller SSw, the larger the value for
the discriminant criterion Γ, and the better the groups are separated.
Since the centroids are defined by the describing variables, the variation between
groups is also called explained variation. Yet, the variation within groups is not
explained by the describing variables and thus is called unexplained variation.3
3 See Sect. 3.2.1.2 for a more detailed discussion of the explained and unexplained variation.
214 4 Discriminant Analysis
We aim to estimate the coefficients bj in such a way that Γ is maximized. The con-
stant term b0 merely shifts the scale of the discriminant values, but does not influence the
value of the discriminant criterion Γ. Thus, it does not play an active role in the estima-
tion procedure. For our exemplary data, we get the following discriminant function when
maximizing Γ:
Y = −1.982 + 1.031 · X1 − 0.565 · X2
Based on this discriminant function, we compute again the discriminant values for each
observation (Table 4.4). The centroids of the two groups are now –0.914 for the focal
brand (g = 1) and 0.914 for the main competitor brand (g = 2). The respective standard
deviations are 1.079 for g = 1 and 0.915 for g = 2, which are in fact similar.
We can use the information about the centroids together with the individually esti-
mated values of the discriminant variables to compute SSb and SSw, which are equal to
20.07 and 22.0, respectively. Accordingly, the resulting value for the discriminant crite-
rion Γ is 0.912, which is the maximum value in this example. However, from Fig. 4.6 we
learn that there is still a substantial overlap, although the distributions in Fig. 4.6 overlap
less than the ones in Fig. 4.5.
Table 4.5 shows the resulting values of the discriminant criterion for various values
of the coefficients b1 and b2. Since b0 does not affect the discriminant criterion, we have
set its value to zero. For the values b1 = 1 and b2 = 0, the discriminant variable Y is equal
to X1 (i.e., price) and for the values b1 = 0 and b2 = 1, it is equal to X2 (i.e., delicious).
Consequently, the resulting value of the discriminant criterion reflects the discriminatory
4.2 Procedure 215
focal brand
main competitor
-5 -4 -3 -2 -1 0 1 2 3 4 5
Fig. 4.6 Distributions of discriminant values of the two groups (b0 = –1.982, b1 = 1.031 and b2 =
–0.565)
power of the respective describing variable (Table 4.5). As Fig. 4.2 already suggested,
‘price’ has a greater discriminatory power (Γ = 0.466) than ‘delicious’ (Γ = 0.031). The
difference between the means of the describing variable ‘delicious’ is smaller than that
of ‘price’ and at the same time, the standard deviation is larger. This results in a lower
216 4 Discriminant Analysis
variation between groups (SSb) and a higher variation within groups (SSw) for ‘deli-
cious’, ultimately leading to a much lower value of the discriminant criterion. However,
we also learn that considering only the describing variable ‘price’ results in a lower value
of the discriminant criterion than considering both variables jointly (Γ = 0.466 compared
to Γ = 0.912). This result indicates that both variables actually contribute to the separa-
tion of the two groups.
−0.565 −0.354
= = −0.55
1.031 0.646
Any set of coefficients that meets the requirement of b2 = –0.55 ⋅ b1 leads to a value of
the discriminant criterion equal to 0.912. Yet no other combination of coefficients will
result in a higher value of the discriminant criterion. Thus, when maximizing the discri-
minant criterion, only the ratio of the discriminant coefficients b2/b1 is clearly defined
(here: –0.55). While the discriminant values change depending on the specifically chosen
coefficients, the value of the discriminant criterion does not change.
2 SSw
spooled = (4.4)
N −G
with
• N: number of observations
• G: number of groups
The constant term b0 is then determined in such a way that the total mean of all discrimi-
nant values is equal to zero. When using SPSS to conduct a discriminant analysis, SPSS
carries out this standardization by default.
When the discriminant coefficients are standardized, the critical discriminant value Y*
is zero. In a two-group discriminant analysis, observations with an estimated discrimi-
nant value larger than zero belong to one group, and observations with an estimated dis-
criminant value smaller than zero are assigned to the other group.
4.2 Procedure 217
0
0 1 2 3 4 5 6 7
price
Figure 4.7 is based on Fig. 4.2 and presents the mapping of the original observations
on the discriminant axis for the discriminant function Y = –1.982 + 1.031X1 –0.565X2.
The discriminant axis has a slope of –0.55 and in the coordinate origin applies Y =
–1.982. The critical discriminant value Y* of zero is indicated by the dashed line that
crosses the discriminant axis. We recognize that one observation of group g = 1 (focal
brand, red squares) and two observations of group g = 2 (main competitor, black aster-
isks) are not correctly identified (cf. Table 4.4). But overall, the two describing variables
‘price’ and ‘delicious’ in combination seem to separate the two groups quite well.
consecutively according to their discriminatory power. If variables are not able to dis-
criminate the groups, they are not included in the discriminant function.
The stepwise estimation procedure follows a sequential process of adding or deleting
describing variables:
1. Select the single best discriminating variable, i.e., the describing variable leading to
the highest value for the discriminant criterion when considered alone.
2. Combine the initial variable with each of the other describing variables, one at a time,
and choose the describing variable that is best able to improve the discriminating
power of the function (i.e., leads to the largest increase in the discriminant criterion).
3. Continue adding describing variables to the discriminant function that improve the
discriminant criterion. Note that as additional describing variables are included, some
previously selected describing variables may be removed if the information they con-
tain on group differences is provided by some combination of the other variables
included at later stages.
4. The procedure stops when no further improvement of the discriminant criterion is
possible.
On the one hand, the stepwise estimation procedure can be useful when we have a large
number of describing variables that potentially may discriminate the groups. By sequen-
tially selecting the variables that separate the groups, we may be able to reduce the set
of describing variables and develop a more parsimonious model. On the other hand, we
should only consider describing variables that are supported by theoretical considera-
tions or a priori knowledge. This means that there should be a good reason to include
the describing variables initially. Thus, the stepwise estimation procedure should not be
used to mine the data. Moreover, a non-significant result for a describing variable is also
a relevant finding.
To assess the quality of a discriminant function, we can use the discriminant criterion.
Moreover, we can compare the estimated and actual group membership of observations.
b SS SSb
Ŵ SSw SSw SSb explained variation
= SSb
= SSw +SSb
= = (4.5)
1+Ŵ 1 + SS SSw
SSb + SSw total variation
w
The denominator is now the sum of SSb and SSw and actually equal to the total variation.
Thus, the result of Eq. (4.5) is the share of explained variation.4 In our example, only
47.7% (= 0.912/(1 + 0.912)) of the variation in the dependent variable is explained by the
discriminant function, which is a rather disappointing result.
4 Inthe two-group case, the result of Eq. (4.5) corresponds to the coefficient of determination R2 in
regression analysis (cf. Sect. 2.2.3.2).
220 4 Discriminant Analysis
Since the first discriminant function was determined so that the eigenvalue and, thus, the
explained variation is maximized, the variation explained by the second discriminant
function cannot be any higher. Accordingly, each further discriminant function is deter-
mined in such a way that it explains a maximum proportion of the residual (unexplained)
variation.
We thus get: Ŵ1 ≥ Ŵ2 ≥ Ŵ3 ≥ . . . ≥ ŴK .
As a measure of the relative importance of a discriminant function, the relative eigen-
value can be used, which reflects the share of explained variance:
Ŵk explained variancek
share of explained variancek = = K
Ŵ1 + Ŵ 2 + . . . + Ŵ K
explained variancek
k=1
(4.6)
Canonical Correlation
Besides the share of explained variation, we can also use the square root of Eq. (4.5) as a
measure of the discriminatory power of the discriminant function. This is called canoni-
cal correlation, and it measures the extent of association between the discriminant value
and the groups (Tatsuoka, 1988, p. 235).
Ŵ
c= (4.7)
1+Ŵ
In our example, the canonical correlation is:
0.912
c= = 0.691
1 + 0.912
The maximum (and best) value that can be achieved for the canonical correlation is 1.
Wilks’ lambda is an inverse quality measure, i.e., the smaller the value, the better the dis-
criminatory power of the discriminant function (Λ = 1–c2). In our example, we get
1
= = 0.523 or = 1 − 0.477 = 0.523
1 + 0.912
We can transform Wilks’ lambda into a probabilistic variable that follows the chi-square
distribution with J × (G – 1) degrees of freedom:
2 J +G
χemp =− N− − 1 ln (�) (4.9)
2
with
• N: number of observations
• J: number of describing variables
• G: number of groups
The chi-square value increases with smaller values for Wilks’ lambda. Higher values
therefore indicate a better separation of the groups. For our example, we get:
2 2+2
χemp = − 24 − − 1 ln (0.523) = 13.614
2
The corresponding null hypothesis H0 states that the mean discriminant value is equal
across all groups. With two degrees of freedom (df = 2), the theoretical chi-square value
at a 5% significance level is 5.99.5 Since the empirical chi-square value of 13.6 is larger
than the theoretical one, we reject H0 and accept H1 that the two groups differ with
respect to the mean value of the discriminant variable (p = 0.001). Consequently, the dis-
criminant function is significant.
K K
1
�= �k = (4.10)
k=1 k=1
1 + Ŵk
5 See Sect. 1.3 for a brief introduction to the basics of statistical testing.
222 4 Discriminant Analysis
Classification Matrix
Table 4.7 shows the so-called classification matrix that is a 2 × 2 table displaying the
number of correctly and incorrectly assigned observations. The diagonal cells contain
the number of correctly classified observations for each group and the off-diagonal cells
contain the number of incorrectly classified observations (relative frequencies related to
the actual group membership are reported in brackets).
At first glance, an overall hit rate of 87.5% sounds decent. However, we need to com-
pare the hit rate of the estimated discriminant function with the hit rate that would be
4.2 Procedure 223
The value of a particular discriminant coefficient depends on the other describing var-
iables considered in the discriminant function. Moreover, the signs of the coefficients
are arbitrary and do not provide insights into the discriminatory power of a particular
coefficient.
Yet, what is of superior interest is the discriminatory power of the describing vari-
ables. Describing variables with high discriminatory power differ significantly across
groups. In our example, we observe mean values of 3.5 and 5.0 for the describing var-
iable ‘price’ for the groups of buyers of the focal (g = 1) and the main competitor brand
(g = 2), respectively. For the describing variable ‘delicious’, the means are 4.5 for group
g = 1 and 4.0 for group g = 2. Thus, we expect that in our example ‘price’ has more dis-
criminatory power than ‘delicious’.
SSb N −G
(G − 1)
Femp =
SSw
= Ŵ · (4.12)
(N − G) G−1
The result of the F-test corresponds to the results of a univariate ANOVA and assesses
whether the groups differ with respect to the describing variable (cf. Sect. 3.2.1.3). In the
two-group case, the results of the F-test are equal to the results of an independent sam-
ples t-test.
226 4 Discriminant Analysis
Table 4.8 shows the result of the F-test for the example. The theoretical F-value for
df1 = (G–1) = 1 and df2 = (N–G) = 22 degrees of freedom is 4.30. For the describing var-
iable ‘price’ the empirical F-value is larger than the theoretical one, and, thus, ‘price’ has
significant discriminatory power (p = 0.004). In contrast, the describing variable ‘deli-
cious’ is not significant (p = 0.421) and does not separate the groups significantly.
Although ‘delicious’ alone has no significant discriminatory power, in combination
with ‘price’ it contributes to an increase in the discriminant criterion (cf. Table 4.5). The
discriminant criterion of the discriminant function when considering both variables is
0.912 compared to 0.466 if only ‘price’ is taken into account.
bjstd. = bj · sj (4.13)
with
bj discriminant coefficient of describing variable j
sj standard deviation of describing variable j
SSw
sj =
N −G
6 For example, if you had a describing variable ‘price’ and changed its unit of measurement from
EUR to Cent, the corresponding discriminant coefficient would decrease by a factor of 100. Yet the
transformation of the scale has no influence on the discriminatory power of the variable.
4.2 Procedure 227
The sign of the standardized coefficients is not relevant when assessing the relative dis-
criminatory power, and we thus consider the absolute values of the standardized coef-
ficients. After standardization, the variable ‘price’ still has a larger absolute coefficient
than ‘delicious’ but the difference is less evident. This means that the describing variable
‘delicious’ does have some discriminatory power, although it is not significant.
It is important to note that the standardized coefficients cannot be used to compute the
discriminant values.
K
std explained variancek
bj = b ·
jk K
(4.14)
k=1 explained variancek
k=1
with
std
bjk : standardized discriminant coefficient for describing variable j in discriminant
function k
Based on the estimated discriminant function, we can assign new observations to one
of the considered groups. We distinguish between three concepts to classify new
228 4 Discriminant Analysis
observations: the distance concept, the classification functions concept, and the probabil-
ity concept. Before discussing the different concepts in detail, we will outline conceptual
similarities and differences between them (see Table 4.9).
4.2.5.1 Distance Concept
Based on the discriminant function, we can compute the discriminant value for a new
observation by using the observed values of the describing variables for this observation
and the estimated unstandardized discriminant coefficients.
Let us assume that we observe a consumer with the values 5.0 and 5.0 for the describ-
ing variables ‘price’ and ‘delicious’, respectively. For this consumer, we get a discrimi-
nant value of 0.350 (= –1.982 + 5 · 1.031 + 5 · (–0.565)).
4.2 Procedure 229
In our example, the squared distances to the group centroids are 1.598 for g = 1 and
0.319 for g = 2. Therefore, the consumer is assigned to group g = 2.
The distance concept just uses the distance of the discriminant value to the group cen-
troid to classify observations. A priori probabilities are not taken into account. Moreover,
the costs of misclassification are not considered.
If more than one discriminant function is considered, the squared Euclidean distance
and the Mahalanobis distance can be used to classify the observations. Originally, the
distance measures assume equal variance-covariance matrices in the groups but the
measures can be extended to capture unequal group variance-covariance matrices (cf.
Tatsuoka, 1988, p. 350). If more than one discriminant function is estimated, it is, how-
ever, not necessary to consider all possible discriminant functions. In this case, we can
focus on those discriminant functions that are significant without losing critical informa-
tion, which simplifies the computation of the distances.
Overall, the advantage of the distance concept is that it is easy to implement and intu-
itively appealing. The disadvantage is that it is a deterministic approach: observations are
assigned to one of the groups with a probability of 100%.
are the larger, the higher the mean value and the smaller the variation of the describing
variables in a group.
In our example, the mean values for ‘price’ are 3.5 for g = 1 and 5.0 for g = 2. The
corresponding standard deviations are 1.168 for g = 1 and 1.128 for g = 2. Group g = 2
thus has a higher mean value and a lower standard deviation for ‘price’ compared to
group g = 1. Consequently, the coefficient for ‘price’ in the classification functions is
larger for group g = 2 compared to group g = 1. More specifically, we get:
F1 = −6.597 + 1.729 · X1 + 1.280 · X2
F2 = −10.22 + 3.614 · X1 + 0.247 · X2
The values Fg themselves have no interpretative meaning and are just used for classifi-
cation purposes. For a new observation with the values 5.0 for ‘price’ and 5.0 for ‘deli-
cious’, we get a classification score of 8.444 for group g = 1 and of 9.083 for group
g = 2. Since we observe the highest value of the classification score for group g = 2, we
classify the consumer into group g = 2.
The classification functions concept relies on the assumption of equal variance-covar-
iance matrices within groups. The costs of misclassification are not taken into account.
Yet it is possible to explicitly consider a priori probabilities. If we do so, the results
based on the classification functions concept may differ from the results of the distance
concept; otherwise, both concepts lead to the same results.
4.2.5.3 Probability Concept
The probability concept is the most sophisticated approach. It explicitly considers
a priori probabilities and allows for taking the costs of misclassification into account.
Furthermore, we can focus on relevant discriminant functions only if several discri-
minant functions are estimated. Finally, it is also possible to account for unequal vari-
ance-covariance matrices. The sophistication of this concept comes at the price that it is
more difficult to comprehend.
4.2 Procedure 231
Classification Rule
The probability concept allows considering a priori probabilities—similar to the classi-
fication functions approach. Additionally, it permits to take the costs of misclassification
into account. If neither probabilities nor the costs of misclassification are considered, the
probability concept leads to the same classification results as the distance concept.
Generally, the probability concept uses the following classification rule:
Assign observation i to the group for which the probability P(g|Yi) is maximum, where
P(g|Yi) is the probability of the membership of observation i in group g given its discri-
minant value Yi.
In decision theory, the classification probability P(g|Yi) is called a posteriori proba-
bility of group membership. We calculate the a posteriori probability with the help of the
Bayes theorem:
with
with
dig distance between the discriminant value of observation i and the centroid of group g
For a new observation with the values 5.0 for ‘price’ and 5.0 for ‘delicious’, we get:
Yi = −1.98 + 1.031 · 5 − 0.565 · 5 = 0.350
The squared distances to the group centroids are di12 = 1.598 and di22 = 0.319, respec-
tively (cf. Sect. 4.2.6). Transforming the distances results in the following densities:
232 4 Discriminant Analysis
Conditional Probability
The conditional probability represents the likelihood of observing discriminant value Y
for observation i given that i belongs to group g. In contrast to the a priori and a posteri-
ori probabilities, conditional probabilities do not have to add up to 1. Conditional prob-
abilities may be arbitrarily small with respect to all groups. Therefore, the conditional
probability is used for assessing how likely it is that an observation belongs to a group at
all.
The conditional probability can be determined based on the standard normal distribu-
tion (cf. Tatsuoka, 1988). For the observation with a value of 5.0 for ‘price’ and 5.0 for
‘delicious’, the discriminant value is 0.350 and it is closest to group g = 2:
|di2 | = |0.350 − 0.914| = 0.564
The resulting conditional probability is:7
P(Yi |g = 2) = 0.572
7 Visit
www.multivariate-methods.info for more information on how to compute the conditional
probability with Excel.
4.2 Procedure 233
( )
-3 -2 -1 0 1 2 3
Fig. 4.8 Representation of the conditional probability (red area) under the density function of the stan-
dard normal distribution
About 60% of all observations in group g = 2 are further away from the centroid than
observation i. Observation i is therefore a rather good representative of group g = 2.
In contrast, for an observation i with the values 6.0 for ‘price’ and 1.0 for ‘delicious’,
we get a discriminant value of 3.639 and the following classification (a posteriori)
probabilities:
P(g = 1|Yi ) = 0.001
P(g = 2|Yi ) = 0.999
Observation i would, therefore, also be classified into group g = 2 with a very high prob-
ability. However, the distance to the group centroid is relatively large with di2 = 2.725.
This results in a conditional probability of:
P(Yi |g = 2) = 0.006
The probability that an observation within group g = 2 has a distance larger than obser-
vation i is extremely low—more specifically, about 0.6%. The conditional probability
for group g = 1 would, however, even be lower. Thus, it appears rather unlikely that the
observation actually belongs to any of the two groups.
Figure 4.8 shows the relationship between the distance dig and the conditional proba-
bility. The greater the distance dig, the less likely it is that an observation within group g
will be observed at an equal or larger distance, and the less likely becomes the hypothe-
sis that observation i belongs to group g. The conditional probability P(Yi|g) corresponds
to the probability or significance level of the hypothesis.
234 4 Discriminant Analysis
Cost of Misclassification
The Bayesian decision rule, which is based on the expected value theory, can be
extended by considering various costs of misclassification, no matter whether the
expected value of a cost or loss criterion is minimized or whether the expected value of a
profit or gain criterion is maximized.
If we consider the cost of misclassification, we assign an observation i to the group
for which the expected value of the costs is minimal:
G
Eh (Cost) = Costgh · P(g|Yi ) (4.20)
g=1
with
Since rejecting the loan request results in overall lower costs, the expected profit is
higher when rejecting the loan. Thus, the bank should decide not to grant the loan to the
customer.
Compared to the distance concept and the classification functions approach, the prob-
ability concept is more elaborate in the sense that it is possible to consider costs of mis-
classification. If unequal costs of misclassification exist, it is recommended to use the
probability concept for classifying new observations.
Table 4.11 Chocolate flavors and perceived attributes examined in the case study
Chocolate flavors Perceived attributes
1 Milk 1 Price
2 Espresso 2 Refreshing
3 Biscuit 3 Delicious
4 Orange 4 Healthy
5 Strawberry 5 Bitter
6 Mango 6 Light
7 Cappuccino 7 Crunchy
8 Mousse 8 Exotic
9 Caramel 9 Sweet
10 Nougat 10 Fruity
11 Nut 11
Multicollinearity
Multicollinearity among the describing variables (i.e., correlation between the describ-
ing variables; cf. Sect. 2.2.5.7) may influence the final specification of the discriminant
function if stepwise procedures are used. If variables are highly correlated, one variable
can be explained by the other variable(s) and thus it adds little to the explanatory power
of the entire set. For this reason it may be excluded or not included in the first place.
Moreover, multicollinearity can influence the test of significance. To assess the degree of
multicollinearity, we can compute the tolerance value or variance inflation factor (VIF)
for each describing variable (see Sect. 2.2.5.7 for suggested remedies if multicollinearity
is critical).
4.3 Case Study 237
8 We use the same data set as for logistic regression (cf. Sect. 5.4) in order to better illustrate simi-
larities and differences between the two methods.
9 Missing values are a frequent and unfortunately unavoidable problem when conducting surveys
(e.g. because people cannot or do not want to answer some question(s), or as a result of mistakes
by the interviewer). The handling of missing values in empirical studies is discussed in Sect. 1.5.2.
238 4 Discriminant Analysis
composition of the three segments and their sizes. The size of the segment ‘Classic’ is
more than twice the size of the segments ‘Fruit’ and ‘Coffee’. The manager now wants to
study which product attributes discriminate the groups.
in each group. This descriptive statistic provides us with a first indication of which
describing variables might be able to discriminate between the groups. Further, we select
‘Univariate ANOVAs’ to examine the discriminatory power of the describing variables
(cf. Sect. 4.2.4). If we choose this option, SPSS performs a one-way ANOVA test for
equality of group means for each describing variable (cf. Chap. 3). Additionally, we
select ‘Box’s M’ to get the results of the test whether the within-group variance-covari-
ance matrices are equal (cf. Sect. 4.2.6).
SPSS displays the standardized discriminant coefficients by default (cf. Sect. 4.2.2.2).
We thus select ‘Unstandardized’ to also retrieve the unstandardized discriminant
coefficients.
To obtain Fisher’s classification functions and the according coefficients, we further
select ‘Fisher’s’. A separate set of classification function coefficients is obtained for each
group.
You also have the option to substitute missing values with the respective mean value
(i.e., option ‘Replace missing values with mean’). If you activate this option, a missing
observation of a describing variable will be replaced with the mean value of this variable.
In our example, we have six missing values only, and we decide to ‘lose’ these observa-
tions instead of replacing them with the mean value.
4.3.3 Results
candidates for separating the groups, which is in line with our conclusions based on the
mean values.
Besides the result of the one-way ANOVAs, SPSS provides Wilks’ lambda (Fig.
4.15). Remember that smaller values of Wilks’ lambda indicate that a variable is better at
244 4 Discriminant Analysis
discriminating between the groups. In our example, the values for Wilks’ lambda do not
differ much. The describing variable ‘exotic’ has the smallest value for Wilks’ lambda
(0.800), followed by a value of 0.803 for ‘fruity’. The values of Wilks’ lambda for the
4.3 Case Study 245
Fig. 4.18 Standardized canonical discriminant function coefficients and structure matrix
describing variables ‘healthy’, ‘bitter’, and ‘sweet’ are the largest, supporting the result
of the F-test that these variables are less suitable for discriminating between the groups.
variance = 89.8%) is much higher than for the second one (eigenvalue = 0.118, % of
variance = 10.2%).
The canonical correlation as a measure for the extent of association between the dis-
criminant value and the groups equals 0.714 for the first discriminant function, whereas
it is only 0.325 for the second one. Overall, we can conclude that discriminant function 1
has a much higher discriminatory power than discriminant function 2.
Figure 4.17 shows the multivariate Wilks’ lambda including the chi-square test.
Multivariate Wilks’ lambda indicates that both discriminant functions combined are
significant (p = 0.000). However, Wilks’ lambda testing for residual discrimination for
function 2 is not significant (p = 0.205). Thus, the second discriminant function does not
contribute significantly to the separation of the groups and we may consider just the first
discriminant function for interpretation.
Fig. 4.19 Discriminant function coefficients and discriminant values at the group centroids
The asterisks mark each variable’s largest absolute correlation with one of the discrimi-
nant functions. Within each function, these marked describing variables are ordered by
the size of the correlation. Thus, the ordering is different from that in the standardized
coefficients table.
The describing variables ‘fruity’, ‘exotic’, ‘price’, ‘crunchy’, ‘light’, and ‘refresh-
ing’ are most strongly correlated with the first discriminant function, although ‘light’
(correlation = 0.344) and ‘refreshing’ (correlation = 0.243) are rather weakly correlated
with discriminant function 1. Remember that the variables ‘fruity’, ‘exotic’, ‘price’, and
‘crunchy’ have already been identified as good potential discriminating variables when
we considered the group statistics (Fig. 4.14).
The describing variables ‘bitter’, ‘delicious’, ‘sweet’, and ‘healthy’ are most strongly
correlated with the second discriminant function. However, the variables ‘delicious’,
‘sweet’, and ‘healthy’ have rather low correlations with discriminant function 2 com-
pared to ‘bitter’ and are not significant when we use bootstrapping (we do not display the
results here).
In order to assess the discriminatory power of the describing variables with respect to
all discriminant functions, we compute the mean (standardized) discriminant coefficients
according to Eq. (4.14). Table 4.13 shows that the variable ‘fruity’ has the highest value
with 0.448 (= 0.461 · 0.898 + 0.336 · 0.102), while ‘bitter’ has the lowest one (0.201).
248 4 Discriminant Analysis
Thus, the variable ‘bitter’ has the least and ‘fruity’ has the greatest discriminatory power
overall. This result supports the previous notion that the variable ‘bitter’ is not a good
candidate for separating between the groups and that discriminant function 2 is not able
to separate the groups.
Figure 4.19 shows the estimated unstandardized discriminant coefficients for the two
discriminant functions. The unstandardized discriminant coefficients are used to com-
pute the values for the discriminant variable for each observation (cf. Fig. 4.22). Next to
the coefficients, the (unstandardized) group centroids are displayed. The group centroids
indicate that discriminant function 1 is able to separate group g = 2 from the groups g = 1
and g = 3. In contrast, discriminant function 2 seems to be able to separate group g = 3
from the groups g = 1 and g = 2. Considering that group g = 3 represents the coffee fla-
vors, the relevance of the variable ‘bitter’ is reasonable.
The scatterplot in Fig. 4.20 is based on the two discriminant functions and shows the
discriminant values for each observation. The observations are represented as points in
a two-dimensional space that is spanned by the two discriminant functions. The dis-
criminant values determine the location of each observation. We realize that especially
the groups g = 1 (i.e., classic flavors) and g = 3 (i.e., coffee flavors) have a substantial
4.3 Case Study 249
overlap and that the group centroids are rather close to each other. Discriminant function
1 seems to separate group g = 2 (i.e., fruit flavors) from the two other groups.
Classification Results
The next part of the SPSS output concerns the classification results (cf. Sect. 4.2.6).
SPSS reports the a priori probabilities (SPSS output not presented here). In our example,
we used the option ‘Compute from group sizes’, and thus the a priori probabilities corre-
spond to the group sizes in the observed data.
SPSS also provides information about the classification functions (Fisher’s linear dis-
criminant functions; Fig. 4.21). The coefficients of the classification functions are used
to classify the observations into one of the three groups (cf. Sect. 4.2.5).
Previously, we selected the option ‘Casewise results’ for the first 15 observations and
Fig. 4.22 presents the corresponding results. More specifically, we get information about:
• Actual group membership of an observation (Actual Group) that is retrieved from the
data set (variable ‘segment’).
• Estimated group membership (Predicted Group), with asterisks indicating an incor-
rect classification.
• Conditional probability P(D > d|G = g) that an observation of group g has a distance
greater than d to the centroid of group g.
250
4
Moreover, SPSS provides the classification matrix (Fig. 4.23). The classification matrix
shows the frequencies of actual and estimated group membership for each group based
on the classification (a posteriori) probabilities. The hit rates are 95.4% for group g = 1,
71.4% for group g = 2, and 13.0% for group g = 3. Overall, 85 (= 62 + 20 + 3) out of 116
observations are correctly classified, and the overall hit rate is therefore 73.3%. We note
that the observations belonging to group g = 3 are not well predicted (see also Fig. 4.22).
We have already learnt that discriminant function 1 is able to separate group g = 2 from
the two other groups but that there is a substantial overlap between groups g = 1 and
g = 3. Since we estimated the a priori probabilities from the actual group sizes and g = 1
is the largest group, the a priori probability for group g = 1 is much higher than for g = 3.
Therefore, most observations belonging to group g = 3 are assigned to group g = 1. If we
assume that the groups have equal sizes and do another discriminant analysis (results are
not reported), the hit rate equals 69.8% which is actually lower. In this case, however, 16
252 4 Discriminant Analysis
It should be noted that the classification functions are always calculated based on the
pooled variance-covariance matrices. Thus, unlike the classification probabilities, they
do not change. In our case study, the classification results change slightly. The hit rate for
256 4 Discriminant Analysis
group g = 1 decreases marginally while it increases for group g = 3 (Fig. 4.27). Overall,
the hit rate is slightly higher, with 75.9% compared to 73.3%.
The territorial map presented in Fig. 4.28, however, looks rather different from the
one in Fig. 4.24. The territorial boundaries are no longer linear.
• Wilks’ lambda. The describing variables are entered into the discriminant function
based on how much they lower Wilks’ lambda. At each step, the variable that mini-
mizes the overall Wilks’ lambda is entered.
• Unexplained variance. At each step, the variable that minimizes the sum of the unex-
plained variation between groups is entered.
• Mahalanobis distance. Describing variables are selected based on their potential to
increase the distance between groups.
• Smallest F ratio. The describing variables are selected based on maximizing an F
ratio computed from the Mahalanobis distance between groups.
• Rao's V. This method (also called Lawley–Hotelling trace) measures the differences
between group means. At each step, the variable that maximizes the increase in Rao’s
V is entered.
4.3 Case Study 257
Fig. 4.30 Describing variables included in the discriminant functions (stepwise procedure)
There is no superior method, and we can use multi-group methods to test the robustness
of the results. In this case study, we use Wilks’ lambda which is also the default option in
SPSS.
Further, you can choose the criteria that decide when to stop the entry or removal of
describing variables. Available alternatives are ‘Use F value’ or ‘Use probability of F’.
For the former alternative, a describing variable is entered into the model if its
F-value is greater than the entry value and is removed if its F-value is lower than the
removal value. The entry value must be greater than the removal value, and both val-
ues must be positive. To enter more variables into the model, lower the entry value. To
remove more variables from the model, increase the removal value.
The latter approach enters a variable into the model if the significance level of its
F-value is lower than the entry value and is removed if the significance level is greater
than the removal value. Again, the entry value must be smaller than the removal value,
and both values must be positive and less than 1. To enter more variables into the model,
increase the entry value. To remove more variables from the model, lower the removal
value.
For both approaches, SPSS provides some default values for entering and remov-
ing describing variables. We use the default criterion related to ‘Use F value’ (Entry:
3.84, Removal: 2.71). Finally, we request to display a summary of the different steps
(‘Summary of steps’).
Figure 4.30 shows the variables that are entered into the discriminant functions. In
step 1, the variable ‘exotic’ is entered. Actually, when using the blockwise estimation
procedure, the variable ‘exotic’ was the one with the smallest value for Wilks’ lambda,
followed by the variable ‘fruity’. Thus, it intuitively makes sense that the variable ‘fruity’
is entered in step 2. After step 4, the procedure stops since no more variables can be
entered or removed based on the set criteria. The discriminant functions then consider
the variables ‘exotic’, ‘fruity’, ‘price’, and ‘refreshing’.
258 4 Discriminant Analysis
Fig. 4.31 Eigenvalues for the two discriminant functions (stepwise procedure)
Figure 4.31 shows the eigenvalues for the two discriminant functions as well as the
share of explained variance (% of variance) and the canonical correlation. Since the
eigenvalues of different analyses are hard to compare, we focus on the share of explained
variance. Discriminant function 1 explains 95.5% of the explained variance, while dis-
criminant function 2 only explains the remaining 4.5% of the explained variance. The
canonical correlation of discriminant function 2 is also rather low. Thus, the results sug-
gest that—given the data at hand—one discriminant function is sufficient to explain the
differences between the groups (Fig. 4.31).
Figure 4.32 shows the standardized discriminant coefficients that allow for comparing
the discriminatory power of the describing variables. Here, the variable ‘price’ has the
greatest discriminatory power for discriminant function 1 and the variable ‘exotic’ has
the greatest discriminatory power for discriminant function 2. When you use the step-
wise estimation procedure in SPSS, bootstrapping is not available.
Moreover, the structure matrix is reported (Fig. 4.32). The structure matrix again
shows the correlation of each describing variable with the discriminant functions. The
asterisk marks each variable’s largest absolute correlation with one of the discriminant
functions. The variables with a superscript ‘b’ are actually not included in the final for-
mulation of the discriminant function but SPSS reports the values anyway. The describing
variables ‘fruity’ and ‘price level’ are most strongly correlated with the first discriminant
function, and the variable ‘exotic’ has the highest correlation with discriminant function 2.
The group centroids suggest that while discriminant function 1 separates group g = 2
from the groups g = 1 and g = 3 (Fig. 4.33), discriminant function 2 is not really able to
separate among the groups. This result is to be expected since discriminant function 2 is
again not significant (SPSS result not reported here).
Finally, we take a look at the classification results (cf. Fig. 4.34). The hit rate for the
model with just four variables is 70.0% and, thus, slightly lower than the hit rate of the
model considering ten variables, which is 73.3%.
Generally, the stepwise estimation procedure should be applied with caution. We rec-
ommend using the blockwise estimation procedure unless you have many describing
variables.
4.3 Case Study 259
Fig. 4.32 Standardized canonical discriminant function coefficients and structure matrix (stepwise esti-
mation)
Above, we demonstrated how to use the graphical user interface (GUI) of SPSS to con-
duct a discriminant analysis. Alternatively, we can also use the SPSS syntax which is
a programming language unique to SPSS. Each option we activate in SPSS’s GUI is
260 4 Discriminant Analysis
translated into SPSS syntax. If you click on ‘Paste’ in the main dialog box shown in Fig.
4.10, a new window opens with the corresponding SPSS syntax. However, you can also
use the SPSS syntax directly and write the commands yourself. Using the SPSS syntax
can be advantageous if you want to repeat an analysis multiple times (e.g., testing dif-
ferent model specifications). Figures 4.35, 4.36 show the SPSS syntax for running the
analyses discussed above.
For readers interested in using R (https://www.r-project.org) for data analysis, we pro-
vide the corresponding R-commands on our website (www.multivariate-methods.info).
4.4 Recommendations
We close this chapter with some prerequisites and recommendations for conducting a
discriminant analysis.
4.4 Recommendations 261
BEGIN DATA
3 3 5 4 1 2 3 1 3 4 1
6 6 5 2 2 5 2 1 6 7 1
2 3 3 3 2 3 5 1 3 2 1
---------------------
5 4 4 1 4 4 1 1 1 4 1
* Enter all data.
END DATA.
Fig. 4.35 SPSS syntax for blockwise estimation with pooled covariance matrices
Fig. 4.36 SPSS syntax for stepwise estimation with pooled covariance matrices
262 4 Discriminant Analysis
Alternative Methods
As an alternative to discriminant analysis, we can use logistic regression to discriminate
between groups and classify observations based on their characteristics/attributes if just
two groups are observed. The main difference between logistic regression and discrimi-
nant analysis is that logistic regression provides probabilities for the occurrence of alter-
native events or the classification into separate groups. In contrast, discriminant analysis
provides discriminant values from which probabilities can be derived in a separate step.
One advantage of logistic regression is that it is based on fewer assumptions about the
data than discriminant analysis. For example, discriminant analysis assumes normally
distributed describing variables with equal within-group variance-covariance matrices.
Logistic regression just assumes a multinomial distribution of the grouping variable.
Logistic regression is, therefore, more flexible and less sensitive than discriminant analy-
sis. If, however, the assumptions of discriminant analysis are fulfilled, discriminant anal-
ysis uses more information derived from the data and delivers more efficient estimates
(i.e., with smaller variance) than logistic regression (cf. Hastie et al., 2009, p. 128). This
is particularly advantageous for small sample sizes (N < 50). Empirical evidence suggests
that with large sample sizes both methods produce similar results, even if the assump-
tions of discriminant analysis are not fulfilled (cf. Michie et al., 1994, p. 214; Hastie
et al., 2009, p. 128; Lim et al., 2000, p. 216.).
If the assumption of equal within-group variance-covariance matrices is violated,
you may consider using a quadratic discriminant analysis (QDA). We do not cover this
References 263
variation of linear discriminant analysis in this book and refer the interested reader to
Hastie et al. (2009).
If the main objective of the analysis is to classify new observations, machine learn-
ing methods such as decision trees and neural networks are alternative methods. In a
large study by Lim et al. (2000), 33 algorithms for classification were tested on 32 data
sets. Discriminant analysis and logistic regression were among the five best methods.
Consequently, before applying sophisticated methods from computer science, rather start
your analysis with a discriminant analysis (or logistic regression) if you aim to classify
observations. The researchers noted that it was interesting that the ‘old’ method of discri-
minant analysis performed just as well as ‘newer’ methods.
References
Further reading
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees.
Chapman & Hall.
Fisher, R. A. (1936). The use of multiple measurement in taxonomic problems. Annals of
Eugenics, 7(2), 179–188.
Green, P., Tull, D., & Albaum, G. (1988). Research for marketing decisions (5th ed.). Prentice
Hall.
Huberty, C. J., & Olejnik, S. (2006). Applied MANOVA and discriminant analysis (2nd ed.).
Wiley-Interscience.
IBM SPSS Inc. (2022). IBM SPSS Statistics 29 documentation. https://www.ibm.com/support/
pages/ibm-spss-statistics-29-documentation. Accessed November 4, 2022.
Klecka, W. (1993). Discriminant analysis (15th ed.). Sage.
Lachenbruch, P. (1975). Discriminant analysis. Springer.
Logistic Regression
5
Contents
5.1 Problem
With many problems in science and practice, the following questions arise:
Often there are only two alternatives, e.g.: Does a patient have a certain disease or not?
Will he survive or not? Will a borrower repay his loan or not? Will a consumer buy a
product or not? In other cases, there are more than two alternatives, e.g.: Which brand
will a potential buyer choose? Which party will a voter vote for?
Logistic regression analysis can be used to answer such questions. As its name indi-
cates, logistic regression is a variant of regression analysis. In general, logistic regression
deals with problems of the form
Y = f (X1 , X2 , . . . , XJ ),
where the dependent variable (response variable) Y is categorical. The independent vari-
ables (predictors) can be metric or categorical variables. Today, logistic regression is the
most important method for analyzing problems with categorical responses.
The values of the dependent variable we usually denote by g = 1,…, G (or 1 and 0
for just two categories). They indicate alternative events (or groups, response categories,
etc.). Since the occurrence of events is usually subject to uncertainty, Y is regarded as a
random variable. The aim of logistic regression is then to estimate probabilities for pre-
dicting events:
π = f (X1 , X2 , . . . , XJ )
Here are some practical examples for logistic regression with just two alternatives:
• Design of an automatic detector for spam emails (junk emails). Observations: emails
that were spam emails or valid emails. Predictors: frequencies of certain words or
character strings (57 variables).1
1 Cf. Hastie et al. (2011, pp. 2, 300). The data set “Spambase” contains information on 4601 emails
• The systematic component is a linear function of the predictor X. For a given value x
the systematic component takes the value
z(x) = α + β x (5.3)
• The logistic function, from which logistic regression got its name, has the form
ez 1
π= = (5.4)
1 + ez 1 + e−z
and is shown in Fig. 5.1.
The systematic component is a real-valued function that can take any value between
−∞ and +∞. What we need for modeling the conditional probability π(x) is a function
that transforms the systematic component into a probability, i.e. into a range between 0
and 1. The logistic function is such a function.
2 Such a variable is called a Bernoulli variable and the events can be seen as outcomes of a
Bernoulli trial. The resulting probability distribution is called Bernoulli distribution. The name
goes back to Jacob Bernoulli (1656–1705). The simplest example of a Bernoulli trial is the tossing
of a coin with the expected value E(Y) = π = 0.5 and the variance V(Y) = π(1 – π). The Bernoulli
distribution is a special case of the binomial distribution Binomial distribution for N = 1 trials. The
binomial distribution results from a sequence of N Bernoulli trials. Correspondingly, the buying
frequency (sum of buyers) is binomially distributed with sample size N. With increasing N the
binomial distribution converges to the normal distribution.
5.1 Problem 269
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
-5 -4 -3 -2 -1
z
The logistic function has an S shape, similar to the distribution function (cumula-
tive probability function) of the normal distribution.3 It can thus be used to transform a
real-valued variable Z (range [−∞, +∞]) into a probability (range [0,1]).
By inserting Eq. (5.3) into Eq. (5.4) we get the simple logistic regression model:
eα+β x 1
π(x) = α+β x
= −(α+β x)
(5.5)
1+e 1+e
where α and β are unknown parameters that have to be estimated on the basis of obser-
vations (yi, xi) of Y and X. The larger z(x), the greater π(x) = P(Y = 1|x). Accordingly, the
greater z(x), the smaller P(Y = 0|x).
For multiple logistic regression, the systematic component can be extended to
z(x) = α + β1 x1 + · · · + βJ xJ (5.6)
where x = (x1, …, xJ) is a vector of predictors. Thus, in the model of logistic regression,
the predictors are combined linearly, just as in multiple linear regression analysis.
3 This is the reason for the broad usage and the importance of the logistic function, since it is
much easier to handle than the distribution function of the normal distribution, which can only be
expressed as an integral and is therefore difficult to calculate. The logistic function was developed
by the Belgian mathematician Pierre-Francois Verhulst (1804–1849) to describe and predict popu-
lation growth as an improved alternative to the exponential function. The constant e = 2.71828 is
Euler’s number, which also serves as the basis of the natural logarithm.
270 5 Logistic Regression
By inserting Eq. (5.6) into Eq. (5.4) we get the binary logistic regression model:
1 1
π(x) = − z(x)
= −(α +β 1 x1 +···+βJ xJ )
(5.7)
1+e 1+e
Terminology
As in other methods, logistic regression uses different names for the variables in differ-
ent contexts:
• The dependent variable is also denoted as the response variable, grouping variable,
indicator variable, outcome variable, or y-variable.
• The independent variables are also called predictors, explanatory variables, covari-
ates, or just x-variables.
In some cases, the independent variables are referred to as covariates when they are met-
ric variables and factors when they are categorical variables.4
4 Categorical independent variables with more than two categories must be decomposed into binary
variables, as in linear regression analysis.
5.1 Problem 271
odds logit
14 5
4
12
3
10
2
8 1
0
6
0 0.2 0.4 0.6 0.
-1
probability π
4
-2
2 -3
-4
0
0 0.2 0.4 0.6 0.8 1
-5
probability π
5 Within the framework of generalized linear models (GLM), logit(π) forms a so-called link func-
tion by means of which a linear relationship is established between the expected value of a depend-
ent variable and the systematic component of the model. The logit link is used in particular when
a binomial distribution of the dependent variable is assumed (cf. Agresti, 2013, pp. 112–122; Fox,
2015, pp. 418 ff.).
272 5 Logistic Regression
5.2 Procedure
In this section, we will show how logistic regression works. The procedure can be struc-
tured into five steps that are shown in Fig. 5.3.
We will demonstrate the steps of logistic regression with a simple example.
Application Example
The product manager of a chocolate company wants to evaluate the market opportuni-
ties of a new product, extra dark chocolate in a gift box, which is to be positioned in the
premium segment. Because of its bitter taste and premium price, the manager wants to
investigate whether and how the demand for this new gourmet chocolate depends on the
income of the consumers and whether it is more preferred by women or men.
For this purpose, he carries out a product test in which the test persons are asked,
after the presentation and tasting of the product, whether they will buy this new type
of chocolate. The test persons can choose between the following answer categories:
“yes”, “maybe”, “rather not”, “no”. For simplicity’s sake, we summarize the last
three answers into one category and refer to the alternative results as “Buy” and “No
buy”. Table 5.2 encompasses the demographic characteristics of N = 30 respondents
and their responses. Income is given in 1000 EUR. Gender is coded as 0 = female and
1 = male. A variable coded 0 or 1 is called a dummy variable. It can be treated like a
metric variable. Dummy variables can be used to incorporate qualitative predictors
into a linear model.6 ◄
1 Model formulation
In the first step, the user has to decide which events should be considered as possible
categories of the dependent variable and which variables should hypothetically be con-
sidered and investigated as independent (influencing) variables.
If there is a large number of categories, it may be necessary to combine several cate-
gories so as to reduce the number. Here we already combined the three answer categories
“maybe”, “rather not” and “no” into one category (“No buy”). Similarly, households may
be classified according to whether children are present or not, without further differen-
tiating according to the number of children. It would be different if it was a question of
brand choice, e.g., between Mercedes, BMW, and Audi. In this case, it would not make
sense to combine two of the three categories.
First, only Income shall be considered as an influencing variable. The product man-
ager assumes that income will have a positive influence on purchasing behavior. Thus, he
formulates the following model:
probability of buying = f (Income)
To estimate this model, it must be specified in more detail. The probabilities are not
directly observable, but are manifested in the respondents’ statements whether they will
buy the new chocolate or not. This is expressed by the variable Y with values yi (i = 1, …,
N), with yi = 1 for “Buy”, and 0 for “No buy”.
It is always useful to visualize the data to be analyzed at the outset. For this purpose,
we can use a scatterplot (see Fig. 5.4). Each observation of the variables X = Income and
Y = Buying is represented by a point (xi, yi). The scatter of the data points has a peculiar
look here. The points are arranged in two parallel lines. The upper row of points repre-
sents the “Buyers” and the lower row the “No buys”.
It is evident that the two clusters are overlapping on the x-axis. That is, for medium
incomes we find both Buyers and Non-buyers. However, the Buyers are shifted slightly
to the right side, towards higher incomes. This indicates that income has a positive influ-
ence on purchasing behavior, as already assumed by the product manager. We now want
to quantify this result of the visual inspection of the data by a numerical analysis.
We will now analyze the above data by using four different models and then compare
the results:
5.2 Procedure 275
Y
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.0 0.5 1.0 1.5 2.0 2.5 3. .5
Income X
The first two models are simple linear regression models which can be estimated by using
the method of ordinary least-squares (OLS). They are easy to handle and can provide
good approximations. Thus, they are of high practical relevance. Besides, it is instructive
to compare these simpler models with the logistic regression models whose estimation
requires the application of the more complicated maximum likelihood method (ML).
y, p
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.0 0.5 1.0 1.5 2.0 2.5 3. .5
Income X
Fig. 5.5 Estimated regression function for the linear probability model (Model 1)
In binary logistic regression, the dependent variable Y is not metric but can only take
the values 0 and 1. Thus we do not have an error term as in linear regression analysis.
But according to Eq. (5.2) the expectancy value of the binary variable Y is a conditional
probability and thus a metric variable. With this, we get the linear probability model7
π(x) = α + β x (5.15)
In our example, π(x) is the probability of buying given a certain income x. With the data
for the variables Buying and Income in Table 5.2 and by using the least-squares method,
we get:
7 More information on the linear probability model can be found in Agresti (2013, p. 117; 1996,
p. 74); Hosmer and Lemeshow (2000, p. 5).
5.2 Procedure 277
The positive sign of the regression coefficient b confirms the product manager’s
assumption that income has a positive influence on the buying probability. The coeffi-
cient of determination (R-square) is only 16.6%, but this is not unusual for individual
data.
However, the model is not logically consistent because it can provide probabilities
that lie outside the interval from 0 to 1. For incomes below 734 EUR, we would get neg-
ative “probabilities”, and for incomes above 3324, we would get “probabilities” greater
than one. Despite these shortcomings the model offers useful approximations within the
range of observed incomes as we will later see when comparing the linear probability
model with the other models.
The advantage of the model is that it is easy to calculate and easy to interpret, as the
buying probability changes linearly with the income. For an income of 1500 EUR, the
expected buying probability is about p = 30%, as can easily be calculated with the esti-
mated function (Eq. 5.16). If the income increases from 1500 EUR to 1600 EUR and,
thus, x from 1.5 to 1.6, the buying probability increases by b/10 = 0.039 to p = 33.9%.
8 These groups (classes) must be distinguished from the category groups of the dependent variable
Y.
278 5 Logistic Regression
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0 1.0 2.0 3.0 4.0
Income X
The scatter of the 5 points (x k , yk ) is shown in Fig. 5.6. Now we have only k = 5 observa-
tions instead of N = 30. Of course, this method works better for larger sample sizes when
we can build more and larger predictor classes.
With Eq. (5.12) we can formulate the simple linear regression model Eq. (5.14) with
grouped data in logit form:
logit(y) = α + β x + ε (5.17)
With the data
x k , logit(yk ) k=1, ..., K
This function, which we estimated on an aggregated basis, we can now apply to indi-
vidual income values and thus obtain estimates for individual probabilities. For the first
person with an income of 2530 EUR, we get:
1 1
p1 = = = 0.71
1 + e3.48−1.73 x1 1 + e3.48−1.73·2.53
1
p(x) = .
1+ e3.67−1.83x
which is shown in Fig. 5.7. This function is very similar to the one we got in the previous
section with grouped data.
p
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Income X
5.2.1.4 Classification
The estimated probabilities can be used to predict purchasing behavior or, in the termi-
nology of classification, to assign persons to categories (groups).
Our sample comprises two categories: “Buy” and “No buy”. Now we want to find out
whether our model can correctly predict into which group a person belongs if the income
is known. If this works, we can apply the model to other persons in the population who
were not used for the analysis. So, for each person, we
For predicting the group membership (classification), we use the estimated probabili-
ties. To transform a probability into a prediction of group membership, a cutoff value is
necessary. This value (threshold) we denote by p*. A case with an estimated probability
5.2 Procedure 281
p
1.0
0.9 Linear probability model
0.8 Logistic functions
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Income X
greater p* will be “predicted” (classified) as “Buy”, otherwise as “No buy”. The follow-
ing applies:
1, if pi > p∗
ŷi = (5.18)
0, if pi ≤ p∗
The cutoff value for just two alternatives is usually the probability p* = 0.5. Using this
value, all three models give the same predictions (see Table 5.4). For person 1 they cor-
rectly predict “Buyer”, for person 15 they falsely predict “No buy”, and for person 30
they correctly predict “No buy”.
Predictions are usually directed towards the future. Strictly speaking, one can there-
fore only speak of predictions when it comes to future behavior. Here we use a retrospec-
tive design (“predicting into the past”) for checking the predictive power of a model.
Table 5.5 shows the estimated probabilities of all 30 persons, which were derived by
logistic regression with individual data (Model 3), with the observed group membership
(Buy or No buy) and the predicted group membership.
Note that the mean of the estimated buying probabilities is equal to the proportion of
observed Buyers (mean of the y-values). This corresponds to the least-squares method in
linear regression, where the mean values of the estimated and observed y-values are also
always identical.
282 5 Logistic Regression
Classification Tables
The total set of observations and predictions can be summarized in a classification
table (confusion matrix). Table 5.6 represents the classification table for the results in
Table 5.5.
In the diagonal of the four fields (under “Prediction”) are the case numbers of the
correct predictions: 9 “Buy” and 7 “No buy” (bold numbers). The remaining two fields
contain the numbers of incorrect predictions. The column “sum” presents the case num-
bers of the two category groups (Buy and No buy) and the total number of cases (here
the number of all test persons). These numbers are given by the data and do not have to
be calculated. They must match the sum of the cells in the same row.
The right side of the classification table shows three different measures of predictive
accuracy:
Sensitivity p roportion of correctly predicted Buyers in relation to the total number of
Buyers
Table 5.7 shows how to calculate these three measures of accuracy. The hit rate is a
weighted average of sensitivity and specificity.
The hit rate of 53.3% achieved here is very modest and only slightly above what we
would expect from tossing a coin. For comparison, see the classification table for the lin-
ear probability model in Table 5.8. This model yields a hit rate of 60%, which might lead
to the conclusion that this model predicts or classifies better.
284 5 Logistic Regression
Table 5.8 Classification table for the linear probability model (p* = 0.5)
Prediction Accuracy
Group 1 = Buy 0 = No buy Sum Proportion correct
1 = Buy 9 7 16 0.563 Sensitivity
0 = No buy 5 9 14 0.643 Specificity
Total 14 16 30 0.600 Hit rate
This, however, is a deception. The hit rate is only conditionally suitable as a meas-
ure of predictive accuracy since it also depends on the selected cutoff value p*. If we
perform the classification with modified cutoff values, e.g. p* = 0.3 or p* = 0.7, using
the data in Table 5.5, we obtain the classifications shown in Tables 5.9 and 5.10. In both
cases we get an increased hit rate. This shows the influence of the cutoff value on the hit
rate. It is unusual, however, for the hit rate to rise both for higher and lower cutoff values
as is the case here.
5.2 Procedure 285
Correctly predicted
buyers (sensitivity)
1
0.9
0.8
p* = 0.5
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
ROC Curve
A generalized concept for assessing a classification table is the ROC curve (receiver
operating characteristic). While a classification table is always valid for a specific cutoff
value p*, the ROC curve gives a summary of the classifications for all possible values of
p*.
Figure 5.9 shows the ROC curve for the logistic model with the values from Table
5.5. A point on the ROC curve is valid for a certain cutoff value and therefore also for a
certain classification table.9 The ROC curve is obtained by plotting the sensitivity over
1 – specificity for different cutoff values p*. The classification table above for p* = 0.5
(Table 5.6) yields: 1 – specificity = 0.500 and sensitivity = 0.563. This point (0.500,
0.563) on the ROC curve in Fig. 5.9 is indicated by an arrow and is close to the diagonal
line.
9 The concept of the ROC curve originates from communications engineering. It was originally
developed during the Second World War for the detection of radar signals or enemy objects and
is used today in many scientific fields (see, e.g., Agresti, 2013, pp. 224 ff.; Hastie et al., 2011,
pp. 313 ff.; Hosmer et al., 2013, pp. 173 ff.). SPSS offers a procedure for creating ROC curves for
given classification probabilities or discriminant values. The above ROC curve was created with
Excel.
286 5 Logistic Regression
The diagonal line would be expected if the prediction were purely random, e.g. by
tossing a coin. It does not allow for any discrimination. The area under the ROC curve,
known as area under curve (AUC), is a measure of the overall predictive accuracy (clas-
sification performance) of the model. Its maximum is 1. Hosmer et al. (2013, p. 177)
give the following rule for judging the accuracy expressed by the ROC curve:
AUC < 0.7: not sufficient
0.7 ≤ AUC < 0.8: acceptable
0.8 ≤ AUC < 0.9: excellent
AUC ≥ 0.9: outstanding
For our model, we obtain AUC = 0.723. This value is obtained for all three mod-
els above, even though they lead to different classification tables for individual cutoff
values.10
• Sensitivity = “true positive”: The test will be positive if the patient is sick (disease is
correctly recognized).
• Specificity = “true negative”: The test will be negative if the patient is not sick.
Imagine that the disease is dangerous but curable if treated quickly. In this case, the
damage of a “false positive” (an unnecessary treatment) is smaller than the damage
of a “false negative” (the patient may die because the disease was not recognized).11
10 We also get the same value for AUC if we apply discriminant analysis to our data. Alternatively,
one can create the ROC curve based on discriminant values or classification probabilities.
11 Another danger of “false negatives” is the risk that sick persons may spread an infectious dis-
ease. The high rate of “false negative” test results contributed to the rapid spread of the corona
virus at the beginning of the pandemic in 2020 (cf. Watson et al., 2020).
5.2 Procedure 287
In this case it would be useful to increase the sensitivity by lowering the cutoff value. In
Table 5.9, the sensitivity is increased to 100% by reducing the cutoff value to p* = 0.3.
If, on the other hand, the disease is not curable, a false diagnosis of disease (“false
positive”) might cause considerable harm by incurring severe anxiety and depression in
the patient, and, in the extreme, ruining his life (cf. Gigerenzer, 2002, pp. 3 ff.; Pearl,
2018, pp. 104 ff.). In this case, it would be appropriate to increase specificity by raising
the cutoff value to avoid “false positives”. In Table 5.10, the specificity is increased to
92.9% by increasing the cutoff value to p* = 0.7, thus reducing the percentage of “false
positives” to (1 – specificity) × 100 = 7.1%. For a medical test, this probability of “false
positives” would still be very high.
A similar problem is encountered when designing a spam filter for e-mails. If we set
“1 = Spam” and “0 = No spam” (corresponding to Buy and No buy in the above exam-
ple), sensitivity represents the ability to recognize spam correctly. A low cutoff value
leads to high sensitivity. But high sensitivity increases the risk that a valid e-mail (No
spam) is lost in the spam filter (“false positive”). Since it is unpleasant if an important
e-mail is lost in this way, the cutoff value will need to be set higher. As a result, sensitiv-
ity is reduced and we continue to receive spam.
The opposite situation is encountered in airport security. Here a “false positive” would
have no serious consequences, leading only to a more extensive checking of passengers.
But a “false negative” means that a terrorist is not detected, potentially leading to hun-
dreds of deaths. Thus, in this case the sensitivity has to be very high (and a low cutoff
value is necessary).
eα+β1 x1 + ··· βJ xJ 1
π(x) = + ···
= −(α+β
(5.19)
1+e α+β 1 x1 β J xJ 1+e 1 x1 + ··· βJ xJ )
with
π(x) = P(Y = 1|x1 , x2 , · · · xJ ) (5.20)
The predictors are combined linearly, as in multiple regression analysis.
Alternatively, the following formulas are used:
′
βj xj
eα+ j eα+xβ 1
π(x) =
α + j βj xj
= α+ β ′ = −(α+ xβ ′ )
1+e 1+e x 1+e
where x and β denote row vectors.
288 5 Logistic Regression
Continuing our example with the data in Table 5.2 we will now include the variable
Gender and estimate the model
probability of buying = f (Income, Gender)
According to Eq. (5.19), we formulate the following logistic model:
1
π(x) = (5.21)
1+ e−(α+β1 x1 +β2 x2 )
Using the maximum likelihood method provides the following estimates for the
parameters:
a = −5.635, b1 = 2.351, b2 = 1.751
With these values we get the following logistic regression function:
1
p(x) = (5.22)
1+ e−(−5.635+2.351x1 +1.751x2 )
Example
For the first person in Table 5.2, a woman with an income of 2530 EUR, the systematic
component (the logit) has the value:
1 Model formulation
p
1.0
0.9
men
0.8
0.7
0.6 women
0.5
0.4
0.3
0.2
0.1
0.0
0.0 1.0 2.0 3.0 4.0
Income X
Table 5.11 Classification table for the multiple logistic model (p* = 0.5)
Prediction Accuracy
Group 1 = Buy 0 = No buy Sum Proportion
correct
1 = Buy 14 2 16 0.875 Sensitivity
0 = No buy 3 11 14 0.786 Specificity
Total 17 13 30 0.833 Hit rate
Due to its non-linearity, the logistic regression function needs to be estimated with the
maximum likelihood method, instead of the least-squares method.
The maximum likelihood (ML) principle states: Determine the estimated values for
the unknown parameters in such a way that the realized data attain maximum plausibility
(likelihood).12 Or, in other words, maximize the probability of obtaining the observed data.
For the estimation of the logistic regression model this means that for a person i
the probability p(xi) should be as large as possible if yi = 1, and as small as possible if
yi = 0. This can be summarized by the following expression, which should be as large as
possible:
12 Theprinciple of the ML method goes back to Daniel Bernoulli (1700–1782), a nephew of Jakob
Bernoulli. Ronald A. Fisher (1890–1962) analyzed the statistical properties of the ML method and
paved the way for its practical application and dissemination. Besides the Least-Squares method,
the ML method is the most important statistical estimation method.
290 5 Logistic Regression
Correctly predicted
buyers (sensitivity)
1 p* = 0.5
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Incorrectly predicted buyers (1-specificity)
Fig. 5.11 ROC curve for multiple logistic regression (AUC = 0.813)
N
L(a, b) = p(xi )yi · [1 − p(xi )]1−yi → Max! (5.24)
i=1
N
LL(a, b) = (ln[p(xi )] · yi + ln[1−p(xi )] · (1 − yi )) → Max! (5.25)
i=1
b
0
0 1 2 3 4
-10
-20
-30
-40
-50
The log-likelihood function LL can only assume negative values, because the loga-
rithm of a probability is negative. The maximization of LL, therefore, means that the
value of LL comes as close as possible to the value 0. LL = 0 would result if the proba-
bilities of the chosen alternatives were all 1 and thus the probabilities for the non-chosen
alternatives were all 0.
Figure 5.12 illustrates the course of LL with the variation of the coefficient b. For
b = 1, the value for LL is −28. The maximum is LL = −18.027. It is achieved with
b = 1.83, our estimation for Model 3.
The solution to this optimization problem, i.e. maximizing the log-likelihood func-
tion, requires the application of iterative algorithms. This can be done by using qua-
si-Newton or gradient methods.13 These methods need a lot of processing capacity, but
with today’s computing power, this is of little importance. The fact that iterative algo-
rithms may not always converge or can get stuck in a local optimum is usually more of a
13 Forlogistic regression, quasi-newton methods are primarily used, which converge quite quickly.
These methods are based on Newton’s method for finding the zero of a function. They use the first
and second partial derivatives of the LL function according to the unknown parameters to find the
optimum. The derivatives are approximated differently depending on the method. Special methods
are the Gauss–Newton method and its further development, the Newton–Raphson method. In the
meantime, the method of Iteratively Reweighted Least Squares (IRLS) is also widely used. Cf. e.g.
Agresti (2013, pp. 149 ff.); Fox (2015, pp. 431 ff.); Press et al. (2007, pp. 521 ff.).
292 5 Logistic Regression
problem. This danger, however, does not exist here since the LL function is convex and
therefore only one global optimum exists.14
1 Model formulation
Due to the non-linearity of the logistic model, the interpretation of the model parameters
is more difficult than in other methods for analyzing dependencies (e.g., linear regres-
sion, analysis of variance). Figure 5.13 shows three diagrams (a–c) that illustrate the
effects of changes in the parameters of the logistic model with a single predictor.
The coefficient b determines how the independent variable x affects the dependent var-
iable p. A difficulty of interpretation of the logistic model results from the fact that the
effects on the dependent variable are not constant for equal changes in x:
14 McFadden (1974) has shown that with a linear systematic component of the logistic model, the
LL function is globally convex, which makes maximization much easier.
5.2 Procedure 293
a) p
1.0
0.9
0.8
0.7
0.6
a=2 0.5
a = 0 0.4
0.3 a=-2
0.2
0.1
0.0
-5 -4 -3 -2 -1
X
b) p
1.0
b=2
0.9 b=1
b = 0.5
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
-5 -4 -3 -2 -1 0 1 2 3 4 5
X
c) p
1.0
0.9
0.8 b=1
0.7
0.6
0.5
0.4
0.3 b = -1
0.2
0.1
0.0
-5 -3 -1 1 3 5
X
Fig. 5.13 Curves of the logistic function for different values of the parameters a and b
294 5 Logistic Regression
• In linear regression, each change of x by one unit causes a constant change b in the
dependent variable.
• In logistic regression, the effect of a change in x on the dependent variable p also
depends on the value of p. The effect is greatest when p = 0.5, and the more p deviates
from 0.5, the smaller becomes the change in p.
At position p, the slope of the logistic function is p(1 − p) b, and thus for p = 0.5 the
slope will be 0.25 b. At p = 0.01 or p = 0.99, however, the slope will only be 0.01 b.
Because of the curvature of the logistic function, this describes only approximately the
change of p due to a change in x by one unit. The smaller the changes in x, the better the
approximation.
Example
We will illustrate the effects of a change in x with a numerical example. For a logistic
regression with a single predictor (Model 3), we had estimated the following function:
1 1
p(x) = = (5.26)
1 + e−(a+b x) 1 + e3.67−1.827x
With an income of 2000 EUR (x = 2) we get approximately p = 0.5. For this value,
a change in income has the greatest effect on p. If the income increases in steps of 1
unit (1000 EUR), then p increases with decreasing rates.
Column 1a in Table 5.12 shows how p increases with increasing x. Column 2a
shows the increments of change (difference to the previous value). One can see that
the increments become smaller. This follows from the curvature of the logistic func-
tion. ◄
Odds
Besides probability, the “odds” are another concept to express the occurrence of random
events. Both concepts originate from gambling, with odds probably being older than
probability, and some prefer it to probability.15 Thus, when rolling dice, the odds to roll a
six are defined as follows:
favorable events 1
odds ≡ =
unfavorable events 5
We state: “The odds to roll a six are 1 to 5.”
15 The term “odds” is used only in plural. The concept of odds and its usefulness was described
by the Italian mathematician and physician Gerolano Cardano (1501–1576), who had to support
his life by gambling. In his “Book on Games of Chance” he wrote the first treatment on proba-
bility. The theory of probability ermerged only later in in the seventeenth century with the works
of the scientists Pierre de Fermat (1601–1665), Blaise Pascal (1623–1662) and Jakob Bernoulli
(1655–1705).
5.2 Procedure 295
To illustrate the effect of a change of x on the odds, we insert x + 1 into this function
and get:
Logit
The logit of a probability p is defined by:
p
logit(p) ≡ ln (5.33)
1−p
5.2 Procedure 297
Table 5.13 Effects of an increase in x by one unit (with positive and negative regression
coefficients)
An increase of x to x + 1 has the following effects
b>0 b<0
p An increase by roughly p(1 − p)b A reduction by roughly p(1 − p)|b|
Odds An increase by the factor e b
A reduction by the factor e−|b| = 1
e|b|
Logit An increase by the value of b A reduction by the value of |b|
Odds ratio b
e >1 eb < 1
“Logit” is a short form for logarithmic odds (also log-odds) and can also be defined as16 :
logit(p) ≡ ln odds(p) (5.34)
By transforming the odds into logits, the range is extended to [−∞, +∞] (see Fig. 5.2,
right panel).
Following Eq. (5.31), the odds increase by eb if x increases by one unit. Thus, the log-
its (log-odds) increase by
ln(eb ) = b (5.35)
So in our Model 3, the logits increase by b = 1.827 units if x increases by one unit (cf.
column 2c in Table 5.12).
This makes it easy to calculate with logits. The coefficient b represents the marginal
effect of x on logits, just as b is the marginal effect of x on Y in linear regression. If we
know the logits, we can calculate the corresponding probabilities, e.g. with Eq. (5.10):
1
p(x)=
1 + e−z(x)
Thus, logits are usually not computed from probabilities,17 as Eq. (5.33) might suggest.
Instead, logits are used for computing probabilities, e.g. with Eq. (5.10). Table 5.13 sum-
marizes the effects described above.
16 The name “logit” was introduced by Joseph Berkson 1944, who used it as an abbreviation for
“logistic unit”, in analogy to the abbreviation “probit” for “probability unit”. Berkson contributed
strongly to the development and popularization of logistic regression.
17 That is the reason why we used the “equal by definition” sign in Eqs. (5.33) and (5.34).
298 5 Logistic Regression
groups (populations), e.g. men and women (or test group and control group), thus indi-
cating the difference between the two groups.
If we calculate the odds ratio for two values of a metric variable (as we have done
for Income above), its size depends on the unit of measurement of that variable and is
therefore not very meaningful. In the example above, we get OR ≈ 6 for an increase of
income x by one unit. The odds ratio is large because the unit of x is [1000 EUR].
The situation is different with binary variables, which can take only the values 0 and
1 and thus have no unit. In Model 4, we included the gender of the persons as a predictor
and estimated the following function:
1 1
p= =
1+ e−(a+b1 x1k +b2 x2k ) 1+ e−(−5.635+2.351x1k +1.751x2k )
The binary variable Gender indicates two groups, men and women.
With an average income of 2 [1000 EUR], the following probabilities may be calcu-
lated for men and women:
1
Men: pm = = 0.694
1 + e−(−5.635+2.351·2+1.751·1)
1
Women: pw = −(−5.635+2.351·2+1.751·0)
= 0.283
1+e
This results in the corresponding odds ratios:
oddsm pm /(1 − pm ) 2.267
ORm = = = = 5.8
oddsw pw /(1 − pw ) 0.393
oddsw pw /(1 − pw ) 0.393
ORw = = = = 0.17
oddsm pm /(1 − pm ) 2.267
A man’s odds for buying are roughly six times higher than a woman’s. A woman’s odds
are less than 20% of a man’s odds.18 This seems to be a very large difference between
men and women.
Another, similar measure for the difference of two groups is the relative risk (RR),
which is the ratio of two probabilities.19 Analogously to the odds ratios we obtain here:
pm 0.694
RRm = = = 2.5
pw 0.283
pw 0.283
RRw = = = 0.41
pm 0.694
18 Alternatively,we may calculate the odds ratios with Eq. (5.32): ORm = eb2 = e1.751 = 5.76
and ORw = e−b2 = e−1.751 = 0.174.
19 In common language, the term risk is associated with negative events, such as accidents, illness
or death. Here the term risk refers to the probability of any uncertain event.
5.2 Procedure 299
According to this measure, a man is 2.5 times more likely to buy the chocolate than a
woman at the given income. The values of RR are significantly smaller (or, more gener-
ally, closer to 1) than the values of the odds ratio OR and often come closer to what we
would intuitively assume. However, OR can also be used in situations where calculating
RR is not possible.20 The odds ratio, therefore, has a broader range of applications than
the relative risk.
1 Model formulation
Once we have estimated a logistic model, we need to assess its quality or goodness-of-fit
since no one wants to rely on a bad model. We need to know how well our model fits the
empirical data and whether it is suitable as a model of reality. For this purpose, we need
measures for evaluating the goodness-of-fit. Such measures are:
20 This can be the case in so-called case-control studies where groups are not formed by random
sampling. Thus the size of the groups cannot be used for the estimation of probabilities. Such stud-
ies are often carried out for the analysis of rare events, e.g. in epidemiology, medicine or biology
(cf. Agresti, 2013, pp. 42–43; Hosmer et al., 2013, pp. 229–230).
300 5 Logistic Regression
A simple measure that is used quite often is the value –2LL (= −2 · LL). Since LL is
always negative, −2LL is positive. A small value for −2LL thus indicates a good fit of
the model for the available data. The “2” is due to the fact that a chi-square distributed
test statistic is aimed at (see Sect. 5.2.4.1).
For Model 4 with the systematic component z = a + b1 x1 + b2 x2 we get:
z = a = 0.134
It yields:−2LL = 2 · 20.728 = 41.46.
This primitive model is called the null model (constant-only model, 0-model) and it
has no meaning by itself. But it serves to construct the most important statistic for testing
the fit of a logistic model, the likelihood ratio statistic.
The logarithm of the ratio of likelihoods is thus equal to the difference in the log-like-
lihoods. With the above values from our example for multiple logistic regression
(Model 4) we get:
LLR = −2 · (LL0 − LLf ) = −2 · (−20.728 + 16.053) = 9.35
Under the null hypothesis H0: β1 = β2 = … = βJ = 0, the LR statistic is approximately
chi-square distributed with J degrees of freedom (df).21 Thus, we can use LLR to test the
statistical significance of a fitted model. This is called the likelihood ratio test (LR test),
which is comparable to the F-test in linear regression analysis.22
The tabulated chi-square value for α = 0.05 and 2 degrees of freedom is 5.99. Since
LLR = 9.35 > 5.99, the null hypothesis can be rejected and the model is considered to be
statistically significant. The p-value (empirical significance level) is only 0.009 and the
model can be regarded as highly significant.23 Figure 5.14 illustrates the log-likelihood
values used in the LR test.
21 Thus, in SPSS the LLR statistic is denoted as chi-square. For the likelihood ratio test statistic see,
e.g., Agresti (2013, p. 11); Fox (2015, pp. 346–348).
22 For a brief summary of the basics of statistical testing see Sect. 1.3.
23 We can calculate the p-value with Excel by using the function CHISQ.DIST.RT(x;df). Here, we
Lr
LLR = −2 · ln = −2 · (LLr − LLf ) (5.38)
Lf
with
LLr maximized log-likelihood for the reduced model (Model 3)
LLf maximized log-likelihood for the full model (Model 4)
With the above values we get:
LLR = −2 · (LLr − LLf ) = −2(−18.027 + 16.053) = 3.949
The LLR statistic is again approximately chi-square distributed, with the degrees of free-
dom resulting from the difference in the number of parameters between the two models.
In this case, with df = 1, we get a p-value of 0.047. Thus, the improvement of Model 4
compared to Model 3 is statistically significant for α = 0.05. A prerequisite for applying
the chi-square distribution is that the models are nested, i.e., the variables of one model
must be a subset of the variables of the other model.
5.2.4.2 Pseudo-R-Square Statistics
There have been many efforts to create a similar measure for goodness-of-fit in logis-
tic regression as the coefficient of determination R2 in linear regression. These efforts
resulted in the so-called pseudo-R-square statistics. They resemble R2 insofar as
However, the pseudo-R2 statistics do not measure a proportion. They are based on the
ratio of two probabilities, like the likelihood ratio statistic.
There are three different versions of pseudo-R-square statistics.
(a) McFadden’s R2
LLf −16.053
McF − R2 = 1 − =1− = 0.226 (5.39)
LL0 −20.728
In contrast to the LLR statistic, which uses the logarithm of the ratio of likelihoods,
McFadden uses the ratio of the log-likelihoods. In case of a small difference between the
two log-likelihoods (of the fitted model and the null model), the ratio will be close to 1,
and McF − R2 thus close to 0. This means the estimated model is not much better than
the 0-model. Or, in other words, the estimated model is of no value.
If there is a big difference between the two log-likelihoods, it is exactly the other way
round. But with McFadden’s R2 it is almost impossible to reach values close to 1 with
empirical data. For a value of 1 (perfect fit), the likelihood would have to be 1, and thus
the log-likelihood, 0. The values are therefore in practice much lower than for R2. As
a rule of thumb, values from 0.2 to 0.4 can be considered to indicate a good model fit
(Louviere et al., 2000, p. 54).
5.2 Procedure 303
2 2
L0 N exp(−20.728) 30
2
RCS = 1− =1− = 0.268 (5.40)
Lf exp(−16.053)
The Cox & Snell R2 can take only values <1, as L0 will always be >0. Thus, it will
deliver values <1 even for a perfect fit.
(c) Nagelkerke’s R2
2
2 RCS 0.2682
RNa = 2/N
= = 0.358 (5.41)
1−L0 1 − exp(−20.728)2/30
The pseudo-R2 according to Nagelkerke is based on the statistics of Cox and Snell. It
modifies it in such a way that a maximum value of 1 can be reached.
For our model, all three pseudo-R2 statistics provide rather low values, although the
model achieves quite a good fit and high significance. The values are below what would
be expected for R2 used in linear regression.
However, this approach is only useful if a sufficiently large sample is available since the
reliability of the estimated logistic regression function decreases along with the size of
the training set. Besides, in this case the existing information is used only incompletely.
Better ways to achieve undistorted hit rates are provided by cross-validation meth-
ods (cf. Hastie et al., 2011, p. 241; James et al., 2014, p. 175) such as the leave-one-out
method (LOO). An element of the sample is singled out and classified using the logis-
tic regression function whose estimation is based on the other elements. This is then
repeated for all elements of the sample. In this way, an undistorted classification table
can be obtained by making full use of the available information. However, the method
is quite complex and therefore only practicable with small sample sizes. In our example,
using the LOO method yields 3 hits less, reducing the hit rate from 83.3% to 73.3%.
With regard to assessing the quality of the underlying model, the problem arises that
the classification table and thus also the hit rate can change with a change in the selected
cutoff value. Therefore, we used the ROC curve (receiver operating characteristic) above,
which summarizes the classification tables for the possible cutoff values, as a generalized
concept. The area under the ROC curve, known as AUC, is a measure of the model’s pre-
dictive power (see Sect. 5.2.1.4).
(observed values minus estimated probabilities). Since y can only take the values 0 or 1
and p can take values in the range from 0 to 1, the residuals will take values from −1 to
+1. As with linear regression, the sum of the residuals is equal to zero.
To judge the size of a residual, it is advantageous to standardize its value by dividing
it by the standard deviation. As in logistic regression the residuals follow a Bernoulli dis-
tribution we get:
yi − pi
zi = √ (5.42)
pi (1 − pi )
With a large sample size N, the standardized residuals are approximately normally dis-
tributed with a mean value of 0 and a standard deviation of 1. They are also referred to
5.2 Procedure 305
as Pearson residuals. Table 5.14 presents the calculated residuals and Fig. 5.15 shows a
scatterplot of the standardized residuals z (std. resid.).
The sum of the squared Pearson residuals gives the Pearson chi-square statistic:
N
(yi − pi )2
X2 = = 30.039 (5.43)
p (1 − pi )
i=1 i
Detecting Outliers
Outliers can be identified by visual inspection of a scatterplot (like the one in Fig. 5.15)
or automatically by setting a cutoff value. In SPSS the default cutoff value is 2 [standard
deviations]. Observations with a standardized residual that is larger than 2 (in absolute
values) have a probability of occurrence below 5% (2-sigma rule). This can be seen as
a signal indicating the need for further investigation. Table 5.14 shows two such obser-
vations, namely persons 2 and 23, which can also be detected in Fig. 5.15 (the markers
outside of the two red lines).
We will take a closer look at observation 23 with a residual z = 2.522 [std. deviations].
This is a woman with an income of 1610 EUR, which is clearly below average. She
bought the gourmet chocolate although the results of our analysis indicate that the tested
chocolate is more likely to be bought by men with a higher income. In Fig. 5.15 this
observation is represented by the marker above the upper red line.
Influential Observations
Observations that have a strong influence on the estimated parameters are called influen-
tial observations. We want to investigate if person 23 is such an influential observation.
The easiest way to do so is to eliminate the conspicuous person from the data set and
repeat the estimation of the model.
Equation (5.22) above gives the estimated regression function for all data. Now we
compare this with the regression function we get after removing observation 23. The
results (expressed in logits) are:
z (Pearson residuals)
3.0
2.0
1.0
0.0
0 5 10 15 20 25 30
-1.0
-2.0
-3.0
Persons
The second outlier here is observation 2 with a residual z = −2.326. This is a man
with an income above average who did not buy the chocolate. In Fig. 5.15 he is repre-
sented by the marker below the lower red line. After eliminating this observation we get:
Here the effect is not quite so strong. The parameter values are somewhere between the
previous results. There are two reasons for this:
For calculating the influence of an outlier we can thus use the simplified formula:
influence = size × leverage
Size is a function of the dependent variable (here: y or p), and leverage is a function
of the independent variable(s) X (here: Income). Or more precisely, the leverage effect
depends on the distance of the x-value from the center. In Fig. 5.16 the center (mean
income x = 2115 EUR) is marked by the dashed line. The income of the woman (person
23) is clearly further away from the center than the income of the man (person 2). Thus,
the woman here has greater leverage than the man.
If influential outliers have been detected, further analysis is necessary. Values can be
incorrect due to mistakes in measurement or data entry. If this is the case, they should
be corrected (if possible) or eliminated. Other reasons may be unusual events inside or
308 5 Logistic Regression
z (Pearson residuals)
3.0
2.0
1.0
0.0
-1.0
-2.0
-3.0
Income X1
outside the research context. In the first case, we should try to change the model specifi-
cation, in the second case, the outliers should be eliminated.
If we cannot identify any cause for the outliers, we have to assume that they are due
to random fluctuation. In this case, the outliers must be kept in the analysis. Dropping
outliers without good reason would constitute a manipulation of the regression results. A
value should only be eliminated if we know that it is incorrect and cannot be corrected.
1 Model formulation
Besides the quality of a model, information on the influence and importance of individ-
ual predictors is usually of interest. For this purpose, we have to examine the estimated
coefficients.
In linear regression analysis, the t-test is commonly used for testing if a coefficient
differs significantly from 0 and is thus of importance. Alternatively, one can also use the
F-test. Both tests will provide identical results.
5.2 Procedure 309
Two tests commonly used in logistic regression are the Wald test and the likelihood
ratio test.24 The latter one we already used to evaluate the overall quality of the model.
However, these two tests do not always provide the same results.
24 Both tests are used in SPSS, but the LR test is only used in the NOMREG procedure for multi-
nomial logistic regression, not in binary Logistic Regression.
25 Named after the Hungarian mathematician Abraham Wald (1902–1950). For the Wald test see
The likelihood ratio test is much more computationally expensive than the Wald test
since a separate ML estimation must be carried out for each of the coefficients. For this
reason, the Wald test is often preferred. It can, however, be misleading because it sys-
tematically provides larger p-values than the likelihood ratio test.26 Consequently, it may
fail to indicate the significance of a coefficient (or reject a false null hypothesis), as is the
case here for the variable Gender. The likelihood ratio test is therefore clearly the better
test. The Wald test should only be used for large samples, as in this case the results of the
two tests converge.
It should also be noted that the likelihood ratio test we used here for testing the sig-
nificance of b2 (coefficient of the variable Gender) is identical to the test we performed
according to Eq. (5.38) for the comparison of Model 4 with Model 3. The significance of
the improvement of Model 4 versus Model 3 by including the variable Gender is equal to
the significance of the coefficient of the variable Gender.
In SPSS, as already mentioned, there are two procedures for performing logistic regres-
sion analyses, the procedure ‘Logistic Regression’ for binary logistic regression and the
procedure ‘Multinomial Logistic Regression’ (called NOMREG in SPSS). Both proce-
dures can be accessed via the menu item ‘Analyze/Regression’. Since the NOMREG pro-
cedure will be used for the case study in Sect. 5.4, we will here show extracts from the
output of the procedure ‘Logistic Regression’ for our exemplary dataset (the results above
were derived by MS Excel) and point out some differences between the procedures.27
The procedure ‘Logistic Regression’ can be reached via the menu items ‘Analyze/
Regression/Binary Logistic …’. When clicking on the latter item, the dialog box for
‘Logistic Regression’ shown in Fig. 5.17 opens. The binary variable ‘Buy’ is to be
entered as ‘Dependent’ and the variables ‘Income’ and ‘Gender’ are to be entered as
‘Covariates’.28
In the dialog box ‘Options’ (Fig. 5.18) a cutoff value for a classification table can be
specified. The default ‘Classification cutoff’ is 0.5. For obtaining a list of outliers, select
‘Casewise listing of residuals’. If you click on ‘Continue’ and ‘OK’, the data of Table 5.2
will result in the output shown in Fig 5.19.
26 The reason is that the standard error becomes too large, especially if the absolute value of the
coefficient is large. This makes the Wald statistics too small and the p-value too large (as found by
Hauck & Donner, 1977). Agresti (2013, p. 169), points out that the likelihood ratio test uses more
information than the Wald test and is therefore preferable.
27 The user can find all Excel files used in this chapter on the website www.multivariate-methods.
info.
28 Binary logistic regression can also be performed using the SPSS syntax as shown in Sect. 5.4.4
(Fig. 5.42).
5.2 Procedure 311
Fig. 5.20 Estimated regression coefficients, results of the Wald test and effect coefficients (Exp (B))
In the upper part, below the heading “Omnibus Tests of Model Coefficients”,
Fig. 5.19 shows the result of the likelihood ratio test according to Eq. (5.37), namely the
value of LLR (in SPSS denoted as ‘Chi-square’) and the corresponding p-value (denoted
as ‘Sig.’ for significance level).
Below the heading “Model Summary”, three further measures of quality are given:
McFadden’s R2 is only obtained by the procedure NOMREG. It is also stated in the out-
put that the ML estimate required 5 iterations.
Figure 5.20 shows the estimated regression coefficients and the results of the Wald
test and corresponds to Table 5.15. Besides, the effect coefficients calculated according
to Eq. (5.31) are listed in the rightmost column. In SPSS, the likelihood ratio test of the
5.2 Procedure 313
Table 5.16 Checking the regression coefficients with the likelihood ratio test
bj LL0j LLf LLRj p-value
Constant −5.635 −19.944 −16.053 7.783 0.005
Income 2.351 −19.643 −16.053 7.181 0.007
Gender 1.751 −18.027 −16.053 3.949 0.047
regression coefficients, as shown in Table 5.16, can only be obtained with the procedure
NOMREG.
Figure 5.21 shows the classification table and is consistent with the classification in
Table 5.11. But rows and columns are exchanged in SPSS according to the coding 0 and 1.
The cutoff value can be changed by the user. This is not possible in the procedure
NOMREG. Here, the default value is 0.5, which we also used above.
If you selected ‘Casewise listing of residuals’, you will now get the table in Fig. 5.22
which shows the two outliers (persons 2 and 23). For each person the following details
are indicated:
To generate the ROC curve for assessing the classification table, the estimated proba-
bilities (as given in Table 5.14) must first be generated and saved in the work file. To
do this, select the option ‘Save’ in the dialog Box ‘Logistic Regression’ and then select
‘Probabilities’. This causes a variable ‘PRE_1’ to be created after the analysis has been
performed, which contains the estimated probabilities pi = P(Y = 1). The variable PRE_1
will appear in the work file.
Under the menu item ‘Analyze/ROC curve’ you reach the procedure ROC. There you
have to choose the variable ‘PRE_1’ as the ‘Test Variable’ and the variable ‘Buy’ as the
‘State Variable’. Besides, for ‘The Value of State Variable’ we must indicate the value of
Y to which the probabilities apply. In our case, this is 1 (= Buyer). For the display you
should select ‘With diagonal reference line’ and ‘Standard error and confidence interval’.
The output is shown in Figs. 5.23 and 5.24.
If logistic regression is extended to more than two categories (groups, events), it is called
multinomial logistic regression. The dependent variable Y can now take the values g = 1,
…, G. As before, x = (x1, x2, …, xJ) is a set of values (vector) of the J independent varia-
bles (predictors).
Y now designates a multinomial random variable. Analogous to Eq. (5.3), we denote
the conditional probability P(Y = g|x) for the occurrence of event g, given values x by
πg (x) = P(Y = g|x ) (g = 1, . . . , G) (5.46)
and the following applies:
G
πg (x) = 1.
g=1
For a better understanding of the model for multinomial logistic regression, we will write
the model for binary logistic regression in a somewhat different form. In Eq. (5.7) we
expressed the binary logistic regression model for the probability P(Y = 1|x) by
5.3 Multinomial Logistic Regression 315
Fig. 5.24 Area under the ROC curve (AUC) with p-value and confidence interval
316 5 Logistic Regression
1
π(x) = with z(x) = α + β1 x1 + · · · + βJ xJ
1 + e−z(x)
Now we express the binary logistic regression model by two formulas, one for each
category:
ez1 (x)
P(Y = 1|x) : π1 (x) = (5.47)
ez1 (x) + ez0 (x)
ez0 (x)
P(Y = 0|x) : π0 (x) = (5.48)
ez1 (x) + ez0 (x)
Since the two probabilities must add up to 1, the parameters in one of the two equations
are redundant. Thus, to identify the parameters, we have to fix them in one equation.
We choose the second equation for this purpose. Setting the parameters in z0 (x) to zero,
we get z0 (x) = 0.29 Then we can write:
1 1
π1 (x) = = = 0.578
1 + e−z(x) 1 + e−0.313
1 1
π0 (x) = = = 0.422
1 + ez(x) 1 + e0.313
29 An alternative is to center the parameters so that their sum across the two categories is zero.
5.3 Multinomial Logistic Regression 317
Now we can, analogous to Eq. (5.47), formulate the model of multinomial logistic
regression as follows:
ezg (x)
πg (x) = (g = 1, . . . , G)
G
(5.49)
ezh (x)
h=1
G−1
1
πG (x) = =1− πh (x)
−1
G (5.51)
1+ ezh (x) h=1
h=1
By inserting the systematic component into Eq. (5.50) we get the model of multinomial
logistic regression in the following form:
eαg +βg1 x1 +...+βgJ xJ
πg (x) = (g = 1, . . . , G − 1)
G−1 (5.52)
1+ eαh +βh1 x1 +...+βhJ xJ
h=1
The parameters of the categories g = 1 to G − 1 express the relative effect with respect
to the reference category G. If, for instance, the categories correspond to the chocolate
brands Alpia, Lindt, and Milka, then the parameters for Alpia and Lindt would express
the relative importance with respect to Milka. But, of course, we can change the order of
the brands and choose Alpia or Lindt as the reference category.
For each category of the multinomial logistic model (except the reference category
G), a logistic regression function must be formed according to Eq. (5.50). Each of these
G − 1 functions includes all parameters.
318 5 Logistic Regression
with ygi = 1 if category g was observed in case i, and ygi = 0 in all other cases. The proba-
bilities are calculated as follows:
eag +bg1 x1i +...+bgJ xJi
pg (xi ) = (g = 1, . . . , G)
−1
G (5.54)
1+ eah +bh1 x1i +...+ bhJ xJi
h=1
To illustrate the multinomial logistic regression, we will modify our example above.
Example
Two chocolate types A and B will now be offered to the test persons, resulting in a
total of three response categories30 :
• g = 1: Buying A
• g = 2: Buying B
• g = 3: No buy
According to Eq. (5.50) we want to estimate the probabilities for a certain income x:
1
No buy: p3 (x) = 2 = 1 − (p1 + p2 )
1 + h=1 eah +bh x
0.097
Buying A: p1 = 1+0.097+1.473
= 0.04
1.473
Buying B: p2 = 1+0.097+1.473
= 0.57
1
No buy: p3 = 1+0.097+1.473
= 0.39
In Table 5.17 the probabilities of buying are compared for three different incomes. With
a low income of 2000 EUR, the probability is highest for No buy, with a medium income
of 3000 EUR for Chocolate B, and with a high income of 4000 EUR for chocolate A.
The sum of the three probabilities must always be 1. The diagram in Fig. 5.25 shows the
320 5 Logistic Regression
Buying probabilties
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1.500 2.000 2.500 3.000 3.500 4.000 4.50 .000
probabilities for incomes between 1500 and 5000 EUR and illustrates the nonlinearity of
logistic regression.
We now come back to the simple binary logistic regression with the two categories
1 = Buy and 0 = No buy. According to Eq. (5.12) we can express the logistic function for
probability P(Y = 1|x) in logit form:
p(x)
logit p(x) ≡ ln =a+bx
1 − p(x)
For obtaining the logit we only have to determine the systematic component z for a given
value x.
If we now change the coding of the second category to 2 = No buy and choose cat-
egory 2 explicitly as the baseline (reference category), the binary model can be formu-
lated as a so-called baseline logit model:
5.3 Multinomial Logistic Regression 321
P(Y = 1|x) p1 (x)
ln = ln = a + bx (5.55)
P(Y = 2|x ) p2 (x)
Although practically nothing changes, this formulation can easily be extended to a multi-
nomial baseline logit model.
The multinomial logistic model can be represented by a set of binary logistic equa-
tions for pairs of categories. The multinomial baseline logit model comprises a set of
G − 1 logit equations:
pg (x)
ln = zg (g = 1, . . . , G − 1) (5.56)
pG (x)
with the systematic components zg = ag + bg1 x1 + . . . + bgJ xJ .
Each equation describes the effects of the predictors on the dependent variable rel-
ative to the baseline category. While the calculation of a probability always requires
all parameters (for all categories), a baseline logit requires only the parameters for the
respective category.
For G categories there are
G (G − 1) · G
= pairs of categories
2 2
A subset of G − 1 baseline logits contains all information of the multinomial logistic
model. The rest is redundant. For G = 3 there are 3 possible pairs (baseline logits), of
which one is redundant.
In our example with G = 3 categories and choosing G as the baseline category, the fol-
lowing two equations result:
ln pp13 (x) = a1 + b1 x
(x)
p2 (x)
ln p3 (x) = a2 + b2 x
For a person with an income of 3000 EUR, the following logits are obtained with the
above parameters for the two chocolate types A and B:
ln pp31 (x)
(x)
= −22.42 + 6.70 · 3 = −2.33
p2 (x)
ln p3 (x) = −7.93 + 2.77 · 3 = 0.387
From this we can derive the odds with Eq. (5.11). The chances that a person with an
income of 3000 EUR will buy chocolate A (g = 1) compared to not buying (g = 3) are:
Comparing Categories
For each other pair of categories (none of which is the reference category), we get the
logit by calculating:
pg (x ) pg (x) ph (x)
ln = ln − ln = z g − zh (5.57)
ph (x) pG (x) pG (x)
And we get the odds by calculating:
oddsgh = ezg −zh (5.58)
So, for a person with an income of 3000 EUR we get:
z1 − z2 = −2.33 − 0.387 = −2.72
and:
31 InSPSS (procedure NOMREG) the user can choose any category as the reference category and
thus determine the odds using the baseline logit model. This is done in the dialog box by choosing
the option “Reference Category” and “Custom”. By default, the last category G is chosen. The
category with the lowest coding is chosen if the user chooses the category order “Descending” (the
default setting is “Ascending”).
5.3 Multinomial Logistic Regression 323
To check the goodness-of-fit of a binary logistic model we used (see Sect. 5.2.4):
These statistics and corresponding tests can also be used for the multinomial logistic
model. In Table 5.18 we compare the binary logistic model (Model 3) with the covariates
Income and Gender for N = 30 and the corresponding multinomial logistic model (Model
4) for N = 50 with respect to these measures. With the multinomial model, we get a much
better fit.
Besides these measures, SPSS provides two further measures for the “goodness-of-
fit” for multinomial logistic regression:
And additionally, SPSS also provides information criteria for model selection.
Unlike the LLR statistic and the pseudo-R square statistics, which become larger
with a better fit, the opposite is true for the Pearson statistic and the deviance. They are
“inverse” measures or actually “badness-of-fit measures” since they measure the lack of
fit. With better fit, they become smaller and, in extreme cases, even zero. In hypothesis
tests, therefore, the acceptance of the null hypothesis is desired, not the rejection. And a
larger p-value is better.
on residual analysis. The measure according to Eq. (5.43) differs from the Pearson
goodness-of-fit measure. It is not truly chi-square distributed (cf. Agresti, 2013, p. 138;
Hosmer & Lemeshow, 2000, p. 146). The approximate chi-square distribution is only
obtained with frequency data as used, e.g., in contingency analysis (see Chap. 6). For
frequency data, Pearson’s chi-square statistic X2 is calculated by
(observed frequencies − expected frequencies)2
X2 =
cells
expected frequencies
In the SPSS procedure NOMREG for multinomial logistic regression, Pearson’s chi-
square statistic is calculated by:
I
G I
G
2
(mig − eig )2
X = = rig2 (5.59)
i=1 g=1
eig i=1 g=1
where
mig observed frequency: number of events (e.g.buying) in cell ig
eig expected frequency
The values rig are called Pearson residuals. X2 is approximately chi-square distributed for
sufficiently large observed frequencies mig with df = I (G − 1) − (J + 1).
In logistic regression the expected frequencies are calculated by using the derived
probabilities:
ezig
eig = ni · pig with pig = G
(5.60)
ezih
h=1
where ni is the number of cases in cell i. This is different from contingency analysis.
Example
As a simple example, we analyze the relationship between Buying and Gender, using
our data from Table 5.2. By counting, we get the frequencies listed in Table 5.19. ◄
5.3 Multinomial Logistic Regression 325
With the derived probabilities according to Eq. (5.60) we calculate the expected frequen-
cies shown in Table 5.20. With X2 = 0, Pearson’s goodness-of-fit measure indicates a per-
fect fit between the observed and the expected frequencies.32
Each row in Table 5.20 corresponds to a cell of the 2 × 2 contingency table. By add-
ing rows, the table can be extended to any number of cells.
A cell is defined by
32 ForX2 = 0, the p-value has to be 1.0, but it cannot be calculated as there are no degrees of free-
dom for this model. It just serves to demonstrate the principle of the calculation. The predicted
(expected) probabilities here are equal to the relative frequencies of the observed values in the
respective subpopulation, i.e. for men or for women.
326 5 Logistic Regression
5.3.4.2 Deviance
The calculation of the deviance is based on the same cells as the calculation of Pearson
goodness-of-fit measure. The values of these two measures are usually very similar. In
SPSS, deviance is calculated by (cf. IBM Corporation, 2022, pp. 768):
I
G
2
mig
X =2 mig · ln (5.61)
i=1 g=1
ni · pig
This shows the similarity to Pearson’s goodness-of-fit measure (see Eq. 5.59). Both
measures are approximately chi-square distributed with df = I (G − 1) − (J + 1) for suffi-
ciently large observed frequencies mig.
Both measures face the same problems if the number of cells becomes large and the
number of observed events mig within the cells becomes small. For empty cells (with
mig = 0) a calculation is not possible, as the logarithm of zero is not defined. Thus, for
models with metric predictors, we have the same objection against the use of deviance as
for Pearson’s goodness-of-fit measure.
Deviance compares the fitted model with a so-called saturated model, measuring the
deviation from this model. That is where the name comes from.
The saturated model is the “best possible” model concerning fit. This model has a
separate parameter for each observation and therefore achieves a perfect fit. But this is
not a good model with regard to parsimony, as it does not simplify reality. Thus, the sat-
urated model is not a useful model. It just serves as a baseline for a comparison with the
fitted model. In linear regression, a saturated model for N observations would be a model
with J = N − 1 independent variables, e.g. a simple regression with two observations.
There is a similarity between deviance and the likelihood ratio statistic LLR. In Eq.
(5.37) we defined LLR as the difference of the log-likelihood of the fitted model and the
log-likelihood of a null model:
L0
LLR = −2 · ln = −2 · (LL0 − LLf )
Lf
Thus, LLR measures the deviation from the “worst possible” model, the 0-model. The
greater the deviation, the greater the goodness-of-fit.
Deviance measures the deviation from the “best possible” model. The greater
the deviation, the worse the model. Thus, deviance measures the lack of fit, just like
Pearson’s goodness-of-fit measure. They are both inverse goodness-of-fit measures and
give similar results.
For models with individual (casewise) data, the likelihood of the saturated model is 1
for each observation and thus the sum of the logarithms will be zero. The deviance then
degenerates to:
D = −2 · LLf
This is the −2 log-likelihood statistic (−2LL) that we find as a summary statistic in
SPSS outputs.
The deviance or the −2LL statistic play the same role in logistic regression as the
sum of squared residuals (SSR) in linear regression. In linear regression, the parameters
are estimated by minimizing SSR, in logistic regression the parameters are estimated by
minimizing −2LL.
predictions for cases within the sample. Therefore, parsimony is an important criterion
for modeling.
Thus, in addition to significance tests (e.g., the likelihood ratio test), further criteria
have been developed that are useful for comparing and selecting models with different
numbers of variables. These include the Akaike information criterion (AIC) and the
Bayesian information criterion (BIC). As with the adjusted coefficient of determination
in linear regression, increasing model complexity is penalized by a correction term that
is added within the information criteria. A model with a smaller value of the information
criterion is the better model.
Akaike information criterion (AIC)
AIC = −2LL + 2 · number of parameters (5.63)
Bayesian information criterion (BIC)
BIC = −2LL + ln(N) · number of parameters (5.64)
with
N: sample size
number of parameters: [(G − 1) · (J + 1)] (= degrees of freedom)
J: Number of predictors
G: number of categories of the dependent variable
In the example for multinomial logistic regression (Sect. 5.3.1) with G = 3 and N = 50,
we get – with only one predictor, Income – the value 45.5 for −2LL. The number of
parameters (including an intercept) is:
[(G − 1) · (J + 1)] = [(3 − 1) · (1 + 1)] = 4
We get:
AIC = 45.5 + 2 · 4 = 45.5 + 8 = 53.5
BIC = 45.5 + ln(50) · 4 = 45.5 + 15.6 = 61.1
If we now include the variable Gender, the value of −2LL decreases to 23.6. But the
number of parameters increases to 6 and, with this, the size of the penalization:
AIC = 23.6 + 2 · 6 = 23.6 + 12 = 35.6
BIC = 23.6 + ln(50) · 6 = 23.5 + 23.6 = 47.1
Both measures are reduced by the inclusion of the additional variable Gender. Here the
reduction in −2LL overcompensates the penalty effect. This model extension is therefore
advantageous.
AIC and BIC are only suitable for comparing models based on the same data set. The
measures do not make a statement about the absolute quality of the compared models,
they just indicate which one is better (the one with the lowest value).
5.4 Case Study 329
However, AIC and BIC do not always lead to the same results. As can be seen, the
penalty effect is greater for BIC than for AIC. Therefore, BIC favors more parsimo-
nious models. Which one of the two criteria is “more correct” cannot be decided une-
quivocally. The larger the sample, the more likely the best model will be selected with
BIC. For small samples, on the other hand, there is a risk that by using BIC a too simple
model will be selected (cf. Hastie et al., 2011, p. 235).
We will now use a larger sample related to the chocolate market to demonstrate how to
conduct a multinomial logistic regression with the help of SPSS.33
The manager of a chocolate company wants to know how consumers evaluate dif-
ferent chocolate flavors with respect to 10 subjectively perceived attributes since he
assumes this information to be strategically important for differentiating his offerings
from those of his competitors and for positioning new products. For this purpose, the
manager has identified 11 flavors and has selected 10 attributes that appear to be relevant
for the evaluation of these flavors.
First, a small pretest with 18 test persons is carried out. The persons are asked to eval-
uate the 11 flavors (chocolate types) with respect to the 10 attributes (see Table 5.21).34
A seven-point rating scale (1 = low, 7 = high) is used for each attribute. Thus, the inde-
pendent variables are perceived attributes of the chocolate types.
However, not all persons evaluated all 11 flavors. Thus, the data set contains only 127
evaluations instead of the complete number of 198 evaluations (18 persons × 11 flavors).
Any evaluation comprises the scale values of the 10 attributes for a certain flavor of one
respondent. It reflects the subjective assessment of a specific chocolate flavor by a par-
ticular test person. Since each test person assessed more than just one flavor, the obser-
vations are not independent. Yet in the following we will treat the observations as such.
Of the 127 evaluations, only 116 are complete, while 11 evaluations have missing val-
ues.35 We exclude all incomplete evaluations from the analysis. Consequently, the num-
ber of cases is reduced to 116.
33 We will use the dataset introduced in Chap. 4 for discriminant analysis in order to better illus-
trate similarities and differences between the two methods.
34 On the website www.multivariate-methods.info, we provide supplementary material (e.g., Excel
(e.g. because people cannot or do not want to answer some question(s), or as a result of mistakes
by the interviewer). The handling of missing values in empirical studies is discussed in Sect. 1.5.2.
330 5 Logistic Regression
Table 5.21 Chocolate flavors and perceived attributes examined in the case study
Chocolate flavor Perceived attributes
1 Milk 1 Price
2 Espresso 2 Refreshing
3 Biscuit 3 Delicious
4 Orange 4 Healthy
5 Strawberry 5 Bitter
6 Mango 6 Light
7 Cappuccino 7 Crunchy
8 Mousse 8 Exotic
9 Caramel 9 Sweet
10 Nougat 10 Fruity
11 Nut
To investigate the differences between the various flavors, 11 groups could be con-
sidered, with each group representing one chocolate flavor. For the sake of simplicity,
the 11 products were grouped into three market segments. This was done by applying a
cluster analysis (see the results of the cluster analysis in Sect. 8.3). Table 5.22 shows the
composition of the three segments and their sizes. The size of the segment ‘Classic’ is
more than twice the size of the segments ‘Fruit’ and ‘Coffee’.
The manager now wants to use logistic regression to answer the following questions:
Here we show how to run the procedure for multinomial logistic regression (NOMREG)
of SPSS via the graphical user interface (GUI). Figure 5.26 shows the Data editor with the
5.4 Case Study 331
Fig. 5.26 Data editor with the the selection of the procedure ‘Multinomial Logistic Regression’ (NOM-
REG)
work file that contains our data, and the pulldown menus for ‘Analyze’ and ‘Regression’.
The 127 rows of the table contain the evaluations (cases) and the columns represent the 10
attributes (our independent variables). The three last columns represent the ident numbers
of the 18 respondents, the type numbers of the 11 flavors (product types), and the numbers
of the three segments (our dependent variable, not visible in Fig. 5.26).
For selecting an analysis procedure in SPSS, you have to click on ‘Analyze’ to
open the pulldown menu with for the various groups of procedures. The procedure
‘Multinomial Logistic Regression’ is included in the group ‘Regression’. This group also
contains the procedure ‘Binary Logistic Regression’, which was discussed in Sect. 5.2.6.
After selecting ‘Analyze/Regression/Multinomial logistic …’, the dialog box shown
in Fig. 5.27 opens, with the left field showing the list of variables. Our dependent varia-
ble ‘Segment’ has to be moved to the field ‘Dependent’ using the left mouse button (this
332 5 Logistic Regression
has already been done in Fig. 5.27). The appendix ‘Last’ indicates that the last segment
(G = 3) has by default been chosen as the reference category. As mentioned above, the
user can select any category (segment) as the reference category. Here, we choose the
first segment as the baseline category because this segment is the largest one. For this,
we have to click on ‘Reference Category/Custom’ and enter 1 for ‘Value’.
We mentioned at the beginning that in logistic regression independent variables are
usually called
Accordingly, the dialog box contains two fields for specifying the independent variables.
Since in our case all the 10 independent variables are metric, they were moved to the
field ‘Covariate(s)’.
Further dialog boxes can be accessed via the buttons on the right side.
The top dialog box, ‘Model’ (Fig. 5.28), will not be required in the beginning. As
the first item, ‘Main effects’, shows, the Multinomial Logistic Regression procedure
estimates the coefficients for all selected predictors (factors and/or covariates) and the
intercept. This is the default option and is called “blockwise” regression. There are two
5.4 Case Study 333
further items in this dialog box. If ‘Full factorial’ is chosen, all possible interaction
effects between the selected factorial (categorical) variables are also included in the
model. Covariate interactions are not estimated. The ‘Custom/Stepwise’ option allows
the user to specify interaction effects (factor or covariate interactions). Here you can also
choose whether a stepwise regression should be performed. In this case, the independent
variables are selected successively in the order of their significance by an algorithm. We
will use this option in Sect. 5.4.4. Finally, the user can choose a model without an inter-
cept. Click the ‘Continue’ button to return to the main menu.
In the second dialog box, ‘Statistics’, you can specify settings for the output.
Figure 5.29 shows the default settings.
The ‘Case processing summary’ provides information about the specified categorical
variables (e.g., the number of cases by segments, missing values).
• In the field ‘Model’ you can request statistics for assessing the quality of the model,
such as the pseudo-R2 statistics (McFadden, Cox & Snell, Nagelkerke), the likelihood
ratio test, or the classification table. If you choose the option ‘Classification table’,
you will get a table with observed versus predicted responses. Pearson’s chi-square
statistic and the deviance can be requested with the option ‘Goodness-of-fit’. If you
334 5 Logistic Regression
choose this option, in our case you will get a warning in the output because the inde-
pendent variables are metric and therefore almost every case has a distinct covariate
pattern. For our total of 116 cases, we would get 113 distinct covariate patterns (i.e.
only three pairs with equal covariate patterns), resulting in 339 cells of which 226
would be empty.
• In the field ‘Parameters’ you can request a table with the estimated coefficients
including standard error, Wald test, and odds ratio (analogous to Table 5.15).
Selecting ‘Likelihood ratio tests’ will result in printouts of the likelihood ratio tests of
the coefficients (analogous to Table 5.16). Confidence intervals for the odds ratios are
also provided, and the user can specify the confidence probability of these intervals.
Pearson's Chi-square statistics and Deviance can be requested under the "Goodness-of-
fit" option. If you choose this option, you get a warning, because the independent var-
iables are metric and therefore almost every case has a distinct covariate pattern. Thus,
most of the cells contain zero frequencies (see above). In total, with the 116 cases, we
get 113 distinct covariate patterns (i.e. there are only three pairs with equal covariate pat-
terns), resulting in 339 cells, of which 226 are empty. SPSS, therefore, gives a warning
message in the output.
5.4 Case Study 335
The dialog box ‘Criteria’ offers controls on the iterative algorithm to perform the
maximum likelihood estimate (e.g. the maximum number of iterations) and printing of
the iteration history. The dialog box ‘Options’ may be used to set parameters for per-
forming a stepwise regression. The dialog box ‘Save’ may be used to save individual
results, such as the estimated probabilities or the predicted category, in the work file by
appending them as new variables.
5.4.3 Results
can be regarded as highly statistically significant, i.e. the predictors discriminate between
the three segments.
The values of the three pseudo R square statistics in the lower part of Fig. 5.31 also
indicate an acceptable model fit. McFadden’s R2 results from the above log-likelihood
values:
2 LLf −142.016
McF − R = 1 − =1− = 0.381
LL0 −229.326
However, this is not very surprising as the segments were formed by a cluster analysis
of the same 10 attributes that we now use as predictors. It shows that the cluster analysis
probably worked correctly.
Fig. 5.32 Estimated parameters of the regression functions for segments 2 and 3
338 5 Logistic Regression
1.000
0.500
0.000
-0.500
-1.000
-1.500
Fig. 5.33 Estimated coefficients for the segments ‘Fruit’ versus ‘Classic’
1.000
0.500
0.000
-0.500
-1.000
-1.500
Fig. 5.34 Estimated coefficients for the segments ‘Coffee’ versus ‘Classic’
Figure 5.34 shows that the segments ‘Coffee’ and ‘Classic’ differ much less than the
segments ‘Fruit’ and ‘Classic’. The greatest difference concerns the attribute ‘bitter’. But
that ‘Coffee’ is perceived as less bitter than ‘Classic’ also seems questionable.
According to Eq. (5.57) we can determine the coefficients for other pairs of categories
by the differences of the respective logits. Figure 5.35 shows the results for ‘Fruit’ versus
5.4 Case Study 339
1.000
0.500
0.000
-0.500
-1.000
-1.500
Fig. 5.35 Estimated coefficients for the segments ‘Fruit’ versus ‘Coffee’
‘Coffee’. For ‘Coffee’ versus ‘Fruit’ (by switching the baseline), we would get identical
coefficients with opposite signs.
The segment ‘Fruit’ is perceived as more fruity than the segment ‘Coffee’, which is
not surprising. It is also perceived as lighter, more refreshing, and crunchier. And it is
perceived as less delicious, less expensive, and less healthy. Not all of these findings
match our intuitive expectations, but human attitudes are sometimes unpredictable.
Figure 5.32 also shows the values of the Wald statistics according to Eq. (5.44) and
the corresponding p-values. For the segments ‘Fruit’ vs. ‘Classic’, five coefficients are
significant for α = 5%, but for the other pairs of segments, most coefficients are not sig-
nificant. Table 5.23 shows the significant coefficients in the order of their significance
level. As can be seen, the largest coefficients do not always have the highest significance.
Finally, in the rightmost column of Fig. 5.32, the effect coefficients (odds ratios,
Exp(B)) of the covariates are shown (specific for the respective reference category). They
are all positive, but <1 for negative values of the regression coefficients and >1 for pos-
itive values (see Sect. 5.2.4.3). SPSS also indicates the confidence intervals of the odds
ratios.
340 5 Logistic Regression
Fig. 5.36 Testing the covariates with the likelihood ratio test
Classification Results
An important characteristic of logistic regression is that it provides probabilities for the
categories (groups). The probabilities may be used for the prediction or the assignment
of cases to categories. In SPSS we get the probabilities for each test person and flavor for
the three segments. But we can also derive these probabilities for persons that have not
taken part in the analysis. For the calculation see Eqs. (5.47) and (5.48) in Sect. 5.3. In
SPSS, the estimated probabilities can be requested with the dialog box ‘Save’. They will
then be appended to the working file as new variables. The probabilities are independent
of the chosen baseline category.
Figure 5.37 shows part of the working file in the data editor with the created varia-
bles EST1_1, EST2_1 and EST3_1 for the estimated probabilities of the three segments.
342 5 Logistic Regression
The variable PRE_1 indicates the predicted category. This is the category with the high-
est probability.
For case 1 (respondent 1 and flavor 1 = Milk) we get the highest probability 0.84
for segment 1 (Classic). This prediction is correct. For case 10 (respondent 4 and flavor
2 = Espresso), segment 2 (Fruit) is predicted, which is false.
A summary of all observed and predicted segments is given by the classification table
in Fig. 5.38, which now contains 9 cells. The hits are shown in the diagonal cells. Of the
65 cases in segment 1 (Classic), 62 are predicted correctly (95.4%), and of the 28 cases
in segment 2 (Fruit), 23 are predicted correctly (82.1%). These are very good results. But
of the 23 cases in segment 3 (Coffee), only 6 are predicted correctly (26.1%). This shows
that logistic regression yields higher hit rates for larger segments. The overall hit rate is
78.4%.
To check the classification, a ROC curve can be created for each segment. We will
do this for segment 1 (Classic). After selecting the item ‘Analyze/ROC curve’, we have
to choose the variable ‘EST1_1’ as the ‘Test Variable’ and the variable ‘Segment’ as the
‘State Variable’. For ‘The Value of State Variable’ we enter 1 for segment 1 (Classic).
Additionally, we select ‘With diagonal reference line’ and ‘Standard error and confidence
interval’. The output is shown in Fig. 5.39. The area under the curve (AUC) is 82.4%,
which is excellent.
We can do the same for the other two segments. For segment 2 (‘Fruit’) we have to
choose the variable ‘EST2_1’ as the ‘Test Variable’ and enter the value 2 for the ‘State
Variable’. We will get AUC = 91.5%. For segment 3 (Coffee) we get AUC = 79.5%.
These are excellent results. The cluster analysis that was used for the segmentation
apparently did a good job.
To construct a more parsimonious model, we can use the results above since the like-
lihood ratio test told us that the attributes ‘crunchy’, ‘delicious’, ‘bitter’ and ‘fruity’ have
the greatest importance for assigning cases to segments.
The option ‘Stepwise Regression’ offers another possibility. With this procedure, it
is left to the computer program to find a good model. An algorithm successively adds
variables to a model that in the beginning contains only the intercept. The variables
are selected in the order of their contribution to the likelihood of the model (forward
selection). Or the algorithm removes variables from a model that contains all independ-
ent variables (backward selection). This can be controlled via the dialog box ‘Model’
(Fig. 5.28). The statistical criteria for the selection process can be controlled via the
dialog box ‘Options’ (Fig. 5.40). The default threshold for including a variable into the
model is a p-value ≤5% of the likelihood ratio.
344 5 Logistic Regression
Figure 5.41 shows the results of the stepwise entry of covariates. The procedure
selects five variables that meet the default selection criterion (‘Entry Probability’ ≤5%).
These are ‘fruity’, ‘price’, ‘exotic’, ‘crunchy’, and ‘bitter’. The likelihood statistics are
not identical to the values in Fig. 5.36 because of multicollinearity. With these variables,
we achieve a hit rate of 69%.
If we increase the ‘Entry Probability’ to ≤10%, the attribute ‘refreshing’ is added to
the model and we achieve a hit rate of 71.6%. If we now discard the attribute ‘bitter’, we
can increase the hit rate to 73.1%. This demonstrates that we should not put too much
confidence in the automatic selection by stepwise regression.
In Sect. 5.2.6 we demonstrated how to use the GUI (graphical user interface) of SPSS
to conduct a binary logistic regression. Alternatively, we can use the SPSS syntax which
is a programming language unique to SPSS. Each option we activate in SPSS’s GUI is
translated into SPSS syntax. If you click on ‘Paste’ in the main dialog box shown in
5.4 Case Study 345
BEGIN DATA
1 2530 0 1
2 2370 1 0
3 2720 1 1
-----------
30 1620 0 0
* Enter all data.
END DATA.
Fig. 5.42 SPSS syntax for binary logistic regression (Sect. 5.2.6)
Fig. 5.17, a new window opens with the corresponding SPSS syntax. However, you can
also use the SPSS syntax and write the commands yourself. Using the SPSS syntax can
be advantageous if you want to repeat an analysis multiple times (e.g., testing differ-
ent model specifications). Figure 5.42 shows the SPSS syntax for binary regression as
described in Sect. 5.2.6.
346 5 Logistic Regression
BEGIN DATA
3 3 5 4 1 2 3 1 3 4 1
6 6 5 2 2 5 2 1 6 7 1
2 3 3 3 2 3 5 1 3 2 1
---------------------
5 4 4 1 4 4 1 1 1 4 1
* Enter all data.
END DATA.
Fig. 5.43 SPSS syntax for blockwise estimation and ROC curve (Sect. 5.4.3.1)
Figures 5.43 and 5.44 show the SPSS syntax for running some analyses for the case
study presented here.
For readers interested in using R (https://www.r-project.org) for data analysis, we pro-
vide the corresponding R-commands on our website www.multivariate-methods.info.
5.5 Modifications and Extensions 347
Logistic regression is of great importance for the estimation of discrete choice models,
i.e. for models that describe, explain, predict, and support choices between two or more
discrete alternatives. As examples we used various instances of buying chocolate (e.g.
buying or not buying, buying type A or type B). For these kinds of models we have to
distinguish between two groups of independent variables (predictors):
36 There is a third group of variables that vary over persons and alternatives, e.g.,. the perceived
attributes that we encountered in the case study.
37 Logit Choice Models became popular by the work of Daniel McFadden (1974), who laid the
foundations for these models and their applications. In 2000 he won the Nobel Prize in econom-
ics. More information on these models can be found in the books of Ben-Akiva & Lerman, 1985;
Hensher et al., 2015; Train, 2009. Examples of their application are the use of transport alterna-
tives (e.g. car, tram, bus, bicycle, walking; Mc Fadden, 1984) or the interpretation of market data
derived from scanner panels (e.g. Guadagni & Little, 1983; Jain et al., 1994).
348 5 Logistic Regression
A problem with logistic regression models is the large number of parameters that have to
be estimated and interpreted, especially if the number of categories is large. Wikipedia
lists more than 200 chocolate brands. If we select 10 brands and use 10 predictors, we
have to estimate 99 parameters (9 intercepts and 90 coefficients) for a logistic regression
model.
For a logit choice model, the number of parameters is reduced to 19 (9 intercepts and
10 coefficients) because the logit choice model (in its basic form) uses generic coeffi-
cients, while the logistic regression model uses alternative-specific coefficients. For
example, there will be just one coefficient for the prices of the alternatives, and it is
assumed that price has the same effect on all alternatives. The possibility of specifying
generic coefficients that are constant over the alternatives makes it possible to formulate
very efficient and parsimonious models. Logistic regression does not allow the estima-
tion of generic coefficients. Neither can generic coefficients that do not vary over the
alternatives be estimated for characteristics of the persons. We will demonstrate this with
a small example.
Example
As an example, we simplify the model in Eq. (5.52). We assume G = 3 choice alter-
natives and only one predictor. For a person with an income x, we want to predict
the buying probabilities for the three alternatives. This yields the following logistic
regression model:
eαg +βg x
πg (x) = (g = 1, 2, 3) (5.65)
eα1 +β1 x + eα2 +β2 x + eα3 +β3 x
Setting the parameters for the baseline category g = 3 to zero, we have to estimate
(J + 1) x (G–1) = 4 parameters. ◄
Now we contrast the model in the example with the corresponding logit choice model.
We assume the predictor will be the price p of the choice alternatives. This leads to the
following formula:
eαg +βpg
πg (p) = (g = 1, 2, 3) (5.66)
eα1 +βp1 + eα2 +βp2 + eα3 +βp3
Now the price coefficient is a generic parameter, while the prices are specific to the alter-
natives. We expect a negative value for the price coefficient β. Setting α3 to zero, we
have to estimate (J + G–1) = 3 parameters. For alternative 1 we can express the probabil-
ity by:
1
π1 (p) = (5.67)
1+ e(α2 −α1 )+β(p2 −p1 ) + e−α1 +β(p3 −p1 )
The intercepts (constant terms) of the model can here be interpreted as utility values of
the alternatives relative to the baseline category. The model states that the probability of
5.6 Recommendations 349
buying alternative 1 depends on the differences in utility and price between choice 1 and
the other two alternatives.
If we now substitute in the logit choice model the price of the alternatives with the
income of the chooser, we get the following formula:
1 1
π1 (x) = =
1 + e(α2 −α1 )+β(x−x) + e−α1 +β(x−x) 1 + eα2 −α1 + e−α1
Because the income does not vary across the alternatives, it is eliminated from the logit
choice model.
As mentioned above, logistic regression does not allow the estimation of generic coef-
ficients. So, if we include an attribute of the alternatives into a logistic regression model,
we have to estimate a specific coefficient for each alternative. This can be useful in cer-
tain situations. For example, a strong brand might be less affected by an increase in price
than a weak brand and it would be interesting to measure this effect of the brand. But
logistic regression forces us to estimate all parameters specific to alternatives. So, in gen-
eral, we have to estimate many parameters.
The logit choice model can be extended to include characteristics of the persons by
using alternative-specific coefficients besides generic coefficients. So, the logistic regres-
sion model can be seen as a special case of an extended logit choice model.38
5.6 Recommendations
38 SPSS has no special procedure for logit choice analysis but the procedure COXREG for Cox-
Regression may be used for this calculation.
350 5 Logistic Regression
the dependent variable) should not be lower than 25, and the total should therefore
amount to at least 50.
• With a larger number of independent variables, larger numbers of cases per group are
required. There should be at least 10 cases per parameter to be estimated.
• The independent variables should be largely free of multicollinearity (no linear
dependencies).
Alternative Methods
An alternative method to binary logistic regression is linear regression as demonstrated
in Sect. 5.2.1 with the linear probability model (Model 1, Sect. 5.2.1.1) and with group-
ing the data and applying a logit transformation (Model 2, Sect. 5.2.1.2). These models
can provide good approximations and are computationally much easier. But the linear
References 351
probability model has only restricted validity, and any grouping of data leads to a loss of
information.
Another alternative to logistic regression analysis (LRA) is discriminant analysis
(DA) (cf. Chap. 4). Both methods can also be used for multinomial problems, and both
methods are based on a linear model (linear systematic component). While in LRA we
deal with states or events, in the context of DA we deal with the classification of ele-
ments into predefined groups, which is historically conditioned by the original areas of
application. But as we have shown, LRA can also be used for problems of classification.
The main difference between the two methods is that LRA provides probabilities for
the occurrence of alternative events or the classification into separate groups. In contrast,
DA provides discriminant values from which probabilities can be derived in a separate
step. A nice feature of DA is that it provides territorial maps for classification problems,
i.e. mappings of groups and boundaries between groups.
LRA is computationally more complex since the estimation of the parameters requires
the application of the maximum likelihood method and thus an iterative procedure.
Concerning the statistical properties of the methods, one advantage of LRA is that it is
based on fewer assumptions about the data than DA. Thus, LRA is more robust con-
cerning the data and, in particular, less sensitive to outliers.39 Experience shows, how-
ever, that both methods provide similarly good results, especially for large sample sizes
and even in cases where the assumptions of DA are not fulfilled (cf. Michie et al., 1994,
p. 214; Hastie et al., 2011, p. 128; Lim et al., 2000, p. 216).
References
39 DAassumes that the independent variables follow a multivariate normal distribution, whereas
LRA assumes that the dependent variable follows a binomial or multinomial distribution.
352 5 Logistic Regression
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2014). An Introduction to Statistical Learning.
Springer.
Jain, D., Vilcassim, N., & Chintagunta, P. (1994). A random-coefficients logit brand-choice model
applied to panel data. J. O. Business & Economic Statistics, 13(3), 317–326.
Lim, T., Loh, W., & Shih, Y. (2000). A comparison of predicting accuracy, complexity, and training
time of thirty-three old and new classification algorithms. Machine Learning, 40(3), 203–229.
Louviere, J., Hensher, D., & Swait, J. (2000). Stated choice methods. Cambridge University Press.
McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. In P. Zarembka
(Ed.), Frontiers in econometrics, 40 (pp. 105–142). Academic Press.
McFadden, Daniel L. (1984). Econometric analysis of qualitative response models. Handbook of
Econometrics, Volume II. Chapter 24. Elsevier Science Publishers BV.
Michie, D., Spiegelhalter, D., & Taylor, C. (1994). Machine learning, neural and statistical classi-
fication. Ellis Horwood Limited.
Pearl, J. (2018). The Book of Why. The new science of cause and effect. Basic Books.
Press, W., Flannery, B., Teukolsky, S., & Vetterling, W. (2007). Numerical recipes – The art of sci-
entific computing. Cambridge University Press.
Train, K. (2009). Discrete choice methods with simulation. Cambridge University Press.
Watson, J., Whiting, P., & Brush, J. (2020). Practice pointer: Interpreting a covid-19 test result.
British Medical Journal, 369, m1808.
Further Reading
Contents
6.1 Problem
Imagine you want to know whether there is an association between gender and being
vegetarian. To answer the question, you draw a random sample of consumers and meas-
ure their gender and diet (vegetarian/omnivore). The two variables ‘gender’ and ‘diet’ are
categorical (nominal) variables. The values of the variables represent different categories
that are mutually exclusive, that is, each consumer can be categorized to one specific
combination of the variables (e.g., female vegetarian, male omnivore).1
To analyze relations between two or more categorical (nominal) variables, we can
use contingency analysis (also called cross-tabulation). When we are interested in asso-
ciations between categorical variables—like in the example above—, we actually want
to assess whether the variables are independent. Thus, we want to test for independence.
However, sometimes we want to know whether a variable is equally distributed in two
or more samples, as in the following example: you are wondering whether coronary
artery abnormalities are equally distributed in patients who are treated with aspirin or
with gamma globulin. Thus, you draw a random sample of patients with coronary artery
abnormalities and you examine whether the patients were treated with aspirin or with
gamma globulin. Both variables are categorical variables. However, in this example we
would like to know whether the distribution of coronary artery abnormalities is equal in
the two groups (i.e., aspirin vs. gamma globulin). We compare the probabilities of devel-
oping coronary artery abnormalities given treatment with aspirin or gamma globulin.
In this example, we want to test for homogeneity. Such questions can also be addressed
with the help of contingency analysis.
Table 6.1 lists further examples of research questions from different disciplines that
can be answered with contingency analysis. It is further indicated whether the ques-
tions ask for a test for independence or homogeneity. Actually, the procedures to test for
1 We are aware that both variables have more than just two categories. We use this simplification to
illustrate the basic idea of the contingency analysis.
6.2 Procedure 355
independence or homogeneity are alike. In the following, we will again use the choco-
late example and address a question about the association of two variables (i.e., test for
independence).
6.2 Procedure
The starting point of a contingency analysis is a cross table that reflects the joint distri-
bution of two variables. In general, theoretical considerations and reasonable arguments
should support the hypothesis of an association between the two variables. Otherwise,
we may just falsely induce dependencies.
Application Example
We collected data from 181 respondents about their age and preferred type of choc-
olate. The variable ‘age’ has just two distinct categories: ‘18 to 44 years’ (younger)
and ‘45 years and older’ (older). If the variable ‘age’ was measured as a metric varia-
ble, we could transform it into a categorical variable after the data has been collected.
The variable ‘preferred chocolate type’ also has two categories, namely ‘milk’ and
‘dark’. Each observation (i.e., respondent) can uniquely be assigned to a combination
of these categories. For example, we observe a younger person who loves milk choco-
late. Moreover, the individual observations are independent of each other, that is, we
have one single observation from each respondent. ◄
356 6 Contingency Analysis
2 Another possibility is the computation of mean values or ratios (cf. Zeisel, 1985). In addition,
methods for the analysis of multidimensional tables such as log-linear models have been developed
(cf. Fahrmeir & Tutz, 2001 for a literature overview).
6.2 Procedure 357
Yet in the following, we focus on two categorical variables to ease the understanding
of the basic idea of contingency analysis. We want to answer the question whether there
is an association between age and respondents’ preference for either milk chocolate or
dark chocolate. Table 6.2 shows an excerpt of the collected data.
In a first step, we aggregate the collected data to create a cross table. To do so, we
compute the number of observations nij with a specific combination of the different cat-
egories of the variables, where i indicates the category of the variable displayed in the
rows (i = 1, …, I) and j represents the category of the variable displayed in the columns
(j = 1,.., J). The I and J possible categories of the two variables represent the cells of the
cross table (Table 6.3). Apart from the number of observations in each cell of the cross
table (nij), we list information about the total number of observations in each row (ni.)
and column (n.j) as well as the total number of observations (n).
Table 6.4 shows the cross table for the chocolate example with 181 observations.
From Table 6.4 we learn that 75 out of the 181 surveyed consumers are younger (18–
44 years), and that 68 consumers prefer milk chocolate over dark chocolate. Out of those
68 consumers, 45 are between 18 and 44 years old. Yet, to facilitate the interpretation of
cross tables, we often consider relative frequencies (i.e., percentages) instead of abso-
lute frequencies. For this purpose, we can use three alternative ways to compute relative
frequencies:
Table 6.4 Cross table for the preferred type of chocolate and age
Age
Preferred type of chocolate 18 to 44 years 45 years and older Sum
Milk 45 23 68
Dark 30 83 113
Sum 75 106 181
358 6 Contingency Analysis
Tables 6.5 to 6.7 show the respective cross tables. Each of these tables provides slightly
different representations of the same information. Hence, the selection of the appropriate
presentation of the cross table depends on the specific interest of the researcher.
For example, Table 6.5 shows that 66.2% of all consumers who prefer milk chocolate
are younger than 45 years, while 73.5% of all consumers who prefer dark chocolate are
45 years or older.
Table 6.6 illustrates that 60.0% of the younger consumers prefer milk chocolate,
while 78.3% of the older consumers prefer dark chocolate.
When considering the complete sample of 181 consumers (Table 6.7), we learn that
62.4% of the respondents prefer dark chocolate. Thus, the cross table suggests that there
is a larger market for dark than for milk chocolate. Further taking into account that
older consumers tend to prefer dark chocolate while younger consumers tend to prefer
milk chocolate (Tables 6.5 and 6.6), we can derive some managerial implications. For
example, a retail manager may use these insights for assortment decisions across differ-
ent retail stores. Since the majority of people seems to prefer dark chocolate, the retail
manager may decide to offer generally more kinds of dark chocolate compared to milk
chocolate in the stores. If older consumers prevail in the trade area of a specific retail
store, the retail manager may offer an even greater variety of dark chocolates.
At first glance, the different cross tables presented in Tables 6.5 to 6.7 suggest an
association between the respondents’ age and preferred type of chocolate. Yet, the cross
tables do not provide sufficient evidence for assuming an association between the varia-
bles in a statistical sense. In the following, we will thus formally test whether there is a
significant association.
To test whether there is indeed an association between the categorical variables, the num-
ber of observations in each cell should be 5 or more in 20% of all cases (cf. Everitt,
1992, p. 39). In our example, all cell counts are larger than 5, and we can thus continue
with the analysis. First, we assess whether the categorical variables are associated with
the help of the chi-square test (here: test for independence). Afterwards, we evaluate the
strength of the association between the variables.
If the two variables are independent, the deviations between the observed (nij) and
expected (eij) number of observations should be small. Instead large deviations provide
evidence for an association (dependence) between the two variables. In our example, the
deviations equal 16.8 in each cell (cf. Table 6.8). Given the sample size, the deviations
seem to be substantial and the variables ‘age’ and ‘preferred type of chocolate’ are prob-
ably associated.
Chi-square Test
We use the information about the deviations to formally assess whether an associ-
ation between the two variables exists in a statistical sense. We test the hypothesis of
independence:
H0: X and Y are independent of each other (not associated).
H1: X and Y are dependent on each other (associated).
We use the chi-square test statistic to test the null hypothesis. The chi-square test sta-
tistic takes all deviations of observed and expected number of observations (nij − eij)
into account. We divide each deviation by its expected count to normalize the deviations.
Moreover, we consider the squared deviations, so that positive and negative deviations
do not cancel each other out.
I
J
(nij − eij )2
χ2 = (6.2)
i=1 j=1
eij
hypothesis, and accept the alternative hypothesis H1 that the variables ‘age’ and ‘pre-
ferred type of chocolate’ are dependent (associated). Based on Table 6.8, we conclude
that older people have a stronger preference for dark chocolate while younger people
rather prefer milk chocolate.
In our example, the Yates correction results in the following value for chi-square:
2
2 181 · |45 · 83 − 23 · 30| − 181 2
χcorr = = 25.86
68 · 75 · 113 · 106
When we compare the chi-square value of the Yates correction with the theoretical chi-
square value of 3.84, we again reject the null hypothesis. Generally, the Yates correction
results in smaller values for chi-square. Yet with an increasing sample size, the difference
between the chi-square values in Eqs. (6.2) and (6.3) diminishes (cf. Fleiss et al., 2003,
pp. 57–58; Everitt, 1992, p. 13).
For samples with less than 20 observations or strongly asymmetric boundary distribu-
tions (large differences between ni. and/or n.j), we can use the Exact Fisher test to assess
whether the variables are independent (cf. Agresti, 2019, pp. 45–47). The Exact Fisher
test computes the probability of obtaining the observed data under the null hypothesis
that the proportions are the same. The Exact Fisher test does not impose any require-
ments on the sample size. The original test is designed for 2 × 2 cross tables and can be
computed manually and equals3:
n1. !n2. !n.1 !n.2 !
p= (6.4)
n11 !n12 !n21 !n22 !n!
The smaller the computed probability, the more likely are the variables associated.
cannot use it to assess the strength of associations. To illustrate this issue, let us just
duplicate the sample from our example. We would then observe 90 observations in cell
n11, and would expect 56.4 (= 150 · 136/362) observations in this cell, which is twice
the number of 28.2. The resulting chi-square value then equals 54.94 (= 27.47 · 2). But
in fact, the strength of the association between the variables ‘age’ and ‘preferred type of
chocolate’ is still the same, and should not be affected by the sample size. We thus need
to consider the sample size when evaluating the strength of association.
To assess the strength of association, we generally distinguish between measures
for the strength of associations that rely on chi-square and measures that are based on
probability considerations. We first discuss the measures that are based on chi-square—
namely, the phi coefficient, contingency coefficient, and Cramer’s V.
The contingency coefficient ranges between 0 and 1. Yet, it can rarely reach the maximum
value of 1; rather the upper limit is a function of the number of columns and rows in the
cross table. Therefore, the respective maximum value should be taken into account when
assessing the strength of an association. The upper limit of the contingency coefficient is:
(6.7)
CCmax = (R − 1) R with R = min (I, J)
6.2 Procedure 363
Since the maximum value equals 0.707, the value of 0.362 for the contingency coeffi-
cient seems to indicate a rather strong association.
Alternatively, Cramer’s V is a measure for the strength of an association which ranges
between 0 and 1 and reaches 1 if each variable is completely determined by the other:
χ2
Cramer’s V = with R = min(I, J) (6.8)
n(R − 1)
If one of the variables is binary, the phi coefficient and Cramer’s V result in the same
value. Thus, in our example Cramer’s V also equals 0.389.
In Sect. 6.3.3, we will report approximate significance levels for the phi coefficient,
contingency coefficient and Cramer’s V that help to assess whether the association
between the two categorical variables is substantial or not in a statistical and not only
subjective sense.
Table 6.9 Cross table containing the relevant information to compute Goodman and Kruskal’s
lambda
Age
Preferred type of chocolate 18 to 44 years 45 years and older Sum
Milk 45 23 68
Dark 30 83 113
Sum 75 106 181
know that a respondent belongs to the age category’ 45 years and older’, our best guess
related to the preferred type of chocolate is ‘dark’. Now, we are right for 83 respond-
ents and wrong for 23 respondents. Thus, we make overall 53 (= 30 + 23) wrong predic-
tions when we know the age category a respondent belongs to. Compared to the situation
where we do not know a respondent’s age the number of wrong predictions reduces by
15 (= 68 – 53). Using the number of wrong predictions without knowledge about age as
a base, we get the following coefficient:
68 − 53
type = = 0.221
68
We can apply the same logic to the situation where we want to make a prediction about
age without having any information about the preferred type of chocolate (here: the vari-
able ‘age’ is the target variable). The lambda coefficient then equals:
75 − 53
age = = 0.293
75
The lambda coefficient is asymmetric since its value depends on the target variable. In
general terms, the lambda coefficients are computed in the following way:
j maxi nij − maxi ni.
maxj nij − maxj n.j
row variable = or column variable = i (6.9)
n − maxi ni. n − maxj n.j
The lambda coefficients always range between 0 and 1. A value of 0 indicates that know-
ing the category of the second variable does not reduce the probability of making a
wrong guess when predicting the first (target) variable. In contrast, a value of 1 indicates
that knowing the category of the second variable results in an error-free prediction of
the first (target) variable. Thus, the size of the lambda coefficient is an indicator for the
strength of association. Moreover, we can test whether the lambda coefficients are statis-
tically significant (cf. Sect. 6.3.3).
Alternatively to computing lambda coefficients for each variable separately, we can
also use the so called symmetric lambda coefficient:
1 1
2 i maxj nij + j maxi nij − 2 maxj n.j + maxi ni.
sym = (6.10)
n − 21 maxj n.j + maxi ni.
6.2 Procedure 365
In our example, the symmetric lambda coefficient equals 0.259. We recognize that all lambda
coefficients are smaller than the chi-square based measures, which is usually the case.
Another measure to assess the strength of the association is Goodman and Kruskal’s
tau (τ). Whereas the lambda coefficient is based on modal category assignment,
Goodman and Kruskal’s tau considers all marginal probabilities. The probability of mak-
ing a wrong prediction is based on random category assignment with probabilities spec-
ified by marginal probabilities. We illustrate the computation of the tau coefficient with
the help of Table 6.10.
Without knowing the age of a respondent, we would assign 37.6% of respondents into
the category ‘milk’ and 62.4% into the category ‘dark’. We make a correct prediction in
37.6% and 62.4% of all cases for the first and second category. Consequently, in total
53.1% (= 0.3762 + 0.6242) of all cases are assigned correctly and 46.9% are assigned
incorrectly. If we knew the age of a respondent, the prediction improves: Among the
younger respondents, 60% prefer milk chocolate and 40% prefer dark chocolate. For the
older respondents, the corresponding values are 21.7% and 78.3%. Hence, we predict
52% (= 0.602 + 0.402) of all younger, and 66% (= 0.2172 + 0.7832) of all older respond-
ents correctly. Overall, we now predict 60.2% (= 0.52 · 0.414 + 0.66 · 0.586) correctly
and 39.8% incorrectly. This is an improvement of 7.1%-points. The tau coefficient again
sets the improvement in relation the probability of making a wrong prediction without
knowledge about age. Therefore, we get:
0.469 − 0.398 0.071
τtype = = = 0.152
0.469 0.469
The tau coefficient is also an asymmetric measure. Yet, in our specific example τage also
equals 0.152. The tau coefficient is smaller than the lambda coefficient and the measures
based on chi-square. We can use the significance for evaluating the tau coefficients (cf. Sect.
6.3.3).
Table 6.10 Cross table containing the relevant information to compute Goodman and Kruskal’s
tau (target variable: type)
Age
Preferred type of chocolate 18 to 44 years (%) 45 years and older (%) Sum (%)
Milk 60.0 21.7 37.6
Dark 40.0 78.3 62.4
Sum 41.4 58.6
366 6 Contingency Analysis
confounding variables, they may affect the association between the focal variables and
may, therefore, blur the results.
To illustrate the potential effect of a confounding variable on the results of a contin-
gency analysis, let us assume the following:
Example
We surveyed 132 young and 158 elderly people regarding the question whether or not
they purchase diet chocolate. Table 6.11 shows the results. For example, the question
to be answered is whether or not the purchase of diet chocolate is related to age. ◄
Table 6.11 indicates that there is an association, i.e., older consumers buy relative more
diet chocolate than younger consumers. However, imagine that we also asked respond-
ents to state their body weight and size, and we computed the BMI (i.e., body mass
index) to assess whether a respondent is obese or not. We divided the sample into two
subgroups according to obesity (3rd (confounding) variable). Tables 6.12 and 6.13 show
the results for the two subgroups. We find that the share of younger and older consumers
who buy (or do not buy) diet chocolate is similar. Around 80% of obese consumers buy
diet chocolate independent of their age (Table 6.12), whereas only around 20% of non-
obese consumers by diet chocolate independent of their age (Table 6.13).
Thus, obesity influences the buying of diet chocolate in our example. Obese consum-
ers buy relatively more diet chocolate than non-obese consumers. Since age is usually
related to obesity, it seems at first glance that age is associated with the buying of diet
chocolate.
Considering a third variable may also lead to a refinement of the association between
the focal variables, or may result in an association that was originally not observed (sup-
pressed association). Theoretical considerations may guide the researcher to determine
what additional variables may be worthwhile to consider.
The manager of a chocolate company wants to find out whether the sales of certain types
of chocolate are associated with the time of the year. To find an answer to the ques-
tion, the manager asks a manager of a large supermarket chain for help. The manager
of the chocolate company would like to use scanner data to answer the question. The
manager of the supermarket chain provides a dataset with 220 purchases of three types
of chocolate (i.e., milk, dark, yoghurt) around the year. Each purchase can be assigned
to a season (spring = March to May, summer = June to August, fall = September to
November, and winter = December to February). In a first step, the manager creates a
cross table (Table 6.14).
The cross table suggests that there might be an association between the variables ‘sea-
son’ and ‘type of chocolate purchased’ but to test whether the two variables are really
associated the manager conducts a contingency analysis with the help of SPSS.
formats: case wise (i.e., each row represents an observation) or frequency (i.e., each row
represents the cell counts of a cross table) data. Figure 6.2 shows the data editor for both
formats.
For the case wise data, the first column is the ID of the observation (i.e., purchase),
the second column indicates the season (1 = spring, 2 = summer, 3 = fall, and 4 = win-
ter), and the third column represents the preferred type of chocolate (1 = milk, 2 = dark,
and 3 = yoghurt).
If only frequency data are available, the columns again represent the variables but the
rows represent the different combinations of the categories. The third column contains
the cell counts (i.e., frequencies). To indicate the cell counts, we use the option ‘Weight
Cases’ which is available in the menu ‘Data’ in SPSS (cf. Fig. 6.3). We weight the cases
by the ‘frequency’ variable.
Yet in the following, we will use the case wise data to conduct the contingency anal-
ysis. Go to ‘Analyze/Descriptive Statistics/Crosstabs’ (cf. Fig. 6.4) and enter the variable
‘season’ in the field ‘Row(s)’ and the variable ‘type’ in the field ‘Column(s)’. The deci-
sion which variable is placed in the rows and which in the columns is arbitrary and has
no consequences for the subsequent analyses (cf. Fig. 6.5).
6.3 Case Study 369
6.3.3 Results
First, the cross table is shown in the SPSS output window. In addition to the number of
observations in each cell (Count), the row (season), column (type), and total percentages
are displayed. The information on the cell counts is identical to the information given in
Table 6.14. The expected counts for each cell eij (Expected Count) and the differences
between the observed and expected count (Residual) are also displayed (cf. Fig. 6.10).
Moreover, the results of the chi-square test statistic (Pearson Chi-Square) are shown
(cf. Fig. 6.11). Next to the degrees of freedom (df), the significance is displayed. The chi-
square test is significant at the 5% level. We reject the null hypothesis of independence.
Thus, the chi-square test statistic shows that the variables ‘season’ and ‘type of chocolate
purchased’ are associated, and thus, dependent. The footnote in the SPSS table further
states that the minimum expected number count is equal to 7.36, indicating that each cell
has more than 5 observations.
372 6 Contingency Analysis
Furthermore, the so called Likelihood Ratio and Linear-by-Linear Association are dis-
played. For large sample sizes the likelihood ratio test is similar to the chi-square test
and will not be discussed in this context. The Linear-by-Linear Association test is not
applicable to nominal variables and will be ignored (cf. Bishop et al., 2007). Since we
6.3 Case Study 373
observe a 4 × 3 cross table, the Yates corrected chi-square (Continuity Correction), and
Fisher’s Exact Test are not computed.
Figure 6.12 shows the values for the measures that assess the strength of association
(phi, Cramer’s V, and Contingency Coefficient). Since the variables under consideration
have four and three levels, the values for phi and Cramer’s V are not identical. Assuming
a 5% significance level, the (approximate) significance for all three measures suggests
that we have to reject the null hypothesis which states that a specific measure equals
zero. This is not surprising since we already know that there is an association between
the two variables. The rule of thumb regarding the value of phi further suggests that the
association is more than trivial.
Figure 6.13 shows Goodman and Kruskal’s lambda and tau coefficients. First, the
three lambda coefficients are reported (symmetric lambda coefficient, lambda coefficient
for ‘season’ and ‘type’). Based on the asymptotic standard error of the statistics, we
374 6 Contingency Analysis
can compute the approximate t-value and error probability (approximate significance).
Assuming a 5% significance level, the asymmetric lambda coefficient for ‘type’ is not
signigicant—indicating that knowledge about the season does not significantly improve
the prediction of the type of chocolate purchased. Moreover, the symmetric lambda
coefficient is not significant. The next few rows in Fig. 6.13 contain information about
Goodman and Kruskal’s tau for ‘season’ and ‘type’, and both coefficients are significant.
6.3 Case Study 375
Overall, we have learnt that the season and type of chocolate purchased are associated
(i.e., dependent). The strength of association is significant and ‘more than trivial’ when
we consider the chi-square related measures. Yet, the measures related to probability
considerations are partly insignificant leading to mixed findings.
In the case study, the contingency analysis was performed using the SPSS graphical user
interface (GUI). Alternatively, we can use the SPSS syntax which is a programming lan-
guage unique to SPSS. Each option we activate in SPSS’s GUI is translated into SPSS
syntax. If you click on ‘Paste’ in the main dialog box shown in Fig. 6.5, a new window
opens with the corresponding SPSS syntax. However, you can also use the SPSS syntax
and write the commands yourself. Using the SPSS syntax can be advantageous if you
want to repeat an analysis multiple times (e.g., testing different model specifications).
376
6
Fig. 6.13 Contingency analysis in SPSS: Strength of association based on probability considerations
Contingency Analysis
6.4 Recommendations and Relations to Other Multivariate Methods 377
BEGIN DATA
1 1 19
1 2 8
1 3 14
------
4 3 5
* Enter all data.
END DATA.
Fig. 6.14 SPSS syntax for conducting a contingency analysis based on frequency data
Figure 6.14 shows the syntax for the chocolate example when using information about
frequency data. The syntax does not refer to an existing data file of SPSS (*.sav); rather
we enter the data with the help of the syntax editor (BEGIN DATA… END DATA).
3. The proportion of cells with expected counts fewer than five should not exceed 20%
(rule-of-thumb). None of the expected counts should be less than one. Combining sev-
eral categories of a variable to meet these requirements, should be carefully considered.
4. The chi-square test is valid if there are more than 60 observations. For samples with
20 to 60 observations, use the Yates correction. For sample sizes smaller than 20,
Fisher’s Exact test is appropriate. Yet, in such cases it is recommended to increase the
sample size to improve the robustness of the analysis.
5. The phi coefficient, contingency coefficient, Cramer’s V, Goodman and Kruskal’s
lambda and tau coefficients can be used to assess the strength of association. A mean-
ingful interpretation of the different coefficients requires knowledge about the mini-
mum and maximum value of these measures.
Contingency analysis examines the relation among two or more categorical varia-
bles (i.e., multidimensional contingency tables). In this chapter, we focused on the sim-
ple case of two variables. Although contingency analysis can also be used for more than
two variables, the application becomes a bit cumbersome. As an alternative to a mul-
tidimensional contingency analysis, we can use log-linear models (cf. Agresti, 2019,
pp. 204–243; Everitt, 1992, pp. 80–107; see also the SPSS procedures HILOGLINEAR;
LOGLINEAR). Log-linear models use a model representation similar to a regression
model (cf. Chap. 2) and assess the influence of several nominal independent variables on a
nominal dependent variable. Ultimately, log-linear models provide information on whether
there is a significant relation between the independent and dependent variables. Log-linear
models assess the strength of the relationships based on the estimated coefficients.
Another method for the analysis of more than two categorical variables is correspond-
ence analysis. Correspondence analysis is a method for the visualization of cross tables.
Correspondence analysis thus serves to illustrate complex relations and can be classified
as a structure-discovering methodology. It is related to factor analysis (cf. Chap. 7).
Log-linear models and correspondence analysis are more complex than contingency
analysis but provide additional insights such as considering explicitly the multivariate
structure of the data and visualization of relations. Yet, contingency analysis as discussed
in this chapter is rather popular in practice. This is due to two reasons: it is easy to con-
duct, and cross tables are easy to interpret and understand. Moreover, several cross tables
provide a more comprehensive picture than more complex analyses which require a good
understanding of the methodology.
References 379
References
Further Reading
Fienberg, S. (2007). The analysis of cross-classified categorical data (2nd ed.). Springer.
Kateri, M. (2014). Contingency table analysis – methods and implementation using R. Birkhäuser.
Sirkin, R. M. (2005). Statistics for the social science (3rd ed.). SAGE Publications.
Wickens, T. (1989). Multiway contingency tables analysis for the social sciences. Psychology
Press.
Factor Analysis
7
Contents
7.1 Problem
Factor analysis is a multivariate method that can be used for analyzing large data sets
with two main goals:
1 Correlations form the basis of factor analysis. For readers who are not sufficiently familiar with
the term, the concept of correlations is explained in detail in Sect. 1.2.2.
7.1 Problem 383
no
suitable? No
factor analysis
Yes
fulfilled? fulfilled?
No Yes Yes No
no run no
factor analysis factor analysis factor analysis
• extract factors
• determine number of factors
• interpret factors
• compute factor scores
Fig. 7.1 Process for testing the suitability of a correlation matrix R for factor analysis
Example
A high correlation (e.g., rx 1 ,x2 = 0.9) was found between the advertising expenditure
for a product (x1) and its sales volume (x2). This may be interpreted in three ways:
384 7 Factor Analysis
case 3
x1
case 1 case 2
F
x1 x2 x1 x2 x2
While causal relations (cases 1 and 2) are assumed by the dependency methods of
multivariate data analysis (e.g. regression analysis, variance analysis, or discriminant
analysis), factor analysis is based on case 3. Data from an empirical survey are there-
fore only suitable for factor analysis if the user has reliable information that case 3
represents the relevant interpretation in this specific application case. Only if this is
true a factor analysis should be performed.
Therefore, to perform a factor analysis, the first step is to evaluate the suitability
of the data. The empirical suitability is given if the variables in a data set show suf-
ficiently high correlations. High correlations are the necessary prerequisite for factor
analysis, since without correlations no explanation of correlations is possible. Several
instruments are available for testing this prerequisite.
In practical applications, a large number of variables is usually considered, but not
all of them need to be highly correlated with each other. In the second step, the user
must therefore decide how many factors are to be extracted. To facilitate this decision,
factor analysis provides various statistical criteria. There are also various algorithms
for the subsequent extraction of the factors (e.g. principal component analysis (PCA),
principal axis analysis (PAA)). In addition to the number of factors, the user must also
decide on an extraction method.
The extraction procedures result in an assignment of the originally considered var-
iables to the extracted factors. The strength of these assignments is measured by the
7.2 Procedure 385
correlations between the original variables and the factors, called factor loadings. The
factor loadings are an important reference point for the user to interpret the factors
logically. This interpretation is not trivial and is of fundamental importance, since it
is here that the user decides which substantive reason is represented by a factor. The
interpretation of the factors is therefore to be regarded as a separate third step in fac-
tor analysis, since it is of great importance for deriving consequences from the results
of a factor analysis.
The fourth and final step concerns the question how the factors are assessed by the
persons who originally considered the investigated attributes. These (fictitious) assess-
ment values per person are called factor scores. They are derived from the assess-
ments of the initial variables. ◄
7.2 Procedure
As described above, factor analysis comprises the four steps shown in Fig. 7.3: The first
step when conducting a factor analysis is to evaluate the suitability of the data. We ulti-
mately combine variables that are highly correlated, and therefore, suitability of the data
is evaluated by the correlations among the variables. In the second step, we extract the
factors and decide on the number of factors to extract. In the third step, we interpret the
derived factors. Finally, we compute factor scores, i.e., evaluation values for the factors
instead of the attributes.
In the following, the four steps of a factor analysis are explained in detail using the fol-
lowing example.
The manager of a chocolate company wants to know how the various chocolate fla-
vors are perceived by his customers. For this purpose, 30 test persons are asked to
evaluate the chocolate flavors according to five attributes (milky, melting, artificial,
fruity and refreshing) on a 7-point scale (from 1 = ‘low’ to 7 = ‘high’). The results of
the survey are shown in Table 7.2.
With the help of this data set, the manager now wants to check whether there are
any correlations between the different perceptions of the five attributes and whether
these can be condensed to common causes (factors). ◄
A first glance at the data matrix of the application example shows that the ratings of
the variables ‘fruity’ and ‘refreshing’ show a similar pattern, i.e. higher (lower) values
of ‘fruity’ always occur with higher (lower) values of ‘refreshing’. This pattern indicates
that these two variables are probably positively correlated.2
To get more insights into the interrelations among the variables, we compute the cor-
relation matrix (Table7.3). In the application example, the correlation matrix is a 5 × 5
symmetric matrix and displays the correlations between the variables. We observe
the highest correlation of 0.983 between the variables ‘fruity’ and ‘refreshing’, which
confirms the assumption derived from Table7.2 that these two variables are positively
correlated. Thus, these two variables seem to ‘belong together’, and we may com-
bine them into one factor since they encompass similar information. Moreover, we
learn that the variables ‘milky’ and ‘artificial’ are also highly and positively correlated
x2 artificial factor 1
x3 melting
x4 fruity
factor 2
x5 refreshing
The low correlations in Table7.3 suggest that not all variables can be combined and that
the underlying structure is more complex.
At this point, we would like to mention that high negative correlations can also indi-
cate variables that ‘belong together’.
In a factor analysis, we interpret the correlations between two variables as being
dependent on a third (unobserved) common factor (Child, 2006, p. 21). In our example,
we seem to have two groups of variables that can be combined to two factors, F1 and F2
(Fig. 7.4). How these factors can be interpreted will be discussed in Sect. 7.2.3.
The calculation of the correlation matrix can be simplified if the variables in a data set
are standardized beforehand.3 Standardization has the advantage that variables that are
measured on different dimensions are comparable (see Sect. 7.2.4). The standardization
of the variables has no influence on the correlation matrix itself (Table 7.3), it just sim-
plifies the calculation of the correlation matrix R, and the following applies:
3 Standardized variables have an average of 0 and a variance of 1. For the standardization of varia-
bles see also Sect. 1.2.1.
7.2 Procedure 389
1
R= · Z′ · Z (7.1)
N −1
with
If standardized variables are used, the variance-covariance matrix and the correlation
matrix are identical, since the variance of standardized variables is 1 and is reported on
the main diagonal of the variance-covariance matrix. The lower triangular matrix shows
the covariances (cov) that correspond to the correlations for standardized variables. For
the standardized variables, the following applies:
cov(x1 , x2 ) = rx1,x2 .
While high correlations among variables are a prerequisite for factor analysis, the ques-
tion arises whether the correlations observed in our application example are actually
‘high enough’. Several measures, all of them based on the correlation matrix, exist to
further evaluate the suitability of the data.
The Bartlett test of sphericity tests the hypothesis that the sample originates from a
population in which the variables are uncorrelated (Dziuban & Shirkey, 1974, p. 358).
If the variables are uncorrelated, the correlation matrix is an identity matrix (R = I).
In this case, the variables are not correlated, and thus, unsuitable for a factor anal-
ysis. The null hypothesis of the Bartlett test is, therefore, equivalent to the question
whether the correlation matrix deviates from an identity matrix only by chance.
Factor analysis requires metric variables, and thus the Bartlett test assumes that the
variables in the sample follow a normal distribution. The following test variable is used:4
2·J +5
Chi-Square = − N − 1 − · ln(|R|) (7.2)
6
4 For a brief summary of the basics of statistical testing see Sect. 1.3.
390 7 Factor Analysis
with
The test variable is approximately chi-square distributed, with J(J − 1)/2 df (degrees
of freedom). This means that the value of the test variable is highly influenced by the
sample size. In the application example, the Bartlett test results in a chi-square value of5
2·5+5
Chi-Square = − 30 − 1 − · ln (0.00096) = 184.13
6
The degrees of freedom in the example are (5 · (5 − 1))/2 = 10, leading to a sig-
nificance level of p = 0.000. Thus, with an error probability of almost zero we can
assume that the correlation matrix is different from an identity matrix and thus suita-
ble for factor analysis. ◄
Another criterion for assessing the suitability of a correlation matrix is the Kaiser–
Meyer–Olkin (KMO) criterion. The KMO criterion considers the bivariate and partial
correlation and is computed in the following way:
J
J
rx2j xj′
j=1 j′ =1
j′ �=j
KMO = J
J J
J (7.3)
rx2j xj′ + partial_rx2j xj′
j=1 j′ =1 j=1 j′ =1
j′ �=j j′ � =j
with
A partial correlation is the degree of association between two variables after exclud-
ing the influences of all other variables. If partial correlations are rather small, the
variables share a common variance which may be caused by an underlying com-
mon factor. Thus, small partial correlations result in rather high values for KMO
5 The determinant of the correlation matrix in the application example is det = 0.001. Furthermore,
ln (0.00096) = −6.9485.
7.2 Procedure 391
(cf. Eq. 7.3); the KMO criterion approaches a value close to 1 if all partial correla-
tions are close to zero. The KMO criterion equals 1 if all partial correlations are 0.
Thus, a value of 1 is the maximum value this criterion may reach. The KMO criterion
is equal to 0.5 if the correlation matrix equals the partial correlation matrix, which is
not desirable (Cureton & D’Agostino, 1993, p. 389). Thus, the value of KMO should
be larger than 0.5, and we would like to observe a value close to 1. Table 7.4 provides
indications for the interpretation of the KMO value according to Kaiser and Rice
(1974, p. 111).
In our example, KMO equals 0.576, which is above the threshold but rather small
and thus not really significant. ◄
While the KMO criterion evaluates the suitability of all variables together, the
so-called measure of sampling adequacy (MSA) assesses the suitability of single
variables:
J
rx2j xj′
′
j =1
j′ �=j
MSAj = J J
(7.4)
rx2j xj′ + partial_rx2j xj′
j′ =1 j′ =1
j′ �=j j′ �=j
with
The values for the MSA range between 0 and 1. Again, we would like to observe a
value that is greater than 0.5 and close to 1. The assessment in Table 7.4 is also valid
for evaluating MSA values.
Table 7.5 shows the MSA values for the five variables in our application example.
The variables ‘fruity’ and ‘refreshing’ have MSA values below 0.5. We may consider
392 7 Factor Analysis
eliminating these two variables from further analyses since they do not seem to be
sufficiently correlated with the other variables. Yet for illustrative purposes, we decide
to keep the two variables and continue with all 5 variables in our example. ◄
Conclusion
To assess the suitability of the data for factor analysis, different criteria can be used but
none of them is superior (Table 7.7). This is due to the fact that all criteria use the same
information to assess the suitability of the data. Therefore, we have to carefully evaluate
the different criteria to get a good understanding of the data.
For our example, it can be concluded that the initial data are only ‘moderately’ suita-
ble for a factor analysis. We will now continue with the extraction of factors to illustrate
the basic idea of factor analysis.
While the previous considerations referred to the suitability of initial data for a factor
analysis, in the following we will explore the question of how factors can actually be
extracted from a data set with highly correlated variables. To illustrate the correlations,
we first show how correlations between variables can also be visualized graphically by
vectors. The graphical interpretation of correlations helps to illustrate the fundamen-
tal theorem of factor analysis and thus the basic principle of factor extraction. Building
394 7 Factor Analysis
on these considerations, various mathematical methods for factor extraction are then
presented, with an emphasis on principal component analysis and the factor-analyti-
cal procedure of principal axis analysis. These considerations then lead to the question
of starting points for determining the number of factors to be extracted in a concrete
application.
A D B
standardized length = 1
two variables equals 0.5. With an angle of 60°, the length of AD is equal to 0.5 which is the
cosine of a 60° angle. The cosine is the quotient of the adjacent leg and the hypotenuse (i.e.,
AD/AC ). Since AC is equal to 1, the correlation coefficient is equal to the distance AD.
The above relationship is illustrated by a second example with three variables and the
following correlation matrix R.
◦
1 0
R = 0.8660 1 which is equal to R = 30◦ 0◦ ◄
◦ ◦ ◦
0.1736 0.6428 1 80 50 0
In example 2, we have chosen the correlations in such a way that a graphical illustration
in a two-dimensional space is possible. Figure 7.6 graphically illustrates the relationships
between the three variables in the present example. Generally we can state: the smaller
the angle, the higher the correlation between two variables.
The more variables we consider, the more dimensions we need to position the vectors
with their corresponding angles to each other.
vector x2
vector x1
80°
vector x3
50°
30°
Fig. 7.6 Graphical representation of the correlation matrix with three variables
396 7 Factor Analysis
A
vector x1
0 30°
C
30° resultant
vector x2
B
Fig. 7.7 Factor extraction for two variables with a correlation of 0.5
with
The factor loadings ajq indicate how strongly a factor is related to an initial variable.
Statistically, factor loadings therefore correspond to the correlation between an observed
variable and the extracted factor, which was not observed. As such, factor loadings are a
measure of the relationship between a variable and a factor.
We can express Eq. (7.5) in matrix notation:
Z = P · A′ (7.6)
The matrix of the standardized data Z has the dimension (N × J), where N is the number
of observations (cases) and J equals the number of variables. We observe the standardized
data matrix Z, while the matrices P and A are unknown and need to be determined. Here,
P reflects the matrix of the factor scores and A is the factor loading matrix.
In Eq. (7.1) we showed that the correlation matrix R can be derived from the stand-
ardized variables. When we substitute Z by Eq. (7.6), we get:
1 1
R= · Z′ · Z = · (P · A′ )′ · (P · A′ )
N −1 N −1
(7.7)
1 1
= · A · P ′ · P · A′ = A · · P ′ · P · A′
N −1 N −1
6 Ifa variable xj is transformed into a standardized variable zj, the mean value of zj = 0 and the var-
iance of zj = 1. This results in a considerable simplification in the representation of the following
relationships. See the explanations on standardization in Sect. 1.2.1.
398 7 Factor Analysis
The relationship expressed in Eq. (7.7) is called the fundamental theorem of factor anal-
ysis, which states that the correlation matrix of the initial data can be reproduced by the
factor loading matrix A and the correlation matrix of the factors C.
Generally, factor analysis assumes that the extracted factors are uncorrelated. Thus, C
corresponds to an identity matrix. The multiplication of a matrix with an identity matrix
results in the initial matrix, and therefore Eq. (7.7) may be simplified to:
R = A · A′ (7.9)
Assuming independent (uncorrelated) factors, the empirical correlation matrix can be
reproduced by the factor loadings matrix A.
We now observe five variables, and the correlations are chosen in such a way that
we can depict the interrelations in two-dimensional space—which will hardly be the
case in reality.7 Table 7.8 shows the correlation matrix for the example, with the upper
triangular matrix containing the angle specifications belonging to the correlations. ◄
7 Please note that example 3 does not correspond to the application example in Sect. 7.2.1.
7.2 Procedure 399
x3
x4
x2
x5
x1
60°
10° 10°
20°
F1
x3
x4
x2
x5
x1
60°
10° 10°
45°12‘
20°
45°12‘
0 F2
We can derive the factor loadings with the help of the angles between the variables
and the vector of the first factor. For example, the angle between the first factor and x1
equals 55°12′ (= 45°12′ + 10°), which corresponds to a factor loading of 0.571. Table 7.9
shows the factor loadings for all five variables.
Since factor analysis searches for factors that are independent (uncorrelated), a sec-
ond factor should be orthogonal to the first factor (Fig. 7.9). Table 7.10 shows the fac-
tor loadings for the corresponding second factor (factor 2). The negative factor loadings
of x1 and x2 indicate that the respective factor 2 is negatively correlated with the corre-
sponding variables.
If the extracted factors fully explained the variance of the observed variables, the sum
of the squared factor loadings for each variable would be equal to 1 (so-called unit vari-
ance). This relationship can be explained as follows:
1. By standardizing the initial variables, we end up with a mean of 0 and a standard devia-
tion of 1. Since the variance is the squared standard deviation, the variance also equals 1:
sj2 = 1
2. The variance of each standardized variable j is the main diagonal element of the correla-
tion matrix (variance-covariance-matrix) and it is the correlation of a variable with itself:
sj2 = 1 = rjj .
3. If the factors completely reproduce the variance of the initial standardized variables,
the sum of the squared factor loadings will be 1.
D
resultant 2 (factor 2)
A
vector x1
60° 120°
30°
C
0 30° resultant 1 (factor 1)
B
vector x2
Fig. 7.10 Graphical presentation of the case in which all variances in the variables are explained
To illustrate this, let us take an example where two variables are reproduced by two fac-
tors (Fig. 7.10). The factor loadings are the cosine of the angles between the vectors
reflecting the variables and factors. For x1, the factor loadings are 0.866 (=cos 30°) for
factor 1 and 0.5 (=cos 60°) for factor 2. The sum of the squared factor loadings is 1
(=0.8662 + 0.52). According to Fig. 7.10, we can express the factor loadings of x1 on the
factors 1 and 2 as follows:
OC
OA
for factor 1 and
OD
OA
for factor 2. If the two factors completely reproduce the standardized variance of the ini-
tial variables, the following relation has to be true:
2 2
OC OD
+ = 1,
OA OA
402 7 Factor Analysis
with
The factor loadings represent the model parameters of the factor-analytical model which
can be used to calculate the so-called model-theoretical (reproduced) correlation matrix
(R̂).The parameters (factor loadings ajq) must now be determined in such a way that the
difference between the empirical correlation matrix (R) and the model-theoretical cor-
relation matrix (R̂), which is calculated with the derived factor loadings, is as small as
possible (cf. Loehlin, 2004, p. 160). The objective function is therefore:
F = R − R̂ → Min.!
For objective 1 we use principal component analysis (PCA). We look for a small number
of factors (principal components) which preserve a maximum of the variance (informa-
tion) contained in the variables. Of course, this requires a trade-off between the smallest
possible number of factors and a minimal loss of information. If we extract all possible
components, the fundamental theorem shown in Eq. (7.8) applies:
R = A · A′
with
R correlation matrix
A factor-loading matrix
A′ transposed factor-loading matrix
For objective 2, we use factor analysis (FA) defined in a narrower sense. The factors are
interpreted as the causes of the observed variables and their correlations. In this case,
7.2 Procedure 403
it is assumed that the factors do not explain all the variance in the variables. Thus, the
correlation matrix cannot be completely reproduced by the factor loadings and the funda-
mental theorem is transformed to:
R = A · A′ + U (7.11)
where U is a diagonal matrix that contains unique variances of the variables that cannot
be explained by the factors.8
While principal component analysis (objective 1) pursues a more pragmatic purpose
(data reduction), factor analysis (objective 2) is used in a more theoretical context (find-
ing and investigating hypotheses). So, many researchers strictly separate between princi-
pal component analysis and factor analysis and treat PCA as a procedure independent of
FA, and indeed, PCA and FA are based on fundamentally different theoretical models.
But both approaches follow the same steps (cf. Fig. 7.3) and use the same mathematical
methods. Besides, they usually also provide very similar results. For these reasons, PCA
is listed in many statistical programs as the default extraction procedure in the context of
factor analysis (as it is in SPSS).
8 Standardized variables with a unit variance of 1 are assumed. For the decomposition of the vari-
ance of an output variable, see also the explanations in Sect. 7.2.2.4.2 and especially Fig. 7.13.
404 7 Factor Analysis
Z2
3.0
PC_1
2.0
1.0
0.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
Z1
-1.0
PC_2
-2.0
-3.0
Example
Let us return to our example in Sect. 7.2.1 with the correlation matrix shown in Table 7.3.9
The upper part of Table 7.11 shows the factor loading matrix resulting from these data
if all five principal components (as many as there are variables) are extracted with PCA.
With these loadings, the correlation matrix can be reproduced according to Eq. (7.8).
9 Remember that we use standardized variables and therefore the variance of each variable is 1 and
the total variance in the data set is 5.
7.2 Procedure 405
The lower part of Table 7.11 presents the squares of the component loadings (factor
loadings) (a2jq). If these are summed up over the rows (over the components), we get the
variance of a variable that is covered by the extracted components according to Eq. (7.9).
As the variance of a standardized variable is 1, and as all possible 5 components were
extracted (Q = J), the sum of the squared loadings for each variable equals 1. This sum is
called the communality of a variable j:
Communality of variable j:
Q
hj2 = 2
ajq (7.12)
q=1
The sum of the squared loadings over the variables yields the eigenvalueof a compo-
nent q. The eigenvalue can be interpreted as a measure of the information contained in a
factor.10 The following applies:
Eigenvalue of component q:
J
2
q = ajq (7.13)
j=1
The eigenvalue divided by the number of variables gives the eigenvalue share of a com-
ponent. Cumulated over the components, it tells us how much information is explained
by the extracted components.
Here we see that 91.6% of the information is preserved if only two principal com-
ponents are extracted. With three components we can even preserve 99% of the infor-
mation. But the third component has an eigenvalue of only 0.369 (7.4%). Thus it seems
justified to drop it and restrict the procedure to only two factors for parsimony.
This can be visualized by a scree plot as shown in Fig. 7.12. A scree plot is a line plot
of the eigenvalues over the number of extracted components or factors. The third eigen-
value—and all the following eigenvalues—account only for small amounts of information.
PCA chooses the first principal component in such a way as to account for the max-
imum possible amount of information contained in the variables. Then the second
principal component is chosen so that it accounts for the maximum proportion of the
Table 7.12 Initial and extracted communalities of the first two principal components (Q = 2) in
the application example
Principal component analysis
Initial communalities Extracted communalities
Milky 1.000 0.931
Melting 1.000 0.736
Artificial 1.000 0.927
Fruity 1.000 0.993
Refreshing 1.000 0.992
For interpreting the principal components, the following question should be answered:
Which collective term can be found for the variables that load highly on a principal
component?11
PCA is also often used to ensure the independence of the variables used for expla-
nation, which is usually required in linear models. The assumptions of linearity and
independence are especially important for the dependency-analytical procedures of
multivariate data analysis such as regression analysis, discriminant analysis or logistic
11 On the problem of interpreting factors or main components, also compare Sect. 7.2.3 and the
case study in Sect. 7.3.3.4.
408 7 Factor Analysis
7.2.2.4.2 Factor-Analytical Approach
In contrast to PCA, factor analysis aims at a more theoretical objective and is interested
in uncovering the causes (factors) of the observed variables and their correlations. As
the variables usually cannot be measured without any error, they cannot be completely
explained by the factors.
Moreover, it is assumed in factor analysis that each observed variable also contains
a specific variance that cannot be explained by the common factors. Error variance and
specific variance together make up the unique variance (also called residual variance) of
a variable j that cannot be explained by the common factors. In the case of a standard-
ized variable, the rule (1 – communality) also applies to the unique variance.
Thus, in factor analysis the communalities of the variables cannot be 1, but are always
smaller than 1, even if all possible factors are extracted. Figure 7.13 illustrates how the
variance of a variable j is decomposed in the factor-analytical model.
According to Eq. (7.10), for the factor-analytical model the following applies:
R = A · A′ +U
The matrix U contains the unique variances which cannot be explained by the extracted
factors. It follows that, in contrast to PCA, the empirical correlation matrix R cannot be
completely reproduced by the factor loadings.
The empirical correlation matrix contains values of 1 on its diagonal (see Table 7.3).
These are the variances of the standardized variables. As factor analysis assumes that
the total variance cannot be explained by the common factors, the diagonal has to be
substituted by the communalities of the variables, which must be less than 1. This has to
be done before the extraction of factors. But the communalities are only known after the
extraction of factors. This is the communality problem in factor analysis.
Portion of the unit variance of variable j, Portion of the unit variance of variable j that
which is explainded by all extracted factors cannot be explained by the extracted factors
= 1 – Communality
components. The cumulative explained percentage of the first two components is 91.6%,
while the first two factors explain only 88.363%.
As Table 7.13 shows, in the present example the results of PCA and PAF are rather
similar. Also, a scree plot of PAF will show no visible difference to the scree plot of
PCA.12 The only exception is the lower loading of the variable ‘melting’ on factor 1,
which results in a lower communality for ‘melting’ in factor analysis.
12 Incontrast, the case study shows major differences between PCA and PAF (cf. Sect. 7.3.3.4).
13 Incontrast to PAF, the aim of the ML, GLS and ULS methods is to determine the factor loadings
in such a way that the difference between the empirical correlation matrix (R) and the model-the-
oretical correlation matrix (R̂) is minimal. In alpha factorization, Cronbach’s alpha is maximized,
and image factorization is based on the image of a variable. The listed procedures are all imple-
mented in SPSS, with PCA included as a further extraction procedure (see Fig. 7.21).
7.2 Procedure 411
In contrast to PCA, all other extraction procedures listed in Table 7.15 lead to a result
of communalities < 1, since all methods always take into account a residual variance per
variable. This means that a maximum of J – 1 factors can be extracted.
For the example in Sect. 7.2.2, Table 7.16 shows the initial values of the communalities
according to the above coefficient of determination of the variables as well as the result of
the iterative estimation after extracting the maximum number of factors using PAF.
In contrast to PCA (see Table 7.11), PAF only leads to communalities < 1, even when
the maximum number of factors is extracted, and the maximum number of extractable
factors is not J (in the example: J = 5 variables), but only J – 1 (in the example: J = 4),
since the individual residual factors (U) must be taken into account. PCA leads to an
explained variance of 89.567% for the application example with the extraction of four
factors. In contrast, PCA could explain 100% of the variance of the standardized initial
variables for five main components.
412 7 Factor Analysis
Table 7.16 Communalities in the application example using PAF and the extraction of four
factors
Principal axis factoring
Initial communalities Extracted communalities
Milky 0.931 0.966
Melting 0.541 0.556
Artificial 0.929 0.961
Fruity 0.974 0.999
Refreshing 0.973 0.996
In summary, the estimation procedures of factor analysis search for factors that can be
regarded as the cause of the correlation between correlated variables. However, this means
that the correlation between two variables becomes zero when this cause (factor) is found.
Thus, the interpretation of an empirical correlation shown in Sect. 7.1 (Fig. 7.2), which is
assumed by factor analysis, becomes apparent: If, for example, the advertising expenditure
of a product (x1) and its sales (x2) are strongly correlated, factor analysis assumes that this
correlation cannot be taken as an indication of a dependency between variables x1 and x2 but
is due, for example, to a general price increase as the cause (factor) of the correlation. When
interpreting the factors, it should therefore be possible to answer the following question:
How can we describe the effect that causes the high loadings of variables on a factor?14
PCA should be preferred if the goal of factor analysis is data aggregation. If users do
not have specific a priori knowledge, they will not be able to divide the total variance
of a variable into a common and a single residual variance. In this case, the information
required for PAF cannot be provided. If, on the other hand, knowledge of the structure
14 Note that PCA, in contrast, requires a collective term for correlating variables. Cf. the presenta-
tion on the problem of the interpretation of factors in Sect. 7.2.3 and the notes on the case study in
Sect. 7.3.3.4.
7.2 Procedure 413
of the total variance is available (e.g. information on the separation into common vari-
ance, specific variance and error variance), a PAF can be performed. Formally, there will
always be a difference in the main diagonal correlation matrix: in PCA, the main diago-
nal can reach a value of 1, while in PAF only values of less than 1 are possible.
7.2.2.5 Number of Factors
In the above procedures, it was assumed that the user knows the number of factors to be
extracted and therefore knows
• how the conflict between the objectives of the PCA can be resolved or
• how many causes exist which, in the case of PAF, explain the correlations between the
variables.
The number of factors required can also be derived from the user’s idea regarding the
percentage of variance in a data set that should be explained by a factor analysis (e.g. at
least 90%).
If this is known, the user can specify the number of factors (main components) to be
extracted. The factor loadings are then estimated according to the specified number of
factors. In all extraction procedures, the factors are always extracted in such a way that
the first factor combines a maximum of the variance of all variables. Accordingly, the
eigenvalue of the first factor is always the highest, followed by the eigenvalue of the sec-
ond factor, and so on. Thus, the eigenvalues always decrease successively.
If a user does not have a clear idea regarding the number of factors to be extracted,
the following auxiliary criteria can be used for decision-making.
Scree test
The so-called scree test also uses the initial eigenvalues as shown in Table 7.17 and dis-
plays them in a coordinate system. If a ‘kink’ or ‘elbow’ shows up in the sequence of
eigenvalues, this means that the difference between two eigenvalues is greatest there. It
is recommended to extract the number of factors to the left of the ‘elbow’. For the appli-
cation example, Fig. 7.14 shows that the largest difference in eigenvalues occurs between
numbers of factors 2 and 3. Therefore, two factors should be extracted according to the
scree test.
The basic idea of this approach is that the factors with the smallest eigenvalues are
considered ‘scree’ (i.e. unsuitable) and are therefore not extracted. However, the scree
Kaiser Criterion
Scree test
test does not always provide a clear solution and leaves room for subjective judgement.
For this reason, the Kaiser criterion is usually preferred and frequently used in empirical
studies.
Table 7.19 Reproduced correlation matrix (R̂) based on the factor loadings matrix
Milky Melting Artificial Fruity Refreshing
Milky 0.968*
Melting 0.712 0.526*
Artificial 0.960 0.705 0.953*
Fruity 0.110 0.127 0.085 0.991*
Refreshing 0.042 0.077 0.017 0.983 0.981*
*: Communalities of variables after extraction of two factors using PAF
original correlation matrix in the application example very well. The two-factor solu-
tion is therefore suitable for describing the five initial variables without a large loss of
information.
Once we have decided on the number of factors, we need to interpret the extracted fac-
tors. We use the factor loadings matrix to do so since high factor loadings indicate that a
variable is strongly correlated with a factor. In the application example, factor 1 is related
to the variables ‘milky’, ‘melting’, and ‘artificial’. Since the factor loadings are the result
of principal axis factoring, we need to answer the question how to label the effect that
causes the high factor loadings.
In this case, all three variables seem to relate to a factor that could be called ‘texture’,
but also ‘unhealthy aspects’. The second factor is related to the variables ‘fruity’ and
7.2 Procedure 417
‘refreshing’ and might thus be labeled as ‘taste experience’ or ‘aroma’. At this point, it
becomes clear that the interpretation of the factors requires a high level of expertise and
some creativity on the part of the user.
Cross-loadings
Interpreting the factor solution is also difficult if the factor loadings do not clearly
indicate to which factor a certain variable belongs. This is the case if a variable loads
highly (in absolute terms) on more than one factor. High correlations on multiple fac-
tors are called cross-loadings. Cross-loadings make factor interpretation more difficult. If
cross-loadings occur, we have to decide which factor the variable should be assigned to.
There is a rule of thumb that can help us make this decision: The absolute value
of a factor loading should be greater than 0.5 to be relevant for a factor. If a variable
has absolute factor loadings greater than 0.5 for several factors, the variable should be
assigned to each one of these factors. In this case, however, a meaningful interpretation
of the factors may not be possible.
Factor rotation
If the assignment of variables to factors is ambiguous, factor rotation may be employed,
i.e., the factor vectors are rotated. If the coordination system in Fig. 7.9 that is repre-
sented by the factor vectors is rotated around its origin, we get, for example, Fig. 7.15.
The angles between x1 and F1 as well as x2 and F1 are substantially smaller, and, thus,
the factor loadings are increased. The same applies to the variables x3, x4, and x5 and
x3 F2
x4
F1
x2
x5
x1
their angles with factor 2 (F2), and consequently for the factor loadings. Ultimately, the
interpretation of the factors is considerably easier.
There are basically two different options for rotating the coordination system and
hence the factor vectors:
1. We can assume that the factors remain uncorrelated with each other. The factor
vectors maintain a 90° angle to each other. We call this method orthogonal (rec-
tangular) rotation. The most popular orthogonal rotation method is the so-called
varimaxmethod.
2. If, however, a correlation between the rotated axes or factors is assumed, the vectors
of the factors are rotated at an oblique angle (<90°) to each other. Such rotation meth-
ods are called oblique rotation. Statistical software packages such as SPSS offer vari-
ous oblique rotation methods (cf. Fig. 7.22).
Table 7.21 shows the unrotated and rotated factor loadings for the application example,
with the rotated factor loadings obtained via the varimax method. Now the factor load-
ings for the variables ‘milky’, ‘melting’, and ‘artificial’ on factor 1 (F1) have decreased.
The same applies to the variables ‘fruity’ and ‘refreshing’ and their factor loadings on
factor 2 (F2). Despite a clear loading structure, factor interpretation is not easy. Readers
should try to find their own interpretation. We offer the following labels: ‘texture’ for F1
and ‘aroma’ for F2.
Table 7.21 Unrotated and rotated factor loadings for the application example (PAF)
Unrotated factor loadings Rotated factor loadings
F1 F2 F1 F2
Milky 0.943 −0.280 0.984 0.032
Melting 0.707 −0.162 0.722 0.070
Artificial 0.928 −0.302 0.976 0.007
Fruity 0.389 0.916 0.080 0.992
Refreshing 0.323 0.936 0.011 0.990
7.2 Procedure 419
For a multitude of questions it is of great interest not only to reduce the variables to a
smaller number of factors and to find labels for the factors, but also to determine how the
objects score on the factors.
Factor analysis aims to present the standardized initial data matrix Z as a linear
combination of factors: Z = P · A′. So far, we have focused on determining A, the fac-
tor loadings matrix. Since Z is given, we still need to determine the matrix of factor
scores,P. P contains the estimated ratings of the respondents with regard to the factors
found. P thus answers the following question: How would the respondents have rated the
factors if they had had the opportunity to do so?
To retrieve P, we proceed with
Z · (A)−1 = P · E (7.15)
Because of P · E = P,
Z · A = P · A′ · A (7.17)
The matrix (A · A) is by definition quadratic and therefore invertible:
′
Surrogates
Using surrogates is the simplest (but in most cases not the best) way ofdetermining fac-
tor scores. When using this approach, we take the highest loading of a variable on a fac-
tor as a surrogate for its factor score. Yet, surrogates are only rough proxies for factor
scores since we assume that a single variable represents the latent factor. In the appli-
cation example, using surrogates would mean that the highest loading on each factor is
taken as a surrogate (cf. Table 7.21). Consequently, ‘milky’ would represent factor 1 (F1)
420 7 Factor Analysis
and ‘fruity’ would represent factor 2 (F2), with the rotated factor loadings of 0.984 and
0.992, respectively. However, surrogates are only appropriate if the highest factor load-
ing is a dominant value, i.e., the factor loadings of all other variables should be sub-
stantially lower than the surrogate’s factor loading. For factor 1, we observe the second
highest loading for ‘artificial’ with a value of 0.928. For factor 2, the second highest
loading is 0.990 for ‘refreshing’ (cf. Table 7.21). Both these factor loadings are close to
the highest factor loading, and thus using surrogates does not seem appropriate in this
case.
Summated scales
Alternatively, we can use summated scales. Summated scales combine different varia-
bles measuring the same concept into a single construct. They are calculated by taking
the mean of the high-loading variables of each factor. In our case, with two factors, we
expect two summed scales. The first scale can be derived directly from Table 7.21.
Summed scales, like surrogates, are based on information contained in the factor load-
ings. While surrogates focus on the highest loadings per factor, summed scales are based
on multiple loadings. However, neither makes use of the full information provided by the
factor loadings. This is the subject of regression analysis.
Regression analysis
Another approach for deriving factor scores is to use regression analysis (see Chap. 2).
The basic idea of this approach is illustrated in Fig. 7.16.
If we multiply the matrix of the standardized output data (Z) with the matrix of the
regression coefficients (so-called factor score coefficients), we obtain the matrix of the
factor scores P. For the application example, we obtain the factor score matrix P (size
30 × 2) by multiplying the output matrix Z (size 30 × 5) with the regression coefficients
listed in Table 7.22 (size 5 × 2).
For our application example, Table 7.23 shows the factor scores of the two-factor
solution for the first three persons and the last person.
When interpreting the factor scores, we need to be aware that they represent standard-
ized values, due to the standardization of the original data matrix. For the interpretation
of the factor scores this implies that
Fig. 7.16 The basic idea of using regression analysis for determining factor scores
Table 7.22 Regression F1 F2
coefficients for determining
factor scores Milky 0.551 –0.049
Melting 0.015 –0.010
Artificial 0.422 0.001
Fruity 0.261 0.673
Refreshing –0.281 0.331
An indication for the “average” evaluation per variable (factor score 0) can be derived
from the means of the variables in the initial data set of the collected data which load on
a factor.
Figure 7.17 summarizes the necessary steps when conducting a factor analysis and illus-
trates how to get from the original data matrix to the factor score matrix. Please note that
the size (length and width) of the boxes representing the various matrices reflects the size
of the matrix, that is, the number of rows and columns. Thus, the aim of factor analysis
is to transform the original data matrix into a matrix with fewer columns but the same
number of rows. In the application example, the original data matrix has 30 rows and 5
columns, while the factor score matrix has 30 rows and 2 columns.
We now use a larger sample related to the chocolate market to demonstrate how to con-
duct a factor analysis with the help of SPSS.15
A manager of a chocolate company wants to know how consumers evaluate different
chocolate flavors with respect to various attributes. For this purpose, the manager iden-
tified 11 flavors and selected 10 attributes that appear to be relevant for the evaluation of
these flavors.
A small pretest with 18 test persons was carried out. The persons were asked to eval-
uate the 11 flavors (chocolate types) with regard to the 10 attributes (see Table 7.24). A
seven-point rating scale (1 = low, 7 = high) was used for each attribute. Thus, the varia-
bles are perceived attributes of the chocolate types.
However, not all persons were able to evaluate all 11 flavors. Thus, the data set con-
tains only 127 evaluations instead of the complete number of 198 evaluations (18 per-
sons × 11 flavors). Every evaluation reflects the subjective assessment of all the 10
attributes with regard to a specific chocolate flavor by a particular test person. Since each
test person assessed more than just one flavor, the observations are not independent. Yet
for convenience, we will here treat the observations as such.
15 Inthe case study, the same data set is used as for discriminant analysis (Chap. 4), logistic regres-
sion (Chap. 5) and cluster analysis (Chap. 88). This is to better illustrate the similarities and differ-
ences between the different variants of the method.
X includes the Z contains the R describes the R contains the A contains the A* contains the P does not contain the
manifestation of the standardized statistical relation estimated correlations between correlations between manifestation of the
variables in question data matrix Z. between the variables. communalities on the variables and factors. variables and factors output variables on
(e.g. “creamy”) on principal diagonal. after rotation of the single persons/objects
the persons. coordinate cross. This anymore (see 2), but
rather the manifestation
7.3 Case Study
The columns contain The matrix is square-shaped, the number of The matrix is generally not square-shaped, The matrix is generally
the attributes rows/columns is determined by the number of since the number of factors (columns) should be not square-shaped. The
(characteristics), the attributes (characteristics) in Z. smaller than the number of attributes (rows). rows contain the
rows contain the persons (so that the
persons. number of columns of
the matrices A or A*
equals the number of
columns in matrix P),
the rows contain the
factors.
Fig. 7.17 How to get from the original data matrix to the factor score matrix
423
424 7 Factor Analysis
Of the 127 evaluations, only 116 are complete, while 11 evaluations contain missing
values (i.e., not all attributes of a flavor were evaluated).16 We exclude all incomplete
evaluations from the analysis. Consequently, the number of cases is reduced to 116.17
The manager of the chocolate company would like the survey to answer the following
three central questions:
To answer his questions, the manager carries out a factor analysis. According to the
considerations in Sect. 7.3.1, PAF is chosen as the extraction method. For the final
16 Missing values are a frequent and unfortunately unavoidable problem when conducting surveys
(e.g. because people cannot or do not want to answer some of the questions, or as a result of mis-
takes by the interviewer). The handling of missing values in empirical studies is discussed in Sect.
1.5.2.
17 In the following the same data set is used as in the case study of discriminant analysis (Chap. 4),
logistic regression (Chap. 5) and cluster analysis (Chap. 8). Thus, it is easier to demonstrate the
similarities and differences between the methods.
7.3 Case Study 425
positioning of the chocolate varieties, he uses the factor scores, averaged over the inter-
viewed persons per chocolate variety.
Fig. 7.18 Data editor with a selection of the procedure ‘Factor analysis’
426 7 Factor Analysis
a first step, we further activate ‘Coefficients’ and ‘Significance levels’ to obtain the corre-
lation matrix and the significance levels of the correlations. Moreover, we activate ‘KMO
and Bartlett’s test of sphericity’ and we also request the anti-image matrix (‘Anti-image’)
and the reproduced correlation matrix (‘Reproduced’).
7.3 Case Study 427
Next, we go to ‘Extraction’ to choose the method for the extraction of the factors. We
discussed that there are two distinct methods to extract the factors: principal components
analysis and principal axis factoring (cf. Sect. 7.2.2.3). Here we assume that specific var-
iances and measurement errors are relevant, and we thus choose ‘Principal axis factor-
ing’ (Fig. 7.21). SPSS offers several extraction methods, which were briefly described in
Table 7.15. Principal axis factoring (PAF) is chosen for the case study because the man-
ager is interested in the causes of the correlations (see Sect. 7.2.2.4.2). Additionally, we
select the option ‘Scree plot’, which helps to make the decision on the number of factors
to be extracted.
To obtain the rotated factor matrix, we open the dialog box ‘Factor Analysis:
Rotation’ and select ‘Varimax’ (Fig. 7.22). The varimax rotation results in uncorrelated
factors and helps with identifying the variables that belong to a specific factor.
Finally, we go to the dialog box ‘Factor Analysis: Factor Scores’ and select the
default option ‘Regression’ (Fig. 7.23). When the option ‘Regression’ is used, SPSS pro-
duces factor scores that have means of 0 and variances equal to the squared multiple cor-
relation between the estimated factor scores and the true factor scores. The scores may
be correlated even if factors are orthogonal. In this case, SPSS saves the estimated factor
scores as new variables to the SPSS data file. We will use the factor scores to position the
different chocolate flavors along the factors.
We further activate the option ‘Display factor score coefficient matrix’. By doing so,
we obtain the regression coefficients that are used to compute the factor scores in the
SPSS output window.
As an alternative, the option ‘Bartlett’ results in factor scores that have means of 0,
and the sum of squares of the unique factors over the range of variables is minimized.
Finally, the method ‘Anderson-Rubin’ is a modification of the Bartlett method which
7.3 Case Study 429
ensures orthogonality of the estimated factors. The factor scores that are derived have
means of 0, standard deviations of 1, and are uncorrelated.
7.3.3 Results
In the following, the results of the factor analysis are presented according to the steps
presented in Sect. 7.2 (see Fig. 7.3). PAF was chosen as the extraction procedure because
the manager is interested in the causes of the correlations in his dataset. We will divide
the presentation of the results into two parts:
1. Checking whether the correlation matrix in the case study is suitable for a factor anal-
ysis (process step 1),
2. Conducting the PAF for the case study (steps 2 to 4).
Subsequently, in Sect. 7.3.3.3, we will deal with the manager’s question regarding the
positioning of the eleven chocolate types. The answer is based on the factor scores of the
respondents.
Because of the great importance of PCA in practical applications and the fundamental
difference to PAF (cf. the discussion in Sect. 7.2.2.4), we will again highlight the central
differences between using PCA and PAF in the case study (Sect. 7.3.3.4).
7.3.3.1 Prerequisite: Suitability
In the first step, the data matrix is standardized and the correlation matrix is computed
(Fig. 7.24).
The upper part of Fig. 7.24 shows that there are only a few high correlations in the
case study. The highest correlations exist between the variables ‘light’ and ‘sweet’
(r = 0.537) and between ‘light’ and ‘fruity’ (r = 0.549). There are many correlations below
0.2 and the lowest correlations exist between ‘healthy’ and ‘refreshing (r = 0.009) and
‘healthy’ and ‘delicious’ (r = –0.019). If we consider the significance levels of the correla-
tions (displayed in the lower matrix), only 25 out of 45 correlations are significant at the
5% level. We conclude that the data are probably not well suited for a factor analysis.
To gain a more detailed insight into the suitability of the data for factor analysis, the
results for KMO and Bartlett’s test for sphericity are also considered (Fig. 7.25). The
KMO criterion is 0.701, which indicates that the data matrix (see Table 7.4) is only mod-
erately suitable for factor analysis. Nevertheless, KMO is above the critical value of 0.5.
The Bartlett test for sphericity is significant (p = < 0.001), which leads to the conclu-
sion that the correlation matrix is not an identity matrix and therefore correlations exist
between the initial data.
Figure 7.26 shows the anti-image correlation matrix with the variable-specific MSA
at the main diagonal. The variables ‘healthy’ (MSA = 0.492) and ‘price’ (MSA = 0.491)
have MSA values below the critical value of 0.5; thus we might consider to ignore these
430
7
Fig. 7.25 KMO and Bartlett’s test in the case study with 10 variables
variables in the further analyses. The remaining variables have MSA values that score
‘mediocre’ or even better. We decide to keep the variables ‘healthy’ and ‘price’ at this
point.
Generally speaking, the assessment methods for deciding whether the data are suited
for factor analysis or not may be compared to a ‘bunch of flowers’. Although all evalu-
ation criteria rely on the correlations among the variables, the conclusions based on the
criteria may be ambiguous. Therefore it does not make much sense to pick out a spe-
cial criterion because it is necessary to see the whole picture. It is the complete portfolio
(bunch) of criteria that the decision-maker has to rely on.
In a final step, the communalities of the initial variables are checked, as these can pro-
vide information on whether any variables should be excluded from the analysis. This is
always the case if the communalities have very small values and the factors therefore can-
not reproduce the variance of a variable. For this purpose, we consider the matrix of ini-
tial and final communalities when PAF is used. Figure 7.27 shows the results for the case
study.
In theory, all variables with very small communalities should be deleted, i.e., ‘refresh-
ing’ (0.182), ‘healthy’ (0.239) and ‘crunchy’ (0.237). Follow-up questions are: Should
we delete one, two or all three variables? And what will happen if the deletion of the
three variables leads to new critical candidates for deletion? And what does it mean that
at the end we get completely different results for PCA and PAF? Does that mean it is a
trial and error process with respect to the results ‘healthy’ (0.239) and ‘crunchy’ (0.237)?
In the following we will analyze the process if—as an example—we delete the varia-
ble that has the lowest extracted communality, that is, ‘refreshing’ (Fig. 7.27). For doing
so, we have to run a completely new factor analysis. In our case KMO increases slightly
from 0.701 to 0.723, and the MSA value also increases marginally. The Bartlett test is
still significant.
Figure 7.28 shows the results for the communalities for 9 variables if PAF is applied.
After the deletion of ‘refreshing’ the critical variables in the 9-variable solution remain
the same (‘healthy’: 0.188; ‘crunchy’: 0.207). To see how the results change if other var-
iables with weak communalities are eliminated, the researcher can systematically check
the results by running a complete analysis for each combination. The results may vary
widely.
432
7
Does that mean the decision-maker has to return to ‘gut feeling’ decisions? The
answer is no. What is needed is a comprehensive theory that guides through the set of
critical questions. That is why we recommend to look for a stable, adoptable theory
before starting any far-reaching but misleading empirical designs. For didactic reasons
we decide to keep the variables ‘healthy’ and ‘crunchy’ and to continue with the interpre-
tation of the number of factors (based on nine variables and PAF).
Number of factors
By default, SPSS uses the Kaiser (eigenvalue) criterion to determine the number of
factors.
Figure 7.29 shows the eigenvalues of the factors if we consider just nine variables.
The column ‘Initial Eigenvalues/Total’ shows that according to the Kaiser criterion
(eigenvalues > 1) three factors have to be extracted. A separate factor analysis on the
basis of the three-factor solution results in 45.17% cumulative variance explained. The
values related to ‘Extraction Sums of Squared Loadings’ consider the common variance
instead of the total variance and are thus lower than the values reported in the columns
‘Initial Eigenvalues’. The columns related to ‘Rotation Sums of Squared Loadings’ rep-
resent the distribution of the variance after the VARIMAX rotation.
Next, we have a look at the scree test. Figure 7.30 shows the scree plot. The result is
ambiguous. We could justify either a three-factor or a two-factor solution.
18 By reducing the number of variables to 9, the number of valid cases in the case study changes to
117.
7.3 Case Study
Kaiser Criterion
Scree Test
the main diagonal in the reproduced correlation matrix are taken into account (see also
Fig. 7.28), it can be seen that poor reproductions are mainly related to the variables with
small communalities. These are the variables ‘bitter’ (communality 0.378) and ‘crunchy’
(communality 0.207). In contrast, the variable ‘healthy’ with the smallest communality
(0.188) is not affected. The reason for this is the fact that, in principle, it forms a factor
of its own in factor extraction and dominates factor 3 (see Fig. 7.33). In summary, it can
be said that although the overall goodness (45.17% explained total variance) is rather
low, the reproductions of the correlations are relatively good considering the consistently
rather low communalities.
Factor interpretation
To interpret the three-factor solution, we first use the unrotated factor matrix in Fig. 7.32.
The variables ‘light’, ‘sweet’, ‘bitter’, and ‘fruity’ load highly on factor 1. The varia-
ble ‘delicious’ has a cross-loading with factors 1 and 2. Besides ‘delicious’, the varia-
bles ‘price’ and ‘exotic’ are correlated with factor 2. The variable ‘healthy’ is the only
one that loads rather highly on factor 3 (0.406). This might be a first hint that there will
be incompatibilities with the conceptual requirements of factor analysis. The variable
‘crunchy’ correlates rather weakly with all three factors. Should we delete this variable?
For further interpretation of the factor solution, we look at the factor loadings
obtained after applying the VARIMAX rotation. In the two-dimensional (as well as in
the three-dimensional) case, we can perform the rotation graphically by trying to rotate
the coordination system in such a way that the angles between the variable and the factor
7.3 Case Study
vectors decrease. In the case of more than three factors, however, it is necessary to per-
form the rotation purely mathematically.
The VARIMAX rotation is an orthogonal rotation method, thus maintaining the
assumption that the factors should be independent (i.e., uncorrelated). Since the rotation
of the factors changes the factor loadings but not the communalities of the model, the
unrotated solution is primarily suitable for the selection of the number of factors and for
the quality assessment of the factor solutions. However, an interpretation of the deter-
mined factors on the basis of an unrotated model is not recommended, since the applica-
tion of a rotation method changes the distribution of the explained variance portion of a
variable among the factors.
Figure 7.33 shows the analytical solution of the rotated factor loading matrix for the
case study. Compared to Fig. 7.32, changes in the factor loadings are noticeable, which
in the result facilitate the factor interpretation or make it more unambiguous.
Now the variables ‘light’, ‘sweet’, ‘fruity’, ‘bitter’, ‘exotic’, and ‘crunchy’ are more
clearly related to factor 1. Factor 2 correlates with ‘price’ and ‘delicious’, while factor 3
is only correlated with the variable ‘healthy’.
Overall, it becomes clear that the results of a factor analysis often raise many ques-
tions and rarely provide clear answers. However, this is precisely what makes factor
analysis so “flexible” and leaves room for interpretation. In particular, the interpreta-
tion of the factors is always subjective and ultimately requires a great deal of expertise
on the part of the user in the field of investigation. This often makes it difficult to find
7.3 Case Study 439
the “right” labels for the abstract, unobserved factors. This is precisely why information
about the variables that load strongly on a factor is used to interpret the factors. The fol-
lowing interpretations are proposed here for the case study:
We encourage the reader to challenge these propositions and to find alternative terms for
the factor labels. Moreover, we would like to motivate the reader to carefully compare
and inspect the different alternative options when conducting a factor analysis to gain
confidence about the robustness of the results.
Factor scores
After extracting the factors, often the question how the interviewed persons would assess
the (fictitious) factors is still of interest. These assessments can be estimated by using
the matrix of the initial data (Z) with the so-called factor scores. We use the regression
method to calculate the factor scores (see Fig. 7.23 and Sect. 7.2.4). SPSS provides the
Factor Score Coefficient Matrix (Fig. 7.34) to calculate the factor scores from the output
data.
440 7 Factor Analysis
The regression coefficients serve as weights for the standardized initial values to com-
pute the factor scores. From Fig. 7.34 we learn that the variables ‘light’, ‘sweet’, ‘fruity’,
‘bitter’, ‘exotic’, and ‘crunchy’ have the highest weights, corresponding to the previous
results. As expected, the variables ‘price’ and ‘delicious’ received the highest weights for
factor 2, and the variable ‘sweet’ has the highest relevance when the factor score for fac-
tor 3 is computed. Since all variables are considered when computing the factor score for
one factor, the factor scores are correlated.
SPSS computes the factor scores and saves these values as new variables in the SPSS
data file. In SPSS, the factor scores of the individual cases are appended to the data
matrix as new variables: FAC1_1, FAC2_1 and FAC3_1 (Fig. 7.35).
As generally discussed in Sect. 7.2.4, when interpreting factor scores, it must be taken
care to ensure that they represent standardized values with a mean of zero and a variance
of 1. The resulting interpretation is illustrated here for persons 1, 2 and 10:
• The factor scores for person 1 are –1.093, –0.321, and –0.520.
This means that person 1 rates all three factors below average compared to the aver-
age of all respondents. With the rating scale used (1 = low, 7 = high), this means that
person 1 rates all three factors low compared to the average.
• The factor scores for person 2 are 0.319, 0.690, and 0.427.
Person 2 shows the opposite phenomenon, i.e. the assessments are above average for
all three factors. Person 2 therefore rates all three factors highly compared to the aver-
age values.
7.3 Case Study
Fig. 7.35 SPSS data editor with the factor scores of the first 28 persons in the case study
441
442 7 Factor Analysis
• The factor scores for person 10 are –0.002, –0.650, and –0.086.
Person 10 has values close to zero for factors 1 and 3, i.e. the assessment of these fac-
tors corresponds to the average assessment of the persons interviewed.
7.3.3.3 Product Positioning
To position the eleven types of chocolate selected for the case study, the manager of the
chocolate company uses the (fictitious) evaluations of the three factors as represented by
the factor scores of the 116 interviewed persons (see Fig. 7.32). For product positioning, he
needs the average evaluations of the three factors for each type of chocolate. These can be
calculated using the SPSS procedure ‘Means’, which is called up via the following menu
sequence: ‘Analyze/Compare Means and Proportions/Means’. In the procedure ‘Means’,
the three variables with the factor scores (“FAC1_1”; “FAC2_1”; “FAC3_1”) must be
added to the ‘Dependent List’, and the chocolate type must be specified in the ‘Independent
List’. Afterwards, the means can be calculated via the menu ‘Options’ (see Fig. 7.36).
The data matrix for positioning is thus an 11 × 3 matrix, with the 11 chocolate flavors
as cases and the three average factor assessments as variables. It should be noted that
averaging per chocolate means that the information about the variations in assessment
between the various individuals is lost. Depending on the heterogeneity among the test
persons, this loss of information can be large and may not be acceptable. Figure 7.37
shows the average factor ratings for the 11 chocolate flavors.
Figure 7.37 shows that only the fruit varieties (orange, strawberry, mango) have high
positive values in the taste experience (factor 1). This means that they are perceived
above average compared to all other varieties. On the other hand, the fruit varieties show
high negative values for the other two factors, ‘value-for-money’ and the ‘health aspect’;
here they are perceived as below average. As shown above, the scale used in a survey
(here: 1 to 7) determines what “average” means. The means of the initial variables that
load on a factor determine the average value of the standardized factor score with a fac-
tor j (zj = 0). A good overall impression is obtained by the graphical representation of
the values in Fig. 7.37. The resulting three-dimensional factor space (perception space) is
shown in Fig. 7.38.
Figure 7.38 shows that the perception of the three types of fruit chocolate is very dif-
ferent from that of the other types of chocolate (in all three dimensions). In contrast, the
remaining eight types of chocolate are positioned relatively close to each other in the
Fig. 7.37 Means of the factor scores for each chocolate flavor
444 7 Factor Analysis
1.25 Mango
1.00
Factor 1: Taste experience
.75 Strawberry
Orange
(EV: 2.397)
.50
.25
Cappuccino
.00
Milk Mousse Biscuit
-.25 Caramel
Nut
Nougat
-.50
Espresso
-.60
-1.00 -.40
-.50 -.20
.00 .00
.50 .20
1.00 .40
Fig. 7.38 Three-dimensional representation of the chocolate flavors in the factor space
consumers’ perceptual space.19 For the manager of the chocolate company, this means
that advertising for the fruit varieties should emphasize the taste dimension.
However, caution is also recommended when interpreting the factor space in
Fig. 7.38: Due to the different eigenvalues of the factors (see also Fig. 7.29), the factors
have a different significance for explaining the variances in the data set. Especially factor
3 has only a very small explanatory power.
19 Cluster analysis is the central methodological instrument for identifying similarly perceived
objects. The cluster analysis presented in this book (cf. Chap. 8) is also based on the data set used
in this case study (Table 7.24) and confirms the result of a two-cluster solution that is emerging
here.
7.3 Case Study 445
Fig. 7.39 Initial and extracted communalities: PCA vs. PAF (10 variables)
decision for one or the other of the two approaches must be made on logical grounds, as
they have completely different objectives.20
In our case study, the application of PAF was determined by the manager’s question.
But in the following we will briefly describe the results of the case study if we use PCA
instead of PAF.
• In PCA, three principal components are extracted when applying the Kaiser crite-
rion. While PAF could explain only 45.17% of the variance (see Fig. 7.29), PCA can
explain 63.15% of the total variance.
• A comparison between the rotated factor loading matrix and the rotated component
matrix does not lead to major differences, neither in terms of the assignment of the
variables to the factors nor in terms of the loadings. It is remarkable, however, that
the variable ‘healthy’ (with a loading of 0.900) loads on the third principal component
only and thus can be explained best by PCA (communality of 0.823). In PAF, this fac-
tor loading is only 0.421 and the communality is 0.188 (see Fig. 7.31 and 7.33).
• Obviously, the rotated loading matrices show only slight differences in the two extrac-
tion methods. The difference is rather that the factors have to be interpreted against
different backgrounds. In PCA, the aim is not to find the cause of the correlations, but
a collective term for the correlating variables.
• However, there are also differences with regard to the factor scores. This means that
the evaluation behavior of the respondents with regard to the factors (main compo-
nents) is represented differently in the two extraction methods.
Above we demonstrated how to use the graphical user interface (GUI) of SPSS to con-
duct a factor analysis. Alternatively, we can use the SPSS syntax which is a programming
language unique to SPSS. Each option we activate in SPSS’s GUI is translated into SPSS
syntax. If you click on ‘Paste’ in the main dialog box shown in Fig. 7.19, a new window
opens with the corresponding SPSS syntax.
However, you can also use the SPSS syntax and write the commands yourself. Using
the SPSS syntax can be advantageous if you want to repeat an analysis multiple times
(e.g., testing different model specifications). Figure 7.40 shows the SPSS syntax for
running the factor analysis in the case study, applying the principal axis factoring on 9
variables (without ‘refreshing’). The second part of the syntax contains the procedure
‘Means’, which was used to calculate the means of the factor scores per factor for the 11
chocolate flavors (Fig. 7.37) and to create the factor space in Fig. 7.38. The syntax does
not refer to an existing data file of SPSS (*.sav); rather, we enter the data with the help of
the syntax editor (BEGIN DATA… END DATA).
7.4 Extension: Confirmatory Factor Analysis (CFA) 447
BEGIN DATA
3 3 5 4 1 2 3 1 3 4 1 1
6 6 5 2 2 5 2 1 6 7 3 1
2 3 3 3 2 3 5 1 3 2 4 1
-------------------------
5 4 4 1 4 4 1 1 1 4 18 11
* Enter all data.
END DATA.
Fig. 7.40 SPSS syntax for conducting the factor analysis in the case study
So far, the term factor analysis has been used to describe the method of exploratory fac-
tor analysis (EFA). In addition to EFA, as presented above, confirmatory factor analysis
(CFA) is also very important for practical applications. In the following, we will high-
light the conceptual differences between the two approaches.
Theoretical foundation
The aim of EFA is to identify those variables that are highly correlated and can be com-
bined to form factors. There are no a priori defined relationships between the variables;
448 7 Factor Analysis
rather, we use EFA to explore the data and search for structures in a set of variables (i.e.,
we aim to discover structures). We often perform EFA to reduce the number of variables
to be considered in further analyses.
In contrast, CFA aims to measure given, so-called hypothetical constructs via empir-
ically collected variables (measurement variables or indicators). The hypothetical con-
structs correspond to the factors. Accordingly, CFA is used exclusively to test whether a
factor (construct) is reflected in the variables specified by the user and measured empiri-
cally. CFA is thus an instrument for operationalizing hypothetical constructs, also known
as latent variables.
Let us use the variables of our case study (Sect. 7.3) to exemplify CFA. We assume
that a user wants to operationalize the hypothetical constructs ‘taste experience’ (factor
1) and ‘value for money’ (factor 2) by means of certain indicators (measurement vari-
ables). Since the factors are latent variables that cannot be directly measured, the user
looks for indicators (variables) that reflect the two factors in reality. For this reason, we
also speak of reflective measurement models when referring to hypothetical constructs.
Each measurement variable should represent as good a reflection of a construct as possi-
ble. For this to be true, the variables assigned to a construct must have high correlations.
The measurement variables of different constructs, however, should not be correlated if
the constructs (factors) under consideration are independent. Figure 7.41 illustrates the
correlations and shows that each indicator is generated by the postulated construct, but
each measurement is also subject to a measurement error (δ).
Due to this basic idea of CFA, the process steps presented in Fig. 7.3 for EFA have
to be changed slightly: in principle, step 1 is not necessary, since the variables should be
selected in advance in such a way that all indicators (variables) that reflect a construct
(factor) show high correlations. Thus, the first step is not used to check whether a data
set is suitable for factor analysis, but to check whether the variables assigned to a con-
struct in advance do have the expected high correlations.
Furthermore, CFA is based on the PAF model. According to Eq. (7.10) the following
applies: R = AA’ + U. As Fig. 7.41 shows, each measurement is reflected in a regres-
sion equation, which is represented for the seven measurement variables (indicators) as
follows:
x1 = 11 F1 + δ1 x5 = 12 F2 + δ5 ;
x2 = 21 F1 + δ2 x6 = 22 F2 + δ6 ;
x3 = 31 F1 + δ3 x7 = 32 F2 + δ7 ;
x4 = 41 F1 + δ4
with
δ1 x1: light
δ2 x2: sweet
taste experience
δ3 x3: fruity (F1)
δ4 x4: bítter
rF1,F2 = 0
δ5 x5: expensive
value-for-money
δ6 x6: saturating
(F2)
δ7 x7: quantity
The quantities λiq form the regression coefficients to be estimated from the empirical
measurements, which correspond to the factor loadings in the case of standardized meas-
urement variables.
The example above shows that the following information is required before carrying
out a CFA and has to be determined a priori by the user on the basis of factual logical or
theoretical considerations:
Thus, step 2 (extracting the factors and determining their number) has a completely differ-
ent meaning in CFA, since the number of factors to be extracted is determined a priori by
the user and only one of the factor-analytical approaches (usually PAF or the ML method)
can be used as an extraction method. In contrast, PCA has no significance for CFA.
If two or more factors are considered simultaneously in a model, the assignment of
the measurement variable to the factors is reflected in the factor loading matrix. Whereas
in EFA all factor loadings must be estimated, CFA only requires an estimate of the factor
loadings that have definitely been assigned to a factor. All other factor loadings can a
priori be set to zero. The estimation of the factor loading matrix is also carried out within
the framework of CFA on the basis of the fundamental theorem of factor analysis (see
Sect. 7.2.2.2). Here, the goal of the estimation is to minimize the difference between the
(model-theoretical) correlation matrix calculated with the help of the parameter estimates
and the empirical correlation matrix.
450 7 Factor Analysis
Factor 1 Factor 2
Factor 1 Factor 2 “taste “value-for-
“??? “ “??? “ experience“ money“
Variable
X1: 11 12 11 0
X2: 21 22 21 0
X3: 31 32 31 0
X4: 41 42 41 0
X5: 51 52 0 52
X6: 61 62 0 62
X7: 71 72 0 72
Exploratory Confirmatory
Factor analysis Factor analysis
Fig. 7.42 Estimation of the factor loading matrix for EFA and CFA
Since the factors of CFA are already defined in terms of content prior to the examina-
tion, step 3 of the process (interpreting the factors), which is very important for EFA, is
omitted.
Figure 7.42 illustrates the central differences in the basic ideas of EFA and CFA using
the factor loading matrix.
After performing a CFA, it is essential to assess the validity of the model in order to con-
firm the measurement model. CFA provides a large number of quality criteria for this pur-
pose (cf. Harrington, 2009, pp. 5–7). However, it is beyond the scope of this book to discuss
CFA in detail. The central differences between EFA and CFA are summarized in Table 7.25.
For further information see the bibliographic information at the end of this chapter.
SPSS provides an independent software package called AMOS (Analysis of Moment
Structures) for performing a CFA.
7.5 Recommendations
We close this chapter with some requirements and recommendations for conducting a
factor analysis.
The considerations in this chapter have shown that an exploratory factor analysis
(EFA) can lead to different results for the same initial data, depending on how the pro-
cess options are determined (see Fig. 7.17). The recommendations listed in Table 7.26
are based on what has proven to be effective and may serve as a beginner’s guide for
defining the parameters. Anyone wanting to dive deeper into the topic is referred to the
specialized literature (see references at the end of this chapter).
7.5 Recommendations 451
References
Child, D. (2006). The essentials of factor analysis (3rd ed.). Bloomsbury Academic.
Cureton, E. E., & D’Agostino, R. B. (1993). Factor analysis: An applied approach. Erlbaum.
Dziuban, C. D. & Shirkey, E. C. (1974). When is a correlation matrix appropriate for factor analy-
sis? Some decision rules. Psychological bulletin, 81(6), 358.
Guttman, L. (1953). Image theory for the structure of quantitative variates. Psychometrika, 18(4),
277–296.
Harrington, D. (2009). Confirmatory factor analysis. Oxford University Press.
Kaiser, H. F., & Rice, J. (1974). Little jiffy, mark IV. Educational and psychological measurement,
34(1), 111–117.
Loehlin, J. (2004). Latent variable models: An introduction to factor, path, and structural equation
analysis (4th ed.). Psychology Press.
Further Reading
Bartholomew, D. J., Knott, M., & Moustaki, I. (2011). Latent variable models and factor analysis:
A unified approach (Vol. 904). Wiley.
Costello, A. B., & Osborne, J. W. (2005). Best practices in exploratory factor analysis: Four rec-
ommendations for getting the most from your analysis. Practical Assessment, Research and
Evaluation, 10(7), 1–9.
Harman, H. (1976). Modern factor analysis (3rd ed.). The University of Chicago Press.
Kaiser, H. F. (1970). A second generation little jiffy. Psychometrika, 35(4), 401–415.
Kim, J. O., & Mueller, J. (1978). Introduction to factor analysis: What it is and how to do it.
SAGE.
Stewart, D. (1981). The application and misapplication of factor analysis. Journal of Marketing
Research, 18(1), 51–62.
Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Allyn & Bacon.
Thompson, B. (2004). Exploratory and confirmatory factor analysis – Understanding concepts
and applications. American Psychological Association.
Yong, A. G., & Pearce, S. (2013). A beginner’s guide to factor analysis: Focusing on exploratory
factor analysis. Tutorials in Quantitative Methods for Psychology, 9(2), 79–94.
Cluster Analysis
8
Contents
8.1 Problem
Empirical studies often deal with large data sets that show great variety with regard to
certain attributes, i.e. the elements of the data set have a high degree of heterogeneity. If
highly heterogeneous data are described by a mean value, for example, the variance or
standard deviation is high. This indicates that the mean value is of little statistical sig-
nificance as an attribute measure of the overall data set. The lower the heterogeneity, the
more reliable the mean value and the smaller the associated standard deviation.
One way to solve the problem of data heterogeneity is to merge persons (or objects)
into comparable (i.e., homogenous) groups. This means that the survey population is
broken down into groups in which the persons (or objects) show a high degree of homo-
geneity (intra-group homogeneity), while there is a high degree of heterogeneity between
groups (intergroup heterogeneity). In this way, statistical analyses can be carried out for
each group separately, providing significantly more reliable results per group.
Cluster analysis is the methodological instrument for breaking down heterogeneous
survey results into homogenous groups. It can be applied in many disciplines such as
medicine, sociology, biology or economics and is used to determine similarities, e.g. of
patients, buyers, plant species, companies or products. Table 8.1 shows selected research
questions in different fields of application, all of which aim to form groups of objects or
persons.
8.1 Problem 455
The questions make it clear that cluster analysis is related to exploratory data analysis
procedures, because it leads to suggestions for grouping surveyed objects, thus generat-
ing “new findings” or discovering structures in data sets.
To visualize the procedure of cluster analysis, two examples are shown in Fig. 8.1. In
the first case (diagram A), age and income were recorded for 30 persons. The average
age of the survey population is 31.4 years and the average income is 2595°€. However,
as Fig. 8.1A shows, these two averages are not very meaningful for characterizing the
30 persons since the standard deviation (s) of age is sage = ± 8.1 years and the stand-
ard deviation for income is sincome = ± 1151€. Clustering the data into two groups, as
indicated in diagram A, leads to more meaningful results. For the group of younger
persons (g = 1), the average age is 24.7 years (s1,age = ± 2.4 years), with an average
income of 1,550€ (s1,income = ± 225€). In contrast, the group of older persons (g = 2)
is on average 38.2 years old (s2,age = ± 4.1 years) and has an average income of 3640€
(s2,income = ± 475).1
Another example is shown in Fig. 8.1B. Ten products were examined with regard to
their price and quality levels as perceived by consumers. The graphic representation sug-
gests three segments that show a high degree of similarity in the assessments. This result
might, for example, allow a supplier to set up segment-specific marketing campaigns.
1 In
diagram A the two characteristics “income” and “age” are not independent. This means that the
two-cluster solution could have been achieved on the basis of only one of the two characteristics.
On the independence of cluster variables, see Sect. 8.2.1.
456 8 Cluster Analysis
price level
Income
Diagram A high Diagram B
Segment 1
5000
Group 2
P1
P8
4000 P5
P4
(3640; 38.2)
3000
Total average Segment 2
(2595; 31.4)
P10 Segment 3
2000 Group 1 P3
P7 P2
P6
(1550; 24.7) P9
1000
Since in both examples the objects were described by just two variables, they can
be graphically represented in a diagram or scatterplot. In many applications, however,
the objects are described by considerably more variables. In these cases, the results of
a cluster analysis can no longer be visualized. For example, all 20.000 enrolled students
of a university are listed as cases (objects), and age, gender, number of semesters, and
high school graduation grade are collected as attributes. If a clustering is carried out on
the basis of these data, the university management may use the results, for example, to
develop group-specific offers. Again, the number of groups should be chosen in such a
way that the students within a group are as similar to each other as possible while there
are only minor similarities between the student groups.
If a graphical illustration (in two- or three-dimensional space) is to be used in a mul-
ti-variable case, the set of variables can be aggregated in advance, e.g. with the help of
factor analysis (see Chap. 7). Using, for example, the first two factors, a visualization is
possible. If the user is interested in more detailed knowledge about the differences of a
clustering solution, discriminant analysis (see Chap. 4) can be used for this purpose. In
discriminant analysis, the result of the cluster analysis (number of groups) is used as a
dependent variable and the differences between the groups are examined on the basis of
the independent variables.
8.2 Procedure
The first step in performing a cluster analysis is to decide which variable should be used
to cluster a set of objects. It depends on the choice of the cluster variable that homogene-
ous groups are described later. In the second step, we have to decide how the similarity
8.2 Procedure 457
A central goal of cluster analysis is to group objects together that are homogenous with
respect to certain criteria in order to subject them to different measures. Take, for exam-
ple, a market segmentation analysis, in which groups of customers with similar purchas-
ing behavior (so-called market segments) are identified and subsequently subjected to
segment-specific marketing concepts in order to avoid the scattering losses typical of
undifferentiated marketing. Segmentation (clustering) typically leads to more effective
and efficient marketing concepts.
458 8 Cluster Analysis
The example of market segmentation shows that the homogeneity of a cluster (seg-
ment) is defined by the variables used for cluster formation. The selection of the cluster
variable is fundamental in preparing a cluster analysis, even though this content-driven
question is not important for the clustering algorithm itself. The “correct” definition of
the cluster variable determines how well the results of a cluster analysis can be used later
on. Based on the example of market segmentation (cf. Wedel & Kamakura, 2000; Wind,
1978, pp. 318 ff.), the following characteristics can be derived that should be fulfilled by
cluster variables in general:
set, but the principal components are independent of each other. The cluster analysis
should then be performed on the basis of the factor values. It should be noted, how-
ever, that it might be difficult to interpret the factors and thus the factor values if only
the central factors and not all factors are used. If only fewer principal components are
extracted than variables, part of the initial information is lost.
• Mahalanobis distance as proximity measure: If the Mahalanobis distance is used
to determine the differences between the objects, any correlations between the var-
iables can be excluded (in the distance calculation between objects). However, the
Mahalanobis distance imposes certain requirements on the data (e.g. uniform mean
values of the variables in all groups), which are often not fulfilled, especially in clus-
ter analysis (Kline, 2011, p. 54).
Cluster variables should, if possible, be manifest variables that are also observable and
measurable in reality. If, on the other hand, cluster variables are hypothetical variables
(latent variables), suitable operationalizations must be found.
If cluster variables are measured in different measurement dimensions (scales), this
may lead to an increase in distances between objects. In order to establish comparability
between the variables, a standardization procedure should be carried out in advance for
metrically scaled cluster variables. As a consequence, all (standardized) variables have a
mean value of zero and a variance of 1.2
Typically, following a cluster analysis the user would like to subject the resulting clusters
to specific measures. These measures should be tailored to the properties of the clusters. It is
therefore important to ensure that the cluster variable can be influenced by the user.
Since the various clusters should be as heterogeneous as possible, the cluster varia-
ble(s) must have a high separating power to distinguish clusters. Cluster variables that
are very similar for all objects (so-called constant characteristics) lead to a levelling of
the differences between the objects and thus cause distortions when fusing objects. Since
constant characteristics are not separable, they should be excluded from the analysis in
advance. This applies in particular to characteristics with a high share of zero values.
If a cluster analysis is carried out on the basis of a sample with the goal of drawing
conclusions about the population, it must be ensured that the individual groups contain
enough elements to represent the corresponding subgroups of the population. Since it
is usually not known in advance which groups are represented in a population—since
finding such groups is precisely the aim of cluster analysis—outliers in a data set should
be eliminated. Outliers influence the fusion process, make it more difficult to recognize
relations between objects, and thus result in distortions.3
2 Cf. on the standrdisation of variables the explanations on the statistical principles in Sect. 1.2
3 For the analysis of outliers, see also the explanations on the statistical principles in Sect. 1.5.1 as
well as the desription of the single linkage method in Sect. 8.2.3.2, which is particularly suitable
for identifying outliers in cluster analyses.
460 8 Cluster Analysis
Cluster characteristics may change over time. However, it is important for the pro-
cessing of clusters that their characteristics remain stable at least for a certain period
of time, since measures developed on the basis of clustering will usually need a certain
amount of time to show effects.
The starting point of cluster analysis is a raw data matrix with N objects (e.g. persons,
companies, products) which are described along J variables. The general structure of this
raw data matrix is illustrated in Table 8.2.
This matrix contains the values of the variables for each object. They can be met-
ric and/or non-metric. The first step is about determining the similarities between the
objects by using a statistical measure. For this purpose, the raw data matrix is converted
into a distance or similarity matrix (Table 8.3) that is always a square (N × N) matrix.
This matrix contains the similarity or dissimilarity values (distance values) between
the objects, which are calculated using the object-related variable values from the raw
data matrix. Measures that quantify similarities or differences between objects are gener-
ally referred to as proximity measures.
• Similarity measures reflect the similarity of two objects: the greater the value of a
similarity measure, the more similar two objects are to each other.
• Distance measures reflect the dissimilarity between two objects: the greater the value
of a distance measure, the more dissimilar two objects are to each other. If two objects
are completely identical, the distance measure equals zero.
8.2 Procedure 461
Application example
In a survey, 30 persons were asked about their perceptions of five chocolate flavors.
The test persons assessed the flavors ‘Biscuit’, ‘Nut’, ‘Nougat’, ‘Cappuccino’ and
‘Espresso’ regarding the attributes (i.e., variables) ‘price’, ‘bitter’ and ‘refreshing’ on
a 7-point scale from high (=7) to low (=1). Table 8.5 shows the mean subjective per-
ception values of the 30 persons interviewed about chocolate flavors.6 ◄
For all metrics considered below, we need to ensure that comparable measures are used.
This is fulfilled in our application example, as all three attributes were assessed on a
4 The selection of the proximity dimensions shown in Table 8.4 is based on the proximity measure-
ments provided in the SPSS procedure “Hierarchical Cluster Analysis”.
5 On the website www.multivariate-methods.info, we provide supplementary material (e.g., Excel
matrix.
462 8 Cluster Analysis
7-point rating scale. If this requirement is not fulfilled, the initial data must first be made
comparable, e.g. with the help of a standardization procedure.7
7 On the standardization of variables, see the comments on statistical basics in Sect. 1.2.1.
8.2 Procedure 463
x1
k
xk2 6 (xk1; xk2)
(xk1 – xl1)
4
(xk2 – xl2) l
xl2 2 (xl1; xl2)
0
0 1 2 3 4 5 6
xk1 xl1 x2
J 2
(8.1)
dk,l = xkj − xlj
j=1
with
dk,l distance between objects k and l
xkj,xlj value of variable j for objects k, l (j = 1,2,…,J)
For a two-variable case, Fig. 8.3 illustrates the Euclidean distance as the direct (shortest)
connection between the objects k and l, which corresponds to the hypotenuse of the rec-
tangular triangle.
In the two-variable case, the Euclidean distance between the points k = ‘Biscuit’ with
the coordinates (6,1) and l = ‘Nut’ with the coordinates (2,5) is calculated as follows:
√
dBiscuit, Nut = (6 − 2)2 + (1 − 5)2 = 16 + 16 = 5.656
464 8 Cluster Analysis
Table 8.6 Distance matrix according to the squared Euclidean distance in the application
example
Biscuit Nut Nougat Cappuccino Espresso
Biscuit 0
Nut 6 0
Nougat 4 6 0
Cappuccino 56 26 44 0
Espresso 75 41 59 11 0
For the multi-variable case (example in Table 8.5), the squared Euclidean distance for
the product pair ‘Biscuit’ and ‘Nut’ is calculated as follows:
By squaring the values, large differences have a higher impact on the distance measure
while small difference values have a lower impact. Moreover, positive and negative dif-
ferences can no longer cancel each other out. The Euclidean distance is then obtained by
calculating the square root of the squared Euclidean distance. In the example above, a
value of 6 is obtained.
Both the squared Euclidean distance and the Euclidean distance can be used for
measuring the dissimilarity of objects. Since simulation studies have shown that many
algorithms give the best results when using the squared Euclidean distance, the consider-
ations below focus on the squared Euclidean distance.
Table 8.6 summarizes the squared Euclidean distances for our numerical exam-
ple with five products. Since the distance of an object to itself is always zero, the main
diagonal of a distance matrix contains zeros. The distance matrix already shows that
the smallest distance is observed between ‘Biscuit’ and ‘Nougat’, while ‘Biscuit’ and
‘Espresso’ are the most dissimilar flavors (indicated by bold values).
The idea of the city block metric is derived from a city like Manhattan (with streets in a
checkerboard pattern), where the distance between two locations is determined by driv-
ing along the blocks of streets from one of the locations to the other one. It is therefore
8.2 Procedure 465
x1
k
xk2 6 (xk1; xk2)
|xk1 – xl1|
4
|xk2 – xl2| l
xl2 2 (xl1; xl2)
0
0 1 2 3 4 5 6
xk1 xl1 x2
also called Manhattan metric or taxi driver metric. It plays an important role in certain
practical applications, such as the grouping of locations, and is derived by calculating the
difference between two objects for each variable and adding the resulting absolute differ-
ence values. Figure 8.4 illustrates this for a two-variable case.
The distance of the objects k = ‘Biscuit’ with the coordinates (6,1) and l = ‘Nut’ with
the coordinates (2,5) is:
dBiscuit,Nut = |6 − 2| + |1 − 5| = 8,
which results from inserting the values in Eq. (8.2).
For the multi-variable case (example in Table 8.5), the city block metric for the prod-
uct pair ‘Biscuit’ and ‘Nut’ is calculated as follows, with the first number in the differ-
ence calculation representing the property value of ‘Biscuit’:
dBiscuit, Nut = |1 − 2| + |2 − 3| + |1 − 3|
=1+1+2
=4
466 8 Cluster Analysis
Table 8.7 Distance matrix according to the city block metric (L1 norm)
Biscuit Nut Nougat Cappuccino Espresso
Biscuit 0
Nut 4 0
Nougat 2 4 0
Cappuccino 12 8 10 0
Espresso 15 11 13 5 0
This means that the distance between the products “biscuit” and “nut” is 4 according to
the city block metric. Based on the city block metric, the distances for all other pairs of
objects are determined in the same way, with the results shown in Table 8.7.
As above (Table 8.7), the products ‘Nougat’ and ‘Biscuit’ have the greatest similar-
ity with a distance value of 2, while the least similarity exists between ‘Espresso’ and
‘Biscuit’ with a distance of 15.
Regarding the most similar and the most dissimilar pair, the squared Euclidean dis-
tance leads to the same results as the city block metric. A complete comparison of the
results of the city block metric (L1 norm) and the Euclidean distance (L2 norm) (cf.
Table 8.8) shows a shift in the order for the product pairs ‘Cappuccino’ and ‘Nougat’ as
well as ‘Espresso’ and ‘Nut’. Thus, it is obvious that the choice of the distance measure
influences the sequence of the test objects. This means that the proximity measure should
not be chosen arbitrarily, but according to its suitability for the application.
The results are different due to the way the calculated differences are taken into
account: While in the city block metric all difference values are weighted equally, large
differences have a stronger effect in the Euclidean distance metric due to the effect of
squaring.
J
xkj − xlj r ] 1r
dk.l = [ (8.3)
j=1
with
dk,l distance between objects k and l
xkj,xlj value of variable j for objects k, l (j = 1,2,…,J)
r ≥ 1 Minkowski constant
The Minkowski constant r is a positive constant. For r = 1, the city block metric (L1
norm) results, and for r = 2, the Euclidean distance (L2 norm) follows.
with
xj,k Observed value k (k = 1, 2, … , K) in factor level j (j = 1, 2, … , J)
x k average value of all variables for object (cluster) k or 1
The Pearson correlation between two objects k and l takes all variables of an object into
account.8 For the application example, the similarity matrix shown in Table 8.9 is based
on the Pearson correlation coefficient.
8A detailed description of the calculation of the correlation coefficient may be found in Sect. 1.2.2.
468 8 Cluster Analysis
1 2 3 4 5 6 7
Price
Bitter
Refreshing
Biscuit Espresso
When comparing these similarity values with the distance values in Table 8.7, it
becomes clear that the relation between the objects has changed significantly. According
to the squared Euclidean distance, ‘Espresso’ and ‘Biscuit’ are most dissimilar, while
they are defined as the most similar product pair by to the correlation coefficient.
Similarly, according to the Euclidean distance, ‘Nougat’ and ‘Biscuit’ are very similar
(with a distance of 4), while the pair is considered completely dissimilar according to the
correlation coefficient (with a correlation of 0 in Table 8.9).
These comparisons show that similarity or distance measures need to be chosen
depending on the target of the researcher. To illustrate this, let us take a look at the pro-
file curves of ‘Biscuit’ and ‘Espresso’ shown in Fig. 8.5 according to the initial data in
our example.
The profiles show that although the values for ‘Biscuit’ and ‘Espresso’ are distant
from each other, their profiles are the same. This explains why the products are dissim-
ilar when using a distance measurement while they are found to be similar when using
the correlation coefficient. In general, the following can be concluded:
• Distance measures consider the absolute distance between objects, and the dissimilar-
ity is greater if two objects are further away from each other according to the consid-
ered variables.
• Similarity measuresbased on correlation values consider how similar the profiles of
two objects are, regardless of the specific values of the objects according to the con-
sidered variables.
Let us consider an example: For a number of companies product sales have been
recorded over a period of five years (= variable). With the help of a cluster analysis,
these companies are grouped according to the following factors:
8.2 Procedure 469
In the first case, clustering is based on the sales level, which means that the proxim-
ity between the companies must be determined using a distance measure. In the second
case, similar sales trends are of interest. Therefore, a similarity measure (e.g. the Pearson
correlation coefficient) is a feasible proximity measure.
The previous section has shown how to derive a distance or similarity matrix from a data
set using proximity measures. The obtained distance or similarity matrix is the departing
point for the subsequent use of clustering methods, which aim to assign similar objects
to the same cluster. Cluster analysis offers the user a wide range of algorithms for group-
ing a given set of objects. The great advantage of cluster analysis is that a large number
of variables can be used simultaneously to group the objects. Clustering methods can be
classified according to the type of clustering procedure. Figure 8.6 provides an overview
of different clustering methods (clustering algorithms).
The following explanations focus on hierarchical cluster procedures since they are of
great importance for practical applications. Explanations of the partitioning procedures
are given in Sect. 8.4. The hierarchical procedures may be classified into agglomerative
and divisive methods. While the agglomerative methods are based on a very granular
partition (regarding the number of examined objects), the divisive methods are based on
a broad partition (all examined objects are in one group). Therefore, agglomerative meth-
ods form different groups out of a granular partition, while divisive clustering methods
divide a full sample into different groups.9
Due to their great practical importance, we first focus on hierarchical, agglomerative
clustering methods. The case study in Sect. 8.3 describes a hierarchical cluster analysis
using Ward’s method. Sect. 8.4.2 gives a comparatively short description of partitioning
9 Due to their rather minor practical importance, divisive cluster procedures will not be discussed
here. If you consider applying a divisive clustering algorithm, you can do this in SPSS by clicking
on ‘Analyze/Classify/Tree‘.
470 8 Cluster Analysis
Cluster methods
hierarchical partitioning
methods methods
clustering methods, concentrating on k-means and two-step cluster analysis since these
methods play a central role in the analysis of large data sets.
• Starting point: Each object represents a cluster. With N objects, there are N single-ob-
ject clusters.
• Step 1: For the objects (clusters) contained on a fusion stage, the pairwise distances or
similarities between the objects (clusters) are calculated.
• Step 2: The two objects (clusters) with the smallest distance (or greatest similar-
ity) to each other are combined into a cluster. The number of objects or groups thus
decreases by 1.
• Step 3: The distances between the new cluster and the remaining objects or groups are
calculated, resulting in the so-called reduced distance matrix.
• Step 4: Steps 2 and 3 are repeated until all objects are contained in only one cluster
(so-called single-cluster solution). With N objects, a total of N–1 fusion steps are car-
ried out.
10 The course of a fusion process is usually illustrated by a table (so-called agglomeration sched-
ule) and by a dendrogram or icicle diagrams. Both options are explained in detail for the sin-
gle-linkage method in Sect. 8.2.3.2.1.
8.2 Procedure 471
with
D(R,P) distance between clusters R and P
D(R,Q) distance between clusters R and Q
D(P,Q) distance between clusters P and P
The values A, B, E and G are constants that vary depending on the algorithm used. The
agglomerative methods listed in Table 8.10 are characterized by assigning certain values
to the constants in Eq. (8.5). Table 8.10 shows the respective values and the resulting
distance calculations for selected agglomerative processes (cf. Kaufman & Rousseeuw,
2005, S. 225 ff.).
While for the first four methods all available proximity measures may be used, the
application of the methods “Centroid”, “Median” and “Ward” only makes sense if a dis-
tance measure is used. Regarding the measurement level of the data, the procedures can
be applied to both metric and non-metric data. The only decisive factor here is that the
proximity measures used are matched to the measurement level (metric or non-metric).
Espresso Espresso
Cappuccino Cappuccino
Nut Nut
Nougat Nougat
Espresso Espresso
Cappuccino Cappuccino
Nut Nut
Nougat Nougat
coefficient coefficient
Biscuit Biscuit
4 6 8 11 16 4 8 12 16 20 24 26
Objects
Espresso
Cappuccino
Nut
Nougat
Biscuit
≈
4 8 12 16 20 72 76 distance
stage of fusion: 1 2 3 4
distance measurement: 4 6 11 75
those objects that are merged at a certain clustering level. For the three clustering meth-
ods presented in Sect. 8.2.3.2, the corresponding dendrograms are shown in Figs. 8.7, 8.8
and 8.9, respectively.
474 8 Cluster Analysis
Objects
Espresso
Cappuccino
Nut
Nougat
Biscuit
≈
2 6 10 14 18 62 66 variance criterion
stage of fusion: 1 2 3 4
error sum of square 4 5.333 10.833 65.600
Stage 1
The objects that have the smallest distance, i.e. the objects that are most similar, are
combined. Thus, in the first iteration the objects ‘Biscuit’ and ‘Nougat’ are combined
with a distance of 4 (see stage 1 in Fig. 8.7).
Stage 2
Since ‘Biscuit’ and ‘Nougat’ now form a separate cluster, the distance of this cluster to
all other objects must be determined next. The new distance between the new cluster
‘Biscuit, Nougat’ and an object R is determined according to Eq. (8.5) as follows (see
Table 8.10).
D(R; P + Q) = 0.5{D(R, P) + D(R, Q) − |D(R, P) − D(R, Q)|} (8.6)
Thus, the distance sought is simply the smallest value of the individual distances:
D(R; P + Q) = min{D(R, P); D(R, Q)}
The single-linkage method thus assigns to a newly formed group the smallest distance
resulting from the old distances of the objects combined in the group to a specific other
object. Therefore, this method is also known as “nearest neighbor method”.
Let us illustrate this approach, for the example, of the distance between the cluster
‘Biscuit, Nougat’ and ‘Cappuccino’. To calculate the new distance, the distances between
‘Biscuit’ and ‘Cappuccino’ and between ‘Nougat’ and ‘Cappuccino’ are calculated. The
initial distance matrix (Table 8.6) shows that the first distance is 56 and the second dis-
tance is 44. Thus, in the second iteration of the single-linkage procedure, the distance
between the ‘Biscuit, Nougat’ group and ‘Cappuccino’ is 44.
Formally, these distances can also be determined using Eq. (8.6). P + Q represents the
group ‘Nougat’ (P) and ‘Biscuit’ (Q’), and R represents a remaining object. In our exam-
ple, the new distances between ‘Nougat, Biscuit’ and the other objects result as follows
(see the values in Table 8.6):
The reduced distance matrix is obtained by removing the rows and columns of the
merged objects from the distance matrix and inserting a new column and row for the
newly built cluster. At the end of the first iteration, a reduced distance matrix is generated
(Table 8.12), which is used as starting point for the second iteration.
In stage 3, again the objects (clusters) with the smallest distance (according to the
reduced distance matrix) are combined. This means that ‘Nut’ is included in the cluster
‘Nougat, Biscuit’, because it has the smallest distance d = 6 (see stage 2 in Fig. 8.7).
476 8 Cluster Analysis
Stage 3
For the reduced distance matrix in the second iteration, the distances between the group
‘Nougat, Biscuit, Nut’ and ‘Cappuccino’ or ‘Espresso’ are calculated as follows:
Stage 4
The distance between the remaining clusters ‘Nut, Nougat, Biscuit’ and ‘Espresso,
Cappuccino’ is calculated on the basis of Table 8.13 as follows:
D(Nougat, Biscuit, Nut; Cappuccino, Espresso) = 0.5 · {(26 + 41) − |26 − 41|} = 26
This means that the two clusters ‘Nut, Nougat, Biscuit’ and ‘Espresso, Cappuccino’ are
combined in step 4 at a distance of d = 26 (cf. stage 4 in Fig. 8.7). After this step, all five
objects are combined in one cluster.
Summary
The clustering steps may be summarized in an agglomeration schedule. Table 8.14 shows
the agglomeration schedule for the single-linkage method for the application example
(cf. Table 8.5) and the corresponding distance matrix (Table 8.6). In the first step, objects
1 (Biscuit) and 3 (Nougat) are fused at a heterogeneity coefficient of 4.0. This corre-
sponds to the squared Euclidean distance between the two objects. In Table 8.14, these
objects are identified as single (“0” in the column “Stage Cluster First Appears”). In con-
trast, a newly formed cluster is always identified by the smallest number of the fused
objects (here: 1). This cluster 1 is then fused with object 2 (‘Nut’) in stage 2. The group
8.2 Procedure 477
formed in this way is again identified as “1”. Only in stage 4 it is merged with group “4”,
which consists of objects 4 (‘Cappuccino’) and 5 (‘Espresso’).
The steps of the clustering process shown in Table 8.14 can also be illustrated graph-
ically. Figure 8.7 shows the development of the dendrogram in the four clustering steps.
In software programs, however, only the overall process is shown (cf. the dendrogram for
stage 4).
Stage 1
The complete-linkage method also merges the objects ‘Biscuit’ and ‘Nougat’ in the first
step since they have the smallest distance (dk,l = 4) according to Table 8.6. The sin-
gle-linkage and the complete-linkage processes differ, however, in how the next dis-
tances are calculated.
Complete-linkage clustering calculates distances according to Eq. (8.5) (see
Table 8.10):
D(R; P + Q) = 0, 5 · {D(R, P) + D(R, Q) + |D(R, P) − D(R, Q)|} (8.7)
Equivalently, the distance can also be determined as follows:
D(R; P + Q) = max{D(R, P); D(R, Q)}
Therefore, this process is also referred to as “furthest neighbor method”. So, starting
with the distance matrix in Table 8.6, the objects ‘Biscuit’ and ‘Nougat’ are combined in
the first step. However, the distance of this group to the others, e.g. ‘Cappuccino’, is now
determined by the largest individual distance in the reduced distance matrix. Formally,
the individual distances according to Eq. (8.7) are as follows:
Stage 2
In stage 2, ‘Nut’ is included in the group ‘Biscuit, Nougat’ as it has the smallest distance
with d = 6 (Table 8.15). The process now continues in the same way as with the sin-
gle-linkage method (cf. the agglomeration schedule in Table 8.14), with the respective
distances always determined according to Eq. (8.7). Figure 8.8 shows the final result as a
dendrogram.
8.2.3.2.3 Ward’s Method
Ward’s method is widely used in practice. It differs from the previous ones not only in
how the new distances are calculated, but also in its clustering approach. The distance
between the last cluster formed and the remaining groups is calculated as follows (see
Table 8.10):
1
D(R; P + Q) = NR+NP+NQ
{(NR + NP) · D(R, P)
(8.8)
+ (NR + NQ) · D(R, Q) − NR · D(P, Q)}
According to Ward’s method, it is not the groups with the smallest distance that are com-
bined, but rather those objects (groups) whose combination increases the variance in
the resulting group the least. The variance sg2 of a group g is calculated for a group g as
follows:
Ng J
sg2 = (xijg − x jg )2 (8.9)
i=1 j=1
with
xijg observed value of variable j (j = 1, …, J) for object i (for all objects i = 1, …, Ng in
group g) Ng
x jg mean value over the observed values of variable j in group g (= 1/Ng xijg )
i=1
Equation (8.9) is also called variance criterionor error sum of squares. If the Ward pro-
cedure is based on the squared Euclidean distance as a proximity measure, in the first
step we can use the squared Euclidean distances in Table 8.6 for the 5-products example.
Accordingly, the error sum of squares has a value of zero in the first step. This means
8.2 Procedure 479
that each object forms an “independent group” and no variance occurs in the variable
values of the objects. The target criterion for the Ward procedure for grouping objects
(groups) is:
“Fuse those objects (groups) that increase the error sum of squares the least.”
This error sum of squares is the heterogeneity measure in the Ward procedure. It
can be shown that the values of the distance matrix in Table 8.6 (squared Euclidean
distances) and the distances calculated with the help of Eq. (8.8) correspond exactly to
twice the increase of the sum of error squares according to Eq. (8.9) with the clustering
of two objects (groups).
In the first stage, Ward’s method also combines the two objects with the smallest
squared Euclidean distance. In our application example, these are the products ‘Biscuit’
and ‘Nougat’, which have a squared Euclidean distance of 4 (see Table 8.6). Taking
into account that the mean value of the variable ‘Price’ is (1 + 3)/2 = 2, the error sum of
squares in this case is 2.
1
D(Nut; Nougat + Biscuit) ={(1 + 1) · 6 + (1 + 1) · 6 − 1 · 4} = 6.667
3
1
D(Cappuccino; Nougat + Biscuit) = {(1 + 1) · 56 + (1 + 1) · 44 − 1 · 4} = 65.333
3
1
D(Espresso; Nougat + Biscuit) = {(1 + 1) · 75 + (1 + 1) · 59 − 1 · 4} = 88.000
3
The result is the reduced distance matrix in Table 8.16, which also shows the double
increase in the error sum of squares when two objects (groups) are merged.
The double increase of the error sum of squares is smallest with the addition of ‘Nut’
to the group ‘Biscuit, Nougat’. In this case, the error sum of squares is only increased by
1/2 of 6.667 = 3.333. After this step, the total error sum of squares is:
where the value 2 represents the increase in the error sum of squares of the first step.
Upon completion of this fusion, the products ‘Biscuit, Nougat, Nut’ are in one group,
and the error sum of squares is 5.333.
In stage 3, the distances between the group ‘Biscuit, Nougat, Nut’ and the remaining
products have to be determined. For this purpose, we use Eq. (8.8) and the results of the
first run presented in Table 8.16:
480 8 Cluster Analysis
1
D(Cappuccino; Biscuit + Nougat + Nut) = {(1 + 2) · 65.333 + (1 + 1) · 26 − 1 · 6.667} = 60.333
4
1
D(Espresso; Biscuit + Nougat + Nut) = {(1 + 2) · 88.000 + (1 + 1) · 41 − 1 · 6.667} = 84.833
4
The result of this step is shown in Table 8.17. It becomes clear that the double increase
in the error sum of squares is smallest when we combine the objects ‘Cappuccino’ and
‘Espresso’ in this step. In this case, the error sum of squares only increases by 1/2 of
11 = 5.5, i.e. the result after merging is as follows:
The value 10.833 reflects the amount of the error sum of squares after completion of the
third step. According to Eq. (8.9), the total value is correctly split into the following two
individual values: s2 (Biscuit, Nougat, Nut) = 5.333 and s2 (Cappuccino, Espresso) = 5.5.
In the last step, the groups ‘Biscuit, Nougat, Nut’ and ‘Cappuccino, Espresso’ are
merged, leading to a double increase of the error sum of squares:
1
D(Biscuit, Nougat, Nut, Espresso, Cappuccino) = {(3 + 1) · 84.833 + (3 + 1) · 60.333 − 3 · 11}
5
= 109.533
After this step, all objects are merged into one cluster, with the variance criterion
increasing by ½ of 109.533 = 54.767. The total error sum of squares in the final state is
therefore 10.833 + 54.767 = 65.6.
The clustering process according to Ward’s method can also be summarized by a
dendrogram, with the error sum of squares (variance criterion) listed after each stage of
fusion (Fig. 8.9).
Table 8.17 Matrix of double increases in heterogeneity after the second clustering step of the
Ward process
Biscuit, Nougat, Nut Cappuccino
Cappuccino 60.333
Espresso 84.833 11
8.2 Procedure 481
• Dilating procedures tend to group the objects into individual groups of approximately
equal size.
• Contracting algorithms tend to form a few large groups with many small ones “left
over”. Contracting algorithms are thus especially suitable for identifying outliers in an
object space.
• Conservative procedures show no tendency to dilate or contract.
1. Since the single-linkage method tends to form many small and a few large groups
(contractive method), it forms a good basis for the identification of outliers in a set
of objects.However, the disadvantage of this procedure is that it tends to form chains
(because of the large groups), which means that poorly separated groups are not
detected.
2. While the fusion processes in the application example are identical for the single- and
complete-linkage procedures, the complete-linkage procedure tends to form smaller
groups. This is due to the fact that the largest value of the individual distances is used
as the new distance. Therefore, the complete-linkage method is not suitable for detect-
ing outliers in a population of objects. Instead, these lead to a distortion of the group-
ing in the complete-linkage process and should therefore be eliminated beforehand
(e.g. with the help of the single-linkage process).
3. Compared to other algorithms, Ward’s method mostly finds very good partitionings
and usually correctly assigns the elements to groups. Ward’s method is a very good
clustering algorithm, particularly in the following cases (Milligan, 1980; Punj &
Stewart, 1983):
– The use of a distance measure is a useful criterion (in terms of content) for deter-
mining similarity.
– All variables are metric.
– No outliers are contained in a set of objects or were previously eliminated.
– The variables are uncorrelated.
– The number of elements in each group is approximately the same.
– The groups have approximately the same extension.
The latter three conditions relate to the applicability of the variance criterion used in the
Ward procedure. However, the Ward process tends to build clusters of equal size and is
not able to detect elongated groups or groups with a small number of elements.
11 For
the extended example, the dendrograms were created using the procedure CLUSTER in
SPSS (see Sect. 8.3.2).
8.2 Procedure 483
outlier 1 outlier 4
A 7
3
B
1
outlier 2 outlier 5
-7 -5 -3 -1 1 3
-1
C
-3
-5
outlier 3 outlier 6
-7
If we apply the single-linkage method to the example data in Fig. 8.10, the corresponding
dendrogram in Fig. 8.11 shows that the method tends to form chains. While the objects
of the three different groups are merged more or less on the same level, the objects
marked as outliers are only fused at the end of the process. This demonstrates that the
single-linkage method is particularly suitable for detecting outliers in a set of objects.
If, on the other hand, the complete-linkage and Ward methods are applied to the
data, they lead to clearly different clustering processes. Figure 8.12 shows that the com-
plete-linkage process does recognize a three-cluster solution, but only group C is defi-
nitely isolated, while group B is only partially separated and the majority of the elements
in B are combined with the objects in group A. The complete-linkage method is there-
fore not able to reproduce the “true grouping” (as given by the example data) according
to Fig. 8.10.
The dendrogram of Ward’s method in Fig. 8.13 shows that a four-cluster solution is
formed at a normalized heterogeneity measure of about 4 and a three-cluster solution
at a heterogeneity measure of about 7. The two-cluster solution only emerges at a clus-
ter distance of approx. 8. Ward’s method is thus able to reproduce the “true grouping”
according to the example data in Fig. 8.10. In the three-cluster solution the six outliers
are distributed over the three groups.
484 8 Cluster Analysis
2-cluster solution
3-cluster solution
4-cluster solution
8.2 Procedure 487
Fig. 8.14 Agglomeration schedule of the Ward method for the extended example
For Ward’s method, the fusion process illustrated by the dendrogram in Fig. 8.13 is
also represented by an agglomeration schedule (Fig. 8.14 ).12 We can see that only sin-
gle objects are merged in the first five stages (all numbers are “0” in the column “Stage
Cluster First Appears”), while in stages 50 and 53 to 55, the clusters formed on previous
levels are merged. We can also see that the outlier objects 53 and 54 (identifiers 0) are
merged at stages 50 and 52. The column “Coefficients” lists the error sum of squares
(variance criterion) for each case, which is used by Ward’s method as a measure of heter-
ogeneity in the fusion process.
12 The agglomeration schedule was also created using the procedure CLUSTER in SPSS.
488 8 Cluster Analysis
In the previous sections we described different methods for merging individual objects
into groups. In this process, all agglomerative processes start with the finest partitioning
(all objects form separate clusters) and end with grouping all objects into one large clus-
ter. In a third step, we now need to decide which number of groups represents the “best”
solution. Since the user does not know the optimal number of clusters beforehand, this
number needs to be determined on the basis of statistical criteria. Usually, the user does
not have any factually justifiable ideas about the grouping of the objects of investigation
and therefore tries to uncover a grouping inherent in the data with the help of the cluster
analysis. Against this background, the determination of the number of clusters should
also be oriented towards statistical criteria and not be justified factually (with regard to
the cases assigned to the groups).
In deciding on the number of clusters, there is always a trade-off between the “homo-
geneity” and the “manageability” of a clustering solution. Factual logic considerations
can also be used to resolve this conflict, but these should only relate to the number of
clusters to be chosen and not to the cases grouped together in the clusters.
The following sections describe different options for determining the number of
clusters.13
13 Sincethere are no criteria available in SPSS for determining the optimal number of clusters, it is
recommended to use alternative programs such as S-Plus, R or SAS and the cubic clustering crite-
rion (CCC) if available.
8.2 Procedure 489
700
600
“Elbow“
500
400
300
200
100
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 56
Number of clusters
Fig. 8.15 Scree plot for determining the number of clusters according to the elbow criterion
Rule of Calinski/Harabasz
Calinski/Harabasz's criterion was identified as the best stopping rule, as it was able to
reveal the “true group structure” in over 90% of the cases examined.
“True” here means that the grouping carried out in a simulation study could be
uncovered.
The criterion according to Calinski and Harabasz (1974), which is suitable for met-
ric attributes, considers the ratio of the deviation between groups (SSb) and the devia-
tion within a group (SSw) as a test statistic, in analogy to the analysis of variance (cf.
Chap. 3). This test statistic is calculated for all clustering solutions. If its value decreases
monotonically with an increasing number of groups, this indicates that there is no group
structure in the data set. If, however, the test statistic increases with the number of clus-
ters, a hierarchically structured data set can be assumed. The cluster number k at which
the test statistic reaches a maximum is taken as the number of groups that exist within a
data set.
490 8 Cluster Analysis
Test of Mojena
The test of Mojena was also identified as one of the ten best methods for determining
the number of clusters. In the following, we will have a look at this test, because it can
be easily carried out with the help of a spreadsheet. As a starting point, this test uses the
standardized clustering coefficients (α̃) per clustering step, which are calculated as fol-
lows (Mojena, 1977, p. 359):14
1 n−1 1 n−1 αi − α
α= αi ;sα = (αi − α)2 ; α̃i = (8.10)
n−1 i=1 n−2 i=1 sα
The optimum number of clusters is indicated by the group number for which a given
threshold value of the standardized clustering coefficient is exceeded for the first time.
The literature mentions different requirements for defining this threshold value. In his
simulation study, Mojena (1977) achieved the best results with a threshold value of 2.75.
The studies of Milligan and Cooper (1985) mention a value of 1.25, with the quality of
the result only slightly varying for threshold values between 1 and 2. However, the opti-
mal parameter strongly depends on the data structure. We recommend selecting a value
between 1.8 and 2.7, since this seems well suited for most data sets, based on various
studies carried out in the past.
14 Fora brief summary of the basics of statistical testing, see Sect. 1.3.
15 Inaddition to KM-CA, two-step cluster analysis may also be used to optimize a clustering solu-
tion found by another procedure. Both methods belong to the partitioning clustering methods
described in detail in Sect. 8.4.2.
8.2 Procedure 491
The interpretation of a clustering solution should be based on the values of the cluster
variables in the clusters identified. It is useful to make a comparison with the survey pop-
ulation by calculating the t-values and F-values. Additionally, a discriminant analysis
may be used for analyzing the differences between the clusters found (cf. Chap. 4).
Calculation of t-values
Similarly to testing for differences in mean values, t-values for each variable j in each
cluster g may be calculated as follows:
xg j − xj
tg j = (8.11)
sj
with
x gj average of variable j in cluster g (g = 1, …, G)
x j average of variable j in the survey population (j = 1, …, J)
sj standard deviation of variable j in the survey population
Thus, these values do not actually assess the quality of a clustering solution, but can be
used to characterize the clusters.
Calculation of F-values
To assess the homogeneity of a cluster in comparison to the survey population, F-values
for each variable j in each group g can also be calculated, analogous to the F-test:18
18 For a brief summary of the basics of statistical testing, see Sect. 1.3.
8.2 Procedure 493
2
sgj
Fgj = (8.12)
sj2
with
2
sgj variance of variable j in cluster g (g = 1, …, G)
sj2 variance of the variable j in the survey population (j = 1, …, J)
The lower the F-value, the smaller the dispersion of this variable within the cluster com-
pared to the survey population. The F value should not exceed 1, because otherwise the
corresponding variable has a greater variation within the cluster than in the survey pop-
ulation. The calculations of t- and F-values are shown in detail in Sect. 8.3.3.3 for the
chocolate case study.
Discriminant analysis
A discriminant analysis (see Chap. 4) can also be used to characterize a clustering solu-
tion. In this case, the clusters found by the cluster analysis form the dependent, nomi-
nally scaled variable. The metrically scaled variables used for clustering can be used as
the independent variables of the discriminant analysis. In this way, it is possible to deter-
mine, for example, which variable is particularly responsible for the separation between
the clusters. In addition, other variables that the user considers useful can also be used in
discriminant analysis. In this way, it is possible to investigate differences with regard to
other variables for the groups identified in the cluster analysis.
From the above considerations, the following four-step procedure may be recommended
for conducting a hierarchical, agglomerative cluster analysis:
Further recommendations for conducting a cluster analysis can be found in Sect. 8.5.
494 8 Cluster Analysis
We now use a larger sample to demonstrate how to conduct a cluster analysis with the
help of SPSS.
A manager of a chocolate company wants to know how consumers evaluate different
chocolate flavors with respect to subjectively perceived attributes. For this purpose, the
manager has identified 11 flavors and has selected 10 attributes that appear to be relevant
for the evaluation of these flavors.
A small pretest with 18 test persons was carried out. The persons were asked to eval-
uate the 11 flavors (chocolate types) with regard to the 10 attributes (see Table 8.19).19 A
seven-point rating scale (1 = low, 7 = high) was used for each attribute. Thus, the varia-
bles represent perceived attributes of the chocolate types.
However, not all persons were able to evaluate all 11 flavors. Thus, the data set con-
tains only 127 evaluations instead of the complete number of 198 evaluations (18 per-
sons × 11 flavors). Any evaluation comprises the scale values of the 10 attributes for a
certain flavor as given by a respondent. Thus, it reflects the subjective assessment of a
specific chocolate flavor by a particular test person. Since each test person assessed more
than just one flavor, the observations are not independent. Yet, for simplicity’s sake, we
will treat the observations as such in the following.
Of the 127 evaluations, only 116 are complete, while 11 evaluations contain miss-
ing values.20 We exclude all incomplete evaluations from the analysis. Consequently, the
number of cases is reduced to 116.
The manager of the chocolate company now wants to know which chocolate varieties
are rated similarly in terms of their attributes by his customers. To answer this question,
first the average ratings of the 18 test persons regarding the attributes of the 11 choc-
olate flavors are determined. The mean values are calculated with the SPSS procedure
‘Means’, which is called up via the following menu sequence: ‘Analyze/Compare Means/
Means’. The input matrix for the cluster analysis is thus an 11 × 10 matrix, with the
11 chocolate flavors as cases and the 10 averaged attribute assessments as variables. It
should be noted that averaging per chocolate means that the information about the varia-
tions in assessment between individuals is lost.21
19 Supplementary material (e.g. Excel files) is made available on the website www.multivariate.de,
with the held of which the reader can deepen his understanding of cluster analysis.
20 Missing values are a frequent and unfortunately unavoidable problem when conducting surveys
(e.g. because people cannot or do not want to answer the question, or as a result of mistakes by the
interviewer). The handling of missing values in empirical studies is discussed in Sect. 1.5.2.
21 The mean values were calculated on the basis of the data set that was also used in the case study
of discriminant analysis (Chap. 4), logistic regression (Chap. 5) and factor analysis (Chap. 7). Using
the same case study allows us to illustrate the similarities and differences between the methods.
8.3 Case Study 495
To identify groups with similarly perceived chocolate flavors (market segments), the
manager conducts a cluster analysis. Based on the recommendations in Sect. 8.2.3.3, he
uses Ward’s method as a fusion algorithm and squared Euclidean distances as a proxim-
ity measure.
The sub-menu ‘Method’ is used to determine the clustering method (clustering algo-
rithm) and the proximity measure (‘Measure’). Various measures are available for met-
rically scaled (‘Interval’), nominally scaled (‘Counts’) and binary-coded (‘Binary’)
variables (cf. Fig. 8.19).
Overall, the proximity measures listed in Table 8.4 are available, comprising measures
for metric (interval-scaled) data, count data and binary data. Seven different clustering
methods can be selected from the drop-down list ‘Cluster Method’. In the case study,
the single-linkage method (nearest neighbor) is initially selected to identify outliers.
Then the 11 chocolate varieties are analyzed using Ward’s method and with the squared
Euclidean distance as a proximity measure. Once the settings have been entered in the
sub-menus, the ‘Continue’ button takes us back to the starting point of the ‘Hierarchical
Cluster Analysis’ and the analysis can be started by pressing ‘OK’.
8.3.3 Results
In the following, the results of the cluster analysis generated with SPSS are presented
first. Then criteria for determining the cluster number in the case study are shown. The
considerations conclude with a characterisation of the cluster solution found by t- and
F-values.
498 8 Cluster Analysis
variance criterion) which is calculated at the end of a clustering step. The objects or
clusters merged into a new cluster are assigned the number of the first object (cluster)
as identifier. The column “Stage Cluster First Appears” indicates the clustering step
in which the respective object (cluster) has been merged for the first time. The column
“Next Stage” indicates the next clustering in which the cluster will be used. For exam-
ple, in stage 7, cluster 1 (which was formed in stage 4) is combined with object 7 with a
heterogeneity measure of 13.969. The resulting cluster is assigned the identifier “1” and
used again in step 8. The value “0” in the column “Stage Cluster First Appears” indicates
that a single object is included in the clustering. Overall, it is clear that in the first four
fusion steps the chocolate varieties ‘Caramel’ (9), ‘Nut’ (11), ‘Nougat’ (10) and ‘Biscuit’
(3) are combined, with the error sum of squares after the fourth step being 4.906. This
means that the variance of the variable values in this group is still relatively small.
The dendrogram in Fig. 8.23 provides a graphical illustration of the clustering pro-
cess. In the dendrogram, SPSS normalizes the used heterogeneity measure to the interval
[0;25].
500
Fig. 8.21 Squared Euclidean distance matrix of the eleven chocolate types
8
Cluster Analysis
8.3 Case Study 501
Fig. 8.22 Agglomeration schedule of Ward’s method for the case study
Elbow criterion
To use the elbow criterion, the error sum of squares is plotted against the corresponding
cluster number in a diagram (see also Sect. 8.2.4.1). Figure 8.24 shows the resulting dia-
gram for the case study based on the coefficient values in the agglomeration schedule
(see Fig. 8.22). The diagram shows a clear “elbow” in the two-cluster solution of the
case study. However, it should be noted that in most analyses an elbow is evident in the
50
40
22.373
20
18.066
13.969
10.391
10
6.997
4.906
3.206
1.835
0 0.772
0 1 2 3 4 5 6 7 8 9 10
Number of clusters
two-cluster solution, since the increase in heterogeneity is always greatest for the step
from the two- to the single-cluster solution. In practical applications, therefore, a second
elbow in the diagram is always necessary for an unequivocal decision. However, a sec-
ond elbow is not visible in this case.
Test of Mojena
For the implementation of the test of Mojena, we transfer the clustering steps from
Fig. 8.22 column ‘Coefficients’ to a spreadsheet, and calculate the standardized cluster-
ing coefficients (α) per clustering stage according to Eq. (8.10) (see Sect. 8.2.4.2). This
leads to an average clustering coefficient of ai = 14.066 and a standard deviation of the
coefficients of sα = 18.096. The results are shown in Table 8.20. With 2 as the critical
threshold value, a single-cluster solution is suggested (ã10 = 2.436 ).
Both a two-cluster and a three-cluster solution will be discussed below. Figure 8.25
shows which object is located in which cluster. For our example, the cluster assign-
ments for the 2-, 3-, 4- and 5-cluster solution are given. In a two-cluster solution, the
types ‘Milk’, ‘Espresso’, ‘Biscuit’, ‘Cappuccino’, ‘Mousse’, ‘Caramel’, ‘Nougat’ and
‘Nut’ (classic cluster) are grouped together in a first cluster and the types ‘Strawberry’,
‘Mango’ and ‘Orange’ (fruit cluster) are grouped in a second cluster.
In order to compare the agglomerative processes, the case study was analyzed with
the “complete-linkage”, “average-linkage”, “centroid” and “median” processes. The
main difference in comparison to Ward’s method is that these methods do not con-
tain the development of the error sum of squares in the column “Coefficients” of the
“Classification overview” (see Fig. 8.22), but the distances or similarities of the objects
or groups that were merged. However, all procedures lead to identical solutions in the
two-cluster case, i.e. a “fruit cluster” and a “classic cluster” are identified.
Fig. 8.26 Cluster membership and final cluster centers according to the k-means method
and ‘healthy’) are more pronounced in cluster 1 (fruit cluster) than in cluster 2 (clas-
sic chocolate flavors). Furthermore, the reported table of variance (ANOVA) shows the
squared deviations between the two groups per variable (see Fig. 8.27). The greater these
deviations are, the more a variable is responsible for the differences between the two
clusters. It can be seen that especially the variables ‘price’, ‘light’, ‘crunchy’, ‘exotic’
and 'fruity' show clear differences in the two clusters (cf. column “Sig.” in Fig. 8.27).
8.3 Case Study 505
The F-tests, however, should only be used for descriptive purposes, as the clusters have
been chosen to maximize the differences among cases in both clusters. The F-values and
the reported significance levels should not be interpreted here as statistical tests of the
hypothesis that the cluster means are equal. The above information can nevertheless be
used for the interpretation of the clusters as described in the following section.
Fig. 8.28 Mean values and variances of the assessments in the survey population (total) and the two clusters
Cluster Analysis
8.3 Case Study 507
Using the results from Fig. 8.28, the t- and F-values for the two clusters can now be
calculated as described in Sect. 8.2.5. The results are shown in Table 8.21.
The calculation of t- and F-values for the case study is shown here for the variable
‘price’ in the fruit cluster.
The t-value is calculated according to Eq. (8.11) as follows:
3.581 − 4.675
t= √ = −1.372
0.636
For the F-value, Eq. (8.12) gives the following result:
0.037
F= = 0.058
0.636
Using the t- and F-values listed in Table 8.21, the two clusters can now be described as
follows:
• In the cluster ‘Classic’, the variables mostly have lower values than in the survey pop-
ulation (t-value < 0), while they have higher values in the cluster ‘Fruit’ (t-value > 0).
This means that in the cluster ‘Fruit’ these variables are perceived as much more
important. In the cluster ‘Classic’, however, they are perceived as significantly
weaker. Only the values of the variables ‘price’, ‘delicious’ and ‘healthy’ are above
average in the ‘Classic’ cluster, whereas they are below average in the ‘Fruit’ cluster.
The largest differences in the mean values are found for the variables ‘price’, ‘exotic’
and ‘fruity’.
• With regard to the homogeneity of the variables in the two clusters, the F-values
predominantly indicate significantly lower variances than in the survey population
(F-value < 1). In the ‘Classic’ cluster only the variable ‘sweet’ has a slightly higher
508 8 Cluster Analysis
Table 8.22 Definition of the three groups for discriminant analysis and logistic regression
Group (segment) Chocolate flavors in the segment Cases (n)
g = 1 | Seg_1 Classic Milk, Biscuit, Mousse, Caramel, Nougat, Nut 65
g = 2 | Seg_2 Fruit Orange, Strawberry, Mango 28
g = 3 | Seg_3 Coffee Espresso, Cappuccino 23
variance than in the survey population. In the ‘Fruit’ cluster, this applies to the var-
iables ‘refreshing’ and ‘healthy’. Overall, however, with regard to the homogeneity
in both clusters, it can be stated that the variables almost always have a significantly
lower variance than in the survey population.
Above, we demonstrated how to use the graphical user interface (GUI) of SPSS to con-
duct a cluster analysis. Alternatively, we can use the SPSS syntax which is a program-
ming language unique to SPSS. Each option that is activated in SPSS’s GUI is translated
22 Multinomial logistic regression requires at least three groups. In case of a two-cluster solution a
binary logistic regression would have to be performed.
8.3 Case Study 509
BEGIN DATA
3 3 5 4 1 2 3 1 3 4
6 6 5 2 2 5 2 1 6 7
2 3 3 3 2 3 5 1 3 2
-------------------
5 4 4 1 4 4 1 1 1 4
* Enter all data.
END DATA.
* Calculation of the means per chocolate flavor and output in own data
set.
DATASET DECLARE DATACluster.
AGGREGATE
/OUTFILE='DATACluster'
/BREAK=type
/price=MEAN(price)
/refreshing=MEAN(refreshing)
/delicious=MEAN(delicious)
/healthy=MEAN(healthy)
/bitter=MEAN(bitter)
/light=MEAN(light)
/crunchy=MEAN(crunchy)
/exotic=MEAN(exotic)
/sweet=MEAN(sweet)
/fruity=MEAN(fruity).
Fig. 8.29 SPSS syntax for calculating the means of the attribute assessments in the case study
into SPSS syntax. If you click on ‘Paste’ in the main dialog box shown in Fig. 8.17, a
new window opens with the corresponding SPSS syntax.
However, you can also use the SPSS syntax directly and write the commands your-
self. Using the SPSS syntax can be advantageous if you want to repeat an analysis mul-
tiple times (e.g., testing different model specifications). The syntax does not refer to an
existing data file of SPSS (*.sav); rather, we enter the data with the help of the syntax
editor (BEGIN DATA … END DATA).
Figure 8.29 shows the syntax file used to calculate the mean assessments for the
10 attributes of the 11 chocolate flavors from the 127 individual assessments. With
these mean values, a data matrix is generated which forms the basis for carrying out
a cluster analysis. Figure 8.30 shows the SPSS syntax for running various clustering
methods.
For readers interested in using R (https://www.r-project.org) for data analysis, we pro-
vide the corresponding R-commands on our website www.multivariate-methods.info .
510 8 Cluster Analysis
BEGIN DATA
Milk 4.5000 4.0000 4.3750 3.8750 3.2500 3.7500 4.0000 2.3750 4.6250 4.1250
Espresso 5.1667 4.2500 3.8333 3.8333 2.1667 3.7500 3.2727 2.3333 3.7500 3.4167
Biscuit 5.0588 3.8235 4.7647 3.4375 4.2353 4.4706 3.7647 2.7059 3.5294 3.5294
Orange 3.8000 5.4000 3.8000 2.4000 5.0000 5.0000 5.0000 4.4000 4.0000 4.6000
Strawberry 3.4444 5.0556 3.7778 3.7647 3.9444 5.3889 5.0556 4.9444 4.2222 5.2778
Mango 3.5000 3.5000 3.8750 4.0000 4.6250 5.2500 5.5000 6.0000 4.7500 5.3750
Cappuccino 5.2500 3.4167 4.5833 3.9167 4.3333 4.4167 4.6667 3.6667 4.5000 3.5833
Mousse 5.8571 4.4286 4.9286 3.8571 4.0714 5.0714 2.9286 2.0909 4.5714 3.7857
Caramel 5.0833 4.0833 4.6667 4.0000 4.0000 4.2500 3.8182 1.5455 3.7500 4.1667
Nougat 5.2727 3.6000 3.9091 4.0909 4.0909 4.0909 4.5455 1.7273 3.9091 3.8182
Nut 4.5000 4.0000 4.2000 3.9000 3.7000 3.9000 3.6000 2.2000 3.5000 3.7000
END DATA.
Fig. 8.30 SPSS syntax for running the cluster analyses of the case study
8.4 Modifications and Extensions 511
The previous explanations concentrated on cases in which objects are described by met-
ric data and a hierarchical, agglomerative clustering method is used. But cluster analysis
is also capable of processing non-metric data. In these cases, however, different proxim-
ity measures have to be used, as already indicated in Sect. 8.2.2.1. In Sect. 8.4.1, we will
describe common proximity measures for.
In Sect. 8.4.2, k-means cluster analysis and two-step cluster analysis, two frequently
used partitioning cluster procedures (see Fig. 8.6), will be presented and compared with
each other.
Example
The nominal variable ‘delivery complaints’ has four categories as attribute values;
thus, it is coded by four binary variables. Table 8.23 shows the corresponding trans-
formation of the complaint categories into binary variables. Each position in the row
stands for a complaint type, which is coded with 1 (= attribute value exists) if it is
valid. ◄
The example shows that the number of categories (values) of a nominal variable deter-
mines the length of the binary variables consisting of zeros and ones. In the above exam-
ple, the number of possible complaint types determines the length of the binary variable
512 8 Cluster Analysis
consisting of zeros and ones (see 3rd column). If there are no complaints, the coding in
the example is ‘0000’.
To calculate the similarity or dissimilarity between objects with nominally scaled
variables, the proximity measures discussed in Sect. 8.4.1.2 can be used. It should be
noted, however, that similarity coefficients that count common non-possession as a
match should not be used. The reason for this is that with such similarity coefficients
(e.g. SM-coefficient; see Sect. 8.4.1.2.2) a large and, in particular, very different number
of characteristic expressions leads to distortions in the similarity measure.
Example
In a survey, 100 people were asked about their assessment of five chocolate types
(Espresso, Cappuccino, Biscuit, Nut and Nougat) with regard to the preferred packag-
ing. The answer categories ‘paper’, ‘tin can’ and ‘gift box’ were specified as possible
types of packaging.
Table 8.24 shows the frequencies per type of packaging for the corresponding
chocolate types, with multiple mentions being possible. In total, N = 606 answers
were given. ◄
The data in Table 8.24 form a cross table of the two nominally scaled variables
‘Chocolate type’ and ‘Packaging type’. The distance between two objects k and l can
now be used as the distance measure between the two objects k and l, using the test value
of the chi-square homogeneity test. This test checks the null hypothesis that the two
objects are from the same distribution (population). In the procedure ‘Cluster’ of SPSS,
the chi-square statistic used to determine distances for frequency data is calculated as
follows:
8.4 Modifications and Extensions 513
Table 8.25 Distance matrix of the frequency data according to the chi-square statistic
Chi-square value between frequency sets
Espresso Cappuccino Biscuit Nut Nougat
Espresso 0.000
Cappuccino 2.209 0.000
Biscuit 6.931 5.901 0.000
Nut 6.642 4.994 8.387 0.000
Nougat 6.470 4.754 7.941 0.430 0.000
0,5
nij − eij 2
Chi − square = (8.13)
eij
with
nij number of entries of variable j for object i (I = k, l; j = 1, …, J) (cell frequency)
eij expected number of entries of variable j for object i when attributes are independ-
ent [(line sum × column sum)/total sum]
The larger the chi-square, the greater the probability that the two objects do not come
from the same population and should therefore be classified as dissimilar. The distance
matrix according to the chi-square for the frequency data in the above example (see
Table 8.24) is shown in Table 8.25.
The distances for all five objects according to the chi-square statistic in Table 8.25
show that the frequency data of ‘Nut’ and ‘Nougat’ (with a value of the chi-square sta-
tistic of 0.430) have the smallest distance (greatest similarity) and would therefore be
merged at the first stage. Accordingly, ‘Espresso’ and ‘Cappuccino’ would be fused in
the next step (with a value of the chi-square statistic of 2.209).
514 8 Cluster Analysis
If the absolute frequencies show large differences between the individual pair com-
parisons, the phi-square statistic should be used to determine the distance.It is based on
the chi-square statistic, but additionally normalizes the data by dividing them by the total
number of cases of the two objects under consideration.
For the sake of clarity, we will show the calculation of the chi-square statistic of 2.209
for the objects ‘Espresso’ and ‘Cappuccino’. For this purpose, only the first two rows of
Table 8.24 are considered in Table 8.26.
In addition, the expected frequency eij which is independent of the two types of choc-
olate must be calculated according to Eq. (8.14):
sum of row × sum of column
eij = (8.14)
total sum
(Example: the expected frequency for the cell ‘Espresso’—‘paper’ equals: (101 ⋅ 59) /
212 = 28.108).
For the expected frequencies (independent of the two varieties) the result is given in
Table 8.27.
Using Table 8.26 and 8.27, the chi-square for the flavors ‘Espresso’ and ‘Cappuccino’
can now be calculated according to Eq. (8.13):
For the example, the following phi-square measure results between the types of espresso
and cappuccino, with 212 representing the total number of cases of the two objects
considered
0.5
2 4.8778
φEspresso,Cappuccino = = 0.152
212
For the determination of similarities between objects with binary variable structures, a large
number of measures have been developed in the literature. Most of them can be traced back
to the following general similarity function (Kaufman & Rousseeuw, 2005, p. 22):
a+δ·d
Sij = (8.15)
a + δ · d + (b + c)
with
Sij similarity between objects i and j
δ, λ (constant) weighting factors
Variables a, b, c and d correspond to the identifiers in Table 8.28, where, for example,
variable a corresponds to the number of properties present in both objects (1 and 2).
516 8 Cluster Analysis
Depending on the choice of weighting factors δ and λ, we will get different similarity
measures for objects with binary variables. Table 8.29 gives an overview, where M is the
number of features (cf. Kaufman and Rousseeuw 2005, p. 24).
The procedure ‘Hierarchichal Cluster Analysis’ in SPSS offers a total of 7 distance
measures and 20 similarity measures for calculating the proximity of objects with a
binary variable structure. The choice of the proximity measure should be based on logi-
cal considerations and depends on the specific situation.
In the following, the similarity coefficients simple matching (SM), Jaccard and Russel
& Rao (RR), which are frequently used in practical applications in the case of binary
variables, are examined in more detail. We will use the example in Table 8.30.
Attributes Storing Seasonal National ads Paper XL size Sales Special Brand Margin > 20% Storage
Flavor time > 1 packaging packaging promotion display product problems
year
Espresso 1 1 1 1 0 0 1 0 0 0
Cappuccino 1 1 1 1 1 0 1 0 1 0
Biscuit 1 1 0 1 0 1 0 1 0 1
Nut 1 0 1 1 1 1 1 1 1 0
Nougat 1 1 0 1 1 1 0 1 1 0
517
518 8 Cluster Analysis
consideration. The first step is to determine how many properties both products have
in common. In our example, the chocolate bars ‘Espresso’ and ‘Cappuccino’ have five
components in common (‘Storing time > 1 year’, ‘Seasonal packaging’, ‘National ads’,
‘Paper packaging’ and ‘Special display’). The components that are only present in one
product are counted next. In our example, two attributes can be found in this category
(‘XL size’ and ‘Margin > 20%’). If we put the number of attributes present in both prod-
ucts in the numerator (a = 5) and add to the denominator the number of attributes present
in both products (a = 5) and those present in only one product (b + c = 2), the Jaccard
coefficient for the products ‘Espresso’ and ‘Cappuccino’ is 5/7 = 0.714.
In the same way, the corresponding similarities are calculated for all other object
pairs. Table 8.32 shows the results. With regard to this matrix, two things should be
noted:
• The similarity of two objects is not influenced by their order in the comparison, i.e. it
is irrelevant whether the similarity between ‘Espresso’ and ‘Cappuccino’ or between
‘Cappuccino’ and ‘Espresso’ is measured (symmetry property). This also explains
why the similarity of the products in Table 8.32 is represented only by the lower trian-
gular matrix.
• The values of the similarity measurement are between 0 (“total dissimilarity”, a = d
= 0) and 1 (“total similarity”, b = c = 0). If the conformity between the characteris-
tics of a single product and itself is checked, one naturally finds complete conformity.
Thus, it is also understandable that only the number 1 can be found in the diagonal of
the matrix.
8.4 Modifications and Extensions 519
These considerations now put us in a position to determine the most similar and the most
dissimilar pair. The chocolate bars ‘Espresso’ and ‘Cappuccino’ have the greatest simi-
larity (Jaccard coefficient = 0.714). The slightest similarity exists between ‘Cappuccino’
and ‘Biscuit’ with a Jaccard coefficient of 0.3.
• The object pair ‘Espresso’ and ‘Cappuccino’, for example, takes third place in the
order of similarity according to the RR coefficient. According to the other two simi-
larity measures, however, these two products are most similar and therefore rank first.
• Whereas ‘Biscuit’ – ‘Espresso’ and ‘Biscuit’ – ‘Cappuccino’ show only low similar-
ity (below 0.375) according to the Jaccard and RR coefficients, the pair ‘Biscuit’ –
‘Espresso’ achieves a value of similarity of 0.5 according to the SM-coefficient, while
‘Biscuit’ – ‘Cappuccino’ has a similarity of 0.3.
Example
In the case of the variable ‘gender’, for example, the existence of the characteristic
‘male’ has the same significance as its absence. This does not apply to the attribute
‘nationality’ with the expressions ‘American’ and ‘non-American’, because the exact
nationality that may be of interest cannot be determined by the statement ‘non-Amer-
ican’. Thus, if the presence of a component has the same significance for the grouping
as its absence, similarity measures which take into account all equal characteristics in
the numerator should be preferred (e.g., the SM coefficient). Conversely, it is advisa-
ble to use the Jaccard coefficient or related proximity measures. ◄
In case of unequally distributed characteristics (e.g., cases of suffering from a very rare
disease), the application of proximity measures leads to distortions (in this example, the
highly probable case that two persons do not suffer from this rare disease would be inter-
preted as a similarity).
the unweighted or weighted mean of the values calculated in the previous step. Let us
assume, for example, that the similarity of the products ‘Nut’ and ‘Nougat’ is determined
on the basis of nominal and metric properties. The SM coefficient for these two products
is 0.7 (see Table 8.31). The resulting distance between the two types of chocolate bars is
0.3, which is obtained by subtracting the similarity value from 1.
For the metric properties, we calculated a squared Euclidean distance of 6 (Table 8.6)
for these two products. If the unweighted arithmetic mean is now used as the common
distance measure, we obtain a value of [(0.3 + 6)/2 = ] 3.15 in our example. Alternatively,
the distance can be obtained by using the weighted arithmetic mean. Therefore, exter-
nal weights for metric and non-metric distances need to be specified. For example,
the respective share of variables of the total number of variables could be used as a
weighting factor. In this case, the example does not change compared to the case when
using the unweighted arithmetic mean for both ten nominal and ten metric features for
classification.
One possibility for converting the present ratio scales into binary scales is dichotomiza-
tion. In this case, a threshold is established to separate the low- and high-priced choco-
late bars. If this limit is assumed to be 1.60 €, for example, the price specifications up
to 1.59 € are assigned a 0 key and the prices above this value are assigned a 1 key. The
advantage of this approach is its simplicity and quick application. On the other hand,
the high loss of information is problematic, since ‘Biscuit’ is on the same price level
as ‘Espresso’, although the latter is 0.40 € more expensive. Another difficulty is the
1.41–1.69€ 1 0 0
1.70–1.99€ 1 1 0
2.00–2.30€ 1 1 1
threshold definition. Arbitrarily establishing a threshold can easily distort the actual con-
ditions and thus the grouping result.
The loss of information can be reduced if price intervals are formed and each interval
is binary-coded in such a way that “1” is assigned if the price of a product falls within
the interval and “0” otherwise.
The first price class is coded with three zeros, because each question is answered with
no. If the other classes are also treated in the same way, the coding shown in Table 8.35
is obtained. If the binary combination obtained is used to encrypt ‘Nut’, for example, we
obtain the number sequence “0 0 0” for this product. Table 8.36 lists the encryptions for
all chocolate bars.
The particular advantage of this method is its low loss of information, which is even
smaller with reduced sizes of the price class intervals. For example, seven price class
intervals reduce the span by half and thus better reflect the actual price differences.
However, the disadvantage of this method is that the importance of the considered
8.4 Modifications and Extensions 523
attribute (‘price’ in the above example) increases with the number of intervals created for
this attribute. If we assume, for example, that in a study only properties with two compo-
nents exist in addition to ‘price’, it can be seen that, in case of four price class intervals,
the price is three times as important as each one of the other attributes. Reducing the
price interval spans by half will result in an importance five times as high as the other
features. The extent to which a higher importance of an individual feature is desirable
must be decided on a case-by-case basis.
8.4.2.1 K-means clustering
KM-CA starts with the assumption that a data set is partitioned into k clusters. A cluster i
is represented by its centroid, which is calculated by averaging the cases assigned to clus-
ter i. This is the reason why the method is called “k-means”. It is also known as cluster
center analysis. The target criterion (Z) of KM-CA is formally represented as follows:
k
xj − µ2 → Min!
Z=
i=1 xj ∈Si i (8.16)
524 8 Cluster Analysis
xn xm x(1)m
x(2)n
CC_2
CC_3
Cluster 3
Cluster 3
Cluster 2 Cluster 2
Fig. 8.31 Example of the k-means algorithm with 21 cases and 3 cluster centers (CC)
Equation (8.16) shows that a given number of objects (x) is to be split into k partitions.
This is done in such a way that the sum of the squared deviations between the data
points (xj) and the center of gravity of a cluster (mean value μi) results in a minimum.
According to this target criterion, KM-CA performs clustering by minimizing the vari-
ance criterion (see Eq. (8.9)).
8.4.2.1.1 Procedure of KM-CA
The KM-CA procedure will be explained in the following on the basis of Fig. 8.31.
which the objects will be assigned. This number (k) may either be specified by the user
based on logical reasoning or may be determined automatically by SPSS. The example
assumes that we want to form three cluster centers (k = 3). These clusters are (randomly)
inserted into the coordinate system and are shown as rectangles in in Fig. 8.31B (Initial
Clusters).
Step 2: Allocation of cases (data points) to the (initial) cluster centers depending on
the variance of the clusters
In order to be able to assign the 21 data points to the three given cluster centers, the
Euclidean distances (cf. Eq. (8.1)) between all data points (xj) and the three cluster
centers (μi) are calculated. A given data point is then assigned to that one of the three
clusters in which the so-called variance criterion (cf. Eq. (8.9)) is increased least. In our
example, 7 cases are classified into cluster 1 (CC_1), 8 cases are classified into cluster 2
(CC_2) and 6 cases are classified into cluster 3 (CC_3) (see Fig. 8.31B).
1
mi(t+1) = xj
(t) xj ∈Si(t) (8.17)
Si
Figure 8.31C shows that each of the cluster centers of the three groups is now located
in the center of the objects belonging to the corresponding cluster. The variance of the
clusters is now calculated with the new cluster centers, analogously to step 2. Thus, we
can check whether a reduction of the cluster variance can be achieved and whether data
points should be reassigned to another cluster. A comparison between panels C and D in
Fig. 8.31 shows that case xn is reassigned from cluster 2 to cluster 1 and case xm is reas-
signed from cluster 1 to cluster 3.
tests for the hypothesis of equality of cluster mean values. The F-values only have a
descriptive value and only provide an indication as to whether the features of the var-
iables differ in the clusters. If the user wants to check the stability of the solution, the
KM-CA can be performed several times, with the cases assigned in a different, randomly
selected order. It should also be noted that the calculation of the cluster centers depends
on the subjective initial classification of the clusters (step 1). If the user initially decides
to propose a different classification, it is possible that the calculation results in a different
solution. That is why there is always a risk of finding a local optimum, but not a global
optimum. It is therefore advisable to try different numbers of initial cluster centers and to
compare the results with each other.
8.4.2.2.1 Procedure of TS-CA
Starting Point
(all cases)
are metrically scaled, the Euclidean distance (see Eq. (8.1)) can be used to calculate the
distances. If, however, the variables have different scale levels, the user must use the
probability-theoretical model approach of the log-likelihood distance. The log-likelihood
criterion can also be used for purely metric variables. To determine the final number of
clusters, the criteria listed in Sect. 8.2.4 can be used, for example.
• The results of the partitioning procedures are influenced by the target function under-
lying the “reassignment” of the objects.
• The initial partition is often selected based on subjective judgement and can influence
the results of the clustering process. If the initial partition is created randomly, the
clustering solutions may vary and the results may not be comparable.
• With partitioning procedures, local optima rather than global optima may be
determined.
8.5 Recommendations
After determining the cluster variable, the first step of a cluster analysis is to decide
which proximity measure and which fusion algorithm should be used. Ultimately, these
decisions should be based on the specific application and the properties of the different
agglomerative cluster procedures discussed in Sect. 8.2.3.3. In general, the Ward method
leads to fairly good partitions and, at the same time, indicates the correct number of
clusters. To validate the results of the Ward method, other algorithms can be applied,
but the properties of the different algorithms should always be taken into account (cf.
Table 8.18).
530 8 Cluster Analysis
many of these questions. Against this background, it becomes clear that the user has a
wide range of interpretation in cluster analysis. On the one hand, this is an advantage of
cluster analysis, since it opens up a wide field of applications. On the other hand, it bears
the risk that data may be manipulated to obtain the desired results. Therefore, we recom-
mend to always answer the following questions when presenting the results of a cluster
analysis:
References
Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in
statistics—Theory and methods, 3(1), 1–27.
García-Escudero, L., Gordaliza, A., Matrán, C., & Mayo-Iscar, A. (2010). A review of robust clus-
tering methods. Advances in Data Analysis and Classification, 4, 89–109.
Kline, R. (2011). Principles and practice of structural equation modeling (3rd ed.). Guilford Press.
Lance, G. H., & Williams, W. T. (1966). A general theory of classification sorting strategies I.
Hierarchical systems. Computer Journal, 9, 373–380.
Milligan, G. W. (1980). An Examination of the effect of six types of error pertubation on fifteen
clustering algorithms. Psychometrika, 45(3), 325–342.
Milligan, G. W., & Cooper, M. (1985). An examination of procedures for determining the number
of clusters in a data set. Psychometrika, 50(2), 159–179.
Mojena, R. (1977). Hierarchical clustering methods and stopping rules: A evaluation. The
Computer Journal, 20(4), 359–363.
Punj, G., & Stewart, D. (1983). Cluster analysis in marketing research: Review and suggestions for
application. Journal of Marketing Research, 20(2), 134–148.
Wedel, M., & Wagner, A. (2000). Market segmentation: Conceptual and methodological founda-
tions (2nd ed.). Springer.
Wind, Y. (1978). Issues and advances in segmentation research. Journal of Marketing Research,
15(3), 317–337.
Further reading
Anderberg, M. R. (2014). Cluster analysis for applications: Probability and mathematical statis-
tics: A series of monographs and textbooks (Vol. 19). Academic press.
Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein, D. (1998). Cluster analysis and display of
genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25),
14863–14868.
532 8 Cluster Analysis
Everitt, B., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis (5th ed.). Wiley.
Hennig, C., Meila, M., Murtagh, F., & Rocci, R. (Eds.). (2015). Handbook of cluster analysis.
Chapman & Hall/CRC.
Kaufman, L., & Rousseeuw, P. (2005). Finding groups in data: An introduction to cluster analysis
(2nd ed.). Wiley.
Romesberg, C. (2004). Cluster analysis for researchers. Lulu.com.
Wierzchoń, S., & Kłopotek, M. (2018). Modern Algorithms of Cluster Analysis. Springer Nature.
Conjoint Analysis
9
Contents
9.1 Problem
(traditional) conjoint analysis and choice-based conjoint (CBC) analysis. The former
asks consumers to evaluate all stimuli using ordinal or metric measurement scales (e.g.,
by ranking or rating). For example, consumers may state their preferences by ranking 10
different stimuli, giving the lowest number to the most preferred stimulus (i.e., rank = 1)
and the highest number to the least preferred stimulus (i.e., rank = 10). In contrast, in CBC
analyses consumers choose a stimulus (e.g., object, product) out of a small set of stimuli
and do so multiple times. For instance, we present three different chocolate bars to a con-
sumer and ask her which one she would buy. Then we present another set of chocolate
bars to her and ask the same question again. Since consumers just select one product, the
observed evaluations are nominal (i.e., 1 if an object is chosen, and 0 otherwise).
In the following, we use the term conjoint analysis whenever consumers evaluate the
stimuli with the help of ordinal or metric measurement scales. If consumers evaluate the
stimuli by making choices, we refer to CBC analysis.
In a final step, we analyze the collected preference data. Traditional conjoint as well
as CBC analyses assume that stated preferences reflect the stimulus’ total utility values.
The stimulus (i.e., object, product) with the highest total utility is the most preferred one.
We presume that an object’s (total) utility equals the sum of the utility contributions of
each of its attribute levels. The stated preference for an object serves as a proxy for an
object’s total utility value. For example, if a consumer evaluated a specific bar of choco-
late with ‘8’ on a rating-scale from 1 to 10 (1 = ‘not attractive at all’ to 10 = ‘very attrac-
tive’), we assume that ‘8’ reflects the chocolate’s total utility value.
The total utility value is the result of the utility contributions of the chocolate’s spe-
cific attribute levels. We call these utility contributions partworths. Figure 9.1 illustrates
this idea. The chocolate bar has a cocoa content of 75%, is wrapped in paper, is offered
for 1.20 EUR and produced by the brand ‘ChocoMania’. Each attribute level contributes
to the chocolate’s total utility value. For instance, the cocoa content of 75% has a part-
worth of 5, and the sum of the partworths across all attribute levels equals 8.
It is the aim of conjoint analysis to identify the utility contribution of each attribute
level. Such knowledge allows the researcher to elicit consumers’ most preferred product,
which is the product with the highest utility and encompasses the most preferred attrib-
ute levels. The key idea of conjoint analysis thus is to decompose consumers’ overall
preferences for objects into preferences for attribute levels. Consequently, conjoint anal-
ysis is a decompositional procedure.
Frequently, we do not only want to study consumers’ preferences regarding the stimuli
that have been evaluated, but we also want to use the results of conjoint analyses to predict
consumers’ preferences for objects that have not been evaluated. In this case, we use the
results of conjoint analyses for simulation purposes. In our example, we are interested in
the potentially ‘best’ chocolate bar. Based on the results of the conjoint analysis, we can
identify the ‘best’ chocolate bar that will probably be most successful in the market.
As stated above, conjoint as well as CBC analyses aim to derive each attribute level’s
partworth from stated preferences. When using conjoint analysis, we usually have suf-
ficient information to estimate the partworths for each individual consumer. Yet in CBC
studies, we just observe consumers’ choices that contain less information than ordinal
536 9 Conjoint Analysis
Fig. 9.1 Illustration of the relation between partworths and total utility
or metric data in conjoint studies. For this reason, in CBC studies we typically cannot
estimate individual-level partworths but only partworths for the complete study sample.
This difference is critical if we want to use the results for simulation purposes (cf. Sects.
9.2.5.4 and 9.4.5.3).
Conjoint analyses are frequently applied in marketing. Yet, conjoint analyses can also
be used in other disciplines and Table 9.1 lists some examples.
We will now proceed with discussing (traditional) conjoint analysis in detail (cf. Sect.
9.2). After demonstrating how to conduct a conjoint analysis with the help of SPSS (cf.
Sect. 9.3), we describe the CBC analysis in more detail (cf. Sect. 9.4). CBC analysis
has been developed to address shortcomings of (traditional) conjoint analysis and has
become very popular in research and practice. Various other variations of traditional
conjoint analysis have been developed to address its specific limitations. We will briefly
introduce some important further developments in Sect. 9.5.2.
9.2 Procedure
Conjoint analysis generally follows a five-step procedure. In the following, we will pres-
ent and discuss the various steps of a conjoint analysis (Fig. 9.2).
A conjoint analysis starts with selecting the attributes and attribute levels that describe
the stimuli. In a second step, we generate an experimental design that represents the
stimuli based on the considered attributes and attribute levels. In step 3, we discuss alter-
native methods to evaluate the stimuli. After the preferences have been collected, we
use the stated preferences to estimate the partworths of the attribute levels to map the
respondents’ preferences (step 4). In a final step, we interpret the estimated partworths
9.2 Procedure 537
and discuss how the results can be used to support the decision-making of managers,
policymakers, and others.
The first step when conducting a conjoint analysis is to decide on the attributes and
attribute levels that are used to describe the stimuli. In the following, we will again use
an example related to the chocolate market.
In general, several aspects need to be considered when selecting the attributes and attrib-
ute levels for conjoint analyses.
1. The attributes have to be relevant for the respondents’ decisions: Only attributes that
respondents take into account when making decisions should be considered. Focus
groups can be used to identify relevant attributes. If consumers substantially dif-
fer concerning the attributes that are relevant for their decision-making, alternative
approaches to (traditional) conjoint analysis have been developed, such as the adap-
tive conjoint analysis (ACA) (cf. Sect. 9.5).
2. The selected attributes have to be independent of each other: Conjoint analysis
assumes that each attribute level contributes to the total utility independently of the
other attribute levels. A violation of this assumption contradicts the additive model
of conjoint analysis. Moreover, it should be ensured that the attribute levels are
empirically independent. This means any combination of attribute levels can actually
occur and is not perceived as dependent by the respondent. Especially when consider-
ing characteristics such as brand and price, it is important to ensure that no implausi-
ble stimuli are considered in the survey design.
3. Managers, policymakers or others have to be able to adapt the attributes: To be able
to act on the results of a conjoint analysis, we need to be able to adapt the attributes.
For example, considering brand as an attribute is in conflict with this requirement.
Yet, brand is sometimes included to assess whether products are simply preferred
because of the brand name.
4. The attribute levels have to be realistic and feasible: To be able to act on the results of
the conjoint analysis, managers, policymakers or others need to be able to change the
product design regarding the preferred attribute levels.
5. The individual attribute levels need to be compensatory: Conjoint analysis assumes
that a poor attribute level of one attribute can be compensated by a certain level of
another attribute. For example, an increase in price that usually reduces total utility
can be compensated by an improvement in another, desirable attribute. This require-
ment implicitly assumes a decision-making process in which respondents simultane-
ously evaluate all attributes.
6. The considered attributes and attribute levels are no exclusion criteria: Exclusion cri-
teria exist when certain attribute levels must be present from the perspective of the
consumer. If exclusion criteria occur, the requirement of a compensatory relationship
between the attribute levels is not met.
7. The number of attributes and attribute levels needs to be limited: The effort to evalu-
ate the stimuli grows exponentially with the number of attributes and attribute levels.
It is suggested to consider not more than six attributes with three to four levels each.
When we apply the different requirements to the example, we conclude that we meet all of
them (Table 9.3). Actually, the number of attributes and attribute levels is rather small. In an
actual study, we would probably want to consider additional attributes such as packaging
size or packaging material. However, we keep this example small for illustrative purposes.
Note that once you have decided on the attributes and attribute levels, you may no
longer change or adapt them. Thus, it is of critical importance to carefully select the
attributes and attribute levels.
Table 9.3 Do the attributes and attribute levels in the example meet the requirements?
Criterion Assessment
Relevance The attributes and attribute levels were identified with the
help of focus groups
Independence The UTZ label is not correlated with price, and there is only
a weak correlation between price and cocoa content to be
found when inspecting market prices
Adaptability A chocolate manufacturer can manipulate all three attributes
and attribute levels—at least after making some investment
in R&D or changes in the supply chain
Realistic and feasible All attribute levels reflect realistic and feasible attribute
levels
Compensatory Consumers probably accept a higher price for the cocoa
content they prefer. The same most likely applies if they
have to pay more for a chocolate with a sustainability label
No exclusion criteria None of the attributes and attribute levels represents an
exclusion criterion
Limited number of attributes and We consider just 3 attributes with 2 or 3 levels each
levels
1. Definition of stimuli: The researcher has to decide how the different stimuli are pre-
sented to the respondents. The fundamental decision is whether the stimuli are pre-
sented with all attributes simultaneously or whether just two attributes are presented
at a time (cf. Sect. 9.2.2.1).
2. Number of stimuli: The number of possible stimuli increases exponentially with the
number of attributes and attribute levels. For example, when considering 3 attributes
with 3 levels each, 27 (=33) different combinations of attribute levels (i.e., stimuli)
are possible. If we consider 6 attributes with 3 levels each, 729 (=36) possible stimuli
exist. To avoid information overload and resulting fatigue effects, it is advisable to
use a reduced design that only takes a subset of all possible stimuli into account (cf.
Sect. 9.2.2.2).
9.2.2.1 Definition of Stimuli
When conducting a conjoint analysis, respondents state their preferences regarding
different combinations of attribute levels (i.e., stimuli). The stimuli can be designed
in two alternative ways: either they are described based on all considered attributes
9.2 Procedure 541
simultaneously or they are composed of just two attributes. The former approach is
called full-profile method and the latter trade-off method.
Full-profile Method
Table 9.4 shows three exemplary stimuli according to the full-profile method. Each stim-
ulus is described based on all three attributes and the stimuli differ concerning the attrib-
ute levels. Overall, we can describe 18 (=3 · 2 · 3) different stimuli.
Trade-off Method
When using the trade-off method, we need to develop a trade-off
matrix for each possi-
J
ble pair of attributes. With Jattributes, we get a total of trade-off matrices. In our
2
3
example, this results in 3 = trade-off matrices (Table 9.5). Each cell of a trade-off
2
matrix forms a stimulus. For example, the combination A1B1 describes the stimulus with
30% cocoa and an UTZ label. Yet, this stimulus contains no information about the price.
Both approaches have advantages and disadvantages related to the following aspects:
Despite the potential limitations of the full-profile method, it has prevailed due to the
greater realism of the evaluation task. Moreover, pictures of the stimuli may be used
9.2 Procedure 543
to further enhance the realism of the evaluation task. Because of its great relevance in
research and practice, we use the full-profile method in the following.
9.2.2.2 Number of Stimuli
Reduced Design
Drawing a random sample from all possible stimuli is the simplest approach to preparing
a reduced design. However, experimental research suggests several other approaches to
develop reduced designs that are prevalent in research and practice (cf. Kuhfeld et al.,
1994). These approaches systematically select certain stimuli from the set of all possible
stimuli. The basic idea of all these approaches is to find a subset of stimuli that allows
estimating all utility contributions (partworths) unambiguously.
If all attributes have the same number of levels, we refer to symmetric designs while
we refer to asymmetric designs if the number of levels varies across attributes. An advan-
tage of symmetric designs is that reduced designs are relatively easy to develop. A spe-
cial case of a reduced symmetric design is the Latin square, which is briefly described
to illustrate the basic idea of reduced designs. Its application is limited to the case of
exactly three attributes. If each attribute has three levels, the full factorial design com-
prises 27 (= 33) stimuli (Table 9.6).
Of those 27 stimuli, nine are selected in such a way that each attribute level is con-
sidered exactly once with each level of another attribute (bold combinations in Table
9.6). Thus, each attribute level is represented exactly three times instead of nine times.
Table 9.7 shows the corresponding Latin square design.
As mentioned above, developing a reduced asymmetric design is more challenging
(cf. Addelman, 1962a, b). Fortunately, software packages such as SPSS offer methods
to generate reduced asymmetric designs, thus decreasing the effort for the researcher to
develop such designs manually. For our example, a reduced design with nine stimuli was
generated with the help of SPSS (Table 9.8; cf. Sect. 9.3.2).
Generally, reduced designs should be orthogonal, which means that the attribute lev-
els are independent from each other (i.e., no multicollinearity). Table 9.8 shows that, for
544 9 Conjoint Analysis
Table 9.8 Reduced Stimulus Cocoa content (%) UTZ label Price (EUR)
asymmetric design for the
example 1 70 1 0.80
2 30 0 1.20
3 70 1 1.20
4 30 1 1.00
5 50 1 1.20
6 70 0 1.00
7 50 0 0.80
8 50 1 1.00
9 30 1 0.80
example, the level ‘70%’ for the attribute ‘cocoa content’ occurs with and without an
UTZ label and with all three price levels. The same applies to the other levels of the
attribute ‘cocoa content’. Consequently, the reduced design presented in Table 9.8 is
orthogonal. Orthogonal designs ensure that we can calculate the partworths of the differ-
ent attribute levels in the later analysis.
However, reduced designs often do not allow for evaluating interaction effects since
reduced designs lead to a loss of information. Thus, reduced designs can only be used
meaningfully if interaction effects are negligible.
9.2 Procedure 545
After developing the experimental design, we ask respondents to evaluate the selected
stimuli. We can use either metric (rating scale, dollar metric, and constant sum scale)
or ordinal (ranking and paired comparisons) evaluation methods to collect information
about respondents’ preferences (Fig. 9.3). In the following, we describe the different
evaluation methods.
Evaluation methods
• information content,
• number of evaluations,
• uniqueness of the evaluations.
Generally, metric evaluation methods (i.e., rating, Dollar metric, constant sum) contain
more information about the respondents’ preferences than ordinal methods (i.e., ranking
or paired comparison), because respondents not only indicate their preferences but also
the strength of their preferences. When using metric evaluation methods, the assigned
preference values can be interpreted as the total utility values of the stimuli. However,
this interpretation is only acceptable if we can assume completeness, reflectivity, and
transitivity of the preferences. A drawback of the metric methods is the relatively low
reliability of the evaluations (cf. Green & Srinivasan, 1978, p. 112), which results from
the fact that respondents are often not able to provide reliable information on the strength
of their preferences.
Mostly, the number of evaluations corresponds to the number of stimuli considered
in the conjoint study. Only the method of paired comparisons requires more evaluations
to ensure sufficient information about the respondents’ preferences. In our example with
nine stimuli, respondents need to make 36 (=9 · (9-1)/2) paired comparisons. These
comparisons need to be consistent to allow the order of the stimuli to be derived (condi-
tion of transitivity). The more evaluations, the higher the chance that respondents are not
able to make consistent assessments. Instead of considering all attributes, respondents
may focus on the attributes that are particularly important to them. One way to address
this issue is to let respondents divide the stimuli into several subgroups according to their
preferences (stacking) and ask for an evaluation among these subgroups.
Ordinal methods result in unique evaluations, whereas metric methods can result in
identical evaluations for several stimuli (so-called ties). On the one hand, metric meth-
ods do not force respondents to place the stimuli in a strict order if they do not perceive
any differences in preference. On the other hand, the assignment of numerous identical
evaluations may indicate that respondents were overburdened with the evaluation task or
9.2 Procedure 547
did not make any effort to evaluate the stimuli corresponding to their preferences. A high
number of ties may also indicate that the considered attributes and/or attribute levels are
not relevant for a respondent. In any case, the estimation of partworths will be difficult
if many ties occur. If a respondent evaluates all stimuli as equal, it is not possible to esti-
mate the utility contributions at all.
Given the advantages and disadvantages of the different evaluation methods, the rat-
ing and ranking methods prevail in practice because respondents find them easier to use
(cf. Wittink et al., 1994, p. 44). Therefore, we use a rating scale from 1 to 10 for the eval-
uation of the stimuli in our example (Table 9.9).
No ties are observed and the different evaluations provide some evidence for the
strength of preferences. We learn that respondent i = 1 has the highest preference for
chocolate with a cocoa content of 50%, an UTZ label, and a price of 1.00 EUR. Overall,
we recognize that this respondent evaluates stimuli with a cocoa content of 50% most
favorably, followed by stimuli with a cocoa content of 30%. The stimuli with a cocoa
content of 70% are evaluated least favorably. For the attributes ‘UTZ label’ and ‘price’,
the preferences are not so pronounced.
After all respondents have evaluated the different stimuli, we can start to determine the
partworths of the attribute levels. In the following, we will first describe the specification
of the utility function that links the utility contributions of the attribute levels to the total
utility value of a stimulus. There are three approaches to do so:
• partworth model,
• vector model and
• ideal point model.
We describe all three approaches before we explain how to derive the utility contribu-
tions of the attribute levels (partworths).
with
In addition to the utility contributions stemming from the attribute levels, often a con-
stant term is added in the utility function, which reflects a basic utility value:
Mj
J
yk = β0 + βjm · xjmk (9.2)
j=1 m=1
with
β0 constant term
Partworth Model
The presentation of the utility function in Eq. (9.1) requires an estimation of the util-
ity contribution of each attribute level. As mentioned above, the utility contribution of
an attribute level is called partworth, and thus, the specification of the utility function
in Eq. (9.1) is called partworth model. The partworth model does not assume any rela-
tion between the attribute levels and their utility contributions (Fig. 9.4). Hence, the
9.2 Procedure 549
partworth
1 2 3 attribute level
partworth model is very flexible. Moreover, it only requires nominally scaled attribute
levels.
For metric attributes—such as price—the attribute levels are converted into binary var-
iables. Generally, we have to use Mj–1 binary (dummy) variables to represent the differ-
ent levels of an attribute. For example, for the attribute ‘price’ with 3 levels—‘0.80 EUR’,
‘1.00 EUR’, and ‘1.20 EUR’, we need two binary variables to represent the different lev-
els. Variable 1 takes on the value 1 if the stimulus has a price of 0.80 EUR, otherwise it
is 0. Variable 2 takes on the value 1 if the stimulus has a price of 1.00 EUR, otherwise
it is 0. If both variables are equal to 0, we know that the stimulus has a price of 1.20
EUR. The price level of 1.20 EUR serves as the reference level. However, it is up to the
researcher to decide what value is used as reference value. In our example, the lowest or
the average price can also serve as reference value. The same logic applies to the attribute
‘cocoa content’. Again, here the highest level (i.e., 70%) serves as the reference level.
Thus, the utility function for our example can be formulated as follows:
yk =β0 + β11 · x11k + β12 · x12k + β21 · x21k + β31 · x31k + β32 · x32k
cocoa content UTZ label price
where j = 1 represents the attribute ‘cocoa content’, j = 2 denotes the attribute ‘UTZ
label’, and j = 3 represents the attribute ‘price’.
For stimulus k = 1 in our example, which has a cocoa content of 70%, an UTZ label,
and a price of 0.80 EUR (cf. Table 9.9), we get the following formulation of the utility
function:
550 9 Conjoint Analysis
We aim to estimate the constant term β0 and the partworths βjm in such a way that the
resulting total utility values yk correspond ‘as well as possible’ to the empirically col-
lected evaluations of the stimuli. In the example, respondent i = 1 evaluated stimulus
k = 1 with ‘3’ (on a 10-point scale). We use this rating (stated preference) as a proxy for
the stimulus’ total utility (i.e., y1 = 3).
In total, we need to estimate 6 (= 1 + 2 + 1 + 2) parameters based on 9 observations.
Thus, we have 3 degrees of freedom, which is sufficient to estimate the utility parameters
for each individual respondent.
Vector Model
For metrically scaled attributes, we can alternatively assume a linear relationship
between the attribute levels’ utility contributions and the total utility value (Fig. 9.5). For
the attribute ‘price’, we may assume that the total utility value decreases with increasing
price levels. If we accept such a linear relationship for all attributes, we obtain the fol-
lowing general form of the utility function:
J
yk = β0 + βj · xjk (9.3)
j=1
partworth
with
In this equation, we use the ideal point model for the attribute ‘cocoa content’, the part-
worth model for the attribute ‘UTZ label’, and the vector model for the attribute ‘price’.
552 9 Conjoint Analysis
partworth
30 50 70 attribute level
(here: cocoa
content)
Implementing the ideal point model requires knowledge about respondents’ ideal
level of an attribute. This information can be difficult to obtain. To address this chal-
lenge, we can alternatively use the partworth model that is also able to capture the idea
that an ideal point exists for consumers.
We have learned from Table 9.9 that respondent i = 1 seems to prefer a cocoa content
of 50% compared to 30% and 70%. It appears that there is an ideal point for this specific
respondent. However, we do not know whether it is actually 50% or some value around
50%. Therefore, we keep the partworth model for the attribute ‘cocoa content’ and pro-
ceed with the following specification of the utility function:
yk = β0 + β11 · x11k + β12 · x12k + β21 · x21k + β3 · x3k ,
Generally, there is a trade-off between the flexibility of the different approaches to spec-
ify the utility function and the number of parameters that need to be estimated. The part-
worth model is the most flexible approach since it does not assume a specific functional
relationship between the partworths and the total utility value. However, the number of
parameters to be estimated for the partworth model increases with the number of levels
considered per attribute in comparison to the other two approaches. Yet, due to its great
flexibility and its ability to take into account attributes measured at different scales, the
partworth model is used quite often (cf. Green et al., 2001, p. 59). Thus, we may want to
start with a specification of the utility function that is as flexible as possible (i.e., part-
worth model for all attributes). If we learn from later analyses that a linear relationship
9.2 Procedure 553
can be assumed for some (or even all) attributes, we can still change the specification of
the utility function and re-run the analysis. Note that we usually assume the same specifi-
cation of the utility function for all respondents.
Table 9.10 Coding of the variables to estimate the utility parameters (partworth model)
Stimulus Cocoa content Cocoa content UTZ label Price Price Rating
= 30% = 50% (x21k) = 0.80 EUR = 1.00 EUR (yk)
(x11k) (x12k) (x31k) (x32k)
1 0 0 1 1 0 3
2 1 0 0 0 0 4
3 0 0 1 0 0 1
4 1 0 1 0 1 6
5 0 1 1 0 0 8
6 0 0 0 0 1 2
7 0 1 0 1 0 9
8 0 1 1 0 1 10
9 1 0 1 1 0 7
554 9 Conjoint Analysis
consumers ranked the stimuli and the most preferred stimulus has rank 1, we first have
to recode the ranking in such a way that the most preferred stimulus receives the highest
value.1 This is achieved by subtracting the observed rank from the maximum value for
the rank + 1. For example, if nine stimuli have been ranked, then the most preferred stim-
ulus receives a value of 9 (= 9 – 1 + 1), and the least preferred stimulus receives a value
of 1 (= 9 – 9 + 1).
In our example, we measured the preferences using a rating scale and we can thus
use the ratings as dependent variables. Table 9.11 shows the estimated utility parame-
ters (partworths) based on a regression analysis. The parameters for the attribute levels
‘cocoa content = 30%’ and ‘cocoa content = 50%’ are both positive. The attribute level
‘cocoa content = 70%’ served as the reference level, and thus, has a utility parameter
(partworth) of 0. Accordingly, the level of ‘50%’ has the highest partworth and is the
preferred level of the attribute ‘cocoa content’ for respondent i = 1. This result sug-
gests that respondent i = 1 likes some bitterness but not too much bitterness. This result
is in line with our previous conjecture. Moreover, we learn that the respondent prefers
an UTZ label compared to no UTZ label. Additionally, the respondent prefers a lower
price to a higher price. The level ‘1.20 EUR’ is the benchmark and has a parameter of 0.
Consequently, the respondent prefers a price of 0.80 EUR over a price of 1.00 EUR and
a price of 1.20 EUR.
For illustrative purposes, we also estimate the utility parameters for the mixed model.
We use the partworth model for the attributes ‘cocoa content’ and ‘UTZ label’, and the
vector model for the attribute ‘price’. Table 9.12 shows the corresponding coding of the
independent variables. Now we have to estimate 5 parameters including the constant
term.
Table 9.13 shows the results of the regression analysis for the mixed model. When
using the vector model for ‘price’, we estimate a negative parameter that indicates that
the respondent prefers a lower price to a higher price. The partworths for the different
1 Thiscomment applies to an estimation of the utility function in, for example, Excel (see www.
multivariate-methods.info ). If you use SPSS for your analysis, SPSS does the recoding for you.
9.2 Procedure 555
Table 9.12 Coding of the variables to estimate the utility parameters (mixed model)
Stimulus Cocoa content Cocoa content UTZ label Price Rating
= 30% = 50%
1 0 0 1 0.80 3
2 1 0 0 1.20 4
3 0 0 1 1.20 1
4 1 0 1 1.00 6
5 0 1 1 1.20 8
6 0 0 0 1.00 2
7 0 1 0 0.80 9
8 0 1 1 1.00 10
9 1 0 1 0.80 7
price levels are now – 4 (= – 5·0.80) for a price of 0.80 EUR, – 5 for a price of 1.00 EUR,
and – 6 for a price of 1.20 EUR. Overall, the findings are the same.
likely that a sustainability label is evaluated positively. For the attribute ‘cocoa content’,
we can think of all kinds of relationships (decreasing, increasing, ideal point). Yet, exam-
ining the evaluations suggested that the respondent has an ideal point related to cocoa
content. Thus, the results of the partworth model seem plausible.
Additionally, the significance of the utility parameters indicates whether the attributes
and levels have been relevant for the respondents. Yet, the small number of the degrees
of freedom often leads to large standard deviations, and thus, insignificant parameters.
Therefore, insignificant parameters have to be treated with care. In the example, all
parameters are significant when using the partworth model (Table 9.11), while the util-
ity parameter for ‘UTZ label’ is not significant at the 5% level (p = 0.11) when imple-
menting the mixed model. Thus, the two different utility functions lead to contradictory
results regarding the relevance of the sustainability label.
Predictive Validity
For assessing the predictive validity, we can use a holdout sample. A holdout sam-
ple consists of stimuli that are evaluated by the respondents, but are not included in the
estimation of the parameters. While considering holdout stimuli allows us to assess the
predictive validity, one reason not to use holdout stimuli is that they increase the num-
ber of evaluations a respondent has to make. Thus, the cognitive effort increases for the
respondents, which can negatively affect the reliability and validity of the evaluations.
If we use holdout stimuli, we predict the total utility values for these stimuli using
the estimated utility parameters (partworths). These values are then compared to the
observed total utility values. We can again use a correlation measure to assess the pre-
dictive validity. Alternatively, we can predict which stimulus in the holdout sample is
the most preferred one (first choice) and assess whether we also observe the highest total
9.2 Procedure 557
utility value for this specific stimulus. If this is the case, we observe a so-called ‘hit’. The
percentage of first-choice hits in the holdout sample can serve as a measure of predictive
validity. In our example, we did not consider any holdout stimuli, and we refer the reader
to Sect. 9.3.3 for a more detailed discussion.
In a final step, we discuss what insights and implications we can derive from the esti-
mated utility parameters. First, we elaborate on the insights regarding the preference
structure of individual respondents. We also discuss how to obtain the relative impor-
tance of attributes. Second, we discuss how to compare the results across respondents
and how to use the findings to predict consumer behavior to support decisions by manag-
ers, policymakers, etc.
max bjm − min bjm
m m
wj = J (9.4)
max bjm − min bjm
j=1 m m
In the example, the attribute ‘cocoa content’ is the most important attribute for the
respondent i = 1 (Table 9.14). The relative importance equals 71.2% (or 0.712 =
7.00/9.83) when considering the mixed model. A change in cocoa content leads to a sub-
stantial change in the total utility value. In contrast, the attribute ‘UTZ label’ is the least
important one (relative importance = 8.5% or 0.085).
Note that the relative importance of an attribute may depend on the number of levels
of an attribute (number-of-levels effect) and the range of that attribute (bandwidth effect)
(Verlegh et al., 2002).
The so-called number-of-levels effect occurs when an increase in the number of levels
of an attribute—while holding the range of levels constant—results in a higher attrib-
ute importance. For instance, we considered three levels for the attribute ‘price’ rang-
ing from 0.80 EUR to 1.20 EUR. If we increased the number of levels to five but kept
the range constant (i.e., 0.80 EUR, 0.90 EUR, 1.00 EUR, 1.10 EUR, 1.20 EUR), the
attribute ‘price’ becomes more important for respondents, simply because there are more
levels and respondents pay more attention to changes in price. Moreover, a larger range
(bandwidth) of attribute levels can also lead to a higher importance of that attribute (e.g.,
if we increased the price range from 0.80 EUR to 1.50 EUR). Thus, the researcher needs
to be careful when deciding on the number of levels and attribute ranges right from the
start of a conjoint analysis (cf. Sect. 9.2.1).
We can transform and standardize the individual utility parameters in such a way that
the estimated utility parameters for all respondents share the same ‘zero point’. Usually,
we set the lowest partworth of an attribute equal to zero. Thus, in a first step the differ-
ence between the individual partworths and the lowest partworth of the corresponding
attribute is computed:
∗
bjm = bjm − bjmin (9.5)
with
Equation (9.5) applies to the partworth model. If a vector model was used, we first need
to compute the partworths by multiplying the utility parameters with the respective val-
ues for the attribute. For the mixed model, we thus obtain for the attribute ‘price’ the
following value for bjm
∗
: 2 for a price of 0.80 EUR, 1 for a price of 1.00 EUR, and 0 for a
price of 1.20 EUR.
In a second step, we consider the maximum total utility for an individual respondent.
The maximum total utility value is the sum of the most preferred attribute levels. In the
example, the respondent prefers a cocoa content of 50%, an UTZ label, and a price of
0.80 EUR. Such a chocolate has a total utility of 16.27 (= 6.44 + 7.00 + 0.83 + 2.00) when
considering the mixed model. We use the maximum total utility to standardize the scales
by setting the total utility of the most preferred stimulus to 1. Doing so results in the
standardized partworths:
∗
std.
bjm
bjm = J
∗ (9.6)
b0 + max bjm
j=1
Table 9.15 shows the resulting standardized partworths for respondent i = 1. If we sum
up the partworths of the most preferred attribute levels and consider the constant term,
we get a value of 1. If we standardize the utility parameters (partworths) according to
Eq. (9.6), we can compare the results from various individual analyses.
to estimate the same number of utility parameters. Thus, we have more degrees of free-
dom, which results in more efficient estimates of the utility parameters. Therefore, we
have to check carefully whether a joint estimation leads to valid results. For example, we
can first conduct individual analyses to test whether the standardized utility parameters
vary substantially across respondents. If this is the case, there is a high degree of hetero-
geneity and we should not run a joint estimation but rather use a cluster analysis to iden-
tify homogeneous groups, that is, groups with similar preference structures (cf. Chap. 8).
The respondent prefers the first alternative with no UTZ label for a price of 1.00 EUR
compared to the second alternative with an UTZ label for 1.20 EUR. Yet, the difference
in estimated utilities is rather small.
Based on these results, we can predict consumers’ buying behavior. There are three
basic approaches to predict consumer behavior:
Since conjoint analysis makes no assumptions about how the total utility values are
linked to actual choice behavior, we need to specify a choice rule.
First-choice Rule
The first-choice rule predicts that consumers will ‘for sure’ choose the product with
the highest utility. Thus, we assign a choice probability of 100% to the alternative with
the highest total utility; the other alternative receives a choice probability of 0%. If two
alternatives have exactly the same total utility value, the choice probability is distributed
562 9 Conjoint Analysis
equally across these stimuli (i.e., 50%). In our example, we predict that the consumer
chooses alternative 1 with a choice probability of 100%, although we may wonder
whether the world is really only black or white (0/1).
with
In the example, the choice probability for the first alternative equals 0.505 (=
8.44/16.71) or 50.5%, while the choice probability for the second alternative is 0.495 or
49.5%.
Logit Rule
A variation of the probabilistic choice rule is the logit rule that relies on the following
equation to compute the choice probability:
exp (uik ∗ )
Pik ∗ = K∗
(9.8)
exp (uik ∗ )
k ∗ =1
The logit rule implies an s-shaped relationship between the utility and the choice prob-
ability (cf. Fig. 9.25 ). In our example, the choice probabilities are 54% (= exp(8.44)/
(exp(8.44) + exp(8.27))) for the first alternative and 46% for the second alternative. The
choice probabilities do not differ substantially since the total utility values are rather sim-
ilar. If we observe a large difference in the total utility values, the logit rule converges to
the first-choice rule.
Remember that conjoint analysis does not imply a specific choice rule. The researcher
has to decide which rule to apply to predict consumer behavior. Since consumers usu-
ally make choices, conjoint analysis has been criticized for being far from reality. As
an answer to this criticism, the choice-based conjoint (CBC) analysis was developed.
Because of its relevance in research and practice, we discuss this variant of the tradi-
tional conjoint analysis separately in Sect. 9.4.
9.3 Case Study 563
We now use a larger sample and a slightly different research question to demonstrate
how to conduct a conjoint analysis with the help of SPSS. The manager of the chocolate
manufacturer actually knows a lot about consumer preferences when it comes to choc-
olate bars. Yet, the chocolate company considers introducing a range of chocolate truf-
fles. The manager visited a confectionary trade fair and organized several focus groups.
Table 9.16 shows the attributes and attribute levels that seem to be relevant for consum-
ers when buying chocolate truffles.
In total, we could create 324 (=3·3·2·2·3·3) stimuli. In the following, we describe
how to use SPSS to generate a reduced design and analyze the collected data.
To conduct a conjoint analysis with SPSS, several steps are necessary. First, we generate
the experimental design. Second, we create a data file that contains the evaluations of the
different stimuli. Third, we estimate the utility parameters for each individual respondent
and the sample.
• Alternative 1: mixed mini truffles (5 g) that contain superfoods and have a creamy filling.
The truffles are not wrapped individually and will be offered for a price of 6.99 EUR.
• Alternative 2: fruity truffles with a medium size (10 g) that contain a liquid filling but
no superfoods, are individually wrapped in paper and will be offered for 7.99 EUR.
9.3 Case Study 565
These two alternatives have been added to the file and appear in row 21 and 22 in
Fig. 9.12.
1. We can ask the respondents to order the stimuli from most preferred to least preferred.
In this case, the first column in the SPSS data file contains the respondent’s ID; the
second column contains the number of the most preferred stimulus, and so forth.
2. We can ask the respondents to rank the stimuli. A lower rank implies greater prefer-
ence. In this case, the second column of the SPSS data file contains the rank of stimu-
lus k = 1, and so forth.
9.3 Case Study 567
Fig. 9.13 Excerpt of the data file showing the respondents’ preferences (ranking)
3. We can ask the respondents to use a metric scale to evaluate the stimuli. A higher
score implies greater preference, so the second column of the SPSS data file contains
the value assigned to stimulus k = 1.
In our example, we used the ranking method for evaluating the stimuli (option 2). The
respondents were asked to rank the stimuli according to their preferences. The most pre-
ferred stimulus received the lowest rank (i.e., rank = 1). Each person evaluated 20 stimuli
(including four holdout stimuli). Figure 9.13 shows an excerpt of the SPSS data file.
9.3 Case Study 569
Besides the partworth and vector models, we can also specify the ideal point model
with the SPSS subcommands ‘IDEAL’ (i.e., inverted u-shaped quadratic relationship) or
‘ANTIIDEAL’ (i.e., u-shaped quadratic relationship).
After indicating the link between the utility parameters and the total utility values, we
need to tell SPSS how the data have been collected. We can choose between the subcom-
mands ‘SEQUENCE’, ‘RANK’, or ‘SCORE’. These options correspond to the above-de-
scribed options on how to enter the data into the SPSS data file. In our example, we
have ranking data and we hence use the subcommand ‘RANK’. SPSS now automatically
recodes the ranking to values that reflect total utility values (cf. Sect. 9.2.4.2). We further
indicate the name of the stimuli that are considered in the later estimations (here: the 16
stimuli of the experimental design and the 4 holdout stimuli):
The ‘PRINT’ subcommand allows us to define which results will be displayed in the
SPSS output file. If we use the option ‘ANALYSIS’, only the results of the experimental
data analysis are included. The estimated utility parameters for each respondent as well
570 9 Conjoint Analysis
as the overall results of a joint analysis are presented. The option ‘SIMULATION’ leads
to the reporting of the simulation data only. The results of the first-choice rule, proba-
bilistic choice (BTL) rule, and logit rule are displayed. The option ‘SUMMARYONLY’
reports the result of the joint estimation but not the individual results. Finally, the option
‘ALL’ reports the results of both the experimental and simulation data analyses. This
option is the default option. If you choose the option ‘NONE’, no results are reported
in the SPSS output file. We choose the option ‘ALL’ in our example because we want to
explore whether there is heterogeneity among the respondents, and we want to know the
results for the simulation.
/PRINT = all
The subcommand ‘UTILITY’ can be used to save the estimated utility parameters to a
new SPSS data file. If you want to do so, you have to indicate the file where the results
should be saved (cf. Sect. 9.3.4).
The subcommand ‘PLOT’ produces plots in addition to the numerical output. The fol-
lowing three options are available for this subcommand:
1. SUMMARY. Produces bar charts of the relative importance for all attributes and of the
utility contributions for each attribute.
2. SUBJECT. Produces bar charts of the relative importance for all attributes and of the
utility contributions for each attribute clustered by respondents.
3. ALL. Plots both summary and subject charts.
Since we do not require any plots for our case study, we ignore this subcommand.
9.3.3 Results
Before the results of the individual and joint estimations are presented, SPSS reports any
reversals. A reversal occurs if an assumed link between the utility parameters and the
total utility value is not confirmed, for example, if we assume a negative linear relation
between price and total utility but we find a positive relation for an individual respond-
ent. In our example, no reversals occur, i.e., the assumed relationship between price and
total utility is valid for all respondents.
9.3.3.1 Individual-level Results
In the following, we present the SPSS results of the individual analyses. For respondent
i = 1, we get the utility parameters displayed in Fig. 9.14.
9.3 Case Study 571
Note that SPSS presents the partworths for all attribute levels, although we used the
vector model for the attribute ‘price’. The estimated parameter for the attribute ‘price’ is
displayed later, in the SPSS output file. In our example, the price parameter for respond-
ent i = 1 equals –1.545. To derive the partworths for the different price levels, we multi-
ply the prices with the price parameter (e.g., –9.257 = 5.99x(–1.545); deviations are due
to rounding).
Moreover, SPSS uses effect coding for the attribute levels of the partworth model.
That is, the partworth for the reference level is not 0, as for dummy coding, but the part-
worths belonging to one attribute add up to 0. Effect coding uses the values + 1, 0 and –1
to represent categorical variables. Table 9.17 shows exemplarily the coding of the differ-
ent attribute levels for dummy and effect coding for the attribute ‘flavor’.
The general interpretation of the effect-coded partworths is similar to the interpreta-
tion of the dummy-coded partworths. The attribute level with the highest value is the
572 9 Conjoint Analysis
Effect coding
Level = fruity 1 0
Level = nutty 0 1
Level = mixed –1 –1
preferred level (e.g., ‘fruity’), and the partworth with the smallest value is the least pre-
ferred level (e.g., ‘nutty’). From Fig. 9.14, we learn that respondent i = 1 has a prefer-
ence for fruity truffles that have a size of 5 g, contain superfoods, have a creamy filling,
are not individually wrapped, and are offered for 5.99 EUR.
Next, SPSS presents the relative importance of each attribute (Fig. 9.15). For respond-
ent i = 1, packaging is the most important attribute (29.057%) followed by truffle size
(20.755%) and type of flavor (17.642%). The attribute ‘filling’ is the least important
attribute for this particular respondent.
As goodness-of-fit measures, SPSS uses Pearson’s correlation (‘Pearson’s R’) and
Kendall’s tau for the estimation (‘Kendall’s tau’) and holdout samples (‘Kendall’s tau for
Holdouts’) (Fig. 9.16). Since we used ranking data, we focus on Kendall’s tau that equals
1 for the estimation and holdout samples. That is, we are able to completely reproduce
the original ranking based on the estimated utility function.
Finally, the estimated total utility values for the two alternatives considered for sim-
ulation (‘Preference Scores of Simulations‘) are displayed (Fig. 9.17). Respondent i = 1
has a strong preference for the first alternative, i.e. mixed mini truffles (5 g) that contain
9.3 Case Study 573
superfoods, have a creamy filling, are not wrapped individually, and are offered for a
price of 6.99 EUR.
2 We leave it up to the reader to inspect the data in more detail and to examine the heterogeneity in
the preference structures. Visit www.multivariate-methods.info to obtain the dataset.
574 9 Conjoint Analysis
Table 9.18 Preferred attribute levels of the two groups in the data set
Group 1 Group 2
Flavor Fruity Nutty
Size of truffles 5g 15 g
Superfood Containing superfoods Not containing superfoods
Filling Creamy Creamy
Packaging No individual packaging No individual packaging
Price Negative Negative
At the very end of the SPSS output file, the results for the simulation are reported.
Remember that the manager considered two alternative chocolate truffles for the simula-
tion process.
• Alternative 1: mixed mini truffles (5 g) that contain superfoods and have a creamy fill-
ing. The truffles are not wrapped individually and will be sold for a price of 6.99 EUR.
• Alternative 2: fruity truffles with a medium size (10 g) that contain a liquid filling but
no superfoods, are individually wrapped in paper and will be offered for 7.99 EUR.
SPSS presents the derived choice probabilities according to the different choice rules
(Fig. 9.20). When using the first-choice rule (‘Maximum Utility’), we predict that
Alternative 1 is chosen by all 41 respondents. The probabilistic choice rule (‘Bradley-
Terry-Luce’) and the logit rule (‘Logit’) provide a more nuanced picture. According to
the probabilistic choice rule, Alternative 1 is chosen with a probability of 65.0%. This
probability is substantially smaller than the probability of 91.3% that is derived when
using the logit rule. The reason for this difference is due to the large difference in
estimated total utility values (Preferences Scores of Simulations). If large differences
in total utility values are predicted, the logit rule converges with the result of the first-
choice rule.
Fig. 9.20 Estimated total utility values and preference probabilities for the two alternatives considered
for simulation (joint estimation)
576 9 Conjoint Analysis
The manager of the chocolate company can now use these results to make a deci-
sion about which kind of truffles to launch in the market. However, it might be advis-
able to take into account the heterogeneity among respondents. If there are two groups
(segments) of consumers that differ concerning their preferences, and both segments are
interesting as target groups, the manager might want to consider introducing two differ-
ent kinds of truffles to the market.
We further requested SPSS to save the individually estimated utility parameters in
a new data file. Figure 9.21 shows an excerpt of this data set. For the attribute ‘price’,
the estimated utility parameters are saved, not the partworths for each level. Besides the
individually estimated utility parameters (‘CONSTANT’, ‘flavor1’ etc.), the data file con-
tains the estimated total utility values for the 20 stimuli including the holdout stimuli
(‘SCORE1’ to ‘SCORE20’) and the alternatives considered for simulation (‘SIMUL01’
and ‘SIMUL02’). A closer look at the estimated total utility values for the two alter-
natives considered for simulation shows that there is some heterogeneity among the
respondents. While some respondents have a very strong preference for Alternative 1,
other respondents show more balanced preferences.
In a next step, we can use the individually estimated utility parameters to compute the
standardized utility parameters as described in Sect. 9.2.5.2. Subsequently, we could use
these standardized utility parameters to conduct a cluster analysis to identify groups of
respondents with similar preferences (cf. Chap. 8 ).
Figures 9.22 and 9.23 show the SPSS syntax for the chocolate truffle example. Figure 9.22
illustrates how to generate a reduced design with the SPSS procedure ORTHOPLAN. The
subcommand ‘MIXHOLD’ indicates whether the holdout stimuli are mixed with the stim-
uli of the experimental design (YES) or whether they are presented at the end of the data
file (NO). We decided to present them at the end of the data file (cf. Fig. 9.10).
Figure 9.23 shows the SPSS syntax for the CONJOINT procedure to estimate the util-
ity parameters (partworths).
For readers interested in using R (https://www.r-project.org ) for conducting a con-
joint analysis, we provide the corresponding R-commands on our website (www.multi-
variate-methods.info ).
Traditional conjoint analysis (as described in Sect. 9.2) has been criticized because of
the non-realistic evaluation of the stimuli. Usually, respondents evaluate a set of objects
jointly and choose the object they prefer the most. This is also the basic idea of discrete
choice analysis that is frequently used in economics (cf. McFadden, 1974). Louviere and
9.4 Choice-based Conjoint Analysis
Fig. 9.22 SPSS syntax for generating a reduced design with the procedure ORTHOPLAN
Fig. 9.23 SPSS syntax for estimating the utility parameters with the procedure CONJOINT
Woodworth (1983) introduced the basic idea of discrete choice analysis into business
(esp. marketing) research and developed the so-called choice-based conjoint (CBC) anal-
ysis that asks consumers to choose a stimulus from a set of stimuli.
Consumer preferences are thus measured on a nominal scale (1 if the stimulus is cho-
sen and 0 otherwise). The nominal variable that reflects consumer preferences contains
less information than evaluations of stimuli using an ordinal or metric scale (cf. Sect.
9.2.3). Therefore, it is usually not possible to estimate the utility parameters for each
individual; rather, a homogeneous preference structure is assumed and we estimate one
set of utility parameters for all respondents. In principle, however, an individual estima-
tion of the utility parameters is possible if consumers make a sufficient number of choice
decisions (i.e., 50 or more choices). Yet in practice, it is hardly possible to ask consumers
to make 50 or more choice decisions. Instead, sophisticated statistical models such as
latent class models and Bayesian models are used to derive utility parameters at a group
9.4 Choice-based Conjoint Analysis 579
Fig. 9.24 Process steps of CBC 1 Selection of attributes and attribute levels
analysis
2 Design of the experimental study
The first step when conducting a CBC analysis is to decide on the attributes and attribute
levels that are used to describe the stimuli. In general, the same aspects need to be con-
sidered when selecting the attributes and attribute levels for a CBC analysis as would be
considered for a conjoint analysis.
The attributes must be relevant and independent of each other. Moreover, we need to
be able to adapt the attributes. The attribute levels have to be realistic and feasible, and
the individual attribute levels need to be compensatory. Moreover, the considered attrib-
utes and attribute levels must not be exclusion criteria. Finally, the number of attributes
and attribute levels needs to be limited. Again, it is recommended to use not more than 6
attributes with 3 to 4 levels each. As in conjoint analysis, we need to be aware that once
we have decided on the attributes and attribute levels, we can no longer change or adapt
them.
In the following, we use a small example to illustrate the CBC analysis. We use a
rather small example to facilitate understanding—especially of the estimation procedure
(cf. Sect. 9.4.4). CBC analysis is not part of the IBM SPSS software package but we can
use the SPSS procedure COXREG to analyze data that we collected with a CBC design.
Yet, there are many commercial software tools for conducting CBC analyses, such as
tools by Sawtooth Software or Conjoint.ly. However, you can also carry out a CBC anal-
ysis with the help of R. Since not every reader is familiar with R, we use Microsoft Excel
for our example.
Let us imagine the manager of a chocolate company wants to measure and analyze
the preferences for dark chocolate. More specifically, the manager wants to learn
more about the cocoa content and the price consumers prefer. Thus, the manager
considers the attributes ‘cocoa content’ and ‘price’ in the study. Table 9.20 shows
the considered attributes and attribute levels. For the sake of simplicity, we consider
just two attributes with two levels each. The two different levels of cocoa content
are different from the ones in our previous example since here we focus on dark
chocolate only. ◄
9.4 Choice-based Conjoint Analysis 581
For the study, the researcher has to define the stimuli and choice sets and has to decide
how many choices the respondents have to make.
Apart from deciding whether to include a no-choice option or not, we need to define
the number of stimuli presented in a choice set. The more stimuli are presented, the more
difficult the task of the respondent becomes. Moreover, considering a larger number of
stimuli in a choice set may affect the relative attribute importance. Respondents seem to
pay more attention to the attributes if fewer stimuli are considered in a choice set. This
indicates that respondents are overwhelmed with the choice task if choice sets consist of
many stimuli. Due to these arguments, we often observe a small number of stimuli in a
choice set in CBC studies.
The use of a small number of stimuli in a choice set is also driven by the fact that
we aim for minimal overlap of the attribute levels. There is minimal overlap between
the attribute levels if the stimuli in each choice set have non-overlapping attribute levels,
which means that an attribute level only appears once in a choice set. Minimal overlap
of attribute levels forces respondents to make trade-off decisions between the different
stimuli in a choice set. Keeping in mind that many CBC designs consider 3 or 4 levels
for each attribute (cf. Sect. 9.4.1), the maximum number of stimuli in a choice set is 4 if
we want to achieve minimal overlap. Since we only consider 2 levels for each attribute
in our example, each choice set consists of just 2 stimuli (Table 9.21). Consequently, we
have no overlap. Yet it is important to note that some overlap in attribute levels is not
critical and may facilitate the estimation of interaction effects. Besides the 2 stimuli, we
also consider a non-choice option in our example.
In contrast to conjoint analysis, in a CBC study respondents are asked to indicate which
stimulus in a choice set they would choose. CBC analysis assumes that respondents
select the most preferred stimulus in a choice set. By gathering respondents’ prefer-
ences with the help of choices, we can avoid the problem of a subjective use of scales.
Additionally, making choices instead of evaluating stimuli on a scale is considered more
realistic.
Table 9.23 represents the choice of one exemplary respondent i = 1 for choice set
r = 1. Respondent i = 1 chooses stimulus k = 2, so it is the most preferred one. However,
we neither learn anything about the strength of the preference nor whether stimulus k = 1
is preferred over the no-choice option or not. That is because respondents’ choice deci-
sions contain less information than evaluations on a metric or ordinal scale, as only a
nominally scaled variable is recorded.
After respondents have made their choices, we determine the partworths of the attribute
levels. In the following, we discuss the specification of the utility function, and how to
derive the partworths.
where x11k indicates whether stimulus k has a cocoa content of 60% (1) or not (0), and
x21k indicates whether stimulus k has a price of 1.50 EUR (1) or not (0).
Although a stimulus has a total utility value independent of the choice set(s) it is part
of, we consider the choice set in the general formulation of the utility function. If we
include a no-choice option, we represent this option as an attribute of the stimulus. Thus,
we get the following general formulation of the utility function if we use the partworth
model for all attributes:
Mj
J
yk = βjm · xjmk (k ∈ Kr ) (9.9)
j=1 m=1
with
For our example, the no-choice option is represented by attribute j = 3, and x31k equals 1
if the stimulus is the no-choice option and 0 otherwise.
If the vector model is employed for all attributes, we get the following formulation:
Mj
J
yk = βjm · xjk (k ∈ Kr ) (9.10)
j=1 m=1
586 9 Conjoint Analysis
with
Since the variable for the no-choice option is a binary variable, we can either use the for-
mulation xjmk or xjk. For our example, we can write:
yk = β11 · x11k + β21 · x21k + β3 · x3k
cocoa content price no - choice
In total, we need to estimate three parameters including the parameter for the no-choice
option.
Equation (9.11) illustrates that the probability of choosing a stimulus depends on its util-
ity and the utility of all other stimuli in a choice set. The choice probability for stimulus
k is reflected in a non-linear relationship between the utility value of stimulus k and the
utility values of the other considered stimuli in a choice set. More specifically, there is
an s-shaped relationship between the choice probability and the estimated utility because
the model relies on the logistic function (Fig. 9.25).
The right-hand side of Eq. (9.11) further shows that the probability of choosing stim-
ulus k among the stimuli in choice set r depends on the difference in utility values. Thus,
the choice probability in the logit model is determined solely by the differences in utility
values, not by their absolute value. Stated differently, it is only the differences in utility
that matter.
If two stimuli have exactly the same utility values and we just consider two stimuli in
a choice set, the choice probability for both stimuli is 0.5 or 50% (Fig. 9.25). For exam-
ple, let us assume two stimuli with a utility value of 2 for both stimuli. Equation (9.11)
then equals for k = 1 (and k = 2):
9.4 Choice-based Conjoint Analysis 587
probability
1.0
0.5
0.0
0 (yk – yk‘)
exp (2) 1 1
Prob(k = 1) = = = = 0.5
exp (2) + exp (2) 1 + exp (−1 · (2 − 2)) 1 + exp (0)
A characteristic of the logit model is that if two stimuli have very similar utility val-
ues, small changes in the utility values have a strong effect on the choice probabilities.
For example, if we assume utility values of 2.2 and 1.8 for k = 1 and k = 2, respectively,
the choice probabilities are Prob(k = 1) = 60% and Prob(k = 2) = 40%. But in the case of
large differences in utility values, small changes in these values have only a minor effect
on the choice probabilities.
Another characteristic of the logit model is that the model relies on the assumption
of ‘independence of irrelevant alternatives’ (IIA). The IIA assumption implies that new
alternatives affect the existing alternatives in the same way. For example, if we consider
a no-choice option, the choice probabilities for k = 1 and k = 2 change. If we assume a
utility parameter of 1 for the no-choice option, we get the following equations for k = 1
(k = 2) and the no-choice option:
1 1
Prob(k = 1, 2) = = = 0.42
1 + exp (−1 · (2 − 2)) + exp (−1 · (2 − 1)) 1 + exp (0) + exp (−1)
1 1
Prob(no-choice) = = = 0.16
1 + exp (−1 · (1 − 2)) + exp (−1 · (1 − 2)) 1 + exp (1) + exp (1)
This is the result of the model’s inherent IIA assumption. However, there might be situa-
tions in which we introduce a new alternative that is a close substitute for an existing alter-
native. In this case, we would expect that the substitute suffers more from the introduction
of the new alternative than the other choice option. For that reason, the IIA assumption
is considered a limitation of the logit model, and alternative models such as nested logit
models have been developed to address this limitation (cf. Train, 2009, pp. 77–78).
So far, we have assumed that we know the total utility values of the stimuli. Yet, we
actually have to estimate the utility parameters to derive the total utility values and the
resulting choices. In the following, we discuss how to estimate the utility parameters.
Kr
R
L= Prob(k)dk → max! (9.12)
r=1 k=1
with
L likelihood function
Prob(k) estimated choice probability for stimulus k in choice set r
dk binary variable indicating whether stimulus k in choice set r has been chosen
(1) or not (0)
The value of the likelihood function depends solely on the utility parameters. We aim
to identify those utility parameters that result in a choice probability of 1 for the cho-
sen stimulus in a choice set. If we are able to identify the utility parameters that lead
to a choice probability of 1 for each chosen stimulus across all choice sets, the likeli-
hood function reaches a value of 1, which is its maximum value (minimum value = 0).
However, it is very unlikely that we obtain a value of 1 for the likelihood function.
Usually, the values for the likelihood are rather small since we multiply a number of
probabilities. For example, if we observed 12 choices and were able to predict each
choice with a probability of 0.9, the resulting value for the likelihood function would
be 0.282 (= 0.912). Small values for the likelihood make it more difficult to identify
the parameters that maximize the likelihood. To address this issue, we take the natural
logarithm of the likelihood and maximize the ln-likelihood function. We thus transform
Eq. (9.12) into the following form:
9.4 Choice-based Conjoint Analysis 589
Kr
R
ln L = LL = dk · ln (Prob(k)) → max! (9.13)
r=1 k=1
Taking the natural logarithm does not influence the estimated utility parameters. The
value for the ln-likelihood lies within the interval ]−∞,0]. Thus, maximizing LL corre-
sponds to reaching a value of 0. To identify the utility parameters that maximize LL, we
can use an iterative procedure such as the Newton–Raphson algorithm (cf. Train, 2009,
pp. 187–188).
The ML estimation procedure is based on the idea that observed choices can be gen-
erated from a set of utility parameters. There is one set of utility parameters that can best
describe the empirically observed choice decisions. These utility parameters are system-
atically searched for and determined with the help of the iterative algorithm. Figure 9.26
illustrates an exemplary progression of the LL function if one parameter is varied. For
our example, we can see that the maximum of the LL function is around –10 and the util-
ity parameter that maximizes the LL function is around 0.5.
To derive robust parameter estimates, a sufficient number of choices must be
observed. That is, we need a sufficient number of degrees of freedom. The literature
refers to at least 60 degrees of freedom for the ML method; conservative voices even
recommend 120 degrees of freedom (cf. Eliason, 1993, p. 83). In our example, we need
to estimate three parameters and thus we would need to observe at least 63 choices.
However, our experimental design considers just two choice sets. Consequently, we need
to observe about 32 respondents to estimate the utility parameters.
Since we can hardly observe a sufficient number of choices from one single respond-
ent, an individual estimation of the parameters is not feasible in CBC studies. We thus
-5
-10
-15
-20
LL
Fig. 9.26 Progression of the LL function when varying a single utility parameter
590 9 Conjoint Analysis
estimate one set of utility parameters for all respondents. If we assume that there is het-
erogeneity among the consumers, we may want to split the sample based on observed
characteristics (e.g., gender, age groups etc.). We can then estimate the utility parameters
for each sub-sample—if there are sufficient observations for each sub-sample. However,
this approach requires the user to know which characteristics are relevant for heterogene-
ity in preferences. Alternatively, other estimation methods such as latent class models or
Bayesian models can be employed to obtain group-specific or individual-specific utility
parameters. Yet, these methods are not discussed in this book.
Attribute levels
Respon- Choice Stimulus 60% 78% 1.50 2.00 No-choice Observed
dent set cocoa cocoa EUR EUR option choice
6 12 1 1 0 1 0 0 0
6 12 2 0 1 0 1 0 1
6 12 3 0 0 0 0 1 0
We use an iterative algorithm to identify those utility parameters that maximize the
likelihood and LL value. On our website www.multivariate-methods.info , we demon-
strate how to use the Microsoft Excel solver to maximize the LL function. The maximum
values for the likelihood and LL are 0.005 and –5.205, respectively. The likelihood value
is still rather small but we have now reached the maximum value, given the observed
choices. One reason for the rather small value might be that there is heterogeneity in the
respondents’ preferences, which makes it difficult to identify one set of utility parameters
that maps all observed choices. The utility parameters that result in the maximum value
for LL are: b11 = − 14.419, b21 = 13.033, and b3 = − 1.386. In the following, we discuss
how to assess the estimated utility function.
9.4 Choice-based Conjoint Analysis 593
We can also use the LLR test to determine the significance of the estimated utility
parameters. The LLR test for a single utility parameter is obtained by setting the utility
parameter to zero in the utility function and then maximizing the likelihood function for
this reduced model over the remaining parameters:
LLRj = −2 · LL0j − LLb (9.15)
Table 9.27 shows the significance levels for the estimated utility parameters based on Eq.
(9.15). In our example, only the utility parameter for ‘60% cocoa content’ is significant.
Another common measure of the goodness-of-fit for the complete model is
McFadden’s R-square (also known as the likelihood ratio index):
2 LLb 2
RM = 1 − 0 ≤ RM ≤1 (9.16)
LL0
For the example, McFadden’s R-square equals 0.605. Given that the value ranges between
0 and 1, the obtained value indicates a high goodness-of-fit of the estimated utility function.
Besides the two measures based on LL, we can compute the hit rate for the estimation
sample (i.e., set up a classification matrix, cf. Chap. 5 ). That is, we count the number
of times we predict the chosen stimulus correctly if we always choose the stimulus with
the highest choice probability. In our example, we get a hit rate of 0.833 (= 10/12) or
83.3%. The benchmark is a hit rate of 50% and, thus, the model fits the data quite well,
although the likelihood value is rather small.
Predictive Validity
To assess the predictive validity, we can use a holdout sample and compute the hit rate
for this sample. In our example, we did not consider a holdout sample, and thus we are
not able to assess the predictive validity of the model.
Finally, we briefly discuss the insights and implications we can derive from the results of
a CBC analysis.
• Define sub-groups a priori and estimate the utility parameters for each sub-group sep-
arately by splitting the sample. This approach requires a priori knowledge about het-
erogeneity. For example, if we are confident that gender differentiates the respondents
in their preferences, we may split the sample into women and men. However, often
we do not know what observable variables are associated with heterogeneity in the
respondents’ preferences.
• Use more advanced estimation procedures such as Bayesian methods or latent
class models to derive at utility parameters at the individual or segment level (cf.
Gustafsson et al., 2007). In contrast to defining groups a priori, latent class models
use the respondents’ choice decisions to identify respondents with similar preferences
(i.e., choice behavior) and to jointly estimate the utility parameters for those respond-
ents who demonstrate similar choice behavior. This will result in segment-specific
utility parameters.
596 9 Conjoint Analysis
9.4.5.4 Conclusion
Overall, CBC analysis and conjoint analysis share the same steps. Yet, there are also fun-
damental differences as described in this section. While we discussed CBC analysis in
some detail, it is not the only variation of conjoint analysis that has established itself
in practice. In Sect. 9.5.2, we will briefly elaborate on other developments to give the
reader an idea of the manifold methodologies available for measuring and mapping con-
sumer preferences.
9.5 Recommendations
In conclusion, the following recommendations can be given for using a conjoint analysis.
Survey Design
The survey design should not include more than 20 stimuli. If this number is exceeded in
the full factorial design, a reduced design should be created using the full-profile method.
Evaluation of Stimuli
The evaluation method needs to be determined based on the concrete research question.
Conjoint analysis has several limitations that have been addressed by further develop-
ments of the methodology. The two main limitations are the way respondents evaluate
the stimuli (e.g., using ordinal or metric scales) and that only a limited number of attrib-
utes and attribute levels can be considered. The following methods address these limita-
tions next to CBC analysis.
MaxDiff Method
We have already discussed CBC analysis as one alternative to conjoint analy-
sis to address the criticism that the evaluation task in conjoint analyses is unrealistic.
Another approach that addresses this weakness is the MaxDiff method (cf. Louviere
et al., 2015).
MaxDiff is a technique that might be considered a more sophisticated extension of the
method of paired comparisons. With MaxDiff, respondents are shown a set (subset) of
the possible stimuli in the study and are asked to indicate (among this subset, with a min-
imum of three stimuli) the best and worst stimulus. MaxDiff assumes that respondents
evaluate all possible pairs of stimuli within the displayed subset and choose the pair that
reflects the maximum difference in preference.
One limitation of ACA that results from the adaptive approach is that the results across
individual respondents may be difficult to compare.
All of these three options have been implemented by Sawtooth Software in their com-
mercial software packages. Yet, other commercial software packages are also available
and there are some R packages available as well. The wide availability of further devel-
opments and software packages illustrates the high relevance of the conjoint methodol-
ogy for research and practice.
References
Green, P. E., & Srinivasan, V. (1978). Conjoint analysis in consumer research: Issues and outlook.
Journal of Consumer Research, 5(2), 103–123.
Green, P. E., Krieger, A. M., & Agarwal, M. K. (1991). Adaptive conjoint analysis: Some caveats
and suggestions. Journal of Marketing Research, 28(2), 215–222.
Green, P. E., Krieger, A. M., & Wind, Y. (2001). Thirty years of conjoint analysis: Reflections and
prospects. Interfaces, 31(3), 56–73.
Gustafsson, A., Herrmann, A., & Huber, F. (2007). (eds.). Conjoint measurement—methods and
applications. Springer.
Haaijer, R., Kamakura, W. A., & Wedel, M. (2001). The ‘no-choice’ alternative to conjoint choice
experiments. International Journal of Market Research, 43(1), 93–106.
Johnson, R. M., & Orme, B. K. (1996). How many questions should you ask in choice-based con-
joint studies? Research Paper Series. Sawtooth Software. https://www.sawtoothsoftware.com/
download/techpap/howmanyq.pdf . Accessed: 19. Sept. 2020.
Kuhfeld, W. F., Tobias, R. D., & Garratt, M. (1994). Efficient experimental design with marketing
research applications. Journal of Marketing Research, 31(4), 545–557.
Kumar, V., & Gaeth, G. J. (1991). Attribute order and product familiarity effects in decision tasks
using conjoint analysis. International Journal of Research in Marketing, 8(2), 113–124.
Louviere, J. J., & Woodworth, G. (1983). Design and analysis of simulated consumer choice or
allocation experiments: An approach based on aggregated data. Journal of Marketing Research,
20(4), 350–367.
Louviere, J. J., Flynn, T., & Marley, A. A. J. (2015). Best-Worst Scaling. Cambridge University
Press.
McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. In P. Zarembka
(Ed.), Frontiers in econometrics (pp. 205–142). Academic.
Swait, J., & Louviere, J. (1993). The role of the scale parameter in the estimation and comparison
of multinomial logit models. Journal of Marketing Research, 30(3), 305–314.
Train, K. (2009). Discrete choice models with simulation. Cambridge University Press.
Verlegh, P. W. J., Schifferstein, H. N. J., & Wittink, D. R. (2002). Range and number-of-levels
effects in derived and stated measures of attribute importance. Marketing Letters, 13(1), 41–52.
Vriens, M., Oppewal, H., & Wedel, M. (1998). Ratings-Based versus Choice-Based Latent Class
Conjoint Models. International Journal of Market Research, 40 (3), 1–11.
Wittink, D. R., Vriens, M., & Burhenne, W. (1994). Commercial use of conjoint analysis in
Europe: Results and critical reflections. International Journal of Research in Marketing, 11(1),
41–52.
Index
A C
Adaptive choice-based-conjoint (ACBC), 598 Calinski&Harabasz rule, 489
Adaptive conjoint analysis (ACA), 597 Causality, 40
Agglomeration schedule. See Cluster analysis causal diagram, 100
Akaike information criterion, 328 regression analysis, 69, 99, 129
Alpha factoring, 411 Central limit theorem, 26
Analysis of variance. See ANOVA Chi-square statistic
ANCOVA, 194 Cluster analysis, 512
ANOVA, 12, 148 Contingency analysis, 360
multivariate, 192 Pearson, 323
one-way, 150 Choice-based conjoint analysis (CBCA), 578
two-way, 166 Choice rules (Conjoint), 561
ANOVA table, 82, 162, 177, 182 City block metric (L1-norm), 464
Anti-image covariance matrix, 392 Classification
Area under curve. See ROC curve a posteriori probability, 231
Autocorrelation, 91, 106 a priori probability, 228
Average (arithmetic mean), 15 conditional probability, 232
Average linkage. See Clustering algorithms function, 229
matrix, 222
score, 229
B table, 283
Bandwidth effect, 558 Cluster analysis, 14, 454
Bartlett test, 389, 429 Agglomeration schedule, 471, 476, 487
Baseline logit model, 320 Hierarchical agglomerative, 470, 474
Bayesian information criterion, 328 Partitioning, 523
Bayes theorem, 231 Cluster center analysis. See K-means clustering
Beta coefficients, 75 Clustering algorithms, 469, 471
Binary variable, 9 Centroid clustering, 471
BLUE characteristics, 92 Complete linkage, 477
Bonferroni test, 165 K-means, 490, 493, 503, 523
Boxplot, 45 Median clustering, 471
Bradley-Terry-Luce rule (BTL rule), 562 Single linkage, 474, 493
J
F Jaccard coefficient, 516
Factor analysis, 382
confirmatory, 14
exploratory, 13 K
Factor extraction methods, 410 Kaiser criterion. See Eigenvalue criterion
Factorial design, 167 Kaiser-Meyer-Olkin (KMO) criterion, 390
Factor levels of ANOVA, 148 K-means clustering. See Clustering algorithms
Factor loading matrix
unrotated, 415
Factor loadings, 396 L
Factor rotation, 418 LAD method, 72
Factor score coefficients, 420, 439 Lambda coefficient, 363, 370, 378
Factor score determination Latent variable, 5
Regression method, 420 Latin square, 543
Summated scales, 420 Least-squares method, 71, 92
Surrogates, 419 Leave-one-out method, 224, 304
Factor scores, 419 Level of significance (sig), 29, 34
First-choice rule, 561 Levene test, 163, 165, 181, 197
F-test, 82, 159, 164, 225 Leverage effect, 117, 307
Full-profile method, 541, 581 Likelihood ratio
statistic, 300
test, 301, 309, 593
G Linear probability model, 276
GLS method, 411 Linear trend model, 139
Goldfeld-Quandt test, 105 Link function, 271
Goodman and Kruskal’s lambda, 363 Logistic function, 268
Goodman and Kruskal’s tau, 363 Logistic regression, 266
Goodness-of-fit, 77, 83, 299, 323 binary, 268
Grouping variable, 204 multinomial, 314
multiple, 287
Logit, 270, 296
H Logit choice model, 348
Heteroscedasticity, 104 Logit rule, 562
Histogram, 45 Log-likelihood function, 290
Hit rate, 222, 283 Log-linear model, 378
Homoscedasticity, 91, 163 Longitudinal data, 5
Lurking variables, 98
604 Index