ML & AI in Marketing and Sales (2021) Nildri Syam - Emerland

Machine Learning and Artificial
Intelligence in Marketing and Sales

Praise for Machine Learning and Artificial Intelligence in Marketing and Sales
The world of business in general and marketing in particular is

going through a radical transformation with the application of
machine learning and artificial intelligence touching almost every
aspect of business. These advanced tools are no longer solely the
domain of statisticians and data scientists. Business practitioners
find themselves interacting with these methods and, at the same
time, data scientist often find themselves sitting at the management
table managing groups and discussing their analysis. Machine
Learning and Artificial Intelligence in Marketing and Sales:
Essential Reference for Practitioners and Data Scientists strikes a,
difficult to achieve, balance between providing sufficient
information on commonly used but complex machine learning
and AI tools, and yet keeping the book accessible and applicable
to business practitioners with technical orientation. This is a great
introduction book for those who wish to know not only about
machine learning and AI, but also what it really is, and how to
apply it in marketing and sales settings.
Oded Netzer, Professor of Business, Columbia University
This book is a great resource for Data scientists as a reference to

anchor your technical understanding, build your intuition of the
core machine learning models and at the same time elevate it for
application in the real-world context of Marketing and Sales. It
would be a good foundational book for students and applied
practitioners of non-ML, non-stat backgrounds to gain
confidence in your work and to stand behind the choices you
make in your model building process. As the authors say, it does
a good job at bridging the DS-real world application gap. I also
liked the choice of the three models (NN, RF, SVM) to allow for
focus and not overload the reader with tons of other material
available. These three will get 80290% of your job done as a
modeler. I also liked the Executive Summary sections which
appeal to the applied modeler in me and the Technical Detours
which piqued a deeper intellectual curiosity and understanding of
the details. Because they are positioned as detours, I could still
read and use the book without making the technical parts
overwhelming.
Vijay Jayanti, Head of Marketing Data Sciences
at WhatsApp Inc.
In Machine Learning and Artificial Intelligence in Marketing and

Sales, Syam and Kaul have teamed up to present a timely
explanation of important topics in the evolving high-tech world
of big data. For readers well-versed in the Support Vector
Machine, artificial neural nets, and deep learning, the book will be
immediately useful. For readers new to these topics, the authors’
accessible style lowers entry barriers. The book is required reading
for managers, analysts, professors, and consultants involved in
marketing and sales.
David J. Curry. Professor of Marketing, University of Cincinnati
Syam and Kaul’s book is a comprehensive treatise on data science

of marketing, a rich and deeply informative dive into the next
generation of marketing analytics solutions. The work
comprehensively integrates the theoretical concepts of Machine
Learning with practical applications of marketing, making it
essential for either ML Engineers solving marketing problems or
marketing analysts looking to get a rigorous treatment of the
nascent science.
Alex Vayner, Data science and AI expert, Partner, PA Consulting
The authors have skillfully tailored the content to a wide audience.

I found this book as a solid reference guide for students and a
reference for data science practitioners alike. While the book
covers the most important Machine Learning topics in lucid
detail, it also provides insightful executive summaries, and, most
importantly, showcases applications of each model in the practical
world of Sales and Marketing. I will wholeheartedly recommend
this book to anyone interested in learning Machine Learning and
Artificial Intelligence.
Sunish Mittal, Vice President, Data and Analytics, Aramark
This page intentionally left blank
Machine Learning and Artificial
Intelligence in Marketing and
Sales: Essential Reference for
Practitioners and Data
Scientists
BY
NILADRI SYAM
University of Missouri, USA
RAJEEVE KAUL
McDonald’s Corporation, USA
United Kingdom – North America – Japan – India – Malaysia – China

Emerald Publishing Limited
Howard House, Wagon Lane, Bingley BD16 1WA, UK
First edition 2021
Copyright © 2021 by Emerald Publishing Limited.

All rights of reproduction in any form reserved.
Reprints and permissions service

Contact: permissions@emeraldinsight.com
No part of this book may be reproduced, stored in a retrieval system, transmitted in any form or
by any means electronic, mechanical, photocopying, recording or otherwise without either the
prior written permission of the publisher or a licence permitting restricted copying issued in the
UK by The Copyright Licensing Agency and in the USA by The Copyright Clearance Center.
Any opinions expressed in the chapters are those of the authors. Whilst Emerald makes every
effort to ensure the quality and accuracy of its content, Emerald makes no representation
implied or otherwise, as to the chapters’ suitability and application and disclaims any warranties,
express or implied, to their use.
British Library Cataloguing in Publication Data

A catalogue record for this book is available from the British Library
ISBN: 978-1-80043-881-1 (Print)

ISBN: 978-1-80043-880-4 (Online)
ISBN: 978-1-80043-882-8 (Epub)
This book is dedicated to the memory of Ma and Baba: NS
This book is dedicated to the memory of my parents who encouraged me to
remember that learning is a lifelong process of staying abreast of change: RK
Table of Contents
List of Figures, Tables and Illustrations xiii
Foreword xvii
Preface xix
Acknowledgments xxi
Introduction 1
Chapter 1 Introduction and Machine Learning Preliminaries:

Training and Performance Assessment 5
1. Training of Machine Learning Models 5
1.1 Regression and Classification Models 6
1.2 Cost Functions and Training of Machine
Learning Models 7
1.3 Maximum Likelihood Estimation 9
1.4 Gradient-Based Learning 10
2. Performance Assessment for Regression and
Classification Tasks 13
2.1 Performance Assessment for Regression Models 14
2.2 Performance Assessment for Classification 15
Technical Detour 1 23
Chapter 2 Neural Networks in Marketing and Sales 25

1. Introduction to Neural Networks 25
1.1 Early Evolution 25
1.2 The Neural Network Model 26
x Table of Contents
1.3 Cost Functions and Training of Neural Networks

Using Backpropagation 38
1.4 Output Nodes 40
2. Feature Importance Measurement and Visualization 42
2.1 Neural Interpretation Diagram (NID) 43
2.2 Profile Method for Sensitivity Analysis 44
2.3 Feature Importance Based on Connection
Weights 45
2.4 Randomization Approach for Weight and Input
Variable Significance 48
2.5 Feature Importance Based on Partial Derivatives 49
3. Applications of Neural Networks to Sales and
Marketing 49
4. Case Studies 54
Case Study 1: Churn Prediction 54
Case Study 2: Rent Value Prediction 57
Linear Activation Function for Continuous
Regression Outputs 62
Sigmoid Activations Function for Binary Outputs 63
Softmax Activation Function for Multi-class Outputs 64
Chapter 3 Overfitting and Regularization in Machine Learning

Models 65
1. Hyperparameters, Overfitting, Bias-variance
Tradeoff, and Cross-validation 65
1.1 Hyperparameters 66
1.2 Overfitting 66
1.3 Bias-variance Tradeoff 68
1.4 Cross-validation 70
2. Regularization and Weight Decay 72
2.1 L2 Regularization 73
2.2 L1 Regularization 74
Table of Contents xi
2.3 L1 and L2 Regularization as Constrained

Optimization Problems 75
2.4 Regularization through Input Noise 76
2.5 Regularization through Early Stopping 77
2.6 Regularization through Sparse Representations 77
2.7 Regularization through Bagging and Other
Ensemble Methods 78
Weight Decay in L2 Regularization 80
Weight Decay in L1 Regularization 81
Chapter 4 Support Vector Machines in Marketing and Sales 85

1. Introduction to Support Vector Machines 85
1.1 Early Evolution 85
1.2 Nonlinear Classification Using SVM 86
2. Separating Hyperplanes 88
3. Role of Kernels in Machine Learning 90
3.1 Kernels as Measures of Similarity 91
3.2 Nonlinear Maps and Kernels 94
3.3 Kernel Trick 98
4. Optimal Separating Hyperplane 99
4.1 Margin between Two Classes 99
4.2 Maximal Margin Classification and Optimal
Separating Hyperplane 101
5. Support Vector Classifier and SVM 106
6. Applications of SVM in Marketing and Sales 114
7. Case Studies 120
Case Study 1: Consumer Choice Modeling 121
Case Study 2: Rent Value vs Location 125
xii Table of Contents

Illustration 3 134
Illustration 4 135
Illustration 5 136
Chapter 5 Random Forest, Bagging, and Boosting of Decision Trees 139

1. Early Evolution of Decision Trees: AID, THAID,
CHAID 139
2. Classification and Regression Trees (CART) 143
2.1 Regression Trees 147
2.2 Classification Trees 151
3. Decision Trees and Segmentation 155
4. Bootstrapping, Bagging, and Boosting 158
4.1 Bootstrapping 159
4.2 Bagging 161
4.3 Boosting 165
5. Random Forest 169
6. Applications of Random Forests and Decision Trees
in Marketing and Sales 171
7. Case Studies 176
Case Study 1: Caravan Insurance 177
Case Study 2: Wine Quality 178
Technical detour 1: 179
References 183
Index 191
List of Figures, Tables and Illustrations
Figure 1.1. Gradient Descent for a Minimization Problem. 11

Figure 1.2. Small and Large Step Sizes (Learning Rates) in
Gradient Descent. 11
Figure 1.3. Classifying “1” and “2” and 0–1 Loss. 15
Figure 1.4. Confusion Matrix Showing True Positives and
False Positives. 17
Figure 1.5. Instances Sorted by Decreasing Order of Predicted
Class Probabilities and Confusion Matrix Corre-
sponding to a Threshold of 0.8. 18
Figure 1.6. Instances Sorted by Decreasing Order of Predicted
Class Probabilities and Confusion Matrix Corre-
sponding to a Threshold of 0.584. 19
Figure 1.7. Receiver Operating Characteristics (ROC) Curve. 19
Figure 1.8. Area Under the Curve (AUC). 20
Figure 1.9. Cumulative Response Curve. 21
Figure 1.10. Lift Chart. 22
Figure 1.11. Gini Coefficient. 22
Figure 2.1. Linear Relationship between “Rent Value”and
“Distance to City Center”. 27
Figure 2.2. Nonlinear Relationship between “Rent Value” and
“Distance to City Center.” 27
Figure 2.3. NN with 1 Hidden Node, p Input Nodes and 1
Output Node. 29
Figure 2.4. Input Information at Hidden Node m Is a
Weighted Sum of Input Nodes Plus a Bias. 30
Figure 2.5. Activation Function fm Acts at the Hidden
Node m. 30
xiv List of Figures, Tables and Illustrations
Figure 2.6. NN with 1 Hidden Layer Containing M Hidden

Nodes. 31
Figure 2.7. Information Flow in NN with 1 Hidden Node. 32
Figure 2.8. Sigmoid Function. 33
Figure 2.9. An “Inverted U” Shaped Nonlinear Pattern. 34
Figure 2.10. Simple NN with 2 Hidden Nodes. 34
Figure 2.11. NN with Two Hidden Units to Model a Nonlinear
Relationship. 35
Figure 2.12. Positively Weighted (Left) and Negatively
Weighted (Right) Sigmoids. 35
Figure 2.13. NN for Multiclass Classification with 1 Hidden
Layer. 36
Figure 2.14. Learning Slowdown with Quadratic Cost for
Binary Output. 41
Figure 2.15. NID for Neural Network for Predicting Choice of
PB. 44
Figure 2.16. Profiles for Sensitivity of Predicted Response to
Input Variable. 45
Figure 2.17. NN with 3 Inputs Nodes, 2 Hidden Nodes and 1
Output Node. 46
Figure 2.18. Estimated Weights after Training of Network. 46
Figure 2.19. Contribution of Each Input Neuron to the Output
via Each Hidden Neuron. 47
Figure 2.20. Relative Contribution of Each Input Neuron to
Outgoing Signal. 47
Figure 2.21. Relative Importance of Each Input. 47
Figure 3.1. Plots of Prediction Error versus Model Complexity. 67
Figure 3.2. For a Given Training Data Set, a More Complex
Model (Right) Is Likely to Overfit. 68
Figure 3.3. Fivefold Cross-validation. 71
Figure 3.4. The Estimated Function (Bold Curve) Passes
Closer to Data Points, Say Point a, with Smaller
Decay Parameter (Left Panel) than with Larger
Decay Parameter (Right Panel). 73
Figure 3.5. Training for Too Long (Too Many Epochs) Can
Raise Validation Error. 77
Figure 4.1. Plot of Two Nonlinearly Separable Classes. 87
List of Figures, Tables and Illustrations xv
Figure 4.2. Poor Classification of “1” and “2” Classes Using

Logistic Regression. 87
Figure 4.3. Very Good Classification Accuracy Using Support
Vector Machine. 88
Figure 4.4. Linearly Separable Points “1” and “2” 89
Figure 4.5. The Inner Product Is a Measure of the Angle
between Two Vectors. 92
Figure 4.6. A New Point x Is Assigned to Class. Whose Class
Mean Is Closer to It. 93
Figure 4.7. Two Classes “1” and “2” Are Not Linearly
Separable in Input Space. 96
Figure 4.8. The Transformed Points Are Linearly Separable in
Feature Space. 97
Figure 4.9. Classes Are Separable in Higher Dimensional
Feature Space. 98
Figure 4.10. Separability between Classes Seen from Two Van-
tage Points: East–West and North–South Axes. 100
Figure 4.11. The Margin of the “1” Points from Hyperplane
(Bold Line). 100
Figure 4.12. Same Two Classes Separated by a Wider (Left) and
Narrower (Right) Margin. 101
Figure 4.13. Nearest Points from Hyperplane to Both Classes
Lie Exactly on Margin. 102
Figure 4.14. Support Vectors, Like Point xa, Represents Con-
sumers That Make Hard Decisions. 103
Figure 4.15. Different Likelihoods of Churn When We Move
Away from Separating Hyperplane. 103
Figure 4.16. Linearly Separable Points in Input Space. 105
Figure 4.17. Support Vectors from the Two Classes Are Circled. 105
Figure 4.18. Optimal Separating Hyperplane in Input Space. 106
Figure 4.19. Classes That Are Not Linearly Separable. 108
Figure 4.20. Support Vectors from the Two Classes Are Circled. 111
Figure 4.21. Optimal Separating Hyperplane in Feature Space. 112
Figure 4.22. The Data for the XOR Problem. 113
Figure 4.23. Plot of XOR Points. 113
Figure 4.24. Support Vector Machine Successfully Separates
Classes in XOR Problem. 114
xvi List of Figures, Tables and Illustrations
Figure 4.25. Latitude of Acceptance Noncompensatory Choice

Rule. 122
Figure 5.1. First Binary Partition of a Two-Dimensional
Feature Space. 144
Figure 5.2. Second Binary Partition of a Two-Dimensional
Feature Space. 144
Figure 5.3. Third Binary Partition of a Two-Dimensional
Feature Space. 145
Figure 5.4. Final Partitioned Two-Dimensional Feature Space
Showing Nonoverlapping Regions. 145
Figure 5.5. Tree Diagram of Recursive Binary Partitioning. 146
Figure 5.6. Arbitrary Partition Does Not Allow Easy
Interpretation. 146
Figure 5.7. Hierarchical Clustering of Seven Objects: A
Through G. 157
Figure 5.8. A Dendrogram Is a Tree-Like Representation
of Hierarchical Clustering. 158
Figure 5.9. Clipping the Dendrogram at Different Levels Gives
Different Numbers of Clusters. 159
Table 5.1. Categorical Data for CHAID Analysis. 153

Table 5.2. Training Data Set for Bootstrapping. 160
Illustration 4.1. Nonlinear maps transform nonlinearly separable

classes to linearly separable classes in feature
space. 95
Illustration 4.2. Nonlinearly separable data in input space is
separable in higher dimensional feature space. 97
Illustration 4.3. Constructing an optimal separating hyperplane
using inner products and support vectors. 104
Illustration 4.4. Constructing optimal separating hyperplane for
nonlinearly separable classes using kernels and
support vectors. 110
Illustration 4.5. Identifying support vectors to compute optimal
separating hyperplane for nonlinearly separable
classes. 113
Foreword
The book is written in five chapters covering important concepts, principles, and
practices on contemporary machine learning topics. Each page is written in an
easy-to-read format using clean and lean sentences unencumbered by complex
jargon. Each chapter provides solid theoretical background of the methods
selected, and further elaborates the how and the why aspects of model selection,
model building and model validation; the step-by-step approaches of leveraging
specific techniques such as the Neural Network, the decision trees, and the vector
machines; and the pros and the cons of certain machine learning (ML) proced-
ures. Moreover, each topic provides many real-world examples that connect the
theory with the applied use. Lists and depth of supporting reference materials are
also excellent.
We are at an exciting time where the era of big data, machine learning, arti-
ficial intelligence, cloud computing and advanced analytics is ushering unprece-
dented access to and uses of large volumes of data to improve our predictive
power to unlock transformational changes that impact many aspects of our lives
in the retail, the financial, the manufacturing, the technology, the healthcare, and
other industries.
As both big and small firms alike are fine-tuning their pricing, promotion,
distribution, customer-retention, risk-management, and go-to-market strategies,
data scientists are increasingly expected to know cutting-edge solutions, equip
themselves with many facets of ML techniques and solutions. This book
undoubtedly provides the foundational background, the tools, and the necessary
tips to grasp many of the ML methods currently in use. In addition, as ML is
rapidly and dynamically evolving to impact our daily life, the timeliness of this
book is undoubtedly very appropriate.
I have worked as a data scientist for diverse organizations and have taught
analytics and ML classes in universities. A few pages into this book, I knew that it
is a special treat for my appetite, and it really struck a chord. Many of the well-
organized core concepts expounded in the book are not only refreshing but also
the kind I wish I had long ago. I also dare to describe this book as a versatile tool
and a must-have reference material for both beginners and seasoned data scien-
tists alike, business leaders and those who embrace data and analytics-driven
decision-making processes. In addition, analytics teachers and their respective
students can benefit from the in-depth analysis of the contemporary data science
topics and the plethora of examples provided. I commend the authors for a job
well done.
xviii Foreword
Dawit Mulugeta, PhD (Biostatistics), VP Analytics and Risk Management,

Wells Fargo
Dawit Mulugeta is an applied data scientist. Currently he holds a vice president
of analytics and risk management role with Wells Fargo. In addition, he teaches
popular analytics and ML classes in the Department of Management Sciences as
well as in the Department of Accounting and Information Systems of the Max M.
Fisher College of Business of the Ohio State University in Columbus. Dawit earned
a PhD in biostatistics from the University of Wisconsin in Madison, USA.
Preface
Machine Learning and Artificial Intelligence in Marketing and Sales: Essential

Reference for Practitioners and Data Scientists is intended for a variety of
audiences:
• Marketing and sales practitioners who want to develop a deeper appreciation

for how machine learning models can be applied to problems in marketing and
sales.
• Data scientists who are tasked with implementing data-based solutions to
business problems in the domain of marketing and sales.
• Software engineers and IT developers who will implement, or assist in the
implementation of, and manage the solutions for marketing and sales organi-
zations in their company.
• Students in master’s programs in data science and in MBA programs and
management consultants who wish to have a deeper appreciation of how
machine learning and AI are impacting the world of marketing and sales.
This book represents a fruitful collaboration between an academic and an

industry practitioner. Each author made a genuine attempt to understand and
incorporate in the book the viewpoints and tastes of the other. This main purpose
of this book is to bridge, what we call, the Domain Specialist – Data Scientist Gap
(DS-DS Gap). It is aimed at the translators, that is, the boundary-spanners,
among two distinct audiences – the marketing practitioners and the data scien-
tists. It is the experience of the authors that often in companies one of the biggest
barriers to success is the ability of the technical and sales/marketing business
teams to effectively understand and communicate with one another in solving
business problems. Marketing practitioners and data scientists wishing to work
together to solve marketing and sales issues using the methods of machine
learning need to have a shared understanding of how these methods can be
applied to marketing and sales. This book treads the fine line between the very
technical books on the one hand, and, on the other hand, the purely qualitative
books that merely mention the various AI and machine learning applications in
Any collaboration between an academic and an industry practitioner, with the
former emphasizing theory and the latter emphasizing business applications,
always has a fair bit of tension – the desirable kind of tension that has hopefully
xx Preface
made the final product better. We have tried earnestly to strike a balance between
the theoretical/technical and applications aspects. Having said that, we have
decided to err on the side of applications, anchoring our narrative on the con-
nections between the techniques and their applications in a business setting. We
have deliberately kept the technical details to a bare minimum in the main body of
the text, and have dealt with technical details through the various “Technical
detours” that have been collected together at the end of each chapter. This allows
readers who do not wish to tackle the technical issues to be able to read the
chapters easily without being distracted or overwhelmed by technical details. In
addition, to help readers get a quick overview of the concepts involved, we have
added “Executive summaries” as and when needed. As far as possible in each
chapter we have tried to emphasize the intuitions behind the, sometimes complex,
concepts of machine learning methods.
As far as the marketing and sales practitioners are concerned, the book
assumes that they are interested in actual implementation of machine learning
models, either by themselves or in collaboration with the data scientists in their
organizations. For this reason, this book is not just a high-level overview of
machine learning applications to marketing and sales. Thus, by its very nature the
chapters assume that the reader is willing to handle some technical material.
However, we have made every attempt to make the chapters self-contained by
providing the background material needed to understand the chapter contents.
Each chapter has a section on the existing applications of machine learning and
artificial intelligence (AI) to issues in marketing and sales. We have tried to focus
on applications that have been archived in the major peer-reviewed research
journals in marketing, operations research, machine learning, expert systems, etc.
By their nature, journal articles have details of implementation and data sets and
interested readers can go through the articles listed in the references at the end of
every chapter for further details. Finally, for those wishing to get hands-on
experience of actually running analyses of marketing and sales data using
machine learning models, each chapter has a couple of detailed “Case studies” at
the end.
Acknowledgments
Niladri
I would like to gratefully acknowledge the support of my wife, Nivedita,
without whose patience and encouragement this book would never have materi-
alized. I would also like to thank my teacher, Professor Bibek Debroy, who
introduced me to the power of quantitative modeling and testing of business
phenomena in the 1980s, much before the term “analytics” had become fash-
ionable. The Center for Sales and Customer Development (CSCD) at the Uni-
versity of Missouri provided the support and proper climate which greatly
facilitated the writing of this book.
Rajeeve
This book would not be possible without the unyielding support and encour-
agement of my wife, Shalini, and the patience and understanding of our son,
Harsha. Working on a book while staying abreast with challenging executive roles
required them to sacrifice time that we could have spent building life memories –
for which I am so grateful. I would also like to thank my many professors who
encouraged me to learn and experiment with so many quantitative methods across
diverse fields from statistics, to marketing, finance, operations research, etc. I
further extend my gratitude to the many incredible executives across so many
industries who adopted my quantitative solutions to improve decision in areas
including pricing, marketing, supply chain,and digital among others, and the
companies that allowed me to follow my curiosity to develop and deploy these
models.
Introduction
This book grew out of many discussions that we, the two coauthors, had over
the span of several years. Over the years, the field of machine learning has gone
from an esoteric topic discussed among a small select group of practitioners and
researchers to an ever-growing tsunami of interest and practitioners approaching
an increasingly digitized world from their diverse backgrounds. We were both
interested in machine learning, but we had approached it from two very different
perspectives, and could relate to this heterogeneity in thinking. One of us is an
academic and was focused primarily on the theory of machine learning and in
doing research on this topic. The other author is an industry practitioner and
was focused primarily on the applications of machine learning models in
marketing and sales. Of course, despite the different areas of emphasis each one
of us is interested in both theories and applications, and both realize that these
should go hand in hand. As our discussions progressed, we felt that the existing
resources in machine learning did not quite serve the needs of the diverse
stakeholders that have to work together for successful industry applications of
machine learning in marketing and sales. This motivated the need for a book
that would speak to both the sales/marketing business teams and the data
scientists who are tasked with solving business problems in the domains of
This book takes a different approach compared to two distinct existing
categories of books that deal with machine learning and AI – the technical and
the qualitative books. The former category of books does not focus on appli-
cations in a specific domain in any detail, and often use stylized examples drawn
not from business but from the physical sciences. The latter (qualitative) cate-
gory of books does not provide any details of the statistical and mathematical
concepts that drive machine learning techniques and their applications. Neither
of these types of books serve well the needs of practitioners coming from varied
backgrounds working on actual implementation of machine learning models in
the field of sales and marketing in a business enterprise. As far as the technical
aspects are concerned, we have avoided machine learning algorithms and have
focused instead on the concepts and ideas that underlie machine learning models
and methods. Our interest is in connecting the concepts that underpin these
methods and bridge the gap between the data scientist and the business
Machine Learning and Artificial Intelligence in Marketing and Sales, 1–3

Copyright © 2021 by Emerald Publishing Limited
All rights of reproduction in any form reserved
doi:10.1108/978-1-80043-880-420211001
2 Machine Learning and Artificial Intelligence in Marketing and Sales
practitioner. There are many online and other resources for those readers
wishing to familiarize themselves with algorithms.
A major decision in writing any book is always what to include and what to
leave out. Machine learning and AI are flourishing fields of research with scholars
from diverse disciplines such as applied mathematics, statistics, operations
research, engineering, and computer science actively contributing to them. There
is an enormous variety of machine learning models, and trying to include all these
models would make the book unwieldy and unfit for the non-expert. We have
therefore decided to focus on just three of the most commonly used methods in
marketing and sales applications – Neural Network, Support Vector Machine
(SVM), and Random Forest. A key motivation for this approach is the
acknowledgment that though there are many ideas and approaches, not all of
them have efficacy across a broad set of business problems. As such, it makes
sense to focus on methods that have been validated for their applicability in
solving a wide variety of marketing problems. Importantly, these three models are
exemplars of three distinct classes of machine learning models, and many of the
latest developments in the field are based on these three fundamental models.
Thus, any understanding of the latest developments in machine learning, deep
learning, and AI require an understanding of these models. For example,
advances in deep learning including Convolutional Neural Networks (CNN) and
Recurrent Neural Networks (RNN) are based on the fundamental ideas of Neural
Networks. An SVM is a prominent example of the class of kernel-based machine
learning models. A Random Forest is a good representative of the class of tree-
based learning models and is a good context to discuss ideas of bagging, boosting
and gradient boosting.
This book is about machine learning and Artificial Intelligence (AI). While
different authors have different ideas about the distinction between them, the
consensus opinion is that machine learning is a subfield of AI. At a broad
level, AI is the umbrella term used to denote the entire suite of technologies
that are designed to mimic human abilities. Thus, AI includes machine
learning, deep learning, natural language processing (NLP) etc. Many authors
argue that neural networks are part of deep learning since the latter are
essentially ‘big’ neural networks with many layers with complex intercon-
nections between them. To the extent that machine learning and deep learning
are often identified as distinct subfields of AI (see, for instance, the SAS
Institute white paper by Thompson, Li, and Bolen), the question of whether
neural networks should be included under machine learning often becomes a
matter of taste and preferences. We would like to avoid these matters of
semantics, and hence we have included both machine learning and AI in the
title. The reader can think of the content of the book as “narrow AI” which
is supervised machine learning.
Before discussing the three specific machine learning models, we will discuss
the concepts of training and performance assessment since these concepts are
applicable to all machine learning models. This is done in Chapter 1. Here we also
discuss the linear regression model for a continuous dependent variable (often
called a response variable or a target variable in a machine learning context) and
Introduction 3
the logistic regression model for a categorical dependent variable. These will form
useful benchmarks for the machine learning models discussed in Chapter 2
(Neural Networks), Chapter 4 (Support Vector Machine) and Chapter 5
(Random Forest). In Chapter 3 we discuss the very important concept of over-
fitting and regularization. We have decided to introduce these after the chapter on
Neural Networks since it is easier to grasp these concepts when discussed in the
context of a specific model, even though they are applicable for all models.
Chapter 1
Introduction and Machine Learning

Preliminaries: Training and Performance
Assessment
Chapter Outline
1. Training of Machine Learning Models
1.1 Regression and Classifications Models
1.2 Cost Functions and Training of Machine Learning Models
1.3 Maximum Likelihood Estimation
1.4 Gradient-Based Learning
2. Performance Assessment for Regression and Classification Models
2.1 Performance Assessment for Regression
2.2 Performance Assessment for Classification
2.2.1 Percent Correctly Classified (PCC) and Hit Rate
2.2.2 Confusion Matrix
2.2.3 Receiver Operating Characteristics (ROC) Curve and the Area under
the Curve (AUC)
2.2.4 Cumulative Response Curve and Lift (Gains) Chart
2.2.5 Gini Coefficient
Technical Appendix
1. Training of Machine Learning Models

In this chapter, we will restrict our discussion to models that have a specific
response variable. Response variables are also called target variables and machine
learning models with such variables are known as supervised learning. These
models are distinguished from unsupervised learning models, like clustering
models, which do not have pre-specified response variables. We first describe
briefly two categories of supervised learning models that are of interest to us –
regression models and classification models. They are distinguished by the “type”
of response variable.

doi:10.1108/978-1-80043-880-420211002
1.1 Regression and Classification Models

Regression and classifications models are discussed in almost all statistics text-
books and we will not repeat these details here. We only mention them very
briefly to set the stage for the discussion of machine learning models in the later
chapters of our book.
Regression models have a continuous response variable (often called a
dependent variable). We will consider the case of a linear regression. Consider the
case of a consumer products company that provides free samples to consumers to
induce trial and also word-of-mouth to sell its products. Sometimes these com-
panies may have their salespeople stationed at various retailers to distribute their
samples in the hope that, after trying it, the consumers will like the product and
purchase the product after their initial trial (“trial-and-repeat” purchase models in
marketing). There is obviously some time lag between trial and repeat, and
suppose the company wants to understand how their distribution of samples in a
given month induces repeat purchases in the next month. We will denote the
number of samples in a given month, say, October, as X and the number of repeat
purchases in November by Y. We can treat the number of purchases Y as a
continuous variable. Thus, Y is the continuous response variable and X is the
explanatory (also called independent) variable. A simple model to predict Y based
on X is
Y ¼ w0 1 w1 X 1 « (1.1)
The epsilon term («) at the end is the error term. It captures the fact that the
relationship between X and Y has randomness owing to a host of factors. The
common sources of randomness are the many other factors that also affect
purchases in November apart from trials in October. Of course, these have not
been modeled, and thus, there will be errors when we use only one explanatory
variable to predict purchases in November. In the simple linear regression
above, the effect of the number of trial samples in October is given by the
parameter w1 (parameters that multiply inputs are also called coefficients and in
machine learning models like Neural Networks, they are called weights). The
slope, given by w1, intuitively captures the additional purchases in November
due to an extra trial sample in October. The intercept, given by w0, intuitively
captures the purchases in November if there were no trial samples in October
(in machine learning models like Neural Networks this parameter is called the
bias). Instead of just one explanatory variable, one could include other variables as
well on the right-hand side of the equation, and then we would have a multiple
regression.
In this book, we will refer to a model with a continuous response variable as a
regression model and different machine learning techniques can be used to analyze
such models. The traditional linear regression described above can serve as a
useful benchmark to compare with the more recent machine learning models.
In marketing the response variable we are often interested in is categorical. For
instance, consider the case of a bank that wants to predict whether its customers
are likely to churn (leave) or not. A sales organization may be interested in
Introduction and Machine Learning Preliminaries 7
categorizing their prospects as being either in the “buy” or “not buy” category. In
lead scoring, a sales organization may want to categorize their sales leads as
belonging to one of many different classes based on their propensities to buy: very
unlikely, unlikely, likely, very likely. These are classification tasks, with the first
two being binary classification and the third being multiclass classification.
We will briefly describe the case of binary classification. The traditional
workhorse for analyzing models with a binary categorical response variable is a
logistic regression. In the bank churn example, suppose the two classes are “churn”
or “not churn,” and the bank wants to understand to what extent the amount of
“balance” that the customer has is predictive of churn. The answer is not clear a
priori. On the one hand, a customer with a large balance can be considered as
having a deeper relationship with the bank, and therefore, less likely to churn. On
the other hand, such attractive customers are targets of competitive offers from
other banks and are more likely to churn. We use the balance a customer has in the
bank as the explanatory variable X. The response variable Y 5 {11, 21} is coded
as: 11[ “churn” and 21 [ “not churn.” We cannot use a linear regression here
since we would like to model the probability of churning, and unlike the contin-
uous response of a linear regression which can take on any value, probabilities
have to lie in the interval [0, 1].
The logistic regression works by defining p 5 Probability(Y 5 11), and then
positing the relationship
Log½p=ð1 2 pÞ ¼ w0 1 w1 X1 (1.2)
The term on the left-hand side, Log[p/(12p)], is called the log odds ratio. This
formulation generates the probability of churning, p. It also ensures that the sum,
Probability(“churn”) 1 Probability (“not churn”), adds up to 1 as is expected of
probabilities. Based on these probabilities, one can classify customers as
belonging to the category “churn” (“not churn”) if p . 0.5 (p , 0.5).
In this book, we will refer to a model with a categorical response variable, both
binary and multiclass, as a classification model, and various machine learning
models can be used for classification tasks. The logistic regression described
above can serve as a benchmark to compare with machine learning classification
models.
1.2 Cost Functions and Training of Machine Learning Models

Machine learning practitioners often talk of cost functions. Take the example of a
company trying to predict sales of a certain product. Data are available over
many past periods, and in each period, sales are affected by factors like the
company’s own price, advertising spending, and the competitor’s prices among
other factors. Given this situation, we want to accurately predict the sales of the
company. One way to do this is to create a mathematical formulation (model)
that allows us to predict sales based on observed factors (like price, advertising,
etc.) for each recorded period in the past. Then, we can compare the actual past
sales value against the sales value predicted by this model to see how well the
model is performing. In this case, the cost function is a function of the difference
between the predicted output of the model and the actual sales value for all past
periods. The model is said to perform well when the cost (also called error or loss)
is minimized. The minimization of cost is achieved by choosing appropriate
parameters of the mathematical model. This process is called training the machine
learning model.
For a machine learning model, training is said to occur when the model esti-
mates the “best” values of the parameters. What does best mean? At this point, we
formalize the concept of a cost function a bit more. Consider the linear regression
model specified above. Given a specific input data point, X 5 x, and some values of
the parameters (weights), the regression model can make a prediction f (x).1 This is,
given specific values of w0 and w1 and a data point x, the regression model makes a
prediction y 5 f (x) 5 w0 1w1x. On the other hand, the input data point x has an
actual observed y (also called target) associated with it. Intuitively, the cost function
measures the discrepancy between the model prediction y and the actual y for all
possible values of input x. The goal of training is to choose those parameters
(weights w0 and w1) that minimize this cost. These cost minimizing weights are the
“best” weights.
In our discussions earlier, the cost function was based on sales – specifically it
was the difference between actual observed sales and the sales predicted by the
model. In business, typical “performance indicators” one encounters are sales,
margins, inventory balances, profits, hours worked, and payroll to name a few.
Any of these, or any combination of these, could be used to define the cost
function.
By far the most common technique for training most machine learning models
is to use the method of Maximum Likelihood Estimation (MLE). Maximum
likelihood estimators have desirable statistical properties and are therefore
advantageous to use. Elements of this philosophy are also applied extensive in
deep learning. It is noteworthy that there is a close theoretical connection between
cost (loss) functions and maximum likelihood, and therefore we will use the
maximum likelihood framework to address the issue of choosing appropriate cost
functions. A standard result from statistics is that, when the errors in a regression
model are Gaussian then minimizing the sum-of-squares cost with respect to the
weights is equivalent to maximizing the log-likelihood. In the maximum likeli-
hood framework, the appropriate cost function for regression-type outputs is the
sum of squares cost (loss), and for binary output the appropriate cost is the cross-
entropy cost. Expressions for the sum-of-squares cost function for regression and
the cross-entropy cost function for binary classification are in the appendix.
Technical Detour 1
1
The uppercase denotes the variable and the lowercase is a specific value of that variable.
1.3 Maximum Likelihood Estimation

As we have mentioned already, although one simple way to estimate the param-
eters is to minimize the cost function, a more general approach is to use the
principle of Maximum Likelihood Estimation. Intuitively speaking, the maximum
likelihood estimation principle says that the most reasonable values of the unknown
parameters are those that maximize the probability of the observed training sample.
The minimization of the cost function, or maximization of the likelihood, is an
optimization problem and the gradient based methods are very frequently used to
accomplish this. We discuss them in the next section.
Every textbook on statistics has detailed discussions of maximum likelihood
estimation and we will not repeat it here. Our goal in introducing maximum
likelihood, very briefly, is only to give an intuition and to have consistent notation
for all our sections. Suppose there are p predictor variables X1,…, Xp. Recall that
we use uppercases for the variables and lowercases for specific observations which
will constitute our data points. In the observed data, (xi, yi), each observed input
data point xi is a p-dimensional vector xi 5 (xi1,…, xip), i 5 1,…, N, with i being
the index for the observation. The response, or target, corresponding to the ith
observation is yi. If there are N input data points, x1,…, xN, where each point is be
a p-dimensional vector, we consider that these xi are drawn independently from a
true data distribution pdata(x). However, the true data distribution is unknown, and
therefore, we need an estimate of pdata(x). We accomplish this by positing a
probability model which is a family of distributions (indexed by parameter u) called
the model distribution pmodel(x; u). For a specific u 5 u9, pmodel(x; u9) provides an
estimate of the true probability pdata(x). Though we would ideally like to match the
model distribution with the true data distribution, since we do not know the true
data distribution we instead match the model distribution with the empirical dis-
tribution pdata defined by the training data. Hence, the task of estimation reduces to
the task of finding the right parameter u which minimizes some measure of “dis-
tance” between the model distribution pmodel(x; u) and the empirical distribution
pdata. Intuitively speaking, the likelihood function measures the match (goodness-
of-fit) of the model distribution with the empirical distribution. It is defined as a
function of the parameter u. Clearly, minimizing the distance between the model
and the empirical distribution is equivalent to maximizing the match between them.
Maximum likelihood estimation is the process of choosing that u which maximizes
the likelihood which is a measure of the match between the model and empirical
distributions.
To build intuition, consider an illustration. Suppose in a classification context
we are trying to distinguish between people who default (on a loan) versus people
who do not. The training data is (xi, yi), i 5 1,…, N, where xi is a set of explanatory
variables and yi is 1 or 0 depending on whether individual i has defaulted or not.
The distribution of 1s and 0s corresponding to all the yis, i 5 1,…, N, in the
training data is the empirical distribution pdata. Suppose, we use a model (say,
logistic regression) which takes the xi as the input for individual i and outputs a
predicted probability yi 5 Pr{ith individual defaults}. Then the distribution of
predicted probabilities across all individuals in the training data set is the model
distribution pmodel(x; u). The goal of the maximum likelihood estimation is to find
that u9such that in the distribution pmodel(x; u9) we have a probability as close as
possible to 1 for individuals who are defaulters and as close as possible to 0 for
non-defaulters.
A desirable feature of the maximum likelihood approach is its connection with
the cost function. We could simply define the negative of the likelihood as the cost
that needs to be minimized (as a technical matter, the logarithm of the likelihood
is used, but this does not change the conceptual ideas). From information theory
the negative of the log likelihood is the cross-entropy between the empirical data
distribution and the model distribution. Minimizing this cost will give us the u
that is consistent with maximum likelihood. The cross-entropy cost is widely used
in machine learning and it has the advantage of being derived from maximum
likelihood estimation procedures. This unified framework for identifying a cost
function has a major advantage in that it obviates the need for coming up with
different cost functions for different models, but rather we can define the cost
function as soon as we specify a model distribution pmodel(x; u).
Technical Detour 2
1.4 Gradient-Based Learning

How exactly are model parameters chosen? We will describe the optimization
process by considering the minimization of the cost function, even though the
maximization of the likelihood function follows a similar principle. Essentially this
is an optimization problem, and the workhorse for cost minimization using com-
puters is the gradient descent algorithm (for maximizing the likelihood, it is ascent
instead of descent). The gradient descent proceeds in a stepwise fashion where the
weights and biases are initialized to some starting values and then updated in a
sequence of steps until a stopping rule is applied to stop the process. As the name
suggests, gradient descent requires the calculation of gradients of the cost function
with respect to all the parameters. In keeping with the applied orientation of this
book, we will give an intuitive understanding of how gradient descent works.
Consider, for simplicity, the case of only one weight w in the cost function
C(w). The task here is to find that w which minimizes C(w). Starting from an
initial value for the weight the process moves stepwise and the weight is updated
at each step. Suppose the value of the weight at step t is wt. The gradient descent
prescription is to update the weight as follows:
wt 1 1 ¼ wt 2 g:½Gradient of cost function at wt (1.3)
In the simple case where there is only one weight to be determined, the
geometrical analog of the gradient is the slope of the cost function as shown
in Fig. 1.1. It is given mathematically by the derivative of the cost function with
respect to the weight. For multidimensional cases with many weights the equivalent
of this derivative is called a gradient. In Fig. 1.1, the slope (tangent line) at wt is
Fig. 1.1. Gradient Descent for a Minimization Problem.
negative and so, from the previous formula for weight updating, the next value of
the weight, wt11, will be larger than wt.
On the other hand, if we are currently at a point “a” on the right side of the
minimum point, the slope is positive, and the updating rule would result in a smaller
weight. In either case, the weights keep gradually moving toward the minimum
point.
The other quantity in the weight updating equation is g which is a learning rate
parameter. It determines the rate at which, starting from an initial weight, the
updating procedure converges to the minimum point. It is called the learning rate
because it determines the rate of approach to the minimum point which is the goal
of learning in machine learning models. This parameter is exogenously chosen by
the model builder and there are complexities involved in its choice. The most
serious one is that, if the learning rate is too large then the learning procedure
could overshoot the minimum point thus failing to converge, and worse, may even
diverge. Fig. 1.2 makes this clear. The picture on the left shows a small learning
rate which allows the updating procedure to slowly approach the minimum point,
whereas in the picture on the right shows a large learning rate where weight
updating overshoots the minimum point.
Fig. 1.2. Small and Large Step Sizes (Learning Rates) in

Gradient Descent.
Of course, if the learning rate is too small, the training will take an unac-
ceptably long time. So, a proper choice of the learning rate is a hyper-parameter.
The model builder will need to try different learning rates and judge their effec-
tiveness, using cross-validation and other methods, before settling on an appro-
priate learning rate. Overall, the plain vanilla gradient descent has oftentimes
been found to be unstable and slow. To the extent that these problems are caused
by the inappropriate choice of the learning rate parameter, machine learning
researchers have suggested stable methods of choosing this parameter so that we
are guaranteed to converge to a local minimum regardless of the starting point.
Here we will not discuss these more technically advanced methods, many of which
use Newton’s method involving second derivatives or Hessians that take into
account the rate of change of slope.
In the case of multiple weights, we operate in multi-dimensional space and
the direction of descent becomes critical. In such cases we have to study the
directional derivatives at various points in multidimensional space. Think of the
mars rover that finds itself on a slope, tasked with finding the bottom of a basin.
In order to descend to the bottom most rapidly, the rover should determine the
most efficient course to the bottom. If the rover could only roll down a foot at a
time, which direction would it roll first? Well, since the goal is to get to the
bottom it is reasonable to expect that the rover would want to go as far down as
it can in that foot of distance covered. This would be achieved by finding the
slope (rise over run) for the one foot move in all the different directions the rover
can move. The first step taken should be in the direction where the reduction in
altitude is greatest, and this is the direction of greatest slope (steepest descent).
As the rover moves one foot at a time from each location, it goes lower and
lower till it can go no lower-telling the computer that it has reached the bottom
of the basin. Intuitively speaking, the gradient specifies the optimal direction of
descent.
Another important issue in gradient descent is to determine how much of the
training data set should be used. Note that the calculation of the cost function
uses the training data (see “Technical Detour 1” in the appendix). Therefore, each
step of gradient descent requires the algorithm to process the entire training data
set to calculate the gradient. This causes severe slowing down of gradient descent
when the training data is large. Mini-batch gradient descent is a clever trick that
lets us get around this problem. In this method the whole training data is divided
into smaller mini batches. These mini batches are then used for training, and the
advantage is that weight updating can start as soon as we process a mini batch
rather than having to wait to process the entire training data before updating
weights. Thus, the batch size, say B, is another hyper-parameter in tuning neural
networks. When B 5 N, that is, when the entire training data us used we are back
to the standard gradient descent, also called batch gradient descent. In the polar
opposite case, when B 5 1 we have stochastic gradient descent (SGD). SGD is
often used when we have real-time streaming data. In these cases, we perform
online learning, where the gradients are calculated and weights updated for each
single training data point and then the results are averaged to get an estimate of
the weights. Though an infinite sequence of incoming training data was an early
motivation for weight updating using each incoming training data point, since by
definition one could not wait for the “entire” data, this idea is now routinely used
even when there is a finite batch of data available. In SGD first there is a random
permutation of the training data, and then data points are drawn one by one
without replacement. After each draw the gradient is calculated and the weight
updating is done. The weight estimate is calculated as the average of all the
weights so calculated during a pass over the entire data. In the boxed text here, we
summarize our discussion.
Executive Summary
Training a machine learning model refers to the process of learning (deter-
mining) the unknown parameters using the training data. For example, for
neural networks the unknown parameters are the weights and biases. The usual
method is to learn the parameters by minimizing a cost function (also called
loss or error). The cost function is a measure of model performance, and the
best performing models have the minimum cost. The cost function calculates
the difference between the model prediction, of some managerially relevant
performance indicators such as sales, profit, margin, and market share, and the
observed target.
Training via cost minimization is done using gradient descent. The term
“descent” refers to the goal of reducing the cost function till we achieve its
minimum. In multidimensional space, there could be many possible directions
of descent, and the gradient specifies the direction, which will result in the
maximum reduction of the cost function. Intuitively speaking, in gradient
descent the weights are updated in a stepwise manner, where each step is
taken in a direction that reduces the cost the most. The process stops when we
reach the minimum cost, and the weights corresponding to the minimum cost
are the desired “optimal” weights. Gradient descent requires the analyst to
choose a step size (learning rate), which is usually done by cross-validation. A
small (large) learning rate will require more (less) time for convergence, but is
less (more) likely to overshoot the point of minimum cost.
If the training data set is large, all the data may not be used. The subset of
data chosen for training is called the mini batch, and gradient descent
calculated using it is called mini batch gradient descent. For streaming data,
the mini batch size is one since weights are updated using gradient descent for
each incoming training data at a time. This is the idea behind stochastic
gradient descent.
2. Performance Assessment for Regression and

Classification Tasks
The measurement of performance of any statistical model is important. A new
model may need to be benchmarked against other existing models, a focal model
may need to be compared vis-à-vis other models of the same family but which
have been parameterized differently, or different types of models may need to
be compared to see how they are able to explain the data etc. A critical aspect of
the performance of a machine learning model is that we are mostly interested
in the performance of the model on new data that has not be used to train
(estimate) the model. Said more formally, the new data is test data and we want a
model that will have low test error. Since the ability of a model to fit test data well
is related to its ability to generalize beyond the training data, the criterion of
minimizing test error is also equivalently stated as minimizing generalization
error. Of course, the model itself is trained (estimated) using the training data and
by choosing parameters (weights and biases) that minimize the training error.
Thus, a good machine learning model will have to balance (1) training error, and
(2) test error. While the machine learning model seeks to minimize training
error, it also needs to minimize the gap between test error and training error. In
terms of the underfitting-overfitting dichotomy, discussed in detail in Section 1 of
Chapter 3, a large gap between training and test error is a sign of overfitting.
Having made the case for assessing performance using the test (generaliza-
tion) error, which we will define formally in Chapter 3, it is worth having an
intuition for why training error cannot be used as a good estimate of test error.
Indeed, a central tension, as far as performance assessment is concerned, is the
fact that often these two errors have very different behaviors. Very complex
models are better able to capture the underlying patterns, and perhaps even
idiosyncrasies, of the training data, and yet, precisely for this reason may not be a
good fit with new data that was not part of the training – namely, the test data. In
other words, a very complex model is likely to have very small training error but
a large test error. The specific functions used for model assessment and evalua-
tion depend on the model. While the technical statistics literature has used many
assessment measures, we will only present the most commonly used ones in
applications.
2.1 Performance Assessment for Regression Models

For regression type models with continuous response variables the most commonly
used fit measure is the quadratic cost function given by C(u) (see the Technical
Appendix at the end of the chapter). Thus, the model’s performance is assessed by
computing the quadratic cost for a new test data set. This measure is similar to the
Mean squared error (MSE). If the MSE is calculated on the test data, it is referred
to as the Predicted MSE (PMSE). Another commonly used measure is the Mean
absolute error which replaces the squares in C(u) by the absolute values. An analyst
may sometimes prefer a relative measure which makes the assessment metric unit-
free. For example, the mean squared error is normalized by using total sum of
squares, much like the R2 in linear regression. In order to account for overfitting,
we often use assessment criteria that are modified by applying a penalty to the
number of parameters to be estimated. The Adjusted R2, AIC or BIC are examples
of such measures.
2.2 Performance Assessment for Classification

We now turn to the various measures to assess performance for classification
tasks.
2.2.1 Percent Correctly Classified (PCC) and Hit Rate

For qualitative or categorical response variables, model assessment is based on
other cost functions. To set the stage for a formal definition of model assessment
for categorical response variables, suppose in a K category case the categorical
response yi corresponding to input data point xi can take one of K values indexed
by k 5 1, 2,…, K. Our probability model yields a probability for an observation
xi to belong to category. Thus, for each xi there are K probabilities, and opti-
mality requires that the predicted category of xi is that category with the highest
probability. The simplest cost function for categorical response is the 0–1 loss: Let
Hi 5 1 if the prediction is correct and 0 if it is incorrect. The 0–1 loss function
leads to the simplest approach to quantifying model performance which is the hit
rate or percent correctly classified. If the test data is (xi, yi) and is of size N then the
hit rate is
1 N
Hit rate ¼ + Hi (1.4)
N i¼1
The major drawback of the hit rate or PCC is that this is an overall measure
and does not account for individual class performance. If we look beyond overall
performance and also consider how the classifier performs with respect to each
individual class then we realize that PCC has serious shortcomings. A visual
depiction will make this concept clear. Consider a simple example of classifying
the “1” and “2” where there are (many) more “2” than “1” in the data set
(Fig. 1.3).
Fig. 1.3. Classifying “1” and “2” and 0–1 Loss.

Clearly, if we assume a 0–1 loss, on which PCC is based, then misclassification

is minimized by simply classifying all points as “2”. If in some practical appli-
cations such points like the “1” have to be correctly identified, then the PCC or
test error rate criteria will fail miserably. Another drawback is that once the
classification threshold is crossed, the PCC does not account for the distance
between the actual and the predicted values.
2.2.2 Confusion Matrix

A useful, visually friendly tool to assess the performance of models for categorical
data is the Confusion Matrix. In a binary classification case, the confusion matrix
provides a simple method of visualizing the model’s predictions vis-à-vis the
actual data. Suppose the two classes are “1” and “2”. Then we can conveniently
display the classification model’s performance in a 2 3 2 matrix as follows.
The diagonal entries are correct predictions and their sum (or proportion of
total data) is representative of the accuracy of the classifier for any test data set.
The true positives are examples (observed input data x) which have been predicted
to be in class “1” and are also actually in class “1”. The false positives are
examples that have been predicted to be in class “1” but are actually in class “2”.
It is important to realize that the default binary classifier (regardless of whether
we use neural networks, discriminant analysis, support vector machines, etc.) will
assign an example to a class or not based on whether the predicted probability of
belonging to that class is greater than 0.5 or less than 0.5. This, of course, is a
consequence of the classification rule that assigns an example to a class if the
probability of belonging to that class is the highest (among all probabilities of
class membership). In the binary classification, this specializes to a threshold
probability of 0.5. The standard classification algorithm proceeds by trying to
minimize the total error but does not care about the error rates in different classes.
However, in many practical applications class-wise errors may be more important
that overall total error. We will illustrate with an example below.
Suppose a sales organization builds a propensity scoring model that will be used
to guide its salesforce regarding which customers to call. Essentially, the training
customer data has customers who either “Buy” (“1”) or “Not-buy” (“2”) and,
based on a set of demographic, psychographic and behavior variables (the set of
independent variables X), the propensity scoring model will give predicted
probabilities of buying (“1”) or not buying (“2”) for a new test group of cus-
tomers who were not part of the training. The firm’s salespeople will make calls
on customers based on the customers’ probabilities of purchase. In this industry
getting customers to buy is very important due to the large revenue generated
(compared to the cost of an additional call), and therefore the firm would like the
salesforce to make a call on a customer when the propensity to buy exceeds some
threshold. The default classification rule would use a threshold of 0.5 and put a
potential customer, X 5 x, in the “1” class if Pr{Class 5 “1”}.0.5. Using the
default classification rule, it is quite possible to have a large number of customers
who actually will buy (Actual 5 “1” in Fig. 1.4) but are predicted not to buy
(Predicted 5 “2” in Fig. 1.4). In terms of Fig. 1.4 the number of “false negatives”
Fig. 1.4. Confusion Matrix Showing True Positives and

False Positives.
could be very high. Thus, even though the overall error rate may be low, among
the customers who will actually buy (column Actual “1”) there may be an
unacceptably high prediction error rate. The salesforce will then not make calls on
many customers who most likely would buy and this flies in the face of the selling
strategy of the firm. An obvious way to correct this would be to change the
classification threshold so that a new customer x is assigned to Y if Pr{Class 5
“1”}.0.2. This will increase the overall error (since we are moving away from
the optimal threshold of 0.5 for binary classification), but importantly, will reduce
the error rate among the critical group of customers who will actually buy.
Depending on the application context, decision makers may be willing to make
such tradeoffs.
2.2.3 Receiver Operating Characteristics (ROC) Curve and Area under the Curve
(AUC)
As is clear from the above discussion, different classification thresholds lead to
different entries in the cells of the confusion matrix. That is, for each possible
classification threshold there is a confusion matrix corresponding to it. To see this
more clearly, consider the following table showing just 10 instances (data points)
from a larger data set of 20 instances with 10 instances whose true class is “1”
and 10 instances whose true class is “2”.
In Fig. 1.5, the 10 instances have been sorted in decreasing order of the prob-
ability of being in the “1” class as predicted by our probability model (say, logistic
regression). This is the column titled Pr{Class 5 “1”}. Suppose our classification
threshold is 0.8. This means that if the predicted probability of an instance is
greater or equal to 0.8 then we classify it as “1” and if the predicted probability is
less than 0.8 then we classify it as “2”. We can see from the column “Pr{Class 5
‘1’}” that 5 of the 10 instances are classified as “1” (since there are 5 probabilities
greater than or equal to 0.8). The confusion matrix corresponding to a classification
threshold of 0.8 is the 3 3 3 matrix on the right. We can see that the sum of cell
entries in row “Predict ‘1’” is 5. Now, from the column “True Class” we can see
that, of the five instances that are predicted to be in “1” category, only three have a
true class of “1” and two have a true class of “2”. Hence the numbers 3 and 2 in
the row “Predict “1” ”. Similarly, if the classification threshold is 0.584 then we
have the confusion matrix as shown in the 3 3 3 matrix on the right in Fig. 1.6.
The Receiver Operating Characteristics (ROC) curve is a simple graphical
depiction of how the two errors – true positives and false positives – can be
simultaneously depicted for all possible thresholds. Fig. 1.7 shows a typical ROC
curve.
The point (0.2, 0.3) corresponds to the threshold of 0.8 as shown in Fig. 1.5. If
we convert the cell numbers in the confusion matrix in Fig. 1.5 to rates, then the “False
positive rate” 5 2/(2 1 8) 5 0.2 and the “True positive rate” 5 3/(3 1 7) 5 0.3.
The point (0.5, 0.5) in the ROC curve in figure corresponds to the threshold of 0.584
in Fig. 1.7.
The better a classifier is, the higher will be the proportion of true positives and
so the ROC curve will hug the left and top lines (top left corner) more and more.
Clearly, the point (true positive 5 1, false positive 5 0) corresponds to perfect
classification. The dashed 45° line corresponds to random chance. While the ROC
curve has a lot of information about how the relative proportions of true and false
positives change as the threshold changes, sometimes it is useful to have one single
summary measure of the performance of a binary classifier. This summary
measure is the Area Under the Curve (AUC), and as the name suggests it is simply
the area under the ROC curve. A classifier that is no better than random chance
will have an AUC of 0.5. This is because the diagonal line represents the state of
random chance, and clearly the area below it is 0.5 as shown in Fig. 1.8. Perfect
classification means that the ROC curve coincides with the top and left margins
and so the area under it is 1. Thus, the AUC lies between 0.5 and 1. Finally, it can
be shown that, up to a transformation, the AUC is equivalent to the Gini coef-
ficient which we discuss later in this section.
Fig. 1.5. Instances Sorted by Decreasing Order of Predicted Class

Probabilities and Confusion Matrix Corresponding to a Threshold of 0.8.
Fig. 1.6. Instances Sorted by Decreasing Order of Predicted Class

Probabilities and Confusion Matrix Corresponding to a Threshold of 0.584.
Fig. 1.7. Receiver Operating Characteristics (ROC) Curve.

Fig. 1.8. Area Under the Curve (AUC).
2.2.4 Cumulative Response Curve and Lift (Gains) Chart

As is perhaps clear from our discussion above, the ROC curve is not the most
intuitive and easy to explain method of evaluating classification outcomes. For
ease of exposition, analysts often use the cumulative response curve to give a
simple visual means of evaluating a classifier’s performance. The steps to obtain
the cumulative response curve are:
(1) Given a probability model, generate the predicted probability of “1” (positive
response) for each customer in the test sample
(2) Sort the customers by decreasing order of predicted probabilities of “1”
(3) Draw a graph such that:
• The “Percentage of customers” is on the X-axis in decreasing order of their

predicted probabilities of “1”
• The “Cumulative percentage of actual ‘1’” is on the Y-axis.
The typical cumulative response curve is shown below (Fig. 1.9).

Suppose the overall response (belonging to class “1”) in the data set is 10%.
So, if we randomly mail the promotion to the entire dataset we will get a 10%
response. However, suppose we mail to the top 10% of the dataset based on the
10% customers with the highest predicted probabilities. We may then find that the
actual response is 30%.
A lift (gains) chart is a tabular depiction showing the lift in each decile, where
the lift measures how the response rate in each decile of the data compares to the
average response rate in the entire data set. The lift chart is generated as follows.
(1) Generate predicted probability of “1” (positive response) for each customer
in test sample
Fig. 1.9. Cumulative Response Curve.
(2) Sort by decreasing order of predicted probabilities of “1”

(3) Divide all customers sequentially into groups. We often group them into
deciles (thus, higher deciles have customers with higher predicted probabilities)
(4) Calculate the average actual response rate (actual “1”) for each group
(5) Lift chart shows relationship between the ordered groups and lift for a group.
Lift is response in a group as proportion of overall response
The table below shows a typical lift (gains) chart (Fig. 1.10).
While the entire lift curve is informative, sometimes we want just one summary
measure to gauge the performance of a classifier. A convenient summary is the top
decile lift which is lift of the top decile. Another summary measure is the variation
in response rates across all 10 deciles. A visual depiction of the lift is given by a lift
(gains) curve. A lift (gains) curve can be drawn by using lift as Y-axis and the
“Percentage of customers” on X-axis in decreasing order of their predicted prob-
abilities of “1”.
2.2.5 Gini Coefficient

The Gini coefficient is a very commonly used summary measure especially in
database marketing applications. The Gini coefficient uses the area under the
Lorenz curve which is essentially the curve depicting the cumulative percentage
of actual responders. The cumulative percentage of responses is plotted on the
Y-axis (as in the cumulative response curve in Section 1.4) and the cumulative
percentage of customers in decreasing order of their predicted probabilities of
response is plotted on the X-axis.
Fig. 1.10. Lift Chart.
In Fig. 1.11, let
AL 5 Area under cumulative responses curve (Lorenz curve), and

AR 5 Area under diagonal line representing random performance
Then the Gini coefficient 5(AL2AR)/AR.
Fig. 1.11. Gini Coefficient.

Since AR 5 0.5, the Gini coefficient 5 2AL21. This makes it clear that the
Gini coefficient ranges between 0 and 1. The perfect classifier would have a
Lorenz curve that coincides with the left and top borders on the figure and so the
AL 5 1. Hence, the Gini coefficient 5 1. On the other hand, the worst classifier,
one that does no better than random chance, has an AL 5 0.5. Hence, the Gini
coefficient is 0.
TECHNICAL APPENDIX
Technical Detour 1
We present the expressions of the cost functions for regression and a binary
classification problem. Suppose there are p predictor variables X1,…, Xp. There
are N observed data points, (xi, yi), i 5 1,…, N. Each observed input data xi is a
p-dimensional vector xi 5 (xi1,…, xip), and the response corresponding to the ith
observation is yi. The sum-of-squares cost, and indeed the cross-entropy cost, for
a regression model is
N
CðuÞ ¼ + ðyi 2 f ðxi ÞÞ2 (A1.1)
i¼1
Consider the case of binary classification. The expressions for the cross-entropy
cost (negative of log likelihood) are simplified if the response yi is treated as 0 or 1.
The probability of yi 5 1 is obviously conditioned on xi and the parameter u:
Prob(yi 5 1│xi, u). To simplify notation, we will omit the conditioning arguments.
The cross-entropy cost is given by
N
CðuÞ ¼ 2 + yi logPrðyi ¼ 1Þ 1 ½1 2 yi log½1 2 Prðyi ¼ 1Þ (A1.2)
i¼1
Technical Detour 2
By definition, the maximum likelihood estimator for u is
N
uML ¼ argmaxpmodel ðx; uÞ ¼ argmax ∏ pmodel ðxi ; uÞ (A1.3)
u u i¼1
where the last expression follows because xi are independent. The product term
in the last expression is the likelihood function. Because the logarithm is mono-
tonic, taking the logarithm of the likelihood function does not change the optimal
choice of parameter u. This gives us the log-likelihood. It can be shown that this
gives us an equivalent definition of the MLE,
uML ¼ argmaxEx;pdata log pmodel ðxi ; uÞ (A1.4)
u
The negative of the log likelihood is the cost that needs to be minimized. This
cost is called the cross-entropy between the empirical data distribution and the
model distribution, and is defined as
CðuÞ ¼ 2 Ex;pdata ½log pmodel ðx; uÞ (A1.5)
We can see that minimizing this cost will give us the u that is consistent with
maximum likelihood estimation.
Chapter 2
Neural Networks in Marketing and Sales
Chapter Outline
1. Introduction to Neural Networks
1.1 Early Evolution
1.2 The Neural Network Model
1.2.1 NN for Regression
1.2.2 NN for Classification
1.3 Cost Functions and Training of Neural Networks Using Backpropagation
1.4 Output Nodes
1.4.1 Linear Activation Function for Continuous Outputs
1.4.2 Sigmoid Activation Function for Binary Outputs
1.4.3 Softmax Activation Function for Multiclass Outputs
2. Feature Importance Measurement and Visualization
2.1 Neural Interpretation Diagram (NID)
2.2 Profile Method for Sensitivity Analysis
2.3 Feature Importance Based on Connection Weights
2.4 Randomization Approach for Weight and Input Variable Significance
2.5 Partial Derivatives Approach
3. Applications of Neural Networks in Marketing and Sales
4. Case Studies
Technical Appendix
1. Introduction to Neural Networks

1.1 Early Evolution
Neural networks can be considered to be the quintessential machine learning
models. They are among the oldest and best studied of all machine learning
models. The origins of modern neural networks can be traced to the work of
Warren McCulloch and Walter Pitts in their seminal paper, “A Logical Calculus
of Ideas Immanent in Nervous Activity” that appeared in the Bulletin of Math-
ematical Biophysics in 1943. From the title of their work it is clear that the early
neural networks were inspired by, though not designed to exactly model, the
human brain. These mathematical models employed a kind of threshold logic

doi:10.1108/978-1-80043-880-420211003
where various “nodes” received inputs from some nodes and then, in turn, acted
upon other nodes based on whether they were sufficiently activated, that is, if their
activation exceeded a threshold. In the machine learning terminology, nodes are
sometimes referred to as units and we will switch between the two. This creates a
network of nodes, with nodes in the initial layer acting on nodes in intermediate
layers, until finally, some desirable outputs are obtained from the output layer.
This architecture of a neural network is loosely based on the patterns of con-
nections between neurons in the human brain where a given neuron receives
electrical signals through dendrites emanating from a preceding layer of neurons.
The recipient neuron then, in turn, sends spikes of electrical signals to other
neurons when excitatory input dominates inhibitory inputs. Thus, the electrical
signals either excite or dampen activity in connected neurons and learning occurs
by changing the influence of neurons on other neurons. Of course, this is a gross
idealization of the human brain since much of the working of the brain is not yet
understood.
At a very simple level, a neural network (NN), sometimes also called an
artificial neural network (ANN), has training and testing modes. Firing rules are
the key to how a NN learns. In the training mode the NN is trained to fire/not fire
for different input patterns (in training data). In the testing mode the NN sees if
an input training pattern is detected in new, non-training data. If yes, the cor-
responding output becomes the current output. If not, then the NN uses the firing
rule to determine current output. In more recent times there has been considerable
work on the Bayesian approach to neural networks. In any case, Bayesian neural
networks build on the fundamental ideas of neural networks discussed in this
chapter, and we will not discuss the additional complexities of the Bayesian
approach here.
1.2 The Neural Network Model

Owing to their strong predictive abilities, NNs have been widely used for predicting
sales in many different industries (Carbonneau, Laframboise, & Vahidov, 2008).
We use this as a motivating example. Consider the problem of predicting real estate
sales in any city. Dollar sales in the case of real estate markets are dependent on the
real estate valuation. There could, of course, be many predictor variables, but
perhaps the strongest predictor of real estate valuation of a property is the location
of the property. For simplicity consider the relationship between “Rent value” (Y)
and the “Distance to the city center” (X), where the latter measure operationalizes
the location of the property. According to the traditional monocentric land value
model, the land value decreases monotonically as distance from central areas
increases (Frew & Wilson, 2002). Since cities developed around ports or railheads,
most businesses and their ancillary supporting commercial establishments grew
organically around the city center. This made the central areas attractive for
commercial, and also residential, establishments, thereby increasing their rent
values. Suppose an analyst believes that the monotonically decreasing relationship
between rent value and distance to the city center could be captured reasonably well
by a linear relationship. Let Y be “rent value” and X be “distance to city center.” If
Neural Networks in Marketing and Sales 27
Fig. 2.1. Linear Relationship between “Rent Value” and

“Distance to City Center”.
we were to assume a linear relationship and fit a linear regression it would be

mathematically represented as: yi 5 w0 1 w1 xi 1 «i , where i is the ith observa-
tion. Here yi and xi are specific values of the variables X and Y. A graphical
representation of the relationship would be as follows (Fig. 2.1).
However, what if the relationship between “rent value” and “distance to city
center” is actually nonlinear as given in Fig. 2.2?
The relationship shown in Fig. 2.2 between rent value and distance to the city
center is consistent with a multicentric land value model. This model posits that
there are multiple “centers” in a city, and apart from the first “center” which is
essentially the historic city center there is also a second “center” which is at a
moderate distance from the city center. The second center represents an attractive
suburban location. Indeed, Frew and Wilson’s (2002) empirical analysis finds
precisely this pattern.
Fig. 2.2. Nonlinear Relationship between “Rent Value” and

“Distance to City Center.”
Clearly a linear regression is not appropriate and we will have to generalize,

beyond the assumed linear relationship, to consider nonlinear relationships. This
generalization is important since, in the real world, relationships between variables
seldom follow nice linear forms. The assumption of linearity is often done for
simplification, but in the process, we lose a lot of information about the real world
and so we end up with models that don’t really explain or predict reality accurately.
We could generalize to a nonlinear relationship by positing a mathematical
relationship such as yi 5 f ðxi Þ 1 «i , where the function f ðxi Þ generalizes the
linear function assumed in a linear regression (in the linear case we simply have
f ðxi Þ 5 w0 1 w1 xi Þ. Though many complex nonlinear functions can be used,
consider the most common nonlinear functional form which is a polynomial:
f ðxi Þ 5 w0 1 w1 xi 1 w2 x2i 1 … 1 wM xM i . This idea motivates a further gener-
alization where the right-hand side of the previous equation can be represented as
a weighted sum of many simpler functions fm ðxi Þ. In the specific case where f (xi)
is a polynomial, the simpler functions that constitute the sum are:
fm ðxÞ 5 xm for m 5 0; ::::; M. Seen from the viewpoint of fitting functions to
explain relationships between variables, a neural network, in essence, captures the
idea of representing complex nonlinear functions as a weighted sum of simpler
functions. Such representations of complex functions as a sum of other functions
have been studied in statistics as “basis function regression” where the component
simpler functions in the weighted sum are the “basis functions.”
See Technical Detour 1
Executive Summary
Neural networks can be conceptualized as fitting complex functions to explain
relationships in the data. Seen through the lens of fitting a complex function
to explain real-world relationships, neural networks provide a very flexible
architecture for replicating such complex functions by weighted sums of
simpler functions.
Below we will give more details on how a neural network implements this idea
of expressing a highly complex, nonlinear, function f (xi) in terms of a weighted
sum of other simpler (basis) functions fm(xi).
1.2.1 NN for Regression

We first describe a NN for the univariate regression case, that is, where there is a
continuous one-dimensional target variable Y. Suppose there are p predictor
variables X1,…, Xp. The target observations are yi, i 5 1,…, N, with i being the
index for the observation. In the observed data, (xi, yi), each observed input xi is a
p-dimensional vector xi 5 (xi1,…, xip). Again, making a connection with the “rent
value” example, suppose, apart from “distance to city center,” the rent values are
affected by the condition of the property in terms of upgrades and updates, square
footage, number of bathrooms, the area of the basement, if any, and macro
factors which include economic indicators like incomes, interest rates, employ-
ment rates etc. We use two summary measures called “condition” and “mac-
ro_factors” for these other determinants of rent value, apart from location. So
here p 5 3 and the input vector is xi 5 (xi1, xi2, xi3)5(distance to city center,
condition, macro factors). In machine learning terminology, the inputs are called
features. Vector notation and vector calculations allow exposition of calculations
with multiple features in simple neat forms instead of having to deal with
cumbersome notations. Moreover, computer programs and libraries are written in
code that is optimized for vector and matrix calculations.
To clearly see the components of a NN, consider the simplest NN with one
hidden layer containing just one hidden node, apart from the p input nodes and
the one output node as shown below. To keep the notation in the figures simple,
we will suppress the index i for the ith observation vector xi 5 (xi1,…, xip) and
instead use the vector of input features x 5 (x1,…, xp) and target y (Fig. 2.3).
The p input nodes correspond to the p-dimensional input vector x 5(x1,…, xp),
and the single output node corresponds to the one-dimensional output y. In a
neural network the input information received at each hidden node (or simply,
input at the hidden node) is a weighted sum of the p input nodes (x1,…, xp) plus a
bias at that hidden node. Thus, the input at a specific hidden node – say, node m -
is: wm0 1 wm1x1 1 wm2x2 1…1 wmpxp. In this summation the quantity wml is the
weight corresponding to input node xl (l 5 1,…, p) and wm0 is the bias at hidden
node m. It may be useful to visualize this in our NN in Fig. 2.4 where there is a
single hidden node m.1We have circled the input information at hidden node m (in
dashes) to draw attention to it.
At each hidden node a function, called an activation function, acts on the
information received at that node, i.e., on the weighted sum of input nodes. The
activation function, as the name suggests, acts as a fence or control valve, which
Fig. 2.3. NN with 1 Hidden Node, p Input Nodes and 1 Output

Node.
1
As shown in “Technical detour 2” the summation wm0 1 wm1x1 1 wm2x2 1…1 wmpxp can
be written conveniently as +l wml xl .
Fig. 2.4. Input Information at Hidden Node m Is a Weighted Sum of

Input Nodes Plus a Bias.
Fig. 2.5. Activation Function fm Acts at the Hidden Node m.
activates/opens to allow the information from the hidden neuron to proceed to the
output node. Fig. 2.5 shows the activation function fm acting at the hidden node m
and creating an output hm that goes to the output node (see expression in dashes).
So far, we have discussed a NN with just one hidden node to simply illustrate
the information flow in the network. Of course, in most cases there are many
hidden nodes. Suppose there are M hidden nodes and one output y as shown in
Fig. 2.6.
Each hidden node has an activation function acting on it, where the activation
function at hidden node m (m 5 1,…, M) is fm.2 Just as the input information
received at a hidden node is a weighted sum of input nodes, so too the input
information received at the (single) output node is a weighted sum of the hidden
nodes. One can easily see that the output node therefore receives information
which is a weighted sum of functions, since an activation function acts at each
hidden node. Recall our conceptualization of a NN as representing complex
functions by a weighted sum of simpler (basis) functions. One could think of the M
hidden nodes (indexed by m 5 1,…, M) as playing the role of M basis functions,
and the information received at the output node as a weighted sum of the basis
2
The reader should note that hm is just the output from hidden node m after the activation
function fm acts on the input information received at that node (see Fig. 2.5).
Fig. 2.6. NN with 1 Hidden Layer Containing M Hidden Nodes.
functions. In the standard NN the activation functions at all hidden nodes have the
same functional form and differ only in how inputs xi are weighted. Thus, in
Fig. 2.6, fm 5f, for all hidden nodes m 5 1,…, M.
To see a concrete illustration of how different weight vectors are used for
different hidden nodes, let us revisit the “rent value” example. The activation
function at the first hidden node would act on a weighted sum of the three inputs
(recall, that there is also a constant). So, the activation function for first hidden
node acts on
constant1 1 a* distance_to_city_center 1 b*condition 1 c*macro_factors
In this case p 5 3 and the weight vector for m 5 1 (hidden node 1) is w1 5 (w10,
w11, w12, w13) 5 (constant1, a, b, c). Similarly, the activation function at hidden
node 2 would act on
constant2 1 d* distance_to_city_center 1 e*condition 1 f*macro_factors
The weight vector for m 5 2 (hidden node 2) is w2 5 (w20, w21, w22, w23) 5
(constant2, d, e, f).
We purposely motivated a NN using the conceptualization of these networks
as models that fit complex functions to capture relationships in the data. This
conceptualization can serve to allay much of the (sometimes unfair) character-
ization of a NN as a black box. Even though the interpretation of the weights may
not be straightforward, the connection with basis function regression, which has a
long history in the statistics literature, tells us that the simple NN model is hardly
a mysterious formulation.
We now turn to the output node. Just as an activation function acts on the
input information received at a hidden node, similarly we have an activation
function that acts on the input information received at the output node. One
motivation for this is that some contexts, such as classification, require a final
transformation at the output nodes. Unlike regressions, which we have discussed
Fig. 2.7. Information Flow in NN with 1 Hidden Node.
so far and which have a continuous dependent variable, classification requires the
calculation of a probability of class membership. Probabilities should be numbers
in the interval [0, 1] and all probabilities should add to 1. Since the un-
transformed raw output from the output node can take any value (which vio-
lates the requirement for a probability), the activation function at that node
transforms the un-transformed values into probabilities (see Section 1.2.2).
To distinguish activation functions acting at the hidden layer versus the output
layer, we use the superscript “(1)” and “(2)” respectively. Recall, as stated earlier
in the current section, the activation functions at all nodes in the hidden layer
have the same functional form. We denote the activation function at the hidden
layer as f (1) and at the output layer as function f (2). To fix ideas, it may be useful to
visualize the full information flow in a NN with one hidden layer. In the interests
of not cluttering Fig. 2.7, we will again show only one hidden node – node m – but
display the activation functions at both the hidden node and the output node. As
with the activation functions, to distinguish the weights at the hidden and output
nodes we use superscripts “(1)” and “(2)” respectively.
Fig. 2.7 shows only one hidden node merely for simplicity, but in a more
general network with M hidden nodes (as in Fig. 2.6) the activation function f (2)
at the output node will act on the weighted sum of hidden nodes h1, … hm plus a
bias at the output node.
See Technical Detour 2
Executive Summary
Consider a neural network with one hidden layer. Each hidden node in this
layer has an activation function that acts on the information received at that
node. The information received at a hidden node is a weighted sum of all the
input nodes plus a bias. The activation functions at different hidden nodes
have the same functional form and only differ in the weights assigned to the
weighted sum of input nodes.
The activation function at the output node acts on the information

received at that node. The information received at the output node is a
weighted sum of hidden nodes plus a bias. If there are multiple output nodes,
the activation functions at different output nodes have the same functional
form and only differ in the weights assigned to the weighted sum of hidden
nodes.
We have spoken of activation functions in general terms without providing any

details of specific types of such functions that are commonly used. In Section 1.4
we have a detailed discussion of output nodes, which depend on the type of task
(regression or classification), but here we will discuss the activation function at the
hidden units. Though this is not an exhaustive list, some commonly used acti-
vation functions for the hidden nodes are the Sigmoid function, Rectified Linear
Units (ReLU) which have recently become very popular, especially in deep
learning, and Maxout Units which are a generalization of the ReLU. The Sig-
moid is a very commonly used activation function for the hidden nodes, and we
use it to give an intuition for how NNs can fit nonlinear functions to capture
complex patterns in the data. The sigmoid activation function is a nonlinear
function which takes any possible value of z (between negative and positive
infinity) and maps it to the interval between 0 and 1 (see “Technical detour 6” for
a formal definition). It has an S-shape as shown in Fig. 2.8.
The theoretical basis for why NNs can be used to approximate complex
functions comes from what has been called the Universal Approximation Theo-
rem. There are many variants of the universal approximation theorem and they
prove that any arbitrary function can be approximated by a NN with a single
hidden layer of nodes. In particular, it has been shown that a NN with one single
hidden layer with the sigmoid activation function can do the job of approximating
any arbitrary function (Cybenko, 1989). In order to provide an intuition for how
hidden nodes with the sigmoid activation function helps the NN capture
Fig. 2.8. Sigmoid Function.

Fig. 2.9. An “Inverted U” Shaped Nonlinear Pattern.
nonlinear relationships, let us consider a commonly occurring pattern in many

marketing and sales applications – the “inverted U” shape (Fig. 2.9).
The careful reader will discern an “inverted U” shape in a certain region of
Fig. 2.1 that shows the relationship between “Rent value” and “Distance to city
center.” This is at the second “center” of the city corresponding to a suburban
location. The above nonlinear pattern can be captured by a NN with two hidden
nodes as shown in Fig. 2.10. To focus on the hidden nodes, and to keep the
network uncluttered, we will consider only one input node. Drawing a parallel
with our opening example of predicting “rent value,” we have one input node or
feature x (corresponding to the input variable “distance to city center”) and one
output node y (corresponding to “rent value”).
Now, suppose the weights on the links between nodes in Fig. 2.10 have the
pattern as shown below in Fig. 2.11 the important thing to notice is that the
Fig. 2.10. Simple NN with 2 Hidden Nodes.

Fig. 2.11. NN with Two Hidden Units to Model a Nonlinear

Relationship.
Fig. 2.12. Positively Weighted (Left) and Negatively Weighted

(Right) Sigmoids.
weights from the two hidden nodes to the output node are opposite in sign (“a,”
“b,” “c,” and “d” are all positive quantities)
The information received at the output node is a weighted sum of two sigmoid
activation functions with a positive and negative weight respectively. What is the
resultant shape? Shown below in Fig. 2.12 are a positively and a negatively
weighted sigmoid. It is easy to see that the weighted sum of these two can capture
an “inverted-U shaped” relationship.
By suitably stitching together many sigmoids with suitably adjusted weights, a
NN can capture many complex relationships between the input and output var-
iables. In our specific “rent value” example, one can now see how a NN with
several hidden nodes, each with a sigmoid activation function acting on it, can
capture the nonlinear pattern shown in Fig. 2.1. A more technical description of
the network in Fig. 2.10 with one input node, two hidden nodes with a sigmoid
activations, and one output node is given in “Technical detour 3.”
Technical Detour 3
As we have already mentioned, the sigmoid function outputs values between

0 and 1 and therefore it can be used as the activation function at the output nodes
in the case of binary outputs – for example “Buy/Not buy”– since it outputs
probabilities. Finally, we note that if we have multiple output variables (yi is
K-dimensional, say), then there will be K output nodes corresponding to the
case of a multivariate regression. Pictorially, the network will be similar to that
in Fig. 2.13.
The “function fitting” view of NN that we have emphasized is also helpful to
conceptualize networks with multiple hidden layers, such as Deep Networks, as
“composite functions.” Such chain structures of composite functions are repre-
sented as networks (see “Technical detour 2” for details). In a NN with n total
layers (including the output layer but not counting the input layer) there are n21
hidden layers (layers 1 through n21) with layer n being the output later. The
number of layers is called depth of network. The function f ( j) is the activation
function for hidden layer j, j 5 1,…, n21. Function f (n) is the activation function
for the output layer n and allows one final transformation of the vector of out-
puts. NNs use specific activation functions f ( j) at different layers j depending on
the type of NN.
Most NN applications in business, and especially in marketing and sales, have
used a single hidden layer. Multilayer NNs are mainly used in the context of Deep
Fig. 2.13. NN for Multiclass Classification with 1 Hidden Layer.

Learning and there are many other aspects of the architecture of deep networks
(Convolutional NNs, Recurrent NNs, etc.) that are much more complex than
simple single hidden layer NNs. Deep learning is an evolving field and its business
applications are still nascent. In the current chapter we will only cover NNs with a
single hidden layer.
1.2.2 NN for Classification

Many interesting sales and marketing questions require methods of classifica-
tion. For example, “lead scoring” in sales requires potential customers to be
classified into groups with various propensities to purchase. In marketing,
customers can be segmented into groups based on the amount of business they
do – heavy versus light users. We now turn to the multi-class classification
case. Again, consider N observations (xi, yi), i 5 1,…, N, where each input xi is
p-dimensional vector. If the classification has K classes indexed by k 5 1,…, K,
then there are K target measurements yk, k 5 1,…, K. Each of these is coded 1
or 0 depending on whether class k occurs or not. When K-class classification is
modeled as a NN we have K nodes where node k models the probability of
belonging to class k.3 Again, we will only assume one hidden layer with M
hidden nodes {h1,…, hM} and K output nodes {y1,…, yK}. The NN can be
shown pictorially shown as,
Similar to the univariate regression case with only one output, the activation
function f (1) at hidden node m operates on the weighted input vector xi (corre-
sponding to the ith observation) and outputs hmi for m 5 1,…, M. However,
different from the univariate regression case, here there are K outputs yki (k 5
1,…, K) for each input vector xi. For output node k define activation function fk(2)
which acts on the weighted sum of the M hidden nodes h1,…, hM.4 The final
ð2Þ
activation function at output node k, fk , allows one final transformation of the
information received from the hidden layer. Summarizing, the idea of a neural
network for multi-class classification is a simple extension of a univariate
regression, except that we have as many output nodes as there are classes
(compare with Fig. 2.6).
Most modern NNs for multi-class classification use the softmax function as the
final activation function f (2) to transform the continuous output from the output
node k to valid probabilities. Essentially, the softmax function allows us to map
output vectors that may have positive and negative valued components into
probabilities where each component is bounded by 0 and 1 and where the values
of the vector components sum to 1 (see “Technical detour 6” for a formal
3
It is now clear why sometimes the final transformation f (2) is needed at the output layer.
For classification, this is the transformation that generates class probabilities from the un-
transformed, and unrestricted, continuous output that the NN would otherwise yield.
4
At the output node k the activation function is f k(2) (w(2)k hi). This function acts on w(2)k hi
which is the weighted sum of hidden nodes (see the dot product notation in “Technical
detour 2”). The weight vector is w(2)k 5 (w(2)k0, w(2)k1, w(2)k2,…, w(2)kM) and vector of hidden
nodes is hi 5 (h0i, h1i,…, hMi), where h0i 5 1.
definition). The softmax activation function serves as a gate, or control valve,

classifying the input into one or the other classes (categories).
The vector of parameters of the NN with a single hidden layer, denoted
collectively as the vector u, can be summarized as
ðwð1Þ m0; wð1Þ m1; wð1Þ m2; :::; wð1Þ mpÞ; m ¼ 1; :::; M; ← Mð p 1 1Þweights
ðwð2Þ k0; wð2Þ k1; wð2Þ k2; :::; wð2Þ kMÞ; k ¼ 1; :::; K; ← KðM 1 1Þweights (2.1)
1.3 Cost Functions and Training of Neural Networks

Using Backpropagation
In the case of a neural network, training is said to occur when the neural
network estimates the “best” values of the parameters given in (2.1) above. As
mentioned in our discussion of cost functions in chapter 1, the best parameters
are the cost minimizing parameter values. The cost could be defined in terms
of any business-relevant factor. For instance, it could be defined in terms of
sales, profits, inventory balances, etc. So, a neural network may be deployed to
predict the critical issue of minimum inventory balance – it is costly for the
company to buy and maintain too much inventory, but on the other hand too
little inventory results in stock-out costs. This is especially true for e-commerce
businesses that have large warehouses that hold inventory to ship based on
orders received from their websites. In this case the cost function would be the
difference between the model prediction of inventory balance and the actual
inventory balance.
As mentioned in Chapter 1, most machine learning models are trained using
the technique of Maximum Likelihood Estimation (MLE), and the concepts of
ML Estimation are applied to solving for parameters of NNs as well. Incidentally,
to foreshadow our discussion of output units in Section 1.4, there is also a very
close theoretical relationship between the type of output units and cost functions
used. Thus, the choice of a cost function using the maximum likelihood frame-
work is related to the type of output units of the neural network. We have seen
that, for Gaussian errors in a regression model the minimization of the sum-of-
squares cost with respect to the weights is equivalent to maximizing the
log-likelihood. Recall also that in the maximum likelihood framework, the
appropriate cost function for regression-type outputs is the sum of squared error
cost (loss), and for binary output the appropriate cost is the cross-entropy cost.
Details of the sum-of-squares cost function for a neural network for regression
and the cross-entropy cost function for a neural network for K-class classification
are in the appendix.
Technical Detour 4
In the boxed text below, we summarize our discussion above.
Executive Summary
The cost function provides a measure of how the neural network model is
performing. Performance is measured in terms of how close the model is able
to predict the actual observed data. Since the cost function is calculated using
the actual data, it is affected by the size of the actual training input data set,
the weights and biases used in specifying the network, and the activation
functions. The more commonly used cost functions are quadratic cost and
cross entropy for neural networks for regression and classification problems
respectively.
As we saw in Chapter 1, training of machine learning models proceeds by

using gradient descent. The specific application of gradient descent to NNs is
called the backpropagation algorithm and it relies on the chain-rule of calculus.
The name backpropagation comes from the idea that the algorithm involves
propagating the errors back from the output layer to the hidden layer and then to
the input layer. The gradient based update requires the derivatives of the cost
function with respect to all the weights and biases in the network. As can be seen
from (2.1) above, a neural network can have many weights and biases, and the
situation is even more complicated in the case of Deep Networks with many
hidden layers. This could make the standard gradient descent very inefficient.
The backpropagation algorithm is a recursive program that allows efficient
computation of the gradients. The complete backpropagation algorithm has two
passes and works as follows:
(1) Suppose we are at step r. Given a set of weights (as in (2.1)), the forward pass
computes the predicted f (xi) for input data xi.
(2) Then for the backward pass the errors are calculated at the nodes in the
output layer and are backpropagated using a recursion to compute the errors
at the hidden nodes.
(3) Both sets of errors are then used in calculating the gradients at the outer and
hidden layers.
(4) Finally, the gradients are used in the gradient descent updating rules. This
yields the weights at step (r 1 1). The algorithm proceeds iteratively till some
stopping rule terminates it.
Why is the training algorithm for neural networks called the backpropagation
algorithm? From (2.1) we can see that we need to estimate the weights at both the
hidden and output layers. It turns out that when gradient descent is applied to a
neural network, the gradients with respect to the weights at both the hidden and
output layers involve the difference between the model prediction f (xi) and the
actual observed yi. These discrepancies are therefore errors which are defined at
the hidden and output nodes. The algorithm works by propagating this error from
the output node back to the hidden node in a convenient recursive manner (see
step 2 in algorithm above). Since computers can easily handle recursive rela-
tionships using logical “do loops,” the backpropagation algorithm allows efficient
calculations. Detailed derivations of the backpropagation equations are given in
the appendix.
Technical Detour 5
1.4 Output Nodes

NNs provide a set of hidden nodes, denoted by hi 5 (h1i,…, hMi) corresponding to
an input xi, and the output layer transforms an affine combination of these hidden
features to perform the task that we want our NN to perform. Below we distin-
guish between three very common tasks in machine learning, and focus on their
respective output nodes: regression, binary classification, multiclass classification.
1.4.1 Linear Activation Function for Continuous Regression Outputs

For the univariate regression case, given hidden nodes hi 5 (h1i,…, hMi), corre-
sponding to input xi, a linear output node f (2) produces an affine transformation
which is essentially a linear transformation. For estimation purposes it is generally
assumed that in the regression equation yi 5 f (xi) 1 ei, the output yi follows a
Gaussian distribution with mean f (xi). In this case, the maximum likelihood
approach for estimation is equivalent to minimizing quadratic loss function (least
squares error). The above can be straightforwardly extended to the multivariate
case. In this case for the kth output (k 5 1,…, K), the regression equation is yki 5
fk(xi) 1 eki, As far as the rate of learning is concerned, there is no “learning
slowdown” with linear activation functions at the output nodes since they do not
saturate, unlike sigmoid activation functions at the output nodes as described
below.
1.4.2 Sigmoid Activation Function for Binary Outputs

When a binary output y is needed, there is only one output node which has a
sigmoid activation function f (2) 5 s (defined in “Technical detour 6”). Thus, given
the weighted sum of the hidden nodes, zi, the activation at the (single) final output
node is the sigmoid function acting on zi.5 In the binary case we can have only two
possible outcomes, 1 or 0, and we assume that the output y, given x, has a Ber-
noulii distribution. Therefore, the model needs to only predict P(y 5 1) and the
final transformation by the activation function f (2) (zi) ensures that these outputs
are valid probabilities.
M
ð2Þ
The weighted sum of the hidden nodes is zi 5 + wm hmi .
5
m50
The cost function used by maximum likelihood estimation in the classifica-

tion case is the cross-entropy cost (as given in (A2.3) in the appendix). It is
worthwhile exploring the advantage of using cross-entropy cost compared to
quadratic cost in the binary classification context. Essentially, quadratic cost in
this context would suffer from a learning slowdown problem. It can be shown that
the learning rate depends on the rate of change of the quadratic cost, which in
turn depends on the rate of change of the sigmoid function with respect to zi.
The rate of change of the sigmoid function with respect to zi, which is its slope, is
small for very small and very large values of zi. In Fig. 2.14 if the desired output
is 0 but the current prediction f (2) (zi) is close to 1 then the rate of change, and
therefore the learning rate, is very slow. Of course, the derivative is also very
small for small values of zi. Obviously, this is due to the property of the sigmoid
activation function f (2), and some authors refer to this as the “saturation” of the
sigmoid in some regions. The use of quadratic cost with sigmoid activation
functions violates a desirable characteristic of how human beings learn – we tend
to learn most rapidly when we are very wrong! However, with the squared error
cost, we just saw that when the function value is close to 1 (corresponding to
large zi) but the desired value is 0, implying a situation of being “very wrong,”
then learning is very slow. A cross-entropy cost function has this desirable
feature as we will see next.
Suppose instead that we use the cross-entropy loss. In our specific case, the
training data is (xi, yi), i 5 1,…, N, where xi is a set of explanatory variables and
the one-dimensional target variable yi is 1 or 0 for each individual i. The learning
rate in the cross-entropy function does not depend on the derivative f (2)9(zi) (see
“Technical detour 6”). Moreover, with the cross-entropy function the learning
rate is higher when the difference between the predicted value f (2) (zi) and the true
target value yi, is large. In other words, the learning is faster when the model is
more wrong in terms of predicting the correct target output. This property of the
cross-entropy functions makes it particularly desirable to use in classification
problems.
Fig. 2.14. Learning Slowdown with Quadratic Cost for Binary

Output.
1.4.3 Softmax Activation Function for Multiclass Outputs

When a multi-class output y is needed, the softmax activation function is used
for output nodes. Essentially, the softmax function is a generalization of the
sigmoid function to multiple classes (categories). Suppose there are K classes
with targets yki, k 5 1,…, K, taking values 1 or 0 depending on whether
observation (say, an individual) i belongs to class k or not. The NN provides a
weighted combination of intermediate outputs generated by the hidden layer as
zki for k 5 1,…, K.6 We use the softmax function as the final activation
ð2Þ
function fk to transform a continuous output at node k, zki, to probabilities (see
“Technical detour 6”).
Maximum likelihood estimation for multi-class classification uses the cross-
entropy cost much like in the case of a binary output. The cross-entropy cost
allied with softmax activation at the output layer has the same advantage of
allowing the NN to learn most rapidly when the error is largest (just as in the
sigmoid activation for the binary case above).
It is worth noting that the term softmax comes from the fact that the softmax
function provides a continuous (hence “soft”) approximation to the discontinuous
function max(.). Just as an illustration, consider two points 0 and x. Clearly
the function max{0, x} is discontinuous at 0 but softmax(0, x) is continuous.
Thus, when we need to optimize in situations involving the max function and its
discontinuity becomes problematic, the softmax approximation can be used
instead.
Technical Detour 6
2. Feature Importance Measurement and Visualization

Much of the research and practice-related work in neural networks has focused
on prediction. Obviously, prediction is important, but if one wants to use
neural networks to inform one’s decisions about domain-specific matters then
the interpretation of neural network outputs becomes very important. Inter-
pretable machine learning is an active area of research and, broadly speaking,
it also goes under the name of Explainable AI. To take a familiar example, it is
not enough for sales managers to forecast/predict sales but also to understand
the most important drivers of sales – personal calls, price, feature displays,
advertising, promotional discounts etc. There are quite a few methods that
have been developed for interpreting neural network weights and for quanti-
fying and graphically visualizing feature (input variable) importance. To that
6
Corresponding to output node k, the weighted sum of the hidden nodes is
M
ð2Þ
zki 5 + wkm hmi .
m50
extent, the unfair characterization of neural networks as black boxes is quite

undeserved. We will discuss:
• Neural interpretation diagrams (NID) (Özesmi & Özesmi, 1999)

• Profile method for sensitivity analysis (Lek et al., 1996)
• Feature importance using connection weights (Garson, 1991)
• Randomization approach (Olden & Jackson, 2002)
• Partial derivatives approach (Dimopoulos, Bourret & Lek 1995)
As mentioned, interpretative machine learning is an evolving research field and

so, our list above is by no means exhaustive. They are some of the more
commonly used, and research based, methods designed specifically to interpreting
neural network outputs.
2.1 Neural Interpretation Diagram (NID)

Özesmi and Özesmi (1999) have proposed using a neural interpretation diagram
to better visualize the strength and valence of connections between nodes. A
neural interpretation diagram provides a visual depiction of a neural network
with:
• Thickness of lines between nodes representing magnitudes of weights on the

links from a node in a layer to a node in the next layer
• “Black” lines for positive weights (positive effect) and “gray” lines for negative
weights
In this chapter we consider (shallow) neural networks with only one hidden
layers, apart from an input and an output layer. Therefore, there are links from
nodes in input layer to nodes in the hidden layer and also links from nodes in
hidden layer to nodes in output layer. The NID method is purely graphical and
does not quantify the magnitude of the impact of an input variable of the output.
Instead it provides directional inferences about the impact of input variables. For
a neural network with only one hidden layer the effect of an input variable is
positive if there is a: (1) Positive input-hidden and positive hidden-output links, or
(2) Negative input-hidden and negative hidden-output links. The effect of an
input variable is negative if there is a: (1) Positive input-hidden and negative
hidden-output links, or (2) Negative input-hidden and positive hidden-output
links.
As an illustration, consider the problem of classification where the task is to
classify industrial buyers of a company, say PB, that sells spirometers. Spirom-
eters are medical devices used for measuring the volume of air inhaled and
exhaled by the lungs. They are used, among many other things, to diagnose the
lung capacity of people with COPD. PB sells to institutions like hospitals and
smaller clinics and it wants to predict whether a medical entity will “Buy” or “Not
buy” their product. This binary dependent variable was called “ChoicePB.” The
Fig. 2.15. NID for Neural Network for Predicting Choice of PB.
company has data from past purchases and wants to use this data to make their
binary prediction. The predictor (independent) variables used are: 1. Price;
2. Storage/retrieval; 3. Repairs/service (low cost); 4. Sanitary; 5. Supplies (avail-
ability); 6. Easy to operate; 7. Service (quick response); 8. Accuracy (provides
accurate readings). Using the statistical package R, we fit a neural network with
one hidden layer containing three nodes, and the NID for this network is dis-
played in Fig. 2.15.
We will demonstrate the interpretation of directional impact of an input
variable on the output using the case of price. We choose price since the
directional impact of price on the probability of choosing a product is simple
– price should have a negative impact. In Fig. 2.15, we can see that the link
from price to all the hidden nodes H1, H2 and H2, are negative (gray lines).
Further, the links from all the hidden nodes H1, H2 and H3 to the output
node are positive (black lines). Thus, the overall effect of price on the
probability of choosing PB is negative, consistent with our intuition. Among
the non-price attributes, we can see the “Storage” has a stronger impact than
“Repair” based on the thickness of the lines, and the effect of both these
variables is positive.
2.2 Profile Method for Sensitivity Analysis

Lek, Beland, Dimopoulos, Lauga, and Moreau (1995), Lek et al. (1996) have
proposed the profile method for visualizing the sensitivity of the output(s) of a
neural network to variations in input variables. While the neural interpretation
diagram is useful for visualizing the relative magnitudes and strengths of the
weights on connections between various nodes, it is unable to capture the possibly
nonlinear relationship between each input variable and the output. The profile
method enables one to do this. In this method the contribution of each input
variable on the output is plotted by varying each input variable over its entire
range, while all the other input variables are held constant at various percentiles
(say 20th, 40th, 60th, 80th).
Fig. 2.16. Profiles for Sensitivity of Predicted Response to Input

Variable.
We will provide an illustration of a neural network where 10 predictor vari-

ables are used to predict a continuous response variable. The response is sales of a
durable good in units, and the predictor variables include price, competitor price,
advertising, location of the product on the retail shelves etc. Suppose we want to
focus on how competitor price affects the sales. Just for illustration, we will show
two sensitivity plots for how the focal variable (competitor price) affects sales
when all other variables are held at two different levels – 25th percentile shown
with the light dotted curve (….) and 75th percentile shown with the bold dashed
curve (----) (Fig. 2.16).
2.3 Feature Importance Based on Connection Weights

The NID gives an easy visual depiction that allows us to infer the direction of
impact as well as the relative importance of the inputs, based on the thickness of
the connecting lines. Its major drawback is that it does not quantify the impor-
tance of the input variables. Garson (1991) proposes a method to achieve that end
and it uses the weights that are the outputs of all neural network programs. We
will illustrate how this method works using an example of a neural network with 3
input nodes, 1 hidden layer containing 2 hidden nodes and 1 output node
(Fig. 2.17).
Suppose the weights after fitting this neural network are (Fig. 2.18))
The weight-based input importance method of Garson (1991) proceeds in three
steps
Fig. 2.17. NN with 3 Inputs Nodes, 2 Hidden Nodes and 1 Output

Node.
Fig. 2.18. Estimated Weights after Training of Network.
Step 1: Calculate the contribution of each input neuron to the output via each
hidden neuron calculated as product of input-hidden and hidden-output
connection (Fig. 2.19)
e.g., c11 5 w11*wO1 5 0.510*1389 5 708. We can fill in the other cells similarly
to obtain
Step 2: Calculate relative contribution of each input neuron to outgoing
signal from each hidden neuron (Fig. 2.20). For example r11 5 jc11 j 1 jcjc11 j
12 j 1 jc13 j
5
708=ð708 1 518:9 1 433:37Þ 5 0:43. The other entries in the columns corre-
sponding to “Hidden h1” and “Hidden h2” in Fig. 2.20 can be calculated
similarly. The row sums are, for example, S1 5 r11 1 r21 5 0.43 1 0.53 5 0.96
etc.
Fig. 2.19. Contribution of Each Input Neuron to the Output via

Each Hidden Neuron.
Fig. 2.20. Relative Contribution of Each Input Neuron to Outgoing

Signal.
Fig. 2.21. Relative Importance of Each Input.
Step 3: Finally, the relative importance of each input can be calculated using
the row sums in the last column in step 2 (Fig. 2.21). For example, e.g., RI1 5 S1/
(S1 1 S21S3)*100 5 0.96/(0.96 1 0.6310.41)*100 5 48%. This quantifies the
relative importance of the input variables.
2.4 Randomization Approach for Weight and Input Variable Significance

Olden and Jackson (2002) point out that it is desirable to eliminate connection
weights that are not significantly different from mere randomness – that is,
weights that could be expected by chance alone even if there were no relationship
between an input predictor variable and the output. Since the predictor impor-
tance is based on connection weights, as in Garson (1991) described in Section 2.3
above, the randomization approach also gives us a way to detect non-significant
predictor variables.
The randomization approach has the benefit of providing an alternate, sta-
tistically valid, method of doing variable (feature) selection. Many pruning
algorithms need an arbitrarily set threshold to eliminate links (Bishop, 1995).
Obtaining statistical significances of all weights on links between nodes allows us
to do variable selection without making this arbitrary judgment – one can elim-
inate an input node (variable) if all links from that node to hidden nodes have
non-significant weights on them. The randomization protocol advocated by
Olden and Jackson (2002) can be described as follows:
• Construct many NNs, for different initial random weights, using original data
• Select the NN with best predictive performance, and record the initial random
weights used for this best fitting NN. Calculate and record
• input-hidden-output connection weights (For example, c11 in step 1 in

Garson (1991)). These are “observed” weights.
• overall connection weight (which is a row sum in step 1 in Garson (1991). For
example, c1 5 c11 1 c21). These are “observed” overall connection weights
• relative importance for each input variable (For example, RI1 in step 3 in
Garson (1991)). These are “observed” relative importances.
• Randomly permute original response variable (yrandom). Permutation means

that the response for the ith observation, yi, will be replaced by the response for
some other observation, say, the jth observation, yj. This will be done for all
observations i 5 1,…, N, where the replacements will be done randomly
• Construct NN using yrandom and the initial random connection weights (from
step 2).
• Essentially, the permutation of the response variable breaks the association
between the predictor variable and the output variable. The weights on the
links in a neural network trained using a permuted data set would reflect those
that might be obtained by mere chance.
• Repeat steps (3) and (4) a large number of times each time recording 2(a), (b)
and (c). E.g., randomized c11, randomized c1, and randomized RI1.
The statistical significance of observed c11, c1, RI1 are calculated as the pro-
portion of randomized values equal to or more extreme than the observed values.
Let’s take the example of the relative importance (RI1) for a given predictor
variable. Clearly, if the observed RI1 is not very different from many of the
randomized RI1 then we would not be able to put much faith in this observed RI1.
In other words, this predictor would be non-significant.
2.5 Feature Importance Based on Partial Derivatives

Dimopoulos et al. (1995) have used the partial derivatives method to study the
impact of an input variable on the output. Consider the OLS regression with two
predictors: Y 5 b0 1 b1X11 b2X2 1 «. The parameter “b1” captures the effect of
the predictor X1 on the response Y. Suppose we write the function f (X1, X2) 5 a1
b1X11 b2X2. The parameter “b1” is the partial derivative of the function f with
respect to the predictor X1. The parameter “b2” can be similarly interpreted. To
the extent that the relative magnitude of “b1” can serve as a measure for the
importance of the predictor X1, the partial derivative approach can provide useful
insights about predictor importance.
The reader can find details of the partial derivatives approach to investigate the
importance of the predictors in the case of a neural network with a single
continuous output node in Dimopoulos et al. (1995). However, they have not
explicitly derived the partial derivatives in the case of neural networks with a
multi-class output. Modern neural networks for multi-class classification have
many output nodes and use a softmax activation function at the output layer, for
a final transformation of the outputs, after using a sigmoid activation at the
hidden layer. It is possible to derive the explicit analytical expressions for the
partial derivatives in the case of neural networks with binary and multi-class
outputs.
We end this section by noting that the Bayesian approach to neural networks
provides other means to study predictor importance. Bayesian neural networks
with Automatic Relevance Detection (ARD) can aid decision-makers to draw
inferences about the relative importance of different predictors.
3. Applications of Neural Networks to Sales and Marketing

There is a long history of firms from all industries applying NNs to sales and
marketing practice. While there are myriad instances of NN applications that
have been mentioned in the trade press and practitioner magazines, we will focus
mostly on the successful applications documented in practically-oriented and
peer-reviewed research journals. We adapt the framework employed by Syam and
Sharma (2018) to discuss how NNs have been fruitfully used to enhance sales
and marketing efficiency and effectiveness. Syam and Sharma (2018) discuss the
impact of AI and machine learning on the sales function by outlining their impact
on the different stages of the selling process (Table 2, page 141): prospecting, pre-
approach, approach, presentation, overcoming objections, close, and follow-up. It
should be noted that we will discuss the impact of neural networks on marketing
within this framework since the initial stages of the sales process have a strong
overlap with marketing functions at most firms. In order to draw a boundary
around our discussion, and reflecting our interest in both sales and marketing, we
will focus on those aspects of the marketing function that are closely intertwined
with the selling function. So, while this will not allow an exhaustive discussion of
all the ways in which marketing has been impacted by machine learning, the
reader will get enough of a flavor of how these new technologies are disrupting
marketing. In terms of the impact of machine learning, and specifically NN, on
the stages of the selling process there is similarity between pre-approach and
approach so we combine them in our discussions. Similarly, objection handling
and close will be combined.
Perhaps the greatest impact of neural networks in sales has been in the pro-
specting stage. In this stage of the selling process the firm performs the tasks of
finding customers and qualifying them by scoring the potential customers based
on their propensity to purchase. From a sales perspective, these constitute the
firm’s lead generation and lead qualifying functions. Some authors use the term
prospecting only for “lead generation,” and put “lead qualification” as a separate
stage (see Exhibit 2.5, page 48, of Johnston & Marshall, 2013). Our focus is on the
function itself rather than in delineating the specific stage of the sales process in
which the function occurs. The consumers’ propensity to purchase has been
extensively used to obtain estimates of demand. Purchase propensities have also
been used to divide the customer base into more or less attractive segments, and
then to target a subset of these segments for the firm’s selling and marketing
efforts. In this way, targeting, which is a typical marketing function, has a strong
overlap with the lead qualification aspect of the sales process and we will discuss
them together. Therefore, under the broad umbrella of the prospecting stage we
discuss the inter-related topics of (1) demand estimation and sales forecasting, (2)
segmentation, targeting and positioning (STP), (3) lead generation, and lead
qualification. STP are typical marketing functions that have maximum overlap
with the initial stages of the sales process, which we call the “customer develop-
ment” stages of the process, where marketing-sales integration is the most intense.
Agarwal and Schorling (1996) is an early paper in marketing that has provided
evidence of the superior performance of neural networks for market share fore-
casting. The authors investigated whether artificial neural networks perform
better than the standard Multinomial logit (MNL) in predicting brand shares of
grocery products. They chose frequently purchased grocery products like catsup,
peanut butter, dishwashing liquid, etc. in a B2C retail context using a well-studied
data set from IRI Marketing Research. They divided their data into 4 clusters: All
households (cluster 0), households purchasing 1 brand; households purchasing
2–3 brands; households purchasing . 5 4 brands. They used the same obser-
vations for both the neural network and MNL for a fair comparison and found
that neural networks outperform the MNL on many dimensions. Interestingly,
the performance of the neural network compared to the MNL is even better in
more complex situations like segments with more brands. In addition, the neural
network is less sensitive to the number of observations and robust to different
estimation periods. While Agarwal and Schorling (1996) have used a neural
network for classification in the binary choice context, most papers on sales
forecasting and demand forecasting have used neural networks in the regression
context. Zhang (2004) is an excellent book that details many applications of
neural networks for business forecasting. Coming to sales forecasting,
Carbonneau et al. (2008) and Thiesing and Vornberger (1997) use neural net-
works to forecast demand and show that these methods are superior to the
traditional forecasting methods like trend, moving average and linear regression.
In a major study, Hill, O’Connor, and Remus (1996) showed that, across monthly
and quarterly time series data, NNs performed significantly better than six sta-
tistical time series methods that were generated in a well-known forecasting
competition (Makridakis et al., 1982). The authors further note that NNs espe-
cially do better than the other methods for discontinuous time series data. The use
of NNs in forecasting aggregate market demand has been demonstrated by
Gruca, Klemz, and Petersen (1999), Hruschka (1993), and Yao, Teng, Poh, and
Tan (1998). NNs have been found to be very successful in handling chaotic time
series data that are very unlikely to obey simple linear relationships between the
independent variables (“input cells” in NN) and dependent variables (“output
cells”) assumed by traditional autoregressive methods that model linear stationary
processes (Landt, 1997; Lawrence, Tsoi, & Gilles, 1996). Therefore, neural net-
works, which do not assume any such relationships, and are very effective at
pattern recognition, are particularly suited for analyzing such time series data.
The use of NNs for market segmentation has been documented by Bloom
(2005), Hruschka and Natter (1999), Krycha (1999), Balakrishnan, Cooper,
Jacob, and Lewis (1996), and Fish, Barnes, and Aiken (1995) among others.
Boone and Roehm (2002) show how retail segmentation can be done using
artificial neural networks. Indeed, Chattopadhyay, Dan, Majumdar, and Chak-
raborty (2012) have compiled a list of more than 1000 articles from all disciplines
that have used NN in segmentation!
Most papers in the area of segmentation use unsupervised neural networks
(Krotov & Hopfield, 2019) where both the input and output variables (units) are
segmentation criteria and with one hidden layer whose units are segment members
(Hruschka & Natter, 1999). A multinomial logit determines segment membership
for the hidden layer whereas for the output layer the segment memberships
obtained in the intermediate hidden layer are weighted by segmentation criterion
specific weights. The output units are obtained using a binomial logit function
which uses the weighted sum of the memberships over all segments. Hruschka and
Natter (1999) show that neural networks perform much better for segmentation
than the traditional K-means clustering approach.
When it comes to targeting, NNs have mostly been used for targeting indi-
vidual customers rather than segments (which forms part of a firm’s STP strat-
egy). Zahavi and Levin (1997) have used NNs for targeting customers with
mailings. NNs have found very fruitful applications in lead generation and lead
qualification. This has been done both at the segment level and at an individual
customer level – whether for one-on-one marketing in B2C contexts or for B2B
business with a fewer number of industrial customers. Lead qualification can be
broadly conceived as not only certifying and vetting all the information about the
customer, but also “scoring” the objective “quality” of the lead in terms of
the consumers’ propensity to buy. The task of certifying, verifying and vetting the
information are examples of mundane tasks that can be automated, using neural
networks and other machine learning methods, to free up employee time. On the
other hand, lead qualification and scoring is an example of using neural networks
to actively generate customer intelligence that allows the firm to better target its
customers.
Lead qualification and scoring models are usually based on NNs that generate
choice probabilities. These models have binary or multiclass outputs capturing the
probabilities of belonging to various classes, such as “buy”/“not buy,” or even
more fine-grained categories of purchase. The individual-level choice probabilities
are then used to determine more or less “attractive customers” in terms of their
propensities to purchase. The attractive customers are said to be qualified, and the
firm then targets them with its sales and marketing efforts.
West, Brockett, and Golden (1997) is an early paper in a major marketing
journal that has demonstrated the superiority of neural networks for predicting
individual consumer choice. The authors study the realistic situation of non-
compensatory consumer choice, that requires nonlinear utility models, and
investigate the performance of neural networks compared to traditional statistical
methods. They state that: “…the results reveal that the neural network model
outperformed the statistical procedures in terms of explained variance and out-of-
sample predictive accuracy” (page 370). Specifically, they consider three
commonly used non-compensatory choice rules – Satisficing rule, Latitude of
acceptance rule; and Weighted additive rule – and find that the neural network
performs much better than the standard linear statistical models like Logistic
Regression and Discriminant Analysis.
Some authors have leveraged the ability of neural networks to better handle
data that is increasingly available from online sources. The increasingly important
online commerce especially benefits from the ability of NNs to handle the enor-
mous volume, complexity and “real-timeness” of the data. Potharst, Kaymak,
and Pijls (2001) have documented a 70% response rate when the mailing is guided
by neural networks (compared to just 30% for traditional lead qualification
methods) when identifying consumers that are likely to respond to direct mailings
by a Dutch charity organization. Kim, Street, Russell, and Menczer (2005) use
NNs, guided by a genetic algorithm, to identify the feature subset that maximizes
classification accuracy. Their form of genetic algorithm is called the evolutionary
local selection algorithm (ELSA) and it accomplishes “feature selection” to search
through a multitude of features (demographic variables) which are then fed to a
neural network that predicts “buy” or “not buy.” The authors show that this NN
approach dominates the traditional methods of feature selection (done by prin-
cipal components analysis) and classification (done by logistic regression).
In the context of using neural networks for choice modeling in a B2B situation,
Kumar, Rao, and Soni (1995) study a supermarket’s item selection decision. In
their analysis these authors use data from a supermarket’s item selection decision
with a total of 1048 observations with 770 rejects (coded as 0) and 278 accepts
(coded as 1). They do a comparison of neural networks and logistic regression. An
important contribution of this paper is a comparison of neural networks and
logistic regression under different classification thresholds, apart from the stan-
dard threshold of 0.5. They find that the performance of a neural network is better
than a logistic regression, and importantly, the performance becomes even better
when the classification thresholds are more stringent – for example, when an
object is classified as 0 (or 1) when the probability is less than 0.25 (or greater than
0.75). They take this as evidence that a neural network provides more confidence
than a logistic regression in making such a binary decision.
In terms of pre-approach and approach stages on the sales process, the largest
impact of AI has been the emergence of mobile and web-based means through
which the selling organization can contact the customers. A company called
6sense uses data on customers’ visits to the client’s site in combination with third-
party data and social media feeds to predict when the customer may be ready to
buy, and therefore the best time for client’s sales people to approach their
potential buyers. The most exciting development in AI-powered conversational
software has been the emergence of chatbots. While earlier chatbots used other
models of Natural Language Processing (NLP) like Markov chains and genetic
algorithms (Abdul-Kader & Woods, 2015), the most recent techniques for
conversational AI use neural network based deep learning methods (Gao, Galley,
& Li, 2019).
The major impact of machine learning and AI in the presentation stage have
come about by immersive technologies like mobile virtual reality (VR), 360-
degree video and augmented reality (AR). Such immersive technologies enable
higher user engagement than “plain” videoconferencing by enhancing the sense of
the presenter being present in the room. Garg and Tai (2014) show how artificial
neural networking (NN), along with genetic programming, can be used to
improve rapid prototyping by optimizing the parameter settings that control the
wear strength and tensile strength of the prototype. Multi-layer neural networks
like deep learners foster real-time image and audio processing which drive virtual
reality displays enabling effective sales presentation technologies (Dooley, 2017).
Intel’s RealSense Vision Processor allows firms to present presentation prototypes
in 3D and use advanced cloud-deployed algorithms to process raw image streams
in real-time. Thieme, Song, and Calantone (2000) have used neural networks in
the critical new product development selection process.
Of all the stages of the sales process, overcoming objections and closing are
perhaps least impacted by AI and machine learning including neural networks.
Much of the closing of big-ticket sales is still done in person, but the overcoming
objections function is being rapidly disrupted by AI through robo-advisors.
Though sales people still have a role in overcoming objections, especially where
standard FAQs are insufficient, AI is rapidly making inroads. Again the more
cutting-edge deep learning type neural networks are leading the charge by being
able to analyze video and text thus facilitating real time interactions with
customers.
The final stage of the sales process, follow-up, has two aspects: current order
processing and customer engagement after the current order is filled. As far as
applying neural networks is concerned, current order processing has been tackled
under the umbrella of supply chain functions comprising of order recording, order
processing, inventory management and order fulfilment. Sustrova (2016) shows
how NNs can be used for managing a company’s order cycle leading to reduced
storage costs, reduced inventory purchase costs and optimum ordering levels.
Chiu and Lin (2004) accomplish complete order fulfilment across the supply chain
by training three separate NNs for the supply network, the production network
and the delivery network.
Machine learning and AI have be applied to the post order follow-up stage of
the sales process. A company called Gainsight and Survey Monkey have teamed
up to offer software that automatically alerts salespeople to the need to invoice
after the close along with prompting them on upsell and cross-sell opportunities.
Guido, Prete, Miraglia, and De Mare (2011) shows how neural networks can
improve the effectiveness of direct mail marketing campaigns by providing better
predictions of purchase intention through cross-selling and up-selling. The con-
sumer’s response rates are modeled using factors that are likely to have an impact
on the purchase intention. The authors also benchmark NNs against the more
traditional methods like multiple regression analysis and logistic regression and
find that NNs perform better. Knott, Hayes and Neslin (2002) develop models to
determine the Next Product To Buy (NPTB) and find that a NN has a slight
advantage over other competing models. Linder, Geier, and Kolliker (2004)
provide some recommendations of whether NNs, regression or classification trees
should be used depending on the type of customer database being analyzed.
4. Case Studies
In this section we will present a couple of case studies about the application of
neural networks in marketing. The goal of these case studies is to provide some
details on how neural networks have been used in important application areas in
marketing and sales. We will focus on the business context of the applications,
describe the data set and, wherever possible, provide some theoretical background
for our choice of predictors and dependent variable. We will describe details of the
analyses done on the data sets and the results obtained, especially with a view to
visualization and interpretation of the results. We also compare the strengths and
weaknesses of neural networks compared to the traditional econometric models
where the latter are used as benchmarks.
Case Study 1: Churn Prediction

The management of customer churn has been a very important application area in
marketing. Indeed, Neslin, Gupta, Kamakura, Lu, and Mason (2006) organized a
tournament in which they made available a churn data set and invited scholars,
consultants and industry practitioners to enter the tournament with their models
for predicting churn. The large volume of articles, published in both marketing
and non-marketing journals, on the topic of churn models has been summarized
well by Blattberg, Kim and Neslin (2008) in Table 24.2 (page 612). These authors
provide an exhaustive list of various factors that determine the levels of churn.
Because of the importance of churn, especially in industries where customers can
exit and not return unless the company exerts considerable effort to acquire them
again, scholars have investigated the problem of churn prediction in many
different industries. To take a few examples, Coussement and Van den Poel (2008)
have considered the market for subscription services, Verbeke, Dejaeger, Mar-
tens, Hur, and Baesens (2012) have considered the telecommunication sector and
Xie, Li, Ngai, and Yin (2009) have considered bank customer churn.
We use a publicly available data set to illustrate the use of a neural network for
churn prediction. The data comes from a European bank and the goal of the
analysis is to predict whether a customer will leave the bank or not. The data
comes from historical records at the bank and the variables in the dataset are
• RowNumber
• CustomerId
• Surname
• CreditScore
• Geography
• Gender
• Age
• Tenure
• Balance
• NumOfProducts (How many accounts, bank account affiliated products the
person has)
• HasCrCard (Whether they have a credit card issued by the bank)
• IsActiveMember (Whether they do regular business with the bank)
• EstimatedSalary
• Exited (Did they leave the bank after all?)
The aim of this section is to give one simple illustration of the analysis that can
be done, and so we will not show detailed analyses of the data using an exhaustive
search over all the different tuning parameters to construct the “best fitting”
neural network. The features we will use are variables 4 through 12 in the list
above (“CreditScore” to “IsActiveMember”), and the target will be variable 14
(“Exited”). Thus, we will use binary classification models. We will use a logistic
regression as a benchmark model and investigate whether, and to what extent, a
neural network performs better than it.
Both the logistic regression and neural network were trained on a random
sample of 80% of the data and tested on a random sample of 20% of the data. In
both we did a 3-fold cross-validation. In addition, for the neural network we
found that a network with 15 hidden nodes performed reasonably well and we
chose a weight decay parameter of 0.8. We show the inputs chosen for running a
logistic regression for this data set.
Response Column: 14
Predictor columns: 4:12 (4 through 12)
Training Percentage: 80
Testing Percentage: 20
Number of folds for Cross-validation: 3
The inputs chosen to run a neural network are as follows
Response Column: 14
Number of Nodes in Network: 5, 10, 15
Number of times Averaging: 2
Decay: 0.8
Probability Range to Exclude: 0.5, 0.5
While the other inputs in the interface for a neural network are straightfor-
ward, two aspects require clarification. First, the number of nodes is given as
“5,10,15.” The idea is to compute neural networks with different number of
hidden nodes and then to select the best fitting neural network. This step can be
accomplished by writing simple code in all software programs. In this case a
neural network with 15 hidden nodes performs the best, and so the fit statistics in
the following paragraph and outputs only relate to the neural network with 15
hidden nodes. Second, the “Probability range to exclude” input is related to the
probability threshold selected for classifying a data point as belonging to category
“1” (Exit) versus “0” (Not exit). If we select the usual threshold of 0.5 then the
input in this cell is “0.5,0.5.” We will give more explanations of this when we
discuss other variants of binary classification models, but for now the reader
should go with the input as shown.
We briefly describe the fit statistics. The left and right tables below show the
Confusion Matrices for the logistic regression and neural network respectively,
both calculated on the test data.
0 1 0 1
0 1568 35 0 1509 71
1 334 63 1 234 186
Clearly, the neural network has a superior predictive performance since

it classifies 84.75% of the test data correctly (1695/2000) as compared to
81.55% for the logistic regression (1631/2000). The number of correctly
classified data points is given by the diagonal elements in each table.
Moreover, the AUC for the logistic regression is approximately 0.74 whereas
for the neural network it is approximately 0.86. Since the possible values for
the AUC range from 0.5 to 1, therefore an increase in AUC of approximately
0.12 (50.86–0.74) represents a considerable improvement in predictive accu-

racy by using a neural network.
It is also interesting to identify the most important features for churn predic-
tion. For the logistic regression, the top 5 features in terms of their importance are
(1) Age (2) Balance (3) IsActiveMember (4) Gender (5) CreditScore. Similarly, for
the neural network the top 5 features are (1) Age (2) NumOfProducts (3) Balance
(4) IsActiveMember (5) Gender. In terms of the more interesting non-
demographic variables it seems that “Balance” and “IsActiveMember” are
important in predicting churn. Moreover, from the neural network output Bal-
ance is positively related to churn, that is, people with a higher balance are more
likely to churn. This may be due to the fact that high-net-worth individuals, who
are more likely to maintain a higher balance, are also more attractive targets for
competitor banks and are more likely to seriously consider competitive induce-
ments. Coming to the other feature, being an active member, who does frequent
business with the bank, makes one less likely to churn. This makes intuitive sense.
Interestingly, a well-established framework in customer relationship management
is called RFM, which is an acronym for: Recency, Frequency, Monetary Value.
One could argue that in the case of banks, IsActiveMember could be a proxy for
“frequency,” in that, active members are those who do frequent business with
their bank.
Case Study 2: Rent Value Prediction

The second case study deals with the opening illustration in the neural networks
chapter – that of predicting rent value based on the distance to the city center.
Unlike case study 1 which has a binary target, here we illustrate a neural network
for a continuous target (dependent variable) which is rent value.
For this case study we use a simulated data set with only two variables,
(1) Rent value

(2) Distance to city center
For the purposes of benchmarking the neural network we use a simple linear
regression. The inputs for the linear regression are as follows.
Response Column: 1
Predictor columns: 2
As in case study 1, we use 80% of the data for training and a randomly selected
sample of 20% of the data for testing. For both the linear regression and the
neural network we use a 3-fold cross validation. The inputs chosen to run a neural
network are as follows
Response Column: 1
Number of Nodes in Network: 5, 10, 15
Decay: 0.8
Since there is only one feature, therefore it is not meaningful to investigate the
relative importance of features. Thus, we will only consider the fit statistics. Since
this is a regression task, we will look at the predicted MSE (mean squared error).
The predicted MSE for the linear regression calculated on the test data is
approximately 6921. The predicted MSE for the neural network is approximately
2543. Note that predicted MSE is an error and so smaller values are more
desirable. Clearly, the neural network does a much better job of prediction on the
test data set. It is worth noting that, as mentioned in Fig. 2.2 in Section 1.2, the
true relation between rent value and distance to the city center has a “cubic”
pattern. The simulated data captures this relationship. Thus, it stands to reason
that the linear regression will not be able to adequately capture this relationship.
How about a nonlinear regression as given below (using DCC 5 distance to city
center)?
Rent value ¼ a 1 b*DCC 1 c*DCC2 1 d*DCC3 1 «
This will require the analyst to create two additional variables from the raw
data, namely DCC2 and DCC.3 This is the nonlinear regression that was esti-
mated by Frew and Wilson (2002), and they found support for this regression.
The important thing to realize is that such nonlinearities are automatically
captured by a neural network without the need of significant human intervention in
terms of pre-processing the data to create new variables!
APPENDIX
Technical Detour 1
The functions fm(xi) (m 5 1,…, M) which form the sum f ðxi Þ 5 w0 1
M
+ wm fm ðxi Þare called “basis functions,” and the summation on the right hand
m51
side is called the “basis expansion” of f (xi). In the specific case where f (xi) has a
2
The reader should note that hm is just the output from hidden node m after the activation
function fm acts on the input information received at that node (see Fig. 2.5).
3
It is now clear why sometimes the final transformation f (2) is needed at the output layer.
For classification, this is the transformation that generates class probabilities from the un-
transformed, and unrestricted, continuous output that the NN would otherwise yield.
polynomial representation, the basis functions are: fm(x) 5 xm for m 5 0,…, M.

Of course, other basis functions can be used such as sigmoid functions, radial
basis functions and splines, but whatever the type of basis function the essential
idea remains the same. For a compact notation, we set f0(xi) 5 1, and write the
M
summation as f ðxi Þ 5 + wm fm ðxi Þ. A basis expansion is merely an equivalent
m50
expression of a complex function f (xi) in terms of a weighted sum of other (often
simpler) functions fm(xi).
Technical Detour 2
The input information received at hidden node m is the weighted sum of input
nodes (that is, x1, x2,…, xp) plus a bias acting at hidden node m. In symbols, the
input at hidden node m is: wm0 1 wm1x1 1 wm2x2 1…1 wmpxp. In this expression,
the weights are wm1,..., wmp and the bias is wm0. It is convenient to express this sum
in a convenient vector notation. For that purpose we need to augment the input
vector x 5 (x1, x2,…, xp) by adding a term x0 5 1. The augmented input vector
becomes x 5 (x0, x1,…, xp). With this augmentation the sum wm0 1 wm1 x1 1
p
wm2 x2 1 … 1 wmp xp 5 + wml xl , where the reader should note that the sum-
l50
mation index l starts from l 5 0. The summation on the right hand side of the
previous equation can be written conveniently in vector notation as wm×x. This
term is called the dot product, or the inner product, of two vectors, namely, the
vector of weights wm 5 (wm0, wm1, wm2,…, wmp) and the vector of inputs x 5 (x0,
x1,…, xp) with x0 5 1. It is an example of the expositional simplicity we can
obtain with vector notation.
We can make the dependence on the ith observation explicit by writing the sum:
p
wm0 1 wm1 xi1 1 wm2 xi2 1 … 1 wmp xip 5 + wml xil . As before, the summation
l50
on the right hand side of this equation can be written in vector notation as
wm×xi, where the vector of inputs for the ith observation xi 5 (xi0, xi1,…, xip) with
xi0 5 1.
Now consider the, more realistic, case where there are M hidden nodes. The
activation function at all hidden nodes have the same form, that is fm(xi) f
(wm×xi), for m 5 1,…, M. The right-hand side shows that only the vector of
weights wm are different for different nodes m. As mentioned in Section 1.2, to
distinguish quantities in the hidden layer from the output layer (which has only
one node in the univariate case) we use superscripts “(1)” and “(2)” respectively.
With the new notation, the input information received at hidden node m is the
weighted sum w(1)m×xi. The activation function at hidden node m is f (1) and it
acts on w(1)m×xi to produce output f (1) (w(1)m×xi). Let the weight on the output
from the hidden node m be w(2)m, m 5 1,…,M. Hence the basis expansion of
M
ð2Þ ð1Þ
f (xi) becomes f ðxi Þ 5 + wm f ð1Þ ðwm :xi Þand thus, the regression yi 5 f ðxi Þ 1 «i
m50
becomes:
M
yi ¼ + wð2Þ
m f
ð1Þ
ðwð1Þ
m :xi Þ 1 «i (A2.1)
m¼0
This mathematical formulation defines a NN for the univariate regression case

with p input nodes, one hidden layer (with M hidden nodes in it) and one output
node.
Let us define hmi 5 f (1) (w(1)m×xi) for m 5 1,…, M and define h0i 5 1. Recall
that hm is notation for the hidden node m, and hmi just makes the dependence on
the specific input data point xi explicit. At the output node, the activation function
M
ð2Þ
f (2) acts on the input information received at that node: + wm hmi . This term is
m50
the weighted sum of the hidden nodes plus the bias: w(2)1h1i 1 w(2)2h2i 1…1 w(2)
MhMi 1 w 0. As with the hidden nodes, we can conveniently write the input
(2)
information received at the output node by the dot product w(2).hi with weights
w(2) 5 (w(2)0, w(2)1, w(2)2,…, w(2)M) and hidden nodes hi 5 (h0i, h1i,…, hMi) where
M
ð2Þ
h0i 5 1. Thus, wð2Þ × hi 5 + wm hmi . After the final transformation the output
m50
would be f ðxi Þ 5 f ð2Þ ðwð2Þ × hi Þ. One can see that this idea of forming composite
functions can be generalized to form longer chains. We have just seen that f (xi) 5
f (2) (w (2)×hi). Since hmi 5 f (1) (w(1)m xi), therefore f (xi) can be expressed as a
composite function f (xi) 5 f (2) (f (1) (xi)). Proceeding similarly, we can also have
longer chains: f (xi) 5 f (n) (…f (2) (f (1) (xi)...).
Technical Detour 3
For a complete description, we need to specify the activation functions at both the
hidden and output layers. First, consider the hidden layer. Making a connection
ð1Þ ð1Þ
with Fig. 2.10, we have, h1 5 f ð1Þ ðw1 xÞ 5 1
2w
ð1Þ
x
and h2 5 f ð1Þ ðw2 xÞ 5
11e 1
ð1Þ ð1Þ ð1Þ ð1Þ ð1Þ ð1Þ
1
ð1Þ
2w x
, where the weights are given as w1 x 5 w01 1 w11 x; w2 x 5 w02 1 w12 x.
11e 2
The activation function in the output layer, f (2), is linear in hm, m 5 1, 2. Thus,
ð2Þ ð2Þ ð2Þ
y 5 f ð2Þ ðhÞ 5 w0 1 w1 h1 1 w2 h2 1 «. Note that the activation function in
output layer is linear in the case of regression. As already mentioned, for a
binary output variable, the activation function in the output layer is usually
sigmoid.
Technical Detour 4
The formal definitions of the maximum likelihood estimator are given in Chapter
1. There we show that maximum likelihood is related to the cross-entropy cost. In
this technical detour we just present explicit expressions of the cost functions for
regression and classification problems for the specific case of a neural network.
The sum-of-squares cost, and indeed the cross-entropy cost, for a regression
model for a 1-dimensional output yi is
N
CðuÞ ¼ + ðyi 2 f ðxi ÞÞ2 (A2.2)
i¼1
For a K-class classification problem the cross-entropy cost is

1 N K ð2Þ
CðuÞ ¼ 2 + + yik log fk ðzi Þ (A2.3)
N i¼1 k¼1
Technical Detour 5
In chapter 1 we mentioned the gradient descent updating of the weights. The
gradient descent occurs according to the updating rule
∂CðwðtÞ Þ
wðt 1 1Þ ¼ wðtÞ 2 g (A2.4)
∂w
ðtÞ
The geometrical analog of the derivative, ∂Cðw Þ
∂w , is a slope and for multi-
dimensional cases with many weights the equivalent of this derivative is called a
gradient.
To give a more detailed demonstration of the backpropagation algorithm we
will consider the case of a NN with one hidden layer and K output nodes. The
parameters are given in (2.1) and are collectively denoted by u. We will consider
supervised learning where there is a well-defined K-dimensional “target variable”
yi 5(yi1, yi2, …, yiK) corresponding to an input data vector xi, i 5 1,…,N.
Learning is accomplished by minimizing a loss (cost) function, and here we use a
quadratic loss:
N N K
CðuÞ ¼ + Ci ¼ + + ðyik 2 fk ðxi ÞÞ2
i¼1 i¼1 k¼1
where fk ðxi Þ is the output of the NN model at node k and yi is the target cor-
responding to input xi . The gradient based update requires the derivatives. The
derivative with respect to the output nodes is
∂Ci ð2Þ9 ð2Þ ∂ ð2Þ
ð2Þ
¼ 2 2ðyik 2 fk ðxi ÞÞfk ðwk ×hi Þ ð2Þ
ðwk ×hi Þ
∂wkm ∂wkm
Which gives us
∂Ci ð2Þ9 ð2Þ
ð2Þ
¼ 2 2ðyik 2 fk ðxi ÞÞfk ðwk ×hi Þhmi
∂wkm
The derivative of the cost with respect to the hidden node is

∂Ci K ∂ ð2Þ ð2Þ
ð1Þ
¼ 2 + 2ðyik 2 fk ðxi ÞÞ ð1Þ
fk ðwk ×hi Þ
∂wml k¼1 ∂wml
The partial derivative inside the summation on the right-hand side can be
evaluated as
∂ ð2Þ ð2Þ ð2Þ9 ð2Þ ð2Þ
ð1Þ
fk ðwk ×hi Þ ¼ fk ðwk ×hi Þwkm f ð1Þ9 ðwð1Þ
m ×xi Þxli
∂wml
In the above evaluation, the term xil in the last expression comes from
p
ð1Þ ð1Þ
differentiating the argument of f (1), which is wm ×xi 5 + wml xli , with respect to
l50
ð1Þ
wml . Summarizing, the gradients of the cost function with respect to the output
and hidden layers are
8
>
> ∂Ci ð2Þ9 ð2Þ
>
> ¼ 2 2ðyik 2 fk ðxi ÞÞfk ðwk ×hi Þhmi
>
> ð2Þ
< ∂wkm
(A2.5)
>
> ∂Ci K
>
> ð2Þ9 ð2Þ ð2Þ
¼ 2 + 2ðyik 2 fk ðxi ÞÞfk ðwk ×hi Þwkm f ð1Þ9 ðwð1Þ
>
> m ×xi Þxli
: ∂w ð1Þ
k ¼ 1
ml
The gradients above can be rewritten as

∂Ci ð2Þ ∂Ci ð1Þ
ð2Þ
¼ dki hmi ; ð1Þ
¼ dmi xli (A2.6)
∂wkm ∂wml
ð2Þ ð1Þ
The terms dki and dmi are the “errors” at the nodes in the output and hidden
layers. Thus, the gradients of the cost function with respect to the output and
hidden layers can be written in terms of the errors. Clearly, from their definitions
and comparing with (A2.5) we have the back propagations equations
9
K
ð1Þ ð2Þ ð2Þ
dmi ¼ f ð1Þ ðwm
ð1Þ
:xi Þ + wkm dki (A2.7)
k¼1
The above is a recursive relationship between errors at different layers. Given

ð2Þ
the errors at the output layer nodes, dki , it propagates the errors back to the
ð1Þ
hidden layer nodes dmi .
Technical Detour 6
Linear Activation Function for Continuous Regression Outputs

When the final output from the NN is a continuous regression output, then the
activation function at an output unit produces an affine transformation given by
M
ð2Þ
f ð2Þ ðhiÞ 5 + wm hmi . For estimation purposes it is generally assumed that in the
m50
M
ð2Þ
regression equation yi 5 f (xi)1 ei, with f ðxi Þ 5 + wm hmi , the output yi follows
m50
a Gaussian distribution with mean f (xi).
When we extend this to the multivariate case, then for the kth output (k 5 1,…,
M
ð2Þ
K), the regression equation is yki 5 fk ðxi Þ 1 eki ; with fk ðxi Þ 5 + wkm hmi .
m50
Sigmoid Activations Function for Binary Outputs

Quadratic cost and learning slowdown: We first formally define a sigmoid acti-
vation function. As mentioned in the text, apart from its use as an activation
function at the output layer for binary outputs, it is also very commonly used as
an activation function for the hidden layer. A sigmoid function, s(z), is given by
1
sðzÞ ¼ (A2.8)
1 1 e2z
The sigmoid activation function is a nonlinear function which takes any
possible value of z (between negative and positive infinity) and maps it to the
interval between 0 and 1.
We now provide a sketch of how the quadratic cost with the sigmoid activation
function would suffer a learning slowdown problem. Suppose we were to use the
N M
ð2Þ
quadratic loss Cquadratic 5 + ðyi 2 f ð2Þ ðzi ÞÞ2 , where zi 5 + wm hmi is the affine
i51 m50
combination of hidden nodes corresponding to input xi. Consider learning the
ð2Þ
weight wm . In gradient based learning the learning rate with respect to weight
ð2Þ ∂Cquadratic
wm would depend on the derivative ð2Þ . Now, by the Chain rule of calculus,
∂wm
∂Cquadratic N ∂f ð2Þ ðzi Þ ∂zi

¼ 2 + 2ðyi 2 f ð2Þ ðzi ÞÞ (A2.9)
∂wm
ð2Þ
i¼1 ∂zi ð2Þ
∂wm
ð2Þ
Clearly, it becomes obvious the rate of learning with respect to wm depends
ð2Þ
on ∂f ∂zðz
i
iÞ
which is the derivative of the function f (2) (zi) with respect to zi. From
the shape of the sigmoid function shown in Fig. 2.14, the derivative, written as
f (2)9(zi), is very small for large and small values of zi. Thus, the learning rate is
small as well.
Cross entropy cost does not have learning slowdown: Consider the training data
(xi, yi), i 5 1,…, N. The distribution of 1s and 0s for the yi constitute the empirical
distribution pdata of the training data. The model predictions P(yi│xi) 5 f (2) (zi)
constitute the model distribution pmodel(x; u). The gradient with cross-entropy is,
∂Ccross 2 entropy 1 N ð2Þ
ð2Þ
¼ + ðf ðzi Þ 2 yi Þhmi (A2.10)
∂wm N i¼1
The learning rate with the cross-entropy function, which is proportion

∂Ccross 2 entropy
to ð2Þ , does not depend on the derivative f (2)9(zi). One can see from the right
∂wm
hand side of (A2.10) that with the cross-entropy function the learning rate is
higher when the error (f (2) (zi) – yi), which is the difference between the predicted
value f (2) (zi) and the true target value yi, is large.
Softmax Activation Function for Multi-class Outputs

There are K classes with targets yki, k 5 1,…, K, taking values 1 or 0. As
mentioned in Section 1.4.3, we use the softmax function as the final activation
function to transform the final output into probabilities. For argument zi 5(z1i,
ð2Þ
z2i,…, zKi) the softmax activation function at output node k, that is fk ðziÞ, is
given by the following.
ð2Þ ezki
fk ðzi Þ ¼ K
(A2.11)
+ ezji
j¼1
We can clearly see that all elements of the vector zi enter the computation
above.
We provide an intuition for how the softmax function is a continuous approx-
imation to the discontinuous function max(.). Many authors distinguish between
the softmax function and the softmax activation function used in NN output layers.
For a set of points x1, x2,…, xn, the softmax function is given by: softmax
n
ðx1 ; x2 ; …; xn Þ 5 ln + exj . First, it is easy to see that max{0, x} is discontinuous
j51
at 0 but softmax(0, x) is continuous. Now, because we exponentiate therefore if xi is
the largest component, that is xi 5 maxj xj, then exi will dominate all the other terms
n n
in the summation + exj . Hence, ln + exj ln exi 5 xi 5 maxj xj . In other
j51 j51
words, not only is the softmax function continuous, it also approximates the max
function. Also, note that when the probabilities of class membership are given by
the softmax activation function in (1.5), then maximizing the log likelihood is
K
z
equivalent to maximizing over terms like lnð Ke ki Þ 5 zki 2 lnð + ezji Þ. One can
+ ezji j51
j51
now clearly see how the softmax function is involved in the softmax activation
function at the output nodes.
Chapter 3
Overfitting and Regularization in Machine

Learning Models
Chapter Outline
1. Hyperparameters, Overfitting, Bias-variance Tradeoff, and Cross-validation
1.1 Hyperparameters
1.2 Overfitting
1.3 Bias-variance Tradeoff
1.4 Cross-validation
2. Regularization and Weight Decay
2.1 L2 Regularization
2.3 L1 and L2 Regularization as Constrained Optimization Problems
2.4 Regularization through Input Noise
2.5 Regularization through Early Stopping
2.6 Regularization through Sparse Representations
2.7 Regularization through Bagging and Other Ensemble Methods
Technical Appendix
1. Hyperparameters, Overfitting, Bias-variance Tradeoff,

and Cross-validation
In this chapter, we discuss many topics that are important to all machine learning
models, not just neural networks. Concepts like hyperparameters, overfitting, the
bias-variance tradeoff, cross-validation, regularization through weight decay and
others are important for any machine learning model. That said, some of these
rather abstract concepts and ideas become more concrete if one can connect them
to a specific machine learning model. Because of this we have decided to discuss
these ideas after covering neural networks, a prominent example of a machine
learning model. Thus, in discussing the concepts of this chapter we will often
make connections with neural networks for the sake of concreteness. However,
the reader should keep in mind their broader applicability.

doi:10.1108/978-1-80043-880-420211004
1.1 Hyperparameters
A hyperparameter is a parameter whose value has to be fixed prior to training on
a specific data set. We can conceive of a hierarchy of parameters. The training
process results in the estimation of certain parameters, but the training itself
requires some “higher level” parameters to be set before training begins. Said
differently, hyperparameters cannot be estimated while fitting the model to the
training data set. Hyperparameter often relate to model specification, model
architecture, or to the specification of the learning algorithm.
Consider the case of a neural network. Any practically useful NN is likely to
have a large number of weights and biases that need to be estimated. This
problem is exacerbated for deep NNs. NNs have a hierarchy of parameters: at the
lower level we have the weights and biases which are estimated using some variant
of the backpropagation algorithm. Recall that this algorithm requires one to
specify a “learning rate.” The learning rate is an example of a parameter at a
higher level, in that, the NN model estimates the weights and biases given a
certain learning rate which is set exogenously. There are other such higher-level
parameters, for example the number of hidden nodes or the regularization
parameter (which we will soon explain). These higher-level parameters are called
hyperparameters.
Many, though not all, hyperparameters are designed so as to allow general-
ization of the model to new test data, once the model has been fit using the
training data. Overfitting of the model to the training data is a situation where the
generalizability of the model to new test data is compromised. Thus, many
hyperparameters are chosen to explicitly counter the tendency of a NN model to
overfit to the training data. Perhaps the most powerful suite of methods for
countering overfitting is regularization. Regularization requires a regularization
weight and this is another hyperparameter.
1.2 Overfitting
The real issue of overfitting is how to design the machine learning model to
properly tradeoff between training error and test error – these are the errors the
model makes when we evaluate the fit of the model to the training data and test
data respectively. Models which provide a very good fit to the training data, i.e,
they have very small training error, may suffer from high test error. As a meta-
phor consider a student who prepares for a math test via rote learning. If this
student prepares for the test by only memorizing the answers to the sample
questions, and nothing else, he is less likely to successfully tackle a new problem
encountered during the test itself which is different from the sample questions. His
learning will not generalize well to new problems.
Consider the case of a neural network. One useful heuristic to determine when
a neural network model has started to overfit is to plot two curves, one each for
the training and test data. The curves in question are the plots of the prediction
error against the model complexity (e.g., the number of hidden nodes for a neural
network).
Overfitting and Regularization in Machine Learning Models 67
Fig. 3.1. Plots of Prediction Error versus Model Complexity.
As a rule of thumb, over fitting occurs when the prediction error of the training
data keeps decreasing while the prediction error of the test data stops decreasing
or even starts to increase. Toward the right side of Fig. 3.1 one would have the
region with high variance and low bias, and toward the left side one would have
the region of low variance and high bias. Consider a NN with a given number of
input units and output units. As we increase the number of hidden units, thereby
increasing model complexity, we can plot the training error and test error. When
the training error stops decreasing or when the test error starts increasing then you
are done with training. Some analysts also plot the learning curves with the
number of training epochs as the X-axis. One of the most practically convenient
means of avoiding overfitting is early stopping where the number of epochs of
training is stopped before training error reaches its minimum (we will discuss this
further when we discuss regularization).
While there are many sources of overfitting, a useful starting point to form an
intuition would be to look at the issue of over-parameterization where the model
has too many parameters relative to the size of the training data. Over-
parameterization could become a serious problem especially in deep learning
models owing to large number of parameters, unless there is also a very large
training data set. To take an extreme example, consider the case where the
number of training data points is actually smaller than the number of parameters.
In multiple linear regression, it is well known that the proper estimation of the
parameters (the coefficients) requires the number of observations to be larger than
the number of parameters. In fact, with more parameters than observations in a
linear regression, an infinite number of parameter values will fit the training data
exactly. This is an important concept, and we provide more details in the
appendix.
Technical Detour 1
Fig. 3.2. For a Given Training Data Set, a More Complex

Model (Right) Is Likely to Overfit.
A similar logic is at play in more complex machine learning models. In these

models too when the number of parameters is large compared to the size of the
training data an infinite number of parameter vectors (values of the parameters)
will give zero training error. If the analyst chooses any one parameter vector from
this infinite set, the model with this set of parameters is highly unlikely to fit a new
test data set – in other words, there will be overfitting since our test error would be
high despite having zero training error. While overfitting could be a problem with
all statistical models as our very simple illustration with regression above shows,
the large number of parameters in NNs (especially in deep NNs) exacerbates this
issue and one has to pay particular attention to it in machine learning contexts.
More generally, overfitting is related to model complexity where, for a given
training data set, a more complex model with more parameters has a higher
chance of overfitting than a simpler model. In Fig. 3.2, both the panels have the
same training data but the model on the right panel is more complex. The model
in the right panel with more parameters overfits the data since it picks up idio-
syncrasies of the training data.
1.3 Bias-variance Tradeoff

Technically speaking, overfitting in machine learning is best understood through
the Bias-Variance Tradeoff. In Fig. 3.2 above, the left panel has a high bias but a
low variance while the right panel has a low bias but a high variance. Consider a
regression problem Y 5 f(X)1e, where the error e has a mean of zero and some
variance. We consider an input point x0 and calculate the expected prediction
error of a regression fit b
f ðx0 Þ at that point using the squared error loss. It can be
shown that the expected prediction error (alternatively called the test or gener-
alization error) is
EPE ¼ Bias2 1 Variance 1 Irreducible error (3.1)
The irreducible error comes from the variance of the error term in the
regression. Ignoring the irreducible error for the moment, the more complex we
make the model we reduce the bias but this increases the variance. A situation of
high bias is often called underfitting and one of high variance is overfitting.
Executive Summary
A major concern while training any machine learning model, including neural
networks, is the problem of overfitting. Overfitting relates to the ability of the
model to generalize to new data that was not part of the training. More
complex models are more likely to overfit training data. The minimization of
overfitting lies in achieving a proper tradeoff between training error and test
error. Training (test) error is the error that the model makes when we evaluate
the fit of the model to the training (test) data.
Technically speaking, overfitting is quantified by the Bias-Variance
tradeoff. Regularization through weight decay, and other techniques of tun-
ing the hyperparameters of a model, are important methods for correcting
overfitting.
The last term in (3.1), the irreducible error, is the variance of the error term
in the regression model Y 5 f (X)1«. This error could be due to the fact that
there are many unmeasured (un-modeled) variables that also affect Y apart
from the ones in the function f (X). Moreover, there is always the chance of
measurement error. We assume that all these sources of “model” error are
captured by the error term e, and this generates the irreducible error term in
(3.1). We now turn to the first and second terms in (3.1). Bias, variance, and the
bias-variance tradeoff are important concepts in machine learning and we will
give some intuition for them. Both bias and variance are related to sampling
variability. Statisticians conceive of the training data sample as being a random
draw from an underlying data distribution. Different training data samples
correspond to different random draws and so the training samples are different.
Now the training of a model, i.e., learning the parameters of the model, depends
on the training sample used. Thus, the prediction b f ðx0 Þ, corresponding to a
given input point x0, will be different for different samples. Because there are
differences in the predictions, hence, we can talk of the average prediction.
Intuitively speaking, the discrepancy of the average prediction from the “true”
f ðx0 Þ, is captured by the concept of bias. This idea can be made more precise.
Statisticians use the general concept of expectation, and for our purposes it
suffices to note that, under reasonable conditions, the sample average is an
estimate of the expectation. The discrepancy of the expected prediction from the
true f (x0) is the bias. Moreover, as the predictions are different for different
samples, one can also calculate the variance of the predictions around their
mean. This is the “variance” term in the bias-variance decomposition formula
above.
How is the bias-variance tradeoff impacted by the complexity of the model?

From Fig. 3.2, for any given training sample, as the model complexity increases
the model fit becomes better and so the discrepancy of the model prediction
from the “truth” is smaller. But this is true for all training samples, and hence
the discrepancy of the average (expected) model prediction from the “truth” is
small as well (the expectation is over all training samples). In other words, the
bias is smaller with a more complex model. What about the variance? A more
complex model is more likely to pick up specific nuances and idiosyncrasies of a
particular training data set. This will result in a good prediction for that
particular data set. However, precisely for this reason it is also less likely to fit
other data sets, with the result that the predictions will be very different for
different training data sets. Said differently, the variance will increase with a
more complex model.
The detailed derivation of the mathematical expression for the bias-variance
decomposition is given in the appendix.
Technical Detour 2
1.4 Cross-validation
We now turn to the important concept of cross-validation. Since the ability to
generalize to new data is a critical consideration for any machine learning model,
we would like to investigate the performance of the model on new test data. This
requires one to keep aside some data as the test data and estimate the model only
on the training data. This makes inefficient use of the data, especially for small
data sets. Cross-validation is a useful technique for assessing the generalizability
of the model, and thus avoiding overfitting, while at the same time making effi-
cient use of the available data.
Ideally, if we have enough data then we should randomly divide the data into:
(a) Training data (b) Validation data (c) Test data.
The training data is used to train the neural network and the modeler has many
design choices. The training data set could be used multiple times for different
choices of hyperparameters like the learning rate, weight decay parameter, or
even for different modeling architecture choices like the number of hidden nodes
in the hidden layer.
The validation data is used to evaluate the performance of these various
hyperparameter and model architecture choices. For example, the tuning of
the weight decay parameter can be done by choosing different values of
weight decay and estimating the regularized model using the training data.
Each of these versions of the neural network model (corresponding to
different weight decay parameters) can be evaluated for accuracy by using the
validation data.
The test data is generally used just once at the very end after all the hyper-
parameters have been tuned and the best set of hyperparameters and parameters
(weights and biases) have been obtained. Often the process unfolds in stages:
• The entire labeled data is divided into training and test data. The test data is
held aside.
• The training data is further divided into “training without validation” and
validation data.
As mentioned, when the data is insufficient to be split into training, validation,

and test data such that each step has enough data, then we have to resort to
sample reuse techniques. A very prominent sample reuse technique is cross-
validation. In K-fold cross validation there are K segments into which the data
are divided. We will demonstrate with the case of fivefold cross-validation. The
entire data is divided into five equal groups. Then four of the five groups are used
to train the model and one group is used for testing. This process is repeated five
times where each of the five groups becomes a test set and the other four are
training data. The following diagram shows how this works (Fig. 3.3).
Once the test errors for each of the five folds of cross-validation are deter-
mined, the overall cross-validation error is an average of the five errors. The case
where K 5 N is a special case called the leave-out-one cross-validation (LOOCV)
What exactly does cross-validation estimate? An important concept is that
cross-validation estimates the expected test error. We will now provide some
conceptual underpinnings in order to give more insight into the expected test
error. Both the training and test data can be thought of as having been drawn
randomly from an underlying distribution called the true data distribution. In
Section 1.3 of Chapter 1 we encountered this distribution and denoted it by
pdata(x). The training data can be considered as a random draw from the data
generating distribution, and clearly different training data sets corresponding to
different random draws will result in different estimated model parameters. Thus,
because of sampling variability, the different estimated model parameters will
lead to different predictions b f ðx0 Þ corresponding to the same input x0. Now,
suppose we have a cost (loss) function C(.). For a given training data set T and an
independent test sample (x, y) we can compute the test (generalization) error
which is a function of the predicted parameters. When we have multiple training
data sets then, because different training data sets give different predicted
parameters, we will have different test errors corresponding to the different
training data sets. The average of the test errors is the object of interest. As
already mentioned in our discussion of bias-variance tradeoff (in Section 1 of the
current chapter), the sample average is an estimate of the expectation. Intuitively
Fig. 3.3. Fivefold Cross-validation.

speaking, cross-validation measures the expected test error where the expectation
is over all training samples. Since training occurs K times during cross-validation,
each of the training data sets can be considered to be a random sample from the
underlying data generating distribution. Then the average of the test errors over
the K training data sets can be thought of as an estimate of the expected test error,
and this is the quantity that cross-validation computes. Of course, this is likely to
hold if K 5 5 or 10 since in this case the training data sets are likely to be different
from each other. But if K is close to N then the training data sets are essentially
the same.
Technical Detour 3
2. Regularization and Weight Decay

At the core of the issue of model building is the ability of the statistical or
mathematical model to come up with insights and results that accurately explain
the phenomenon under study. This becomes difficult in practice as there are often
multiple models that are being evaluated with different logical explanations of the
phenomenon under study. As such, typically the modeling effort is guided by
philosophical considerations and principles. A key concept in problem solving is
Occam’s Razor (law of parsimony) which states that entities should not be
multiplied without necessity – so when faced with competing hypothesis that
make the same predictions, one should select the one with the fewest assumptions.
In a practical model building scenario with hundreds of millions of rows and
thousands of columns of data, the model building effort focuses on selecting a
parsimonious model over one that has more parameters. In traditional parametric
models there are statistical methodologies that help one chose between competing
models - these work better for smaller and less complex formulations but may not
do as well for problems with much larger, and often unstructured, data sets as is
typical in machine learning applications. Take the example of a linear regression
model with large amounts of data - the estimate of variance is influenced by the
sample size so it is easy to find more and more significant predictors in a model as
sample size increases. This issue gets more complex as we start relaxing
assumptions of linearity and fitting more flexible machine learning models in
practice. One example of a nonlinear problem is a binary choice targeting model
for whether we will successfully acquire a customer or not. The data matrix for
this problem would be factors of the customer and their situation and the attri-
butes of the solution offered to the customer. In this case there can be multiple
combinations of the explanatory variables that can conceivably predict proba-
bility of success in the targeting activity. For business decision making, the need is
to always identify a handful of explanatory variables that can be influenced to
change the outcome in our favor. More complex models are more likely to overfit
the data and regularization techniques are designed to offer a logical approach to
avoiding unnecessary model complexity.
Since overfitting of machine learning models, and the consequent lack of

generalizability, is a pervasive concern, few areas have received as much attention
as regularization. In simple terms regularization can be thought of as the entire
suite of methods that are designed to reduce test error possibly at the expense of
higher training error. Of the many methods of regularization, “weight decay,”
which is theoretically grounded in the ideas behind ridge regression and lasso,
provides a good starting point to provide some intuition for regularization as a
whole. Essentially, weight decay is a penalty on the magnitude of the weights.
Intuitively speaking, by penalizing the magnitude of the weights, we force the
weights to be small and thus ensure that the function is simple. In the extreme
case, the reader will note that a regression with a weight vector of zero essentially
reduces to the simplest model – a straight line with only an intercept. A function
with large weights would tend to pass close to every input point and would
necessarily be highly nonlinear (like a very high dimensional polynomial). Such a
function would “hard fit” the training data (see the right panel in Fig. 3.2). This
could cause it to lose generalizability on the test data. In Fig. 3.4, the estimated
functions are the curves shown. The left panel corresponds to a smaller weight
decay parameter which, as we will soon explain, does not enforce small weights –
i.e., does not cause the weights to decay too much. The right panel with larger
weight decay parameter enforces smaller weights. One can see that the estimated
curve passes closer to point A, for instance, on the left panel than on the right one.
There are many methods of regularization but we will discuss only a few of
them which are most often used in shallow neural networks. Some of the regu-
larization methods (e.g., multitask learning, dropout, etc.) are more suited for
deep networks with many hidden layers, and we will not discuss them here.
Both L2 regularization and L1 regularization (which we will discuss next) are
examples of parameter norm penalty methods which work by controlling model
complexity through the weights in the model. In simple terms, a norm of a vector
is its length. In both cases we penalize the magnitude of the weights of the NN
Fig. 3.4. The Estimated Function (Bold Curve) Passes Closer to

Data Points, Say Point a, with Smaller Decay Parameter (Left Panel) than
with Larger Decay Parameter (Right Panel).
(biases are often not penalized) and the difference lies in how the norm is
measured. In these methods, we add parameter norm penalty to the cost function
that will be minimized. Recall, that in the standard regression the weights are
chosen by: Minimizing [(sum of squared errors)]. In the weight-decay setup with
L2 regularization, weights are chosen by
Minimizing½ðsum of squared errorsÞ 1 lðsum of squared weightsÞ (3.2)
l is a hyperparameter weighting the contribution of the penalty applied to the

norm and is called the weight decay parameter. In L2 regularization the norm used
is the sum of squared weights and the idea comes from the methods of ridge
regression. Essentially, the program in (3.2) does not allow the weights to become
too large. How much emphasis the program puts on not allowing the weights to
be too large depends on the pre-determined value of l. As l increases, the penalty
term becomes more important. Often l is chosen by cross-validation.
How does weight decay via L2 regularization work? We will illustrate with the
regression context with the squared error cost C(u). To set a benchmark, consider
the cost C(u) without weight decay. We will focus on a single weight w, ignoring
the subscripts and superscripts on the weights for expositional ease. From
Section 1 of Chapter 1, one gradient descent step would update a weight w as:
w→w 2 g:½Gradient of cost function at w, where g is the learning rate. Now
suppose we minimize the penalized cost. We show in “Technical detour 4” that
in this case one gradient descent step would update the weight as:
ð1 2 2glÞw 2 g:½Gradient of cost function at w (3.3)
Thus, at each step of gradient descent the weight decays (is rescaled) by a
factor of (122gl). Notice that the factor for weight decay is a function of two
important hyperparameters – the learning rate g and the weight decay parameter
l. For many values of g and l, the factor (122gl) is less than one, and the
amount of decay increases as g and/or l increases.
While the above shows how the weights are affected by regularization at each
step of gradient descent, the effect of L2 regularization (weight decay) after full
training is over has a desirable feature. It can be shown that L2 regularization
shrinks the weights more along directions in which the weights do not contribute
much to reducing the cost function during training. Said differently, the weights
are not decayed much in directions along which these weights contribute a lot to
reducing the cost function. This is desirable since the goal of training is to
minimize the cost function, and directions along which the weights do not reduce
cost function much are “unimportant” directions. It is along these unimportant
directions that L2 regularization decays the weights
L1 regularization is the other common parameter norm penalty. As in L2 regu-
larization, here too we penalize the magnitude of the weights of the NN. While L2
regularization uses the squares of the weights as the norm (magnitude), in L1
regularization we use the absolute value. In the statistics literature L1 regulari-

zation has been analyzed as the Lasso. As shown in “Technical detour 4,” in L1
regularization the weight decays by a constant independent of the magnitude of w.
Compare this with L2 regularization where the decay is proportional to w. Thus,
when a weight has a large (small) magnitude, L2 regularization shrinks it more
(less) compared to L1 regularization.
Technical Detour 4
A major advantage of L1 regularization is its ability to do feature selection which

is a consequence of its sparsity property. Sparsity is an aspect of L1 regularization
where some regularized weights can become exactly zero. Since weights multiply
input features (input nodes) and hidden nodes, therefore a weight of zero corre-
sponding to an input feature is equivalent to that input feature not playing any role
in determining the output from the neural network. In that sense, we can say that this
feature can be safely ignored and this is tantamount to feature selection. In contrast,
with L2 regularization the regularized weight is a fraction of the un-regularized
weight, but doesn’t disappear completely as long as the un-regularized weight is
nonzero. L1 regularization is the recommended approach when there are many
features and a sparse model which uses only a subset of the features is a desirable
goal. For details on the sparsity property of L1 regularization see “Technical detour
5” in the appendix.
Technical Detour 5
2.3 L1 and L2 Regularization as Constrained Optimization Problems

L1 and L2 regularization can be seen through the lens of constrained optimization.
Recall from (3.2) above that the weights are chosen by: Minimizing [(sum of
squared errors) 1 l (sum of squared weights)]. This expression makes it clear that
there are two forces at work. On the one hand the minimizing program tries to
obtain the best fit to the training data due to the presence of the cost term: (sum of
squared errors). On the other hand, the program is not allowed to choose weights
that are too large due to the presence of the penalty term: (sum of squared
weights). So, this is like a constrained optimization, where the main objective of
choosing weights that minimize the cost, (sum of squared errors), is constrained
by the requirement that the weights are not too large. As usual, the presence of the
constraint will imply that the final cost with the constraint will be larger than the
cost without the constraint. This is because the program has to pay attention to
conflicting goals, and so cannot focus entirely on the goal of minimizing the cost.
The parameter l governs the extent to which the optimization program pays
attention to the constraint of having small weights.
Technical Detour 6
An advantage of the constrained optimization approach is that it allows us to

explicitly specify the constraint region for the weights. This can be useful if we
have some prior idea of which constraint regions would be more fruitful to search.
The weight decay approach in Section 2.1 does not motivate the researcher to
think in terms of constraint regions, and hence the constrained optimization
approach, by focusing on constraint regions, may sometimes be a more fruitful
avenue. Of course, the constrained optimization route is equally applicable to L1
regularization with the appropriate changes in the constraint. Constrained opti-
mization when applied to L1 case gives us another view of sparsity. If the constant
which constrains the parameter region (the “t” in “Technical detour 6”), is made
sufficiently small it will cause some of the weights to be zero.
2.4 Regularization through Input Noise

Another important concept is that inducing some noise, with infinitesimal vari-
ation, at the input stage is equivalent to L2 regularization of the weights. So see
this most cleanly consider a regression with a single input variable xi, where i 5
1,…, N are the N training input data points. Each input data point xi has a
N
corresponding target yi. The quadratic cost function, C 5 + ðf ðxi Þ 2 yi Þ2 . Sup-
i51
pose that we add a small noise (error), «i, to the input each time input xi
is presented.1 Thus, the cost is perturbed by the input noise since xi is replaced by
xi 1 «i. It can be shown that the optimal f (xi) that minimizes this perturbed cost is
f ðxi Þ ¼ yi 1 ðnegligible termsÞ (3.4)
If the function f (xi) is linear in xi then it can be shown that the perturbed cost
function with input noise can be written in the form of (3.2). Thus, training with
input noise is equivalent to L2 regularization of the weights. It can be shown that
a similar insight holds for regressions with multi-dimensional inputs since the
logic proceed exactly as above by using gradients instead of the single valued
derivatives. For details on the relationship between input noise and L2 regulari-
zation see “Technical detour 7” in the appendix.
Technical Detour 7
1
Each time input xi is presented, a random vector «i is added where E½«i 5 0; E½«2i 5 s2
and further the «i’s are uncorrelated.
2.5 Regularization through Early Stopping

Another powerful method of regularization is to stop training early. Intuitively,
the training data is a random sample from an underlying “true” data generating
distribution. The cost (loss) function is constructed using the training sample data
since we have no access to the population data from the true distribution. Thus,
the optimal weights obtained by minimizing the cost function constructed from
the training sample will almost definitely not be the true optimal weights obtained
by minimizing the true cost function (based on the (infinite) population data). The
effect of training for a longer period will ensure that the training algorithm will
more closely approach the minimizer of the training cost and thereby miss the
minimizer of the true cost. This will have the effect of reducing training error but
increasing validation error. Fig. 3.5 shows a schematic of this discussion.
Technically, it can be shown that if certain plausible conditions hold then the
training time (number of iterations of gradient descent) plays a role that is the
inverse of the weight decay coefficient of L2 regularization.
2.6 Regularization through Sparse Representations

This type of weight decay is most often used in neural networks and deep learning
models. While weight decay adds a penalty on the model parameters, another
effective technique for regularization is to add a penalty on the activations at the
nodes. Usually hidden nodes are chosen for this type of regularization. The idea is
to have only a small subset of the nodes (neurons) being activated for any given
data point (data instance). Since the signal that is transmitted from a node to the
next layer is a product of the node activation and the weights connecting that
node to all the nodes in the next layer, one can see that having an activated node
but suppressing the weight (as in weight decay) would have a similar effect as
suppressing the activation itself while leaving the weights unchanged.
Fig. 3.5. Training for Too Long (Too Many Epochs) Can Raise
Validation Error.
Similar to how sparsity for weights is achieved by imposing a L1 penalty on the

weights, sparsity of the representation is achieved by imposing a L1 penalty on the
activations, mostly in the hidden units.2 This approach requires a slight modifi-
cation to the backpropagation algorithm, but fundamentally the algorithm
remains unchanged. Essentially, gradient flow to the output nodes from the
hidden node has an additional term (the one that involves l)
2.7 Regularization through Bagging and Other Ensemble Methods

Various methods of model averaging can also play the role of regularization in
that they may counter the tendency of overfitting to training data. The basic idea
for why model averaging may work is that different models are likely to make
different errors and so systematic errors are less likely to creep in when we use
many different models and take their average. Importantly, we need to ensure that
the errors of the various models are uncorrelated for model averaging to be useful.
This is an important idea and the details are given in the “Technical detour 8.” A
very common method for model averaging is called bagging (bootstrap averaging)
and, due to its importance and wide applicability, we will consider it in a separate
chapter.
Technical Detour 8
TECHNICAL APPENDIX
Technical Detour 1
Instead of a formal proof, we will provide a sketch of the arguments for why a
linear multiple regression requires more training data points than parameters.
Suppose a model tries to learn the function y 5 w0 1 w1 x1 1 w2 x21«. This
model has three parameters w0, w1, and w2. Suppose, further, that there are only
two training data points: (x11, x21, y1) and (x12, x22, y2). Note that the second
index i in xji (j 5 1, 2) denotes the observation i among training data points. The
cost function that should be minimized is
C ¼ ½y1 2 ðw0 1 w1 x11 1 w2 x21 Þ2 1 ½y2 2 ðw0 1 w1 x12 1 w2 x22 Þ2
The reader should note that the only unknowns in the cost function are the
weights w0, w1 and w2, and the function should be minimized to determine the
2
Assuming there are M hidden nodes, h1,…, hM, then mathematically speaking the
M
~
objective to be minimized becomes CðuÞ 5 CðuÞ 1 l + jhm j.
m51
values of these weights. Clearly, the cost function would be minimized if we could
find weights such that
y1 ¼ w0 1 w1 x11 1 w2 x21 ;
and
y2 ¼ w0 1 w1 x21 1 w2 x22
Then the cost would be exactly zero. In particular, let us select any arbitrary
b 0 . Then the above system of equation becomes
value for w0, say, w
b 0 ¼ w1 x11 1 w2 x21 ;
y1 2 w
and
b 0 ¼ w1 x21 1 w2 x22
y2 2 w
This is a system of two linear equations in two unknowns, w1 and w2, and they
can be solved simultaneously to get the unique solutions w b 2 . Now, we will
b 1 and w
get different values for w b 2 depending on the chosen value of w
b 1 and w b 0 . Since w
b0
was chosen arbitrarily, therefore an infinite number of solutions exist that will
give zero error on the training data.
Technical Detour 2
For a formal derivation of the bias-variance tradeoff we follow Hastie, Tibshirani,
and Friedman (2009, p. 223). Consider a regression problem Y 5 f (X)1«, with E
[«] 5 0 and variance VarðeÞ 5 s2« . We consider an input point x0 and calculate the
expected prediction error of a regression fit b
f ðx0 Þ at that point using the squared
error loss.
E½ðY 2 b
f ðx0 ÞÞ2 jX ¼ x0 ¼ E½ðf ðx0 Þ 2 b
f ðx0 ÞÞ2 1 s2« (A3.1)
Now the term E½ðf ðx0 Þ 2 b 2

f ðx0 ÞÞ can be rewritten as
2
E½ðf ðx0 Þ 2 b
f ðx0 ÞÞ2 ¼ E½b
f ðx0 ÞÞ2 2 2ðE½b f ðx0 ÞÞ2 1 ðE½b
f ðx0 ÞÞ2 1 ðE½b
f ðx0 ÞÞ2
2 2f ðx0 ÞE½b f ðx0 Þ 1 f ðx0 Þ
2
where the last two terms have been taken out of the expectation keeping in mind
that f (x0) is a constant with respect to the expectation. Thus, the right-hand side
of the previous equality is
2
E½b
f ðx0 ÞÞ2 2 2b
f ðx0 ÞE½b
f ðx0 Þ 1 ðE½b
f ðx0 ÞÞ2 1 ðE½b
f ðx0 ÞÞ2 2 2f ðx0 ÞE½b
f ðx0 Þ 1 f 2 ðx0 Þ
2
f ðx0 ÞÞ2 2 E½b
¼ E½b f ðx0 Þ2 1 ðE½b
f ðx0 Þ 2 f ðx0 ÞÞ2 (A3.2)
One can recognize thatðE½b f ðx0 Þ 2 f ðx0 ÞÞ2 5 Bias2 ðb

f ðx0 ÞÞ. This is because the
b
bias, defined formally as: Bias 5 ðE½f ðx0 Þ 2 f ðx0 ÞÞ, measures by how much the
expectation of the estimate, that is E½b f ðx0 Þ, differs from the true mean f ðx0 Þ.
2
f ðx0 Þ 2 E½b
Also E½b 2
f ðx0 Þ 5 Varianceðb
2
f ðx0 ÞÞ, and so
E½ðf ðx0 Þ 2 b
f ðx0 ÞÞ2 ¼ Bias2 ðb
f ðx0 ÞÞ 1 Varianceðb
f ðx0 ÞÞ (A3.3)
Putting (A3.1) and (A3.3) together, the expected prediction error

E½ðY 2 b
f ðx0 ÞÞ2 jX ¼ x0 ¼ s2« 1 Bias2 ðb
f ðx0 ÞÞ 1 Varianceðb
f ðx0 ÞÞ
¼ Irreducible error 1 Bias2 1 Variance (A3.4)
Technical Detour 3
We give a formal definition of cross-validation. Suppose the data set (xi, yi) is of
size N, that is i 5 1,.., N. We are doing a K-fold cross validation,
2k
so the data has
been randomly divided into K groups. Denote by b f ðxÞ the fitted function
computed with kth group of data removed. Then the cross validation estimate of
the expected test error is
1 K 2k
+ ½Test error over test data x in group k using b
f ðxÞ as the predicted value of x
K k¼1
As mentioned in the chapter, an important concept is that cross-validation

estimates the expected test error, where the expectation is over training samples
that are random samples from the underlying data generating distribution. Sup-
pose we have a cost (loss) function L(.), then for a given training data set T and an
independent test sample (x, y) we can compute the test (generalization) error:
ErrT 5 E½Lðy; bf ðxÞÞjT
Cross validation essentially computes the expected test (generalization) error
where the expectation is over training samples
Err ¼ E½ErrT
Technical Detour 4
Weight Decay in L2 Regularization
In the usual estimation program where the weights are not penalized we minimize
the cost function C(u). When we penalize the magnitude of the weights of the NN
(biases are often not penalized) we add a parameter norm penalty J(u) to the cost
function that will be minimized
~
CðuÞ ¼ CðuÞ 1 lJðuÞ (A3.5)
l is a hyperparameter weighting the contribution of the penalty applied to the

norm and is called the weight decay parameter.
In L2 regularization specifically the norm used is the sum of squared weights
J(u) and the idea comes from the methods of ridge regression. How does weight
decay work? We will illustrate with the regression context with the squared error
cost C(u). To set a benchmark, consider the cost C(u) without weight decay. We
will focus on a single weight w, ignoring the subscripts and superscripts on the
weights for expositional ease. One gradient descent step would update a weight w
as: w→w 2 g ∂CðuÞ
∂w , where g is the learning rate. Now suppose we minimize the
~
penalized cost CðuÞ. One gradient descent step would update the weight as:
∂ ∂CðuÞ
w→w 2 g ðCðuÞ 1 lJðuÞÞ ¼ ð1 2 2glÞw 2 g (A3.6)
∂w ∂w
The second equation above follows because with L2 regularization, the
parameter norm penalty J(u) has the square of the weight, that is w2. The deriv-
ative with respect to w yields the term 2w. The above expression makes it clear that
at each step of gradient descent the weight decays by a factor of (122gl), which is
less than 1.
Weight Decay in L1 Regularization

As above, the penalized cost function is CðuÞ~ 5 CðuÞ 1 lJðuÞ; where the un-
regularized cost is given in the usual manner. To simplify our exposition, we
will not use superscripts and subscripts on the weights and define the regularized
~
cost as CðuÞ 5 CðuÞ 1 l+jwj. The function jwj is discontinuous at w 5 0. For the
w
moment, lets hold that aside and focus on w 0. Then the gradient of the
∂ ~ ∂
regularized cost function is ∂w CðuÞ 5 ∂w CðuÞ 1 lsignðwÞ. Thus, the weight update
at any given step of gradient descent is
∂
w→½w 2 glsignðwÞ 2 g CðuÞ (A3.7)
∂w
Note that, unlike L2 regularization, in the above expression the weight
decreases by a constant amount glsignðwÞ, since the sign of w is a fixed constant
(either 11 or 21).
Technical Detour 5
We follow the treatment of James, Witten, Hastie, and Tibshirani (2013, p. 225)
to analyze the sparsity property. Since the optimal weights with the L1 regularized
cost function does not have convenient analytical solutions, we will consider a
special case that is enough to make the central point of sparsity. We consider the
linear regression context with one output where we have f (1) as an identity acti-
vation function and there is only one hidden node M 5 1. Then, ignoring the
N p
intercept, the un-regularized cost function is CðuÞ 5 + ðyi 2 + wl xil Þ2 . To
i51 l51
simplify further, suppose the number of training data points N and the dimension
of the input data vector p are equal, i.e., N 5 p. So, the data matrix X is a square
matrix and suppose it is a diagonal matrix with 1s on the diagonal and 0s
p
elsewhere. Then the un-regularized cost simplifies to CðuÞ 5 + ðyl 2 wl Þ2 . The
l 51
optimal weights in the un-regularized problem can be easily obtained as
b unregularized
w l ¼ yl (A3.8)
p p
~
Now the regularized cost is CðuÞ 5 + ðyl 2 wl Þ2 1 l + jwl j. We need to take
l51 l51
~
the derivative of CðuÞ with respect to wl. However, jwl j is discontinuous at 0 and
so we appeal to the method of subgradients. It can be shown that the optimal
regularized weights are
8
>
> l l
>
> yl 2 yl .
>
> 2 2
>
<
l 2l
b regularized
w ¼ yl 1 yl , (A3.9)
l >
> 2 2
>
>
>
> l
>
:0 jyl j #
2
At least in this simplified scenario we can see from (A3.8) and (A3.9) how the
un-regularized and regularized weights compare. Importantly, from (A3.9) we can
clearly see the sparsity property of the L1 regularization. If the absolute value of
yl, which is the least squares estimator of the weights, is less than l/2 then the
regularized weights are shrunk entirely to zero. This is the feature selection
property of L1 regularization. While we have taken a very special case for the
above analysis, the main ideas behind the sparsity of L1 regularization are
essentially the same for the more complex cases.
Technical Detour 6
We sketch below how L1 and L2 regularization can be seen as constrained opti-
mization. For instance, the L2 regularization program can be recast as
argminCðuÞ (A3.10)
ð1Þ ð2Þ
wml ;wkm
subject to JðuÞ # t
We can write the Lagrangian corresponding to (A3.10) as L(u, l) 5 C(u) 1 l

(J(u) 2 t). As is usual in solving constrained optimization problems, the mini-
mization of the Lagrangian is done with respect to both the weight vector u and
the parameter l. The correspondence between the constrained optimization
approach and the weight decay approach outlined in Section 2.1 can be seen if we
have a fixed l 5 l* in the Lagrangian for the constrained optimization. In that
case, minimizing the Lagrangian is equivalent to minimizing L(u, l*) 5 C(u) 1
l* J(u) over u only. Thus, this objective function is the same as that in L1 and L2
regularizations.
Technical Detour 7
We consider a regression with a single input variable xi, where i 5 1,…, N are the
N training input data points. Each input data point xi has a corresponding target
N
yi. The quadratic cost function, C 5 + ðf ðxi Þ 2 yi Þ2 . Suppose each time input xi
i51
is presented, a random vector «i is added where E½«i 5 0; E½«2i 5 s2 and further
the «is are uncorrelated.
When input error is incorporated in the cost function, then the quantity that we
N
~ is the expectation of + ðf ðxi 1 «i Þ 2 yi Þ2 . We note that
minimize, denoted by C,
i51
we have added noise «i to the input xi. The expectation is necessary since «i is a
random quantity. Thus, the cost function with input error is:
N
~ 5 E« ½ + ðf ðxi 1 «i Þ 2 yi Þ2 . Now, by the Taylor’s expansion we have
C i
i51
2
~ ¼ +
N ∂2 f ðxi Þ ∂f ðxi Þ 1 2 ∂2 f ðxi Þ
C ðf ðxi Þ 2 yi Þ2 1 s2 ðf ðxi Þ 2 yi Þ 1 E«i ð«i 1 «i Þ
i¼1 ∂xi 2 ∂xi 2 ∂xi 2
where we have used E½«i 5 0; E½«2i 5 s2 Expanding the last squared term above
and applying the expectation we have

~ ¼ C 1 + s2 ðf ðxi Þ 2 yi Þ∂ f ðxi Þ 1 s2 ð∂f ðxi ÞÞ2
N 2
C (A3.11)
i¼1 ∂x2i ∂xi
We want the function that minimizes this error. So, using functional differ-
entiation of (A3.11) with respect to f (xi), and making use of the definition of C we
obtain
f ðxi Þ ¼ yi 1 Oðs2 Þ (A3.12)
where is O(s ) is a term of order s. Back substituting (A3.12) into (A3.11) we

2 2
note that the term involving the second derivative of f (xi) in (A3.8) vanishes to
O(s2). Therefore, the effective cost with input errors can be written as
~ ¼ C 1 s2 + ð∂f ðxi ÞÞ2

N
C (A3.13)
i¼1 ∂xi
If the function f (xi) is linear in xi then ∂f∂xðxi i Þ is the weight, and (A3.13) is the
cost function corresponding to L2 regularization of the weights.
Technical Detour 8
Suppose there are n regression models where the error for each model i 5 1,…,n is
«i. The errors are zero mean E½«i 5 0, have variance E½«2i 5 v and covariances
E½«i «j 5 c. The ensemble prediction is just the average prediction, and its
expected squared error is

1 n v ðn 2 1Þc
E ð + «i Þ 2 ¼ 1 (A3.14)
n i¼1 n n
Clearly, if the errors are perfectly correlated, that is, c 5 v, the mean squared
error is v and so there is no reduction of variance by averaging.
Chapter 4
Support Vector Machines in Marketing

and Sales
Chapter Outline
1. Introduction to Support Vector Machines
1.1 Early Evolution
1.2 Nonlinear Classification Using SVM
2. Separating Hyperplanes
3. Role of Kernels in Machine Learning
3.1 Kernels as Measures of Similarity
3.2 Nonlinear Maps and Kernels
4. Optimal Separating Hyperplane
4.1 Margin between Two Classes
4.2 Maximal Margin Classification and Optimal Separating Hyperplane
5. Support Vector Classifier (Nonseparable Case) and SVM
6. Applications of SVM in Marketing and Sales
7. Case Studies
Technical Appendix
Worked-out Illustrations
1. Introduction to Support Vector Machines

1.1 Early Evolution
The methodology that leads to Support Vector Machines (SVMs) was developed
by Vladimir Vapnik and Alexey Chervonenkis in the early 1960s. What these
collaborators had developed was the Vapnik–Chervonenkis theory, a form of
computational learning theory. Computational learning theory approaches
learning processes from a statistical point of view (Vapnik, 1989). The major
business and other practical applications of this approach to learning really began
when this theory was adapted to the ever-present problem of classification. This
happened in the early 1990s through the work of Vapnik and his collaborators
like Bernhard Boser, Isabelle Guyon, and others (Boser, Guyon, & Vapnik, 1992).
While the theory of SVMs has also been applied for regression problems,

doi:10.1108/978-1-80043-880-420211005
arguably its most significant business applications have been in classification. In

this domain, SVMs have been found to be a powerful method for achieving
nonlinear classification using the mathematical apparatus of kernels. SVMs have
been extensively used in text categorization, image classification, and handwriting
recognition. In cases where labeled target (output) variables are not available,
SVMs have also been used for unsupervised learning for segmentation, a very
important part of marketing strategy.
1.2 Nonlinear Classification Using SVM

While SVMs have been used for unsupervised learning and also for regression in a
supervised setting, we will restrict our discussion in this chapter to supervised
classification. Apart from its extensive use in classification in practice, the main
ideas behind SVMs can be most effectively conveyed in this case as well.
Consider a very common context, that of binary classification. Suppose the
input data points xi are each associated with a response variable yi (also called
output or target variable) where each yi can be either 11 or 21. Suppose the input
points xi lie in a two-dimensional space defined by axes (X1, X2), so that each
observation xi 5 (xi1, xi2), i 5 1,…, N. A long-standing workhorse for such
classification problems in marketing is the logistic regression. In a logistic
regression, we model the log odds (or logit transformation) of the target variable
yi.1 The important aspect of the logistic regression to note is that it is linear in the
input variables xi1 and xi2. This is clear from the classification function, f (xi) 5
b01 b1 xi11b2 xi2, which is linear. The classification rule for a logistic regression
is given in “Technical detour 1.”
Technical Detour 1
We will now show a how a linear classifier is very restrictive in many realistic
cases. Consider below a fragment of a much larger data set.
The data plot is given below (Fig. 4.1):
Clearly, no linear classification function can distinguish between the “1” and
the “2” classes. If we apply the logistic regression classifier to these data set, we
obtain the following classifier (bold line) (Fig. 4.2):
In fact, all points have been classified as “2,” and so the performance of
logistic regression is very poor. The reason that all points have been classified as
“2” is because there are more “2” data points in the training data set, and,
therefore, the misclassification rate is lower by “playing it safe” and classifying
everything as “2.” We have discussed this aspect of classifiers in more details in
The log odds is given by log[pi/(1 2 pi)], where pi is the probability that yi 5 11.
1
Support Vector Machines in Marketing and Sales 87
Fig. 4.1. Plot of Two Nonlinearly Separable Classes.
Fig. 4.2. Poor Classification of “1” and “2” Classes Using Logistic
Regression.
Fig. 4.3. Very Good Classification Accuracy Using Support Vector

Machine.
chapter 1 under assessment methods for classification tasks (see percent correctly
classified in Section 2).
As a comparison, we have done classification on these data set using SVM.
The plot of the classification of the same data using a standard SVM package is as
follows (Fig. 4.3). The SVM classifier achieves an accuracy of 92%.
We have just seen an illustration of how SVMs are a powerful method of
achieving nonlinear classification. In the next few sections, we provide some
insights into the SVM technique.
2. Separating Hyperplanes
We first define a hyperplane. Quite simply, a hyperplane is the generalization of a
line when we operate in a higher dimensional space. Thus, in two dimensions, a
hyperplane is just a line. Formally, if we are on a p-dimensional space, then
a hyperplane is defined by the equation:
b0 1 b1 x1 1 b2 x2 1 ::: 1 bp xp ¼ 0 (4.1)
If a point xi 5(xi1, xi2,…, xip) in p-dimensional space satisfies (4.1), then we say
that this point lies on the hyperplane. A hyperplane that separates two classes is a
separating hyperplane.
Consider again the binary classification context. As before, the input data
points xi are each associated with a response variable yi (also called output or
target variable) where each yi can be either 11 or 21. In two dimensions, consider
the set of “1” and “2” points which are separable, that is, which can be separated
by lines drawn in this space as the figure below shows. Suppose, the “1” points
correspond to yi 5 11 and the “2“ points correspond to yi 5 21.
In the two-dimensional Fig. 4.4, since the “1” and “2” points are linearly
separable we can draw many lines that separate them. Consider any one of these
lines – say the thick bold line. Imagine a point xi with coordinates xi 5(xi1, xi2).
Fig. 4.4. Linearly Separable Points “1” and “2”
Let the equation of the thick bold line in this two-dimensional space be:
b0 1 b1 x1 1 b2 x2 5 0. Suppose, the point xi lies exactly on this thick bold line. If
we substitute the values of xi1 and xi2 in the equation of the thick bold line, we will
get a value of zero: b0 1 b1 xi1 1 b2 xi2 5 0. However, for points that lie strictly
above or below this line, the corresponding expression will either be a positive
value or a negative value. In fact, for points in the “1” category, we will have a
positive number, and for points in the “2” category, we will have negative
number. The same logic carries over to a p-dimensional space. Said differently,
the sign of the expression on the left-hand side of (4.1) can be used for classifica-
tion. “Technical detour 2” provides more details.
A major drawback of this approach to classification is that there could be
many separating hyperplanes, as is also clear from the figure above. This lack of
uniqueness is undesirable, and we address the question of uniqueness of the
separating hyperplanes in Section 4.
Technical Detour 2
As a historical note, learning models that form linear combinations of inputs

xi 5(xi1, xi2,…, xip) and output the sign are called perceptrons. Perceptrons were
the early progenitors of many of the neural network models that we use today.
Executive summary
Classes that can be separated by a line are called linearly separable. In two
dimensions, the linear separator is the usual straight line. The generalization
of a linear separator to higher dimensions is called a separating hyperplane. A
separating hyperplane is a function of input points. The sign of the function,
with a given any input point as its argument, can be used to classify that input
point.
(Continued)
A major drawback of this way of classifying linearly separable classes is
that there is no unique separating hyperplane – in fact, we can find an infinite
number of separating hyperplanes.
3. Role of Kernels in Machine Learning

A fundamental requirement for machine learning models is the measurement of
similarity between two objects. Consider, for example, the classic problem of
binary classification where an object has to be assigned to one of two classes.
Suppose the object in question is a potential customer, and the company wants to
classify this potential customer as “buy” or “not buy.” Since the potential
customer is “new,” the company does not yet have any observation about whether
the class of the potential customer is “buy” or “not buy,” and, thus, has to make a
prediction. An obvious means for doing this is to look into the company’s
database and to study the behavior of the company’s existing customer base. If
customers who are similar to the focal new customer are buyers, then it is plau-
sible to also classify the new customer as a buyer. This leads to the question of
how the company could determine the similarity of two customers. The company
could record some observed variables, such as the customers’ demographic,
behavioral, and psychographic characteristics, and use these to measure the
similarity between customers.
At a very general level, a kernel measures the similarity between objects. Among
the demographic, behavioral, and psychographic variables, some will be such that
the similarity between customers on those variables is easy to measure. For
instance, the similarity in income and gender (using a coding of 1/0 for male/
female) is easy to measure. However, suppose that the company wants to also
mine the information embedded in emails sent by the customers. This could be
done by keyword searches to uncover specific phrases, questions about product
quality, price queries, delivery schedules, financing terms etc. How can one
measure the “similarity” of two emails? This is facilitated by defining a kernel that
measures the similarity of two documents. As a related problem, one can imagine
the problem of measuring the similarity of two character strings. Specifically,
consider two strings x and x9 of lengths l and l9.
String x: JDHFOVN VGQDFOFPBMDFTOVPSVSHY0MDGFK
String x9:
KGKRPWNFCRKTRPQLNVHSFWHFKIPTJWOVPGAKMDLHODD
How should one measure the similarity between the strings x and x9? We notice
that both have the substring “OVP” in common. Thus, one solution would be to
enumerate the number of common substrings and to use that as part of a measure
of similarity between strings. Even more complex would be an analysis of voice
mails left by the customers, in which nonverbal communication characteristics
like pitch, tone, speech rate, accent etc., would be aspects of a measure of simi-
larity between two voicemail messages. Plainly speaking, a kernel, denoted by
k(x, x9), is a generalized measure of similarity between objects including complex
objects like text messages, voice, and video etc. Of course, for similarities between
objects that are easy to quantify, such as metric variables like income, weights etc,
the kernel will reduce to the familiar distance between them.
Another power of kernels derives from the fact that they allow us to redefine
nonlinear operations and function as linear operations and function in another
(possibly higher dimensional) space. Since linear operations and linear functions
are easier to manipulate mathematically and also interpreting them to gain sub-
stantive insights is more straightforward, this is a major benefit of kernels. We will
now provide a more formal treatment of both the “similarity measuring” and
“nonlinear mapping” properties of kernels.
3.1 Kernels as Measures of Similarity

We will use the case of binary classification as a motivating illustration. Suppose,
we have N observed data points (xi, yi), i 5 1,…, N, where xi could lie in
p-dimensional input space X (think of p input demographic, behavioral, and
psychographic variables) and yi is a binary variable with values “11” or “21.”
We can think of xi as an object, perhaps a very unstructured object in multidi-
mensional space. The task we are faced with is to classify a new object x as either
lying in class “11” or “21,” that is, to determine whether the corresponding y is
“11” or “21.” To accomplish this, we will need to define the similarity measure
between the new point (x, y) and the observed data points (xi, yi), i 5 1,…, N.
Since the only values for the response variables yi and y are “11” or “21,” which
makes the measurement of their similarities quite straightforward, the real
problem lies in defining similarity measures for the input objects xi and x. If the
inputs are indeed similar, then we can consider the responses to be similar as well
following the intuitive dictum that “similar inputs should lead to similar
responses.”
A very simple and well-known measure of similarity of two vectors is the dot
product, also called the inner product, which has a simple geometric interpretation.
For our purposes, the reader should just keep in mind that the inner product of two
vectors is proportional to the angle between the vectors.2 The figure below (Fig. 4.5)
shows a two-dimensional space (p 5 2) with two vectors x5(x1, x2) and x95(x91,
x92). Clearly, the angle between them is a measure of the similarity between the
vectors.
We have just seen that the inner product of two vectors is proportional to the
angle between them. The constant of proportionality depends on the norm of
the two vectors. Intuitively speaking, the most commonly used norm, which is the
Euclidean norm, is just the square of the length. So, the Euclidean norm of x5(x1,
x2) is the square of the Euclidean distance from the origin (0, 0) to the point (x1,
x2). Hence, the inner product allows one to carry out mathematical computations
that involve the geometric concepts of angles, lengths, and distances. In the simple
case where there is a concrete representation of the input object x as a vector, the
2
To be precise, the inner product of two vectors is proportional to the cosine of the angle
between them.
Fig. 4.5. The Inner Product Is a Measure of the Angle between Two
Vectors.
simplest kernel is merely the inner product which, as a matter of notation, is

written in angled brackets:
kðx; xÞ ¼ Æx; x= æ (4.2)
Now, not all input objects of interest in machine learning have a straightfor-
ward representation in terms of vectors. One may be interested in similarity
measures in more general objects like strings, sentence structures, documents,
trees etc. The concept of kernels can be used in these more general cases too. For
instance, a very popular kernel used for measuring the similarity of two docu-
ments is based on the idea of using the cosine as a similarity measure, and,
therefore, uses the inner product. Some details of the above discussion are in
“Technical detour 3.”
Technical Detour 3
We will now present a simple algorithm that can accomplish binary classifi-
cation using only inner products, i.e., kernels, in the case where the input objects
can be represented as vectors. Consider N data points (xi, yi), i 5 1,…, N, where
input vectors xi 5(xi1, xi2) lie in two-dimensional space and yi is a binary response
variable with values “11” or “21.” A subset of the N points have responses with
yi 5 11 values, and the rest have responses with yi 5 21 values.
In the figure Fig. 4.6 suppose the mean over inputs xi for the response class yi
5 11 be denoted by x1 and the mean over inputs xi for the response class yi 5 21
Fig. 4.6. A New Point x Is Assigned to Class. Whose Class Mean Is

Closer to It.
be denoted by x2. In other words, these vectors are the centroids of the “1” and
“2” classes. The point halfway between the class means is xc 5 (x1 1 x2)/2.
Consider the problem of classifying a new point x 5(x1, x2) in either class “1”
or class “2.” The most intuitive way to do so is to assign this new point to that
class whose class mean is closer to that point. In geometrical terms, as can be seen
from the figure above, this is equivalent to classifying it in class “1” if the angle
between the vectors “x 2 xc” and “x1 - x-” is less than 90o, and to classify it in
class “2” if this angle is greater than 90o. These vectors are the two bold directed
arrows in the figure. Therefore, geometrically speaking, the classification can be
done simply by comparing angles! Now, we note that the angles are related to inner
products. Therefore, the inner product can act as a classifier. Further, since the
cosine of an angle greater than 90o is negative and cosine of an angle less than 90o
is positive, hence we can use just the sign of the inner product for classification.
Specifically, it can be shown that the classifier is:
y ¼ signðÆx 2 xC ; x 1 2 x æÞ (4.3)
Finally, we note that the mean vectors x1 and x2 together involve all the data
points in the training data set. This is so because the vector x1, being the centroid
of the “1” class is computed using all data points in that class. Similarly, the
vector x2 is computed using all data points in the “2” class. Hence, the inner
product-based classifier involves the entire training data which implies that the
computational complexity of SVM increases proportionately with the size of the
training data set.
As noted in Section 3, the separating hyperplane that classifies two linearly
separable classes is not unique. A fundamental result in the field of SVMs shows
that the unique optimal separating hyperplane can be constructed using only
kernels and is similar in form to expression (4.3) above, written as a linear

combination of all the training data points. The details of the above discussion are
presented in “Technical detour 4.”
Technical Detour 4
3.2 Nonlinear Maps and Kernels

In more complex cases, however, the simple inner product-based classification
described in the previous section is not general enough to achieve classification.
To appreciate why this is so, we start by noting that an inner product, say Æx; x= æ,
is linear in its arguments x and x9. However, we can have situations where the
classes cannot be separated linearly but require nonlinear functions of inputs x to
separate them. In this connection, we recall from Section 1.2 that the standard
logistic regression models can handle only linearly separable data. Some variants
of logistic regression can handle nonlinearly separable classes. Such nonlinear
logistic regression models use interaction terms and terms of higher orders of
explanatory variables in the function f(xi). However, such an approach requires
the analyst to explicitly incorporate the type of nonlinearity, which is often not
obvious a priori, in the function f(xi). The kernel-based methods that are being
discussed here can accommodate nonlinearities much more naturally. In the
kernel-based methods, we can define the kernel in terms of nonlinear maps f(x) of
the inputs x. Thus, one can generalize (4.2) and define the kernel:
kðx; xÞ ¼ ÆfðxÞ; fðx= Þæ (4.4)
The map f(.) transforms objects from the input space, denoted by X, to an
entity called the feature space denoted by F. Thus, if x is in X, then f(x) is in F.
This feature space is usually of a higher dimension than the input space and is
such that it allows computations of inner products of two vectors f(x) and f(x9)
that lie in it.
We will provide a simple intuition for how kernels perform the task of
incorporating nonlinearities. Consider a two-dimensional input space. Thus, each
input point x has coordinates x5 (x1, x2). By definition, the inner product of two
input points x and x9 in input space, given by Æx; x= æ, is linear in x. Therefore, it
involves only linear terms of the coordinates x1 and x2. Now consider the
quadratic expression given by the square of the inner product: Æx; x= æ2 . One can
easily guess that, in the input space, this nonlinear expression will involve squares
and product terms like x12, x22, and x1x2. Define a three-dimensional feature space
whose axes correspond to these three terms involving squares and products of x1
and x2. The critical thing to realize is that if the axes of a 3-D feature space are
these squares and products of x1 and x2, then the expression Æx; x= æ2 is just a linear
combination of these axes. That is, Æx; x= æ2 is linear in the feature space. Finally, let
f(.) be a map which takes the 2-D input vector x5 (x1, x2) and maps it to a 3-D
vector involving squares and products of x1 and x2. It can be shown that Æx; x= æ2 ,
which is linear in 3-D feature space, can be expressed as an inner product

ÆfðxÞ; fðx= Þæ in that space (recall that inner products are linear entities). If we
define a kernel as this inner product, then this kernel allows us to transform the
nonlinear expression Æx; x= æ2 in the 2-D input space to a linear expression in 3-D
feature space. We provide more details in “Technical detour 5.”
From this discussion, it is clear that with the use of a nonlinear map f in the
kernel k(x, x9), we can achieve classification even in cases where the classes are not
linearly separable. Thus, even if the decision boundary between two classes is
nonlinear in the input space (x1, x2), in the three-dimensional feature space it
becomes a hyperplane, which is linear by definition.
Technical Detour 5
Executive summary
Kernels are widely used in machine learning. They play two important roles.
First, a kernel is a generalized measure of similarity between two objects – not
just vectors, but strings, sentence structures, documents, trees etc. This
property of a kernel is due to the fact that it is a generalization of an inner
product which is a measure of the angle between two vectors. The angle
between two vectors captures the similarity between them.
Second, a kernel allows us to transform nonlinear functions of vectors in a
lower dimensional input space into linear functions in a higher dimensional
feature space. This task is accomplished by using a nonlinear map that takes
vectors in input space to vectors in feature space. Using this map, a nonlinear
function of vectors in the input space is expressed as an inner product in
feature space. Linearity in feature space follows because inner products are
linear.
The idea that two nonlinearly separable classes of points in input space can be
made linearly separable in feature space, once the input points are mapped to
feature space by a nonlinear map f, is a fundamental concept in SVM. Therefore,
to fix ideas, we present a concrete illustration of this phenomenon.
Illustration 4.1: Nonlinear maps transform nonlinearly separable classes to
linearly separable classes in feature space.
Consider two classes of points in two-dimensional input space (X1, X2).
• The four points in the set{(1, 1); (1, 21); (21, 21); (21, 1)} are labeled as 11.
• The four points in the set{(0.5, 0.5); (0.5, 20.5); (20.5, 20.5); (20.5, 0.5)} are
labeled as 21.
As seen in the plot below, clearly these two classes are not linearly separable in
input space (Fig. 4.7).
Fig. 4.7. Two Classes “1” and “2” Are Not Linearly Separable in
Input Space.
Now, there exists a nonlinear map fðx1 ; x2 Þ such that when the input points
(vectors) are mapped into feature space, then these points are linearly separable in
feature space.3 This nonlinear map takes vectors in two-dimensional input space
and maps them to two-dimensional feature space. Suppose, we denote the axes of
the feature space as W1 and W2.
The input points mapped to feature space are: f(1, 1)5(1, 1); f(1, 21)5 (5, 3);
f(21, 21)5(3, 3); f(21, 1)5(3, 5) for the “1” points. The “2” points are
unaffected by this mapping. Then, the transformed points in feature space are.
Just by “eyeballing” the figure above (Fig. 4.8), one can see that the two classes
are linearly separable once they have been transformed by the nonlinear mapping
f and mapped to the feature space. It is important to note that there could be
many functions that transform nonlinearly separable points so that they become
linearly separable in a different space but not all of these functions are kernels.
For a function to be a kernel, it has to satisfy some special properties, but these
are beyond the scope of this book. Of course, finding the right nonlinear map, or
equivalently, an appropriate kernel, is a nontrivial problem. Moreover, different
kernels can have different feature spaces, even to the point of having feature
spaces of different dimensions. Despite the difficulty of finding the appropriate
kernel, in practice there are some commonly used kernels, and we will discuss
them later. While eyeballing the figure above shows that there could be many
hyperplanes that perform linear separation of the classes in feature space, con-
structing the optimal hyperplane requires more technical machinery. We will
revisit this illustration in Section 4 where we construct the optimal separating
hyperplane.
3
The interested reader can find the definition of the nonlinear map in (A4.25) under
“Worked-out Illustrations” at the end of this chapter.
Fig. 4.8. The Transformed Points Are Linearly Separable in Feature

Space.
The nonlinear map in Illustration 4.1 above did not change the number of
dimensions. The dimensions of the input space and feature spaces were the same.
However, in most practical applications, the nonlinear transformation maps the
input data into a feature space which has more dimensions than the input space.
Often the number of dimensions of the feature space greatly exceeds that of the
input space. To get an intuition of this, let us consider a simple illustration.
Illustration 4.2: Nonlinearly separable data in input space are separable in
higher dimensional feature space.
Consider the seven data points on a single-dimensional input space X1:
23, 22, 2, 3 are in class “2”

21, 0, 1 are in class “1”
Clearly, the classes are not linearly separable in one dimensional space. Now
consider the nonlinear map that transforms the one-dimensional input space to
two-dimensional feature space:
fðx1 Þ ¼ ðx1 ; x21 Þ
For instance, f(23) 5 (23, 9). The others six points can be similarly mapped
into two dimensions. Let us denote the dimensions in feature space as W1 ([ x1)
Fig. 4.9. Classes Are Separable in Higher Dimensional Feature

Space.
and W2 ([ x12). In two dimensions the seven points xi 5 (xi1, xi2) can be
plotted as:
Now, the classes are linearly separable in the higher dimensional space. One
possible separating hyperplane is shown as the bold horizontal line in Fig. 4.9.
In our discussion of a kernel as a generalized measure of similarity in Section
3.1, we mentioned that we can compute the similarity between two input objects x
and x9 even when the objects are not vectors. This is the case with input objects
that are strings, documents, voice recordings, X-ray images etc. In such cases too,
as long as one can use a (possibly nonlinear) transformation f such that f(.)
defines a legitimate kernel, then nonlinearly separable classes of these objects can
be linearly separated after transformation to a higher dimension feature space.
In the machine learning community, one often hears mention of the so-called
kernel trick. While there are many formal definitions of it, for application-oriented
readers it suffices to have an intuitive idea of what this concept is. We state it as
follows.
3.3 Kernel Trick

If a machine learning algorithm can be formulated in terms of inner products
Æx; x= æ in the original input space X, then one can construct an alternate algorithm
by replacing Æx; x= æ with a kernel function k(x, x9).
In Section 3.2, we have started with a feature map f, and assuming that inner
products can be defined on the feature space F, we defined a kernel k(x, x9) as
ÆfðxÞ; fðx= Þæ. It can be shown that one can proceed in the reverse direction as
well. Starting with a kernel k(x, x9), it is possible to construct a feature space F
such that this kernel computes the inner product in that feature space. Details are
in “Technical appendix 6.”
We conclude the section on kernels by listing some of the most commonly used
kernels.
• The kth degree polynomial kernel.

• Radial basis function (RBF) kernel: It is a special case of the Gaussian kernel.
• Cosine similarity kernel: This kernel is extensively used in machine learning for
text data.
• Sigmoid kernel: This kernel is used in neural networks.
Detailed expressions of these kernels are also in “Technical appendix 6.”
Technical Detour 6
4. Optimal Separating Hyperplane

Consider two classes that are linearly separable. From Section 3, we recall that a
major drawback of using separating hyperplanes is that there could be an infinite
number of such hyperplanes that do an equally good job of separating the two
classes. One method for obtaining uniqueness of the separating hyperplane is to
appeal to the concept of a margin.
4.1 Margin between Two Classes

Before formally defining a margin, let us provide an intuition for how the concept
of a margin could achieve uniqueness of the separating hyperplane. In the figure
below it is clear that, seen from the vantage point of the East–West axis, the “1”
and “2” classes can be separated by many separating hyperplanes – like the three
dotted lines shown here just for illustration. However, seen from the vantage point
of the North–South axis, there seems to be a sense in which we can have
uniqueness in how the two classes can be separated! The key lies in identifying the
two points, one from each class, such that they are closest among all such pairs of
points. These two points are used to define a measure of separation. In the figure
below, they are point xa for the “1” class and xb for the “2” class (Fig. 4.10). The
concept of a margin builds on this simple, but powerful, insight.
The margin of the point (xi, yi) to the hyperplane b0 1 b1 x1 1
b2 x2 1 ⋯ 1 bp xp 5 0 is the distance from xi to the hyperplane if the point (xi, yi)
is correctly classified. For a set of correctly classified points (x1, y1),…, (xN, yN)
the margin is the minimum distance from this set to the hyperplane. This is shown
in the figure below where we show the margin for the set of correctly classified
“1” points (Fig. 4.11). The minimum distance to the hyperplane for the “1”
points corresponds to point xa, and this point defines the margin. When the set of
points is not explicitly mentioned, we will assume that we are referring to the
training data set.
Now, from Section 2 of this chapter, we know that the sign of the expression
on the left-hand side of (4.1) is positive for points in the “1” category and
Fig. 4.10. Separability between Classes Seen from Two Vantage

Points: East–West and North–South Axes.
Fig. 4.11. The Margin of the “1” Points from Hyperplane (Bold
Line).
negative for points in the “2” category. Also, the class label for the “1” category
is yi 5 11, and the class label for the “2” category is yi 5 21. It can be shown
that these two facts together imply that for correctly classified points, the margin
is just the geometrical distance of the separating hyperplane from that point, and
for incorrectly points, the margin is the negative of the geometrical distance.
Technical Detour 7
4.2 Maximal Margin Classification and Optimal Separating Hyperplane

The margin provides us a way to obtain a unique optimal separating hyperplane.
This is the hyperplane that maximizes the margin, and hence, maximizes the
distance between the closest points from two classes. To get an intuition, a
visualization of this idea is helpful. The figure below shows that, when the classes
are linearly separable, different separating hyperplanes, depending on their
orientation, define different margins (Fig. 4.12). It is clear from the figure that the
“best” separation corresponds to the hyperplane that defines the widest margin.
Given our motivation of the “best” separator as one that has the widest
margin, we can describe the optimization program, also called the maximal
margin classification, as:.
objective function : Choose b0 ; b1 ; …; bp ; to maximize the margin M
(4.5)
Constraint: All points are at least a distance M away from the hyperplane
Just as in the standard Ordinary Least Squares (OLS) regression, the goal of
estimation is to estimate the coefficients of the linear regression equation; here
too, the goal is to estimate the coefficients of the hyperplane. This is done such
that the coefficients are chosen so as to maximize the margin. It is noteworthy that
there are as many constraints as there are training data points since there is one
constraint corresponding to each data point in the training data.
The power of this formulation comes from the fact that the optimal separating
hyperplane can be expressed only in terms of inner products. Specifically, the
optimal separating hyperplane, defined for a generic point x, can be expressed as a
linear combination of terms like Æx; xi æ where xi, i 5 1,…, N, are points in the
training data set. This may remind the reader of our discussion in Section 3.1
where we had sketched out a geometrical argument for how classification can be
achieved by using only inner products. It is satisfying that the geometrical
argument presented in that section is consistent with the more formal machinery
in the current section. The interested reader can look up the technical details
pertaining to the above discussion in “Technical detour 8” in the appendix at the
end of this chapter.
The constraint that all points have to be at least a distance M from the
hyperplane (in the optimization program (4.5)) is satisfied as an equality for the
nearest point(s). That is, the point closest to the hyperplane for a given class will
Fig. 4.12. Same Two Classes Separated by a Wider (Left) and

Narrower (Right) Margin.
Fig. 4.13. Nearest Points from Hyperplane to Both Classes Lie

Exactly on Margin.
lie exactly on the margin. For example, in the figure below the point xa in the
“1”class lies exactly on the margin (Fig. 4.13).
Another interesting property of the solution to (4.5) is that the optimal sepa-
rating hyperplane is fully characterized only by the points (vectors) xi that lie on
the margin. These vectors xi that lie exactly on the margin, and which are the only
ones required to compute the optimal separating hyperplane, are called support
vectors. All input data points that lie inside the margin do not factor in the
computation of the optimal separating hyperplane. In a previous paragraph in
this section, we had mentioned that the hyperplane can be constructed using inner
products of a generic point x with all points xi, i 5 1,…, N, in the training data
set. The fact that the optimal separating hyperplane actually uses only a small
subset of the points in the training data set is an example of the sparsity property
of SVMs.
To get some intuition for this important property of SVMs suppose that, in the
context of churn modeling, we want to classify “nonchurners” and “churners” for
a bank. In the figure below, the nonchurners are represented by “1” and churners
are represented by “2.” We apply the labels yi 5 11 to the former and yi 5 21 to
the latter. Also, suppose the historical database of the bank has a set of p variables
that can be used as explanatory variables. These could include demographic,
psychographic, and behavior variables pertaining to a customer. Thus, customer i
can be represented by the vector xi 5(xi1, xi2,…, xip), and there are N customers
i 5 1,…, N. By definition, the support vectors are the points closest to the
boundary for a given class – e.g., in the figure below, it is point xa for the “1”
class (Fig. 4.14).
Intuitively speaking, we expect that consumers who make the “hard choice” of
whether to churn or not would be the most important in determining what makes
a bank customer churn. This is because in choice situations one can really
understand why people make the choices they do by understanding the choice
Fig. 4.14. Support Vectors, Like Point xa, Represents Consumers

That Make Hard Decisions.
processes of those who make such difficult trade-offs. As a practical matter, the
support vectors identify consumers that make hard choices. In our bank churn case,
these are the customers who are almost indifferent, that is, are closest to the fence,
between churning and not churning. In other words, a point like xa represents
nonchurners who are most at risk to churn. From the point of view of interpre-
tation and practical ramifications, this is very important. As a managerial
implication, the firm’s retention efforts could be targeted at such customers. It is
worth noting that the SVM methodology automatically identifies such support
vectors, and this is standard output of most software packages that perform SVM
analysis. Furthermore, moving away from the margin has an interesting
interpretation.
In the figure Fig. 4.15 as one moves away from the margin toward the
“nonchurners” class, the likelihood of churn decreases. This is because this
Fig. 4.15. Different Likelihoods of Churn When We Move Away

from Separating Hyperplane.
represents a move away from the “churners” class. Similarly, as one moves away
from the margin toward the “churners” class, the likelihood of churn increases.
This is because this represents a move away from the “nonchurners” class.
Technical Detour 8
Executive summary
The concept of a margin is very useful since it allows one to find a unique
separating hyperplane between two classes. The margin of a point to the
hyperplane is the distance from that point to the hyperplane if the point is
correctly classified. For a set of training data points, the margin is the min-
imum distance from this set of points to the hyperplane.
For linearly separable classes, the optimal separating hyperplane is the one
that maximizes the margin M, subject to the constraint that all points are at
least a distance M away from the hyperplane. The SVM technique with the
maximum margin approach has two powerful properties: (1) The computa-
tion of the optimal separating hyperplane requires computing only inner
products, (2) The optimal separating hyperplane is specified only by the
support vectors. Because the support vectors are a subset of all vectors, hence
the SVM is said to have the sparsity property. Geometrically, points that lie
exactly on the margin are support vectors. From a practical point of view,
support vectors correspond to consumers who make “hard choices” – i.e.,
they are almost indifferent between belonging to one or the other class.
We end this section with an illustration. The purpose of this illustration is to

show how the optimal separating hyperplane can be computed using only: (1)
inner products and (2) support vectors. While practical applications of SVMs will
involve a preprogrammed computer software package, for the purpose of building
some intuition for how support vectors enter into the construction optimal
hyperplanes, a simple hand-worked illustration is often useful.
Illustration 4.3: Constructing an optimal separating hyperplane using inner
products and support vectors.
Consider two classes of points in two-dimensional input space (X1, X2).
These points are shown in the plot below (Fig. 4.16).

Clearly, these two classes are linearly separable in input space, but our goal is
to demonstrate how to compute the analytical expression of the optimal separating
Fig. 4.16. Linearly Separable Points in Input Space.
hyperplane using only inner products and the support vectors. We first note, that
just by a visual inspection of the plotted points, the support vectors are: s1 5 (1, 0)
for the “2” class and s2 5(2, 0) for the “1” class. These are circled in the figure
below (Fig. 4.17).
The target values corresponding to these points are y1 5 21 and y2 5 11. Let
us consider a generic point x5(x1, x2). Under “Illustration 4.3” in the “Worked-
out Illustrations” section, we show that the optimal separating hyperplane is
given by:
f ðxÞ ¼ 2 3 2 2 , x; s1 . 1 2 , x; s2 .
The critical aspect to notice is that the optimal separating hyperplane can be
specified by a linear combination of the inner products of the vector x with the
Fig. 4.17. Support Vectors from the Two Classes Are Circled.
Fig. 4.18. Optimal Separating Hyperplane in Input Space.
two support vectors. Expanding the inner products in terms of the components of
the vectors, the optimal separating hyperplane in input space is:
f ðx1 ; x2 Þ ¼ 2 3 1 2x1 :
We can plot the optimal separating hyperplane. Points (x1, x2) that lie exactly
on the hyperplane are characterized by f (x1, x2) 5 0. Therefore, the line 2312x1
5 0 passes through the point (3/2, 0) and is perpendicular to the X1 axis. This
hyperplane is shown below (bold vertical line) (Fig. 4.18):
The reader will note that in the illustration above we based our calculations on
being able to identify the support vectors. In the simple setting above, the support
vectors are easy to identify by eyeballing a plot of the data. In general, with many
data points, as in common in real machine learning applications, it may be
impossible to identify the support vectors merely by plotting the data. Then, we
will have to formally solve the program in (4.5) to obtain the optimal separating
hyperplane. In these cases, for computational purposes, it is often easier to solve
an equivalent problem called the Wolfe Dual program. This program yields the
optimal li, and the points (vectors) xi corresponding to li . 0 are the support
vectors. We will not present these details here, but the interested reader will find it
in the appendix.
Technical Detour 9
5. Support Vector Classifier and SVM

So far, we have considered the case where the two classes are linearly
separable, and in Section 4, we saw how to construct the optimal separating
hyperplane for such data. In many situations, the classes are not separable,
and there is no feasible solution in input space that achieves perfect sepa-
ration using a linear classifier. Consider for instance the situation portrayed
in the figure below:
The points denoted 1 and 2 belong to class “2,” and we can see that the two
classes are not linearly separable. The bold dashed lines on either side of the
hyperplane (bold solid line) are the margin lines. We will use the term “margin
line(s)” for the line(s) at a distance M from the hyperplane, and the distance itself
will be called the “margin width.” Consider point 1. This point is on the wrong
side of the bold dashed margin line (the line at the bottom) and at a distance
of Mj1 from it.4 The variable j1, called slack variable, quantifies the
extent of misclassification of point 1. Since distances from the hyperplane
define margin lines, in essence, we have a “new” margin line (the dotted line)
at a distance M(12 j1) from the hyperplane (rather than at a distance of M). Since
j1 . 0, the new margin width is smaller than M. To the extent that a larger
margin width results in a better separation, the new margin line will achieve
weaker separation. The bold dashed lines on either side of the hyperplane are
often called soft margins since they can be violated by some points. Notice that,
with respect to the bold dashed margin line, point 1 is not misclassified – even
though it is on the wrong side of this margin line, it is on the correct side of the
hyperplane. On the other hand, point 2 is misclassified. The distance of point 2
from the bold dashed margin, Mj2, is greater than M. Thus, its distance from the
hyperplane is M(12 j2), and this is a negative quantity. One can see that, in this
construction, misclassification of point i occurs when ji . 1. At this point, it will
be useful to relate the values of ji to the misclassification of point i.
• If ji 5 0, then point i is exactly on the margin line (bold dashed line in

Fig. 4.19).
• If 0, ji # 1, then point i is on the wrong side of the margin but on the right
side of the hyperplane. So, it is correctly classified.
• If ji . 1, then point i is on the wrong side of the hyperplane. So, it is
misclassified.
From the above, it is clear how the slack variable ji quantifies the extent of
misclassification of point i, and the higher its value, the more severe the
misclassification. Thus, the sum +ji quantifies the total amount of misclassifi-
cation across all data points. i
The major difference in the nonseparable case compared to the separable case
discussed in Section 4.2 is how the constraints in the program for the optimal
separating hyperplane are handled. In the program for the optimal separating
hyperplane in (4.5), we imposed the constraint that all points are at least a
4
In our construction, the distance from the margin line for points on the wrong side of it is
stated as proportional to the margin width M. Equivalently, the distance could be stated in
absolute terms. For technical reasons, having to do with optimization, the former approach
is often adopted.
Fig. 4.19. Classes That Are Not Linearly Separable.
distance M away from the hyperplane. Here, we relax this constraint using the
slack variables. We instead impose the constraint the constraints that all points
are at least a distance Mð1 2 ji Þ away from the hyperplane. This is the soft
margin referred to above. Additionally, in order to control the extent of
N
misclassification, we also impose the constraint that the sum + ji should not be
i51
larger than a prespecified constant C. This constant is determined as a tuning
parameter in SVM. Since ji .1 quantifies the extent of misclassification and
because C . 0, therefore, at the most C points can be misclassified. Therefore, a
large value of C allows for more misclassification of the training data. This, in
turn, may allow for better classification of test data, since the model does not
hard-fit the training data. In this way, the constant C is related to the bias–
variance trade-off and is usually selected via cross-validation.
The support vector classifier is a natural extension of the maximal margin
classifier. Like the latter, it uses a linear boundary, but it allows for some
misclassification when applied to nonlinearly separable classes. Conceptually, it
does this by using slack variables and then controlling the amount of misclassi-
fication using a constraint. As a rough analogy, consider the idea behind fitting a
linear regression to a scatter of points. Since the scatter of data points are usually
noncollinear, a line cannot pass exactly through all the points. Thus, the modeler
allows errors in the fit and then estimates that regression line which minimizes the
total error. The optimization program for a support vector classifier is:
objective function : Choose b0 ; b1 ; …; bp ; and j1 ; …; jN to maximize the margin M
Constraint 1: All points are at least a distance Mðð1 2 «i Þ away from the hyperplane
Constraint 2: The sum +i zi , C (4.6)
Note that there is some similarity between the nonseparable case presented
here and the separable case in section (4.5). However, there are three salient
differences: (1) slack variables ji have been incorporated, and as the objective
function shows, they are also decision variables along with the b coefficients, (2)
the right-hand side of the first constraint now incorporates soft margins that allow
some misclassification, and (3) the second constraint has been added which caps
the extent of misclassification. Conceptually, the goal of the estimation is to
maximize the margin subject to capping the extent of misclassification using the
slack variables. Obviously, if the cap C is chosen to be too small compared to the
actual nonseparability in the training data, the program will be infeasible. See
the appendix for a formal statement of the optimization program for the non-
separable case.
Technical Detour 10
Using the support vector classifier, in the nonseparable case too the optimal
separating hyperplane is characterized only by those xi for which the corre-
sponding li . 0. These are the support vectors. Interestingly, unlike the separable
case, in the nonseparable case there are two types of support vectors:
(1) For ji 5 0, the point xi lies on margin line.

(2) For ji . 0, the point xi lies on the wrong side of margin line.
The second category of support vectors, corresponding to ji . 0, did not figure

in the separable case since in that case we did not have any slack variables – said
differently, the only case was ji 5 0.
We now proceed to describe an SVM. In Section 3.2, we had described how
two nonlinearly separable classes in input space can be made linearly separable in
feature space, once the input points are mapped to a higher dimensional feature
space by a nonlinear map. Furthermore, we had shown how this powerful idea is
implemented using a kernel, which is a generalization of an inner product. In
Section 4.2, we have also seen that the optimal separating hyperplane obtained by
solving the maximal margin classifier program in (4.5) can be expressed in terms
of inner products. It turns out that the solution to the support vector classifier
program in (4.6) above also shares this feature – that is, this optimal separating
hyperplane can also be expressed in terms of inner products. One can see this
almost intuitively since the support vector classifier also uses a linear boundary
like the maximal margin classifier, and the only difference from the latter is that it
allows some misclassification for nonlinearly separable classes.
Since the algorithm for separability can be expressed solely in terms of inner
products, an application of the kernel trick suggests that one could replace the
inner products by more general kernels k(x, x9) 5 ÆfðxÞ; fðx= Þæ. In simple terms,
the SVM is the methodology of setting up the classification program using kernels
to expand the feature space, so that the classes are linearly separable in feature
space, and then computing the optimal separating hyperplane in feature space in
terms of these kernels. These kernels are defined by nonlinear maps that transform
vectors x in input space into vectors fðxÞ in feature space. Thus, the optimal
separating hyperplane that results from solving this program is also expressed in
terms of kernels. These kernels give linear separating boundaries in feature space,
which result in nonlinear boundaries in input space.
Technical Detour 11
Executive summary
The support vector classifier is a natural generalization of the maximal margin
classifier to nonseparable classes. Unlike the case of linearly separable classes,
for nonlinearly separable classes we use soft margins which can be violated by
some points. In this case, the optimal separating hyperplane is the one that
maximizes the margin M, subject to the constraint that all points are at least a
certain distance away from the hyperplane – that distance is a fraction of the
margin width M.
Intuitively, the support vector classifier allows some misclassification, but
then caps the total amount of misclassification allowed. It quantifies the
extent of misclassification of an individual point by using slack variables. The
total amount of misclassification across all training data points is capped
using a user-defined constant. The support vector classifier also shares the
desirable properties of the maximal margin classifier, in that, the optimal
separating hyperplane requires computing inner products, and it can be
specified using only support vectors.
The SVM is a powerful method for classifying nonseparable classes.
Intuitively, the SVM methodology uses kernels to transform input data to a
higher dimensional feature space where the data become linearly separable,
and then computing the linear optimal separating hyperplane in feature space
in terms of the kernels. Transformed back to input space, the separating
boundary is nonlinear, and, thus, it achieves classification of nonlinearly
separable classes in input space.
We end this section with two illustrations. The first one computes the optimal
separating hyperplane for the data in Illustration 4.1 in Section 3.2 where we had
two classes that are not linearly separable. This illustration demonstrates how
kernels and support vectors are used in the construction of a separating hyper-
plane for such data. The first illustration is such that support vectors can be
identified by plotting and “eyeballing” the data. In the second illustration, support
vectors are not easily identified, and we demonstrate the use of the formal
machinery involving solving the Wolfe dual program.
Illustration 4.4: Constructing optimal separating hyperplane for nonlinearly
separable classes using kernels and support vectors.
Recall from Illustration 4.1 the two classes of points in two-dimensional input
space (X1, X2). For the reader’s convenience, we will show the points comprising
the two classes:
• The four points in the set {(1, 1); (1, 21); (21, 21); (21, 1)} are labeled as 11.
• The four points in the set {(0.5, 0.5); (0.5, 20.5); (20.5, 20.5); (20.5, 0.5)} are
labeled as 21.
These classes are not linearly separable in input space, but they are linearly
separable in feature space once they are transformed by the mapping f which is
defined in Illustration 4.1. The goal now is to construct the separating hyperplane
in feature space using kernels. We start by noting that the only points that enter
the calculation of the separating hyperplane are the support vectors. In the figure
below (Fig. 4.20), the two support vectors are circled. They are: s1 5 (0.5, 0.5) for
the “2” class and s2 5 (1, 1) for the “1” class.
The target values corresponding to these points are y1 5 21 and y2 5 11. We
apply the two support vectors as arguments of the function f(x) for the separating
hyperplane (the interested reader can see Eq. A4.20). Then corresponding to s1
and s2, we have two equations which are solved simultaneously to obtain the
coefficients of the optimal separating hyperplane.
Let us consider a generic point x5(x1, x2). In “Illustration 4.4” under the
“Worked-out Illustrations” section, we show that the optimal separating hyper-
plane is given by:
f ðxÞ ¼ 2 4 2 6 , fðxÞ; fðs1 Þ . 1 6 , fðxÞ; fðs2 Þ .
It is important to note that this hyperplane lies in the feature space with W1 and
W2 as axes and, since it is expressed in terms of inner products which are linear, is
Fig. 4.20. Support Vectors from the Two Classes Are Circled.
itself linear in feature space. For the generic input vector (x1, x2), let us denote its
image under f as f(x1, x2) 5 (w1, w2). Evaluating the inner products where the
vectors in feature space are f(x) 5 (w1, w2), f(s1) 5 (0.5, 0.5), and f(s2) 5 (1, 1),
the previous expression for the hyperplane in feature space becomes:
f ðx1 ; x2 Þ ¼ 2 4 1 3w1 1 3w2
We can plot the optimal separating hyperplane. Points (x1, x2) that lie exactly
on the hyperplane are characterized by f (x1, x2) 5 0. Therefore, the line 2
413w113w2 5 0 passes through the points (0, 4/3) and (4/3, 0). This hyperplane is
shown below (bold solid line in figure below) (Fig. 4.21)
Since w1 and w2 are nonlinear functions of the input vector (x1, x2), therefore
the hyperplane when transformed back to input space defines a nonlinear
boundary which classifies two nonlinearly separable classes. So, how do we use
our optimal separating hyperplane to classify a new point in input space?
Consider the point (0.25, 0.75) in input space. In order to apply our optimal
separating hyperplane, we first have to transform this input point to feature space
using the mapping f. We have f(0.25, 0.75) 5 (1.75, 2.25). Then, f (0.25, 0.75) 5
24 1 3*1.75 1 3*2.25 . 0. So, this input point is classified as belonging to the
“1” class. Is this classification reasonable? To speak to the reasonableness of any
classification, we note that, first, no model is perfect and there may be classifi-
cation errors. The optimal separating hyperplane is based on all the training data,
and the model is designed to minimize overall misclassification across all data
points. Second, the kernel itself is a modeler-defined choice, and the classification
accuracy of any separating hyperplane is conditional on the specific choice of the
kernel.
Finally, we present an illustration where the support vectors cannot be iden-
tified merely by “eyeballing” a plot of the data from the two classes. In such a
Fig. 4.21. Optimal Separating Hyperplane in Feature Space.

situation, we will demonstrate how the support vectors can be identified and the
optimal separating hyperplane constructed by formally solving the full SVM
program.
Illustration 4.5: Identifying support vectors to compute optimal separating
hyperplane for nonlinearly separable classes.
We present an example from Cui and Curry (2005) to show how the use of
kernels can efficiently achieve proper classification in the “XOR problem” where
it is well-known that no linear classifier exists. The XOR problem has four points
ai, i 5 1, 2, 3, 4, in two-dimensional space with axes X1 and X2. The target
variable yi corresponding to each of the four points is either 11 or 21. A table
with the data for the XOR problem is below (Fig. 4.22).
In the plot of the data below a “11” point is represented as a “circle” and a “21”
point is represented as a “square” for ease of visualization (Fig. 4.23).
Fig. 4.22. The Data for the XOR Problem. Source: Cui and Curry
(2005).
Fig. 4.23. Plot of XOR Points. Source: Cui and Curry (2005).
As we can see, no linear classifier can perform this classification task. We

propose a kernel-based nonlinear classifier which is a hyperplane the feature
space. The feature space contains images of input points x under a nonlinear map
f(x). We use a polynomial kernel of degree 2:
kðx; zÞ ¼ ð1 1 Æx; zæÞ2 ¼ ð1 1 x1 z1 1 x2 z2 Þ2
As shown in “Illustration 4.5” under “Worked out Illustrations,” the right-

hand side, which is a nonlinear expression in input space, can be expressed as an
inner product in feature space using the nonlinear map. This map transforms
vectors in 2-dimensional input space to vectors in 6-dimensional feature space.
Following the standard solution process, we find that all li . 0, therefore all 4
vectors are support vectors. Finally, the optimal classifier (decision boundary) is:
x21 2 x22 ¼ 0
One can easily check that this classifier, which is nonlinear in input space,
correctly classifies the squares and circles as can be seen in the plot of the classifier
below in input space (Fig. 4.24).
The decision boundary is in the shape of a cross – that is, it is the combination
of both the sloped bold lines. This is because the decision boundary as given in the
previous equation is x12 2 x22 5 0. This reduces to (x1 2 x2) (x1 1 x2) 5 0 so that
x1 5 x2 and x1 5 2x2.
6. Applications of SVM in Marketing and Sales

We will discuss the impact of machine learning on marketing and sales along
several dimensions listed below. The intention is not to provide an exhaustive
Fig. 4.24. Support Vector Machine Successfully Separates Classes in

XOR Problem. Source: Cui and Curry (2005).
record of all possible applications of SVMs to marketing and sales, but to give a
sense of some of the most common applications.
• Market segmentation;
• Target marketing;
• Predictive modeling/Response modeling;
• Sales/demand forecasting;
• Churn prediction;
• Time series analysis in marketing;
• Text mining and analysis.
A perennial problem in marketing is market segmentation. Most marketing

plans essentially start with segmenting the market since the holy grail of mar-
keting is to understand how segments of customers differ on various dimensions
so that the firm can customize its offerings based on the targeted segment. A
variant of SVM called Support Vector Clustering (SVC) has been successfully
used for market segmentation and other clustering tasks (Huang, Tzeng, & Ong,
2007). The SVC method was developed by Ben-Hur, Horn, Siegelmann, and
Vapnik (2001) and is closely based on the ideas behind SVM that we will develop
in this chapter. Traditionally, segmentation has been done using clustering
techniques like k-means or agglomerative hierarchical methods etc. A major
disadvantage of many traditional methods, like the most popularly used k-means
clustering, is that they are unable to deal with nonconvex cluster shapes. They are
also not ideal for data with qualitative attributes. More recently, Self-Organizing
Feature Maps (SOFM), which are a type of neural network, have been used for
segmentation. However, in an application to segmentation of the soft drinks
market, Huang et al. (2007) show that SVC dominates both k-means and SOFM.
Mizuno, Saji, Sumita, & Suzuki (2008) have also successfully employed SVMs for
market segmentation.
Once segmentation is done, the marketer has to choose one or more of the
segments to target. SVMs have been successfully employed for target marketing.
Most of these applications involve identifying customer segments that have a high
likelihood of being interested in the firm’s product and service offerings. These
days with the ascendency of personalized marketing, the models have been used to
identify individual customers who have a high probability of buying the firm’s
product. For example, Shih, Chen, and Chang (2014) have used SVM, along with
some other machine learning methods, for targeting potential customers of a bank
in Taiwan who may have a high tendency of applying for loan products of the
bank. They find that while neural networks have a slight edge over SVM, the
performance of SVM and neural networks is almost similar. In light of
the stronger theoretical basis of SVM, and its better interpretability, compared to
a neural network, this research makes the case for SVM in target marketing
applications. Govidarajan (2013) has used an ensemble methods with base clas-
sifiers as SVM and RBF and has demonstrated the usefulness of this ensemble for
direct marketing.
The ability of SVM to efficiently perform nonlinear classification makes it an

ideal tool for predictive modeling or response modeling. Consumer response or
consumer choice in the presence of marketing stimuli is a classic problem in
marketing. An early paper that dealt with the predictive power of SVM for
consumer choice is Cui and Curry (2005). Consistent with the “customer-focused”
approach of marketing, the authors performed extensive benchmarking of the
predictive power of SVM vis-à-vis the commonly used logit model for a wide
variety of customer utility specifications posited to model consumer choice for a
multiattribute product (Table 3, p. 605). The systemic components of the six
consumer utility specifications studied in this work involve various combinations
of “utility for an attribute,” “individual characteristics of consumer,” “interac-
tions between attribute utility and consumer characteristics,” “quadratic effect of
attribute utility,” and “interactions between quadratic attribute effect and con-
sumer characteristics.” These six utility specifications then give rise to 16 treat-
ment conditions formed by a fractional factorial experimental design. The authors
find strong evidence for the superior performance of SVM compared to the logit
model for all 16 treatment conditions involving the six consumer utility specifi-
cations. Importantly, since it is well known that the logit formulation cannot
handle noncompensatory choice rules, the authors compared SVM to the logit
model only for compensatory, but nonlinear, choice situations. This was intended
to rule out situations where the logit model has inherent drawbacks, even though
many of these noncompensatory choice rules have been shown to have practical
relevance. In the case of noncompensatory choice, the performance advantage of
SVM is far superior to the performance of the standard logit model.
Continuing with response modeling, the Conjoint Analysis Methodology has
undoubtedly been one of the major successes in the field of marketing when it
comes to modeling consumer choice. This technique is extensively used in practice
for the purpose of uncovering consumer preferences. Traditionally, the conjoint
analysis methods were designed for simple linear consumer preference models, but
later some efforts were made to handle nonlinear preferences via polyhedral
methods (Toubia, Hauser, & Simester, 2004) and Hierarchical Bayesian (HB)
methods (Andrews, Ansari, & Currim 2002). Evgeniou, Boussios, and Zacharia
(2005) have used SVM in Conjoint Analysis because of the ability of SVM to
handle highly nonlinear choice situations. Based on extensive empirical testing
they conclude that,
Our methods are shown experimentally to be more robust to noise

than both logistic regression and the polyhedral methods, to either
significantly outperform or never be worse than both logistic
regression and the polyhedral methods, and to estimate nonlinear
utility models faster and better than all methods including HB.
(pp. 415–416).
As the authors point out, a major advantage of the SVM-based conjoint

method here is that the utility function (preferences) are estimated based only on
“hard choices” that consumers have to make. We will see later that the technical
counterpart of “hard choices” turns out to be the support vectors of the SVM, and
since the support vectors are automatically determined as part of the estimation
process, the “hard choices” that shape consumer preferences are also automati-
cally found by the SVM-based conjoint method. This is consistent with the
intuition that people’s preferences are really formed by hard choices they have to
make rather than by obvious trade-offs, with the added advantage that there are
much fewer hard choices compared to the set of all choices that people make.
Cheung, James, Law, and Tsui (2000, 2003) have used SVM to mine customer
preferences for product recommendations. In light of the explosion in online
customer ratings and reviews in all product categories, businesses are very keen to
use them to recommend suitable products to potential customers. Recommen-
dation systems are usually of two types: (1) Content-based and (2) Collaborative-
based. Simply stated, content-based recommendation systems look at the match
between product attributes and other product information with customer inter-
ests, while collaborative systems leverage the preference ratings from the other
customers for recommending products/services to the focal customer. In theory, it
is straightforward to extract product contents from web pages owing to the vast
amount of information of products available nowadays, but translating the
contents to a manageable number of attributes that can be processed by statistical
methods is daunting because of the large number of resulting attributes. For
example, consider the very common application of movie recommendations. One
important aspect of a movie is its “cast.” However, this is a multivalued object
and the usual way it is coded is via binary variables like “cast includes Keanu
Reeves,” “cast includes Meryl Streep” etc., each of which takes 0/1 values. This
scheme explodes the dimensionality of the set of attributes. SVM is known to have
superior performance in the case of high-dimensional data sets, and this property
can be exploited for content-based recommendation systems. Using an Internet
Movie Database (IMDB), Cheung et al. (2003) show that SVM has a much
superior performance compared to Naı̈ve Bayes and k-Nearest Neighbor in cal-
ibrating recommendation systems for movies.
Kim, Lee, and Cho (2008) have used support vector regression for response
modeling in direct marketing where they use a customer database to estimate the
amount of purchases made. As opposed to the SVM methodology that was
developed originally for classification, as in the consumer choice contexts
described in the preceding paragraphs, they have used a variant which is
appropriate for regression – the support vector regression. The authors have
shown how a more sophisticated sampling method proposed by them can yield
better results and, at the same time, can reduce the estimation load that results
from using the entire data set. This addresses a drawback of SVM which places a
lot on estimation when the training data set is very large. Since the use of larger
data sets is becoming increasingly common, this method can be fruitfully used to
efficiently train SVMs. In a similar vein, Shin and Cho (2006) have used SVM for
response modeling in direct marketing using the Direct Marketing Educational
Foundation (DMEF) data set. Using the classification based SVM, they again
demonstrate the practical usefulness of a specific sampling procedure to train an
SVM with a large training data set. In the domain of response modeling, a novel
contribution is that of Martı́nez-Ruiz Gomez-Borja, Molla-Descals, and Rojo-

Alvarez (2008) who have employed SVM to estimate the effects of firm pricing on
brand substitution. In Martı́nez-Ruiz, Molla-Descals, Gomez-Borja, and Rojo-
Alvarez (2006), the authors analyze the effect of the length of temporary retail
price promotion periods on sales. They employ daily store sales of ground coffee
(nonperishable) and yogurt (perishable) from a retail scanner data set. The
authors develop a SVM-based robust regression technique to analyze this data set.
They benchmark their SVM based semiparametric model against a standard
semiparametric model employed by Van Heerde, Leeflang, and Wittink (2001)
and also against a fully parametric linear model. They find that the SVM-based
model outperforms the fully parametric methods (as may be expected), but more
importantly, also dominates the standard semiparametric method.
Another classic problem in sales and marketing, and one which continues to
have strong interest among practitioners and researchers, is the estimation of
demand functions and sales response functions. Bajari, Nekipelov, Ryan, and
Yang (2015) have documented how machine learning can outperform traditional
statistical methods for demand estimation. Referring to the performance of
machine learning models in the classic problem of demand estimation, the authors
state that “…we claim that as compared to common methods an applied
econometrician might use in off the shelf statistical software, these methods are
considerably more accurate” (p. 481–482). Among the machine learning methods
that Bajari et al. (2015) have used to compare with “traditional econometric and
statistical methods” is SVM. One can conceive of the problem of estimating the
sales response to be very similar to the problem of demand estimation, in that,
both try to predict the quantity demanded (sold) based on several covariates. In
the specific case of sales response function, the covariates include, among others,
those related to the multidimensional effort that the salespeople exert to sell the
product. Thus, while the choice of specific covariates may differ for demand
estimation and sales response estimation, as a statistical problem they are both
similar. It is worth noting that Bajari et al. (2015) compared machine learning
models with traditional statistical methods for the same data set with the same
covariates. A major strength of machine learning models, of which SVM is one, is
their ability to handle highly complex, unstructured data of the type that the
traditional statistical methods cannot handle. This implies that, in terms of per-
formance, machine learning models will likely have an even greater advantage
than that documented by Bajari et al. Carbonneau, Laframboise, and Vahidov
(2008) have used SVM to forecast demand. They use two data sets: one based on
simulating a supply chain and the other being actual data from Canadian
foundries. The authors find that forecasting based on SVM performs better than
the traditional forecasting methods like trend, moving average, and linear
regression.
Gordini and Veglio (2017) have demonstrated the use of SVM in churn
modeling. Unlike most studies on churn modeling, this research focuses on the
business-to-business (B2B) setting in the e-commerce industry. The authors
develop a parameter-selection technique, called SVMauc, which is based on the
“Area Under the Curve” (AUC). Gordini and Veglio (2017) have benchmarked
the SVM with smart parameter selection relative to logistic regression, neural
network, and classic (without SVMauc) SVM, and find the superior performance
of their model. They further note that their SVM-based model shows better
generalization performance when applied to noisy, imbalanced, and nonlinear
marketing data, which most firms are increasingly capturing in the CRM systems.
Coussement and Van den Poel (2008) have studied churn prediction in a business-
to-customer (B2C) setting, specifically for subscription services. They too find that
SVM performs better than logistic regression, especially when the former is
trained by a suitable parameter selection technique. However, they find that a
Random Forest performs better in their specific data set. Huang, Chen, Hsu,
Chen, and Wu (2004) have used SVM for credit rating prediction using multiple
data sets from credit rating organizations in Taiwan, called the Taiwan Ratings
Corporation, and the United States. They find that SVM has better predictive
accuracy compared to a neural network trained on the same data set and both are
much superior to logistic regression.
Another field in which the SVM methodology has shown superior performance
is in analyzing time-series data, especially chaotic time series. In an early paper,
Mukherjee, Osuna, and Girosi (1997) benchmarked the SVM-based time series
predictions with several alternatives and found that SVM dominated its alter-
natives in terms of prediction accuracy. Most importantly, the authors generated
chaotic time-series data and employed sophisticated nonlinear analysis techniques
as alternatives to benchmark their SVM-based approach. The chaotic time-series
they considered are the Mackey-Glass time series, the Ikeda map, and the Lorenz
time series (Mukherjee et al., 1997). The alternative techniques used to analyze
these chaotic time-series are taken from Casdagli (1989) and include nonlinear
approximation techniques such as polynomial, rational, local polynomial, RBFs,
and neural networks. The authors report that, “The SVM performs better than
the approaches presented in [1],” where [1] refers to the alternative approaches.
This research provides very strong evidence for the usefulness of SVM in
analyzing time-series which are otherwise difficult to handle using more tradi-
tional methods. Sapankevych and Sankar (2009) provide a wide-ranging survey of
applications of SVM to time-series predictions. They too note that,
The underlying motivation for using SVMs is the ability of this

methodology to accurately forecast time series data when the
underlying system processes are typically nonlinear, nonstationary,
and not defined a priori. SVMs have also been proven to outperform
other nonlinear techniques such as neural network based nonlinear
prediction techniques such as multilayer perceptrons (p. 25).
This papers provides many cites of research in different fields that have
employed SVM for time-series analysis, and we will not repeat them here.
Text analysis has become very popular as a means of obtaining customer
intelligence and other marketing intelligence. This radical growth of textual
analysis is an offshoot of the large volume of text that is generated by various
marketing and sales efforts of a firm, and the responses to these efforts by its
customers (Netzer, Feldman, Goldenberg, & Fresko, 2012; Tirunillai & Tellis,
2014). SVMs have been extensively used in text classification and many other text
analysis techniques (Chapter 6 of Aggarwal & Zhai, 2012; Chau & Chen, 2008;
Joachims, 1998; Zhang, Dang, Chen, Thurmond, & Larson, 2009 etc). Li and Wu
(2010) have used SVM for text mining and sentiment analysis for online forums
by using data from 31 different sports-related topic forums spanning a wide range
of topics and 220,053 posts. They first algorithmically determine the emotional
polarity of a text, that is, sentiment analysis, and obtain a value for each piece of
text. They then combine this algorithm with SVM in an unsupervised text mining
approach to form clusters. The goal is to group the topic forums into clusters
where the center of each cluster would represent a hotspot forum. They report
experimental results which confirm that SVM performs the clustering task very
well, with the top 4 hotspot forums identified by SVM being similar to results
obtained from using k-means clustering.
A very common application area within text analysis is text classification
which involves automatically categorizing textual documents into topical cate-
gories such that information can then be easily searched. A persistent problem
with text classification is the overwhelming number of applications where the
training data set is very imbalanced. Consider, for example, the task of classifying
news articles as “interesting” or “not interesting” for a particular reader. The
standard means of doing this is to use the SVM as a binary classifier and then to
perform the multicategory classification task (it is multicategory since there are
many categories of new articles) by adopting a one-against-all learning strategy.
Clearly, there are many more training examples in the “not interesting” category.
There are many techniques that have been suggested to handle such imbalanced
training data when using SVM as the base classifier. In a wide ranging empirical
study, Sun, Lim, and Liu (2008) discuss and compare the various methods of
handling such imbalanced training samples in the case textual data. They find
that, using the area under the Precision-Recall Curve as the evaluation criterion
for model performance, the standard SVM with suitable adjustments of the
threshold may be the best performer.
7. Case Studies
In this section, we will present a couple of case studies about the application of
SVMs in marketing. We describe the data sets and demonstrate the analyses done
on them. We also compare the strengths and weaknesses of SVMs compared to
the traditional econometric models.
As is the case for the other machine learning methods in this book, our goal is
for the applications-oriented reader to be able to see the details of some marketing
applications of SVMs. We will focus on understanding the business context, the
data set, the choice of predictors and dependent variable, visualization and
interpretation of results, and finally, communication of the results in a business-
relevant manner to other stakeholders. We will also provide the results of a
nonmachine learning benchmark model. This will allow the reader to clearly
contrast the findings from the SVM analysis against the findings from the
benchmark model.
Case Study 1: Consumer Choice Modeling

The modeling of consumer choice has a very long history in marketing. Initially
drawing on theories from the mathematical psychology and economics
literatures of the 1960s onwards, marketers have developed discrete choice
models with an eye to practical implementation and measurement. Corstjens and
Gautschi (1983) provide an excellent survey of formal choice models in
marketing, with details of the theoretical background of the marketing appli-
cations of these models, and we will not repeat it here. These authors draw
attention to the axioms and postulates of preferences, the nature of choice
alternatives, the choice rules, and the nature of choice outcomes. The major
conceptual unpinning of discrete choice models in marketing is the view of
products as “bundles of attributes.” The major advantage of this conceptuali-
zation is that a new product entrant into an existing market does not require the
modeler/analyst to completely respecify the utility function.5
We will provide here a very brief summary of discrete consumer choice the-
ories in marketing and then focus on choice rules – specifically noncompensatory
choices. Consider an individual facing a discrete set of choice alternatives. In the
marketing and sales contexts, these alternatives could be products, services,
competing firms etc. Let us consider a set of competing brands offered by
different firms in a certain product category. Products in this category can be
considered as a bundle of attributes, and of course, the various brands represent
different combinations of attributes and attribute levels chosen by a company. A
given individual’s utility from consuming a certain company’s product is a
function of the company’s choice of attributes and attribute levels. In discrete
choice models, a rational individual chooses that product alternative which
maximizes his/her utility. Within this overall modeling framework, different
types of discrete choice models make different assumptions about the form of the
utility function: whether it is random or not, whether it is linear or not based on
the assumed choice rules etc. We will focus on choice rules in consumer choice
models.
There are two major categories of choice rules – whether they are compen-
satory or noncompensatory. The simplest utility function for compensatory
choice is a linear function of attributes where the weight of each attribute is a
measure of the importance of that attribute for the individual. Consider, the
simple case of a two-attribute product whose attributes are A1 and A2. For
example, in the case of an automobile, one could think of the two attributes
being “power,” measured as horsepower delivered by the engine, and
“efficiency,” measured as miles per gallon. A compensatory choice rule could be
5
See last para, page 21, of Corstjens and Gautschi (1983).
modeled as a linear utility function: U 5 a 1 bA1 1 gA2. This form of the

utility function models compensatory choice since a consumer could remain at
the same utility level even if a company was to decrease the power in its vehicle
as long as it compensated by a proportionate increase in its efficiency. If an
individual consumer was deciding between two automobile brands based on
power and efficiency, then discrete choice theory tells us that she would choose
that brand which gives the higher utility. The simple linear model has been
widely used since its birth during the 1960s owing to its simplicity. However,
almost as soon as its inception, it has been criticized as unrealistic and not
capturing actual consumer behavior (Anderson, 1970, 1971). As an illustration,
the compensatory choice rule allows a consumer to remain at the same utility
even with an automobile that has very miniscule power as along as it is
compensated with very high levels of efficiency! Marketing scholars and others
have proposed alternate choice rules that address the drawbacks of compen-
satory choice. Most of these noncompensatory choice rules rely on nonlinear
utility functions. One such noncompensatory choice rule is the Latitude of
Acceptance (LOA). The LOA accommodates the fact that the acceptance range
for a product is when the attributes lie between some minimum and maximum
limits. As shown in the figure below, the product is acceptable if the attributes
lie within the intervals: a # A1 # b and c # A2 # d.
Fig. 4.25. Latitude of Acceptance Noncompensatory Choice Rule.
We will analyze simulated data for an individual consumer. Suppose, we have

the data from past choices of one single individual in n choice occasions. In each
choice occasion, the individual is presented with an attribute pair, (A1, A2), and
he/she makes the binary choice of “buy/not buy” (coded as “1” or “2”). This
type of data is typical in Conjoint Analysis experiments which are designed to
estimate an individual’s utility function. In Choice-Based Conjoint (CBC)
analysis, respondents are presented with different “product profiles,” which are
different combination of attribute levels, and their choices are recorded. The
variables in our data set are.
(1) Choice (0 or 1)
(2) A1 (value of attribute 1)
(3) A2 (value of attribute 2)
To establish a baseline first, we analyze this binary choice situation using

logistic regression. Both the logistic regression and SVM were trained on a
random sample of 80% of the data and tested on a random sample of 20% of the
data. In both, we did a 3-fold cross-validation. We show the inputs chosen for
running a logistic regression for this data set:
Response Column: 1
Predictor columns: 2:3 (2 and 3)
The inputs chosen to run a SVM are as follows:
Response Column: 1
Predictor columns: 2:3 (2 and 3)
Kernel: Radial
Kernel Parameter: 0.5,1,1.5
Cost: 0.8
While the other inputs for a SVM are straightforward, three aspects require
clarification. First, we are required to choose a kernel. Here we chose the “radial”
kernel. Second, the radial kernel needs a user-specified kernel parameter – the
gamma parameter value. Intuitively, the gamma value governs how far
the influence of a particular training example reaches. When gamma is small, the
influence reaches far, and when it is large, the influence is close. We have chosen
three parameter values: 0.5, 1, 1.5. The idea is to compute SVMs with different
parameter values and then to select the best fitting model. This step can be
accomplished by writing simple code in all software programs. Third, we also
need to input a “Cost” value. This is a regularization parameter for SVM. In this
case, a SVM with kernel parameter of 1.5 has the best performance based on the
box plot of cross-validation error as seen below:
We briefly describe the fit statistics. The Confusion Matrix for the logistic
regression is calculated on the test data and is given by the following:
Actual + Actual -
Predicted + 0 0
Predicted - 46 154
The percent correctly classified (PCC) for test data for the logistic regression
is 77%. On the face of it, a PCC of almost 80% may not seem too bad, but there
is a very serious problem with logistic regression as applied to this data set.
None of the points that are actually in the “1” class are correctly classified! In
fact, not one of the data points is predicted to lie in the “1” class. Thus, when
we go from overall performance to a more fine-grained performance by class,
the logistic regression performs poorly. Intuitively, this happens because there
are more “Actual 2”s in the data, compared to “Actual 1”s, and the method
plays it safe (as far as maximizing PCC is concerned) by classifying everything
as “-.”
The Confusion Matrix for the logistic regression is calculated on the test data
and is given by the following.
Actual + Actual -
Predicted + 45 4
Predicted - 10 141
The PCC for test data for the SVM is 93%. The performance of SVM is
significantly better than that of logistic regression. This pattern of results
is confirmed by looking at the AUC as well. For the logistic regression, the AUC is
approximately 0.51. The AUC for the best-performing SVM, the one with gamma
value of 1.5, is significantly higher at about 0.99. The reason for the much superior
performance of SVM compared to logistic regression is obviously because logistic
regression provides a linear boundary that is capable of faithfully classifying only
linearly separable classes. However, noncompensatory choice rules like the LOA
require nonlinear classification as can be seen clearly from Fig. 4.25.
As was already mentioned, conjoint analysis, which has been one of the major
successes in the field of marketing in terms of widespread applications in industry
and elsewhere, is very similar in spirit to the current case study. There too, one
main goal is to estimate an individual’s utility function and predict his/her product
choices. Hence, it is no surprise that the SVM methodology has been successfully
employed for conjoint analysis, given that many real choice situations follow
noncompensatory choice rules and nonlinear utility functions. Evgeniou, Bous-
sios, and Zacharia (2005) have shown that, compared to other statistical models,
apart from being able to handle highly nonlinear choice situations, the SVM
methodology applied to conjoint analysis can handle many more attributes, is
robust to noise, does not suffer from the curse of dimensionality, and does not
make distributional assumptions that may or may not be satisfied.
Case Study 2: Rent Value vs Location

Recall that in the chapter on neural networks we had used a motivating example of
predicting real estate sales in any city. The dollar sales in the case of real estate
markets are dependent on the real estate valuation. Among the many predictors of
real estate valuation of a property, the location of the property is perhaps the most
important. We consider the relationship between “Rent value” (Y) and the “Dis-
tance to the city center” (X), where the latter measure operationalizes the location of
the property. We use this case study to demonstrate the use of SVM even in
regression contexts with a continuous response (dependent) variable.
To establish a baseline first, we fit a linear regression. Both the linear regression
and SVM were trained on a random sample of 80% of the data and tested on a
random sample of 20% of the data. In both, we did a 3-fold cross-validation. We

show the inputs chosen for running a linear regression for this data set.
Response Column: 1
As was mentioned in the neural networks chapter, a linear regression will not
provide a good fit to the data since empirical research has established that there is
a cubic relationship between “distance to city center” and “rent value” (Frew &
Wilson 2002). The PMSE (predicted mean squared error) for the linear regression
fit is 6921.
An important learning point from this case is to demonstrate the importance of
an appropriate choice of a kernel for SVM. We will first run a SVM with a radial
basis kernel. The inputs for this are as follows.
Response Column: 1
Kernel: Radial
Kernel Parameter: 0.5,1,1.5
Cost: 0.8
The performance of the SVM with the radial kernel on test data is significantly
better than the benchmark of linear regression. The PMSE is 2373.8, which, as
expected, is significantly better than the linear regression.
We now try the SVM with a polynomial kernel. The inputs are as follows:
Response Column: 1
Kernel: Polynomial
Kernel Parameter: 1, 2, 3
Cost: 0.8
The kernel parameter is the degree of the polynomial and we have selected
parameters 1, 2, and 3 corresponding to linear, quadratic, and cubic polynomials –
the best fitting polynomial kernel will be selected. We find that the best fitting
SVM has a PMSE of 2074. This is even better than the performance of the SVM
with radial basis kernel. Interestingly, the best fitting kernel is the cubic poly-
nomial which has the lowest CV error. This is consistent with the true relationship
between “rent value” and “distance to city center” based on empirical research
(Frew & Wilson 2002). This case study underscores the importance of using the
appropriate kernel for SVM and often the best guide for this is the amount of
domain knowledge that the analyst brings to the task.
TECHNICAL APPENDIX
Technical Detour 1
As mentioned in the text, suppose the input points xi lie in a two dimensional
space defined by axes X5(X1, X2), so that each xi 5 (xi1, xi2), i 5 1,…, N. In a
logistic regression, we model the log odds (or logit transformation) of the target
variable yi, and the model has the form:
Prðyi ¼ 1 1jX ¼ xi Þ p
Log ¼ Logð Þ ¼ b0 1 b1 xi1 1 b2 xi2
Prðyi ¼ 2 1jX ¼ xi Þ 12p
The linear expression in the input variables, xi1 and xi2, on the right-hand side
is the linear classification function. The dependent variable is the log odds ratio,
log[pi/(12 pi)], where pi 5 Pr(yi 5 11|X 5 xi). If the linear classification function
is denoted by f (xi), so that f (xi) 5 b01 b1 xi11b2 xi2, then the classification rule
is:

1 1; f ðxi Þ . 0
2 1; f ðxi Þ , 0
Since the logistic regression model is log[pi/(12 pi)]] 5 f (xi), the classification
rule can be restated as:

1 1; pi . 0:5
2 1; pi , 0:5
Technical Detour 2
Suppose the equation of the hyperplane is b0 1 b1 x1 1 b2 x2 5 0. Since none of the
“1” or “2” points lie on it therefore for any point xi 5(xi1, xi2) either
b0 1 b1 xi1 1 b2 xi2 . 0 or b0 1 b1 xi1 1 b2 xi2 , 0. Said differently, the hyperplane
defined by the thick bold line separates the “1” and the “2” points.
We can generalize this idea and define a separating hyperplane in p-dimensions
as:
(
b0 1 b1 xi1 1 b2 xi2 1 ::: 1 bp xip . 0; if yi ¼ 1
(A4.1)
b0 1 b1 xi1 1 b2 xi2 1 ::: 1 bp xip , 0; if yi ¼ 2 1
In this way, we can see that the sign of the separating hyperplane, corre-
sponding to an input point xi, can be used to classify that point.
These above two inequalities can be combined to give the condition for a
separating hyperplane as:
yi ðb0 1 b1 xi1 1 b2 xi2 1 ::: 1 bp xip Þ . 0 (A4.2)
Technical Detour 3
A very simple and well-known measure of similarity of two vectors in
p-dimensional space, x5(x1,…, xp) and x95(x91,…, x9p), is the dot product also
called the inner product. This is defined as:
p
=
Æx; x= æ ¼ + xj xj (A4.3)
j¼1
The inner product above has a geometric interpretation in that it computes the
angle between x and x9(actually, cosine of the angle). For our practical purposes,
we take the norm of a vector x, denoted by ‖x‖, to be a measure its length. The
norm is also defined in terms of the inner product of the vector x and itself.
‖x‖2 ¼ Æx; xæ
The inner product of two vectors, in terms of the cosine of the angle u between
them, is:
Æx; x= æ ¼ ‖x‖ ‖x= ‖Cos u
Hence, the inner product allows one to carry out mathematical computations
that involve geometric concepts of angles, lengths, and distances. In this simple
case where there is a concrete representation of the input object x as a vector, the
kernel is merely the inner product:
kðx; x= Þ ¼ Æx; x= æ (A4.4)
A very popular kernel used for measuring the similarity of two documents is
based on the idea of using the cosine as a similarity measure. This kernel is:
Æx; x= æ
kðx; x= Þ ¼
‖x‖ ‖x= ‖
Technical Detour 4
Consider N data points (xi, yi), i 5 1,…, N, where input vectors xi 5(xi1, xi2) lie
in 2-dimensional space and yi is a binary response variable with values “11” or
“21.” Suppose there are N1 responses with yi 5 11 values and N- responses with
yi 5 21, so that N1 1 N- 5 N. Also suppose the set of indices i such that yi 5 11
(yi 5 21) be denoted by P (M).
Now, for ease of exposition, we assume that the distance of the class means x1
and x- from the origin is the same (see Fig. 4.6). Importantly, the reader should
keep in mind that the basic idea of this construction works for any arbitrary
locations of the class means. Getting back to the symmetric case, the class means
at equal distance and symmetrically located on either side of the vertical imply
that x21 5 x2-, and we denote this common value by x2*. The point halfway
between the class means is:
xC ¼ ðx 1 1 x Þ=2 ; and it has coordinates xC ¼ ð0; x*2 Þ:
To classify a new point x5 (x1, x2), we have to determine whether this new
point is closer to the class mean of class “1” or class “2.” We classify it in class
“1” (or “2”) if the angle of the vector “x2xC” with the vector “x1 2 x2” is less
(or greater) than 90o, respectively. Recall that Cosine 90o 5 0, cosine of an angle
greater than 90o is negative, and cosine of an angle less than 90o is positive.
Therefore, the inner product (which is a measure of the cosine of the angle
between two vectors) can act as a classifier. Specifically, the classifier is:
y ¼ signðÆx 2 xC ; x 1 2 x æÞ (A4.5)
The direction vectors “x2xC” and “x1 2 x2” in coordinate form are (x1, x22
x2*) and (x11 2 x12, 0). Thus, the inner product on the right-hand side of pre-
ceding equation is:
Æx 2 xC ; x 1 2 x æ ¼ Æx; x 1 æ 2 Æx; x 2 æ
The expression on the right-hand side uses the fact that x21 5 x22 since the X2
coordinates of the class means are the same. From (A4.5) then the required
classifier is:
y ¼ signðÆx; x 1 æ 2 Æx; x 2 æÞ (A4.6)
Now, since x 1 5 N11 + xi and x 5 N12 + xi , and since inner products have the
i2P i2M
linearity property, therefore the term in parenthesis on the right-hand side of
(A4.6) can be written in terms of sums and differences of inner products of the
new point x and all the input points xi. Finally, since an inner product is a kernel,
thus, one can see how a classifier can be expressed in terms of kernels as:
N
y ¼ signð + ai kðxi ; xÞ 1 b0 Þ (A4.7)
i¼1
Technical Detour 5
As mentioned in the text, the map f(.) maps objects from the input space X to an
inner product space called the feature space F. The simplest nonlinear map f(.)
takes the input vector x and maps it to a feature space which has products and
powers of components of x, which are x1 and x2. For example, for input vector
x5(x1, x2), consider the nonlinear map:
pffiffiffi
fðxÞ ¼ fðx1 ; x2 Þ ¼ ðx21 ; x22 ; 2 x1 x2 Þ (A4.8)
Then, the inner product in feature space is:

=2 =2 = =
ÆfðxÞ; fðx= Þæ ¼ x21 x1 1 x22 x2 1 2x1 x2 x1 x2 ¼ Æx; x= æ2 (A4.9)
Thus, even though Æx; x= æ2 is nonlinear in the 2-dimensional input space X, the
kernel kðx; x= Þ 5 ÆfðxÞ; fðx= Þæ has mapped this nonlinear quantity into a linear
quantity. This linear quantity resides in the 3-dimensional feature space F defined
pffiffiffi
by dimensions W1, W2, W3, say, where W1 5 x12, W2 5 x22, W3 5 2x1x2. The
alert reader will note that linearity in the feature space F follows from the fact that
an inner product defined on a space is, by definition, linear in that space.
Technical Detour 6
Given a set of input points x1, x2,…, xN in X, we can use the kernel to form the
kernel matrix:
0 1
kðx1 ; x1 Þ ⋯ kðx1 ; xN Þ
B C
G ¼ @ ⋮ ⋮ ⋮ A
kðxN ; x1 Þ ⋯ kðxN ; xN Þ
This is called the Gram matrix. If it is positive definite, then it can be shown
that there exists a feature space with an inner product defined on it based on this
kernel.
We provide more details on the commonly used kernels mentioned in the text:
• The kth degree polynomial kernel: k(x, x9)5(11x; x=)k.

• Radial basis function (RBF) kernel: kðx; x= Þ 5 expð 2 g‖x 2 x= ‖2 Þ. The rbf
kernel is a special case of the Gaussian kernel: kðx; x= Þ 5
expð 2 12ðx 2 x= ÞT S 2 1 ðx 2 x= ÞÞ. The Gaussian kernel reduces to the rbf kernel
if the covariance matrix S is isotropic (or spherical), that is, it is diagonal with
all the diagonal elements being the same. =
• ÆfðxÞ;fðx Þæ
Cosine similarity kernel: kðx; x= Þ 5 ‖fðxÞ‖ ‖fðx= Þ‖
. This kernel is extensively used
in machine learning for text analysis.
• Sigmoid kernel: kðx; x= Þ 5 tanhðg1 Æx; x= æ 1 g2 Þ. This kernel is used in neural
networks.
Technical Detour 7
The margin of the point (xi, yi) to the hyperplane b0 1 b1 x1 1
b2 x2 1 ::: 1 bp xp 5 0 is the distance from xi to the hyperplane if the point (xi, yi)
5 (xi1,…, xip, yi) is correctly classified. A standard result from coordinate
geometry shows that the geometrical distance from (xi, yi) to the hyperplane
jb 1 b1 xi1 1 b2 xi2 1 ::: 1 bp xip j
b0 1 b1 x1 1 b2 x2 1 ::: 1 bp xp 5 0 is 0 p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi . Formally speaking,
2
b1 1 b2 1 ::: 1 bp
2 2
we can define the margin of the point (xi, yi) 5 (xi1,…, xip, yi) as:
yi ðb0 1 b1 xi1 1 b2 xi2 1 ::: 1 bp xip Þ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (A4.10)
b21 1 b22 1 ::: 1 b2p
Since yi 5 1 or 21 depending on whether b0 1 b1 xi1 1 b2 xi2 1 ::: 1 bp xip is .

0 or ,0, therefore the numerator of the expression above is positive for correctly
classified points. Moreover, for correctly classified points the expression above
equals the distance from the point to the hyperplane. From this, it directly follows
that for misclassified points, where the product term yi (b0 1 b1 xi1 1
b2 xi2 1 ::: 1 bp xip ) , 0, the margin is the negative of the geometrical distance.
Technical Detour 8
The development of the optimization program for the maximal margin hyper-
plane is based on the treatment in Hastie, Tibshirani, and Friedman (2009). The
formal statement of the program described in (4.5) is:
8
>
> max Mb0 ;b1 ;:::;bp
>
< s:t: ‖b‖ ¼ 1
p (A4.11)
>
> yi ðb0 1 + bj xij Þ $ M "i ¼ 1; 2:::; N
>
: j¼1
In the above, the first constraint involves the norm of the vector b5(b1,…, bp),
defined as, ‖b‖2 5 b21 1 ::: 1 b2p . The reader will note that, since yi equals either 11
or -1 and because of the first constraint, the left-hand side of the second constraint
is essentially the (signed) distance from input point xi to the hyperplane. See the
expression for the distance from xi to the hyperplane in (A4.10). Therefore, the
two constraints together imply that all the points are at least a distance M away
from the hyperplane.
Since the goal is to determine the optimal values of the coefficients bj, j 5 0,
1,…,N, it is convenient to recast the objective function in (A4.11) in terms of the
bj. Note that the first constraint in (A4.11) is essentially ‖b‖ 5 1 where b5(b1,..,
bp). In order to recast the objective function in (A4.11) in terms of b, we relax the
requirement that ‖b‖ 5 1. When we do not impose the unity requirement on the
norm ‖b‖, then the requirement that the distance of the hyperplane from any
y ðb 1 b x 1 b x 1 ::: 1 b x Þ
point xi 5(xi1,…, xip) is at least M becomes: i 0 1 i1 ‖b‖ 2 i2 p ip
$M.
Clearly, if coefficients (b0, b) satisfy this inequality constraint, so does (kb0, kb),
where k is a constant. This is because, by replacing (b0, b) with (kb0, kb) in the
numerator of the ratio on the left-hand side, the term becomes kyi
(b0 1 b1 xi1 1 b2 xi2 1 ::: 1 bp xip ). Also, by replacing (b0, b) with (kb0, kb) in the
denominator of the ratio on the left-hand side, the term becomes:
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðkb1 Þ2 1 ::: 1 ðkbp Þ2 5 k b21 1 ::: 1 b2p 5 k‖b‖. Consequently, the k’s in the
numerator and denominator cancel out. Thus, there is an indeterminacy which
can be avoided if we arbitrarily set M‖b‖ 5 1. Therefore, in the above construc-
tion, the width of the margin becomes M 5 1=‖b‖.
The program for the optimal separating hyperplane can now be restated as:
8
>
> 1 p 2
>
> + bj
>
>
min
< 0 1 p2 j ¼ 1
b ; b ;:::;b
(A4.12)
>
> p
>
> ðb 1 + bj xij Þ $ 1 "i ¼ 1; 2:::; N
>
> s:t: y i
: 0
j¼1
This form is intuitively appealing since, unlike (A4.11), the objective function
is directly stated in terms of the coefficients which are precisely the quantities that
need to be estimated. Also note that since M 5 1=‖b‖ therefore the maximizing
program of (A4.11) has been recast as a minimizing program in (A4.12). As usual,

we form the Lagrangian:
1 p 2 N p
Lp ¼ + bj 2 + li ½yi ð + xij bj 1 b0 Þ 2 1
2 j¼1 i¼1 j¼1
where li are the Lagrange multipliers. We use the subscript p in the Lagrangian to
make it explicit that this is the “primal” problem, to be distinguished from the
“dual” problem to be specified in “Technical detour 9” below.
This Lagrangian needs to be solved for the bs and the ls. Consider the
derivative w.r.t. bj. It is
∂Lp
∂bj 5 bj 2 l1 y1 x1j 2 l2 y2 x2j 2 ::: 2 lN yN xNj , for j 5 1,…,p. Setting it equal
to 0 and solving gives the First Order Conditions (FOCs) with respect to bj, j 5
1,…, p.
N
b j ¼ + li yi xij for j ¼ 1; …; p
FOC condition w:r:t: bj : b (A4.13)
i¼1
N
FOC w:r:t: b0 : + li yi ¼ 0 (A4.14)
i¼1
In addition, the Karush–Kuhn–Tucker conditions yield:

p
li ½yi ð + xij bj 1 b0 Þ 2 1 ¼ 0; for all i (A4.15)
j¼1
Now, we know that for the training data (xl, yl), l 5 1,…, N, classification by a
separating hyperplane requires computation of terms like ðb0 1 xl :bÞ. For a
compact notation, we use the dot product xl :b. Consider the terms xl :b for l 5
1,…,N. The optimal separating hyperplane requires the optimal values b bj for j 5
bj obtained in (A4.13) we get:
1,…, p. Using the optimal b
0 1
N
B + li yi xi1 C
B i¼1 C
B C
B N C
B + lyx C
b ¼ ðxl1 ; xl2 ; :::; xlp ÞB
xl :b i i i2 C
B i¼1 C
B C
B ⋮ C
B N C
@ + lyx A
i i ip
i¼1
N p
b 5 + li yi + xlk xik . Finally, we note
Writing out this product gives us: xl :b
i51 k51
that the second summation on the right-hand side is an inner product, and so:
N
b 5 + li yi Æxl ; xi æ, for l 5 1,…, N.
xl :b
i51
Thus, the function f(x) for the optimal separating hyperplane, defined for a
generic point x as the argument, is:
N
b 0 1 x:b
f ðxÞ ¼ b b ¼ b
b 0 1 + li yi Æx; xi æ (A4.16)
i¼1
b0 1x:b).
The optimal classifier is just: sign(f(x)) 5 sign(b b
Finally, (A4.16) makes it clear that those points (vectors) for which the cor-
responding li 5 0 will not figure in the specification of the separating hyperplane.
Hence, we should only focus on vectors for which li . 0. Now, the KKT con-
p
ditions in (A4.15) imply that if li . 0 thenyi ð + xij bj 1 b0 Þ 5 1. These vectors xi
j51
corresponding to which li . 0 are support vectors.
Technical Detour 9
The optimal separating hyperplane for linearly separable classes is solved using
the program (A4.11) in “Technical detour 8.” There, we showed that we could,
equivalently, solve program (A4.12). For computational purposes, it is often
easier to solve the Wolfe Dual program. Substituting the FOCs (A4.13) and
(A4.14) into the Lagrangian primal Lp given in Technical Detour 8, we can form
the Lagrangian dual. Then, the Wolfe Dual is specified as:
N 1 N N
max LD ¼ + li 2 + + li lk yi yk xTi xk
l1 ;:::lN i¼1 2 i¼1 k¼1
(A4.17)
N
s:t: + li yi ¼ 0; li $ 0
i¼1
In the above, xiT is the transpose of the vector xi. This program yields the
optimal li, and the points (vectors) xi corresponding to li . 0 are the support
vectors.
Technical Detour 10
The formal statement of the program described in (4.6) for the nonseparable
case is:
8
>
>
> max Mb0 ;b1 ;:::;bp ;j1 ;:::;jN
>
>
> s:t: ‖b‖ ¼ 1
>
>
>
< p
yi ðb0 1 + bj xij Þ $ Mð1 2 ji Þ "i ¼ 1; 2:::; N (A4.18)
>
> j¼1
>
>
>
>
N
>
> ji $ 0; + ji # C
>
: i¼1
Notice that the differences from the separable case in (A.4.11) are that: (1) the
right-hand side of the margin constraint (the second constraint) now incorporates
soft margins that allow some misclassification, and (2) we have added a new
constraint which caps the extent of misclassification using the pre-specified con-
stant C.
Similar to “Technical detour 8” we have to set up the Lagrangian and solve the
FOCs. Just as in that case, here too it is convenient to recast the program such
that the objective function is the norm ‖b‖, whereupon the second constraint in
p
(A4.18) reduces to: yi ðb0 1 + xij bj Þ $ 1 2 ji , for all i. This is analogous to how
j51
(A4.11) has been recast as (A4.12) for the maximal margin classifier in the
separable case. One salient difference is that now the KKT conditions imply that
p
if li . 0 then yi ðb0 1 + xij bj Þ 5 12 ji (previously, in the separable case the
j51
right-hand side was just 1). As before, the optimal separating hyperplane is
characterized only by those xi for which the corresponding li .0. These are the
support vectors.
Technical Detour 11
In the SVM program, the solution of the optimal separating hyperplane is
accomplished by setting up the Wolfe dual as in (A4.17). The major difference
from (A4.17) is that the objective function is expressed in terms of the kernel as
shown below:
N 1 N N
max LD ¼ + li 2 + + li lk yi yk kðxi ; xk Þ (A4.19)
l1 ;:::lN i¼1 2 i¼1 k¼1
It can be shown that, for a generic point x, the optimal separating hyperplane
can be expressed in terms of kernels as:
N
f ðxÞ ¼ b0 1 + li yi kðx; xi Þ (A4.20)
i¼1
The classifier is just: sign(f (x)).
WORKED-OUT ILLUSTRATIONS
Illustration 3
We had identified the support vectors as s1 5 (1, 0) for the “2” class and s2 5(2, 0)
for the “1” class. The target labels corresponding to these points are y1 5 21 and
y2 5 11. We apply the two support vectors as arguments of the function f (x) for
the separating hyperplane in (A4.16). So, corresponding to s1 and s2, we have two
equations that have to be solved simultaneously. Consider the function f(x) cor-
responding to the support vector s1 2 that is f(s1). The summation of the right-
hand side will clearly involve only the two support vectors since only these
correspond to nonzero li. Let us write the inner products on the right-hand side of
(A4.16) as dot products for compact notation. Thus, for example, , s1, s2. is
written as s1.s2, and so on. Thus, from (A4.16), the equation corresponding to s1 is:
f ðs1 Þ 5 b0 1 l1 ð 2 1Þs1 :s1 1 l2 ð1Þs1 :s2 . Now, since s1 is in “2” class, therefore f(s1)
5 21. Similarly, the function f(x) corresponding to the support vector s2 – that is
f(s2) – will also involve only the two support vectors. We then have the two
following equations corresponding to the two support vectors as:

2 1 ¼ b0 1 l1 ð 2 1Þs1 :s1 1 l2 ð1Þs1 :s2
(A4.21)
1 1 ¼ b0 1 l1 ð 2 1Þs2 :s1 1 l2 ð1Þs2 :s2
In these equations, the dot products (inner products) are defined on input
space. Using s1 5 (1, 0) and s2 5(2, 0), the above simultaneous equations become:

2 1 ¼ b0 2 l1 ð1; 0Þ:ð1; 0Þ 1 l2 ð1; 0Þ:ð2; 0Þ
1 1 ¼ b0 2 l1 ð2; 0Þ:ð1; 0Þ 1 l2 ð2; 0Þ:ð2; 0Þ
Evaluating the inner products using (A4.3) gives:

b0 2 l1 1 2l2 ¼ 2 1
b0 2 2l1 1 4l2 ¼ 1
Also, the FOC with respect to b0 given in (A4.14) yields l1y1 1 l2y2 5 0. This
gives 2l1 1 l2 5 0, so that l1 5 l2. Substituting in the simultaneous equations
above gives the optimal values:
b*0 ¼ 3; l*1 ¼ 2; l*2 ¼ 2
Consider a generic point x5(x1, x2). From (A4.16), the optimal separating
hyperplane is:
f ðx1 ; x2 Þ ¼ 3 1 ð2Þð1Þðx1 ; x2 Þ:ð1; 0Þ 1 ð2Þð1Þðx1 ; x2 Þ:ð2; 0Þ:
To be consistent with writing hyperplanes in terms of the inner product

notation, the above is just:
f ðx1 ; x2 Þ ¼ 2 3 2 2 , x; s1 . 1 2 , x; s2 . (A4.22)
Evaluating the inner products, this finally gives the optimal separating
hyperplane, where we note that this hyperplane lies in input space:
f ðx1 ; x2 Þ ¼ 2 3 1 2x1 (A4.23)
Illustration 4
We had identified the two support vectors as: s1 5 (0.5, 0.5) for the “2” class
and s2 5 (1, 1) for the “1” class. The target labels corresponding to these points
are y1 5 21 and y2 5 11. We apply the two support vectors as arguments of the
function f(x) for the separating hyperplane given in (A4.20). Consider f(s1). The
summation of the right-hand side will clearly involve only the support vectors
since only these correspond to nonzero li. Also, recall that the kernel k(s1, xi) is
the inner product , f(s1), f(xi)., which we will write as a dot product for
compact notation. Then, corresponding to s1 and s2, we have two simultaneous
equations where we use f(s1) 5 21 and f(s2) 5 11 on the left-hand side:

2 1 ¼ b0 1 l1 ð 2 1Þfðs1 Þ:fðs1 Þ 1 l2 ð1Þfðs1 Þ:fðs2 Þ
(A4.24)
1 1 ¼ b0 1 l1 ð 2 1Þfðs2 Þ:fðs1 Þ 1 l2 ð1Þfðs2 Þ:fðs2 Þ
In these equations, the inner products are defined on feature space. Now, from
Illustration 4.1, the nonlinear map that we use to define the kernel is:

ð2 2 x2 1 jx1 2 x2 j; 2 2 x1 1 jx1 2 x2 jÞ if x21 1 x21 . 0:5
fðx1 ; x2 Þ ¼ (A4.25)
ðx1 ; x2 Þ otherwise
Thus, f(s1) 5 (0.5, 0.5) and f(s2) 5 (1, 1). This nonlinear map takes vectors in
2-dimensional input space and maps them to 2-dimensional feature space. Sup-
pose, we denote the axes of the feature space as W1 and W2. Using the map in
(A4.25), the simultaneous equations in (A4.24) become:

2 1 ¼ b0 2 l1 ð0:5; 0:5Þ:ð0:5; 0:5Þ 1 l2 ð0:5; 0:5Þ:ð1; 1Þ
1 1 ¼ b0 2 l1 ð1; 1Þ:ð0:5; 0:5Þ 1 l2 ð1; 1Þ:ð1; 1Þ
Evaluating the inner products using (A4.3) gives:

b0 2 0:5l1 1 l2 ¼ 2 1
b0 2 l1 1 2l2 ¼ 1
Also, the FOC with respect to b0 given in (A4.14) yields l1y1 1 l2y2 5 0. This
gives 2l1 1 l2 5 0, so that l1 5 l2. Substituting in the simultaneous equations
above gives the optimal values:
b*0 ¼ 4; l*1 ¼ 6; l*2 ¼ 6
Consider a generic point x5(x1, x2). From (A4.20), the optimal separating
hyperplane is:
f ðxÞ ¼ 4 1 ð6Þð1Þ kðx; s1 Þ 1 ð6Þð1Þ kðx; s2 Þ
Expanding the kernels, the above can be written as:

f ðxÞ ¼ 2 4 2 6 , fðxÞ; fðs1 Þ . 1 6 , fðxÞ; fðs2 Þ . (A4.26)
Note that this hyperplane lies in the feature space with W1 and W2 as the two
axes. We denote the image of a generic input vector (x1, x2) under f as f(x1, x2) 5
(w1, w2). Evaluating the inner products using f(x)5(w1, w2), f(s1)5(0.5, 0.5), and
f(s2)5(1, 1), the previous expression for the hyperplane is:
f (x1, x2) 5 2413w113w2
Illustration 5
Since there are 4 training data points, from (A4.20), the hyperplane (decision
4
boundary) in the feature space is f ðxÞ 5 b0 1 + li yi kðx; ai Þ. We use a poly-
i51
nomial kernel of degree 2. For a generic x5(x1, x2) and z5(z1, z2) the kernel is
defined as:
kðx; zÞ ¼ ð1 1 Æx; zæÞ2 ¼ ð1 1 x1 z1 1 x2 z2 Þ2 (A4.27)
where the last expression results from evaluating the inner product in input space
using (A4.3). This gives:
kðx; zÞ ¼ x21 z21 1 x22 z22 1 2x1 x2 z1 z2 1 2x1 z1 1 2x2 z2 1 1
The nonlinear expression on the right-hand side above can be expressed as an

inner product, ,f(x), f(z)., in feature space. This is done using the following
nonlinear map f which maps x5(x1, x2) in 2-dimensional input space into a
6-dimensional feature space:
pffiffiffi pffiffiffi pffiffiffi
fðxÞ ¼ ð x21 ; x22 ; 2 x1 x2 ; 2 x1 ; 2 x2 ; 1Þ (A4.28)
Then, we can see that:

kðx; zÞ ¼ x21 z21 1 x22 z22 1 2x1 x2 z1 z2 1 2x1 z1 1 2x2 z2 1 1 ¼ , fðxÞ; fðzÞ . (A4.29)
Expanding the terms in the hyperplane (decision boundary),

4
f ðxÞ 5 b0 1 + li yi kðx; ai Þ, we get:
i51
f ðxÞ ¼ b0 2 l1 ð 2 x2 1 1Þ2 1 l2 ð 2 x1 1 1Þ2 2 l3 ðx2 1 1Þ2 1 l4 ðx1 1 1Þ2 (A4.30)
The second term on the right-hand side has (2x2 11)2, which is just k(x, a1) the
kernel of x5(x1, x2) and a1 5 (0, 21), where we recall that the kernel is defined as
k(x, a1) 5 (11 Æx; a1 æ)2. The other three-squared terms can be similarly obtained.
To obtain the optimal li, i 5 1,…,4, and b0 we specify the Lagrangian (Wolfe)
dual program below whose objective function is given in (A4.19):
4 1 4 4
max LD ¼ + li 2 + + li lk yi yk kðai ; ak Þ
l1 ;:::lN i¼1 2 i¼1 k¼1
4 (A4.31)
s:t: + li yi ¼ 2 l1 1 l2 2 l3 1 l4 ¼ 0
i¼1
li $ 0; i ¼ 1; 2; 3; 4
In the program above, the labels yi are known and the kernels k(ai, ak), i, k 5 1,
2, 3, 4, can be easily calculated using the definition of the kernel above. Thus, it
remains to solve for the li. Using standard techniques, we obtain:
l*1 ¼ l*2 ¼ l*3 ¼ l*4 ¼ 0:5
Since all li . 0, therefore all 4 vectors are support vectors. The decision
boundary with these li* is:
f ðxÞ ¼ b0 1 x21 x22 :
To determine b0, we can use any support vector. We use vector x1. By defi-
nition y1 f(x1) 5 1, where y1 5 21 and x1 5 (0, 21). This implies that b0 5 0.
Finally, the optimal classifier (decision boundary) in input space is:
x21 2 x22 ¼ 0 (A4.32)
Chapter 5
Random Forest, Bagging, and Boosting of

Decision Trees
Chapter Outline
1. Early Evolution of Decision Trees: AID, THAID, CHAID
2. Classification and Regression Trees (CART)
2.1 Regression Trees
2.1.1 Greedy Algorithm
2.1.2 Cost Complexity Pruning
2.2 Classification Trees
3. Decision Trees and Segmentation
4. Bootstrapping, Bagging and Boosting
4.1 Bootstrapping
4.2 Bagging
4.3 Boosting
5. Random Forest
6. Applications of Random Forests and Decision Trees in Marketing and Sales
7. Case Studies
Technical Appendix
1. Early Evolution of Decision Trees: AID, THAID, CHAID

A random forest is a very popular machine learning model which, as the name
suggests, builds on the ideas of decision trees. Therefore, before discussing random
forests it is important to give some background of decision trees. This will enable a
deeper appreciation of how random forests, which are essentially aggregates of
decision trees, improve on a single decision tree. Decision tree-based methods are
a commonly used class of supervised learning which can be used in both regression
and classification contexts, that is, with both continuous and categorical output
variables. They can handle both continuous and categorical input variables.
Among database marketers they are very popular since they are highly visual and
easy to interpret. Also, it is easy to communicate the results of the decision
tree analyses to non-technical audiences in plain English. A major advantage is
that these are non-parametric methods that do not make strong distributional

doi:10.1108/978-1-80043-880-420211006
assumptions. The earliest decision tree method was the Automatic Interaction
Detection (AID) method developed by Morgan and Sonquist (1963). The AID
model was developed for regression trees which have a continuous response
variable. After several years, in the early 1970s, Messenger and Mandell (1972)
proposed the first algorithm for classification trees which have a categorical
response variable. The Chi Squared Automatic Interaction Detection (CHAID)
method was developed by Kass (1980). Together, the AID, THAID, and CHAID
can be thought of as the first wave of tree-based methods.
While some of the early methods were designed to handle continuous response
variables, the easiest means of developing an intuition of how these tree-based
models work is to consider a categorical response variable with several categorical
predictor (explanatory) variables. In a typical marketing context, consider the
case of classifying a group of people into “buy” or “not buy” for a specific product.
Suppose that the three predictor variables are “gender,” “marital status,” and
“occupation.” To make the illustration simple, suppose further that marital status
has two categories, “married” and “not-married,” where divorced people are put in
the not-married category. Similarly, “occupation” also has two categories, “white
collared” and “blue collared.” For the moment we will not concern ourselves with
the issue that many occupations do not fit into this simplistic categorization. We
will describe the CHAID algorithm as a prototype for the early tree-based models.
The reason for using predictor variables with only two categories is that we can
avoid the complexities involved in finding the “optimal” combination of categories
for variables that have more than two categories. This will allow us to focus on
providing a simple intuition for how the tree is built in stages, rather than being
ensnared in technical details.
Most texts that explain the CHAID method take a terse algorithmic approach
since they want to show how the method works for a general data set. We will
take a descriptive approach and tie our explanation to the specific marketing
illustration detailed in the previous paragraph. Hence, the steps in our description
of CHAID will be different from the steps of the formal algorithm detailed in the
more technical resources.
Step 1: Cross-tabulate the response variable with each of the predictor variables.
Since we have three predictor variables we have three cross-tabulations, with the
cross-tabulation with respect to “gender,” followed by “marital status,” followed by
“occupation” from left to right below.
The chi-squared statistic is used to test for the statistical significance of a

cross-tabulation. This explains the use of chi-squared in the CHAID acronym.
So here we have three chi-squared numbers corresponding to the three cross-
tabulations.
Random Forest, Bagging, and Boosting of Decision Trees 141
Step 2: Find the “most significant” of these predictor variables. Based on the
three chi-squared values, suppose all three are significant and “gender” is the most
significant. Then, the first branch of the tree is for “gender” as shown below.
Step 3: Consider the node “male” in the tree in step 2. Cross-tabulate the
response variable with each of the remaining predictor variables, “marital status”
and “occupation,” for the males as shown below.
Then calculate the two chi-squared values corresponding to the two cross-
tabulations with the predictor variables “marital status” and “occupation.”
Step 4: Find the “most significant” among the predictor variables “marital
status” and “occupation” for the males. Suppose, based on the chi-squared values,
both are significant and “marital status” is the most significant. Then, the branch
of the tree in step 2 starting from the node “male” is for “marital status” as shown
below.
Step 5: Consider the married males corresponding to the node “married” in

the tree in step 4. Create the cross-tabulation for the only remaining predictor
variable – “occupation.” This is shown below.
We have a chi-squared value for the above cross-tab.

Step 6: Based on the chi-squared value, suppose this predictor variable is
significant, Then the branch of the tree in step 4 starting from the node “married”
is for “occupation” as shown below.
If the cross-tabulation in step 5 is not significant then we will not have this last
branch and the process ends.
Step 7: Carry out steps 5 and 6 for the “not married” node in the tree in step 4.
Finally, carry out steps 3 to step 6 for the “female” node in the tree in step 2.
It is important to note three aspects of CHAID. First, CHAID, like all decision
trees, is well suited to detect interactions among the predictor variables – hence the
term “interaction detection” in the acronyms AID, THAID and CHAID. There
can be different branching/splitting variables for different nodes at the same level
in the tree. For example, suppose that in the tree in step 2, the branch after the
“female” node is “occupation” and there are no more splits after that. Therefore,
the effect of the predictor variable “marital status” on the response (“buy”/”not
buy”) depends on another predictor variable “gender” – marital status matters for
the males but not for the females. In other words, the manner in which a given
predictor variable affects the response depends on other predictor variables – a
classic interaction effect! One can incorporate interaction effects in regression, but
in a classic chicken-and-egg conundrum, one needs to have some prior knowledge

or intuition for what interactions to include in the regression equation. The theory
itself is not very helpful for detecting interactions. Second, there are additional
steps if some of the predictor variables have more than two categories. In that case,
there is a process of merging the categories so as to arrive at the optimal combi-
nation of categories for each predictor variable. For each predictor variable we use
the optimally merged categories in the cross-tabulation. Essentially, categories of a
predictor variable that are similar with respect to the response variable are merged
together. In that case, the statistical significance of the chi-squared test is adjusted
(by the Bonferroni multiplier) to account for the bias toward predictors with more
categories in them. Our exposition was simple because with only two categories this
additional step is irrelevant. Third, the standard CHAID model handles contin-
uous predictor variables by forming categories of such variables. Hence, our
illustration using categorical predictor variables is consistent with the standard
model.
In terms of chronology CHAID was the last among the initial wave of tree-
based methods, but it was the easiest for us to give a simple intuition of how
decision trees are built using this model. Some fundamental concepts underlying
the other two models in the first wave of decision tree models, AID and THAID,
are shared with the second wave of tree models starting with CART. We will
discuss those in our treatment of the CART model.
2. Classification and Regression Trees (CART)

At a fundamental level, there are two ideas underlying decision trees. First, it
involves a partition of the feature space into non-overlapping regions. A feature
space is just the set of all possible values of the features or predictor variables.
Second, in each of the regions resulting from the partition, the algorithm makes
the same prediction for all training observations in that region. The prediction rule
depends on whether the context is that of regression or classification.
The partitioning of the feature space is done based on the predictor variables.
Consider a financial products company which sells a certain financial product.
The company is interested in classifying people who may or may not be interested
in buying this product. So, the binary response is “buy” or “not buy.” From their
past experience, the company knows that the product had high appeal among
consumers who either already have high savings or have a potential of generating
high savings in their professional careers. Specifically, the company is interested in
retirees (assuming a retirement age of 65) with high savings or young professionals
under 40 who currently may have low savings, but have a long enough remaining
work life to generate high savings. Young professionals under 40 who have high
savings are already targeted by many bigger rivals and are seen as less promising
prospects for the focal company. Said differently, these two segments of cus-
tomers have a higher proportion of the “buy” response compared to other seg-
ments. Let the two explanatory variables be X1 for “savings” and X2 for “age,”
where each lies in some continuous interval.
• The first partition is: Split feature space at X2 5 k1, where k1 5 65 years.
Fig. 5.1 shows this partition.
• The second partition is: For X2 . k1 split region at X1 5 k2. Here k2 is a
suitable high dollar amount of savings. Pictorially (Fig. 5.2), we have
• The third partition is: For X2 , k1 split region at X2 5 k3 where k3 5 40 years.
Now the partitioned feature space becomes (Fig. 5.3)
• The fourth partition is: For X2 , k3 split region at X1 5 k4. Here k4 is a
relatively low dollar amount of savings. The final partitioned feature space is
(Fig. 5.4)
In terms of the regions shown in Fig. 5.4, the financial product company’s
target segments are region R5 which has people age 65 and above (i.e., retirees)
with high savings and R1 which has working professionals under 40 who
currently have low savings.
Fig. 5.1. First Binary Partition of a Two-Dimensional Feature Space.
Fig. 5.2. Second Binary Partition of a Two-Dimensional

Feature Space.
Fig. 5.3. Third Binary Partition of a Two-Dimensional

Feature Space.
Fig. 5.4. Final Partitioned Two-Dimensional Feature Space

Showing Nonoverlapping Regions.
We can depict the sequence of binary partitions from Figs. 5.1–5.4 as a tree
diagram. Since the binary partitions proceeded in a sequence, the process is called
recursive binary partitioning. The tree is shown in Fig. 5.5.
Since the modeling of interactions among predictor variables is one of the major
strengths of tree-based models, we will demonstrate the manner in which the tree in
Fig. 5.5 captures interactions. Note that, for people with age greater than 65, a high
level of savings (“savings” . k2) has a higher proportion of “buy” responses. On
the other hand, among people with age less than 40, a low level of a savings
(“savings”, k4) has a higher proportion of “buy” responses. Thus, the manner in
which the predictor variable “savings” affects the response (“buy”/“not buy”)
depends on the predictor “age.”
Fig. 5.5. Tree Diagram of Recursive Binary Partitioning.
One might wonder why we restrict ourselves to recursive binary splits. The
main reason is interpretability, since arbitrary partitions can lead to regions that
are difficult to describe and interpret. As an illustration, consider the partition
below.
In Fig. 5.6, there are non-overlapping regions and oddly shaped regions that
are difficult to describe. For example, in region R there are two points such that
the line segment that connects them does not lie entirely in the region. Regions
like this are called non-convex, and they create problems for many optimization
algorithms. Partitions such as in Fig. 5.6 would be hard to describe as trees.
Recursive binary partitions allow the entre partition to be conveniently repre-
sented as a single tree, enabling all the easy and intuitive interpretation and
visualization of tree diagrams.
Fig. 5.6. Arbitrary Partition Does Not Allow Easy Interpretation.

Once the feature space is partitioned into non-overlapping regions as described

above, the decision tree algorithm makes the same prediction for all training
observations in a given region. As mentioned above, the specific prediction for a
region depends on whether we have a regression or a classification context. For a
regression context, as in AID, the prediction is the mean of the response values
corresponding to the training observations in that region. For a classification
context, as in THAID, the prediction is the mode of the response values, that is,
the predicted class of an observation in any region is the most commonly
occurring class in that region.
2.1 Regression Trees

Consider the case of a continuous response variable. The tree growing method
proceeds in a sequence of binary splits which generate non-overlapping regions.
In the illustration above our goal was to give an intuition for how a recursive
binary partition can be represented as a tree diagram. To this end, we had laid
aside any discussion of how exactly split points are decided – e.g., 65 years for the
“age” variable for the first split etc. However, in actual applications of decision
trees we need to use the training data itself to decide the “optimal” split points.
For this purpose, we need a criterion to measure how good a split is. For a
regression tree this criterion is the sum of squares (SS) over the entire training
data. Once we have the sum of squares for each of the two regions created by a
binary split, then we can just add up these two terms to obtain the total sum of
squares over the training data. How is the sum of squares in a given region
computed? Suppose our training data consists of N observations (xi, yi), i 5 1,…, N,
with observation xi belonging to a p-dimensional space. Thus, the ith observation
xi is a p-dimensional vector xi 5(xi1,…, xip). Consider a binary partition of the
training data into two regions. If the predicted value corresponding to xi is f(xi),
then the sum of squares for a region is just the sum of terms like (yi 2 f(xi))2 over
all observations xi in that region. The reader may realize that the sum of squares
over the training data is similar to the usual squared error cost in ordinary least
squares regression. It can be shown that if the prediction is a constant for
all points xi in a region, then the sum of squares for the region is minimized if the
prediction corresponding to all points in that region is just the average of the
responses for that region. That is, in region Rm, the optimal prediction for every
point xi in Rm is yðmÞ , where the quantity
yðmÞ is the average of all the responses yi
for points in Rm (see “Technical detour 1” for in intuitive sketch of this fact).
Note the parenthesis in the subscript for the average. The precise manner in
which optimal split points are determined is based on the total sum of squares
across the two regions resulting from the split. This is discussed in more detail
next in the greedy algorithm.
2.1.1 Greedy Algorithm

We describe the algorithm that accomplishes the recursive binary partition
used to fit a regression tree to a training data set. This algorithm is called
“greedy” because it looks just one step ahead instead of looking ahead till the
end. Thus, at any step it optimally decides just the next splitting variable and
split point in a myopic manner rather than being far-sighted and accounting
for the future steps in the tree formation process. To make a distinction
between a predictor variable and a specific value that it could take, we use the
uppercase Xj to denote the jth predictor variable and the lowercase xij for the jth
component of the ith observation xi 5(xi1,…, xip). We provide an intuitive
sketch of the greedy algorithm which accomplishes the recursive binary
partition.
Step 1: For a predictor variable Xj (j 5 1,…, p) and a split point “s” compute
the total sum of squares over the two regions that are generated by a binary
partition. This total sum of squares is minimized over all predictors Xj and all
split points “s”. This identifies the next optimal splitting variable Xk and its split
point sk*.
A bit of intuition about this critical step may be helpful for the reader. Once
we have a predictor variable Xj and a split point “s” we can partition the feature
space into two regions: R1 and R2. In this binary partition, R1 is the set of all
observations xi whose jth component xij is less than “s”. Similarly, R2 is the set of
all observations xi whose jth component xij is greater than “s”. Different split
points “s” give different regions R1 and R2. We choose that split point “s” which
minimizes the sum of squares over the training data. The sum of squares was
described in the opening paragraph of Section 2.1, and here we specialize it to a
regression tree. In the case of a regression tree, for any region we have the same
predicted value for all points in that region – the average response. Using that
prediction, and the actual yi corresponding to each point xi in a region, we can
compute the sum of squares for that region. We do the same for the other region.
We add them to get the sum of squares over the training data. This sum of
squares is a function of the assumed splitting variable Xj and split point “s” which
generates the two regions R1 and R2. We choose the variable and split point to
minimize the sum of squares.
Practically speaking, this step is accomplished in two stages: (1) In the first
stage, for a fixed Xj, we assume a split point “s”. For that Xj and s we form the
binary partition as described in the previous paragraph. We then select that split
point “s” which minimizes the total sum of squares. This is the optimal split point
given the splitting variable Xj (2) In the second stage we vary the predictors Xj
over all j 5 1,…, p, and repeat step 1. Of course, when the splitting variable and
split point are varied to generate two new regions though a binary split, the
predicted response also changes since the response in a region depends on the
training observations included in that region.
Step 2: Form the two regions based on the splitting variable Xk and split
point sk*. Thus, R1 is the set of all observations xi whose kth component xik is
less than sk*. Similarly, R2 is the set of all observations xi whose kth component
xik is greater than sk*. Repeat step 1 for region R1, so this region also splits into
two in an optimal manner. This gives us three regions. Split one of these three
regions in similar manner and continue till some stopping rule terminates this
process. A common stopping rule is based on a pre-determined number of

observations in each node, where the process stops when this number is
reached.
For interested readers we provide some technical details of the greedy
algorithm.
Technical detour 1:
Recall that the AID method discussed in Section 1 generates a regression tree.
Similar to the CART method for regression trees, it too uses an algorithm very
similar to the greedy algorithm just described. However, the AID and CART
differ on other aspects, and a major distinction is the rule for determining how
large the tree should be. The most common stopping rule for AID is based on the
sum of squares and the algorithm stops if the reduction in the sum of squares as
a result of a proposed split is not large enough to justify the split. The deter-
mination of how much reduction of sum of squares is desirable at each step is
left to the analyst’s judgment. In CART the determination of how large to grow
the tree is based on important considerations of controlling overfitting the tree
to the training data. Thus, tree size is chosen based on the training data itself.
Controlling overfitting is accomplished via the important concept of cost
complexity pruning of the tree. We turn to this next. In the typical imple-
mentation of CART solution, trees are grown to maximum depth and then
pruned back using the pruning criterion.
2.1.2 Cost Complexity Pruning

We first provide some intuition for why this specific type of tree pruning is
needed. A very large tree is likely to overfit the training data, whereas a tree that
is too small may not capture important information in the data. The CART
approach to tree size determination is to err on the side of capturing the infor-
mation in the data by growing a large enough tree, but then pruning the tree in a
specific manner to control overfitting. In this approach, the tree size is a tuning
parameter. In CART, a standard approach to growing a large tree is to use a
non-stringent stopping criterion, for instance, growing it till each node has a
minimum number of training observations. This will likely generate a large tree
that may overfit the data. Now, the most obvious way to control overfitting
would be to consider different sub-trees and then use cross-validation to choose
the appropriate sub-tree. Cross-validation requires a measure of the goodness-of-
fit of a tree on test data. This goodness-of-fit measure is the total sum of squares
for the entire tree. We first need a means to quantify the size of a tree – it is the
number of terminal nodes. The topmost node of a tree is often referred to as the
“root” node, and the nodes that do not have branches emanating from them are
“terminal” nodes. For example, in Fig. 5.5 there are five terminal nodes where
each terminal node represents a region Rm, m 5 1,…, 5. Suppose a tree has M
terminal nodes. Each terminal node m defines a region Rm, m 5 1,…, M. When
we use the terminology of nodes when describing the partitioning the feature
space, we talk of a “parent node” giving rise to two “children nodes” as result of a
binary partition. We have already described how we calculate the sum of squares
for a given region. Let us denote the sum of squares for region m by SSm, m 5
1,…, M. The sum of squares over the training data is the sum of all the SSm for
m 5 1,…, M. That is: SS 5 SS11…1 SSM. The SS is a measure of the goodness-
of-fit of a tree, and since it is like a cost/error function the goal is to minimize it.
The sub-tree that has the minimum SS on the test data would be the desired sub-
tree that avoids overfitting. While theoretically sound, this process is computa-
tionally infeasible owing to the very large number of sub-trees that one would
have to evaluate. Cost complexity pruning gives us a way to do pruning by only
considering a small set of sub-trees.
Intuitively, the cost complexity pruning method essentially puts a penalty
on the size of the tree such that the minimizing criterion is a combination of
the total sum of squares over all terminal nodes plus the number of terminal
nodes. Suppose that we have grown a large tree T9which we wish to prune.
Consider a sub-tree of T9denoted by T. The cost complexity criterion, which is
a function of sub-tree T, is a combination of the total sum of squares and the
sub-tree size measured by the number of terminal nodes, where we assume that
the sub-tree has M nodes. The cost complexity criterion is,Cost complexity
(T) 5 SS 1 lM
The parameter l plays a role similar to the “weight decay” parameters dis-
cussed in Chapter 3. Since the goal is to minimize the cost complexity criterion,
therefore larger trees with more terminal nodes will be penalized. For larger
values of l, large tree sizes are penalized even more and the criterion yields
smaller trees that are less likely to overfit the training data. For each l we find
the sub-tree T that minimizes the above cost complexity criterion. We denote it
by Tl. It has been shown that we are guaranteed to find such a Tl corresponding
to each l (Brieman, Friedman, Ohlsen, & Stone, 1984). The idea is to succes-
sively combine internal nodes (nodes between the root and the leaves). When
two nodes are combined, it causes an increase in the sum of squares (SS). To get
an intuition for this, consider that the largest possible tree, one where each node
is just a single point, has a SS of zero since the predicted value (which is the
average of the response corresponding to a single observation) exactly equals the
actual response. Therefore, when we combine two nodes to create a smaller tree,
the SS increases. At each step we combine those two nodes which cause the
smallest increase in SS. As we continue in this way we produce the tree with a
single node which has the largest possible SS. Brieman et al. (1984) have shown
that this sequence must contain the cost complexity minimizing sub-tree Tl. In
“Technical detour 2” we provide details of the cost complexity pruning
technique.
Technical detour 2:
Finally, l is a tuning parameter estimated using cross-validation, and we sketch

the cross-validation process when we summarize the regression tree building process.
The entire regression tree building process can be summarized as follows:
Step 1: Grow a tree to training data using recursive binary splitting. This step uses
the greedy algorithm described above to grow the tree. This step results in a
large tree which may overfit the training data.
Step 2: Do cost complexity pruning which will give a set of smaller sub-trees, Tl,
indexed by the tuning parameters l. We must now choose one of these sub-
trees. This is done via cross-validation as shown in step 3.
Step 3: Do a K-fold cross-validation. For each l do the following
a. Divide the training data into K folds. For each k 5 1,…, K, do: (i) Using the
process in step 1, grow a large tree using the training data that remains when we
leave out the kth fold, and (ii) Using the process in step 2, do cost complexity
pruning of the large tree generated in step (i).
b. Compute the predicted mean squared error on the test data in the kth fold that
was left out in step 3a.
c. Compute the average of the predicted mean squared errors from step 3b over
all the K folds. So, we now have the cross-validation error for each l.
Step 4: Choose the l corresponding to the lowest cross-validation error, and denote
it by l*. The required tree is the one corresponding to l* from step 2.
2.2 Classification Trees

Classification trees build decision trees when the response variable is categorical.
Overall, the main concepts behind classification trees are similar to those of
regression trees in Section 2.1, except for (1) the rule for making the predictions
for points in a region, and (2) the criterion for splitting the nodes (partitioning the
regions) and pruning the tree.
Just as in the case of regression trees, the predicted response for all training data
points (observations) in a given region is just a constant. However, for classification
trees the prediction f(x) in a region is the mode of the responses yi corresponding to
all observations xi in that region. That is, we take the most frequently occurring
class in a given region as the prediction for all observations in that region.
Growing a classification tree also proceeds via recursive binary partitions, and
the algorithm for doing so is similar to the greedy algorithm we described in
Section 2.1.1. Here too, once we have grown a large tree the task of pruning it
takes place by a process similar to the cost complexity pruning that was
described in Section 2.1.2. The main difference in the case of classification trees is
that we do not use the sum of squares criterion for growing and pruning the tree.
Instead, the most commonly used criteria are the (1) Gini index (2) Cross entropy.
The Gini index is a measure of statistical variance across the classes, and is
perhaps the most commonly used measure of inequality in groups. The cross-
entropy is an information theoretic measure and is often referred to as Shannon’s
entropy.
Suppose we want to build a classification tree in the multi-class case where there
are K classes of the response variable, with the classes indexed by k 5 1,…, K.
Suppose we have a partition of the feature space into M regions Rm, m 5 1,…, M.
We denote by pmk the proportion of training data points in region m that belong to
class k. The Gini index for node m (region m) is given by
K
Gini index ¼ + pmk ð1 2 pmk Þ
k¼1
The cross-entropy for node m (region m) is given by

K
Cross entropy ¼ 2 + pmk log pmk
k¼1
As the recursive binary partition proceeds in a stepwise fashion, each newly

created region after a partition is more homogeneous than the regions at the
previous step. The homogeneity is with respect to the response variable.
Consider a response variable with two classes: “buy” and “not buy.” Suppose
there is a binary partition of a region. After the partition, one of the resulting
regions would have a higher proportion of “buy” observations than the parent
region that was split. Similarly, the other resulting region would have a higher
proportion of “not buy” observations than the region that was split. Said
differently, the nodes have been made “purer,” in the sense that both the chil-
dren nodes resulting from a binary partition are more homogenous than the
parent node. Now, a partition decreases both the Gini index and the cross-
entropy and therefore these are called measures of node impurity – in the sense
that successive splits increase node purity, that is, reduce the impurity of nodes.
To build some intuition for why more pure nodes, that is, nodes that are more
homogeneous, have a smaller Gini index, consider again the response variable
with two classes – class 1 is “buy” and class 2 is “not buy.” Consider the extreme
case where we have partitioned the feature space into two regions with only
“buy” responses in one region 1 and only “not buy” responses in region 2. This is
a case of complete homogeneity of the regions (nodes). Since we have only
observations belonging to class 1 in region 1, therefore p11 5 1. There are no
observations belonging to class 2 in region 1, and therefore p12 5 0. Similarly,
since we have only observations belonging to class 2 in region 2, therefore p22 5 1.
Also, p21 5 0. From the formula above, the Gini index for region 1 (node 1) is:
p11(1 2 p11) 1 p12(1 – p12) 5 0. Similarly, the Gini index for region 2 (node 2) is:
p21(1 2 p21) 1 p22(1 – p22) 5 0. So, the minimum value of the Gini index is 0. In a
diametrically opposite situation, when a region is completely heterogeneous,
i.e., when there is a completely random distribution of “buy” and “not buy”
responses in that region, then in the two-class case the Gini index is 0.5. This is
quickly checked by noting that if p11 5 0.5 and p12 5 0.5, then the Gini index for
region 1 is 0.5*(1–0.5) 1 0.5*(1–0.5) 5 0.5. While we have sketched out the
characteristics of the Gini index for the two-class case, the same underlying logic
applies to the general case of K classes. The cross-entropy cost also has the similar
behavior to the Gini index.
Recall that in the case of a regression tree, the predictor variable that causes
the largest decrease in the sum of squares was the one chosen to generate the next
binary partition. This logic carries over to classification trees as well – except that
the predictor that causes the largest decrease in the Gini index, or equivalently,
the cross entropy, is the one chosen for the next binary partition.
To build intuition, it may be useful to see this in a simple illustration. Let us
revisit the opening illustration of the CHAID example in Section 1. Recall that in
a typical marketing context we wanted to classify a group of people into “buy”
and “not buy” for a specific product. Instead of the three predictor variables,
suppose for the current illustration of how the Gini index performs binary splits
we consider only two predictors: “gender” and “marital status.” As before, we
assume that marital status has two categories, “married” and “not-married,”
where divorced people are put in the not-married category. For illustrative pur-
poses, consider the synthetic data in Table 5.1.
In order to use the Gini index for determining which predictor variable to split
the data on, we first have to compute the decrease in Gini index as a result of
splitting on both the variables, and then choose that predictor variable that cor-
responds to a larger decrease.
At the root node the proportion of “buy” and “not buy” are 0.5 each since
there are 6 “buy” and 6 “not buy” out of 12 customers. Therefore, the Gini index
for the root node is: Root: 0.5*(1–0.5) 1 0.5*(1–0.5) 5 0.5
Consider the split with respect to the predictor variable “gender.” Let region
1 be “males” and region 2 be “females.” Also, let the response class “buy” be
Table 5.1. Categorical Data for CHAID Analysis.

Customer # Response Gender Marital Status
1 Buy M Unmarried
2 Not buy F Unmarried
4 Not buy M Married
6 Buy M Married
7 Buy M Married
8 Not buy F Married
9 Buy F Unmarried
10 Buy M Married
11 Buy M Unmarried
12 Not buy F Married
class 1 and “not buy” be class 2. Among the 6 males, there are 5 “buy” and so
the proportion of “buy” is p11 5 5/6 5 0.83. Similarly, the proportion of “not
buy” among the males is p12 5 0.17. Among the 6 females, there is 1 “buy” and
so the proportion of “buy” is p21 5 0.17. Similarly, the proportion of “not buy”
among the females is p22 5 0.83. The Gini indices for regions 1 (male) and 2
(female) are: Males: p11(1 2 p11) 1 p12(1 2 p12) 5 0.83*(1–0.83) 1
0.17*(1–0.17) 5 0.2822; Females: p21(1 – p21) 1 p22(1 – p22) 5 0.17*(1–0.17) 1
0.83*(1–0.83) 5 0.2822
Accounting for the fact that there are 6 males and 6 females in a customer
base of 12, the decrease in Gini index when we split the root node by “gender” is:
0.5 2 (6/12)*0.2822 2 (6/12)*0.2822 5 0.22
Consider now the split with respect to the predictor variable “marital status.”
Let region 1 be “married” and region 2 be “unmarried.” Among the 6 married
customers, there are 3 “buy” and so the proportion of “buy” is p11 5 0.5.
Similarly, the proportion of “not buy” among the married customers is p12 5 0.5.
Among the 6 unmarried customers, there are 3 “buy” and so the proportion of
“buy” is p21 5 0.5. Similarly, the proportion of “not buy” among the unmarried
customers is p22 5 0.5. The Gini indices for regions 1 (married) and 2 (unmar-
ried) are: Married: p11(1 2 p11) 1 p12(1 2 p12) 5 0.5*(1–0.5) 1 0.5*(1–0.5) 5 0.5
Females: p21(1 2 p21) 1 p22(1 2 p22) 5 0.5*(1–0.5) 1 0.5*(1–0.5) 5 0.5
Accounting for the fact that there are 6 married and 6 unmarried people in the
customer base of 12, the decrease in Gini index when we split the root node by
“marital status” is: 0.5 2 (6/12)*0.5 – (6/12)*0.5 5 0
Since a split on the predictor variable “gender” causes a larger decrease in the
Gini index, we choose that predictor as the next splitting variable. This would
result in the tree shown in step 2 in our description of the CHAID algorithm in
Section 1.
It might be instructive to see how an informational measure such as cross-
entropy would select predictor variables to perform the split. Using the same
illustration in the previous paragraph, we first calculate the entropy for the root
node. At the root node the proportion of “buy” and “not buy” are 0.5, and so the
entropy (using logarithms of base 10) is: Root: 2[0.5*log(0.5) 1 0.5*log(0.5)] 5
0.3010
Consider the split with respect to the predictor variable “gender.” As before, we use
p11 5 0.83 and p12 5 0.17 for the proportions of “buy” and “not buy” among the
males. Similarly, we use p21 5 0.17 and p22 5 0.83 for the proportion of “buy” and
“not buy” among the females. The cross-entropy for regions 1 (male) and 2 (female)
are: Males: 2[p11log(p11) 1 p12log(p12)] 5 2[0.83*(20.08) 1 0.17*(20.77)] 5 0.196;
Females: 2[p21log(p21) 1 p22log(p22)] 5 2[0.17*(20.77) 1 0.83*(20.08)] 5 0.196
Accounting for the fact that there are 6 males and 6 females in a customer base
of 12, the decrease in Gini index when we split the root node by “gender” is:
0.3010 2 (6/12)*0.196 2 (6/12)*0.196 5 0.105
Consider the split with respect to the predictor variable “marital status.” As before,
we use p11 5 0.5 and p12 5 0.5 for the proportions of “buy” and “not buy” among the
married customers. Similarly, we use p21 5 0.5 and p22 5 0.5 for the proportion of
“buy” and “not buy” among the unmarried customers. The cross-entropy for regions
1 (married) and 2 (unmarried) are: Married: 2[p11log(p11) 1 p12log(p12)] 5 2

[0.5*log(0.5) 1 0.5*log(0.5)] 5 0.3010 Unmarried: 2[p21log(p21) 1 p22log(p22)] 5 2
[0.5*log(0.5) 1 0.5*log(0.5)] 5 0.3010
Accounting for the fact that there are 6 males and 6 females in a customer base
of 12, the decrease in Gini index when we split the root node by “gender” is:
0.3010 2 (6/12)*0.3010 2 (6/12)*0.3010 5 0
Thus, the cross-entropy measure identifies the same predictor variable as the
next splitting variable – it is “gender.” Also note that, similar to the Gini index,
the cross-entropy criterion also suggests that splitting the root node by “marital
status” would be completely futile since it would not result in any reduction of
either criterion.
Executive Summary
Decision trees are among the most intuitive, easy to interpret and visualize
among all machine learning models. This makes the results of decision trees
very easy to communicate to non-technical stakeholders. Moreover, decision
trees are particularly well-suited for identifying how interactions between
predictors can affect the response variable. Two predictors are said to interact
if the effect of one predictor on the response variable depends on the other
predictor.
Modern decision trees are grown using the “Greedy Algorithm.” Essen-
tially the greedy algorithm proceeds by recursively partitioning the entire data
set into smaller subsets, or regions, based on the predictors. The goal is to
form regions such that observations in a given region are similar with respect
to their association with the response variable. It is called “greedy” because it
myopically looks only one step ahead rather than considering the entire tree
growing process.
The predictive performance of decision trees can be improved by an
aggregation over multiple trees. This is the idea behind bagging and boosting of
decision trees and random forests. A random forest is an aggregation technique
that improves on bagging in a specific way.
3. Decision Trees and Segmentation

It can be argued that segmentation is at the heart of all marketing processes. In fact
the trio of Segmentation, Targeting and Positioning (STP) is perhaps the most
common strategic approach in modern-day marketing. As the acronym STP sug-
gests, this process starts with market segmentation. Even as far back as 1978,
Mahajan and Jain (1978) had remarked that the variety and sophistication of
various market segmentation methods had increased enormously. Decision trees
have long been used for segmenting markets where the marketer is able to identify
and measure a relevant response variable. In other words, decision trees can do
supervised segmentation. An obvious categorical response variable is “buy” or “not

buy.” As another example of a categorical response, many businesses, especially
those with a subscription service format, are interested in whether customers will
“churn” or “not churn.” As we have already mentioned, decision trees enable the
analyst to capture interactions among predictor variables. From a practical point of
view, such interactions are merely combinations of predictor variables that define
segments of customers. Let us revisit the opening illustration in Section 1 which is the
CHAID example. The response variable was “buy” or “not buy” and the three
predictor variables are “gender,” “marital status” and “occupation.” Consider
the tree shown in step 6 of the CHAID algorithm. Consider the left-most branch.
The left-most node is clearly a terminal node since the branch that reaches this node
has used all the three predictor variables and there cannot be any more branches
emanating from the left-most node. Then, starting bottom-up from the left-most
terminal node, we can see that this terminal node describes the segment of: “Married
males in white collar occupations.” Similarly, once the tree is finally built, each of the
terminal nodes will define a distinct segment of customers. Since this is a supervised
segmentation case, the segments would differ with respect to the response variable.
That is, different segments will differ on the proportions of customers who will “buy”
(equivalently, “not buy”) the product.
Another useful class of models are the unsupervised segmentation models.
These are also called clustering models. Strictly speaking, clustering models are
not decision trees in the sense that we have described them in Sections 1 and 2.
However, a very common type of clustering models, hierarchical clustering, are
depicted using dendrograms which have a tree structure. Once a clustering algo-
rithm is run and a dendrogram is obtained, we can identify segments from the
dendrogram. Because hierarchical clustering schemes involve tree-like structures
and are used for segmentation, we are discussing them in the current section on
segmentation. Our discussion will be brief since they are not the standard decision
trees which are usually supervised machine learning models. Readers interested in
segmentation techniques in marketing can refer to the excellent chapter on clus-
tering in the edited book by Bagozzi (1994).
Suppose we have a set of objects, such as customers, defined by certain char-
acteristics. In a marketing context the characteristics could be the customers’
demographic and psychographic variables along with their past purchase behav-
iors. Companies often want to segment their customer base on the basis of the
aforementioned variables. Another classic use of segmentation in marketing has
been the attempt to understand market structure by grouping competing brands
that are considered to be the most similar, that is, substitutes, by consumers. This
grouping of brands is accomplished by using brand-level and firm-level variables.
Hierarchical clustering models need a measure for the “distances” between objects
(say, customers) based on their characteristics. When the customer characteristics
are metric, the usual Euclidean distance is commonly used. Many other distance
measures, depending on the type of data, have been proposed in the clustering
literature. Most clustering schemes attempt to form groups such that there is
homogeneity among the customers within a group and heterogeneity between or
across groups. Readers may see the obvious similarity with the partitioning schemes
that have been the theme of our discussions about decision trees in Chapters 1
and 2. Tree partitioning also promotes homogeneity within a region and hetero-
geneity between regions. Since our main interest is in tree-structures, we will pro-
vide below a very brief sketch of how the hierarchical clustering scheme leads to a
tree structure from which segments can be identified.
Suppose we have seven objects denoted alphabetically from A through G (cor-
responding to the bold dots) in Fig. 5.7.
The first step is to identify the two objects that have the smallest distance
between them. In the formal clustering algorithm we usually have a matrix of
distances between the objects – with N objects we would have an N3N matrix
whose entries are pair-wise distances between the N objects. In our sketch of the
hierarchical scheme we will just do this visually. By visual inspection we can see
that the smallest distance is between objects F and G. At the first step we form the
first group as {F, G} as shown by the circle labeled 1 in Fig. 5.7. Thus, {F, G} is
now one entity. The algorithm then needs a measure of the distance between the
group {F, G} and the other remaining 5 objects. In most marketing and behav-
ioral science applications we choose the maximum distance (called complete
linkage) or the average distance (called average linkage). The complete linkage
distance between an object outside the group, say B, and the group {F, G} would
be max[distance(BF); distance(BG)]. The average linkage distance is similarly
defined using the average instead of the maximum. Using the complete linkage or
average linkage distance, and the usual distance for pairs of objects that are not in
the group {F, G}, we can form another matrix of pair-wise distances and then
select the smallest distance. In Fig. 5.7 above, suppose it is the distance between
objects A and D. So, at the second step we form the second group {A, D}. See
circle labeled 2 in Fig. 5.7. We now need to again calculate the pair-wise distances
treating {F, G} and {A, D} as two distinct entities. To calculate pair-wise
Fig. 5.7. Hierarchical Clustering of Seven Objects: A Through G.

distances at this step, we need the distance between two groups (entities). Both the
complete linkage and average linkage criteria are well-suited for this. At the third
step, object B is added to the group {F, G}, so the third group is {B, F, G}, as
seen in the circle labeled 3. The clustering process proceeds in a sequential manner
as described until we have the fourth group as {A, B, D, F, G}, the fifth as {C, E}
and finally in step six we have all {A, B, C, D, E, F, G}.
Our interest is in representing this clustering process as a tree-like structure. Such
a hierarchical tree structure in the context of clustering is called a dendrogram. The
y-axis shows the different distances at which different objects and groups have been
joined together (Fig. 5.8).
How do we identify segments from the tree above? When we clip the
dendrogram at different distances we arrive at a different number of clusters along
with the objects that lie in the clusters. This can be visually depicted as in Fig. 5.9.
When we clip the dendrogram at a certain level, we ignore the part above that
level and focus only on the part below it. Consider the first clip from the top in
Fig. 5.9. We can detect two distinct segments by focusing on the dendrogram
below that level. The segments are {A, B, D, F, G} and {C, E}. Similarly, the
second clip from the top of the figure gives us three segments. Having discussed
decision trees, we now turn to the random forest method which, as we noted, is
built on ideas of decision trees. To understand random forests, we first have to
familiarize ourselves with the understand the concepts of bootstrapping, bagging,
and boosting.
4. Bootstrapping, Bagging, and Boosting

Before proving the details of bootstrapping, bagging and boosting, we will provide
an overall picture of how they all fit together to accomplish the main goal of
Fig. 5.8. A Dendrogram Is a Tree-Like Representation

of Hierarchical Clustering.
Fig. 5.9. Clipping the Dendrogram at Different Levels Gives

Different Numbers of Clusters.
improving a machine learning model. Random forests are based on the concept of
bagging, in the sense that a random forest is a specific type of improvement over
bagged trees. At an intuitive level, bagging of any machine learning model, including
a decision tree, is the aggregation over different variants of the same model, where
model aggregation helps improve the predictions of the learning model. The concept
of bagging, in turn, requires us to understand the concept of bootstrapping, since the
latter generates the model variants that are aggregated. Boosting is another tech-
nique to improve the prediction of any machine learning model, including decision
trees, and is also based on the ideas of fitting multiple models. Unlike bagging, the
models are fit in a sequential manner – any given model is fit based on information
from previous models.
4.1 Bootstrapping
As mentioned above, bagging is the aggregation over different variants of a model.
The model variants are obtained when the same model is estimated on different
samples from the same training data set. The process of sampling repeatedly from
the training data is conceptually different from the usual idea in statistics of sam-
pling from the population. While the latter may be ideal, it is often not practical to
get multiple samples from the population. This technique of sampling from the
training data forms the basis of bootstrapping. It is for this reason that bagging
is referred to as bootstrap aggregation. We will expand on these concepts in the
following paragraphs.
Bootstrapping was originally designed to assess the uncertainty associated with a
statistical learning model or any statistical parameter estimate. The most familiar
example of a parameter is the coefficient of a simple linear regression with just one
predictor, and bootstrapping can be used to get a measure of the uncertainty of the
estimate of the coefficient. Of course, the main strength of bootstrapping lies in its
ability to help us assess the uncertainty of complex parameter estimates whose
uncertainty is difficult to estimate otherwise, unlike that of regression coefficients

where standard methods exist to assess their uncertainty. Similarly, bootstrapping
can be used to assess the uncertainty associated with very complex statistical
learning models.
Bootstrapping involves obtaining multiple random samples from the same
training data where the samples are drawn with replacement. Since the readily
available training data set is used, it is convenient to use it rather than getting
additional independent samples from the population. Colloquially, we often speak
of “pulling oneself up by one’s bootstrap” – meaning that one hasn’t obtained any
external help but has used one’s own resources to better oneself. We will use the
familiar example of a simple linear regression to illustrate how bootstrapping
works. Suppose, we want to fit the regression Y 5 b0 1 b1X to the training data set
V which has 4 points V 5 {v1, v2, v3, v4} where vi 5 (xi, yi). Specifically, let the
training data set be (Table 5.2).
Table 5.2. Training Data Set for Bootstrapping.

Observation X Y
V1 X1 Y1
V2 X2 Y2
V3 X3 Y3
V4 X4 Y4
Suppose we want to use bootstrapping to assess the uncertainty associated with

the estimate of the slope b1. The first order of business is to resample, say D times,
with replacement from the training data V. Because the samples are drawn with
replacement, there is the chance that an observation is repeated in a sample.
Importantly, each of our samples is of the same size as the training data, i.e., of
size 4. Suppose the bootstrap samples are as follows.
Sample 1: V(1) 5
Sample 2: V(2) 5
Continuing in this manner, we obtain the Dth sample.

Sample D: V(D) 5
Note that in sample 1 the observation v3 has been repeated and v1 is not
included in it. We fit the model, Y 5 b0 1 b1X, D times to each of the D
bootstrap samples. Corresponding to sample V(d) we obtain the parameter esti-
mates (b0(d), b1(d)), d 5 1,…, D. An obvious way to assess uncertainty associated
with the estimate of b1 is to obtain its standard error. The standard error of the
estimate of b1 can be obtained by using the variation in the D estimates: b1(1),
b1(2), …, b1(D) . It is now a simple matter to use standard measures of variation to
calculate the standard error of b. The same procedure is employed to assess
uncertainty for more complex statistical learning models.
4.2 Bagging
Bagging draws on the bootstrapping idea of drawing multiple samples with
replacement from the training data, but does so in order to improve the estimate
of the parameters or the prediction from a machine learning model. One sense in
which parameter estimates or prediction can be improved is to reduce their
variance. Hence the role of bootstrapping in bagging, which is variance reduction,

is different than its original purpose of assessing uncertainty that we have dis-
cussed in the previous paragraphs. Since bootstrapping is based on the idea of
aggregation, it is appropriate to inquire how aggregation may also be used for
variance reduction. To get a simple intuition, for this, consider N independent
observations, x1,…, xN, from a distribution with variance v. Then each of the
observations has a variance of v. However, the average of these observations,
which is essentially an aggregate, has variance of v/N. Once we have understood
how multiple bootstrap samples are drawn from training data (as shown in
Section 4.1), the technique of bagging becomes straightforward.
While bagging can be used for any statistical learning model, it is especially
useful for a decision tree which is the topic of the current chapter. In the termi-
nology of Chapter 3, a major downside of decision trees is that they have high
variance, even though they may have low bias if grown deep. Therefore, any
technique that can reduce the variance of trees can benefit from the easy inter-
pretability and explainability of trees and yet avoid their major drawback. Hence,
we will illustrate the bagging process in the case of a regression tree.
(1) Draw D bootstrap samples from the training data – the samples are of the
same size as the original training data and are drawn with replacement.
(2) Grow a regression tree on each of the D bootstrap samples. Usually the trees
are not pruned.
(3) For a given observation x, make the D predictions, one for each tree. Denote
the D predictions by yð1Þ ,…,
yðDÞ . We give some details of this step. Consider
the tree grown on the bootstrap sample V(d), and consider the prediction made
by this tree for a given observation x. Since a tree is essentially a partition of
the feature space into distinct regions, the observation x will lie in some region
of the tree. As we saw in Section 2.1, the prediction corresponding to x is just
the average of the responses for that region. Let us denote this by yðdÞ . This
process is repeated for the D bootstrap samples V , d 5 1,…, D, and D trees
(d)
are grown. Importantly, each of these trees may have different partitions of the
feature space and also have different sizes. Thus, the regions in which the focal
observation x lies in the different trees may be defined by different combina-
tions of predictor variables. Each of these trees yields a prediction for the
observation x, and this gives us D predictions yð1Þ ,…,
yðDÞ .
(4) The bagging estimate of the prediction for observation x is the average of the
D predictions yð1Þ ,…,
yðDÞ . This aggregate prediction has a lower variance
and hence bagging improves the learning model.
A similar logic and bagging process holds also for bagging classification trees.
The two differences pertain to steps 3 and 4 above. In step 3 we need to make a
prediction from each of the D trees for a given observation x. Recall from Section
2.2, that for each classification tree the prediction in a region is the mode of the
responses corresponding to all observations in that region. That is, we take the most
frequently occurring class in a given region as the prediction for all observations in
that region. Let us denote the prediction from the dth tree as b yðdÞ . For the
observation x, the prediction b yðdÞ from the tree grown on the bootstrap sample V(d)
is a vector with a 1 for the most commonly occurring class in a region and 0s
elsewhere. Suppose we have a binary classification case with responses: “buy” or
“not buy.” Suppose “buy” is coded as 1 and “not buy” as 0. Consider the tree
grown on the bootstrap sample V(d). If the most commonly occurring class in a
region is “buy” then the prediction corresponding to an observation x in that region
would be written as yðdÞ 5 (1, 0). In step 4 we need a single bagging estimate from
the D predictions b yð1Þ ,…, b
yðDÞ . In the classification context, the bagging estimate is
just that class which is the most commonly occurring among the D predictions.
It may be useful to have some intuition for why decision trees particularly
benefit from bagging. Decision trees often lack robustness and would benefit the
most from the aggregation that bagging provides. When a decision tree model is
fit multiple times to different training data sets, the different tree variants are
likely to involve different predictors (features) and may even have different sizes
(number of terminal nodes) for different data sets. The average of these trees is
less likely to omit an important predictor and is also less likely to emphasize an
unimportant one – both of these types of mis-specifications could be artifacts of
over-reliance on just one training data set, but are unlikely to be repeated when
one uses many training data sets.
A major advantage of bagging is that it automatically gives us a way to gauge
prediction accuracy on test data without doing explicit cross-validation (see
Chapter 3). In bagging, a given tree is grown on a bootstrap sample. Since the
bootstrap samples are drawn from the original training data without replacement,
it is possible for a given training observation to not be included in a given
bootstrap sample. As an illustration, consider the first bootstrap sample V(1) in
Section 4.1 where the training observation v1 is not included. Because observation
v1 has not been used in the tree grown on bootstrap sample V(1), it can be used as a
test data point for this tree. For a given tree, observations which have not been
used to grow the tree are called out-of-bag observations. Let us now look at this
from the point of view of individual observations in the original training data. For
a given training observation, we consider the set of trees for which this particular
observation is an out-of-bag observation. This observation can play the role of
test data for all of these trees. We can consider the set of predictions that these
trees (for which this particular observation was out-of-bag) make for this
particular data point, and take the average of these predictions. This average is
the out-of-bag prediction for this observation. We can now compare the out-of-
bag prediction for this observation with its actual response. This is a measure of
test error for this particular observation. Doing this for all points in the original
training data set, we can obtain an aggregate measure like the out-of-bag MSE
(mean square error). In this way we have obtained a useful measure of prediction
accuracy, the test error, without doing explicit cross-validation. While the out-of-
bag MSE is appropriate for a regression tree, we can use the similar ideas in this
paragraph to compute the out-of-bag prediction for a particular observation for a
classification problem. This can be compared to the actual response for that
observation. Doing this for all training observations, we can obtain the overall
out-of-bag classification error. While we have described bagging for decision
trees, this method works in a similar manner for more complex machine learning
models.
At this juncture it may be useful to discuss how one may determine the relative
importance of the predictors in tree aggregates. As mentioned earlier, one of the
main attractions of trees is their ease of interpretation and visualization. Unfor-
tunately, while tree aggregation methods like bagging may improve predictive
accuracy, they tend to lose the simple interpretation of trees. Before explaining
predictor, or feature, importance in tree aggregates we will first describe predictor
importance in single decision trees. There are several measures of importance of
predictor variables, and every software package can output at least one variable
importance measure. To give a flavor of how importance measures work, we will
restrict ourselves to the two most common methods and not try to be exhaustive.
The first method is essentially based on the idea that is used to grow the tree in the
first place. To illustrate the idea, let us first consider classification trees. As shown
in Section 2.2, at any node, the decision of which predictor variable to do the next
split on is based on the reduction in some node impurity measure (Gini or cross-
entropy). Recall that the predictor variable that resulted in the maximum
reduction in node impurity was the one chosen as the next splitting variable. One
common, perhaps the most common, importance measure is based on this logic.
For each predictor variable, we record the total impurity reduction due to all
splits (across all nodes in the tree) over this predictor. This gives the variable
importance for the single classification tree. This can be easily extended to an
aggregation of trees, for instance, a random forest in Section 5 which is based on
an aggregation of bagged trees. We simply average the total impurity reduction
due to this predictor over all trees in the aggregation. A somewhat more
sophisticated measure of variable importance is the permutation importance
measure due to Breiman (2001). First consider the importance of a given predictor
for a single tree. The steps followed by the permutation importance measure are:
• We start with the tree grown using the training data set. We record its predictive
performance on test data. For this we can use any standard acceptable measure
for model performance – e.g., number of observations correctly classified.
• Suppose we have p predictor variables, X1,…, Xp, and we are measuring the
importance of variable Xk. We randomly permutate the focal predictor vari-
able Xk. This permutation changes the data set in a particular way. The ith
observation is xi 5 (xi1,…, xip), i 5 1,…, N. Its kth component is xik. A per-
mutation of the predictor Xk will replace the kth component of observation i,
that is, xik, with the kth component of observation j, that is, xjk. This will be
done for all the observations i 5 1,…, N.
• Grow a tree with the permutated training data. The permutated training data
uses the permutated variable Xk and all the other non-permutated variables.
We record the predictive performance of this tree on test data.
• The difference between the performance of the trees grown with the original
training data and the permutated training data is a measure of importance of
the predictor Xk in the given tree.
To build some intuition for the permutation importance method, note that if
the predictor Xk had a strong relationship with the response variable in the
original training data (before permutation) then this relationship would be broken
by the permutation. The stronger the previous relationship the more we are likely
to see a decrease in relationship strength due to the permutation. Hence the dif-
ference between the predictive performance of trees grown without and with the
permutation of a predictor variable captures the importance of that variable.
Now, we can extend this idea to an aggregation of trees quite naturally. We can
compute the difference in prediction accuracy before and after permutation for
each tree and then average this across all the trees in the aggregation. We can
determine the variable importance for regression trees using exactly the same
procedures as we have described for classification, except that we use the sum of
squares as the criterion instead of node purity measures like Gini or cross-entropy.
Before leaving the topic of bagging it is important to point out one effective way
in which bagged trees can be further improved. This will also set us up for random
forests which are designed to precisely realize this improvement. The intuition in
the opening paragraph of Section 4.2 gives just a rough-and-ready idea for why
aggregation may work in general. The reader may have noticed that the intuition
depends on the observations being independent. It is important to realize that, the
multiple trees grown in bagging may not be independent. This is because if there are
strong relationships between certain predictors and the response then all (or almost
all) trees are likely to capture them in a similar manner. Now, since the different
bootstrap samples used in bagging are drawn from the same training data, the trees
grown on these bootstrap samples can be considered to be identically distributed.
Hence, from what we just discussed the trees grown during bagging may be
identically distributed but not necessarily independent.
4.3 Boosting
Boosting, like bagging, is also a “committee-based” learning model, in that, these
methods leverage the advantages of aggregating many variants of some basic
learning model. However, boosting does not use bootstrap samples. The major
distinction from bagging is that the different variants of the basic model in
boosting are not independent. Recall that bagging involved fitting the model
multiple times to independent bootstrap samples from the training data. In
boosting, on the other hand, the model variants are fit sequentially, and at each
stage, the model builds on the model variant at the previous stage – that is, at each
stage the boosting process leverages information from the previous stage.
Boosting can be used for many types of machine learning models, but they have
been found to provide the biggest improvements for decision trees. Moreover,
boosting has been found to be especially useful for classification trees.
We can consider the boosted tree T as the additive combination of a sequence of
D stages of model fitting. At a given stage d (d 5 1,…, D) the current model is the
additive combination of trees till stage d-1. At the stage d, the component tree Td is
added to the current model until the process goes through all D stages. Importantly,
at stage d the boosting algorithm focuses only on the current “best” tree Td without
adjusting the trees that have already been built at earlier stages. This process is
sometimes referred to as the forward stagewise additive modeling process.
To get an intuition for how boosting may improve the performance for a
decision tree, one must first note that each of the individual trees in the sequence
can be shallow trees. Reducing overfitting is the hallmark of a good learning model.
By definition, shallow trees are “weak learners” and are less likely to overfit the
training data. Despite the shallowness of the component trees in the aggregation of
trees in boosting, the predictive ability of the model is not harmed because of the
careful weighting scheme of the training data – in growing the tree at any stage,
observations that were misclassified by the tree in the previous stage are given a
higher weightage and the current tree thus concentrates on them. In this way,
boosting combines the twin benefits of less overfitting and better predictions. As in
the case of bagging, boosting especially helps decision trees because single trees are
often quite poor in their predictive ability. Thus, the improvement due to aggre-
gation schemes that enhance the predictive ability of trees can be quite dramatic.
We will first discuss boosting for a binary classification tree. Suppose we have
N training observations (xi, yi), i 5 1,…, N. Boosting involves growing D trees Td
sequentially in stages, where the stages (trees) are indexed by d 5 1,…, D. The
boosting process involves assigning weights wi, i 5 1,…, N, to the individual
observations such that, at each step, the observations that were misclassified by
the tree in the previous step are assigned higher weights. In this way the boosting
algorithm ensures that the decision tree at a given step pays more attention to
observations that were misclassified at the earlier step. Once the D trees are grown
sequentially, the final prediction is based on the weighted combination of trees,
D
+ ld T d
d ¼1
We will sketch a common boosting algorithm for a binary classification tree – the
Ada.Boost.M1 algorithm. Instead of the formal technical statements, we will sketch
the intuitions behind each step of AdaBoost.M1 for a two-class classification tree.
Step 1: Initialize the beginning weights to wi 5 1/N for all i 5 1,…, N. So, we start
in stage 1 by assigning equal weights to all training observations.
Step 2: For stages d 5 1,…, D, perform the following steps
(a) Fit classification tree Td to training data weighted with weights wid, i 5 1,…,
N. The trees are fit by minimizing a special loss function (based on expo-
nential loss).
(b) Compute a measure of misclassification error rate over the training data. This
quantifies the total amount of misclassification over the training data,
weighted by the weights wid.
(c) Compute “tree weight” ld that will be used as weight on tree Td in the weighted
combination of trees. This weight is calculated based on the misclassification
error rate computed in step 2b. The tree weight is small if the misclassification
error rate is large and vice versa.
(d) Update the weights wid (on the training observations) to obtain new weights
wid11. The weights are updated such that the new weights wid11 are larger than
the current weights wid for observations i that are misclassified in tree Td. The
updated weights wid11 are used to weight the training observations in the next
stage, and the next tree, Td11, is grown. This process continues until all D trees
are built. D
Step 3: Output aggregate tree: + ld T d . The final prediction is based on this.
d 51
Having described the boosting of classification trees, we now turn to the boosting
process for regression trees. The underlying logic is similar to boosting classification
trees, even though specific details vary. Here too multiple regression trees are built
sequentially, where each regression tree is small in size. This prevents overfitting.
Moreover, here too the boosting process induces the tree at a given stage to pay
more attention to errors made in the previous stages. In the case of classification
trees we tweaked the training data at a stage to put more weight on observations that
were misclassified by the tree at the earlier stage. In the case of regression trees, the
tree at a given stage it fit to the “residuals” from the current model rather than to the
actual response Y. Recall that in a simple OLS regression Y 5 f(X) 1 e, the residual
corresponding to any X 5 x is the “error”: y2f(x). Thus, fitting the tree to the
residuals is tantamount to focusing on errors from the previous trees.
There are two hyperparameters that control overfitting. One hyperparameter
controls the number of splits, say s, in the recursive partitioning process for
growing the trees. The number of splits controls the tree size, and smaller trees are
less likely to overfit. The number of splits can be rather small, even just 1, so that
we only grow very shallow trees. The other hyperparameter that controls over-
fitting is the weight on the weighted combination of the D regression trees
Td(x), d 5 1,…, D, below
D
TðxÞ ¼ + lT d ðxÞ
d ¼1
Unlike the case of the AdaBoost.M1 algorithm for classification trees, a very
commonly used algorithm for regression trees treats the weight l as a hyper-
parameter – so it is chosen by the analyst using cross-validation or other means.
Since the boosting algorithm learns over many stages, by adding a tree at each
stage, the parameter l controls the rate at which the boosting algorithm learns.
When learning is slow, there is also less chance of overfitting.
To set the stage for the algorithm for boosting regression trees it may be
helpful to recall the stage-wise nature of the boosting process. The boosted tree
T(x) is the outcome of a sequence of D stages of model fitting. At each stage d, the
component added to the current model could be the weighted tree lTd(x). Of
course, the current model is the weighted combination of trees till stage d-1. We
now sketch the algorithm for boosting regression trees.
Step 1: Initialize the beginning residuals to be the actual response: ri 5 yi for all
i 5 1,…, N. Initialize T(x) 5 0.

(a) Fit regression tree Td(x) to data (xi, rid), i 5 1,…, N. The tree can have s splits.
The predictor variables are xi and the responses at stage d are the residuals rid
from the current model. The current model is the weighted combination of trees
till stage d-1. In fitting tree Td(x) the s splits will partition the feature space into
distinct regions. As we discussed Section 2.1 on regression trees, for the squared
error loss function, the prediction in each region will be the average of the
residuals corresponding to the xi in that region.
(b) Update the current model by adding lTd(x)
(c) Update the residuals. The residuals now become rid11, i 5 1,…, N. The tree in
the next stage, Td11(x), is fit to data (xi, rid11), i 5 1,…, N. The predictor variables
are xi and the responses at stage d11 are the residuals rid11 from the updated
model in step (b) above. D
Step 3: Output aggregate tree: T(x) 5 + lT d ðxÞ. The final prediction is based on
this. d 51
More details of boosting regression trees are given in the “Technical detour 3.”
Technical detour 3:
We have discussed boosting of classification trees using the exponential loss

function and boosting of regression trees using squared error loss. Gradient boosting
is a boosting technique that can handle any loss function for boosting trees, and
therefore, greatly expands the applicability of boosting. As the name suggests,
gradient boosting utilizes the gradient which is a critical component in a common
numerical approach to minimizing loss functions – the gradient descent approach
(see Chapter 1). Essentially, gradient descent finds the parameter that minimizes the
loss function in a sequential manner. At each stage of the process the optimization
algorithm updates the parameter by adding a term which depends on the gradient
of the loss function. The main idea in gradient boosting is to draw a parallel
between boosting and numerical optimization using gradients by exploiting the fact
that boosting too is a sequential process. From our discussions of boosting it is clear
that boosting can be seen as a model that is built stage-wise by an aggregation of
trees. At any given stage, the optimal tree is the one that results in the maximal
reduction of the loss function. Now, finding the optimal tree that results in the
maximal reduction for a general loss function is difficult. However, computing the
gradient of any differentiable loss function is straightforward. Moreover, the
negative of the gradient determines the direction of the maximal reduction for any
loss function. Thus, it is reasonable that, at a given stage of the boosting process,
the tree predictions would correspond to the negative of the gradient – this is the
key insight in gradient boosting! In each stage of gradient boosting the algorithm
fits the tree to the negative gradient values.
For the sake of completeness, we will briefly mention the stochastic gradient
boosting algorithm (Friedman, 2002). Here, at each stage in the boosting process,
we sample a fraction of the training observations without replacement. The tree is
grown with this small sample, which can be much smaller than the training data
set size when the latter is large. Even though the parallel is not exact, this idea
bears some resemblance to the idea of mini-batch gradient descent that we have
discussed in Chapter 2. In the interests of speeding up computations, especially
when the training data set is large, smaller subsets of the entire training data set
are actually used for training. Just as the batch size in mini-batch gradient descent
was an additional hyper-parameter that needed tuning, in stochastic gradient
boosting too the fraction of the training data used becomes a hyperparameter.
Having discussed some prominent aggregation methods like bagging and
boosting, we now turn to random forests.
Executive Summary
While decision trees are easy to interpret and visualize, they sometimes lack
the predictive ability of other machine learning methods. Aggregation over
multiple trees is one way in which the predictive accuracy of decision trees
can be improved. Bagging and boosting of decision trees are two popular
tree aggregation methods. While bagging and boosting can be used to
aggregate other non-tree methods, they often perform best with decision
trees.
Bagging is the process of averaging over multiple trees where the trees are
grown using independent bootstrap samples from the training data set. A
bootstrap sample is a random sample that is drawn from the training data
with replacement and which is of the same size as the training data set.
Bootstrapping is done when it is not possible to get enough different samples
from the population itself. The average of the prediction over multiple tree
has a lower variance than the prediction from an individual tree. Lower
variance is desirable, and in this way, averaging improves decision trees.
Boosting also involves averaging over multiple trees, but unlike bagging,
the trees are not independent. The trees are grown sequentially, and the tree at
a given stage leverages information from the previous stage. Specifically, the
trees are grown using training data that has been weighted based on the earlier
tree. In a classification tree, for instance, the boosting process puts more
weight on observations that were misclassified by the tree at the earlier stage.
Thus, each tree improves the prediction accuracy by concentrating on pre-
viously misclassified observations.
5. Random Forest
As the name suggests, a random forest is an aggregation of decision trees. A
random forest makes an improvement over bagging by a simple, yet very effective,
tweak. Random forests have shown consistent good performance, and there is no
need to tune many parameters except for the number of trees and the number of
predictors that are randomly chosen for splitting the trees. Moreover, since they use
bootstrap samples and bagging they can leverage a major advantage of bagging -
that is, we can compute the out-of-bag error to measure prediction accuracy on test
data without having to divide the data into training and test sets (see Section 4.2).
This is certainly a very useful aspect of random forests when it is not so easy, or it is
expensive, to collect data. The computation of out-of-bag prediction with random
forests is similar to bagging. For a given training observation we can compute the
random forest predictor by averaging over only those trees for which this particular
observation is an out-of-bag observation – that is, averaging over trees that were
grown without this observation being in the bootstrap sample used for tree
growing.
Recall that from the final paragraph of Section 4.2 on bagging it was sug-
gested that the different trees grown in bagging may not be independent, even
though they are identically distributed. This is because all the trees in bagging
are likely to include particularly strong relationships between the predictors and
the response by using these predictors as splitting variables at the top of all the
trees. Recall that if x1,…, xN are N independent random variables (observa-
tions), each with a variance of v, then the variance of their average is v/N. Thus,
the variance can be made arbitrarily small by increasing N. However, consider
N identically distributed (but not independent) random variables (observations)
x1,…, xN, each with a variance of v. As expected, the variance of the average of
these observations depends on the correlation between pairs of variables. It can
be shown that the variance of the average cannot be made arbitrarily small, even
with very large N, if the correlation is not small. Thus, significant variance
reduction to improve the model can be achieved by pursuing the twin objectives
of aggregation (that is, by averaging) and also reducing the pair-wise
correlations.
Random forest is a technique to grow uncorrelated trees so that one is able to
achieve significant variance reduction by aggregation, beyond what simple
bagging can deliver. As in bagging, multiple trees are grown using bootstrapped
samples. For a given tree, each split considers only a randomly selected subset of
the p predictor variables. Note that in the tree growing algorithms discussed so
far, each binary partition (split) considered all the predictor variables. By using
a different randomly selected subset each split would therefore be based on a
different set of predictor variables. When each tree in the random forest is
grown in this manner, they will become uncorrelated. Suppose the size of the
subset is k , p. When the subset size k is smaller, the trees will be more
uncorrelated. Hence, when the analyst suspects that the data has a large number
of correlated predictors variables, then fitting a random forest with a small k
would be helpful. Of course, when k 5 p in a random forest we are just doing
bagging. Similar to bagging, in a regression tree context, once the D trees Td(x)
(d 5 1,…, D) are grown the final prediction for a random forest is based on the
average.
1 D d
TðxÞ ¼ + T ðxÞ
D d ¼1
In the classification context, the D classification trees will give D predictions

for any observation x. Again, as in bagging, the random forest estimate is just that
class which is the most commonly occurring class among the D predictions.
We end the topic of random forests by providing an intuition for how the
process of selecting only a subset of predictors for splitting trees can reduce
correlation among the trees. A major reason for trees to be similar, i.e., correlated,
is that all trees are likely to select a predictor from the top two or three most
important predictors as the first splitting variable. Once the first splitting variable
has been identified, the second splitting variable in all trees is also likely to be from
the set of the top two or three most important variables. Thus, the upper branches
of all the trees is likely to be very similar. This will be true as long as all the
predictor variables are considered for the splits. However, if the splitting variables
can come from only a subset of the predictors, then there is a chance that the most
powerful predictor (or one of the top two or three most powerful predictors) is
absent from this subset. Therefore, it is less likely for the trees to be correlated.
6. Applications of Random Forests and Decision Trees in

Marketing and Sales
The applications of decision tree in marketing are not as widespread as one might
have expected owing to their ease of interpretation and their ability to uncover
patterns of interactions between predictors. An early decision tree model is due to
Currim, Meyer, and Le (1988), in which the authors use these trees to model
consumer choice data. The authors point out the advantages of decision trees for
modeling consumer choice processes because such choice processes are of a
“contingent,” or hierarchical, nature. The authors give the example of a consumer
purchasing soft drinks who may consider price of a particular brand only after she
has found the product acceptable on dimensions such as taste and sugar content
etc. A clear hierarchical nature of choice processes makes tree-based models ideal
for them. The authors argue that hierarchical models of decision making often
used for the study of market structure may benefit from tree models (by Grover &
Dillon, 1985; Urban, Johnson, & Hauser, 1984). Despite the fact that, strictly
speaking, they are not choice models they have a hierarchical structure reflecting
the choice processes of consumers (Urban et al., 1984). Currim et al. (1988) apply
the decision tree to a consumer panel data and compare it with disaggregate logit
analysis. The data provided by IRI was for coffee purchases by 2000 households.
The authors found that decision trees faithfully recovered actual brand choices on
both training and test samples. Moreover, it had predictive performance similar to
a disaggregate logit model.
Researchers in direct marketing were perhaps the first people in marketing to
adopt decision tree approaches to data analysis. This was due to the importance
to direct marketers of having simple rules regarding who to send mailings out to.
This required statistical procedures that are intuitive, easily visualized and easily
communicated to decision makers in the organization. Decision trees fit these

requirements. An early paper that advocated for using decision trees in direct
marketing was Magidson (1988). Magidson (1989) compared CHAID to logit
models in the context of direct marketing. Haughton and Oulabi (1997) investi-
gated a direct marketing context where the goal was to classify prospects as
“responders” or “non responders” to a direct mailing program. Another goal was
to compare the performance of CART and CHAID models for this classification
task. The direct marketing dataset the authors used had 10,000 customers who
were mailed the direct marketing material. They used 30 predictor variables which
contained both continuous and categorical variables. Overall, the authors found
that these tree models have a similar performance as far a “lift” is concerned, at
least for this data set (see Chapter 2 for a description of “lift”). An important
observation that the authors make is that when there are many continuous pre-
dictor variables then CART would be a better choice, whereas when there are
many categorical predictor variables, and the analyst has some prior knowledge
of particularly useful predictors, then CHAID would be a better choice. Baines,
Worcester, David, and Mortimore (2003) have used classification trees to perform
market segmentation in the context of political campaigns. Kim, Lee, Shaw,
Chang, and Nelson (2001) have investigated the provision of personalized rec-
ommendations by internet stores and web service providers. These personalized
recommendations are suggestions about products and services based on the
demographics and past purchase behaviors of customers who visit internet
storefronts. The authors use tree induction to match customer profiles, based on
demographics and past purchase behaviors, to product and service categories.
This paper demonstrates the advantages of decision trees to capture a rules-based
approach to personalized recommendations. The authors performed an experi-
ment to test their proposed decision tree approach using survey data from 330
respondents gathered over the internet. The effectiveness of their decision tree
methodology for rules-based recommendation was superior to a benchmark
preference-scoring approach. Cho and Ngai (2003) demonstrate the use of deci-
sion trees in a novel sales context. They use decision trees, among other machine
learning methods like neural networks, to predict the length of service of insur-
ance sales agents – an understanding of this will help the selection decisions for
insurance sales agents. Data from the Hong Kong branch of an international
insurance company was gathered from 1998 to 2000. The records of 3,053 agents
were used as training data set while the records for 500 others was used as the test
data set. While the neural network had the best predictive performance, the
decision tree was useful for identifying important predictors of length of service.
Bensic, Sarlija, and Zekic-Susac (2005) have used decision trees for addressing an
important problem of much interest to marketers as well as finance and
accounting practitioners and researchers – credit scoring for individuals and
businesses. Using data set from a Croatian savings and loan association which
extends loan to small businesses and medium-sized enterprises, they found that
the CART model performed worse than a neural network. Galindo and Tamayo
(2000) studied default rates on mortgage loans using CART decision-tree models,
NNs, the k-nearest neighbor and the probit model. These authors found that the
CART decision-tree models provide the best estimation for mortgage loan
default.
Thrasher (1991) has used decision trees for segmentation, a fundamental
marketing task. Tirenni, Kaiser, and Herrmann (2007) have used decision trees
to formulate a segmentation methodology for customers using their lifetime
values. In this approach different segments have different lifetime values. Using
data from a major European airline the authors have used their methodology to
predict future segments of customers according to their demographic and
behavioral characteristics. Thomassey and Fiordaliso (2006) have used decision
trees for sales forecasting of products. The authors make the case that decision
trees are well suited to uncover simple rules for forecasting in categories like
textiles, which have numerous new items with short life times. While many other
methods like regression, Box and Jenkins, neural networks or fuzzy systems
have been used, quite successfully, in other contexts, the authors suggest that
these methods are inappropriate for categories like textiles where replacement of
items at the end of each season makes past data unavailable. While their data
comes from the apparel industry, the methodology is applicable to situations
like new products or new customers where past data is not readily available. In
addition, many of the other methods are not suitable for uncovering under-
standable relationships in the data. Many simpler, often parametric, models
may be useful to understand relationships between predictors and the response
variable they lack the predictive ability of machine learning models when
dealing with complex, nonlinear data. Decision trees are a non-parametric
method that can handle complex data and yet are easily interpretable and
explainable. The authors propose a method where products are grouped into
clusters. The clusters, called “prototypes,” are formed based on products which
have similar historical sales profiles along with descriptive criteria. The authors
use the popular k-means clustering algorithm. Then new products, or products
which for any reason do not have historical sales profiles, are assigned to these
clusters based on their descriptive criteria. This step is accomplished using
decision trees. Essentially, they are clustered with similar products based on
their descriptive criteria. Then the sales profiles of the cluster are used as a proxy
for the sales forecast for the products which do not have historical data. Using
some standard rule-based forecasting methods as benchmarks, the authors find
that with a real data set from a French textile distributor their method (clus-
tering followed by decision trees) performs the best. Sheu, Su, and Chu (2009)
used decision trees to segment online gaming customers. Specifically, they
investigated the relationship between influential predictors and customer loy-
alty. The predictors of loyalty that the authors investigated fall under the rubric
of “experiential marketing.” These factors go beyond the functional “features
and benefits” view of marketing to the experiences consumer have when they
interact with the brand. In terms of the focal response variable, customer loy-
alty, the authors measure dimensions such as “repurchase desire,” “public praise
and recommendation desire,” and “cross-purchase desire.” Abrahams et al.
(2009) employ a novel variant on decision trees which incorporates a profit-
optimizing algorithm. Using their new method they provide actionable
recommendations on the profitable demographics that the managers of PetPlan,

a pet insurance company, should market to. They demonstrate that their profit-
maximizing approach to decision tree building, called “Sequential Binary Pro-
gramming (SBP)” dominates the standard decision tree approach that focuses
on node purity as the tree growing criterion. Rokach, Naamani, and Shmilovici
(2008) address the issue of which customers the firm should target with a new
product offer in order to maximize profit. They highlight the exploration/
exploitation tradeoff that marketing organizations face. Exploration refers to
the phase where the company interacts with customers to explore their behav-
iors. In the language of machine learning, in this phase the firm is able to label a
set of unlabeled customer records in terms of some response variable of interest.
The firm can then build a model to predict the customers’ response and prof-
itability based on the labeled records. In the exploitation phase the firm applies
the predictive model developed in the earlier stage to classify unlabeled customer
records to determine which customer are the “best” in terms of response and
profitability. In this research the authors propose a novel method called
“Pessimistic Active Learning (PAL)” which improves on the standard “active
learning” approach. Active learning refers to selection of unlabeled records for
labeling – here the focus is on improved exploration. In PAL, the authors
augment the standard approach to active learning by overlaying cost/profit
criterion on the standard criterion of improving exploration. They find that
their proposed methodology performs better than appropriately chosen bench-
marks. Duchessi and Lauria (2013) use decision trees to investigate the online
and mobile technologies and services used by ski resorts. These technologies are
used by the ski resorts for their advertising and promotional outreach to their
two main customer segments – millennials and non-millennials. The main dis-
tinguishing feature of this research is that technologies are profiled rather than
customers as is usually the case in marketing. Delen, Kuzey, and Uyar (2013)
have used decision trees to look at firm performance using financial ratios. The
research is interesting in that the authors have benchmarked several decision tree
algorithms including CHAID and CART. Using a data set from the Istanbul
Stock exchange, they find that “earnings before tax-to-equity ratio” and “net
profit margin” are the two most important financial predictors of firm perfor-
mance. Coussement, Van den Bossche, and De Bock (2014) benchmarked
decision trees with logistic regression and RFM (Recency, Frequency, Monetary
value) analysis to study the relative performance of these methods with regards
to accuracy of data for segmentation. Specifically, they investigate the problems
with data accuracy, an important aspect of data quality, in the context of three
widely-used segmentation techniques in direct marketing. These are decision
trees, RFM analysis and logistic regression. This study used typical direct
marketing data sets provided by the Direct Marketing Educational Foundation
(DMEF). The study’s main findings are that: (1) when data quality is not
compromised, decision trees have the best performance among the three
methods, and (2) even when data quality is compromised, decision trees
continue to remain the best of the three methods.
Apampa (2016) has investigated customer response to marketing efforts of a

Portuguese bank using random forests. Using a data set with 45,211 customer
records and 17 predictors, the author studied the performance of random forests
compared to three other classification techniques – logistic regression, CART
decision trees, naı̈ve Bayes. Because the raw data set was very imbalanced, with
many more customers in one class, the author balanced the data set. A major
takeaway from this study is the importance of balancing the data sets when
analyzing data with a categorical response. With balancing, the performance of
the random forest increased dramatically compared to its performance in the
imbalanced data set. On the balanced data set the performance of random forest
was found to be good, but less than the best classifier which was the CART
decision tree. Bahnsen, Aouada, and Ottersten (2015) propose a novel meth-
odology that augments a decision tree by taking into account the fact that the
costs due to misclassification are “example-dependent.” Example-dependent
costs are situations where the misclassification cost may vary from consumer
to consumer. The standard classification models using decision trees either do
not take misclassification costs into account, or if they do, they incorporate the
costs either before or after training. Moreover, they do not account for the fact
that they are example-dependent. The authors propose a new cost-dependent
impurity measure for decisions trees so that the misclassification costs are
incorporated during training of the model itself. Using three real-world appli-
cations of credit card fraud detection, credit scoring and direct marketing, the
authors demonstrate the superior performance of their proposed method.
Bejju (2016) used decision trees to evaluate and optimize the pricing strategies
of e-commerce websites. Lin, Wu, Lin, Wen, and Lin (2017) propose an
ensemble random forest approach that utilizes a heuristic sampling procedure to
use big data for classification. Their sampling scheme is aimed at addressing the
problems of imbalanced data sets in classification tasks, but specifically for big
data contexts with very large amounts of data. Using a data set of 500,000
customer records and 16 predictors from the China Life Insurance Company the
authors report that the ensemble random forest dominates SVM and logistic
regression for their classification task. De Caigny, Coussement, and De Bock
(2018) propose a new tree-based model which is geared to overcoming some
known difficulties with existing classification models for customer churn. On the
one hand, logistic regression can handle linear relationships well but not
the interactions between predictors, and on the other hand, decision trees can
handle interactions well but have shortcomings in capturing linear relationships.
The authors propose a model called Logit leaf Model (LLM) in which the
decision tree splits the data into homogenous groups. Then a logit model is fit to
the data in each group. The authors use 14 churn data sets and benchmark their
LLM against some well-known churn predictions models that have been used in
the literature. The models used to benchmark LLM are decision trees, random
forests, and logistic model trees. While the LLM procedure has some similarities
with ensemble methods where the outputs from many different classifiers are
aggregated in some fashion, there is one important difference. Ensemble methods

train all the constituent models on the entire training data set before aggregating
their outputs. However, the LLM algorithm first creates subsets from the entire
training data using a decision tree, and then trains each of the constituent clas-
sifiers on a subset. The main finding is that on across the 14 data sets, the LLM
algorithm performs better than its two constituents, namely logistic regression
and decision trees, and has at least as good a performance as a random forest. A
related research is that by Coussement and Van den Poel (2008). In the context of
subscription services, the authors analyze data from a Belgian newspaper pub-
lishing company. The authors compare a support vector machine, which was
trained using two parameter-selection techniques designed for this purpose, with
a random forest. They find that, despite the careful parameter-selection technique
to guide the SVM training, the random forest’s performance dominates the
performance of the support vector machine. Kumar and Ravi (2008) have use an
ensemble of multilayer perceptron (MLP), logistic regression, decision trees,
random forest, radial basis function (RBF) network and support vector machine
(SVM) to attack the churn problem for a bank. They have documented very good
performance of this ensemble. In this work a classification and regression tree
(CART) was used for feature selection. For understanding rules for early warning
of churn they have used decision trees. Lemmens and Croux (2006) have studied
bagging and boosting classification trees fit on churn data. Using data from a US
telecommunications company they find that both bagging and boosting can
significantly improve on the performance of classification trees. In addition, they
show that when using techniques to balance data which is inherently imbalanced,
then the analyst should use some appropriate bias correction. This is because, in
drawing a balanced sample to increase the percentage of churners in the training
data (the raw data set usually has a much greater proportion of non-churners)
one runs the risk of over-estimating the number of churners compared the actual
situation. They find that applying their novel bias-correction method, when
balanced data is used to train bagged and boosted classifiers, can reduce the
classification error rate. Razmochaeva and Klionsky (2019) have used random
forests to study the management of retails sales processes. Specifically, they
use random forest classifiers to achieve data or dimension reduction given the
enormous amounts of data in the context of retail sales which typically have
thousands of products.
7. Case Studies
In this section, we will present a couple of case studies about the application of
random forests in marketing. We describe the data sets and demonstrate the
analyses done on them.
Case Study 1: Caravan Insurance

The data used in this case study comes from a real data set from an insurance
company. This data set was first used in the COIL Challenge 2000 that was
organized by the Computational Intelligence and Learning group which is a
cooperative of four EU funded research groups. This data mining competition ran
from March to May 2000 and attracted entries from 43 participants.
The data is about the ownership of insurance policies and the task is to predict
future buyers of the caravan insurance policy based on various sociodemographic
and product ownership data. There are 43 sociodemographic features and 42
ownership related features for a total of 85 features – 83 numeric and 2 symbolic
features. Because of the large number of features in the data, we will not list them
here. Instead, interested readers can look them up in (van der Putten & van
Someren, 2000).
We will use a logistic regression as a benchmark model to compare with the
Random Forest. The inputs chosen to run a logistic regression are as follows.
Response Column: 1
The inputs chosen for the Random Forest are.
Response Column: 1
Number of Trees: 50, 100, 300, 500
We compute Random Forests with 50, 100, 300, 500 trees and then select the
best fitting model. This step can be accomplished by writing simple code in all
software programs. From the AUC plots we find that the best-fitting model is the
one with 300 trees.
The confusion matrix for the logistic regression is.
Actual + Actual -
Predicted + 2 7
Predicted - 78 1078
The percent correctly classified (PCC) for test data for the logistic regression is
92.7%. The AUC for the logistic regression model is approximately 0.725.
The Confusion Matrix for the best fitting Random Forest as calculated on the
test data is.
Actual + Actual -
Predicted + 1 9
Predicted - 59 1096
The percent correctly classified (PCC) for test data for the Random Forest is
94.2%. The AUC for the best-performing Random forest model is approximately
0.7. With this data set the performance of the Random Forest model is not
significantly better than the benchmark logistic regression.
Case Study 2: Wine Quality

This case study is in the context of wine marketing for Portuguese wines. The data
used in this case study comes from a real data set about the quality of vinho verde,
a well-recognized variety of Portuguese wine. The data set has a bunch of
physicochemical and sensory features that are used to evaluate the quality of
wines. The physicochemical features such as density, alcohol content and pH level
are easily available based on analytical tests run during the wine certification
stage. The sensory features are based on tests by human experts. A description of
the features is available in Cortez, Cerdeira, Almeida, Matos, and Reis (2009).
The target variable is a gradation of the wines on a scale from 0 to 10 with higher
numbers representing better quality. For simplicity, we will consider this a binary
target (response) variable with numbers 5 and below coded as 0 and numbers 6
and above coded as 1. The task is to predict the quality of wine (using the binary
target variable) based on the physicochemical and sensory features.
As a benchmark, we first run a logistic regression on this data set. The inputs
chosen to run a logistic regression are as follows.
Response Column: 12
The inputs chosen to run a Random Forest are as follows.
Response Column: 12
Number of Trees: 50, 100, 300, 500
We compute Random Forests with 50, 100, 300, 500 trees and then select the
best fitting model. The Confusion Matrix for the logistic regression calculated on
the test data is.
Actual + Actual -
Predicted + 127 47
Predicted - 47 99
The percent correctly classified (PCC) for test data for the logistic regression is
70.6%. The AUC for the logistic regression model is approximately 0.834.
The Confusion Matrix for the best fitting Random Forest as calculated on the
test data is.
Actual + Actual -
Predicted + 128 38
Predicted - 33 121
The percent correctly classified (PCC) for test data for the Random Forest is
77.8%. The AUC for the best-performing Random Forest model is 0.875. We see
that the predictive performance of the Random Forest is significantly better than
the benchmark logistic regression model.
TECHNICAL APPENDIX
Technical detour 1:
Our training data consists of N observations (xi, yi), i 5 1,…, N, with obser-
vation xi belonging to a p-dimensional space. Thus, the ith observation xi is a
p-dimensional vector xi 5(xi1,…, xip). Consider a partition of the training data
into M regions Rm, m 5 1,…, M. For a regression tree the usual cost that we
minimize is the sum of squares over the training data which we denoted by SS in
Section 2.1. Mathematically,
N
SS ¼ + ðyi 2 f ðxi ÞÞ2
i¼1
As mentioned in the text, for a regression tree the prediction for all points in a
given region is just a constant. Further, for a square error cost (loss) the optimal
prediction f(xi) for a region happens to be the average over all responses yi cor-
responding to the observations xi in that region.
We will provide a simple illustration for why the predicted constant for a
region is the average response in that region. Since the prediction f(xi) is constant
cm in each region Rm we can write
8
>
> c1 if xi is in region R1
<
c2 if xi is in region R2
f ðxi Þ ¼
>M
>
:
cM if xi is in region RM
Suppose there are 3 points (xi, yi), i 5 1, 2, 3, partitioned into two regions Ri,
i 5 1, 2 such that (x1, y1) e R2, (x2, y2) e R1, (x3, y3) e R2 . Since observation
1 and 3 are in region 2 and observation 2 is in region 1, the sum of squares is
ðy1 2 c2 Þ2 1 ðy2 2 c1 Þ2 1 ðy3 2 c2 Þ2 . To find the optimal c2 (for example) take
First Order Condition w.r.t. c2 and solve. This gives c2 5 y1 12 y3 5
yð2Þ . This simple
illustration should be enough for the reader to form an intuition that the same will
hold with N observations partitioned into M regions. We now provide the details
of the greedy algorithm for recursive binary partitioning. We use the uppercase Xj
to denote the jth predictor variable and the lowercase xij for the jth component of
the ith observation xi 5(xi1,…, xip).
Step 1: For a given predictor variable Xj (j 5 1,…, p) and split point “s”
perform a binary partition of the feature space into two regions.
R1 5 {set of points xi such that xij # s}

R2 5 {set of points xi such that xij . s}
Let yð1Þ be average of yi corresponding to all xi in R1. Similarly,

yð2Þ is the
average of yi corresponding to all xi in R2. Find the predictor variable and split
point which is the solution to the minimization below,
minf + yð1Þ Þ2 1
ðyi 2 + yð2Þ Þ2 g
ðyi 2
j; s xi is in R1 xi is in R2
The first summation inside the curly braces ranges over all observations xi in
region R1. Similarly, the second summation ranges over all observations xi in
region R2. The minimization over both the predictor variable and the split point
gives us the best predictor, say Xk, and the best split point sk*.
While the program above requires us to solve for both the predictor variable
and the split point, in practical implementations, this is usually done in two
stages:
(a) In the first stage, for a fixed Xj, we assume a split point “s”. For that Xj and s
we form the binary partition. Then the following program is solved for the
optimal split point
minf + yð1Þ Þ2 1
ðyi 2 + yð2Þ Þ2 g
ðyi 2
s xi is in R1 xi is in R2
This gives us the optimal split point given the splitting variable Xj (b)
(b) Vary the predictors Xj over all j 5 1,…, p, and repeat stage (a).
Step 2: Form the two regions based on the splitting variable Xk and split point
sk*. These are:
R1 5 {set of points xi such that xik # sk*}

R2 5 {set of points xi such that xik . sk*}
Repeat step 1 for region R1 above. This gives three regions. Split one of these
three regions in similar manner and continue until a stopping rule ends the
process.
This process grows a large tree that should be pruned to control overfitting.
Technical detour 2:
As mentioned in Section 2.1.2, the cost complexity pruning method essentially
puts a penalty on the size of the tree. The minimizing criterion for growing the
tree is not just the sum of squares (SS) but a sum of SS plus the number of ter-
minal nodes. Suppose that using the greedy algorithm we have grown a large tree
T9which we wish to prune. Suppose that T is a sub-tree obtained by pruning the
tree T9. That is, T has been obtained by collapsing a number of the nodes of T9.
Suppose further that the sub-tree T has M terminal nodes indexed by m 5 1,…, M.
As usual, node m defines a region Rm. In the regression tree context, we know that
the prediction corresponding to all observations xi in region m is the same con-
stant yðmÞ (which is just the average of the responses yi corresponding to the xi in
region m). With these predictions, the sum of squares over the training data set
becomes
SS ¼ SS1 1 SS2 1 … 1 SSM
¼ + yð1Þ Þ2
ðyi 2 1 + yð2Þ Þ2 1 ::: 1
ðyi 2 + yðMÞ Þ2
ðyi 2
xi is in R2 xi is in RM
xi is in R1
The cost complexity criterion, which is a function of tree T, is shown below.

Cost complexity (T) 5 SS 1 lM.
The desired optimal sub-tree is the one that minimizes the cost complexity
criterion. Here, a larger l will induce the minimization program to choose a
smaller tree.
Technical detour 3:
We will provide more details of the algorithm for boosting regression trees. Let us
denote the boosted (aggregated) tree and the D component trees as functions of
D
the vector of predictor variables x 5 (x1, …, xp): TðxÞ 5 + lT j ðxÞ. Recall that
d 51
in our notation, the ith training observation is xi 5 (xi1, …, xip), i 5 1,…, N.
Step 1: Initialize the beginning residuals to be the actual response: ri 5 yi for all
i 5 1,…, N. Initialize T(x) 5 0.
(a) Fit regression tree Td(x) to data (xi, rid), i 5 1,…, N. The tree can have s splits.
The predictor variables are xi and the responses at stage d are the residuals rid
d 21 d 21
from the current model: + lT j ðxÞ. Thus, the ith residual is rid 5 yi - + lT j ðxi Þ,
j51 j51
i 5 1,…, N. In fitting tree Td(x) the s splits will partition the feature space into
distinct regions. As we discussed Section 2.1 on regression trees, the prediction in
each region will be the average of the residuals corresponding to the xi in that
region.
(b) Update the weighted combination of trees by adding lTd(x) to the current
d 21 d
model. The model now becomes: + lT j ðxÞ 1 gT d ðxÞ 5 + lT j ðxÞ.
j51 j51
(c) Update the residuals. The residuals now become rid11 5 rid - lTd(xi) i 5 1,…,
N. This is because, after updating the weighted combination of trees as in step 2b
above, the new residual corresponding to the ith observation is: rid11 5 yi 2
d d 21
+ lT j ðxi Þ5 yi 2 + lT j ðxi Þ 2 lT d ðxi Þ5 rid 2 lTd(xi). The tree in the next stage
j51 j51
Td11(x) is fit to data (xi, rid11), i 5 1,…, N. The predictor variables are xi and the
d
responses at stage d11 are the residuals rid11 from the current model + lT j ðxÞ.
j51
Continue till all D trees are grown. D
Step 3: Output aggregate tree: T 5 + lT j ðxÞ. The final prediction is based on
this. d 51
References
Abdul-Kader, S., & Woods, J. (2015). Survey of chatbot design techniques in speech
conversation systems. International Journal of Advanced Computer Science and
Applications, 6(7), 72–80.
Abrahams, A., Becker, A., Sabido, D., D’Souza, R., George, M., & Kransnodebski,
M. (2009). Inducing a marketing strategy for a new pet insurance company using
decision trees. Expert Systems with Applications, 36, 1914–1923.
Agarwal, D., & Schorling, C. (1996). Market share forecasting: An empirical
comparison of artificial neural networks and multinomial logit model. Journal of
Retailing, 72(4), 383–407.
Aggarwal, C., & Zhai, C. X. (2012). A survey of text classification algorithms. In
C. Aggarwal & C. X. Zhai (Eds.), Mining text data (pp. 163–222). Berlin: Springer.
Anderson, N. (1970). Functional measurement and psychophysical judgment.
Psychological Review, 77, 153–170.
Anderson, N. (1971). Integration theory and attitude change. Psychological Review,
78, 177–206.
Andrews, R., Ansari, A., & Imran, C. (2002). Hierarchical Bayes versus finite mixture
conjoint analysis models: A comparison of fit, prediction, and partworth recovery.
Journal of Marketing Research, 39, 87–98.
Apampa, O. (2016). Evaluation of classification and ensemble algorithms for bank
customer marketing response prediction. Journal of International Technology and
Information Management, 24(4), 85–100.
Bagozzi, R. (1994). Advanced methods of marketing research. Cambridge, MA: Basil
Blackwell Ltd.
Bahnsen, A. C., Aouada, D., & Ottersten, B. (2015). Example-dependent cost-
sensitive decision trees. Expert Systems with Applications, 42, 6609–6619.
Baines, P., Worcester, R., David, J., & Mortimore, R. (2003). Market segmentation
and product differentiation in political campaigns. Journal of Marketing
Management, 19(1–2), 225–249.
Bajari, P., Nekipelov, D., Ryan, S. P., & Yang, M. (2015). Machine learning methods
for demand estimation. The American Economic Review, 105(5), 481–485.
Balakrishnan, P., Cooper, M., Jacob, V., & Lewis, P. (1996). Comparative performance
of the FSCL neural net and K-means algorithm for market segmentation. European
Journal of Operational Research, 93(10), 346–357.
Bejju, A. (2016). Sales analysis of E-commerce websites using data mining techniques.
International Journal of Computer Applications, 133(5), 36–40.
Ben-Hur, A., Horn, D., Siegelmann, H., & Vapnik, V. (2001). Support vector
clustering. Journal of Machine Learning Research, 2, 125–137.
Bensic, M., Sarlija, N., & Zekic-Susac, M. (2005). Modeling small-business credit
scoring by using logistic regression, neural networks and decision trees. Intelligent
Systems in Accounting, Finance and Management, 13, 133–150.
184 References
Bishop, C. (1995). Regularization and complexity control in feed forward networks.

Technical Report NCRG/95/022, University of Aston, Birmingham.
Blattberg, R., Kim, B.-D., & Neslin, S. (2008). Database marketing: Analyzing and
managing customers. Berlin: Springer.
Bloom, J. (2005). Market segmentation - a neural network application. Annals of
Tourism Research, 32, 93–111. doi:10.1016/j.annals.2004.05.001
Boone, D., & Roehm, M. (2002). Retail segmentation using artificial neural networks.
International Journal of Research in Marketing, 19(3), 287–301.
Boser, B., Guyon, I. M., & Vapnik, V. (1992). A training algorithm for optimal
margin classifiers. COLT ’92: Proceedings of the fifth annual workshop on
Computational learning theory (pp. 144–152). doi:10.1145/130385.130401
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
Brieman, L., Friedman, J., Ohlsen, R., & Stone, C. (1984). Classification and
regression trees. New York, NY: Wadsworth.
Carbonneau, R., Laframboise, K., & Vahidov, R. (2008). Application of machine
learning techniques for supply chain demand forecasting. European Journal of
Operational Research, 184, 1140–1154.
Casdagli, M. (1989). Nonlinear prediction of chaotic time-series. Physica, 35, 335–356.
Chattopadhyay, M., Dan, P. K., Majumdar, S., & Chakraborty, P. S. (2012).
Application of artificial neural network in market segmentation: A review on recent
trends. Management Science Letters, 2, 425–438. Retrieved from https://arxiv.org/abs/
1202.2445
Chau, M., & Chen, H. (2008). A machine learning approach to web page filtering
using content and structure analysis. Decision Support Systems, 44(2), 482–494.
Cheung, W., James, K., Law, M., & Tsui, K.-C. (2000). Mining customer preference
ratings for product recommendation using the support vector machine and the
latent class model. In C. A. Brebbia & N. F. F. Ebecken (Eds.), Data mining II.
WIT Press. 1-85312-821-X. Retrieved from www.witpress.com
Cheung, W., James, K., Law, M., & Tsui, K.-C. (2003). Mining customer product
ratings for personalized marketing. Decision Support Systems, 35, 231–243.
Chiu, M., & Lin, G. (2004). Collaborative supply chain planning using the artificial
neural network approach. Journal of Manufacturing Technology Management,
15(8), 787–796.
Cho, V., & Ngai, E. (2003, July). Data mining for selection of insurance sales agents.
Expert Systems, 20(3), 123–132.
Corstjens, M., & Gautschi, D. (1983). Formal choice models in marketing. Marketing
Science, 2(1), 19–56.
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine
preferences by data mining from physicochemical properties. Decision Support
Systems, 47, 547–553.
Coussement, K., & Van den Poel, D. (2008). Churn prediction in subscription services:
An application of support vector machines while comparing two parameter-
selection techniques. Expert Systems with Applications, 34, 313–327.
Coussement, K., Bossche Van den, F., & De Bock, K. (2014). Data accuracy’s impact
on segmentation performance: Benchmarking RFM analysis, logistic regression,
and decision trees. Journal of Business Research, 67, 2751–2758.
Cui, D., & Curry, D. (2005). Prediction in marketing using the support vector
machine. Marketing Science, 24(4), 595–615.
References 185
Currim, I., Meyer, R., & Le, N. (1988). Disaggregate tree- structured modeling of
consumer choice data. Journal of Marketing Research, 25(August), 253–265.
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function.
Mathematics of Control, Signals and Systems, 2, 303–314.
De Caigny, A., Coussement, K., & De Bock, K. W. (2018). A new hybrid classification
algorithm for customer churn prediction based on logistic regression and decision
trees. European Journal of Operational Research, 269, 760–772.
Delen, D., Kuzey, C., & Uyar, A. (2013). Measuring firm performance using financial
ratios: A decision tree approach. Expert Systems with Applications, 40, 3970–3983.
Dimopoulos, Y., Paul, B., & Sovan, L. (1995). Use of some sensitivity criteria for
choosing networks with good generalization ability. Neural Processing Letters,
2(6), 1–4.
Dooley, B. (2017). Why AI with augmented and virtual reality will be the next big thing.
Retrieved from https://upside.tdwi.org/articles/2017/04a/04/ai-with-augmented-and-
virtual-reality-next-big-thing.aspx
Duchessi, P., & Lauria, E. (2013). Decision tree models for profiling ski resorts’
promotional and advertising strategies and the impact on sales. Expert Systems
with Applications, 40, 5822–5829.
Evgeniou, T., Boussios, C., & Zacharia, G. (2005). Generalized robust conjoint
estimation. Marketing Science, 24(3), 415–429.
Fish, K., Barnes, J., & Aiken, M. (1995). Artificial neural networks: A new
methodology for industrial market segmentation. Industrial Marketing
Management, 24(5), 431–438.
Frew, J. F., & Wilson, B. (2002). Estimating the connection between location and
property value. Journal of Real Estate Practice and Education, 5(1), 17–25.
Friedman, J. (2002). Stochastic gradient boosting. Computational Statistics and Data
Analysis, 38(4), 367–378.
Galindo, J., & Tamayo, P. (2000). Credit risk assessment using statistical and machine
learning: Basic methodology and risk modelling applications. Computational
Economics, 15, 107–143.
Gao, J., Galley, M., & Li, L. (2019). Neural approaches to conversational AI.
Foundations and TrendsÒ in Information Retrieval, 13(2–3), 127–298. doi:
10.1561/1500000074
Garg, A., & Tai, K. (2014). An ensemble approach of machine learning in evaluation
of mechanical property of the rapid prototyping fabricated prototype. Applied
Mechanics and Materials, 575, 493–496.
Garson, G. D. (1991). Interpreting neural network connection weights. Artificial
Intelligence Expert, 6(4), 46–51.
Gordini, N., & Veglio, V. (2017). Customers churn prediction and marketing retention
strategies. An application of support vector machines based on the AUC
parameter-selection technique in B2B e-commerce industry. Industrial Marketing
Management, 62, 100–107.
Govidarajan, M. (2013). A hybrid framework using RBF and SVM for direct marketing.
International Journal of Advanced Computer Science and Applications, 4(4), 121–126.
Grover, R., & Dillon, W. R. (1985). A probabilistic model for testing hypothesized
hierarchical market structures. Marketing Science, 4(Fall), 312–335.
Gruca, T., Klemz, B., & Petersen, E. (1999). Mining sales data using a neural network
model of market response. ACM SIGKDD, 1(1), 39–43.
186 References
Guido, G., Prete, M. I., Miraglia, S., & De Mare, I. (2011). Targeting direct
marketing campaigns by neural networks. Journal of Marketing Management,
27(9–10), 992–1006.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). Elements of statistical learning.
Springer series in statistics. New York, NY: Springer.
Haughton, D., & Oulabi, S. (1997). Direct marketing modeling with CART and
CHAID. Journal of Direct Marketing, 11(4), 42–52.
Hill, T., O’Connor, M., & Remus, W. (1996). Neural network models for time series
forecasts. Management Science, 42(7), 1082–1092.
Hruschka, H., & Natter, M. (1999). Comparing performance of feedforward neural
nets and K-means for market segmentation. European Journal of Operational
Research, 114, 346–353.
Hruschka, H. (1993). Determining market response functions by neural network
modeling: A comparison of econometric techniques. European Journal of
Operational Research, 66(1), 346–353.
Huang, Z., Chen, H., Hsu, C.-J., Chen, W.-H., & Wu, S. (2004). Credit rating analysis
with support vector machines and neural networks: A market comparative study.
Decision Support Systems, 37, 543–558.
Huang, J.-J., Tzeng, G.-H., & Ong, C.-S. (2007). Marketing segmentation using
support vector clustering. Expert Systems with Applications, 32, 313–317.
James, G., Witten, D., Hastie, T., & Tibshirani, R., (2013). An introduction to statistical
learning: With applications in R (1st ed., Springer Texts in Statistics). New York,
NY: Springer.
Joachims, T. (1998). Text categorization with support vector machines: Learning with
many relevant features. Proc. of ECML, 98, 137–142. Springer-Verlag.
Johnston, M., & Marshall, G. (2013). Sales force management (11th ed.). Abingdon:
Routledge.
Kass, G. (1980). An exploratory technique for investigating large quantities of
categorical data. Journal of the Royal Statistical Society, Series C (Applied
Statistics), 29(2), 119–127.
Kim, J. W., Lee, B. H., Shaw, M. J., Chang, H.-Lu, & Nelson, M. (2001). Application
of decision-tree induction techniques to personalized advertisements on internet
storefronts. International Journal of Electronic Commerce, 5(3), 45–62.
Kim, Y., Street, W. N., Russell, G. J., & Menczer, F. (2005). Customer targeting: A neural
network approach guided by genetic algorithm. Management Science, 51, 264–276.
Kim, D., Lee, H.-J., & Cho, S. (2008). Response modeling with support vector
regression. Expert Systems with Applications, 34, 1102–1108.
Knott, A., Hayes, A., & Scott, N. (2002). Marketplace: Next-product-to-buy models
for cross-selling applications. Journal of Interactive Marketing, 16(Summer), 3.
Krotov, D., & Hopfield, J. J. (2019, April 16). Unsupervised learning by competing
hidden units. Proceedings of the National Academy of Sciences, 116(16), 7723–7731.
Krycha, K. A. (1999). Market segmentation and profiling using artificial neural
networks. In W. Gaul & H. Locarek-Junge (Eds.), Classification in the
information age. Studies in classification, data analysis, and knowledge
organization. Berlin, Heidelberg: Springer.
Kumar, D. A., & Ravi, V. (2008). Predicting credit card customer churn in banks
using data mining. International Journal of Data Analysis Techniques and
Strategies, 1(1), 4–28.
References 187
Kumar, A., Rao, V., & Soni, H. (1995). An empirical comparison of neural networks
and logistic regression. Marketing Letters, 6(4), 251–263.
Landt, F. W. (1997). Stock price predictions using neural networks. Leiden: Leiden
University.
Lawrence, S., Tsoi, A., & Gilles, C. (1996). Noisy time series prediction using symbolic
representation and recurrent neural network grammatical inference. University of
Maryland, College Park, MD: University of Maryland, Institute for Advanced
Computer Sciences.
Lek, S., Beland, A., Dimopoulos, I., Lauga, J., & Moreau, J. (1995). Improved
estimation using neural networks, of the food consumption of fish populations.
Marine and Freshwater Research, 46, 1229–1236.
Lek, S., Delacoste, M., Baran, P., Dimopoulos, I., Lauga, J., & Aulagnier, S. (1996).
Application of neural networks to modelling nonlinear relationships in ecology.
Ecological Modelling, 90(1), 39–52.
Lemmens, A., & Croux, C. (2006). Bagging and boosting of classification trees to
predict churn. Journal of Marketing Research, 42, 276–286.
Li, N., Wu, D. D. (2010). Using text mining and sentiment analysis for online forums
hotspot detection and forecast. Decision Support Systems, 48, 354–368.
Lin, W., Wu, Z., Lin, L., Wen, A., & Lin, L. (2017). An ensemble random forest
algorithm for insurance big data analysis. IEEE Access, 5, 16568–16575.
Linder, R., Geier, J., & Kolliker, M. (2004). Artificial neural networks, classification
trees and regression: Which methods for which customer base? Database Marketing
and Customer Strategy Management, 11(4), 344–356.
Magidson, J. (1988). New statistical techniques in direct marketing: Progression
beyond regression. Journal of Direct Marketing, 2(4), 6–18.
Magidson, J. (1989). CHAID, logit, and log-linear modeling. Marketing Research
Systems, Report 11-130, 101–114.
Mahajan, V., & Jain, A. (1978). An approach to normative segmentation. Journal of
Marketing Research, 15, 338–345.
Makridakis, S., Anderson, A., Carbone, R., Fildes, R., Hibon, M., Lewandowski, R.,
… Winkler, R. (1982). The accuracy of extrapolation (time series) methods: Results
of a forecasting competition. Journal of Forecasting, 1, 111–153.
Martı́nez-Ruiz, M. P., Molla-Descals, A., Gomez-Borja, M., & Rojo-Alvarez, J.
(2006). Assessing the impact of temporary retail price discounts intervals using
SVM semiparametric regression. International Review of Retail, Distribution and
Consumer Research, 16(2), 181–197.
Martı́nez-Ruiz, M. P., Gomez-Borja, M. A., Molla-Descals, A., & Rojo-Alvarez, J. L.
(2008). Using support vector semiparametric regression to estimate the effects of
pricing on brand substitution. International Journal of Marketing Research, 50(4),
555–557.
McCulloch, W., & Pitts, W. (1943). A logical calculus of the ideas immanent in
nervous activity. Bulletin of Mathematical Biophysics, 5, 115–133.
Messenger, R., & Mandell, L. (1972). A modal search technique for predictable nominal
scale multivariate analysis. Journal of the American Statistical Association, 67(340),
768–772.
Mizuno, M., Saji, A., Sumita, U., & Suzuki, H. (2008). Optimal threshold analysis of
segmentation methods for identifying target customers. European Journal of
Operational Research, 186, 358–379.
188 References
Morgan, J., & Sonquist, J. (1963). Problems in the analysis of survey data, and a
proposal. Journal of the American Statistical Association, 58, 415–434.
Mukherjee, S., Osuna, E., & Girosi, F. (1997). Nonlinear prediction of chaotic time
series using support vector machines. Neural Networks for Signal Processing [1997]
VII. Proceedings of the 1997 IEEE Workshop. Added to IEEE Explore in 06
August 2002. doi:10.1109/NNSP.1997.622433
Neslin, S., Gupta, S., Kamakura, W., Lu, J., & Mason, C. (2006, May). Defection
detection: Measuring and understanding the predictive accuracy of customer churn
models. Journal of Marketing Research, 43(2), 204–211.
Netzer, O., Feldman, R., Goldenberg, J., & Fresko, M. (2012). Mine your own
business: Market-structure surveillance through text mining. Marketing Science,
31(3), 521–543.
Olden, J. D., & Jackson, D. A. (2002). Illuminating the “black box”: A randomization
approach for understanding variable contributions in artificial neural networks.
Ecological Modelling, 154(1–2), 135–150.
Özesmi, S. L., & Özesmi, U. (1999). An artificial neural network approach to spatial
habitat modelling with interspecific interaction. Ecological Modelling, 116(1), 15–31.
Potharst, R., Kaymak, U., & Pijl, W. (2001). Neural networks for target selection in direct
marketing. ERIM Report Series Research in Management, ERS-2001-14-LIS.
Retrieved from https://www.igi-global.com/chapter/neural-networks-business/27261
Razmochaeva, N., & Klionsky, D. (2019). Data presentation and application of
machine learning methods for automating retail sales management processes.
IEEE Explore. doi:10.1109/EIConRus.2019.8657077
Rokach, L., Naamani, L., & Shmilovici, A. (2008). Pessimistic cost-sensitive active
learning of decision tree for profit maximizing targeting campaigns. Data Mining
and Knowledge Discovery, 17, 283–316.
Sapankevych, N., & Sankar, R. (2009). Time series prediction using support vector
machines. IEEE Computational Intelligence Magazine, 4(2), 24–38.
Sheu, J.-J., Su, Y.-H., & Chu, Ko-T. (2009). Segmenting online game customers - the
perspective of experiential marketing. Expert Systems with Applications, 36,
8487–8495.
Shih, J.-Y., Chen, W.-H., & Chang, Y.-J. (2014). Developing target marketing models
for personal loans. IEEE. 978-1-4799-6410-9.
Shin, H.J., & Cho, S. (2006). Response modeling with support vector machines.
Expert Systems with Applications, 30, 746–760.
Sun, A., Lim, E.-P., & Liu, Y. (2009). On strategies for imbalanced text classification
using SVM: A comparative study. Decision Support Systems, 48, 191–201.
Sustrova, T. (2016). An artificial neural network model for a wholesale company’s
order-cycle management. International Journal of Engineering Business Management.
doi:10.5772/63727
Syam, N., & Sharma, A. (2018). Waiting for a sales renaissance in the fourth industrial
revolution: Machine learning and artificial intelligence in sales research and
practice. Industrial Marketing Management, 69, 135–146.
Thieme, J., Song, M., & Calantone, R. (2000). Artificial neural network decision
support systems for new product development project selection. Journal of
Marketing Research, 37(4), 499–507.
References 189
Thiesing, F. M., & Vornberger, O. (1997). Sales forecasting using neural networks.
Proceedings of International Conference on Neural Networks (ICNN’97), Houston,
TX, USA (Vol. 4; pp. 2125–2128). doi:10.1109/ICNN.1997.614234
Thomassey, S., & Fiordaliso, A. (2006). A hybrid sales forecasting system based on
clustering and decision trees. Decision Support Systems, 42, 408–421.
Thompson, W., Li, H., & Bolen, A. Artificial intelligence, machine learning, deep
learning and beyond: Understanding AI technologies and how they lead to smart
applications. Retrieved from https://www.sas.com/en_us/insights/articles/big-data/
artificial-intelligence-machine-learning-deep-learning-and-beyond.html
Thrasher, R. (1991). Cart: A recent advance in tree-structured list segmentation
methodology. Journal of Direct Marketing, 5(1), 35–47.
Tirenni, G., Kaiser, C., & Herrmann, A. (2007). Applying decision trees for value-
based customer relations management: Predicting airline customers’ future values.
Database Marketing & Customer Strategy Management, 14(2), 130–142.
Tirunillai, S., & Tellis, G. (2014). Mining marketing meaning from online chatter:
Strategic brand analysis of big data using Latent Dirichlet allocation. Journal of
Marketing Research, LI, 463–479.
Toubia, O., Hauser, J. R., & Simester, D. I. (2004). Polyhedral methods for adaptive
choice-based conjoint analysis. Journal of Marketing Research, 41, 116–131.
Urban, G. L., Johnson, P. L., & Hauser, J. R. (1984). Testing competitive market
structures. Marketing Science, 3(Spring), 83–112.
van der Putten, P., & van Someren, M. (2000). CoIL challenge 2000: The insurance
company case. Published by Sentient Machine Research, Amsterdam. Also, a
Leiden Institute of Advanced Computer Science Technical Report 2000-09. June
22, 2000. Retrieved from http://liacs.leidenuniv.nl/;puttenpwhvander/tic.html
Van Heerde, H. J., Leeflang, P. S. H., & Wittink, D. R. (2001). Semiparametric analysis
to estimate the deal effect curve. Journal of Marketing Research, 38(May), 197–215.
Vapnik, V. N. (1989). Statistical learning theory. New York: Wiley-Interscience. ISBN
978-0-471-03003-4.
Verbeke, W., Dejaeger, K., Martens, D., Hur, J., & Baesens, B. (2012). New insights
into churn prediction in the telecommunication sector: A profit driven data mining
approach. European Journal of Operational Research, 218, 211–229.
West, P., Brockett, P., & Golden, L. (1997). A comparative analysis of neural
networks and statistical methods for predicting consumer choice. Marketing
Science, 16(4), 370–391.
Xie, Y., Li, X., Ngai, E. W. T., & Yin, W. (2009). Customer churn prediction using
improved balanced random forests. Expert Systems with Applications, 36, 5445–5454.
Yao, J., Teng, N., Poh, T., & Tan, C. (1998). Forecasting and analysis of marketing
data using neural networks. Journal of Information Science and Engineering, 14(4),
843–862.
Zahavi, J., & Levin, N. (1997). Applying neural computing to target marketing.
Journal of Direct Marketing, 11(1), 5–22.
Zhang, Y., Dang, Y., Chen, H., Thurmond, M., & Larson, C. (2009). Automatic
online news monitoring and classification for syndromic surveillance. Decision
Support Systems, 47(4), 508–517.
Zhang, P. (2004). Neural networks in business forecasting. Hershey, PA: IRM Press.
Index
Activation function, 29–32, 37, 60 Business-to-customer setting

Active learning, 173–174 (B2C setting), 118–119
Ada. Boost. M1 algorithm, 166–167
Area under curve (AUC), 13–14, Caravan insurance, 176
118–119 Chain-rule of calculus, 39, 63
Artificial intelligence (AI), 1, 2 Chaotic time series, 119
Artificial neural network (ANN), 26 Chi Squared automatic interaction
Augmented reality (AR), 53 detection method (CHAID
Automatic interaction detection method), 139–143
method (AID method), Chi-squared statistic, 140
139–143 Choice rules, 121–122
Automatic relevance detection (ARD), Choice-based conjoint analysis (CBC
49 analysis), 122
Average linkage, 157–158 Churn
modeling, 118–119
Backpropagation prediction, 54–57
cost functions and training of neural Classification
networks using, 38–40 models, 2–3
equations, 62 NN for, 37–38
Bagging, 158–159, 161–165, 169 performance assessment for classifi-
regularization through, 78 cation tasks, 9–19
Basis expansion, 58–59 trees, 150–155
Basis function(s), 58–59 Classification and regression trees
regression, 28, 31 (CART), 143–155, 175–176
Batch Classifier, 93
gradient descent, 8–9 Clustering models, 156
size, 8–9 Coefficients, 2
Bayesian approach, 26 Collaborative-based recommendation
Bayesian neural networks, 49 system, 116–117
Bias, 2, 69 Complete linkage, 157–158
bias-variance tradeoff, 68–70 Composite functions, 36
Binary choice targeting model, 72 Computational learning theory, 85–86
Binary classification, 3 Confusion matrix, 12–13
Boosting, 158–159, 165–169 Conjoint analysis, 124–125
Bootstrap(ping), 158–161 methodology, 116
aggregation, 159 Connection weights, feature
Business-to-business setting importance based on,
(B2B setting), 118–119 45–47
192 Index
Consumer choice modeling, 121 Empirical distribution, 5

Content-based recommendation Ensemble methods, regularization
system, 116–117 through, 78
Convolutional neural networks Ensemble random forest approach,
(CNN), 2 175–176
Cosine similarity kernel, 130 Euclidean distance, 156–157
Cost complexity Euclidean norm, 91–92
criterion, 181 Evolutionary local selection algorithm
pruning, 149–150 (ELSA), 52
Cost function, 9 Example-dependent costs, 175–176
and training of machine learning Expectation, 69
models, 3–4 Expected test error, 71–72
Cross-entropy cost, 4, 6, 19–20, 38, 41, Explanatory variable, 2
151–152
Cross-validation, 70–72, 80 Feature importance
Cumulative response curve, 15–17 based on connection weights, 45–47
“Customer-focused” approach based on partial derivatives, 49
of marketing, 116 measurement, 42–49
Feature selection, 75
Decision trees, 155 Feature space, 94, 129, 143
applications in marketing and sales, First Order Conditions (FOCs), 132
171–176 Fivefold cross-validation, 71
bootstrapping, bagging, and boost- Forward stagewise additive modeling
ing, 158–169 process, 165–166
case studies, 176–179
decision tree-based methods, Gainsight and Survey Monkey
139–140 company, 54
early evolution of, 139–143 Gaussian distribution, 62
random forest, 169–171 Gaussian errors, 38
and segmentation, 155–158 Generalizability, 73
Default classification rule, 12–13 Generalization error, 9–10, 68
Dendrograms, 156, 158 Gini coefficient, 17–19
Dependent variable, 2 Gini index, 151–152
Depth of network, 36 Goodness-of-fit measure, 149–150
Descent, 9 Gradient, 61
Direct Marketing Educational boosting, 168
Foundation (DMEF), with cross-entropy, 63
117–118, 173–174 descent, 9, 61
Directional derivatives, 8 gradient-based learning, 6–9
Distance to city center (DCC), 58 Gram matrix, 130
Dot product, 59, 91, 128 Greedy algorithm, 147–149, 155
“Earnings before tax-to-equity ratio”, Hard choices, 116–117

173–174 Hidden nodes, 31–32, 59–60
Index 193
Hierarchical Bayesian method (HB Law of parsimony, 72

method), 116 Lead qualification and scoring models,
Hierarchical clustering, 156 52
Hit rate, 11–12 Learning rate, 66
Hyperparameters, 66, 167 with cross-entropy function, 63
Hyperplanes, 88 parameter, 7–8
margin between classes, 99–100 Learning slowdown, 63
maximal margin classification, Leave-out-one cross-validation
101–106 (LOOCV), 71
optimal separating hyperplane, Lift chart, 15–17
99–106 Linear activation function for
separating, 88–89 continuous regression
outputs, 40, 62–63
Independent variable, 2 Linear regression model, 2–3
Inner product, 59, 91, 128 Log odds, 86, 127
Intel’s RealSense Vision Processor, 53 ratio, 3
Internet Movie Database (IMDB), Log-likelihood, 19
116–117 Logistic regression, 3, 86
“Inverted U” shape, 33–34 Logit leaf model (LLM), 175–176
Irreducible error, 69
Machine learning, 1–2
K-fold cross validation, 71 implementation, 1
Karush–Kuhn–Tucker conditions, 132 industry applications, 1
Kernel(s), 94–98 kernels in, 90–99
kernel-based nonlinear classifier, Margin, 104, 130
114 width, 107
in machine learning, 90–99 Maximal margin classification,
matrix, 130 101–106
as measures of similarity, 91–94 Maximum likelihood estimation
trick, 98–99 (MLE), 4–6, 38
kth degree polynomial kernel, 130 Maximum likelihood estimator,
19, 60–61
L1 regularization, 74–75 Mean squared error (MSE), 10, 58
as constrained optimization Mini-batch gradient descent, 8–9
problems, 75–76 Misclassification costs, 175–176
weight decay in, 81 Model distribution, 5
L2 regularization, 73–74 Monocentric land value model, 26–27
as constrained optimization Multi-class classification, 37
problems, 75–76 Multicentric land value model, 27
weight decay in, 80–81 Multilayer NNs, 36–37, 53
Lagrange multipliers, 131–132 Multilayer perceptron (MLP),
Lasso, 73 175–176
Latitude of acceptance rule (LOA Multinomial logit (MNL), 50–51
rule), 52, 121–122
194 Index
Natural language processing (NLP), Pessimistic active learning (PAL),

2, 53 173–174
“Net profit margin”, 173–174 Polynomial kernel, 114
Neural interpretation diagrams (NID), Predicted mean squared error (PMSE),
43–44 126
Neural networks (NN), 2, 25–26, 53 Predicted MSE (PMSE), 10
applications to sales and marketing, Prediction rule, 143
49–54 Profile method for sensitivity analysis,
case studies, 54–58 44–45
for classification, 37–38 Propensity scoring model, 12–13
cost functions and training of Prototypes, 173–174
neural networks using
backpropagation, 38–40 Quadratic cost, 63
early evolution, 25–26 function, 76, 83
feature importance measurement
and visualization, 42–49 Radial basis function kernel (RBF
model, 26–38 kernel), 99, 130, 175–176
output nodes, 40–42 Radial basis kernel, 126
for regression, 28–37 Radial kernel, 123
Next Product To Buy (NPTB), 54 Random forest, 2, 139–140, 169–171
Non-compensatory choice rules, applications in marketing and sales,
52, 121 171–176
Non-convex region, 146 Randomization approach for weight
Non-parametric methods, 139–140 and input variable
Nonlinear maps and kernels, 94–98 significance, 48–49
Norm, 91–92, 128 Receiver operating characteristics
curve (ROC curve), 13–14
Online learning, 8–9 Recency, frequency, monetary value
Optimal classifier, 114 analysis (RFM analysis),
Ordinary least squares regression 173–174
(OLS regression), 101 Rectified Linear Units (ReLU), 33
Out-of-bag observations, 163 Recurrent neural networks (RNN), 2
Output nodes, 40–42 Recursive binary partitioning, 145
Overfitting, 66–68 Regression
cost complexity pruning, 149–150
Parameter norm penalty methods, greedy algorithm, 147–149
73–74 models, 2–3
Partial derivatives, feature importance NN for, 28–37
based on, 49 performance assessment for, 9–19
Percent correctly classified (PCC), trees, 147–150
11–12, 124, 176 Regularization, 66, 72–78
Perceptrons, 89 through bagging and ensemble
Permutation importance measure, methods, 78
164 through early stopping, 77
Index 195
through input noise, 76 Streaming data, 8–9

through sparse representations, Sum of squares (SS), 147, 150, 181
77–78 cost, 4, 19
Rent value error cost, 38
location vs., 125 Supervised learning models, 1
prediction, 57–58 Supervised segmentation, 155–156
Response variables, 1–3 Support vector classifier, 106–114
Ridge regression, 73 Support vector clustering (SVC), 115
“Root” node, 149–150 Support vector machine (SVM), 2,
85–86, 175–176
Sales and marketing applications in marketing and sales,
applications of NN to, 49–54 114–120
SVM applications in, 114–120 case studies, 120–127
Sampling variability, 69 early evolution, 85–86
Satisficing rule, 52 hyperplanes, 88–89
Segmentation, 155–158 kernels in machine learning, 90–99
Segmentation, targeting and nonlinear classification using, 86–88
positioning (STP), optimal separating hyperplane,
50, 155–156 99–106
Self-organizing feature maps (SOFM), support vector classifier and,
115 106–114
Separability, 109–110 Support vectors, 102
Separating hyperplanes, 88–89, 127 SVMauc technique, 118–119
Sequential binary programming (SBP),
173–174 Taiwan Ratings Corporation, 118–119
Shannon’s entropy, 151–152 Target variables, 1–3
Sigmoid activation function, 33 Test data, 9–10, 71
for binary outputs, 40–41, 63 Test error, 9–10, 66, 68
Sigmoid function, 33, 36 Text analysis, 119–120
Sigmoid kernel, 130 Text classification, 120
Similarity, kernels as measures of, THAID, 139–143
91–94 Top decile lift, 17
Slack variable, 107–109 Training error, 9–10, 66
Soft margins, 107 Training of machine learning models,
Softmax activation function for 1–9
multiclass outputs, 42, 64 cost functions and training of
Softmax function, 37 machine learning models,
Sparse representations, regularization 3–4
through, 77–78 gradient-based learning, 6–9
Sparsity, 75, 81–82 MLE, 5–6
Stochastic gradient boosting algorithm, regression and classification models,
169 2–3
Stochastic gradient descent (SGD), 8–9 Tree size, 149–150
Stopping rule, 148–149 Tree-based model, 175–176
196 Index
“Trial-and-repeat” purchase models, 2 Weight decay, 72–78

True data distribution, 5, 71–72 in L1 regularization, 81
in L2 regularization, 80–81
Underfitting, 69 parameter, 74, 80, 150
Units, 25–26 Weight(s), 2
Universal approximation theorem, vectors, 31
33–34 weight-based input importance
Unsupervised segmentation models, method, 45
156 weighted additive rule, 52
Wine quality, 178
Vapnik–Chervonenkis theory, 85–86 Wolfe Dual program, 106, 133
Variance, 69
Virtual reality (VR), 53 XOR problem, 113
Visualization, 42–49

ML &amp; AI in Marketing and Sales (2021) Nildri Syam - Emerland

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML &amp; AI in Marketing and Sales (2021) Nildri Syam - Emerland

Uploaded by

Copyright:

Available Formats

Machine Learning and Artiﬁcial

Intelligence in Marketing and Sales

The world of business in general and marketing in particular is

This book is a great resource for Data scientists as a reference to

In Machine Learning and Artiﬁcial Intelligence in Marketing and

Syam and Kaul’s book is a comprehensive treatise on data science

The authors have skillfully tailored the content to a wide audience.

United Kingdom – North America – Japan – India – Malaysia – China

First edition 2021

Copyright © 2021 by Emerald Publishing Limited.

Reprints and permissions service

British Library Cataloguing in Publication Data

ISBN: 978-1-80043-881-1 (Print)

List of Figures, Tables and Illustrations xiii

Chapter 1 Introduction and Machine Learning Preliminaries:

Chapter 2 Neural Networks in Marketing and Sales 25

1.3 Cost Functions and Training of Neural Networks

Chapter 3 Overﬁtting and Regularization in Machine Learning

2.3 L1 and L2 Regularization as Constrained

Chapter 4 Support Vector Machines in Marketing and Sales 85

Technical Detour 3 128

Chapter 5 Random Forest, Bagging, and Boosting of Decision Trees 139

Figure 1.1. Gradient Descent for a Minimization Problem. 11

Figure 2.6. NN with 1 Hidden Layer Containing M Hidden

Figure 4.2. Poor Classiﬁcation of “1” and “2” Classes Using

Figure 4.25. Latitude of Acceptance Noncompensatory Choice

Table 5.1. Categorical Data for CHAID Analysis. 153

Illustration 4.1. Nonlinear maps transform nonlinearly separable

Dawit Mulugeta, PhD (Biostatistics), VP Analytics and Risk Management,

Machine Learning and Artiﬁcial Intelligence in Marketing and Sales: Essential

• Marketing and sales practitioners who want to develop a deeper appreciation

This book represents a fruitful collaboration between an academic and an

Machine Learning and Artiﬁcial Intelligence in Marketing and Sales, 1–3

Introduction and Machine Learning

1. Training of Machine Learning Models

Machine Learning and Artiﬁcial Intelligence in Marketing and Sales, 5–24

1.1 Regression and Classiﬁcation Models

1.2 Cost Functions and Training of Machine Learning Models

1.3 Maximum Likelihood Estimation

1.4 Gradient-Based Learning

Fig. 1.1. Gradient Descent for a Minimization Problem.

Fig. 1.2. Small and Large Step Sizes (Learning Rates) in

2. Performance Assessment for Regression and

2.1 Performance Assessment for Regression Models

2.2 Performance Assessment for Classiﬁcation

2.2.1 Percent Correctly Classiﬁed (PCC) and Hit Rate

Fig. 1.3. Classifying “1” and “2” and 0–1 Loss.

Clearly, if we assume a 0–1 loss, on which PCC is based, then misclassiﬁcation

2.2.2 Confusion Matrix

Fig. 1.4. Confusion Matrix Showing True Positives and

Fig. 1.5. Instances Sorted by Decreasing Order of Predicted Class

Fig. 1.6. Instances Sorted by Decreasing Order of Predicted Class

Fig. 1.7. Receiver Operating Characteristics (ROC) Curve.

Fig. 1.8. Area Under the Curve (AUC).

2.2.4 Cumulative Response Curve and Lift (Gains) Chart

• The “Percentage of customers” is on the X-axis in decreasing order of their

The typical cumulative response curve is shown below (Fig. 1.9).

Fig. 1.9. Cumulative Response Curve.

(2) Sort by decreasing order of predicted probabilities of “1”

2.2.5 Gini Coefﬁcient

Fig. 1.10. Lift Chart.

In Fig. 1.11, let

AL 5 Area under cumulative responses curve (Lorenz curve), and

Then the Gini coefﬁcient 5(AL2AR)/AR.

ML & AI in Marketing and Sales (2021) Nildri Syam - Emerland

ML & AI in Marketing and Sales (2021) Nildri Syam - Emerland