Professional Documents
Culture Documents
ML & AI in Marketing and Sales (2021) Nildri Syam - Emerland
ML & AI in Marketing and Sales (2021) Nildri Syam - Emerland
BY
NILADRI SYAM
University of Missouri, USA
RAJEEVE KAUL
McDonald’s Corporation, USA
No part of this book may be reproduced, stored in a retrieval system, transmitted in any form or
by any means electronic, mechanical, photocopying, recording or otherwise without either the
prior written permission of the publisher or a licence permitting restricted copying issued in the
UK by The Copyright Licensing Agency and in the USA by The Copyright Clearance Center.
Any opinions expressed in the chapters are those of the authors. Whilst Emerald makes every
effort to ensure the quality and accuracy of its content, Emerald makes no representation
implied or otherwise, as to the chapters’ suitability and application and disclaims any warranties,
express or implied, to their use.
Foreword xvii
Preface xix
Acknowledgments xxi
Introduction 1
References 183
Index 191
List of Figures, Tables and Illustrations
The book is written in five chapters covering important concepts, principles, and
practices on contemporary machine learning topics. Each page is written in an
easy-to-read format using clean and lean sentences unencumbered by complex
jargon. Each chapter provides solid theoretical background of the methods
selected, and further elaborates the how and the why aspects of model selection,
model building and model validation; the step-by-step approaches of leveraging
specific techniques such as the Neural Network, the decision trees, and the vector
machines; and the pros and the cons of certain machine learning (ML) proced-
ures. Moreover, each topic provides many real-world examples that connect the
theory with the applied use. Lists and depth of supporting reference materials are
also excellent.
We are at an exciting time where the era of big data, machine learning, arti-
ficial intelligence, cloud computing and advanced analytics is ushering unprece-
dented access to and uses of large volumes of data to improve our predictive
power to unlock transformational changes that impact many aspects of our lives
in the retail, the financial, the manufacturing, the technology, the healthcare, and
other industries.
As both big and small firms alike are fine-tuning their pricing, promotion,
distribution, customer-retention, risk-management, and go-to-market strategies,
data scientists are increasingly expected to know cutting-edge solutions, equip
themselves with many facets of ML techniques and solutions. This book
undoubtedly provides the foundational background, the tools, and the necessary
tips to grasp many of the ML methods currently in use. In addition, as ML is
rapidly and dynamically evolving to impact our daily life, the timeliness of this
book is undoubtedly very appropriate.
I have worked as a data scientist for diverse organizations and have taught
analytics and ML classes in universities. A few pages into this book, I knew that it
is a special treat for my appetite, and it really struck a chord. Many of the well-
organized core concepts expounded in the book are not only refreshing but also
the kind I wish I had long ago. I also dare to describe this book as a versatile tool
and a must-have reference material for both beginners and seasoned data scien-
tists alike, business leaders and those who embrace data and analytics-driven
decision-making processes. In addition, analytics teachers and their respective
students can benefit from the in-depth analysis of the contemporary data science
topics and the plethora of examples provided. I commend the authors for a job
well done.
xviii Foreword
made the final product better. We have tried earnestly to strike a balance between
the theoretical/technical and applications aspects. Having said that, we have
decided to err on the side of applications, anchoring our narrative on the con-
nections between the techniques and their applications in a business setting. We
have deliberately kept the technical details to a bare minimum in the main body of
the text, and have dealt with technical details through the various “Technical
detours” that have been collected together at the end of each chapter. This allows
readers who do not wish to tackle the technical issues to be able to read the
chapters easily without being distracted or overwhelmed by technical details. In
addition, to help readers get a quick overview of the concepts involved, we have
added “Executive summaries” as and when needed. As far as possible in each
chapter we have tried to emphasize the intuitions behind the, sometimes complex,
concepts of machine learning methods.
As far as the marketing and sales practitioners are concerned, the book
assumes that they are interested in actual implementation of machine learning
models, either by themselves or in collaboration with the data scientists in their
organizations. For this reason, this book is not just a high-level overview of
machine learning applications to marketing and sales. Thus, by its very nature the
chapters assume that the reader is willing to handle some technical material.
However, we have made every attempt to make the chapters self-contained by
providing the background material needed to understand the chapter contents.
Each chapter has a section on the existing applications of machine learning and
artificial intelligence (AI) to issues in marketing and sales. We have tried to focus
on applications that have been archived in the major peer-reviewed research
journals in marketing, operations research, machine learning, expert systems, etc.
By their nature, journal articles have details of implementation and data sets and
interested readers can go through the articles listed in the references at the end of
every chapter for further details. Finally, for those wishing to get hands-on
experience of actually running analyses of marketing and sales data using
machine learning models, each chapter has a couple of detailed “Case studies” at
the end.
Acknowledgments
Niladri
I would like to gratefully acknowledge the support of my wife, Nivedita,
without whose patience and encouragement this book would never have materi-
alized. I would also like to thank my teacher, Professor Bibek Debroy, who
introduced me to the power of quantitative modeling and testing of business
phenomena in the 1980s, much before the term “analytics” had become fash-
ionable. The Center for Sales and Customer Development (CSCD) at the Uni-
versity of Missouri provided the support and proper climate which greatly
facilitated the writing of this book.
Rajeeve
This book would not be possible without the unyielding support and encour-
agement of my wife, Shalini, and the patience and understanding of our son,
Harsha. Working on a book while staying abreast with challenging executive roles
required them to sacrifice time that we could have spent building life memories –
for which I am so grateful. I would also like to thank my many professors who
encouraged me to learn and experiment with so many quantitative methods across
diverse fields from statistics, to marketing, finance, operations research, etc. I
further extend my gratitude to the many incredible executives across so many
industries who adopted my quantitative solutions to improve decision in areas
including pricing, marketing, supply chain,and digital among others, and the
companies that allowed me to follow my curiosity to develop and deploy these
models.
This page intentionally left blank
Introduction
This book grew out of many discussions that we, the two coauthors, had over
the span of several years. Over the years, the field of machine learning has gone
from an esoteric topic discussed among a small select group of practitioners and
researchers to an ever-growing tsunami of interest and practitioners approaching
an increasingly digitized world from their diverse backgrounds. We were both
interested in machine learning, but we had approached it from two very different
perspectives, and could relate to this heterogeneity in thinking. One of us is an
academic and was focused primarily on the theory of machine learning and in
doing research on this topic. The other author is an industry practitioner and
was focused primarily on the applications of machine learning models in
marketing and sales. Of course, despite the different areas of emphasis each one
of us is interested in both theories and applications, and both realize that these
should go hand in hand. As our discussions progressed, we felt that the existing
resources in machine learning did not quite serve the needs of the diverse
stakeholders that have to work together for successful industry applications of
machine learning in marketing and sales. This motivated the need for a book
that would speak to both the sales/marketing business teams and the data
scientists who are tasked with solving business problems in the domains of
marketing and sales.
This book takes a different approach compared to two distinct existing
categories of books that deal with machine learning and AI – the technical and
the qualitative books. The former category of books does not focus on appli-
cations in a specific domain in any detail, and often use stylized examples drawn
not from business but from the physical sciences. The latter (qualitative) cate-
gory of books does not provide any details of the statistical and mathematical
concepts that drive machine learning techniques and their applications. Neither
of these types of books serve well the needs of practitioners coming from varied
backgrounds working on actual implementation of machine learning models in
the field of sales and marketing in a business enterprise. As far as the technical
aspects are concerned, we have avoided machine learning algorithms and have
focused instead on the concepts and ideas that underlie machine learning models
and methods. Our interest is in connecting the concepts that underpin these
methods and bridge the gap between the data scientist and the business
practitioner. There are many online and other resources for those readers
wishing to familiarize themselves with algorithms.
A major decision in writing any book is always what to include and what to
leave out. Machine learning and AI are flourishing fields of research with scholars
from diverse disciplines such as applied mathematics, statistics, operations
research, engineering, and computer science actively contributing to them. There
is an enormous variety of machine learning models, and trying to include all these
models would make the book unwieldy and unfit for the non-expert. We have
therefore decided to focus on just three of the most commonly used methods in
marketing and sales applications – Neural Network, Support Vector Machine
(SVM), and Random Forest. A key motivation for this approach is the
acknowledgment that though there are many ideas and approaches, not all of
them have efficacy across a broad set of business problems. As such, it makes
sense to focus on methods that have been validated for their applicability in
solving a wide variety of marketing problems. Importantly, these three models are
exemplars of three distinct classes of machine learning models, and many of the
latest developments in the field are based on these three fundamental models.
Thus, any understanding of the latest developments in machine learning, deep
learning, and AI require an understanding of these models. For example,
advances in deep learning including Convolutional Neural Networks (CNN) and
Recurrent Neural Networks (RNN) are based on the fundamental ideas of Neural
Networks. An SVM is a prominent example of the class of kernel-based machine
learning models. A Random Forest is a good representative of the class of tree-
based learning models and is a good context to discuss ideas of bagging, boosting
and gradient boosting.
This book is about machine learning and Artificial Intelligence (AI). While
different authors have different ideas about the distinction between them, the
consensus opinion is that machine learning is a subfield of AI. At a broad
level, AI is the umbrella term used to denote the entire suite of technologies
that are designed to mimic human abilities. Thus, AI includes machine
learning, deep learning, natural language processing (NLP) etc. Many authors
argue that neural networks are part of deep learning since the latter are
essentially ‘big’ neural networks with many layers with complex intercon-
nections between them. To the extent that machine learning and deep learning
are often identified as distinct subfields of AI (see, for instance, the SAS
Institute white paper by Thompson, Li, and Bolen), the question of whether
neural networks should be included under machine learning often becomes a
matter of taste and preferences. We would like to avoid these matters of
semantics, and hence we have included both machine learning and AI in the
title. The reader can think of the content of the book as “narrow AI” which
is supervised machine learning.
Before discussing the three specific machine learning models, we will discuss
the concepts of training and performance assessment since these concepts are
applicable to all machine learning models. This is done in Chapter 1. Here we also
discuss the linear regression model for a continuous dependent variable (often
called a response variable or a target variable in a machine learning context) and
Introduction 3
the logistic regression model for a categorical dependent variable. These will form
useful benchmarks for the machine learning models discussed in Chapter 2
(Neural Networks), Chapter 4 (Support Vector Machine) and Chapter 5
(Random Forest). In Chapter 3 we discuss the very important concept of over-
fitting and regularization. We have decided to introduce these after the chapter on
Neural Networks since it is easier to grasp these concepts when discussed in the
context of a specific model, even though they are applicable for all models.
This page intentionally left blank
Chapter 1
Chapter Outline
1. Training of Machine Learning Models
1.1 Regression and Classifications Models
1.2 Cost Functions and Training of Machine Learning Models
1.3 Maximum Likelihood Estimation
1.4 Gradient-Based Learning
2. Performance Assessment for Regression and Classification Models
2.1 Performance Assessment for Regression
2.2 Performance Assessment for Classification
2.2.1 Percent Correctly Classified (PCC) and Hit Rate
2.2.2 Confusion Matrix
2.2.3 Receiver Operating Characteristics (ROC) Curve and the Area under
the Curve (AUC)
2.2.4 Cumulative Response Curve and Lift (Gains) Chart
2.2.5 Gini Coefficient
Technical Appendix
The epsilon term («) at the end is the error term. It captures the fact that the
relationship between X and Y has randomness owing to a host of factors. The
common sources of randomness are the many other factors that also affect
purchases in November apart from trials in October. Of course, these have not
been modeled, and thus, there will be errors when we use only one explanatory
variable to predict purchases in November. In the simple linear regression
above, the effect of the number of trial samples in October is given by the
parameter w1 (parameters that multiply inputs are also called coefficients and in
machine learning models like Neural Networks, they are called weights). The
slope, given by w1, intuitively captures the additional purchases in November
due to an extra trial sample in October. The intercept, given by w0, intuitively
captures the purchases in November if there were no trial samples in October
(in machine learning models like Neural Networks this parameter is called the
bias). Instead of just one explanatory variable, one could include other variables as
well on the right-hand side of the equation, and then we would have a multiple
regression.
In this book, we will refer to a model with a continuous response variable as a
regression model and different machine learning techniques can be used to analyze
such models. The traditional linear regression described above can serve as a
useful benchmark to compare with the more recent machine learning models.
In marketing the response variable we are often interested in is categorical. For
instance, consider the case of a bank that wants to predict whether its customers
are likely to churn (leave) or not. A sales organization may be interested in
Introduction and Machine Learning Preliminaries 7
categorizing their prospects as being either in the “buy” or “not buy” category. In
lead scoring, a sales organization may want to categorize their sales leads as
belonging to one of many different classes based on their propensities to buy: very
unlikely, unlikely, likely, very likely. These are classification tasks, with the first
two being binary classification and the third being multiclass classification.
We will briefly describe the case of binary classification. The traditional
workhorse for analyzing models with a binary categorical response variable is a
logistic regression. In the bank churn example, suppose the two classes are “churn”
or “not churn,” and the bank wants to understand to what extent the amount of
“balance” that the customer has is predictive of churn. The answer is not clear a
priori. On the one hand, a customer with a large balance can be considered as
having a deeper relationship with the bank, and therefore, less likely to churn. On
the other hand, such attractive customers are targets of competitive offers from
other banks and are more likely to churn. We use the balance a customer has in the
bank as the explanatory variable X. The response variable Y 5 {11, 21} is coded
as: 11[ “churn” and 21 [ “not churn.” We cannot use a linear regression here
since we would like to model the probability of churning, and unlike the contin-
uous response of a linear regression which can take on any value, probabilities
have to lie in the interval [0, 1].
The logistic regression works by defining p 5 Probability(Y 5 11), and then
positing the relationship
Log½p=ð1 2 pÞ ¼ w0 1 w1 X1 (1.2)
The term on the left-hand side, Log[p/(12p)], is called the log odds ratio. This
formulation generates the probability of churning, p. It also ensures that the sum,
Probability(“churn”) 1 Probability (“not churn”), adds up to 1 as is expected of
probabilities. Based on these probabilities, one can classify customers as
belonging to the category “churn” (“not churn”) if p . 0.5 (p , 0.5).
In this book, we will refer to a model with a categorical response variable, both
binary and multiclass, as a classification model, and various machine learning
models can be used for classification tasks. The logistic regression described
above can serve as a benchmark to compare with machine learning classification
models.
model is performing. In this case, the cost function is a function of the difference
between the predicted output of the model and the actual sales value for all past
periods. The model is said to perform well when the cost (also called error or loss)
is minimized. The minimization of cost is achieved by choosing appropriate
parameters of the mathematical model. This process is called training the machine
learning model.
For a machine learning model, training is said to occur when the model esti-
mates the “best” values of the parameters. What does best mean? At this point, we
formalize the concept of a cost function a bit more. Consider the linear regression
model specified above. Given a specific input data point, X 5 x, and some values of
the parameters (weights), the regression model can make a prediction f (x).1 This is,
given specific values of w0 and w1 and a data point x, the regression model makes a
prediction y 5 f (x) 5 w0 1w1x. On the other hand, the input data point x has an
actual observed y (also called target) associated with it. Intuitively, the cost function
measures the discrepancy between the model prediction y and the actual y for all
possible values of input x. The goal of training is to choose those parameters
(weights w0 and w1) that minimize this cost. These cost minimizing weights are the
“best” weights.
In our discussions earlier, the cost function was based on sales – specifically it
was the difference between actual observed sales and the sales predicted by the
model. In business, typical “performance indicators” one encounters are sales,
margins, inventory balances, profits, hours worked, and payroll to name a few.
Any of these, or any combination of these, could be used to define the cost
function.
By far the most common technique for training most machine learning models
is to use the method of Maximum Likelihood Estimation (MLE). Maximum
likelihood estimators have desirable statistical properties and are therefore
advantageous to use. Elements of this philosophy are also applied extensive in
deep learning. It is noteworthy that there is a close theoretical connection between
cost (loss) functions and maximum likelihood, and therefore we will use the
maximum likelihood framework to address the issue of choosing appropriate cost
functions. A standard result from statistics is that, when the errors in a regression
model are Gaussian then minimizing the sum-of-squares cost with respect to the
weights is equivalent to maximizing the log-likelihood. In the maximum likeli-
hood framework, the appropriate cost function for regression-type outputs is the
sum of squares cost (loss), and for binary output the appropriate cost is the cross-
entropy cost. Expressions for the sum-of-squares cost function for regression and
the cross-entropy cost function for binary classification are in the appendix.
Technical Detour 1
1
The uppercase denotes the variable and the lowercase is a specific value of that variable.
Introduction and Machine Learning Preliminaries 9
distribution pmodel(x; u). The goal of the maximum likelihood estimation is to find
that u9such that in the distribution pmodel(x; u9) we have a probability as close as
possible to 1 for individuals who are defaulters and as close as possible to 0 for
non-defaulters.
A desirable feature of the maximum likelihood approach is its connection with
the cost function. We could simply define the negative of the likelihood as the cost
that needs to be minimized (as a technical matter, the logarithm of the likelihood
is used, but this does not change the conceptual ideas). From information theory
the negative of the log likelihood is the cross-entropy between the empirical data
distribution and the model distribution. Minimizing this cost will give us the u
that is consistent with maximum likelihood. The cross-entropy cost is widely used
in machine learning and it has the advantage of being derived from maximum
likelihood estimation procedures. This unified framework for identifying a cost
function has a major advantage in that it obviates the need for coming up with
different cost functions for different models, but rather we can define the cost
function as soon as we specify a model distribution pmodel(x; u).
Technical Detour 2
In the simple case where there is only one weight to be determined, the
geometrical analog of the gradient is the slope of the cost function as shown
in Fig. 1.1. It is given mathematically by the derivative of the cost function with
respect to the weight. For multidimensional cases with many weights the equivalent
of this derivative is called a gradient. In Fig. 1.1, the slope (tangent line) at wt is
Introduction and Machine Learning Preliminaries 11
negative and so, from the previous formula for weight updating, the next value of
the weight, wt11, will be larger than wt.
On the other hand, if we are currently at a point “a” on the right side of the
minimum point, the slope is positive, and the updating rule would result in a smaller
weight. In either case, the weights keep gradually moving toward the minimum
point.
The other quantity in the weight updating equation is g which is a learning rate
parameter. It determines the rate at which, starting from an initial weight, the
updating procedure converges to the minimum point. It is called the learning rate
because it determines the rate of approach to the minimum point which is the goal
of learning in machine learning models. This parameter is exogenously chosen by
the model builder and there are complexities involved in its choice. The most
serious one is that, if the learning rate is too large then the learning procedure
could overshoot the minimum point thus failing to converge, and worse, may even
diverge. Fig. 1.2 makes this clear. The picture on the left shows a small learning
rate which allows the updating procedure to slowly approach the minimum point,
whereas in the picture on the right shows a large learning rate where weight
updating overshoots the minimum point.
Of course, if the learning rate is too small, the training will take an unac-
ceptably long time. So, a proper choice of the learning rate is a hyper-parameter.
The model builder will need to try different learning rates and judge their effec-
tiveness, using cross-validation and other methods, before settling on an appro-
priate learning rate. Overall, the plain vanilla gradient descent has oftentimes
been found to be unstable and slow. To the extent that these problems are caused
by the inappropriate choice of the learning rate parameter, machine learning
researchers have suggested stable methods of choosing this parameter so that we
are guaranteed to converge to a local minimum regardless of the starting point.
Here we will not discuss these more technically advanced methods, many of which
use Newton’s method involving second derivatives or Hessians that take into
account the rate of change of slope.
In the case of multiple weights, we operate in multi-dimensional space and
the direction of descent becomes critical. In such cases we have to study the
directional derivatives at various points in multidimensional space. Think of the
mars rover that finds itself on a slope, tasked with finding the bottom of a basin.
In order to descend to the bottom most rapidly, the rover should determine the
most efficient course to the bottom. If the rover could only roll down a foot at a
time, which direction would it roll first? Well, since the goal is to get to the
bottom it is reasonable to expect that the rover would want to go as far down as
it can in that foot of distance covered. This would be achieved by finding the
slope (rise over run) for the one foot move in all the different directions the rover
can move. The first step taken should be in the direction where the reduction in
altitude is greatest, and this is the direction of greatest slope (steepest descent).
As the rover moves one foot at a time from each location, it goes lower and
lower till it can go no lower-telling the computer that it has reached the bottom
of the basin. Intuitively speaking, the gradient specifies the optimal direction of
descent.
Another important issue in gradient descent is to determine how much of the
training data set should be used. Note that the calculation of the cost function
uses the training data (see “Technical Detour 1” in the appendix). Therefore, each
step of gradient descent requires the algorithm to process the entire training data
set to calculate the gradient. This causes severe slowing down of gradient descent
when the training data is large. Mini-batch gradient descent is a clever trick that
lets us get around this problem. In this method the whole training data is divided
into smaller mini batches. These mini batches are then used for training, and the
advantage is that weight updating can start as soon as we process a mini batch
rather than having to wait to process the entire training data before updating
weights. Thus, the batch size, say B, is another hyper-parameter in tuning neural
networks. When B 5 N, that is, when the entire training data us used we are back
to the standard gradient descent, also called batch gradient descent. In the polar
opposite case, when B 5 1 we have stochastic gradient descent (SGD). SGD is
often used when we have real-time streaming data. In these cases, we perform
online learning, where the gradients are calculated and weights updated for each
single training data point and then the results are averaged to get an estimate of
the weights. Though an infinite sequence of incoming training data was an early
Introduction and Machine Learning Preliminaries 13
motivation for weight updating using each incoming training data point, since by
definition one could not wait for the “entire” data, this idea is now routinely used
even when there is a finite batch of data available. In SGD first there is a random
permutation of the training data, and then data points are drawn one by one
without replacement. After each draw the gradient is calculated and the weight
updating is done. The weight estimate is calculated as the average of all the
weights so calculated during a pass over the entire data. In the boxed text here, we
summarize our discussion.
Executive Summary
Training a machine learning model refers to the process of learning (deter-
mining) the unknown parameters using the training data. For example, for
neural networks the unknown parameters are the weights and biases. The usual
method is to learn the parameters by minimizing a cost function (also called
loss or error). The cost function is a measure of model performance, and the
best performing models have the minimum cost. The cost function calculates
the difference between the model prediction, of some managerially relevant
performance indicators such as sales, profit, margin, and market share, and the
observed target.
Training via cost minimization is done using gradient descent. The term
“descent” refers to the goal of reducing the cost function till we achieve its
minimum. In multidimensional space, there could be many possible directions
of descent, and the gradient specifies the direction, which will result in the
maximum reduction of the cost function. Intuitively speaking, in gradient
descent the weights are updated in a stepwise manner, where each step is
taken in a direction that reduces the cost the most. The process stops when we
reach the minimum cost, and the weights corresponding to the minimum cost
are the desired “optimal” weights. Gradient descent requires the analyst to
choose a step size (learning rate), which is usually done by cross-validation. A
small (large) learning rate will require more (less) time for convergence, but is
less (more) likely to overshoot the point of minimum cost.
If the training data set is large, all the data may not be used. The subset of
data chosen for training is called the mini batch, and gradient descent
calculated using it is called mini batch gradient descent. For streaming data,
the mini batch size is one since weights are updated using gradient descent for
each incoming training data at a time. This is the idea behind stochastic
gradient descent.
may need to be compared vis-à-vis other models of the same family but which
have been parameterized differently, or different types of models may need to
be compared to see how they are able to explain the data etc. A critical aspect of
the performance of a machine learning model is that we are mostly interested
in the performance of the model on new data that has not be used to train
(estimate) the model. Said more formally, the new data is test data and we want a
model that will have low test error. Since the ability of a model to fit test data well
is related to its ability to generalize beyond the training data, the criterion of
minimizing test error is also equivalently stated as minimizing generalization
error. Of course, the model itself is trained (estimated) using the training data and
by choosing parameters (weights and biases) that minimize the training error.
Thus, a good machine learning model will have to balance (1) training error, and
(2) test error. While the machine learning model seeks to minimize training
error, it also needs to minimize the gap between test error and training error. In
terms of the underfitting-overfitting dichotomy, discussed in detail in Section 1 of
Chapter 3, a large gap between training and test error is a sign of overfitting.
Having made the case for assessing performance using the test (generaliza-
tion) error, which we will define formally in Chapter 3, it is worth having an
intuition for why training error cannot be used as a good estimate of test error.
Indeed, a central tension, as far as performance assessment is concerned, is the
fact that often these two errors have very different behaviors. Very complex
models are better able to capture the underlying patterns, and perhaps even
idiosyncrasies, of the training data, and yet, precisely for this reason may not be a
good fit with new data that was not part of the training – namely, the test data. In
other words, a very complex model is likely to have very small training error but
a large test error. The specific functions used for model assessment and evalua-
tion depend on the model. While the technical statistics literature has used many
assessment measures, we will only present the most commonly used ones in
applications.
The major drawback of the hit rate or PCC is that this is an overall measure
and does not account for individual class performance. If we look beyond overall
performance and also consider how the classifier performs with respect to each
individual class then we realize that PCC has serious shortcomings. A visual
depiction will make this concept clear. Consider a simple example of classifying
the “1” and “2” where there are (many) more “2” than “1” in the data set
(Fig. 1.3).
could be very high. Thus, even though the overall error rate may be low, among
the customers who will actually buy (column Actual “1”) there may be an
unacceptably high prediction error rate. The salesforce will then not make calls on
many customers who most likely would buy and this flies in the face of the selling
strategy of the firm. An obvious way to correct this would be to change the
classification threshold so that a new customer x is assigned to Y if Pr{Class 5
“1”}.0.2. This will increase the overall error (since we are moving away from
the optimal threshold of 0.5 for binary classification), but importantly, will reduce
the error rate among the critical group of customers who will actually buy.
Depending on the application context, decision makers may be willing to make
such tradeoffs.
2.2.3 Receiver Operating Characteristics (ROC) Curve and Area under the Curve
(AUC)
As is clear from the above discussion, different classification thresholds lead to
different entries in the cells of the confusion matrix. That is, for each possible
classification threshold there is a confusion matrix corresponding to it. To see this
more clearly, consider the following table showing just 10 instances (data points)
from a larger data set of 20 instances with 10 instances whose true class is “1”
and 10 instances whose true class is “2”.
In Fig. 1.5, the 10 instances have been sorted in decreasing order of the prob-
ability of being in the “1” class as predicted by our probability model (say, logistic
regression). This is the column titled Pr{Class 5 “1”}. Suppose our classification
threshold is 0.8. This means that if the predicted probability of an instance is
greater or equal to 0.8 then we classify it as “1” and if the predicted probability is
less than 0.8 then we classify it as “2”. We can see from the column “Pr{Class 5
‘1’}” that 5 of the 10 instances are classified as “1” (since there are 5 probabilities
greater than or equal to 0.8). The confusion matrix corresponding to a classification
threshold of 0.8 is the 3 3 3 matrix on the right. We can see that the sum of cell
entries in row “Predict ‘1’” is 5. Now, from the column “True Class” we can see
that, of the five instances that are predicted to be in “1” category, only three have a
18 Machine Learning and Artificial Intelligence in Marketing and Sales
true class of “1” and two have a true class of “2”. Hence the numbers 3 and 2 in
the row “Predict “1” ”. Similarly, if the classification threshold is 0.584 then we
have the confusion matrix as shown in the 3 3 3 matrix on the right in Fig. 1.6.
The Receiver Operating Characteristics (ROC) curve is a simple graphical
depiction of how the two errors – true positives and false positives – can be
simultaneously depicted for all possible thresholds. Fig. 1.7 shows a typical ROC
curve.
The point (0.2, 0.3) corresponds to the threshold of 0.8 as shown in Fig. 1.5. If
we convert the cell numbers in the confusion matrix in Fig. 1.5 to rates, then the “False
positive rate” 5 2/(2 1 8) 5 0.2 and the “True positive rate” 5 3/(3 1 7) 5 0.3.
The point (0.5, 0.5) in the ROC curve in figure corresponds to the threshold of 0.584
in Fig. 1.7.
The better a classifier is, the higher will be the proportion of true positives and
so the ROC curve will hug the left and top lines (top left corner) more and more.
Clearly, the point (true positive 5 1, false positive 5 0) corresponds to perfect
classification. The dashed 45° line corresponds to random chance. While the ROC
curve has a lot of information about how the relative proportions of true and false
positives change as the threshold changes, sometimes it is useful to have one single
summary measure of the performance of a binary classifier. This summary
measure is the Area Under the Curve (AUC), and as the name suggests it is simply
the area under the ROC curve. A classifier that is no better than random chance
will have an AUC of 0.5. This is because the diagonal line represents the state of
random chance, and clearly the area below it is 0.5 as shown in Fig. 1.8. Perfect
classification means that the ROC curve coincides with the top and left margins
and so the area under it is 1. Thus, the AUC lies between 0.5 and 1. Finally, it can
be shown that, up to a transformation, the AUC is equivalent to the Gini coef-
ficient which we discuss later in this section.
(1) Given a probability model, generate the predicted probability of “1” (positive
response) for each customer in the test sample
(2) Sort the customers by decreasing order of predicted probabilities of “1”
(3) Draw a graph such that:
(1) Generate predicted probability of “1” (positive response) for each customer
in test sample
Introduction and Machine Learning Preliminaries 21
The table below shows a typical lift (gains) chart (Fig. 1.10).
While the entire lift curve is informative, sometimes we want just one summary
measure to gauge the performance of a classifier. A convenient summary is the top
decile lift which is lift of the top decile. Another summary measure is the variation
in response rates across all 10 deciles. A visual depiction of the lift is given by a lift
(gains) curve. A lift (gains) curve can be drawn by using lift as Y-axis and the
“Percentage of customers” on X-axis in decreasing order of their predicted prob-
abilities of “1”.
Since AR 5 0.5, the Gini coefficient 5 2AL21. This makes it clear that the
Gini coefficient ranges between 0 and 1. The perfect classifier would have a
Lorenz curve that coincides with the left and top borders on the figure and so the
AL 5 1. Hence, the Gini coefficient 5 1. On the other hand, the worst classifier,
one that does no better than random chance, has an AL 5 0.5. Hence, the Gini
coefficient is 0.
TECHNICAL APPENDIX
Technical Detour 1
We present the expressions of the cost functions for regression and a binary
classification problem. Suppose there are p predictor variables X1,…, Xp. There
are N observed data points, (xi, yi), i 5 1,…, N. Each observed input data xi is a
p-dimensional vector xi 5 (xi1,…, xip), and the response corresponding to the ith
observation is yi. The sum-of-squares cost, and indeed the cross-entropy cost, for
a regression model is
N
CðuÞ ¼ + ðyi 2 f ðxi ÞÞ2 (A1.1)
i¼1
Consider the case of binary classification. The expressions for the cross-entropy
cost (negative of log likelihood) are simplified if the response yi is treated as 0 or 1.
The probability of yi 5 1 is obviously conditioned on xi and the parameter u:
Prob(yi 5 1│xi, u). To simplify notation, we will omit the conditioning arguments.
The cross-entropy cost is given by
N
CðuÞ ¼ 2 + yi logPrðyi ¼ 1Þ 1 ½1 2 yi log½1 2 Prðyi ¼ 1Þ (A1.2)
i¼1
Technical Detour 2
By definition, the maximum likelihood estimator for u is
N
uML ¼ argmaxpmodel ðx; uÞ ¼ argmax ∏ pmodel ðxi ; uÞ (A1.3)
u u i¼1
where the last expression follows because xi are independent. The product term
in the last expression is the likelihood function. Because the logarithm is mono-
tonic, taking the logarithm of the likelihood function does not change the optimal
choice of parameter u. This gives us the log-likelihood. It can be shown that this
gives us an equivalent definition of the MLE,
uML ¼ argmaxEx;pdata log pmodel ðxi ; uÞ (A1.4)
u
24 Machine Learning and Artificial Intelligence in Marketing and Sales
The negative of the log likelihood is the cost that needs to be minimized. This
cost is called the cross-entropy between the empirical data distribution and the
model distribution, and is defined as
CðuÞ ¼ 2 Ex;pdata ½log pmodel ðx; uÞ (A1.5)
We can see that minimizing this cost will give us the u that is consistent with
maximum likelihood estimation.
Chapter 2
Chapter Outline
1. Introduction to Neural Networks
1.1 Early Evolution
1.2 The Neural Network Model
1.2.1 NN for Regression
1.2.2 NN for Classification
1.3 Cost Functions and Training of Neural Networks Using Backpropagation
1.4 Output Nodes
1.4.1 Linear Activation Function for Continuous Outputs
1.4.2 Sigmoid Activation Function for Binary Outputs
1.4.3 Softmax Activation Function for Multiclass Outputs
2. Feature Importance Measurement and Visualization
2.1 Neural Interpretation Diagram (NID)
2.2 Profile Method for Sensitivity Analysis
2.3 Feature Importance Based on Connection Weights
2.4 Randomization Approach for Weight and Input Variable Significance
2.5 Partial Derivatives Approach
3. Applications of Neural Networks in Marketing and Sales
4. Case Studies
Technical Appendix
where various “nodes” received inputs from some nodes and then, in turn, acted
upon other nodes based on whether they were sufficiently activated, that is, if their
activation exceeded a threshold. In the machine learning terminology, nodes are
sometimes referred to as units and we will switch between the two. This creates a
network of nodes, with nodes in the initial layer acting on nodes in intermediate
layers, until finally, some desirable outputs are obtained from the output layer.
This architecture of a neural network is loosely based on the patterns of con-
nections between neurons in the human brain where a given neuron receives
electrical signals through dendrites emanating from a preceding layer of neurons.
The recipient neuron then, in turn, sends spikes of electrical signals to other
neurons when excitatory input dominates inhibitory inputs. Thus, the electrical
signals either excite or dampen activity in connected neurons and learning occurs
by changing the influence of neurons on other neurons. Of course, this is a gross
idealization of the human brain since much of the working of the brain is not yet
understood.
At a very simple level, a neural network (NN), sometimes also called an
artificial neural network (ANN), has training and testing modes. Firing rules are
the key to how a NN learns. In the training mode the NN is trained to fire/not fire
for different input patterns (in training data). In the testing mode the NN sees if
an input training pattern is detected in new, non-training data. If yes, the cor-
responding output becomes the current output. If not, then the NN uses the firing
rule to determine current output. In more recent times there has been considerable
work on the Bayesian approach to neural networks. In any case, Bayesian neural
networks build on the fundamental ideas of neural networks discussed in this
chapter, and we will not discuss the additional complexities of the Bayesian
approach here.
Executive Summary
Neural networks can be conceptualized as fitting complex functions to explain
relationships in the data. Seen through the lens of fitting a complex function
to explain real-world relationships, neural networks provide a very flexible
architecture for replicating such complex functions by weighted sums of
simpler functions.
Below we will give more details on how a neural network implements this idea
of expressing a highly complex, nonlinear, function f (xi) in terms of a weighted
sum of other simpler (basis) functions fm(xi).
affected by the condition of the property in terms of upgrades and updates, square
footage, number of bathrooms, the area of the basement, if any, and macro
factors which include economic indicators like incomes, interest rates, employ-
ment rates etc. We use two summary measures called “condition” and “mac-
ro_factors” for these other determinants of rent value, apart from location. So
here p 5 3 and the input vector is xi 5 (xi1, xi2, xi3)5(distance to city center,
condition, macro factors). In machine learning terminology, the inputs are called
features. Vector notation and vector calculations allow exposition of calculations
with multiple features in simple neat forms instead of having to deal with
cumbersome notations. Moreover, computer programs and libraries are written in
code that is optimized for vector and matrix calculations.
To clearly see the components of a NN, consider the simplest NN with one
hidden layer containing just one hidden node, apart from the p input nodes and
the one output node as shown below. To keep the notation in the figures simple,
we will suppress the index i for the ith observation vector xi 5 (xi1,…, xip) and
instead use the vector of input features x 5 (x1,…, xp) and target y (Fig. 2.3).
The p input nodes correspond to the p-dimensional input vector x 5(x1,…, xp),
and the single output node corresponds to the one-dimensional output y. In a
neural network the input information received at each hidden node (or simply,
input at the hidden node) is a weighted sum of the p input nodes (x1,…, xp) plus a
bias at that hidden node. Thus, the input at a specific hidden node – say, node m -
is: wm0 1 wm1x1 1 wm2x2 1…1 wmpxp. In this summation the quantity wml is the
weight corresponding to input node xl (l 5 1,…, p) and wm0 is the bias at hidden
node m. It may be useful to visualize this in our NN in Fig. 2.4 where there is a
single hidden node m.1We have circled the input information at hidden node m (in
dashes) to draw attention to it.
At each hidden node a function, called an activation function, acts on the
information received at that node, i.e., on the weighted sum of input nodes. The
activation function, as the name suggests, acts as a fence or control valve, which
1
As shown in “Technical detour 2” the summation wm0 1 wm1x1 1 wm2x2 1…1 wmpxp can
be written conveniently as +l wml xl .
30 Machine Learning and Artificial Intelligence in Marketing and Sales
activates/opens to allow the information from the hidden neuron to proceed to the
output node. Fig. 2.5 shows the activation function fm acting at the hidden node m
and creating an output hm that goes to the output node (see expression in dashes).
So far, we have discussed a NN with just one hidden node to simply illustrate
the information flow in the network. Of course, in most cases there are many
hidden nodes. Suppose there are M hidden nodes and one output y as shown in
Fig. 2.6.
Each hidden node has an activation function acting on it, where the activation
function at hidden node m (m 5 1,…, M) is fm.2 Just as the input information
received at a hidden node is a weighted sum of input nodes, so too the input
information received at the (single) output node is a weighted sum of the hidden
nodes. One can easily see that the output node therefore receives information
which is a weighted sum of functions, since an activation function acts at each
hidden node. Recall our conceptualization of a NN as representing complex
functions by a weighted sum of simpler (basis) functions. One could think of the M
hidden nodes (indexed by m 5 1,…, M) as playing the role of M basis functions,
and the information received at the output node as a weighted sum of the basis
2
The reader should note that hm is just the output from hidden node m after the activation
function fm acts on the input information received at that node (see Fig. 2.5).
Neural Networks in Marketing and Sales 31
functions. In the standard NN the activation functions at all hidden nodes have the
same functional form and differ only in how inputs xi are weighted. Thus, in
Fig. 2.6, fm 5f, for all hidden nodes m 5 1,…, M.
To see a concrete illustration of how different weight vectors are used for
different hidden nodes, let us revisit the “rent value” example. The activation
function at the first hidden node would act on a weighted sum of the three inputs
(recall, that there is also a constant). So, the activation function for first hidden
node acts on
constant1 1 a* distance_to_city_center 1 b*condition 1 c*macro_factors
In this case p 5 3 and the weight vector for m 5 1 (hidden node 1) is w1 5 (w10,
w11, w12, w13) 5 (constant1, a, b, c). Similarly, the activation function at hidden
node 2 would act on
constant2 1 d* distance_to_city_center 1 e*condition 1 f*macro_factors
The weight vector for m 5 2 (hidden node 2) is w2 5 (w20, w21, w22, w23) 5
(constant2, d, e, f).
We purposely motivated a NN using the conceptualization of these networks
as models that fit complex functions to capture relationships in the data. This
conceptualization can serve to allay much of the (sometimes unfair) character-
ization of a NN as a black box. Even though the interpretation of the weights may
not be straightforward, the connection with basis function regression, which has a
long history in the statistics literature, tells us that the simple NN model is hardly
a mysterious formulation.
We now turn to the output node. Just as an activation function acts on the
input information received at a hidden node, similarly we have an activation
function that acts on the input information received at the output node. One
motivation for this is that some contexts, such as classification, require a final
transformation at the output nodes. Unlike regressions, which we have discussed
32 Machine Learning and Artificial Intelligence in Marketing and Sales
so far and which have a continuous dependent variable, classification requires the
calculation of a probability of class membership. Probabilities should be numbers
in the interval [0, 1] and all probabilities should add to 1. Since the un-
transformed raw output from the output node can take any value (which vio-
lates the requirement for a probability), the activation function at that node
transforms the un-transformed values into probabilities (see Section 1.2.2).
To distinguish activation functions acting at the hidden layer versus the output
layer, we use the superscript “(1)” and “(2)” respectively. Recall, as stated earlier
in the current section, the activation functions at all nodes in the hidden layer
have the same functional form. We denote the activation function at the hidden
layer as f (1) and at the output layer as function f (2). To fix ideas, it may be useful to
visualize the full information flow in a NN with one hidden layer. In the interests
of not cluttering Fig. 2.7, we will again show only one hidden node – node m – but
display the activation functions at both the hidden node and the output node. As
with the activation functions, to distinguish the weights at the hidden and output
nodes we use superscripts “(1)” and “(2)” respectively.
Fig. 2.7 shows only one hidden node merely for simplicity, but in a more
general network with M hidden nodes (as in Fig. 2.6) the activation function f (2)
at the output node will act on the weighted sum of hidden nodes h1, … hm plus a
bias at the output node.
Executive Summary
Consider a neural network with one hidden layer. Each hidden node in this
layer has an activation function that acts on the information received at that
node. The information received at a hidden node is a weighted sum of all the
input nodes plus a bias. The activation functions at different hidden nodes
have the same functional form and only differ in the weights assigned to the
weighted sum of input nodes.
Neural Networks in Marketing and Sales 33
weights from the two hidden nodes to the output node are opposite in sign (“a,”
“b,” “c,” and “d” are all positive quantities)
The information received at the output node is a weighted sum of two sigmoid
activation functions with a positive and negative weight respectively. What is the
resultant shape? Shown below in Fig. 2.12 are a positively and a negatively
weighted sigmoid. It is easy to see that the weighted sum of these two can capture
an “inverted-U shaped” relationship.
By suitably stitching together many sigmoids with suitably adjusted weights, a
NN can capture many complex relationships between the input and output var-
iables. In our specific “rent value” example, one can now see how a NN with
several hidden nodes, each with a sigmoid activation function acting on it, can
capture the nonlinear pattern shown in Fig. 2.1. A more technical description of
the network in Fig. 2.10 with one input node, two hidden nodes with a sigmoid
activations, and one output node is given in “Technical detour 3.”
Technical Detour 3
36 Machine Learning and Artificial Intelligence in Marketing and Sales
Learning and there are many other aspects of the architecture of deep networks
(Convolutional NNs, Recurrent NNs, etc.) that are much more complex than
simple single hidden layer NNs. Deep learning is an evolving field and its business
applications are still nascent. In the current chapter we will only cover NNs with a
single hidden layer.
3
It is now clear why sometimes the final transformation f (2) is needed at the output layer.
For classification, this is the transformation that generates class probabilities from the un-
transformed, and unrestricted, continuous output that the NN would otherwise yield.
4
At the output node k the activation function is f k(2) (w(2)k hi). This function acts on w(2)k hi
which is the weighted sum of hidden nodes (see the dot product notation in “Technical
detour 2”). The weight vector is w(2)k 5 (w(2)k0, w(2)k1, w(2)k2,…, w(2)kM) and vector of hidden
nodes is hi 5 (h0i, h1i,…, hMi), where h0i 5 1.
38 Machine Learning and Artificial Intelligence in Marketing and Sales
Technical Detour 4
Neural Networks in Marketing and Sales 39
Executive Summary
The cost function provides a measure of how the neural network model is
performing. Performance is measured in terms of how close the model is able
to predict the actual observed data. Since the cost function is calculated using
the actual data, it is affected by the size of the actual training input data set,
the weights and biases used in specifying the network, and the activation
functions. The more commonly used cost functions are quadratic cost and
cross entropy for neural networks for regression and classification problems
respectively.
(1) Suppose we are at step r. Given a set of weights (as in (2.1)), the forward pass
computes the predicted f (xi) for input data xi.
(2) Then for the backward pass the errors are calculated at the nodes in the
output layer and are backpropagated using a recursion to compute the errors
at the hidden nodes.
(3) Both sets of errors are then used in calculating the gradients at the outer and
hidden layers.
(4) Finally, the gradients are used in the gradient descent updating rules. This
yields the weights at step (r 1 1). The algorithm proceeds iteratively till some
stopping rule terminates it.
Why is the training algorithm for neural networks called the backpropagation
algorithm? From (2.1) we can see that we need to estimate the weights at both the
hidden and output layers. It turns out that when gradient descent is applied to a
neural network, the gradients with respect to the weights at both the hidden and
output layers involve the difference between the model prediction f (xi) and the
actual observed yi. These discrepancies are therefore errors which are defined at
40 Machine Learning and Artificial Intelligence in Marketing and Sales
the hidden and output nodes. The algorithm works by propagating this error from
the output node back to the hidden node in a convenient recursive manner (see
step 2 in algorithm above). Since computers can easily handle recursive rela-
tionships using logical “do loops,” the backpropagation algorithm allows efficient
calculations. Detailed derivations of the backpropagation equations are given in
the appendix.
Technical Detour 5
m50
Neural Networks in Marketing and Sales 41
Technical Detour 6
6
Corresponding to output node k, the weighted sum of the hidden nodes is
M
ð2Þ
zki 5 + wkm hmi .
m50
Neural Networks in Marketing and Sales 43
In this chapter we consider (shallow) neural networks with only one hidden
layers, apart from an input and an output layer. Therefore, there are links from
nodes in input layer to nodes in the hidden layer and also links from nodes in
hidden layer to nodes in output layer. The NID method is purely graphical and
does not quantify the magnitude of the impact of an input variable of the output.
Instead it provides directional inferences about the impact of input variables. For
a neural network with only one hidden layer the effect of an input variable is
positive if there is a: (1) Positive input-hidden and positive hidden-output links, or
(2) Negative input-hidden and negative hidden-output links. The effect of an
input variable is negative if there is a: (1) Positive input-hidden and negative
hidden-output links, or (2) Negative input-hidden and positive hidden-output
links.
As an illustration, consider the problem of classification where the task is to
classify industrial buyers of a company, say PB, that sells spirometers. Spirom-
eters are medical devices used for measuring the volume of air inhaled and
exhaled by the lungs. They are used, among many other things, to diagnose the
lung capacity of people with COPD. PB sells to institutions like hospitals and
smaller clinics and it wants to predict whether a medical entity will “Buy” or “Not
buy” their product. This binary dependent variable was called “ChoicePB.” The
44 Machine Learning and Artificial Intelligence in Marketing and Sales
Fig. 2.15. NID for Neural Network for Predicting Choice of PB.
company has data from past purchases and wants to use this data to make their
binary prediction. The predictor (independent) variables used are: 1. Price;
2. Storage/retrieval; 3. Repairs/service (low cost); 4. Sanitary; 5. Supplies (avail-
ability); 6. Easy to operate; 7. Service (quick response); 8. Accuracy (provides
accurate readings). Using the statistical package R, we fit a neural network with
one hidden layer containing three nodes, and the NID for this network is dis-
played in Fig. 2.15.
We will demonstrate the interpretation of directional impact of an input
variable on the output using the case of price. We choose price since the
directional impact of price on the probability of choosing a product is simple
– price should have a negative impact. In Fig. 2.15, we can see that the link
from price to all the hidden nodes H1, H2 and H2, are negative (gray lines).
Further, the links from all the hidden nodes H1, H2 and H3 to the output
node are positive (black lines). Thus, the overall effect of price on the
probability of choosing PB is negative, consistent with our intuition. Among
the non-price attributes, we can see the “Storage” has a stronger impact than
“Repair” based on the thickness of the lines, and the effect of both these
variables is positive.
Step 1: Calculate the contribution of each input neuron to the output via each
hidden neuron calculated as product of input-hidden and hidden-output
connection (Fig. 2.19)
e.g., c11 5 w11*wO1 5 0.510*1389 5 708. We can fill in the other cells similarly
to obtain
Step 2: Calculate relative contribution of each input neuron to outgoing
signal from each hidden neuron (Fig. 2.20). For example r11 5 jc11 j 1 jcjc11 j
12 j 1 jc13 j
5
708=ð708 1 518:9 1 433:37Þ 5 0:43. The other entries in the columns corre-
sponding to “Hidden h1” and “Hidden h2” in Fig. 2.20 can be calculated
similarly. The row sums are, for example, S1 5 r11 1 r21 5 0.43 1 0.53 5 0.96
etc.
Neural Networks in Marketing and Sales 47
Step 3: Finally, the relative importance of each input can be calculated using
the row sums in the last column in step 2 (Fig. 2.21). For example, e.g., RI1 5 S1/
(S1 1 S21S3)*100 5 0.96/(0.96 1 0.6310.41)*100 5 48%. This quantifies the
relative importance of the input variables.
48 Machine Learning and Artificial Intelligence in Marketing and Sales
• Construct many NNs, for different initial random weights, using original data
• Select the NN with best predictive performance, and record the initial random
weights used for this best fitting NN. Calculate and record
The statistical significance of observed c11, c1, RI1 are calculated as the pro-
portion of randomized values equal to or more extreme than the observed values.
Let’s take the example of the relative importance (RI1) for a given predictor
variable. Clearly, if the observed RI1 is not very different from many of the
Neural Networks in Marketing and Sales 49
randomized RI1 then we would not be able to put much faith in this observed RI1.
In other words, this predictor would be non-significant.
with the selling function. So, while this will not allow an exhaustive discussion of
all the ways in which marketing has been impacted by machine learning, the
reader will get enough of a flavor of how these new technologies are disrupting
marketing. In terms of the impact of machine learning, and specifically NN, on
the stages of the selling process there is similarity between pre-approach and
approach so we combine them in our discussions. Similarly, objection handling
and close will be combined.
Perhaps the greatest impact of neural networks in sales has been in the pro-
specting stage. In this stage of the selling process the firm performs the tasks of
finding customers and qualifying them by scoring the potential customers based
on their propensity to purchase. From a sales perspective, these constitute the
firm’s lead generation and lead qualifying functions. Some authors use the term
prospecting only for “lead generation,” and put “lead qualification” as a separate
stage (see Exhibit 2.5, page 48, of Johnston & Marshall, 2013). Our focus is on the
function itself rather than in delineating the specific stage of the sales process in
which the function occurs. The consumers’ propensity to purchase has been
extensively used to obtain estimates of demand. Purchase propensities have also
been used to divide the customer base into more or less attractive segments, and
then to target a subset of these segments for the firm’s selling and marketing
efforts. In this way, targeting, which is a typical marketing function, has a strong
overlap with the lead qualification aspect of the sales process and we will discuss
them together. Therefore, under the broad umbrella of the prospecting stage we
discuss the inter-related topics of (1) demand estimation and sales forecasting, (2)
segmentation, targeting and positioning (STP), (3) lead generation, and lead
qualification. STP are typical marketing functions that have maximum overlap
with the initial stages of the sales process, which we call the “customer develop-
ment” stages of the process, where marketing-sales integration is the most intense.
Agarwal and Schorling (1996) is an early paper in marketing that has provided
evidence of the superior performance of neural networks for market share fore-
casting. The authors investigated whether artificial neural networks perform
better than the standard Multinomial logit (MNL) in predicting brand shares of
grocery products. They chose frequently purchased grocery products like catsup,
peanut butter, dishwashing liquid, etc. in a B2C retail context using a well-studied
data set from IRI Marketing Research. They divided their data into 4 clusters: All
households (cluster 0), households purchasing 1 brand; households purchasing
2–3 brands; households purchasing . 5 4 brands. They used the same obser-
vations for both the neural network and MNL for a fair comparison and found
that neural networks outperform the MNL on many dimensions. Interestingly,
the performance of the neural network compared to the MNL is even better in
more complex situations like segments with more brands. In addition, the neural
network is less sensitive to the number of observations and robust to different
estimation periods. While Agarwal and Schorling (1996) have used a neural
network for classification in the binary choice context, most papers on sales
forecasting and demand forecasting have used neural networks in the regression
context. Zhang (2004) is an excellent book that details many applications of
neural networks for business forecasting. Coming to sales forecasting,
Neural Networks in Marketing and Sales 51
Carbonneau et al. (2008) and Thiesing and Vornberger (1997) use neural net-
works to forecast demand and show that these methods are superior to the
traditional forecasting methods like trend, moving average and linear regression.
In a major study, Hill, O’Connor, and Remus (1996) showed that, across monthly
and quarterly time series data, NNs performed significantly better than six sta-
tistical time series methods that were generated in a well-known forecasting
competition (Makridakis et al., 1982). The authors further note that NNs espe-
cially do better than the other methods for discontinuous time series data. The use
of NNs in forecasting aggregate market demand has been demonstrated by
Gruca, Klemz, and Petersen (1999), Hruschka (1993), and Yao, Teng, Poh, and
Tan (1998). NNs have been found to be very successful in handling chaotic time
series data that are very unlikely to obey simple linear relationships between the
independent variables (“input cells” in NN) and dependent variables (“output
cells”) assumed by traditional autoregressive methods that model linear stationary
processes (Landt, 1997; Lawrence, Tsoi, & Gilles, 1996). Therefore, neural net-
works, which do not assume any such relationships, and are very effective at
pattern recognition, are particularly suited for analyzing such time series data.
The use of NNs for market segmentation has been documented by Bloom
(2005), Hruschka and Natter (1999), Krycha (1999), Balakrishnan, Cooper,
Jacob, and Lewis (1996), and Fish, Barnes, and Aiken (1995) among others.
Boone and Roehm (2002) show how retail segmentation can be done using
artificial neural networks. Indeed, Chattopadhyay, Dan, Majumdar, and Chak-
raborty (2012) have compiled a list of more than 1000 articles from all disciplines
that have used NN in segmentation!
Most papers in the area of segmentation use unsupervised neural networks
(Krotov & Hopfield, 2019) where both the input and output variables (units) are
segmentation criteria and with one hidden layer whose units are segment members
(Hruschka & Natter, 1999). A multinomial logit determines segment membership
for the hidden layer whereas for the output layer the segment memberships
obtained in the intermediate hidden layer are weighted by segmentation criterion
specific weights. The output units are obtained using a binomial logit function
which uses the weighted sum of the memberships over all segments. Hruschka and
Natter (1999) show that neural networks perform much better for segmentation
than the traditional K-means clustering approach.
When it comes to targeting, NNs have mostly been used for targeting indi-
vidual customers rather than segments (which forms part of a firm’s STP strat-
egy). Zahavi and Levin (1997) have used NNs for targeting customers with
mailings. NNs have found very fruitful applications in lead generation and lead
qualification. This has been done both at the segment level and at an individual
customer level – whether for one-on-one marketing in B2C contexts or for B2B
business with a fewer number of industrial customers. Lead qualification can be
broadly conceived as not only certifying and vetting all the information about the
customer, but also “scoring” the objective “quality” of the lead in terms of
the consumers’ propensity to buy. The task of certifying, verifying and vetting the
information are examples of mundane tasks that can be automated, using neural
networks and other machine learning methods, to free up employee time. On the
52 Machine Learning and Artificial Intelligence in Marketing and Sales
other hand, lead qualification and scoring is an example of using neural networks
to actively generate customer intelligence that allows the firm to better target its
customers.
Lead qualification and scoring models are usually based on NNs that generate
choice probabilities. These models have binary or multiclass outputs capturing the
probabilities of belonging to various classes, such as “buy”/“not buy,” or even
more fine-grained categories of purchase. The individual-level choice probabilities
are then used to determine more or less “attractive customers” in terms of their
propensities to purchase. The attractive customers are said to be qualified, and the
firm then targets them with its sales and marketing efforts.
West, Brockett, and Golden (1997) is an early paper in a major marketing
journal that has demonstrated the superiority of neural networks for predicting
individual consumer choice. The authors study the realistic situation of non-
compensatory consumer choice, that requires nonlinear utility models, and
investigate the performance of neural networks compared to traditional statistical
methods. They state that: “…the results reveal that the neural network model
outperformed the statistical procedures in terms of explained variance and out-of-
sample predictive accuracy” (page 370). Specifically, they consider three
commonly used non-compensatory choice rules – Satisficing rule, Latitude of
acceptance rule; and Weighted additive rule – and find that the neural network
performs much better than the standard linear statistical models like Logistic
Regression and Discriminant Analysis.
Some authors have leveraged the ability of neural networks to better handle
data that is increasingly available from online sources. The increasingly important
online commerce especially benefits from the ability of NNs to handle the enor-
mous volume, complexity and “real-timeness” of the data. Potharst, Kaymak,
and Pijls (2001) have documented a 70% response rate when the mailing is guided
by neural networks (compared to just 30% for traditional lead qualification
methods) when identifying consumers that are likely to respond to direct mailings
by a Dutch charity organization. Kim, Street, Russell, and Menczer (2005) use
NNs, guided by a genetic algorithm, to identify the feature subset that maximizes
classification accuracy. Their form of genetic algorithm is called the evolutionary
local selection algorithm (ELSA) and it accomplishes “feature selection” to search
through a multitude of features (demographic variables) which are then fed to a
neural network that predicts “buy” or “not buy.” The authors show that this NN
approach dominates the traditional methods of feature selection (done by prin-
cipal components analysis) and classification (done by logistic regression).
In the context of using neural networks for choice modeling in a B2B situation,
Kumar, Rao, and Soni (1995) study a supermarket’s item selection decision. In
their analysis these authors use data from a supermarket’s item selection decision
with a total of 1048 observations with 770 rejects (coded as 0) and 278 accepts
(coded as 1). They do a comparison of neural networks and logistic regression. An
important contribution of this paper is a comparison of neural networks and
logistic regression under different classification thresholds, apart from the stan-
dard threshold of 0.5. They find that the performance of a neural network is better
than a logistic regression, and importantly, the performance becomes even better
Neural Networks in Marketing and Sales 53
when the classification thresholds are more stringent – for example, when an
object is classified as 0 (or 1) when the probability is less than 0.25 (or greater than
0.75). They take this as evidence that a neural network provides more confidence
than a logistic regression in making such a binary decision.
In terms of pre-approach and approach stages on the sales process, the largest
impact of AI has been the emergence of mobile and web-based means through
which the selling organization can contact the customers. A company called
6sense uses data on customers’ visits to the client’s site in combination with third-
party data and social media feeds to predict when the customer may be ready to
buy, and therefore the best time for client’s sales people to approach their
potential buyers. The most exciting development in AI-powered conversational
software has been the emergence of chatbots. While earlier chatbots used other
models of Natural Language Processing (NLP) like Markov chains and genetic
algorithms (Abdul-Kader & Woods, 2015), the most recent techniques for
conversational AI use neural network based deep learning methods (Gao, Galley,
& Li, 2019).
The major impact of machine learning and AI in the presentation stage have
come about by immersive technologies like mobile virtual reality (VR), 360-
degree video and augmented reality (AR). Such immersive technologies enable
higher user engagement than “plain” videoconferencing by enhancing the sense of
the presenter being present in the room. Garg and Tai (2014) show how artificial
neural networking (NN), along with genetic programming, can be used to
improve rapid prototyping by optimizing the parameter settings that control the
wear strength and tensile strength of the prototype. Multi-layer neural networks
like deep learners foster real-time image and audio processing which drive virtual
reality displays enabling effective sales presentation technologies (Dooley, 2017).
Intel’s RealSense Vision Processor allows firms to present presentation prototypes
in 3D and use advanced cloud-deployed algorithms to process raw image streams
in real-time. Thieme, Song, and Calantone (2000) have used neural networks in
the critical new product development selection process.
Of all the stages of the sales process, overcoming objections and closing are
perhaps least impacted by AI and machine learning including neural networks.
Much of the closing of big-ticket sales is still done in person, but the overcoming
objections function is being rapidly disrupted by AI through robo-advisors.
Though sales people still have a role in overcoming objections, especially where
standard FAQs are insufficient, AI is rapidly making inroads. Again the more
cutting-edge deep learning type neural networks are leading the charge by being
able to analyze video and text thus facilitating real time interactions with
customers.
The final stage of the sales process, follow-up, has two aspects: current order
processing and customer engagement after the current order is filled. As far as
applying neural networks is concerned, current order processing has been tackled
under the umbrella of supply chain functions comprising of order recording, order
processing, inventory management and order fulfilment. Sustrova (2016) shows
how NNs can be used for managing a company’s order cycle leading to reduced
storage costs, reduced inventory purchase costs and optimum ordering levels.
54 Machine Learning and Artificial Intelligence in Marketing and Sales
Chiu and Lin (2004) accomplish complete order fulfilment across the supply chain
by training three separate NNs for the supply network, the production network
and the delivery network.
Machine learning and AI have be applied to the post order follow-up stage of
the sales process. A company called Gainsight and Survey Monkey have teamed
up to offer software that automatically alerts salespeople to the need to invoice
after the close along with prompting them on upsell and cross-sell opportunities.
Guido, Prete, Miraglia, and De Mare (2011) shows how neural networks can
improve the effectiveness of direct mail marketing campaigns by providing better
predictions of purchase intention through cross-selling and up-selling. The con-
sumer’s response rates are modeled using factors that are likely to have an impact
on the purchase intention. The authors also benchmark NNs against the more
traditional methods like multiple regression analysis and logistic regression and
find that NNs perform better. Knott, Hayes and Neslin (2002) develop models to
determine the Next Product To Buy (NPTB) and find that a NN has a slight
advantage over other competing models. Linder, Geier, and Kolliker (2004)
provide some recommendations of whether NNs, regression or classification trees
should be used depending on the type of customer database being analyzed.
4. Case Studies
In this section we will present a couple of case studies about the application of
neural networks in marketing. The goal of these case studies is to provide some
details on how neural networks have been used in important application areas in
marketing and sales. We will focus on the business context of the applications,
describe the data set and, wherever possible, provide some theoretical background
for our choice of predictors and dependent variable. We will describe details of the
analyses done on the data sets and the results obtained, especially with a view to
visualization and interpretation of the results. We also compare the strengths and
weaknesses of neural networks compared to the traditional econometric models
where the latter are used as benchmarks.
different industries. To take a few examples, Coussement and Van den Poel (2008)
have considered the market for subscription services, Verbeke, Dejaeger, Mar-
tens, Hur, and Baesens (2012) have considered the telecommunication sector and
Xie, Li, Ngai, and Yin (2009) have considered bank customer churn.
We use a publicly available data set to illustrate the use of a neural network for
churn prediction. The data comes from a European bank and the goal of the
analysis is to predict whether a customer will leave the bank or not. The data
comes from historical records at the bank and the variables in the dataset are
• RowNumber
• CustomerId
• Surname
• CreditScore
• Geography
• Gender
• Age
• Tenure
• Balance
• NumOfProducts (How many accounts, bank account affiliated products the
person has)
• HasCrCard (Whether they have a credit card issued by the bank)
• IsActiveMember (Whether they do regular business with the bank)
• EstimatedSalary
• Exited (Did they leave the bank after all?)
The aim of this section is to give one simple illustration of the analysis that can
be done, and so we will not show detailed analyses of the data using an exhaustive
search over all the different tuning parameters to construct the “best fitting”
neural network. The features we will use are variables 4 through 12 in the list
above (“CreditScore” to “IsActiveMember”), and the target will be variable 14
(“Exited”). Thus, we will use binary classification models. We will use a logistic
regression as a benchmark model and investigate whether, and to what extent, a
neural network performs better than it.
Both the logistic regression and neural network were trained on a random
sample of 80% of the data and tested on a random sample of 20% of the data. In
both we did a 3-fold cross-validation. In addition, for the neural network we
found that a network with 15 hidden nodes performed reasonably well and we
chose a weight decay parameter of 0.8. We show the inputs chosen for running a
logistic regression for this data set.
Response Column: 14
Predictor columns: 4:12 (4 through 12)
Training Percentage: 80
Testing Percentage: 20
Number of folds for Cross-validation: 3
56 Machine Learning and Artificial Intelligence in Marketing and Sales
Response Column: 14
Predictor columns: 4:12 (4 through 12)
Training Percentage: 80
Testing Percentage: 20
Number of folds for Cross-validation: 3
Number of Nodes in Network: 5, 10, 15
Number of times Averaging: 2
Decay: 0.8
Probability Range to Exclude: 0.5, 0.5
While the other inputs in the interface for a neural network are straightfor-
ward, two aspects require clarification. First, the number of nodes is given as
“5,10,15.” The idea is to compute neural networks with different number of
hidden nodes and then to select the best fitting neural network. This step can be
accomplished by writing simple code in all software programs. In this case a
neural network with 15 hidden nodes performs the best, and so the fit statistics in
the following paragraph and outputs only relate to the neural network with 15
hidden nodes. Second, the “Probability range to exclude” input is related to the
probability threshold selected for classifying a data point as belonging to category
“1” (Exit) versus “0” (Not exit). If we select the usual threshold of 0.5 then the
input in this cell is “0.5,0.5.” We will give more explanations of this when we
discuss other variants of binary classification models, but for now the reader
should go with the input as shown.
We briefly describe the fit statistics. The left and right tables below show the
Confusion Matrices for the logistic regression and neural network respectively,
both calculated on the test data.
0 1 0 1
0 1568 35 0 1509 71
For the purposes of benchmarking the neural network we use a simple linear
regression. The inputs for the linear regression are as follows.
Response Column: 1
Predictor columns: 2
Training Percentage: 80
Testing Percentage: 20
Number of folds for Cross-validation: 3
As in case study 1, we use 80% of the data for training and a randomly selected
sample of 20% of the data for testing. For both the linear regression and the
neural network we use a 3-fold cross validation. The inputs chosen to run a neural
network are as follows
58 Machine Learning and Artificial Intelligence in Marketing and Sales
Response Column: 1
Predictor columns: 2
Training Percentage: 80
Testing Percentage: 20
Number of folds for Cross-validation: 3
Number of Nodes in Network: 5, 10, 15
Number of times Averaging: 2
Decay: 0.8
Since there is only one feature, therefore it is not meaningful to investigate the
relative importance of features. Thus, we will only consider the fit statistics. Since
this is a regression task, we will look at the predicted MSE (mean squared error).
The predicted MSE for the linear regression calculated on the test data is
approximately 6921. The predicted MSE for the neural network is approximately
2543. Note that predicted MSE is an error and so smaller values are more
desirable. Clearly, the neural network does a much better job of prediction on the
test data set. It is worth noting that, as mentioned in Fig. 2.2 in Section 1.2, the
true relation between rent value and distance to the city center has a “cubic”
pattern. The simulated data captures this relationship. Thus, it stands to reason
that the linear regression will not be able to adequately capture this relationship.
How about a nonlinear regression as given below (using DCC 5 distance to city
center)?
Rent value ¼ a 1 b*DCC 1 c*DCC2 1 d*DCC3 1 «
This will require the analyst to create two additional variables from the raw
data, namely DCC2 and DCC.3 This is the nonlinear regression that was esti-
mated by Frew and Wilson (2002), and they found support for this regression.
The important thing to realize is that such nonlinearities are automatically
captured by a neural network without the need of significant human intervention in
terms of pre-processing the data to create new variables!
APPENDIX
Technical Detour 1
The functions fm(xi) (m 5 1,…, M) which form the sum f ðxi Þ 5 w0 1
M
+ wm fm ðxi Þare called “basis functions,” and the summation on the right hand
m51
side is called the “basis expansion” of f (xi). In the specific case where f (xi) has a
2
The reader should note that hm is just the output from hidden node m after the activation
function fm acts on the input information received at that node (see Fig. 2.5).
3
It is now clear why sometimes the final transformation f (2) is needed at the output layer.
For classification, this is the transformation that generates class probabilities from the un-
transformed, and unrestricted, continuous output that the NN would otherwise yield.
Neural Networks in Marketing and Sales 59
Technical Detour 2
The input information received at hidden node m is the weighted sum of input
nodes (that is, x1, x2,…, xp) plus a bias acting at hidden node m. In symbols, the
input at hidden node m is: wm0 1 wm1x1 1 wm2x2 1…1 wmpxp. In this expression,
the weights are wm1,..., wmp and the bias is wm0. It is convenient to express this sum
in a convenient vector notation. For that purpose we need to augment the input
vector x 5 (x1, x2,…, xp) by adding a term x0 5 1. The augmented input vector
becomes x 5 (x0, x1,…, xp). With this augmentation the sum wm0 1 wm1 x1 1
p
wm2 x2 1 … 1 wmp xp 5 + wml xl , where the reader should note that the sum-
l50
mation index l starts from l 5 0. The summation on the right hand side of the
previous equation can be written conveniently in vector notation as wm×x. This
term is called the dot product, or the inner product, of two vectors, namely, the
vector of weights wm 5 (wm0, wm1, wm2,…, wmp) and the vector of inputs x 5 (x0,
x1,…, xp) with x0 5 1. It is an example of the expositional simplicity we can
obtain with vector notation.
We can make the dependence on the ith observation explicit by writing the sum:
p
wm0 1 wm1 xi1 1 wm2 xi2 1 … 1 wmp xip 5 + wml xil . As before, the summation
l50
on the right hand side of this equation can be written in vector notation as
wm×xi, where the vector of inputs for the ith observation xi 5 (xi0, xi1,…, xip) with
xi0 5 1.
Now consider the, more realistic, case where there are M hidden nodes. The
activation function at all hidden nodes have the same form, that is fm(xi) f
(wm×xi), for m 5 1,…, M. The right-hand side shows that only the vector of
weights wm are different for different nodes m. As mentioned in Section 1.2, to
distinguish quantities in the hidden layer from the output layer (which has only
one node in the univariate case) we use superscripts “(1)” and “(2)” respectively.
With the new notation, the input information received at hidden node m is the
weighted sum w(1)m×xi. The activation function at hidden node m is f (1) and it
acts on w(1)m×xi to produce output f (1) (w(1)m×xi). Let the weight on the output
from the hidden node m be w(2)m, m 5 1,…,M. Hence the basis expansion of
60 Machine Learning and Artificial Intelligence in Marketing and Sales
M
ð2Þ ð1Þ
f (xi) becomes f ðxi Þ 5 + wm f ð1Þ ðwm :xi Þand thus, the regression yi 5 f ðxi Þ 1 «i
m50
becomes:
M
yi ¼ + wð2Þ
m f
ð1Þ
ðwð1Þ
m :xi Þ 1 «i (A2.1)
m¼0
information received at the output node by the dot product w(2).hi with weights
w(2) 5 (w(2)0, w(2)1, w(2)2,…, w(2)M) and hidden nodes hi 5 (h0i, h1i,…, hMi) where
M
ð2Þ
h0i 5 1. Thus, wð2Þ × hi 5 + wm hmi . After the final transformation the output
m50
would be f ðxi Þ 5 f ð2Þ ðwð2Þ × hi Þ. One can see that this idea of forming composite
functions can be generalized to form longer chains. We have just seen that f (xi) 5
f (2) (w (2)×hi). Since hmi 5 f (1) (w(1)m xi), therefore f (xi) can be expressed as a
composite function f (xi) 5 f (2) (f (1) (xi)). Proceeding similarly, we can also have
longer chains: f (xi) 5 f (n) (…f (2) (f (1) (xi)...).
Technical Detour 3
For a complete description, we need to specify the activation functions at both the
hidden and output layers. First, consider the hidden layer. Making a connection
ð1Þ ð1Þ
with Fig. 2.10, we have, h1 5 f ð1Þ ðw1 xÞ 5 1
2w
ð1Þ
x
and h2 5 f ð1Þ ðw2 xÞ 5
11e 1
ð1Þ ð1Þ ð1Þ ð1Þ ð1Þ ð1Þ
1
ð1Þ
2w x
, where the weights are given as w1 x 5 w01 1 w11 x; w2 x 5 w02 1 w12 x.
11e 2
The activation function in the output layer, f (2), is linear in hm, m 5 1, 2. Thus,
ð2Þ ð2Þ ð2Þ
y 5 f ð2Þ ðhÞ 5 w0 1 w1 h1 1 w2 h2 1 «. Note that the activation function in
output layer is linear in the case of regression. As already mentioned, for a
binary output variable, the activation function in the output layer is usually
sigmoid.
Technical Detour 4
The formal definitions of the maximum likelihood estimator are given in Chapter
1. There we show that maximum likelihood is related to the cross-entropy cost. In
Neural Networks in Marketing and Sales 61
this technical detour we just present explicit expressions of the cost functions for
regression and classification problems for the specific case of a neural network.
The sum-of-squares cost, and indeed the cross-entropy cost, for a regression
model for a 1-dimensional output yi is
N
CðuÞ ¼ + ðyi 2 f ðxi ÞÞ2 (A2.2)
i¼1
Technical Detour 5
In chapter 1 we mentioned the gradient descent updating of the weights. The
gradient descent occurs according to the updating rule
∂CðwðtÞ Þ
wðt 1 1Þ ¼ wðtÞ 2 g (A2.4)
∂w
ðtÞ
The geometrical analog of the derivative, ∂Cðw Þ
∂w , is a slope and for multi-
dimensional cases with many weights the equivalent of this derivative is called a
gradient.
To give a more detailed demonstration of the backpropagation algorithm we
will consider the case of a NN with one hidden layer and K output nodes. The
parameters are given in (2.1) and are collectively denoted by u. We will consider
supervised learning where there is a well-defined K-dimensional “target variable”
yi 5(yi1, yi2, …, yiK) corresponding to an input data vector xi, i 5 1,…,N.
Learning is accomplished by minimizing a loss (cost) function, and here we use a
quadratic loss:
N N K
CðuÞ ¼ + Ci ¼ + + ðyik 2 fk ðxi ÞÞ2
i¼1 i¼1 k¼1
where fk ðxi Þ is the output of the NN model at node k and yi is the target cor-
responding to input xi . The gradient based update requires the derivatives. The
derivative with respect to the output nodes is
∂Ci ð2Þ9 ð2Þ ∂ ð2Þ
ð2Þ
¼ 2 2ðyik 2 fk ðxi ÞÞfk ðwk ×hi Þ ð2Þ
ðwk ×hi Þ
∂wkm ∂wkm
Which gives us
∂Ci ð2Þ9 ð2Þ
ð2Þ
¼ 2 2ðyik 2 fk ðxi ÞÞfk ðwk ×hi Þhmi
∂wkm
The partial derivative inside the summation on the right-hand side can be
evaluated as
∂ ð2Þ ð2Þ ð2Þ9 ð2Þ ð2Þ
ð1Þ
fk ðwk ×hi Þ ¼ fk ðwk ×hi Þwkm f ð1Þ9 ðwð1Þ
m ×xi Þxli
∂wml
In the above evaluation, the term xil in the last expression comes from
p
ð1Þ ð1Þ
differentiating the argument of f (1), which is wm ×xi 5 + wml xli , with respect to
l50
ð1Þ
wml . Summarizing, the gradients of the cost function with respect to the output
and hidden layers are
8
>
> ∂Ci ð2Þ9 ð2Þ
>
> ¼ 2 2ðyik 2 fk ðxi ÞÞfk ðwk ×hi Þhmi
>
> ð2Þ
< ∂wkm
(A2.5)
>
> ∂Ci K
>
> ð2Þ9 ð2Þ ð2Þ
¼ 2 + 2ðyik 2 fk ðxi ÞÞfk ðwk ×hi Þwkm f ð1Þ9 ðwð1Þ
>
> m ×xi Þxli
: ∂w ð1Þ
k ¼ 1
ml
Technical Detour 6
When we extend this to the multivariate case, then for the kth output (k 5 1,…,
M
ð2Þ
K), the regression equation is yki 5 fk ðxi Þ 1 eki ; with fk ðxi Þ 5 + wkm hmi .
m50
We can clearly see that all elements of the vector zi enter the computation
above.
We provide an intuition for how the softmax function is a continuous approx-
imation to the discontinuous function max(.). Many authors distinguish between
the softmax function and the softmax activation function used in NN output layers.
For a set of points x1, x2,…, xn, the softmax function is given by: softmax
n
ðx1 ; x2 ; …; xn Þ 5 ln + exj . First, it is easy to see that max{0, x} is discontinuous
j51
at 0 but softmax(0, x) is continuous. Now, because we exponentiate therefore if xi is
the largest component, that is xi 5 maxj xj, then exi will dominate all the other terms
n n
in the summation + exj . Hence, ln + exj ln exi 5 xi 5 maxj xj . In other
j51 j51
words, not only is the softmax function continuous, it also approximates the max
function. Also, note that when the probabilities of class membership are given by
the softmax activation function in (1.5), then maximizing the log likelihood is
K
z
equivalent to maximizing over terms like lnð Ke ki Þ 5 zki 2 lnð + ezji Þ. One can
+ ezji j51
j51
now clearly see how the softmax function is involved in the softmax activation
function at the output nodes.
Chapter 3
Chapter Outline
1. Hyperparameters, Overfitting, Bias-variance Tradeoff, and Cross-validation
1.1 Hyperparameters
1.2 Overfitting
1.3 Bias-variance Tradeoff
1.4 Cross-validation
2. Regularization and Weight Decay
2.1 L2 Regularization
2.2 L1 Regularization
2.3 L1 and L2 Regularization as Constrained Optimization Problems
2.4 Regularization through Input Noise
2.5 Regularization through Early Stopping
2.6 Regularization through Sparse Representations
2.7 Regularization through Bagging and Other Ensemble Methods
Technical Appendix
1.1 Hyperparameters
A hyperparameter is a parameter whose value has to be fixed prior to training on
a specific data set. We can conceive of a hierarchy of parameters. The training
process results in the estimation of certain parameters, but the training itself
requires some “higher level” parameters to be set before training begins. Said
differently, hyperparameters cannot be estimated while fitting the model to the
training data set. Hyperparameter often relate to model specification, model
architecture, or to the specification of the learning algorithm.
Consider the case of a neural network. Any practically useful NN is likely to
have a large number of weights and biases that need to be estimated. This
problem is exacerbated for deep NNs. NNs have a hierarchy of parameters: at the
lower level we have the weights and biases which are estimated using some variant
of the backpropagation algorithm. Recall that this algorithm requires one to
specify a “learning rate.” The learning rate is an example of a parameter at a
higher level, in that, the NN model estimates the weights and biases given a
certain learning rate which is set exogenously. There are other such higher-level
parameters, for example the number of hidden nodes or the regularization
parameter (which we will soon explain). These higher-level parameters are called
hyperparameters.
Many, though not all, hyperparameters are designed so as to allow general-
ization of the model to new test data, once the model has been fit using the
training data. Overfitting of the model to the training data is a situation where the
generalizability of the model to new test data is compromised. Thus, many
hyperparameters are chosen to explicitly counter the tendency of a NN model to
overfit to the training data. Perhaps the most powerful suite of methods for
countering overfitting is regularization. Regularization requires a regularization
weight and this is another hyperparameter.
1.2 Overfitting
The real issue of overfitting is how to design the machine learning model to
properly tradeoff between training error and test error – these are the errors the
model makes when we evaluate the fit of the model to the training data and test
data respectively. Models which provide a very good fit to the training data, i.e,
they have very small training error, may suffer from high test error. As a meta-
phor consider a student who prepares for a math test via rote learning. If this
student prepares for the test by only memorizing the answers to the sample
questions, and nothing else, he is less likely to successfully tackle a new problem
encountered during the test itself which is different from the sample questions. His
learning will not generalize well to new problems.
Consider the case of a neural network. One useful heuristic to determine when
a neural network model has started to overfit is to plot two curves, one each for
the training and test data. The curves in question are the plots of the prediction
error against the model complexity (e.g., the number of hidden nodes for a neural
network).
Overfitting and Regularization in Machine Learning Models 67
As a rule of thumb, over fitting occurs when the prediction error of the training
data keeps decreasing while the prediction error of the test data stops decreasing
or even starts to increase. Toward the right side of Fig. 3.1 one would have the
region with high variance and low bias, and toward the left side one would have
the region of low variance and high bias. Consider a NN with a given number of
input units and output units. As we increase the number of hidden units, thereby
increasing model complexity, we can plot the training error and test error. When
the training error stops decreasing or when the test error starts increasing then you
are done with training. Some analysts also plot the learning curves with the
number of training epochs as the X-axis. One of the most practically convenient
means of avoiding overfitting is early stopping where the number of epochs of
training is stopped before training error reaches its minimum (we will discuss this
further when we discuss regularization).
While there are many sources of overfitting, a useful starting point to form an
intuition would be to look at the issue of over-parameterization where the model
has too many parameters relative to the size of the training data. Over-
parameterization could become a serious problem especially in deep learning
models owing to large number of parameters, unless there is also a very large
training data set. To take an extreme example, consider the case where the
number of training data points is actually smaller than the number of parameters.
In multiple linear regression, it is well known that the proper estimation of the
parameters (the coefficients) requires the number of observations to be larger than
the number of parameters. In fact, with more parameters than observations in a
linear regression, an infinite number of parameter values will fit the training data
exactly. This is an important concept, and we provide more details in the
appendix.
Technical Detour 1
68 Machine Learning and Artificial Intelligence in Marketing and Sales
The irreducible error comes from the variance of the error term in the
regression. Ignoring the irreducible error for the moment, the more complex we
make the model we reduce the bias but this increases the variance. A situation of
high bias is often called underfitting and one of high variance is overfitting.
Executive Summary
A major concern while training any machine learning model, including neural
networks, is the problem of overfitting. Overfitting relates to the ability of the
model to generalize to new data that was not part of the training. More
complex models are more likely to overfit training data. The minimization of
overfitting lies in achieving a proper tradeoff between training error and test
error. Training (test) error is the error that the model makes when we evaluate
the fit of the model to the training (test) data.
Technically speaking, overfitting is quantified by the Bias-Variance
tradeoff. Regularization through weight decay, and other techniques of tun-
ing the hyperparameters of a model, are important methods for correcting
overfitting.
The last term in (3.1), the irreducible error, is the variance of the error term
in the regression model Y 5 f (X)1«. This error could be due to the fact that
there are many unmeasured (un-modeled) variables that also affect Y apart
from the ones in the function f (X). Moreover, there is always the chance of
measurement error. We assume that all these sources of “model” error are
captured by the error term e, and this generates the irreducible error term in
(3.1). We now turn to the first and second terms in (3.1). Bias, variance, and the
bias-variance tradeoff are important concepts in machine learning and we will
give some intuition for them. Both bias and variance are related to sampling
variability. Statisticians conceive of the training data sample as being a random
draw from an underlying data distribution. Different training data samples
correspond to different random draws and so the training samples are different.
Now the training of a model, i.e., learning the parameters of the model, depends
on the training sample used. Thus, the prediction b f ðx0 Þ, corresponding to a
given input point x0, will be different for different samples. Because there are
differences in the predictions, hence, we can talk of the average prediction.
Intuitively speaking, the discrepancy of the average prediction from the “true”
f ðx0 Þ, is captured by the concept of bias. This idea can be made more precise.
Statisticians use the general concept of expectation, and for our purposes it
suffices to note that, under reasonable conditions, the sample average is an
estimate of the expectation. The discrepancy of the expected prediction from the
true f (x0) is the bias. Moreover, as the predictions are different for different
samples, one can also calculate the variance of the predictions around their
mean. This is the “variance” term in the bias-variance decomposition formula
above.
70 Machine Learning and Artificial Intelligence in Marketing and Sales
Technical Detour 2
1.4 Cross-validation
We now turn to the important concept of cross-validation. Since the ability to
generalize to new data is a critical consideration for any machine learning model,
we would like to investigate the performance of the model on new test data. This
requires one to keep aside some data as the test data and estimate the model only
on the training data. This makes inefficient use of the data, especially for small
data sets. Cross-validation is a useful technique for assessing the generalizability
of the model, and thus avoiding overfitting, while at the same time making effi-
cient use of the available data.
Ideally, if we have enough data then we should randomly divide the data into:
(a) Training data (b) Validation data (c) Test data.
The training data is used to train the neural network and the modeler has many
design choices. The training data set could be used multiple times for different
choices of hyperparameters like the learning rate, weight decay parameter, or
even for different modeling architecture choices like the number of hidden nodes
in the hidden layer.
The validation data is used to evaluate the performance of these various
hyperparameter and model architecture choices. For example, the tuning of
the weight decay parameter can be done by choosing different values of
weight decay and estimating the regularized model using the training data.
Each of these versions of the neural network model (corresponding to
different weight decay parameters) can be evaluated for accuracy by using the
validation data.
Overfitting and Regularization in Machine Learning Models 71
The test data is generally used just once at the very end after all the hyper-
parameters have been tuned and the best set of hyperparameters and parameters
(weights and biases) have been obtained. Often the process unfolds in stages:
• The entire labeled data is divided into training and test data. The test data is
held aside.
• The training data is further divided into “training without validation” and
validation data.
speaking, cross-validation measures the expected test error where the expectation
is over all training samples. Since training occurs K times during cross-validation,
each of the training data sets can be considered to be a random sample from the
underlying data generating distribution. Then the average of the test errors over
the K training data sets can be thought of as an estimate of the expected test error,
and this is the quantity that cross-validation computes. Of course, this is likely to
hold if K 5 5 or 10 since in this case the training data sets are likely to be different
from each other. But if K is close to N then the training data sets are essentially
the same.
Technical Detour 3
2.1 L2 Regularization
Both L2 regularization and L1 regularization (which we will discuss next) are
examples of parameter norm penalty methods which work by controlling model
complexity through the weights in the model. In simple terms, a norm of a vector
is its length. In both cases we penalize the magnitude of the weights of the NN
(biases are often not penalized) and the difference lies in how the norm is
measured. In these methods, we add parameter norm penalty to the cost function
that will be minimized. Recall, that in the standard regression the weights are
chosen by: Minimizing [(sum of squared errors)]. In the weight-decay setup with
L2 regularization, weights are chosen by
Minimizing½ðsum of squared errorsÞ 1 lðsum of squared weightsÞ (3.2)
Thus, at each step of gradient descent the weight decays (is rescaled) by a
factor of (122gl). Notice that the factor for weight decay is a function of two
important hyperparameters – the learning rate g and the weight decay parameter
l. For many values of g and l, the factor (122gl) is less than one, and the
amount of decay increases as g and/or l increases.
While the above shows how the weights are affected by regularization at each
step of gradient descent, the effect of L2 regularization (weight decay) after full
training is over has a desirable feature. It can be shown that L2 regularization
shrinks the weights more along directions in which the weights do not contribute
much to reducing the cost function during training. Said differently, the weights
are not decayed much in directions along which these weights contribute a lot to
reducing the cost function. This is desirable since the goal of training is to
minimize the cost function, and directions along which the weights do not reduce
cost function much are “unimportant” directions. It is along these unimportant
directions that L2 regularization decays the weights
2.2 L1 Regularization
L1 regularization is the other common parameter norm penalty. As in L2 regu-
larization, here too we penalize the magnitude of the weights of the NN. While L2
regularization uses the squares of the weights as the norm (magnitude), in L1
Overfitting and Regularization in Machine Learning Models 75
Technical Detour 4
Technical Detour 5
The parameter l governs the extent to which the optimization program pays
attention to the constraint of having small weights.
Technical Detour 6
If the function f (xi) is linear in xi then it can be shown that the perturbed cost
function with input noise can be written in the form of (3.2). Thus, training with
input noise is equivalent to L2 regularization of the weights. It can be shown that
a similar insight holds for regressions with multi-dimensional inputs since the
logic proceed exactly as above by using gradients instead of the single valued
derivatives. For details on the relationship between input noise and L2 regulari-
zation see “Technical detour 7” in the appendix.
Technical Detour 7
1
Each time input xi is presented, a random vector «i is added where E½«i 5 0; E½«2i 5 s2
and further the «i’s are uncorrelated.
Overfitting and Regularization in Machine Learning Models 77
Fig. 3.5. Training for Too Long (Too Many Epochs) Can Raise
Validation Error.
78 Machine Learning and Artificial Intelligence in Marketing and Sales
Technical Detour 8
TECHNICAL APPENDIX
Technical Detour 1
Instead of a formal proof, we will provide a sketch of the arguments for why a
linear multiple regression requires more training data points than parameters.
Suppose a model tries to learn the function y 5 w0 1 w1 x1 1 w2 x21«. This
model has three parameters w0, w1, and w2. Suppose, further, that there are only
two training data points: (x11, x21, y1) and (x12, x22, y2). Note that the second
index i in xji (j 5 1, 2) denotes the observation i among training data points. The
cost function that should be minimized is
C ¼ ½y1 2 ðw0 1 w1 x11 1 w2 x21 Þ2 1 ½y2 2 ðw0 1 w1 x12 1 w2 x22 Þ2
The reader should note that the only unknowns in the cost function are the
weights w0, w1 and w2, and the function should be minimized to determine the
2
Assuming there are M hidden nodes, h1,…, hM, then mathematically speaking the
M
~
objective to be minimized becomes CðuÞ 5 CðuÞ 1 l + jhm j.
m51
Overfitting and Regularization in Machine Learning Models 79
values of these weights. Clearly, the cost function would be minimized if we could
find weights such that
y1 ¼ w0 1 w1 x11 1 w2 x21 ;
and
y2 ¼ w0 1 w1 x21 1 w2 x22
Then the cost would be exactly zero. In particular, let us select any arbitrary
b 0 . Then the above system of equation becomes
value for w0, say, w
b 0 ¼ w1 x11 1 w2 x21 ;
y1 2 w
and
b 0 ¼ w1 x21 1 w2 x22
y2 2 w
This is a system of two linear equations in two unknowns, w1 and w2, and they
can be solved simultaneously to get the unique solutions w b 2 . Now, we will
b 1 and w
get different values for w b 2 depending on the chosen value of w
b 1 and w b 0 . Since w
b0
was chosen arbitrarily, therefore an infinite number of solutions exist that will
give zero error on the training data.
Technical Detour 2
For a formal derivation of the bias-variance tradeoff we follow Hastie, Tibshirani,
and Friedman (2009, p. 223). Consider a regression problem Y 5 f (X)1«, with E
[«] 5 0 and variance VarðeÞ 5 s2« . We consider an input point x0 and calculate the
expected prediction error of a regression fit b
f ðx0 Þ at that point using the squared
error loss.
E½ðY 2 b
f ðx0 ÞÞ2 jX ¼ x0 ¼ E½ðf ðx0 Þ 2 b
f ðx0 ÞÞ2 1 s2« (A3.1)
where the last two terms have been taken out of the expectation keeping in mind
that f (x0) is a constant with respect to the expectation. Thus, the right-hand side
of the previous equality is
2
E½b
f ðx0 ÞÞ2 2 2b
f ðx0 ÞE½b
f ðx0 Þ 1 ðE½b
f ðx0 ÞÞ2 1 ðE½b
f ðx0 ÞÞ2 2 2f ðx0 ÞE½b
f ðx0 Þ 1 f 2 ðx0 Þ
2
f ðx0 ÞÞ2 2 E½b
¼ E½b f ðx0 Þ2 1 ðE½b
f ðx0 Þ 2 f ðx0 ÞÞ2 (A3.2)
E½ðf ðx0 Þ 2 b
f ðx0 ÞÞ2 ¼ Bias2 ðb
f ðx0 ÞÞ 1 Varianceðb
f ðx0 ÞÞ (A3.3)
Technical Detour 3
We give a formal definition of cross-validation. Suppose the data set (xi, yi) is of
size N, that is i 5 1,.., N. We are doing a K-fold cross validation,
2k
so the data has
been randomly divided into K groups. Denote by b f ðxÞ the fitted function
computed with kth group of data removed. Then the cross validation estimate of
the expected test error is
1 K 2k
+ ½Test error over test data x in group k using b
f ðxÞ as the predicted value of x
K k¼1
Technical Detour 4
Weight Decay in L2 Regularization
In the usual estimation program where the weights are not penalized we minimize
the cost function C(u). When we penalize the magnitude of the weights of the NN
(biases are often not penalized) we add a parameter norm penalty J(u) to the cost
function that will be minimized
~
CðuÞ ¼ CðuÞ 1 lJðuÞ (A3.5)
will focus on a single weight w, ignoring the subscripts and superscripts on the
weights for expositional ease. One gradient descent step would update a weight w
as: w→w 2 g ∂CðuÞ
∂w , where g is the learning rate. Now suppose we minimize the
~
penalized cost CðuÞ. One gradient descent step would update the weight as:
∂ ∂CðuÞ
w→w 2 g ðCðuÞ 1 lJðuÞÞ ¼ ð1 2 2glÞw 2 g (A3.6)
∂w ∂w
The second equation above follows because with L2 regularization, the
parameter norm penalty J(u) has the square of the weight, that is w2. The deriv-
ative with respect to w yields the term 2w. The above expression makes it clear that
at each step of gradient descent the weight decays by a factor of (122gl), which is
less than 1.
Technical Detour 5
We follow the treatment of James, Witten, Hastie, and Tibshirani (2013, p. 225)
to analyze the sparsity property. Since the optimal weights with the L1 regularized
cost function does not have convenient analytical solutions, we will consider a
special case that is enough to make the central point of sparsity. We consider the
linear regression context with one output where we have f (1) as an identity acti-
vation function and there is only one hidden node M 5 1. Then, ignoring the
N p
intercept, the un-regularized cost function is CðuÞ 5 + ðyi 2 + wl xil Þ2 . To
i51 l51
simplify further, suppose the number of training data points N and the dimension
of the input data vector p are equal, i.e., N 5 p. So, the data matrix X is a square
matrix and suppose it is a diagonal matrix with 1s on the diagonal and 0s
82 Machine Learning and Artificial Intelligence in Marketing and Sales
p
elsewhere. Then the un-regularized cost simplifies to CðuÞ 5 + ðyl 2 wl Þ2 . The
l 51
optimal weights in the un-regularized problem can be easily obtained as
b unregularized
w l ¼ yl (A3.8)
p p
~
Now the regularized cost is CðuÞ 5 + ðyl 2 wl Þ2 1 l + jwl j. We need to take
l51 l51
~
the derivative of CðuÞ with respect to wl. However, jwl j is discontinuous at 0 and
so we appeal to the method of subgradients. It can be shown that the optimal
regularized weights are
8
>
> l l
>
> yl 2 yl .
>
> 2 2
>
<
l 2l
b regularized
w ¼ yl 1 yl , (A3.9)
l >
> 2 2
>
>
>
> l
>
:0 jyl j #
2
At least in this simplified scenario we can see from (A3.8) and (A3.9) how the
un-regularized and regularized weights compare. Importantly, from (A3.9) we can
clearly see the sparsity property of the L1 regularization. If the absolute value of
yl, which is the least squares estimator of the weights, is less than l/2 then the
regularized weights are shrunk entirely to zero. This is the feature selection
property of L1 regularization. While we have taken a very special case for the
above analysis, the main ideas behind the sparsity of L1 regularization are
essentially the same for the more complex cases.
Technical Detour 6
We sketch below how L1 and L2 regularization can be seen as constrained opti-
mization. For instance, the L2 regularization program can be recast as
argminCðuÞ (A3.10)
ð1Þ ð2Þ
wml ;wkm
subject to JðuÞ # t
Technical Detour 7
We consider a regression with a single input variable xi, where i 5 1,…, N are the
N training input data points. Each input data point xi has a corresponding target
N
yi. The quadratic cost function, C 5 + ðf ðxi Þ 2 yi Þ2 . Suppose each time input xi
i51
is presented, a random vector «i is added where E½«i 5 0; E½«2i 5 s2 and further
the «is are uncorrelated.
When input error is incorporated in the cost function, then the quantity that we
N
~ is the expectation of + ðf ðxi 1 «i Þ 2 yi Þ2 . We note that
minimize, denoted by C,
i51
we have added noise «i to the input xi. The expectation is necessary since «i is a
random quantity. Thus, the cost function with input error is:
N
~ 5 E« ½ + ðf ðxi 1 «i Þ 2 yi Þ2 . Now, by the Taylor’s expansion we have
C i
i51
2
~ ¼ +
N ∂2 f ðxi Þ ∂f ðxi Þ 1 2 ∂2 f ðxi Þ
C ðf ðxi Þ 2 yi Þ2 1 s2 ðf ðxi Þ 2 yi Þ 1 E«i ð«i 1 «i Þ
i¼1 ∂xi 2 ∂xi 2 ∂xi 2
where we have used E½«i 5 0; E½«2i 5 s2 Expanding the last squared term above
and applying the expectation we have
~ ¼ C 1 + s2 ðf ðxi Þ 2 yi Þ∂ f ðxi Þ 1 s2 ð∂f ðxi ÞÞ2
N 2
C (A3.11)
i¼1 ∂x2i ∂xi
We want the function that minimizes this error. So, using functional differ-
entiation of (A3.11) with respect to f (xi), and making use of the definition of C we
obtain
f ðxi Þ ¼ yi 1 Oðs2 Þ (A3.12)
note that the term involving the second derivative of f (xi) in (A3.8) vanishes to
O(s2). Therefore, the effective cost with input errors can be written as
If the function f (xi) is linear in xi then ∂f∂xðxi i Þ is the weight, and (A3.13) is the
cost function corresponding to L2 regularization of the weights.
Technical Detour 8
Suppose there are n regression models where the error for each model i 5 1,…,n is
«i. The errors are zero mean E½«i 5 0, have variance E½«2i 5 v and covariances
E½«i «j 5 c. The ensemble prediction is just the average prediction, and its
expected squared error is
84 Machine Learning and Artificial Intelligence in Marketing and Sales
1 n v ðn 2 1Þc
E ð + «i Þ 2 ¼ 1 (A3.14)
n i¼1 n n
Clearly, if the errors are perfectly correlated, that is, c 5 v, the mean squared
error is v and so there is no reduction of variance by averaging.
Chapter 4
Chapter Outline
1. Introduction to Support Vector Machines
1.1 Early Evolution
1.2 Nonlinear Classification Using SVM
2. Separating Hyperplanes
3. Role of Kernels in Machine Learning
3.1 Kernels as Measures of Similarity
3.2 Nonlinear Maps and Kernels
4. Optimal Separating Hyperplane
4.1 Margin between Two Classes
4.2 Maximal Margin Classification and Optimal Separating Hyperplane
5. Support Vector Classifier (Nonseparable Case) and SVM
6. Applications of SVM in Marketing and Sales
7. Case Studies
Technical Appendix
Worked-out Illustrations
Technical Detour 1
We will now show a how a linear classifier is very restrictive in many realistic
cases. Consider below a fragment of a much larger data set.
The data plot is given below (Fig. 4.1):
Clearly, no linear classification function can distinguish between the “1” and
the “2” classes. If we apply the logistic regression classifier to these data set, we
obtain the following classifier (bold line) (Fig. 4.2):
In fact, all points have been classified as “2,” and so the performance of
logistic regression is very poor. The reason that all points have been classified as
“2” is because there are more “2” data points in the training data set, and,
therefore, the misclassification rate is lower by “playing it safe” and classifying
everything as “2.” We have discussed this aspect of classifiers in more details in
The log odds is given by log[pi/(1 2 pi)], where pi is the probability that yi 5 11.
1
Support Vector Machines in Marketing and Sales 87
Fig. 4.2. Poor Classification of “1” and “2” Classes Using Logistic
Regression.
88 Machine Learning and Artificial Intelligence in Marketing and Sales
chapter 1 under assessment methods for classification tasks (see percent correctly
classified in Section 2).
As a comparison, we have done classification on these data set using SVM.
The plot of the classification of the same data using a standard SVM package is as
follows (Fig. 4.3). The SVM classifier achieves an accuracy of 92%.
We have just seen an illustration of how SVMs are a powerful method of
achieving nonlinear classification. In the next few sections, we provide some
insights into the SVM technique.
2. Separating Hyperplanes
We first define a hyperplane. Quite simply, a hyperplane is the generalization of a
line when we operate in a higher dimensional space. Thus, in two dimensions, a
hyperplane is just a line. Formally, if we are on a p-dimensional space, then
a hyperplane is defined by the equation:
b0 1 b1 x1 1 b2 x2 1 ::: 1 bp xp ¼ 0 (4.1)
If a point xi 5(xi1, xi2,…, xip) in p-dimensional space satisfies (4.1), then we say
that this point lies on the hyperplane. A hyperplane that separates two classes is a
separating hyperplane.
Consider again the binary classification context. As before, the input data
points xi are each associated with a response variable yi (also called output or
target variable) where each yi can be either 11 or 21. In two dimensions, consider
the set of “1” and “2” points which are separable, that is, which can be separated
by lines drawn in this space as the figure below shows. Suppose, the “1” points
correspond to yi 5 11 and the “2“ points correspond to yi 5 21.
In the two-dimensional Fig. 4.4, since the “1” and “2” points are linearly
separable we can draw many lines that separate them. Consider any one of these
lines – say the thick bold line. Imagine a point xi with coordinates xi 5(xi1, xi2).
Support Vector Machines in Marketing and Sales 89
Let the equation of the thick bold line in this two-dimensional space be:
b0 1 b1 x1 1 b2 x2 5 0. Suppose, the point xi lies exactly on this thick bold line. If
we substitute the values of xi1 and xi2 in the equation of the thick bold line, we will
get a value of zero: b0 1 b1 xi1 1 b2 xi2 5 0. However, for points that lie strictly
above or below this line, the corresponding expression will either be a positive
value or a negative value. In fact, for points in the “1” category, we will have a
positive number, and for points in the “2” category, we will have negative
number. The same logic carries over to a p-dimensional space. Said differently,
the sign of the expression on the left-hand side of (4.1) can be used for classifica-
tion. “Technical detour 2” provides more details.
A major drawback of this approach to classification is that there could be
many separating hyperplanes, as is also clear from the figure above. This lack of
uniqueness is undesirable, and we address the question of uniqueness of the
separating hyperplanes in Section 4.
Technical Detour 2
Executive summary
Classes that can be separated by a line are called linearly separable. In two
dimensions, the linear separator is the usual straight line. The generalization
of a linear separator to higher dimensions is called a separating hyperplane. A
separating hyperplane is a function of input points. The sign of the function,
with a given any input point as its argument, can be used to classify that input
point.
90 Machine Learning and Artificial Intelligence in Marketing and Sales
(Continued)
A major drawback of this way of classifying linearly separable classes is
that there is no unique separating hyperplane – in fact, we can find an infinite
number of separating hyperplanes.
objects like text messages, voice, and video etc. Of course, for similarities between
objects that are easy to quantify, such as metric variables like income, weights etc,
the kernel will reduce to the familiar distance between them.
Another power of kernels derives from the fact that they allow us to redefine
nonlinear operations and function as linear operations and function in another
(possibly higher dimensional) space. Since linear operations and linear functions
are easier to manipulate mathematically and also interpreting them to gain sub-
stantive insights is more straightforward, this is a major benefit of kernels. We will
now provide a more formal treatment of both the “similarity measuring” and
“nonlinear mapping” properties of kernels.
2
To be precise, the inner product of two vectors is proportional to the cosine of the angle
between them.
92 Machine Learning and Artificial Intelligence in Marketing and Sales
Fig. 4.5. The Inner Product Is a Measure of the Angle between Two
Vectors.
Now, not all input objects of interest in machine learning have a straightfor-
ward representation in terms of vectors. One may be interested in similarity
measures in more general objects like strings, sentence structures, documents,
trees etc. The concept of kernels can be used in these more general cases too. For
instance, a very popular kernel used for measuring the similarity of two docu-
ments is based on the idea of using the cosine as a similarity measure, and,
therefore, uses the inner product. Some details of the above discussion are in
“Technical detour 3.”
Technical Detour 3
We will now present a simple algorithm that can accomplish binary classifi-
cation using only inner products, i.e., kernels, in the case where the input objects
can be represented as vectors. Consider N data points (xi, yi), i 5 1,…, N, where
input vectors xi 5(xi1, xi2) lie in two-dimensional space and yi is a binary response
variable with values “11” or “21.” A subset of the N points have responses with
yi 5 11 values, and the rest have responses with yi 5 21 values.
In the figure Fig. 4.6 suppose the mean over inputs xi for the response class yi
5 11 be denoted by x1 and the mean over inputs xi for the response class yi 5 21
Support Vector Machines in Marketing and Sales 93
be denoted by x2. In other words, these vectors are the centroids of the “1” and
“2” classes. The point halfway between the class means is xc 5 (x1 1 x2)/2.
Consider the problem of classifying a new point x 5(x1, x2) in either class “1”
or class “2.” The most intuitive way to do so is to assign this new point to that
class whose class mean is closer to that point. In geometrical terms, as can be seen
from the figure above, this is equivalent to classifying it in class “1” if the angle
between the vectors “x 2 xc” and “x1 - x-” is less than 90o, and to classify it in
class “2” if this angle is greater than 90o. These vectors are the two bold directed
arrows in the figure. Therefore, geometrically speaking, the classification can be
done simply by comparing angles! Now, we note that the angles are related to inner
products. Therefore, the inner product can act as a classifier. Further, since the
cosine of an angle greater than 90o is negative and cosine of an angle less than 90o
is positive, hence we can use just the sign of the inner product for classification.
Specifically, it can be shown that the classifier is:
y ¼ signðÆx 2 xC ; x 1 2 x æÞ (4.3)
Finally, we note that the mean vectors x1 and x2 together involve all the data
points in the training data set. This is so because the vector x1, being the centroid
of the “1” class is computed using all data points in that class. Similarly, the
vector x2 is computed using all data points in the “2” class. Hence, the inner
product-based classifier involves the entire training data which implies that the
computational complexity of SVM increases proportionately with the size of the
training data set.
As noted in Section 3, the separating hyperplane that classifies two linearly
separable classes is not unique. A fundamental result in the field of SVMs shows
that the unique optimal separating hyperplane can be constructed using only
94 Machine Learning and Artificial Intelligence in Marketing and Sales
Technical Detour 4
The map f(.) transforms objects from the input space, denoted by X, to an
entity called the feature space denoted by F. Thus, if x is in X, then f(x) is in F.
This feature space is usually of a higher dimension than the input space and is
such that it allows computations of inner products of two vectors f(x) and f(x9)
that lie in it.
We will provide a simple intuition for how kernels perform the task of
incorporating nonlinearities. Consider a two-dimensional input space. Thus, each
input point x has coordinates x5 (x1, x2). By definition, the inner product of two
input points x and x9 in input space, given by Æx; x= æ, is linear in x. Therefore, it
involves only linear terms of the coordinates x1 and x2. Now consider the
quadratic expression given by the square of the inner product: Æx; x= æ2 . One can
easily guess that, in the input space, this nonlinear expression will involve squares
and product terms like x12, x22, and x1x2. Define a three-dimensional feature space
whose axes correspond to these three terms involving squares and products of x1
and x2. The critical thing to realize is that if the axes of a 3-D feature space are
these squares and products of x1 and x2, then the expression Æx; x= æ2 is just a linear
combination of these axes. That is, Æx; x= æ2 is linear in the feature space. Finally, let
f(.) be a map which takes the 2-D input vector x5 (x1, x2) and maps it to a 3-D
vector involving squares and products of x1 and x2. It can be shown that Æx; x= æ2 ,
Support Vector Machines in Marketing and Sales 95
Technical Detour 5
Executive summary
Kernels are widely used in machine learning. They play two important roles.
First, a kernel is a generalized measure of similarity between two objects – not
just vectors, but strings, sentence structures, documents, trees etc. This
property of a kernel is due to the fact that it is a generalization of an inner
product which is a measure of the angle between two vectors. The angle
between two vectors captures the similarity between them.
Second, a kernel allows us to transform nonlinear functions of vectors in a
lower dimensional input space into linear functions in a higher dimensional
feature space. This task is accomplished by using a nonlinear map that takes
vectors in input space to vectors in feature space. Using this map, a nonlinear
function of vectors in the input space is expressed as an inner product in
feature space. Linearity in feature space follows because inner products are
linear.
The idea that two nonlinearly separable classes of points in input space can be
made linearly separable in feature space, once the input points are mapped to
feature space by a nonlinear map f, is a fundamental concept in SVM. Therefore,
to fix ideas, we present a concrete illustration of this phenomenon.
Illustration 4.1: Nonlinear maps transform nonlinearly separable classes to
linearly separable classes in feature space.
Consider two classes of points in two-dimensional input space (X1, X2).
• The four points in the set{(1, 1); (1, 21); (21, 21); (21, 1)} are labeled as 11.
• The four points in the set{(0.5, 0.5); (0.5, 20.5); (20.5, 20.5); (20.5, 0.5)} are
labeled as 21.
As seen in the plot below, clearly these two classes are not linearly separable in
input space (Fig. 4.7).
96 Machine Learning and Artificial Intelligence in Marketing and Sales
Fig. 4.7. Two Classes “1” and “2” Are Not Linearly Separable in
Input Space.
Now, there exists a nonlinear map fðx1 ; x2 Þ such that when the input points
(vectors) are mapped into feature space, then these points are linearly separable in
feature space.3 This nonlinear map takes vectors in two-dimensional input space
and maps them to two-dimensional feature space. Suppose, we denote the axes of
the feature space as W1 and W2.
The input points mapped to feature space are: f(1, 1)5(1, 1); f(1, 21)5 (5, 3);
f(21, 21)5(3, 3); f(21, 1)5(3, 5) for the “1” points. The “2” points are
unaffected by this mapping. Then, the transformed points in feature space are.
Just by “eyeballing” the figure above (Fig. 4.8), one can see that the two classes
are linearly separable once they have been transformed by the nonlinear mapping
f and mapped to the feature space. It is important to note that there could be
many functions that transform nonlinearly separable points so that they become
linearly separable in a different space but not all of these functions are kernels.
For a function to be a kernel, it has to satisfy some special properties, but these
are beyond the scope of this book. Of course, finding the right nonlinear map, or
equivalently, an appropriate kernel, is a nontrivial problem. Moreover, different
kernels can have different feature spaces, even to the point of having feature
spaces of different dimensions. Despite the difficulty of finding the appropriate
kernel, in practice there are some commonly used kernels, and we will discuss
them later. While eyeballing the figure above shows that there could be many
hyperplanes that perform linear separation of the classes in feature space, con-
structing the optimal hyperplane requires more technical machinery. We will
revisit this illustration in Section 4 where we construct the optimal separating
hyperplane.
3
The interested reader can find the definition of the nonlinear map in (A4.25) under
“Worked-out Illustrations” at the end of this chapter.
Support Vector Machines in Marketing and Sales 97
The nonlinear map in Illustration 4.1 above did not change the number of
dimensions. The dimensions of the input space and feature spaces were the same.
However, in most practical applications, the nonlinear transformation maps the
input data into a feature space which has more dimensions than the input space.
Often the number of dimensions of the feature space greatly exceeds that of the
input space. To get an intuition of this, let us consider a simple illustration.
Illustration 4.2: Nonlinearly separable data in input space are separable in
higher dimensional feature space.
Consider the seven data points on a single-dimensional input space X1:
Clearly, the classes are not linearly separable in one dimensional space. Now
consider the nonlinear map that transforms the one-dimensional input space to
two-dimensional feature space:
fðx1 Þ ¼ ðx1 ; x21 Þ
For instance, f(23) 5 (23, 9). The others six points can be similarly mapped
into two dimensions. Let us denote the dimensions in feature space as W1 ([ x1)
98 Machine Learning and Artificial Intelligence in Marketing and Sales
and W2 ([ x12). In two dimensions the seven points xi 5 (xi1, xi2) can be
plotted as:
Now, the classes are linearly separable in the higher dimensional space. One
possible separating hyperplane is shown as the bold horizontal line in Fig. 4.9.
In our discussion of a kernel as a generalized measure of similarity in Section
3.1, we mentioned that we can compute the similarity between two input objects x
and x9 even when the objects are not vectors. This is the case with input objects
that are strings, documents, voice recordings, X-ray images etc. In such cases too,
as long as one can use a (possibly nonlinear) transformation f such that f(.)
defines a legitimate kernel, then nonlinearly separable classes of these objects can
be linearly separated after transformation to a higher dimension feature space.
In the machine learning community, one often hears mention of the so-called
kernel trick. While there are many formal definitions of it, for application-oriented
readers it suffices to have an intuitive idea of what this concept is. We state it as
follows.
We conclude the section on kernels by listing some of the most commonly used
kernels.
Technical Detour 6
Fig. 4.11. The Margin of the “1” Points from Hyperplane (Bold
Line).
negative for points in the “2” category. Also, the class label for the “1” category
is yi 5 11, and the class label for the “2” category is yi 5 21. It can be shown
that these two facts together imply that for correctly classified points, the margin
is just the geometrical distance of the separating hyperplane from that point, and
for incorrectly points, the margin is the negative of the geometrical distance.
Technical Detour 7
Support Vector Machines in Marketing and Sales 101
Just as in the standard Ordinary Least Squares (OLS) regression, the goal of
estimation is to estimate the coefficients of the linear regression equation; here
too, the goal is to estimate the coefficients of the hyperplane. This is done such
that the coefficients are chosen so as to maximize the margin. It is noteworthy that
there are as many constraints as there are training data points since there is one
constraint corresponding to each data point in the training data.
The power of this formulation comes from the fact that the optimal separating
hyperplane can be expressed only in terms of inner products. Specifically, the
optimal separating hyperplane, defined for a generic point x, can be expressed as a
linear combination of terms like Æx; xi æ where xi, i 5 1,…, N, are points in the
training data set. This may remind the reader of our discussion in Section 3.1
where we had sketched out a geometrical argument for how classification can be
achieved by using only inner products. It is satisfying that the geometrical
argument presented in that section is consistent with the more formal machinery
in the current section. The interested reader can look up the technical details
pertaining to the above discussion in “Technical detour 8” in the appendix at the
end of this chapter.
The constraint that all points have to be at least a distance M from the
hyperplane (in the optimization program (4.5)) is satisfied as an equality for the
nearest point(s). That is, the point closest to the hyperplane for a given class will
lie exactly on the margin. For example, in the figure below the point xa in the
“1”class lies exactly on the margin (Fig. 4.13).
Another interesting property of the solution to (4.5) is that the optimal sepa-
rating hyperplane is fully characterized only by the points (vectors) xi that lie on
the margin. These vectors xi that lie exactly on the margin, and which are the only
ones required to compute the optimal separating hyperplane, are called support
vectors. All input data points that lie inside the margin do not factor in the
computation of the optimal separating hyperplane. In a previous paragraph in
this section, we had mentioned that the hyperplane can be constructed using inner
products of a generic point x with all points xi, i 5 1,…, N, in the training data
set. The fact that the optimal separating hyperplane actually uses only a small
subset of the points in the training data set is an example of the sparsity property
of SVMs.
To get some intuition for this important property of SVMs suppose that, in the
context of churn modeling, we want to classify “nonchurners” and “churners” for
a bank. In the figure below, the nonchurners are represented by “1” and churners
are represented by “2.” We apply the labels yi 5 11 to the former and yi 5 21 to
the latter. Also, suppose the historical database of the bank has a set of p variables
that can be used as explanatory variables. These could include demographic,
psychographic, and behavior variables pertaining to a customer. Thus, customer i
can be represented by the vector xi 5(xi1, xi2,…, xip), and there are N customers
i 5 1,…, N. By definition, the support vectors are the points closest to the
boundary for a given class – e.g., in the figure below, it is point xa for the “1”
class (Fig. 4.14).
Intuitively speaking, we expect that consumers who make the “hard choice” of
whether to churn or not would be the most important in determining what makes
a bank customer churn. This is because in choice situations one can really
understand why people make the choices they do by understanding the choice
Support Vector Machines in Marketing and Sales 103
processes of those who make such difficult trade-offs. As a practical matter, the
support vectors identify consumers that make hard choices. In our bank churn case,
these are the customers who are almost indifferent, that is, are closest to the fence,
between churning and not churning. In other words, a point like xa represents
nonchurners who are most at risk to churn. From the point of view of interpre-
tation and practical ramifications, this is very important. As a managerial
implication, the firm’s retention efforts could be targeted at such customers. It is
worth noting that the SVM methodology automatically identifies such support
vectors, and this is standard output of most software packages that perform SVM
analysis. Furthermore, moving away from the margin has an interesting
interpretation.
In the figure Fig. 4.15 as one moves away from the margin toward the
“nonchurners” class, the likelihood of churn decreases. This is because this
represents a move away from the “churners” class. Similarly, as one moves away
from the margin toward the “churners” class, the likelihood of churn increases.
This is because this represents a move away from the “nonchurners” class.
Technical Detour 8
Executive summary
The concept of a margin is very useful since it allows one to find a unique
separating hyperplane between two classes. The margin of a point to the
hyperplane is the distance from that point to the hyperplane if the point is
correctly classified. For a set of training data points, the margin is the min-
imum distance from this set of points to the hyperplane.
For linearly separable classes, the optimal separating hyperplane is the one
that maximizes the margin M, subject to the constraint that all points are at
least a distance M away from the hyperplane. The SVM technique with the
maximum margin approach has two powerful properties: (1) The computa-
tion of the optimal separating hyperplane requires computing only inner
products, (2) The optimal separating hyperplane is specified only by the
support vectors. Because the support vectors are a subset of all vectors, hence
the SVM is said to have the sparsity property. Geometrically, points that lie
exactly on the margin are support vectors. From a practical point of view,
support vectors correspond to consumers who make “hard choices” – i.e.,
they are almost indifferent between belonging to one or the other class.
• The four points in the set{(2, 0); (3, 21); (3, 21); (4, 0)} are labeled as 11.
• The four points in the set{(1, 0); (0, 1); (0, 21); (21, 0)} are labeled as 21.
hyperplane using only inner products and the support vectors. We first note, that
just by a visual inspection of the plotted points, the support vectors are: s1 5 (1, 0)
for the “2” class and s2 5(2, 0) for the “1” class. These are circled in the figure
below (Fig. 4.17).
The target values corresponding to these points are y1 5 21 and y2 5 11. Let
us consider a generic point x5(x1, x2). Under “Illustration 4.3” in the “Worked-
out Illustrations” section, we show that the optimal separating hyperplane is
given by:
f ðxÞ ¼ 2 3 2 2 , x; s1 . 1 2 , x; s2 .
The critical aspect to notice is that the optimal separating hyperplane can be
specified by a linear combination of the inner products of the vector x with the
Fig. 4.17. Support Vectors from the Two Classes Are Circled.
106 Machine Learning and Artificial Intelligence in Marketing and Sales
two support vectors. Expanding the inner products in terms of the components of
the vectors, the optimal separating hyperplane in input space is:
f ðx1 ; x2 Þ ¼ 2 3 1 2x1 :
We can plot the optimal separating hyperplane. Points (x1, x2) that lie exactly
on the hyperplane are characterized by f (x1, x2) 5 0. Therefore, the line 2312x1
5 0 passes through the point (3/2, 0) and is perpendicular to the X1 axis. This
hyperplane is shown below (bold vertical line) (Fig. 4.18):
The reader will note that in the illustration above we based our calculations on
being able to identify the support vectors. In the simple setting above, the support
vectors are easy to identify by eyeballing a plot of the data. In general, with many
data points, as in common in real machine learning applications, it may be
impossible to identify the support vectors merely by plotting the data. Then, we
will have to formally solve the program in (4.5) to obtain the optimal separating
hyperplane. In these cases, for computational purposes, it is often easier to solve
an equivalent problem called the Wolfe Dual program. This program yields the
optimal li, and the points (vectors) xi corresponding to li . 0 are the support
vectors. We will not present these details here, but the interested reader will find it
in the appendix.
Technical Detour 9
and there is no feasible solution in input space that achieves perfect sepa-
ration using a linear classifier. Consider for instance the situation portrayed
in the figure below:
The points denoted 1 and 2 belong to class “2,” and we can see that the two
classes are not linearly separable. The bold dashed lines on either side of the
hyperplane (bold solid line) are the margin lines. We will use the term “margin
line(s)” for the line(s) at a distance M from the hyperplane, and the distance itself
will be called the “margin width.” Consider point 1. This point is on the wrong
side of the bold dashed margin line (the line at the bottom) and at a distance
of Mj1 from it.4 The variable j1, called slack variable, quantifies the
extent of misclassification of point 1. Since distances from the hyperplane
define margin lines, in essence, we have a “new” margin line (the dotted line)
at a distance M(12 j1) from the hyperplane (rather than at a distance of M). Since
j1 . 0, the new margin width is smaller than M. To the extent that a larger
margin width results in a better separation, the new margin line will achieve
weaker separation. The bold dashed lines on either side of the hyperplane are
often called soft margins since they can be violated by some points. Notice that,
with respect to the bold dashed margin line, point 1 is not misclassified – even
though it is on the wrong side of this margin line, it is on the correct side of the
hyperplane. On the other hand, point 2 is misclassified. The distance of point 2
from the bold dashed margin, Mj2, is greater than M. Thus, its distance from the
hyperplane is M(12 j2), and this is a negative quantity. One can see that, in this
construction, misclassification of point i occurs when ji . 1. At this point, it will
be useful to relate the values of ji to the misclassification of point i.
From the above, it is clear how the slack variable ji quantifies the extent of
misclassification of point i, and the higher its value, the more severe the
misclassification. Thus, the sum +ji quantifies the total amount of misclassifi-
cation across all data points. i
The major difference in the nonseparable case compared to the separable case
discussed in Section 4.2 is how the constraints in the program for the optimal
separating hyperplane are handled. In the program for the optimal separating
hyperplane in (4.5), we imposed the constraint that all points are at least a
4
In our construction, the distance from the margin line for points on the wrong side of it is
stated as proportional to the margin width M. Equivalently, the distance could be stated in
absolute terms. For technical reasons, having to do with optimization, the former approach
is often adopted.
108 Machine Learning and Artificial Intelligence in Marketing and Sales
distance M away from the hyperplane. Here, we relax this constraint using the
slack variables. We instead impose the constraint the constraints that all points
are at least a distance Mð1 2 ji Þ away from the hyperplane. This is the soft
margin referred to above. Additionally, in order to control the extent of
N
misclassification, we also impose the constraint that the sum + ji should not be
i51
larger than a prespecified constant C. This constant is determined as a tuning
parameter in SVM. Since ji .1 quantifies the extent of misclassification and
because C . 0, therefore, at the most C points can be misclassified. Therefore, a
large value of C allows for more misclassification of the training data. This, in
turn, may allow for better classification of test data, since the model does not
hard-fit the training data. In this way, the constant C is related to the bias–
variance trade-off and is usually selected via cross-validation.
The support vector classifier is a natural extension of the maximal margin
classifier. Like the latter, it uses a linear boundary, but it allows for some
misclassification when applied to nonlinearly separable classes. Conceptually, it
does this by using slack variables and then controlling the amount of misclassi-
fication using a constraint. As a rough analogy, consider the idea behind fitting a
linear regression to a scatter of points. Since the scatter of data points are usually
noncollinear, a line cannot pass exactly through all the points. Thus, the modeler
allows errors in the fit and then estimates that regression line which minimizes the
total error. The optimization program for a support vector classifier is:
objective function : Choose b0 ; b1 ; …; bp ; and j1 ; …; jN to maximize the margin M
Constraint 1: All points are at least a distance Mðð1 2 «i Þ away from the hyperplane
Constraint 2: The sum +i zi , C (4.6)
Note that there is some similarity between the nonseparable case presented
here and the separable case in section (4.5). However, there are three salient
differences: (1) slack variables ji have been incorporated, and as the objective
Support Vector Machines in Marketing and Sales 109
function shows, they are also decision variables along with the b coefficients, (2)
the right-hand side of the first constraint now incorporates soft margins that allow
some misclassification, and (3) the second constraint has been added which caps
the extent of misclassification. Conceptually, the goal of the estimation is to
maximize the margin subject to capping the extent of misclassification using the
slack variables. Obviously, if the cap C is chosen to be too small compared to the
actual nonseparability in the training data, the program will be infeasible. See
the appendix for a formal statement of the optimization program for the non-
separable case.
Technical Detour 10
Using the support vector classifier, in the nonseparable case too the optimal
separating hyperplane is characterized only by those xi for which the corre-
sponding li . 0. These are the support vectors. Interestingly, unlike the separable
case, in the nonseparable case there are two types of support vectors:
vectors x in input space into vectors fðxÞ in feature space. Thus, the optimal
separating hyperplane that results from solving this program is also expressed in
terms of kernels. These kernels give linear separating boundaries in feature space,
which result in nonlinear boundaries in input space.
Technical Detour 11
Executive summary
The support vector classifier is a natural generalization of the maximal margin
classifier to nonseparable classes. Unlike the case of linearly separable classes,
for nonlinearly separable classes we use soft margins which can be violated by
some points. In this case, the optimal separating hyperplane is the one that
maximizes the margin M, subject to the constraint that all points are at least a
certain distance away from the hyperplane – that distance is a fraction of the
margin width M.
Intuitively, the support vector classifier allows some misclassification, but
then caps the total amount of misclassification allowed. It quantifies the
extent of misclassification of an individual point by using slack variables. The
total amount of misclassification across all training data points is capped
using a user-defined constant. The support vector classifier also shares the
desirable properties of the maximal margin classifier, in that, the optimal
separating hyperplane requires computing inner products, and it can be
specified using only support vectors.
The SVM is a powerful method for classifying nonseparable classes.
Intuitively, the SVM methodology uses kernels to transform input data to a
higher dimensional feature space where the data become linearly separable,
and then computing the linear optimal separating hyperplane in feature space
in terms of the kernels. Transformed back to input space, the separating
boundary is nonlinear, and, thus, it achieves classification of nonlinearly
separable classes in input space.
We end this section with two illustrations. The first one computes the optimal
separating hyperplane for the data in Illustration 4.1 in Section 3.2 where we had
two classes that are not linearly separable. This illustration demonstrates how
kernels and support vectors are used in the construction of a separating hyper-
plane for such data. The first illustration is such that support vectors can be
identified by plotting and “eyeballing” the data. In the second illustration, support
vectors are not easily identified, and we demonstrate the use of the formal
machinery involving solving the Wolfe dual program.
Illustration 4.4: Constructing optimal separating hyperplane for nonlinearly
separable classes using kernels and support vectors.
Support Vector Machines in Marketing and Sales 111
Recall from Illustration 4.1 the two classes of points in two-dimensional input
space (X1, X2). For the reader’s convenience, we will show the points comprising
the two classes:
• The four points in the set {(1, 1); (1, 21); (21, 21); (21, 1)} are labeled as 11.
• The four points in the set {(0.5, 0.5); (0.5, 20.5); (20.5, 20.5); (20.5, 0.5)} are
labeled as 21.
These classes are not linearly separable in input space, but they are linearly
separable in feature space once they are transformed by the mapping f which is
defined in Illustration 4.1. The goal now is to construct the separating hyperplane
in feature space using kernels. We start by noting that the only points that enter
the calculation of the separating hyperplane are the support vectors. In the figure
below (Fig. 4.20), the two support vectors are circled. They are: s1 5 (0.5, 0.5) for
the “2” class and s2 5 (1, 1) for the “1” class.
The target values corresponding to these points are y1 5 21 and y2 5 11. We
apply the two support vectors as arguments of the function f(x) for the separating
hyperplane (the interested reader can see Eq. A4.20). Then corresponding to s1
and s2, we have two equations which are solved simultaneously to obtain the
coefficients of the optimal separating hyperplane.
Let us consider a generic point x5(x1, x2). In “Illustration 4.4” under the
“Worked-out Illustrations” section, we show that the optimal separating hyper-
plane is given by:
f ðxÞ ¼ 2 4 2 6 , fðxÞ; fðs1 Þ . 1 6 , fðxÞ; fðs2 Þ .
It is important to note that this hyperplane lies in the feature space with W1 and
W2 as axes and, since it is expressed in terms of inner products which are linear, is
Fig. 4.20. Support Vectors from the Two Classes Are Circled.
112 Machine Learning and Artificial Intelligence in Marketing and Sales
itself linear in feature space. For the generic input vector (x1, x2), let us denote its
image under f as f(x1, x2) 5 (w1, w2). Evaluating the inner products where the
vectors in feature space are f(x) 5 (w1, w2), f(s1) 5 (0.5, 0.5), and f(s2) 5 (1, 1),
the previous expression for the hyperplane in feature space becomes:
f ðx1 ; x2 Þ ¼ 2 4 1 3w1 1 3w2
We can plot the optimal separating hyperplane. Points (x1, x2) that lie exactly
on the hyperplane are characterized by f (x1, x2) 5 0. Therefore, the line 2
413w113w2 5 0 passes through the points (0, 4/3) and (4/3, 0). This hyperplane is
shown below (bold solid line in figure below) (Fig. 4.21)
Since w1 and w2 are nonlinear functions of the input vector (x1, x2), therefore
the hyperplane when transformed back to input space defines a nonlinear
boundary which classifies two nonlinearly separable classes. So, how do we use
our optimal separating hyperplane to classify a new point in input space?
Consider the point (0.25, 0.75) in input space. In order to apply our optimal
separating hyperplane, we first have to transform this input point to feature space
using the mapping f. We have f(0.25, 0.75) 5 (1.75, 2.25). Then, f (0.25, 0.75) 5
24 1 3*1.75 1 3*2.25 . 0. So, this input point is classified as belonging to the
“1” class. Is this classification reasonable? To speak to the reasonableness of any
classification, we note that, first, no model is perfect and there may be classifi-
cation errors. The optimal separating hyperplane is based on all the training data,
and the model is designed to minimize overall misclassification across all data
points. Second, the kernel itself is a modeler-defined choice, and the classification
accuracy of any separating hyperplane is conditional on the specific choice of the
kernel.
Finally, we present an illustration where the support vectors cannot be iden-
tified merely by “eyeballing” a plot of the data from the two classes. In such a
situation, we will demonstrate how the support vectors can be identified and the
optimal separating hyperplane constructed by formally solving the full SVM
program.
Illustration 4.5: Identifying support vectors to compute optimal separating
hyperplane for nonlinearly separable classes.
We present an example from Cui and Curry (2005) to show how the use of
kernels can efficiently achieve proper classification in the “XOR problem” where
it is well-known that no linear classifier exists. The XOR problem has four points
ai, i 5 1, 2, 3, 4, in two-dimensional space with axes X1 and X2. The target
variable yi corresponding to each of the four points is either 11 or 21. A table
with the data for the XOR problem is below (Fig. 4.22).
In the plot of the data below a “11” point is represented as a “circle” and a “21”
point is represented as a “square” for ease of visualization (Fig. 4.23).
Fig. 4.22. The Data for the XOR Problem. Source: Cui and Curry
(2005).
Fig. 4.23. Plot of XOR Points. Source: Cui and Curry (2005).
114 Machine Learning and Artificial Intelligence in Marketing and Sales
One can easily check that this classifier, which is nonlinear in input space,
correctly classifies the squares and circles as can be seen in the plot of the classifier
below in input space (Fig. 4.24).
The decision boundary is in the shape of a cross – that is, it is the combination
of both the sloped bold lines. This is because the decision boundary as given in the
previous equation is x12 2 x22 5 0. This reduces to (x1 2 x2) (x1 1 x2) 5 0 so that
x1 5 x2 and x1 5 2x2.
record of all possible applications of SVMs to marketing and sales, but to give a
sense of some of the most common applications.
• Market segmentation;
• Target marketing;
• Predictive modeling/Response modeling;
• Sales/demand forecasting;
• Churn prediction;
• Time series analysis in marketing;
• Text mining and analysis.
counterpart of “hard choices” turns out to be the support vectors of the SVM, and
since the support vectors are automatically determined as part of the estimation
process, the “hard choices” that shape consumer preferences are also automati-
cally found by the SVM-based conjoint method. This is consistent with the
intuition that people’s preferences are really formed by hard choices they have to
make rather than by obvious trade-offs, with the added advantage that there are
much fewer hard choices compared to the set of all choices that people make.
Cheung, James, Law, and Tsui (2000, 2003) have used SVM to mine customer
preferences for product recommendations. In light of the explosion in online
customer ratings and reviews in all product categories, businesses are very keen to
use them to recommend suitable products to potential customers. Recommen-
dation systems are usually of two types: (1) Content-based and (2) Collaborative-
based. Simply stated, content-based recommendation systems look at the match
between product attributes and other product information with customer inter-
ests, while collaborative systems leverage the preference ratings from the other
customers for recommending products/services to the focal customer. In theory, it
is straightforward to extract product contents from web pages owing to the vast
amount of information of products available nowadays, but translating the
contents to a manageable number of attributes that can be processed by statistical
methods is daunting because of the large number of resulting attributes. For
example, consider the very common application of movie recommendations. One
important aspect of a movie is its “cast.” However, this is a multivalued object
and the usual way it is coded is via binary variables like “cast includes Keanu
Reeves,” “cast includes Meryl Streep” etc., each of which takes 0/1 values. This
scheme explodes the dimensionality of the set of attributes. SVM is known to have
superior performance in the case of high-dimensional data sets, and this property
can be exploited for content-based recommendation systems. Using an Internet
Movie Database (IMDB), Cheung et al. (2003) show that SVM has a much
superior performance compared to Naı̈ve Bayes and k-Nearest Neighbor in cal-
ibrating recommendation systems for movies.
Kim, Lee, and Cho (2008) have used support vector regression for response
modeling in direct marketing where they use a customer database to estimate the
amount of purchases made. As opposed to the SVM methodology that was
developed originally for classification, as in the consumer choice contexts
described in the preceding paragraphs, they have used a variant which is
appropriate for regression – the support vector regression. The authors have
shown how a more sophisticated sampling method proposed by them can yield
better results and, at the same time, can reduce the estimation load that results
from using the entire data set. This addresses a drawback of SVM which places a
lot on estimation when the training data set is very large. Since the use of larger
data sets is becoming increasingly common, this method can be fruitfully used to
efficiently train SVMs. In a similar vein, Shin and Cho (2006) have used SVM for
response modeling in direct marketing using the Direct Marketing Educational
Foundation (DMEF) data set. Using the classification based SVM, they again
demonstrate the practical usefulness of a specific sampling procedure to train an
SVM with a large training data set. In the domain of response modeling, a novel
118 Machine Learning and Artificial Intelligence in Marketing and Sales
the SVM with smart parameter selection relative to logistic regression, neural
network, and classic (without SVMauc) SVM, and find the superior performance
of their model. They further note that their SVM-based model shows better
generalization performance when applied to noisy, imbalanced, and nonlinear
marketing data, which most firms are increasingly capturing in the CRM systems.
Coussement and Van den Poel (2008) have studied churn prediction in a business-
to-customer (B2C) setting, specifically for subscription services. They too find that
SVM performs better than logistic regression, especially when the former is
trained by a suitable parameter selection technique. However, they find that a
Random Forest performs better in their specific data set. Huang, Chen, Hsu,
Chen, and Wu (2004) have used SVM for credit rating prediction using multiple
data sets from credit rating organizations in Taiwan, called the Taiwan Ratings
Corporation, and the United States. They find that SVM has better predictive
accuracy compared to a neural network trained on the same data set and both are
much superior to logistic regression.
Another field in which the SVM methodology has shown superior performance
is in analyzing time-series data, especially chaotic time series. In an early paper,
Mukherjee, Osuna, and Girosi (1997) benchmarked the SVM-based time series
predictions with several alternatives and found that SVM dominated its alter-
natives in terms of prediction accuracy. Most importantly, the authors generated
chaotic time-series data and employed sophisticated nonlinear analysis techniques
as alternatives to benchmark their SVM-based approach. The chaotic time-series
they considered are the Mackey-Glass time series, the Ikeda map, and the Lorenz
time series (Mukherjee et al., 1997). The alternative techniques used to analyze
these chaotic time-series are taken from Casdagli (1989) and include nonlinear
approximation techniques such as polynomial, rational, local polynomial, RBFs,
and neural networks. The authors report that, “The SVM performs better than
the approaches presented in [1],” where [1] refers to the alternative approaches.
This research provides very strong evidence for the usefulness of SVM in
analyzing time-series which are otherwise difficult to handle using more tradi-
tional methods. Sapankevych and Sankar (2009) provide a wide-ranging survey of
applications of SVM to time-series predictions. They too note that,
This papers provides many cites of research in different fields that have
employed SVM for time-series analysis, and we will not repeat them here.
Text analysis has become very popular as a means of obtaining customer
intelligence and other marketing intelligence. This radical growth of textual
analysis is an offshoot of the large volume of text that is generated by various
120 Machine Learning and Artificial Intelligence in Marketing and Sales
marketing and sales efforts of a firm, and the responses to these efforts by its
customers (Netzer, Feldman, Goldenberg, & Fresko, 2012; Tirunillai & Tellis,
2014). SVMs have been extensively used in text classification and many other text
analysis techniques (Chapter 6 of Aggarwal & Zhai, 2012; Chau & Chen, 2008;
Joachims, 1998; Zhang, Dang, Chen, Thurmond, & Larson, 2009 etc). Li and Wu
(2010) have used SVM for text mining and sentiment analysis for online forums
by using data from 31 different sports-related topic forums spanning a wide range
of topics and 220,053 posts. They first algorithmically determine the emotional
polarity of a text, that is, sentiment analysis, and obtain a value for each piece of
text. They then combine this algorithm with SVM in an unsupervised text mining
approach to form clusters. The goal is to group the topic forums into clusters
where the center of each cluster would represent a hotspot forum. They report
experimental results which confirm that SVM performs the clustering task very
well, with the top 4 hotspot forums identified by SVM being similar to results
obtained from using k-means clustering.
A very common application area within text analysis is text classification
which involves automatically categorizing textual documents into topical cate-
gories such that information can then be easily searched. A persistent problem
with text classification is the overwhelming number of applications where the
training data set is very imbalanced. Consider, for example, the task of classifying
news articles as “interesting” or “not interesting” for a particular reader. The
standard means of doing this is to use the SVM as a binary classifier and then to
perform the multicategory classification task (it is multicategory since there are
many categories of new articles) by adopting a one-against-all learning strategy.
Clearly, there are many more training examples in the “not interesting” category.
There are many techniques that have been suggested to handle such imbalanced
training data when using SVM as the base classifier. In a wide ranging empirical
study, Sun, Lim, and Liu (2008) discuss and compare the various methods of
handling such imbalanced training samples in the case textual data. They find
that, using the area under the Precision-Recall Curve as the evaluation criterion
for model performance, the standard SVM with suitable adjustments of the
threshold may be the best performer.
7. Case Studies
In this section, we will present a couple of case studies about the application of
SVMs in marketing. We describe the data sets and demonstrate the analyses done
on them. We also compare the strengths and weaknesses of SVMs compared to
the traditional econometric models.
As is the case for the other machine learning methods in this book, our goal is
for the applications-oriented reader to be able to see the details of some marketing
applications of SVMs. We will focus on understanding the business context, the
data set, the choice of predictors and dependent variable, visualization and
interpretation of results, and finally, communication of the results in a business-
relevant manner to other stakeholders. We will also provide the results of a
Support Vector Machines in Marketing and Sales 121
nonmachine learning benchmark model. This will allow the reader to clearly
contrast the findings from the SVM analysis against the findings from the
benchmark model.
5
See last para, page 21, of Corstjens and Gautschi (1983).
122 Machine Learning and Artificial Intelligence in Marketing and Sales
analysis, respondents are presented with different “product profiles,” which are
different combination of attribute levels, and their choices are recorded. The
variables in our data set are.
(1) Choice (0 or 1)
(2) A1 (value of attribute 1)
(3) A2 (value of attribute 2)
Response Column: 1
Predictor columns: 2:3 (2 and 3)
Training Percentage: 80
Testing Percentage: 20
Number of folds for Cross-validation: 3
Response Column: 1
Predictor columns: 2:3 (2 and 3)
Training Percentage: 80
Testing Percentage: 20
Number of folds for Cross-validation: 3
Kernel: Radial
Kernel Parameter: 0.5,1,1.5
Cost: 0.8
Number of times Averaging: 1
While the other inputs for a SVM are straightforward, three aspects require
clarification. First, we are required to choose a kernel. Here we chose the “radial”
kernel. Second, the radial kernel needs a user-specified kernel parameter – the
gamma parameter value. Intuitively, the gamma value governs how far
the influence of a particular training example reaches. When gamma is small, the
influence reaches far, and when it is large, the influence is close. We have chosen
three parameter values: 0.5, 1, 1.5. The idea is to compute SVMs with different
parameter values and then to select the best fitting model. This step can be
accomplished by writing simple code in all software programs. Third, we also
need to input a “Cost” value. This is a regularization parameter for SVM. In this
case, a SVM with kernel parameter of 1.5 has the best performance based on the
box plot of cross-validation error as seen below:
124 Machine Learning and Artificial Intelligence in Marketing and Sales
We briefly describe the fit statistics. The Confusion Matrix for the logistic
regression is calculated on the test data and is given by the following:
Actual + Actual -
Predicted + 0 0
Predicted - 46 154
The percent correctly classified (PCC) for test data for the logistic regression
is 77%. On the face of it, a PCC of almost 80% may not seem too bad, but there
is a very serious problem with logistic regression as applied to this data set.
None of the points that are actually in the “1” class are correctly classified! In
fact, not one of the data points is predicted to lie in the “1” class. Thus, when
we go from overall performance to a more fine-grained performance by class,
the logistic regression performs poorly. Intuitively, this happens because there
are more “Actual 2”s in the data, compared to “Actual 1”s, and the method
plays it safe (as far as maximizing PCC is concerned) by classifying everything
as “-.”
The Confusion Matrix for the logistic regression is calculated on the test data
and is given by the following.
Support Vector Machines in Marketing and Sales 125
Actual + Actual -
Predicted + 45 4
Predicted - 10 141
The PCC for test data for the SVM is 93%. The performance of SVM is
significantly better than that of logistic regression. This pattern of results
is confirmed by looking at the AUC as well. For the logistic regression, the AUC is
approximately 0.51. The AUC for the best-performing SVM, the one with gamma
value of 1.5, is significantly higher at about 0.99. The reason for the much superior
performance of SVM compared to logistic regression is obviously because logistic
regression provides a linear boundary that is capable of faithfully classifying only
linearly separable classes. However, noncompensatory choice rules like the LOA
require nonlinear classification as can be seen clearly from Fig. 4.25.
As was already mentioned, conjoint analysis, which has been one of the major
successes in the field of marketing in terms of widespread applications in industry
and elsewhere, is very similar in spirit to the current case study. There too, one
main goal is to estimate an individual’s utility function and predict his/her product
choices. Hence, it is no surprise that the SVM methodology has been successfully
employed for conjoint analysis, given that many real choice situations follow
noncompensatory choice rules and nonlinear utility functions. Evgeniou, Bous-
sios, and Zacharia (2005) have shown that, compared to other statistical models,
apart from being able to handle highly nonlinear choice situations, the SVM
methodology applied to conjoint analysis can handle many more attributes, is
robust to noise, does not suffer from the curse of dimensionality, and does not
make distributional assumptions that may or may not be satisfied.
Response Column: 1
Predictor columns: 2
Training Percentage: 80
Testing Percentage: 20
Number of folds for Cross-validation: 3
As was mentioned in the neural networks chapter, a linear regression will not
provide a good fit to the data since empirical research has established that there is
a cubic relationship between “distance to city center” and “rent value” (Frew &
Wilson 2002). The PMSE (predicted mean squared error) for the linear regression
fit is 6921.
An important learning point from this case is to demonstrate the importance of
an appropriate choice of a kernel for SVM. We will first run a SVM with a radial
basis kernel. The inputs for this are as follows.
Response Column: 1
Predictor columns: 2
Training Percentage: 80
Testing Percentage: 20
Number of folds for Cross-validation: 3
Kernel: Radial
Kernel Parameter: 0.5,1,1.5
Cost: 0.8
Number of times Averaging: 1
The performance of the SVM with the radial kernel on test data is significantly
better than the benchmark of linear regression. The PMSE is 2373.8, which, as
expected, is significantly better than the linear regression.
We now try the SVM with a polynomial kernel. The inputs are as follows:
Response Column: 1
Predictor columns: 2
Training Percentage: 80
Testing Percentage: 20
Number of folds for Cross-validation: 3
Kernel: Polynomial
Kernel Parameter: 1, 2, 3
Cost: 0.8
Number of times Averaging: 1
The kernel parameter is the degree of the polynomial and we have selected
parameters 1, 2, and 3 corresponding to linear, quadratic, and cubic polynomials –
the best fitting polynomial kernel will be selected. We find that the best fitting
SVM has a PMSE of 2074. This is even better than the performance of the SVM
Support Vector Machines in Marketing and Sales 127
with radial basis kernel. Interestingly, the best fitting kernel is the cubic poly-
nomial which has the lowest CV error. This is consistent with the true relationship
between “rent value” and “distance to city center” based on empirical research
(Frew & Wilson 2002). This case study underscores the importance of using the
appropriate kernel for SVM and often the best guide for this is the amount of
domain knowledge that the analyst brings to the task.
TECHNICAL APPENDIX
Technical Detour 1
As mentioned in the text, suppose the input points xi lie in a two dimensional
space defined by axes X5(X1, X2), so that each xi 5 (xi1, xi2), i 5 1,…, N. In a
logistic regression, we model the log odds (or logit transformation) of the target
variable yi, and the model has the form:
Prðyi ¼ 1 1jX ¼ xi Þ p
Log ¼ Logð Þ ¼ b0 1 b1 xi1 1 b2 xi2
Prðyi ¼ 2 1jX ¼ xi Þ 12p
The linear expression in the input variables, xi1 and xi2, on the right-hand side
is the linear classification function. The dependent variable is the log odds ratio,
log[pi/(12 pi)], where pi 5 Pr(yi 5 11|X 5 xi). If the linear classification function
is denoted by f (xi), so that f (xi) 5 b01 b1 xi11b2 xi2, then the classification rule
is:
1 1; f ðxi Þ . 0
2 1; f ðxi Þ , 0
Since the logistic regression model is log[pi/(12 pi)]] 5 f (xi), the classification
rule can be restated as:
1 1; pi . 0:5
2 1; pi , 0:5
Technical Detour 2
Suppose the equation of the hyperplane is b0 1 b1 x1 1 b2 x2 5 0. Since none of the
“1” or “2” points lie on it therefore for any point xi 5(xi1, xi2) either
b0 1 b1 xi1 1 b2 xi2 . 0 or b0 1 b1 xi1 1 b2 xi2 , 0. Said differently, the hyperplane
defined by the thick bold line separates the “1” and the “2” points.
We can generalize this idea and define a separating hyperplane in p-dimensions
as:
(
b0 1 b1 xi1 1 b2 xi2 1 ::: 1 bp xip . 0; if yi ¼ 1
(A4.1)
b0 1 b1 xi1 1 b2 xi2 1 ::: 1 bp xip , 0; if yi ¼ 2 1
In this way, we can see that the sign of the separating hyperplane, corre-
sponding to an input point xi, can be used to classify that point.
128 Machine Learning and Artificial Intelligence in Marketing and Sales
These above two inequalities can be combined to give the condition for a
separating hyperplane as:
yi ðb0 1 b1 xi1 1 b2 xi2 1 ::: 1 bp xip Þ . 0 (A4.2)
Technical Detour 3
A very simple and well-known measure of similarity of two vectors in
p-dimensional space, x5(x1,…, xp) and x95(x91,…, x9p), is the dot product also
called the inner product. This is defined as:
p
=
Æx; x= æ ¼ + xj xj (A4.3)
j¼1
The inner product above has a geometric interpretation in that it computes the
angle between x and x9(actually, cosine of the angle). For our practical purposes,
we take the norm of a vector x, denoted by ‖x‖, to be a measure its length. The
norm is also defined in terms of the inner product of the vector x and itself.
‖x‖2 ¼ Æx; xæ
The inner product of two vectors, in terms of the cosine of the angle u between
them, is:
Æx; x= æ ¼ ‖x‖ ‖x= ‖Cos u
Hence, the inner product allows one to carry out mathematical computations
that involve geometric concepts of angles, lengths, and distances. In this simple
case where there is a concrete representation of the input object x as a vector, the
kernel is merely the inner product:
kðx; x= Þ ¼ Æx; x= æ (A4.4)
A very popular kernel used for measuring the similarity of two documents is
based on the idea of using the cosine as a similarity measure. This kernel is:
Æx; x= æ
kðx; x= Þ ¼
‖x‖ ‖x= ‖
Technical Detour 4
Consider N data points (xi, yi), i 5 1,…, N, where input vectors xi 5(xi1, xi2) lie
in 2-dimensional space and yi is a binary response variable with values “11” or
“21.” Suppose there are N1 responses with yi 5 11 values and N- responses with
yi 5 21, so that N1 1 N- 5 N. Also suppose the set of indices i such that yi 5 11
(yi 5 21) be denoted by P (M).
Now, for ease of exposition, we assume that the distance of the class means x1
and x- from the origin is the same (see Fig. 4.6). Importantly, the reader should
keep in mind that the basic idea of this construction works for any arbitrary
locations of the class means. Getting back to the symmetric case, the class means
at equal distance and symmetrically located on either side of the vertical imply
Support Vector Machines in Marketing and Sales 129
that x21 5 x2-, and we denote this common value by x2*. The point halfway
between the class means is:
xC ¼ ðx 1 1 x Þ=2 ; and it has coordinates xC ¼ ð0; x*2 Þ:
To classify a new point x5 (x1, x2), we have to determine whether this new
point is closer to the class mean of class “1” or class “2.” We classify it in class
“1” (or “2”) if the angle of the vector “x2xC” with the vector “x1 2 x2” is less
(or greater) than 90o, respectively. Recall that Cosine 90o 5 0, cosine of an angle
greater than 90o is negative, and cosine of an angle less than 90o is positive.
Therefore, the inner product (which is a measure of the cosine of the angle
between two vectors) can act as a classifier. Specifically, the classifier is:
y ¼ signðÆx 2 xC ; x 1 2 x æÞ (A4.5)
The direction vectors “x2xC” and “x1 2 x2” in coordinate form are (x1, x22
x2*) and (x11 2 x12, 0). Thus, the inner product on the right-hand side of pre-
ceding equation is:
Æx 2 xC ; x 1 2 x æ ¼ Æx; x 1 æ 2 Æx; x 2 æ
The expression on the right-hand side uses the fact that x21 5 x22 since the X2
coordinates of the class means are the same. From (A4.5) then the required
classifier is:
y ¼ signðÆx; x 1 æ 2 Æx; x 2 æÞ (A4.6)
Now, since x 1 5 N11 + xi and x 5 N12 + xi , and since inner products have the
i2P i2M
linearity property, therefore the term in parenthesis on the right-hand side of
(A4.6) can be written in terms of sums and differences of inner products of the
new point x and all the input points xi. Finally, since an inner product is a kernel,
thus, one can see how a classifier can be expressed in terms of kernels as:
N
y ¼ signð + ai kðxi ; xÞ 1 b0 Þ (A4.7)
i¼1
Technical Detour 5
As mentioned in the text, the map f(.) maps objects from the input space X to an
inner product space called the feature space F. The simplest nonlinear map f(.)
takes the input vector x and maps it to a feature space which has products and
powers of components of x, which are x1 and x2. For example, for input vector
x5(x1, x2), consider the nonlinear map:
pffiffiffi
fðxÞ ¼ fðx1 ; x2 Þ ¼ ðx21 ; x22 ; 2 x1 x2 Þ (A4.8)
Thus, even though Æx; x= æ2 is nonlinear in the 2-dimensional input space X, the
kernel kðx; x= Þ 5 ÆfðxÞ; fðx= Þæ has mapped this nonlinear quantity into a linear
quantity. This linear quantity resides in the 3-dimensional feature space F defined
130 Machine Learning and Artificial Intelligence in Marketing and Sales
pffiffiffi
by dimensions W1, W2, W3, say, where W1 5 x12, W2 5 x22, W3 5 2x1x2. The
alert reader will note that linearity in the feature space F follows from the fact that
an inner product defined on a space is, by definition, linear in that space.
Technical Detour 6
Given a set of input points x1, x2,…, xN in X, we can use the kernel to form the
kernel matrix:
0 1
kðx1 ; x1 Þ ⋯ kðx1 ; xN Þ
B C
G ¼ @ ⋮ ⋮ ⋮ A
kðxN ; x1 Þ ⋯ kðxN ; xN Þ
This is called the Gram matrix. If it is positive definite, then it can be shown
that there exists a feature space with an inner product defined on it based on this
kernel.
We provide more details on the commonly used kernels mentioned in the text:
Technical Detour 7
The margin of the point (xi, yi) to the hyperplane b0 1 b1 x1 1
b2 x2 1 ::: 1 bp xp 5 0 is the distance from xi to the hyperplane if the point (xi, yi)
5 (xi1,…, xip, yi) is correctly classified. A standard result from coordinate
geometry shows that the geometrical distance from (xi, yi) to the hyperplane
jb 1 b1 xi1 1 b2 xi2 1 ::: 1 bp xip j
b0 1 b1 x1 1 b2 x2 1 ::: 1 bp xp 5 0 is 0 p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi . Formally speaking,
2
b1 1 b2 1 ::: 1 bp
2 2
we can define the margin of the point (xi, yi) 5 (xi1,…, xip, yi) as:
yi ðb0 1 b1 xi1 1 b2 xi2 1 ::: 1 bp xip Þ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (A4.10)
b21 1 b22 1 ::: 1 b2p
Technical Detour 8
The development of the optimization program for the maximal margin hyper-
plane is based on the treatment in Hastie, Tibshirani, and Friedman (2009). The
formal statement of the program described in (4.5) is:
8
>
> max Mb0 ;b1 ;:::;bp
>
< s:t: ‖b‖ ¼ 1
p (A4.11)
>
> yi ðb0 1 + bj xij Þ $ M "i ¼ 1; 2:::; N
>
: j¼1
In the above, the first constraint involves the norm of the vector b5(b1,…, bp),
defined as, ‖b‖2 5 b21 1 ::: 1 b2p . The reader will note that, since yi equals either 11
or -1 and because of the first constraint, the left-hand side of the second constraint
is essentially the (signed) distance from input point xi to the hyperplane. See the
expression for the distance from xi to the hyperplane in (A4.10). Therefore, the
two constraints together imply that all the points are at least a distance M away
from the hyperplane.
Since the goal is to determine the optimal values of the coefficients bj, j 5 0,
1,…,N, it is convenient to recast the objective function in (A4.11) in terms of the
bj. Note that the first constraint in (A4.11) is essentially ‖b‖ 5 1 where b5(b1,..,
bp). In order to recast the objective function in (A4.11) in terms of b, we relax the
requirement that ‖b‖ 5 1. When we do not impose the unity requirement on the
norm ‖b‖, then the requirement that the distance of the hyperplane from any
y ðb 1 b x 1 b x 1 ::: 1 b x Þ
point xi 5(xi1,…, xip) is at least M becomes: i 0 1 i1 ‖b‖ 2 i2 p ip
$M.
Clearly, if coefficients (b0, b) satisfy this inequality constraint, so does (kb0, kb),
where k is a constant. This is because, by replacing (b0, b) with (kb0, kb) in the
numerator of the ratio on the left-hand side, the term becomes kyi
(b0 1 b1 xi1 1 b2 xi2 1 ::: 1 bp xip ). Also, by replacing (b0, b) with (kb0, kb) in the
denominator of the ratio on the left-hand side, the term becomes:
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðkb1 Þ2 1 ::: 1 ðkbp Þ2 5 k b21 1 ::: 1 b2p 5 k‖b‖. Consequently, the k’s in the
numerator and denominator cancel out. Thus, there is an indeterminacy which
can be avoided if we arbitrarily set M‖b‖ 5 1. Therefore, in the above construc-
tion, the width of the margin becomes M 5 1=‖b‖.
The program for the optimal separating hyperplane can now be restated as:
8
>
> 1 p 2
>
> + bj
>
>
min
< 0 1 p2 j ¼ 1
b ; b ;:::;b
(A4.12)
>
> p
>
> ðb 1 + bj xij Þ $ 1 "i ¼ 1; 2:::; N
>
> s:t: y i
: 0
j¼1
This form is intuitively appealing since, unlike (A4.11), the objective function
is directly stated in terms of the coefficients which are precisely the quantities that
need to be estimated. Also note that since M 5 1=‖b‖ therefore the maximizing
132 Machine Learning and Artificial Intelligence in Marketing and Sales
where li are the Lagrange multipliers. We use the subscript p in the Lagrangian to
make it explicit that this is the “primal” problem, to be distinguished from the
“dual” problem to be specified in “Technical detour 9” below.
This Lagrangian needs to be solved for the bs and the ls. Consider the
derivative w.r.t. bj. It is
∂Lp
∂bj 5 bj 2 l1 y1 x1j 2 l2 y2 x2j 2 ::: 2 lN yN xNj , for j 5 1,…,p. Setting it equal
to 0 and solving gives the First Order Conditions (FOCs) with respect to bj, j 5
1,…, p.
N
b j ¼ + li yi xij for j ¼ 1; …; p
FOC condition w:r:t: bj : b (A4.13)
i¼1
N
FOC w:r:t: b0 : + li yi ¼ 0 (A4.14)
i¼1
Now, we know that for the training data (xl, yl), l 5 1,…, N, classification by a
separating hyperplane requires computation of terms like ðb0 1 xl :bÞ. For a
compact notation, we use the dot product xl :b. Consider the terms xl :b for l 5
1,…,N. The optimal separating hyperplane requires the optimal values b bj for j 5
bj obtained in (A4.13) we get:
1,…, p. Using the optimal b
0 1
N
B + li yi xi1 C
B i¼1 C
B C
B N C
B + lyx C
b ¼ ðxl1 ; xl2 ; :::; xlp ÞB
xl :b i i i2 C
B i¼1 C
B C
B ⋮ C
B N C
@ + lyx A
i i ip
i¼1
N p
b 5 + li yi + xlk xik . Finally, we note
Writing out this product gives us: xl :b
i51 k51
that the second summation on the right-hand side is an inner product, and so:
N
b 5 + li yi Æxl ; xi æ, for l 5 1,…, N.
xl :b
i51
Thus, the function f(x) for the optimal separating hyperplane, defined for a
generic point x as the argument, is:
N
b 0 1 x:b
f ðxÞ ¼ b b ¼ b
b 0 1 + li yi Æx; xi æ (A4.16)
i¼1
b0 1x:b).
The optimal classifier is just: sign(f(x)) 5 sign(b b
Support Vector Machines in Marketing and Sales 133
Finally, (A4.16) makes it clear that those points (vectors) for which the cor-
responding li 5 0 will not figure in the specification of the separating hyperplane.
Hence, we should only focus on vectors for which li . 0. Now, the KKT con-
p
ditions in (A4.15) imply that if li . 0 thenyi ð + xij bj 1 b0 Þ 5 1. These vectors xi
j51
corresponding to which li . 0 are support vectors.
Technical Detour 9
The optimal separating hyperplane for linearly separable classes is solved using
the program (A4.11) in “Technical detour 8.” There, we showed that we could,
equivalently, solve program (A4.12). For computational purposes, it is often
easier to solve the Wolfe Dual program. Substituting the FOCs (A4.13) and
(A4.14) into the Lagrangian primal Lp given in Technical Detour 8, we can form
the Lagrangian dual. Then, the Wolfe Dual is specified as:
N 1 N N
max LD ¼ + li 2 + + li lk yi yk xTi xk
l1 ;:::lN i¼1 2 i¼1 k¼1
(A4.17)
N
s:t: + li yi ¼ 0; li $ 0
i¼1
In the above, xiT is the transpose of the vector xi. This program yields the
optimal li, and the points (vectors) xi corresponding to li . 0 are the support
vectors.
Technical Detour 10
The formal statement of the program described in (4.6) for the nonseparable
case is:
8
>
>
> max Mb0 ;b1 ;:::;bp ;j1 ;:::;jN
>
>
> s:t: ‖b‖ ¼ 1
>
>
>
< p
yi ðb0 1 + bj xij Þ $ Mð1 2 ji Þ "i ¼ 1; 2:::; N (A4.18)
>
> j¼1
>
>
>
>
N
>
> ji $ 0; + ji # C
>
: i¼1
Notice that the differences from the separable case in (A.4.11) are that: (1) the
right-hand side of the margin constraint (the second constraint) now incorporates
soft margins that allow some misclassification, and (2) we have added a new
constraint which caps the extent of misclassification using the pre-specified con-
stant C.
Similar to “Technical detour 8” we have to set up the Lagrangian and solve the
FOCs. Just as in that case, here too it is convenient to recast the program such
that the objective function is the norm ‖b‖, whereupon the second constraint in
p
(A4.18) reduces to: yi ðb0 1 + xij bj Þ $ 1 2 ji , for all i. This is analogous to how
j51
134 Machine Learning and Artificial Intelligence in Marketing and Sales
(A4.11) has been recast as (A4.12) for the maximal margin classifier in the
separable case. One salient difference is that now the KKT conditions imply that
p
if li . 0 then yi ðb0 1 + xij bj Þ 5 12 ji (previously, in the separable case the
j51
right-hand side was just 1). As before, the optimal separating hyperplane is
characterized only by those xi for which the corresponding li .0. These are the
support vectors.
Technical Detour 11
In the SVM program, the solution of the optimal separating hyperplane is
accomplished by setting up the Wolfe dual as in (A4.17). The major difference
from (A4.17) is that the objective function is expressed in terms of the kernel as
shown below:
N 1 N N
max LD ¼ + li 2 + + li lk yi yk kðxi ; xk Þ (A4.19)
l1 ;:::lN i¼1 2 i¼1 k¼1
It can be shown that, for a generic point x, the optimal separating hyperplane
can be expressed in terms of kernels as:
N
f ðxÞ ¼ b0 1 + li yi kðx; xi Þ (A4.20)
i¼1
WORKED-OUT ILLUSTRATIONS
Illustration 3
We had identified the support vectors as s1 5 (1, 0) for the “2” class and s2 5(2, 0)
for the “1” class. The target labels corresponding to these points are y1 5 21 and
y2 5 11. We apply the two support vectors as arguments of the function f (x) for
the separating hyperplane in (A4.16). So, corresponding to s1 and s2, we have two
equations that have to be solved simultaneously. Consider the function f(x) cor-
responding to the support vector s1 2 that is f(s1). The summation of the right-
hand side will clearly involve only the two support vectors since only these
correspond to nonzero li. Let us write the inner products on the right-hand side of
(A4.16) as dot products for compact notation. Thus, for example, , s1, s2. is
written as s1.s2, and so on. Thus, from (A4.16), the equation corresponding to s1 is:
f ðs1 Þ 5 b0 1 l1 ð 2 1Þs1 :s1 1 l2 ð1Þs1 :s2 . Now, since s1 is in “2” class, therefore f(s1)
5 21. Similarly, the function f(x) corresponding to the support vector s2 – that is
f(s2) – will also involve only the two support vectors. We then have the two
following equations corresponding to the two support vectors as:
2 1 ¼ b0 1 l1 ð 2 1Þs1 :s1 1 l2 ð1Þs1 :s2
(A4.21)
1 1 ¼ b0 1 l1 ð 2 1Þs2 :s1 1 l2 ð1Þs2 :s2
Support Vector Machines in Marketing and Sales 135
In these equations, the dot products (inner products) are defined on input
space. Using s1 5 (1, 0) and s2 5(2, 0), the above simultaneous equations become:
2 1 ¼ b0 2 l1 ð1; 0Þ:ð1; 0Þ 1 l2 ð1; 0Þ:ð2; 0Þ
1 1 ¼ b0 2 l1 ð2; 0Þ:ð1; 0Þ 1 l2 ð2; 0Þ:ð2; 0Þ
Also, the FOC with respect to b0 given in (A4.14) yields l1y1 1 l2y2 5 0. This
gives 2l1 1 l2 5 0, so that l1 5 l2. Substituting in the simultaneous equations
above gives the optimal values:
b*0 ¼ 3; l*1 ¼ 2; l*2 ¼ 2
Consider a generic point x5(x1, x2). From (A4.16), the optimal separating
hyperplane is:
f ðx1 ; x2 Þ ¼ 3 1 ð2Þð1Þðx1 ; x2 Þ:ð1; 0Þ 1 ð2Þð1Þðx1 ; x2 Þ:ð2; 0Þ:
Evaluating the inner products, this finally gives the optimal separating
hyperplane, where we note that this hyperplane lies in input space:
f ðx1 ; x2 Þ ¼ 2 3 1 2x1 (A4.23)
Illustration 4
We had identified the two support vectors as: s1 5 (0.5, 0.5) for the “2” class
and s2 5 (1, 1) for the “1” class. The target labels corresponding to these points
are y1 5 21 and y2 5 11. We apply the two support vectors as arguments of the
function f(x) for the separating hyperplane given in (A4.20). Consider f(s1). The
summation of the right-hand side will clearly involve only the support vectors
since only these correspond to nonzero li. Also, recall that the kernel k(s1, xi) is
the inner product , f(s1), f(xi)., which we will write as a dot product for
compact notation. Then, corresponding to s1 and s2, we have two simultaneous
equations where we use f(s1) 5 21 and f(s2) 5 11 on the left-hand side:
2 1 ¼ b0 1 l1 ð 2 1Þfðs1 Þ:fðs1 Þ 1 l2 ð1Þfðs1 Þ:fðs2 Þ
(A4.24)
1 1 ¼ b0 1 l1 ð 2 1Þfðs2 Þ:fðs1 Þ 1 l2 ð1Þfðs2 Þ:fðs2 Þ
In these equations, the inner products are defined on feature space. Now, from
Illustration 4.1, the nonlinear map that we use to define the kernel is:
ð2 2 x2 1 jx1 2 x2 j; 2 2 x1 1 jx1 2 x2 jÞ if x21 1 x21 . 0:5
fðx1 ; x2 Þ ¼ (A4.25)
ðx1 ; x2 Þ otherwise
136 Machine Learning and Artificial Intelligence in Marketing and Sales
Thus, f(s1) 5 (0.5, 0.5) and f(s2) 5 (1, 1). This nonlinear map takes vectors in
2-dimensional input space and maps them to 2-dimensional feature space. Sup-
pose, we denote the axes of the feature space as W1 and W2. Using the map in
(A4.25), the simultaneous equations in (A4.24) become:
2 1 ¼ b0 2 l1 ð0:5; 0:5Þ:ð0:5; 0:5Þ 1 l2 ð0:5; 0:5Þ:ð1; 1Þ
1 1 ¼ b0 2 l1 ð1; 1Þ:ð0:5; 0:5Þ 1 l2 ð1; 1Þ:ð1; 1Þ
Also, the FOC with respect to b0 given in (A4.14) yields l1y1 1 l2y2 5 0. This
gives 2l1 1 l2 5 0, so that l1 5 l2. Substituting in the simultaneous equations
above gives the optimal values:
b*0 ¼ 4; l*1 ¼ 6; l*2 ¼ 6
Consider a generic point x5(x1, x2). From (A4.20), the optimal separating
hyperplane is:
f ðxÞ ¼ 4 1 ð6Þð1Þ kðx; s1 Þ 1 ð6Þð1Þ kðx; s2 Þ
Note that this hyperplane lies in the feature space with W1 and W2 as the two
axes. We denote the image of a generic input vector (x1, x2) under f as f(x1, x2) 5
(w1, w2). Evaluating the inner products using f(x)5(w1, w2), f(s1)5(0.5, 0.5), and
f(s2)5(1, 1), the previous expression for the hyperplane is:
f (x1, x2) 5 2413w113w2
Illustration 5
Since there are 4 training data points, from (A4.20), the hyperplane (decision
4
boundary) in the feature space is f ðxÞ 5 b0 1 + li yi kðx; ai Þ. We use a poly-
i51
nomial kernel of degree 2. For a generic x5(x1, x2) and z5(z1, z2) the kernel is
defined as:
kðx; zÞ ¼ ð1 1 Æx; zæÞ2 ¼ ð1 1 x1 z1 1 x2 z2 Þ2 (A4.27)
where the last expression results from evaluating the inner product in input space
using (A4.3). This gives:
kðx; zÞ ¼ x21 z21 1 x22 z22 1 2x1 x2 z1 z2 1 2x1 z1 1 2x2 z2 1 1
The second term on the right-hand side has (2x2 11)2, which is just k(x, a1) the
kernel of x5(x1, x2) and a1 5 (0, 21), where we recall that the kernel is defined as
k(x, a1) 5 (11 Æx; a1 æ)2. The other three-squared terms can be similarly obtained.
To obtain the optimal li, i 5 1,…,4, and b0 we specify the Lagrangian (Wolfe)
dual program below whose objective function is given in (A4.19):
4 1 4 4
max LD ¼ + li 2 + + li lk yi yk kðai ; ak Þ
l1 ;:::lN i¼1 2 i¼1 k¼1
4 (A4.31)
s:t: + li yi ¼ 2 l1 1 l2 2 l3 1 l4 ¼ 0
i¼1
li $ 0; i ¼ 1; 2; 3; 4
In the program above, the labels yi are known and the kernels k(ai, ak), i, k 5 1,
2, 3, 4, can be easily calculated using the definition of the kernel above. Thus, it
remains to solve for the li. Using standard techniques, we obtain:
l*1 ¼ l*2 ¼ l*3 ¼ l*4 ¼ 0:5
Since all li . 0, therefore all 4 vectors are support vectors. The decision
boundary with these li* is:
f ðxÞ ¼ b0 1 x21 x22 :
To determine b0, we can use any support vector. We use vector x1. By defi-
nition y1 f(x1) 5 1, where y1 5 21 and x1 5 (0, 21). This implies that b0 5 0.
Finally, the optimal classifier (decision boundary) in input space is:
x21 2 x22 ¼ 0 (A4.32)
Chapter 5
Chapter Outline
1. Early Evolution of Decision Trees: AID, THAID, CHAID
2. Classification and Regression Trees (CART)
2.1 Regression Trees
2.1.1 Greedy Algorithm
2.1.2 Cost Complexity Pruning
2.2 Classification Trees
3. Decision Trees and Segmentation
4. Bootstrapping, Bagging and Boosting
4.1 Bootstrapping
4.2 Bagging
4.3 Boosting
5. Random Forest
6. Applications of Random Forests and Decision Trees in Marketing and Sales
7. Case Studies
Technical Appendix
assumptions. The earliest decision tree method was the Automatic Interaction
Detection (AID) method developed by Morgan and Sonquist (1963). The AID
model was developed for regression trees which have a continuous response
variable. After several years, in the early 1970s, Messenger and Mandell (1972)
proposed the first algorithm for classification trees which have a categorical
response variable. The Chi Squared Automatic Interaction Detection (CHAID)
method was developed by Kass (1980). Together, the AID, THAID, and CHAID
can be thought of as the first wave of tree-based methods.
While some of the early methods were designed to handle continuous response
variables, the easiest means of developing an intuition of how these tree-based
models work is to consider a categorical response variable with several categorical
predictor (explanatory) variables. In a typical marketing context, consider the
case of classifying a group of people into “buy” or “not buy” for a specific product.
Suppose that the three predictor variables are “gender,” “marital status,” and
“occupation.” To make the illustration simple, suppose further that marital status
has two categories, “married” and “not-married,” where divorced people are put in
the not-married category. Similarly, “occupation” also has two categories, “white
collared” and “blue collared.” For the moment we will not concern ourselves with
the issue that many occupations do not fit into this simplistic categorization. We
will describe the CHAID algorithm as a prototype for the early tree-based models.
The reason for using predictor variables with only two categories is that we can
avoid the complexities involved in finding the “optimal” combination of categories
for variables that have more than two categories. This will allow us to focus on
providing a simple intuition for how the tree is built in stages, rather than being
ensnared in technical details.
Most texts that explain the CHAID method take a terse algorithmic approach
since they want to show how the method works for a general data set. We will
take a descriptive approach and tie our explanation to the specific marketing
illustration detailed in the previous paragraph. Hence, the steps in our description
of CHAID will be different from the steps of the formal algorithm detailed in the
more technical resources.
Step 1: Cross-tabulate the response variable with each of the predictor variables.
Since we have three predictor variables we have three cross-tabulations, with the
cross-tabulation with respect to “gender,” followed by “marital status,” followed by
“occupation” from left to right below.
Step 2: Find the “most significant” of these predictor variables. Based on the
three chi-squared values, suppose all three are significant and “gender” is the most
significant. Then, the first branch of the tree is for “gender” as shown below.
Step 3: Consider the node “male” in the tree in step 2. Cross-tabulate the
response variable with each of the remaining predictor variables, “marital status”
and “occupation,” for the males as shown below.
Then calculate the two chi-squared values corresponding to the two cross-
tabulations with the predictor variables “marital status” and “occupation.”
Step 4: Find the “most significant” among the predictor variables “marital
status” and “occupation” for the males. Suppose, based on the chi-squared values,
both are significant and “marital status” is the most significant. Then, the branch
of the tree in step 2 starting from the node “male” is for “marital status” as shown
below.
If the cross-tabulation in step 5 is not significant then we will not have this last
branch and the process ends.
Step 7: Carry out steps 5 and 6 for the “not married” node in the tree in step 4.
Finally, carry out steps 3 to step 6 for the “female” node in the tree in step 2.
It is important to note three aspects of CHAID. First, CHAID, like all decision
trees, is well suited to detect interactions among the predictor variables – hence the
term “interaction detection” in the acronyms AID, THAID and CHAID. There
can be different branching/splitting variables for different nodes at the same level
in the tree. For example, suppose that in the tree in step 2, the branch after the
“female” node is “occupation” and there are no more splits after that. Therefore,
the effect of the predictor variable “marital status” on the response (“buy”/”not
buy”) depends on another predictor variable “gender” – marital status matters for
the males but not for the females. In other words, the manner in which a given
predictor variable affects the response depends on other predictor variables – a
classic interaction effect! One can incorporate interaction effects in regression, but
Random Forest, Bagging, and Boosting of Decision Trees 143
• The first partition is: Split feature space at X2 5 k1, where k1 5 65 years.
Fig. 5.1 shows this partition.
• The second partition is: For X2 . k1 split region at X1 5 k2. Here k2 is a
suitable high dollar amount of savings. Pictorially (Fig. 5.2), we have
• The third partition is: For X2 , k1 split region at X2 5 k3 where k3 5 40 years.
Now the partitioned feature space becomes (Fig. 5.3)
• The fourth partition is: For X2 , k3 split region at X1 5 k4. Here k4 is a
relatively low dollar amount of savings. The final partitioned feature space is
(Fig. 5.4)
In terms of the regions shown in Fig. 5.4, the financial product company’s
target segments are region R5 which has people age 65 and above (i.e., retirees)
with high savings and R1 which has working professionals under 40 who
currently have low savings.
We can depict the sequence of binary partitions from Figs. 5.1–5.4 as a tree
diagram. Since the binary partitions proceeded in a sequence, the process is called
recursive binary partitioning. The tree is shown in Fig. 5.5.
Since the modeling of interactions among predictor variables is one of the major
strengths of tree-based models, we will demonstrate the manner in which the tree in
Fig. 5.5 captures interactions. Note that, for people with age greater than 65, a high
level of savings (“savings” . k2) has a higher proportion of “buy” responses. On
the other hand, among people with age less than 40, a low level of a savings
(“savings”, k4) has a higher proportion of “buy” responses. Thus, the manner in
which the predictor variable “savings” affects the response (“buy”/“not buy”)
depends on the predictor “age.”
146 Machine Learning and Artificial Intelligence in Marketing and Sales
One might wonder why we restrict ourselves to recursive binary splits. The
main reason is interpretability, since arbitrary partitions can lead to regions that
are difficult to describe and interpret. As an illustration, consider the partition
below.
In Fig. 5.6, there are non-overlapping regions and oddly shaped regions that
are difficult to describe. For example, in region R there are two points such that
the line segment that connects them does not lie entirely in the region. Regions
like this are called non-convex, and they create problems for many optimization
algorithms. Partitions such as in Fig. 5.6 would be hard to describe as trees.
Recursive binary partitions allow the entre partition to be conveniently repre-
sented as a single tree, enabling all the easy and intuitive interpretation and
visualization of tree diagrams.
“greedy” because it looks just one step ahead instead of looking ahead till the
end. Thus, at any step it optimally decides just the next splitting variable and
split point in a myopic manner rather than being far-sighted and accounting
for the future steps in the tree formation process. To make a distinction
between a predictor variable and a specific value that it could take, we use the
uppercase Xj to denote the jth predictor variable and the lowercase xij for the jth
component of the ith observation xi 5(xi1,…, xip). We provide an intuitive
sketch of the greedy algorithm which accomplishes the recursive binary
partition.
Step 1: For a predictor variable Xj (j 5 1,…, p) and a split point “s” compute
the total sum of squares over the two regions that are generated by a binary
partition. This total sum of squares is minimized over all predictors Xj and all
split points “s”. This identifies the next optimal splitting variable Xk and its split
point sk*.
A bit of intuition about this critical step may be helpful for the reader. Once
we have a predictor variable Xj and a split point “s” we can partition the feature
space into two regions: R1 and R2. In this binary partition, R1 is the set of all
observations xi whose jth component xij is less than “s”. Similarly, R2 is the set of
all observations xi whose jth component xij is greater than “s”. Different split
points “s” give different regions R1 and R2. We choose that split point “s” which
minimizes the sum of squares over the training data. The sum of squares was
described in the opening paragraph of Section 2.1, and here we specialize it to a
regression tree. In the case of a regression tree, for any region we have the same
predicted value for all points in that region – the average response. Using that
prediction, and the actual yi corresponding to each point xi in a region, we can
compute the sum of squares for that region. We do the same for the other region.
We add them to get the sum of squares over the training data. This sum of
squares is a function of the assumed splitting variable Xj and split point “s” which
generates the two regions R1 and R2. We choose the variable and split point to
minimize the sum of squares.
Practically speaking, this step is accomplished in two stages: (1) In the first
stage, for a fixed Xj, we assume a split point “s”. For that Xj and s we form the
binary partition as described in the previous paragraph. We then select that split
point “s” which minimizes the total sum of squares. This is the optimal split point
given the splitting variable Xj (2) In the second stage we vary the predictors Xj
over all j 5 1,…, p, and repeat step 1. Of course, when the splitting variable and
split point are varied to generate two new regions though a binary split, the
predicted response also changes since the response in a region depends on the
training observations included in that region.
Step 2: Form the two regions based on the splitting variable Xk and split
point sk*. Thus, R1 is the set of all observations xi whose kth component xik is
less than sk*. Similarly, R2 is the set of all observations xi whose kth component
xik is greater than sk*. Repeat step 1 for region R1, so this region also splits into
two in an optimal manner. This gives us three regions. Split one of these three
regions in similar manner and continue till some stopping rule terminates this
Random Forest, Bagging, and Boosting of Decision Trees 149
Technical detour 1:
Recall that the AID method discussed in Section 1 generates a regression tree.
Similar to the CART method for regression trees, it too uses an algorithm very
similar to the greedy algorithm just described. However, the AID and CART
differ on other aspects, and a major distinction is the rule for determining how
large the tree should be. The most common stopping rule for AID is based on the
sum of squares and the algorithm stops if the reduction in the sum of squares as
a result of a proposed split is not large enough to justify the split. The deter-
mination of how much reduction of sum of squares is desirable at each step is
left to the analyst’s judgment. In CART the determination of how large to grow
the tree is based on important considerations of controlling overfitting the tree
to the training data. Thus, tree size is chosen based on the training data itself.
Controlling overfitting is accomplished via the important concept of cost
complexity pruning of the tree. We turn to this next. In the typical imple-
mentation of CART solution, trees are grown to maximum depth and then
pruned back using the pruning criterion.
each terminal node represents a region Rm, m 5 1,…, 5. Suppose a tree has M
terminal nodes. Each terminal node m defines a region Rm, m 5 1,…, M. When
we use the terminology of nodes when describing the partitioning the feature
space, we talk of a “parent node” giving rise to two “children nodes” as result of a
binary partition. We have already described how we calculate the sum of squares
for a given region. Let us denote the sum of squares for region m by SSm, m 5
1,…, M. The sum of squares over the training data is the sum of all the SSm for
m 5 1,…, M. That is: SS 5 SS11…1 SSM. The SS is a measure of the goodness-
of-fit of a tree, and since it is like a cost/error function the goal is to minimize it.
The sub-tree that has the minimum SS on the test data would be the desired sub-
tree that avoids overfitting. While theoretically sound, this process is computa-
tionally infeasible owing to the very large number of sub-trees that one would
have to evaluate. Cost complexity pruning gives us a way to do pruning by only
considering a small set of sub-trees.
Intuitively, the cost complexity pruning method essentially puts a penalty
on the size of the tree such that the minimizing criterion is a combination of
the total sum of squares over all terminal nodes plus the number of terminal
nodes. Suppose that we have grown a large tree T9which we wish to prune.
Consider a sub-tree of T9denoted by T. The cost complexity criterion, which is
a function of sub-tree T, is a combination of the total sum of squares and the
sub-tree size measured by the number of terminal nodes, where we assume that
the sub-tree has M nodes. The cost complexity criterion is,Cost complexity
(T) 5 SS 1 lM
The parameter l plays a role similar to the “weight decay” parameters dis-
cussed in Chapter 3. Since the goal is to minimize the cost complexity criterion,
therefore larger trees with more terminal nodes will be penalized. For larger
values of l, large tree sizes are penalized even more and the criterion yields
smaller trees that are less likely to overfit the training data. For each l we find
the sub-tree T that minimizes the above cost complexity criterion. We denote it
by Tl. It has been shown that we are guaranteed to find such a Tl corresponding
to each l (Brieman, Friedman, Ohlsen, & Stone, 1984). The idea is to succes-
sively combine internal nodes (nodes between the root and the leaves). When
two nodes are combined, it causes an increase in the sum of squares (SS). To get
an intuition for this, consider that the largest possible tree, one where each node
is just a single point, has a SS of zero since the predicted value (which is the
average of the response corresponding to a single observation) exactly equals the
actual response. Therefore, when we combine two nodes to create a smaller tree,
the SS increases. At each step we combine those two nodes which cause the
smallest increase in SS. As we continue in this way we produce the tree with a
single node which has the largest possible SS. Brieman et al. (1984) have shown
that this sequence must contain the cost complexity minimizing sub-tree Tl. In
“Technical detour 2” we provide details of the cost complexity pruning
technique.
Random Forest, Bagging, and Boosting of Decision Trees 151
Technical detour 2:
Step 1: Grow a tree to training data using recursive binary splitting. This step uses
the greedy algorithm described above to grow the tree. This step results in a
large tree which may overfit the training data.
Step 2: Do cost complexity pruning which will give a set of smaller sub-trees, Tl,
indexed by the tuning parameters l. We must now choose one of these sub-
trees. This is done via cross-validation as shown in step 3.
Step 3: Do a K-fold cross-validation. For each l do the following
a. Divide the training data into K folds. For each k 5 1,…, K, do: (i) Using the
process in step 1, grow a large tree using the training data that remains when we
leave out the kth fold, and (ii) Using the process in step 2, do cost complexity
pruning of the large tree generated in step (i).
b. Compute the predicted mean squared error on the test data in the kth fold that
was left out in step 3a.
c. Compute the average of the predicted mean squared errors from step 3b over
all the K folds. So, we now have the cross-validation error for each l.
Step 4: Choose the l corresponding to the lowest cross-validation error, and denote
it by l*. The required tree is the one corresponding to l* from step 2.
perhaps the most commonly used measure of inequality in groups. The cross-
entropy is an information theoretic measure and is often referred to as Shannon’s
entropy.
Suppose we want to build a classification tree in the multi-class case where there
are K classes of the response variable, with the classes indexed by k 5 1,…, K.
Suppose we have a partition of the feature space into M regions Rm, m 5 1,…, M.
We denote by pmk the proportion of training data points in region m that belong to
class k. The Gini index for node m (region m) is given by
K
Gini index ¼ + pmk ð1 2 pmk Þ
k¼1
applies to the general case of K classes. The cross-entropy cost also has the similar
behavior to the Gini index.
Recall that in the case of a regression tree, the predictor variable that causes
the largest decrease in the sum of squares was the one chosen to generate the next
binary partition. This logic carries over to classification trees as well – except that
the predictor that causes the largest decrease in the Gini index, or equivalently,
the cross entropy, is the one chosen for the next binary partition.
To build intuition, it may be useful to see this in a simple illustration. Let us
revisit the opening illustration of the CHAID example in Section 1. Recall that in
a typical marketing context we wanted to classify a group of people into “buy”
and “not buy” for a specific product. Instead of the three predictor variables,
suppose for the current illustration of how the Gini index performs binary splits
we consider only two predictors: “gender” and “marital status.” As before, we
assume that marital status has two categories, “married” and “not-married,”
where divorced people are put in the not-married category. For illustrative pur-
poses, consider the synthetic data in Table 5.1.
In order to use the Gini index for determining which predictor variable to split
the data on, we first have to compute the decrease in Gini index as a result of
splitting on both the variables, and then choose that predictor variable that cor-
responds to a larger decrease.
At the root node the proportion of “buy” and “not buy” are 0.5 each since
there are 6 “buy” and 6 “not buy” out of 12 customers. Therefore, the Gini index
for the root node is: Root: 0.5*(1–0.5) 1 0.5*(1–0.5) 5 0.5
Consider the split with respect to the predictor variable “gender.” Let region
1 be “males” and region 2 be “females.” Also, let the response class “buy” be
class 1 and “not buy” be class 2. Among the 6 males, there are 5 “buy” and so
the proportion of “buy” is p11 5 5/6 5 0.83. Similarly, the proportion of “not
buy” among the males is p12 5 0.17. Among the 6 females, there is 1 “buy” and
so the proportion of “buy” is p21 5 0.17. Similarly, the proportion of “not buy”
among the females is p22 5 0.83. The Gini indices for regions 1 (male) and 2
(female) are: Males: p11(1 2 p11) 1 p12(1 2 p12) 5 0.83*(1–0.83) 1
0.17*(1–0.17) 5 0.2822; Females: p21(1 – p21) 1 p22(1 – p22) 5 0.17*(1–0.17) 1
0.83*(1–0.83) 5 0.2822
Accounting for the fact that there are 6 males and 6 females in a customer
base of 12, the decrease in Gini index when we split the root node by “gender” is:
0.5 2 (6/12)*0.2822 2 (6/12)*0.2822 5 0.22
Consider now the split with respect to the predictor variable “marital status.”
Let region 1 be “married” and region 2 be “unmarried.” Among the 6 married
customers, there are 3 “buy” and so the proportion of “buy” is p11 5 0.5.
Similarly, the proportion of “not buy” among the married customers is p12 5 0.5.
Among the 6 unmarried customers, there are 3 “buy” and so the proportion of
“buy” is p21 5 0.5. Similarly, the proportion of “not buy” among the unmarried
customers is p22 5 0.5. The Gini indices for regions 1 (married) and 2 (unmar-
ried) are: Married: p11(1 2 p11) 1 p12(1 2 p12) 5 0.5*(1–0.5) 1 0.5*(1–0.5) 5 0.5
Females: p21(1 2 p21) 1 p22(1 2 p22) 5 0.5*(1–0.5) 1 0.5*(1–0.5) 5 0.5
Accounting for the fact that there are 6 married and 6 unmarried people in the
customer base of 12, the decrease in Gini index when we split the root node by
“marital status” is: 0.5 2 (6/12)*0.5 – (6/12)*0.5 5 0
Since a split on the predictor variable “gender” causes a larger decrease in the
Gini index, we choose that predictor as the next splitting variable. This would
result in the tree shown in step 2 in our description of the CHAID algorithm in
Section 1.
It might be instructive to see how an informational measure such as cross-
entropy would select predictor variables to perform the split. Using the same
illustration in the previous paragraph, we first calculate the entropy for the root
node. At the root node the proportion of “buy” and “not buy” are 0.5, and so the
entropy (using logarithms of base 10) is: Root: 2[0.5*log(0.5) 1 0.5*log(0.5)] 5
0.3010
Consider the split with respect to the predictor variable “gender.” As before, we use
p11 5 0.83 and p12 5 0.17 for the proportions of “buy” and “not buy” among the
males. Similarly, we use p21 5 0.17 and p22 5 0.83 for the proportion of “buy” and
“not buy” among the females. The cross-entropy for regions 1 (male) and 2 (female)
are: Males: 2[p11log(p11) 1 p12log(p12)] 5 2[0.83*(20.08) 1 0.17*(20.77)] 5 0.196;
Females: 2[p21log(p21) 1 p22log(p22)] 5 2[0.17*(20.77) 1 0.83*(20.08)] 5 0.196
Accounting for the fact that there are 6 males and 6 females in a customer base
of 12, the decrease in Gini index when we split the root node by “gender” is:
0.3010 2 (6/12)*0.196 2 (6/12)*0.196 5 0.105
Consider the split with respect to the predictor variable “marital status.” As before,
we use p11 5 0.5 and p12 5 0.5 for the proportions of “buy” and “not buy” among the
married customers. Similarly, we use p21 5 0.5 and p22 5 0.5 for the proportion of
“buy” and “not buy” among the unmarried customers. The cross-entropy for regions
Random Forest, Bagging, and Boosting of Decision Trees 155
Executive Summary
Decision trees are among the most intuitive, easy to interpret and visualize
among all machine learning models. This makes the results of decision trees
very easy to communicate to non-technical stakeholders. Moreover, decision
trees are particularly well-suited for identifying how interactions between
predictors can affect the response variable. Two predictors are said to interact
if the effect of one predictor on the response variable depends on the other
predictor.
Modern decision trees are grown using the “Greedy Algorithm.” Essen-
tially the greedy algorithm proceeds by recursively partitioning the entire data
set into smaller subsets, or regions, based on the predictors. The goal is to
form regions such that observations in a given region are similar with respect
to their association with the response variable. It is called “greedy” because it
myopically looks only one step ahead rather than considering the entire tree
growing process.
The predictive performance of decision trees can be improved by an
aggregation over multiple trees. This is the idea behind bagging and boosting of
decision trees and random forests. A random forest is an aggregation technique
that improves on bagging in a specific way.
that have been the theme of our discussions about decision trees in Chapters 1
and 2. Tree partitioning also promotes homogeneity within a region and hetero-
geneity between regions. Since our main interest is in tree-structures, we will pro-
vide below a very brief sketch of how the hierarchical clustering scheme leads to a
tree structure from which segments can be identified.
Suppose we have seven objects denoted alphabetically from A through G (cor-
responding to the bold dots) in Fig. 5.7.
The first step is to identify the two objects that have the smallest distance
between them. In the formal clustering algorithm we usually have a matrix of
distances between the objects – with N objects we would have an N3N matrix
whose entries are pair-wise distances between the N objects. In our sketch of the
hierarchical scheme we will just do this visually. By visual inspection we can see
that the smallest distance is between objects F and G. At the first step we form the
first group as {F, G} as shown by the circle labeled 1 in Fig. 5.7. Thus, {F, G} is
now one entity. The algorithm then needs a measure of the distance between the
group {F, G} and the other remaining 5 objects. In most marketing and behav-
ioral science applications we choose the maximum distance (called complete
linkage) or the average distance (called average linkage). The complete linkage
distance between an object outside the group, say B, and the group {F, G} would
be max[distance(BF); distance(BG)]. The average linkage distance is similarly
defined using the average instead of the maximum. Using the complete linkage or
average linkage distance, and the usual distance for pairs of objects that are not in
the group {F, G}, we can form another matrix of pair-wise distances and then
select the smallest distance. In Fig. 5.7 above, suppose it is the distance between
objects A and D. So, at the second step we form the second group {A, D}. See
circle labeled 2 in Fig. 5.7. We now need to again calculate the pair-wise distances
treating {F, G} and {A, D} as two distinct entities. To calculate pair-wise
distances at this step, we need the distance between two groups (entities). Both the
complete linkage and average linkage criteria are well-suited for this. At the third
step, object B is added to the group {F, G}, so the third group is {B, F, G}, as
seen in the circle labeled 3. The clustering process proceeds in a sequential manner
as described until we have the fourth group as {A, B, D, F, G}, the fifth as {C, E}
and finally in step six we have all {A, B, C, D, E, F, G}.
Our interest is in representing this clustering process as a tree-like structure. Such
a hierarchical tree structure in the context of clustering is called a dendrogram. The
y-axis shows the different distances at which different objects and groups have been
joined together (Fig. 5.8).
How do we identify segments from the tree above? When we clip the
dendrogram at different distances we arrive at a different number of clusters along
with the objects that lie in the clusters. This can be visually depicted as in Fig. 5.9.
When we clip the dendrogram at a certain level, we ignore the part above that
level and focus only on the part below it. Consider the first clip from the top in
Fig. 5.9. We can detect two distinct segments by focusing on the dendrogram
below that level. The segments are {A, B, D, F, G} and {C, E}. Similarly, the
second clip from the top of the figure gives us three segments. Having discussed
decision trees, we now turn to the random forest method which, as we noted, is
built on ideas of decision trees. To understand random forests, we first have to
familiarize ourselves with the understand the concepts of bootstrapping, bagging,
and boosting.
improving a machine learning model. Random forests are based on the concept of
bagging, in the sense that a random forest is a specific type of improvement over
bagged trees. At an intuitive level, bagging of any machine learning model, including
a decision tree, is the aggregation over different variants of the same model, where
model aggregation helps improve the predictions of the learning model. The concept
of bagging, in turn, requires us to understand the concept of bootstrapping, since the
latter generates the model variants that are aggregated. Boosting is another tech-
nique to improve the prediction of any machine learning model, including decision
trees, and is also based on the ideas of fitting multiple models. Unlike bagging, the
models are fit in a sequential manner – any given model is fit based on information
from previous models.
4.1 Bootstrapping
As mentioned above, bagging is the aggregation over different variants of a model.
The model variants are obtained when the same model is estimated on different
samples from the same training data set. The process of sampling repeatedly from
the training data is conceptually different from the usual idea in statistics of sam-
pling from the population. While the latter may be ideal, it is often not practical to
get multiple samples from the population. This technique of sampling from the
training data forms the basis of bootstrapping. It is for this reason that bagging
is referred to as bootstrap aggregation. We will expand on these concepts in the
following paragraphs.
Bootstrapping was originally designed to assess the uncertainty associated with a
statistical learning model or any statistical parameter estimate. The most familiar
example of a parameter is the coefficient of a simple linear regression with just one
predictor, and bootstrapping can be used to get a measure of the uncertainty of the
estimate of the coefficient. Of course, the main strength of bootstrapping lies in its
ability to help us assess the uncertainty of complex parameter estimates whose
160 Machine Learning and Artificial Intelligence in Marketing and Sales
Sample 2: V(2) 5
Note that in sample 1 the observation v3 has been repeated and v1 is not
included in it. We fit the model, Y 5 b0 1 b1X, D times to each of the D
bootstrap samples. Corresponding to sample V(d) we obtain the parameter esti-
mates (b0(d), b1(d)), d 5 1,…, D. An obvious way to assess uncertainty associated
with the estimate of b1 is to obtain its standard error. The standard error of the
estimate of b1 can be obtained by using the variation in the D estimates: b1(1),
b1(2), …, b1(D) . It is now a simple matter to use standard measures of variation to
calculate the standard error of b. The same procedure is employed to assess
uncertainty for more complex statistical learning models.
4.2 Bagging
Bagging draws on the bootstrapping idea of drawing multiple samples with
replacement from the training data, but does so in order to improve the estimate
of the parameters or the prediction from a machine learning model. One sense in
which parameter estimates or prediction can be improved is to reduce their
162 Machine Learning and Artificial Intelligence in Marketing and Sales
(1) Draw D bootstrap samples from the training data – the samples are of the
same size as the original training data and are drawn with replacement.
(2) Grow a regression tree on each of the D bootstrap samples. Usually the trees
are not pruned.
(3) For a given observation x, make the D predictions, one for each tree. Denote
the D predictions by yð1Þ ,…,
yðDÞ . We give some details of this step. Consider
the tree grown on the bootstrap sample V(d), and consider the prediction made
by this tree for a given observation x. Since a tree is essentially a partition of
the feature space into distinct regions, the observation x will lie in some region
of the tree. As we saw in Section 2.1, the prediction corresponding to x is just
the average of the responses for that region. Let us denote this by yðdÞ . This
process is repeated for the D bootstrap samples V , d 5 1,…, D, and D trees
(d)
are grown. Importantly, each of these trees may have different partitions of the
feature space and also have different sizes. Thus, the regions in which the focal
observation x lies in the different trees may be defined by different combina-
tions of predictor variables. Each of these trees yields a prediction for the
observation x, and this gives us D predictions yð1Þ ,…,
yðDÞ .
(4) The bagging estimate of the prediction for observation x is the average of the
D predictions yð1Þ ,…,
yðDÞ . This aggregate prediction has a lower variance
and hence bagging improves the learning model.
A similar logic and bagging process holds also for bagging classification trees.
The two differences pertain to steps 3 and 4 above. In step 3 we need to make a
prediction from each of the D trees for a given observation x. Recall from Section
2.2, that for each classification tree the prediction in a region is the mode of the
responses corresponding to all observations in that region. That is, we take the most
frequently occurring class in a given region as the prediction for all observations in
that region. Let us denote the prediction from the dth tree as b yðdÞ . For the
Random Forest, Bagging, and Boosting of Decision Trees 163
observation x, the prediction b yðdÞ from the tree grown on the bootstrap sample V(d)
is a vector with a 1 for the most commonly occurring class in a region and 0s
elsewhere. Suppose we have a binary classification case with responses: “buy” or
“not buy.” Suppose “buy” is coded as 1 and “not buy” as 0. Consider the tree
grown on the bootstrap sample V(d). If the most commonly occurring class in a
region is “buy” then the prediction corresponding to an observation x in that region
would be written as yðdÞ 5 (1, 0). In step 4 we need a single bagging estimate from
the D predictions b yð1Þ ,…, b
yðDÞ . In the classification context, the bagging estimate is
just that class which is the most commonly occurring among the D predictions.
It may be useful to have some intuition for why decision trees particularly
benefit from bagging. Decision trees often lack robustness and would benefit the
most from the aggregation that bagging provides. When a decision tree model is
fit multiple times to different training data sets, the different tree variants are
likely to involve different predictors (features) and may even have different sizes
(number of terminal nodes) for different data sets. The average of these trees is
less likely to omit an important predictor and is also less likely to emphasize an
unimportant one – both of these types of mis-specifications could be artifacts of
over-reliance on just one training data set, but are unlikely to be repeated when
one uses many training data sets.
A major advantage of bagging is that it automatically gives us a way to gauge
prediction accuracy on test data without doing explicit cross-validation (see
Chapter 3). In bagging, a given tree is grown on a bootstrap sample. Since the
bootstrap samples are drawn from the original training data without replacement,
it is possible for a given training observation to not be included in a given
bootstrap sample. As an illustration, consider the first bootstrap sample V(1) in
Section 4.1 where the training observation v1 is not included. Because observation
v1 has not been used in the tree grown on bootstrap sample V(1), it can be used as a
test data point for this tree. For a given tree, observations which have not been
used to grow the tree are called out-of-bag observations. Let us now look at this
from the point of view of individual observations in the original training data. For
a given training observation, we consider the set of trees for which this particular
observation is an out-of-bag observation. This observation can play the role of
test data for all of these trees. We can consider the set of predictions that these
trees (for which this particular observation was out-of-bag) make for this
particular data point, and take the average of these predictions. This average is
the out-of-bag prediction for this observation. We can now compare the out-of-
bag prediction for this observation with its actual response. This is a measure of
test error for this particular observation. Doing this for all points in the original
training data set, we can obtain an aggregate measure like the out-of-bag MSE
(mean square error). In this way we have obtained a useful measure of prediction
accuracy, the test error, without doing explicit cross-validation. While the out-of-
bag MSE is appropriate for a regression tree, we can use the similar ideas in this
paragraph to compute the out-of-bag prediction for a particular observation for a
classification problem. This can be compared to the actual response for that
observation. Doing this for all training observations, we can obtain the overall
out-of-bag classification error. While we have described bagging for decision
164 Machine Learning and Artificial Intelligence in Marketing and Sales
trees, this method works in a similar manner for more complex machine learning
models.
At this juncture it may be useful to discuss how one may determine the relative
importance of the predictors in tree aggregates. As mentioned earlier, one of the
main attractions of trees is their ease of interpretation and visualization. Unfor-
tunately, while tree aggregation methods like bagging may improve predictive
accuracy, they tend to lose the simple interpretation of trees. Before explaining
predictor, or feature, importance in tree aggregates we will first describe predictor
importance in single decision trees. There are several measures of importance of
predictor variables, and every software package can output at least one variable
importance measure. To give a flavor of how importance measures work, we will
restrict ourselves to the two most common methods and not try to be exhaustive.
The first method is essentially based on the idea that is used to grow the tree in the
first place. To illustrate the idea, let us first consider classification trees. As shown
in Section 2.2, at any node, the decision of which predictor variable to do the next
split on is based on the reduction in some node impurity measure (Gini or cross-
entropy). Recall that the predictor variable that resulted in the maximum
reduction in node impurity was the one chosen as the next splitting variable. One
common, perhaps the most common, importance measure is based on this logic.
For each predictor variable, we record the total impurity reduction due to all
splits (across all nodes in the tree) over this predictor. This gives the variable
importance for the single classification tree. This can be easily extended to an
aggregation of trees, for instance, a random forest in Section 5 which is based on
an aggregation of bagged trees. We simply average the total impurity reduction
due to this predictor over all trees in the aggregation. A somewhat more
sophisticated measure of variable importance is the permutation importance
measure due to Breiman (2001). First consider the importance of a given predictor
for a single tree. The steps followed by the permutation importance measure are:
• We start with the tree grown using the training data set. We record its predictive
performance on test data. For this we can use any standard acceptable measure
for model performance – e.g., number of observations correctly classified.
• Suppose we have p predictor variables, X1,…, Xp, and we are measuring the
importance of variable Xk. We randomly permutate the focal predictor vari-
able Xk. This permutation changes the data set in a particular way. The ith
observation is xi 5 (xi1,…, xip), i 5 1,…, N. Its kth component is xik. A per-
mutation of the predictor Xk will replace the kth component of observation i,
that is, xik, with the kth component of observation j, that is, xjk. This will be
done for all the observations i 5 1,…, N.
• Grow a tree with the permutated training data. The permutated training data
uses the permutated variable Xk and all the other non-permutated variables.
We record the predictive performance of this tree on test data.
• The difference between the performance of the trees grown with the original
training data and the permutated training data is a measure of importance of
the predictor Xk in the given tree.
Random Forest, Bagging, and Boosting of Decision Trees 165
To build some intuition for the permutation importance method, note that if
the predictor Xk had a strong relationship with the response variable in the
original training data (before permutation) then this relationship would be broken
by the permutation. The stronger the previous relationship the more we are likely
to see a decrease in relationship strength due to the permutation. Hence the dif-
ference between the predictive performance of trees grown without and with the
permutation of a predictor variable captures the importance of that variable.
Now, we can extend this idea to an aggregation of trees quite naturally. We can
compute the difference in prediction accuracy before and after permutation for
each tree and then average this across all the trees in the aggregation. We can
determine the variable importance for regression trees using exactly the same
procedures as we have described for classification, except that we use the sum of
squares as the criterion instead of node purity measures like Gini or cross-entropy.
Before leaving the topic of bagging it is important to point out one effective way
in which bagged trees can be further improved. This will also set us up for random
forests which are designed to precisely realize this improvement. The intuition in
the opening paragraph of Section 4.2 gives just a rough-and-ready idea for why
aggregation may work in general. The reader may have noticed that the intuition
depends on the observations being independent. It is important to realize that, the
multiple trees grown in bagging may not be independent. This is because if there are
strong relationships between certain predictors and the response then all (or almost
all) trees are likely to capture them in a similar manner. Now, since the different
bootstrap samples used in bagging are drawn from the same training data, the trees
grown on these bootstrap samples can be considered to be identically distributed.
Hence, from what we just discussed the trees grown during bagging may be
identically distributed but not necessarily independent.
4.3 Boosting
Boosting, like bagging, is also a “committee-based” learning model, in that, these
methods leverage the advantages of aggregating many variants of some basic
learning model. However, boosting does not use bootstrap samples. The major
distinction from bagging is that the different variants of the basic model in
boosting are not independent. Recall that bagging involved fitting the model
multiple times to independent bootstrap samples from the training data. In
boosting, on the other hand, the model variants are fit sequentially, and at each
stage, the model builds on the model variant at the previous stage – that is, at each
stage the boosting process leverages information from the previous stage.
Boosting can be used for many types of machine learning models, but they have
been found to provide the biggest improvements for decision trees. Moreover,
boosting has been found to be especially useful for classification trees.
We can consider the boosted tree T as the additive combination of a sequence of
D stages of model fitting. At a given stage d (d 5 1,…, D) the current model is the
additive combination of trees till stage d-1. At the stage d, the component tree Td is
added to the current model until the process goes through all D stages. Importantly,
166 Machine Learning and Artificial Intelligence in Marketing and Sales
at stage d the boosting algorithm focuses only on the current “best” tree Td without
adjusting the trees that have already been built at earlier stages. This process is
sometimes referred to as the forward stagewise additive modeling process.
To get an intuition for how boosting may improve the performance for a
decision tree, one must first note that each of the individual trees in the sequence
can be shallow trees. Reducing overfitting is the hallmark of a good learning model.
By definition, shallow trees are “weak learners” and are less likely to overfit the
training data. Despite the shallowness of the component trees in the aggregation of
trees in boosting, the predictive ability of the model is not harmed because of the
careful weighting scheme of the training data – in growing the tree at any stage,
observations that were misclassified by the tree in the previous stage are given a
higher weightage and the current tree thus concentrates on them. In this way,
boosting combines the twin benefits of less overfitting and better predictions. As in
the case of bagging, boosting especially helps decision trees because single trees are
often quite poor in their predictive ability. Thus, the improvement due to aggre-
gation schemes that enhance the predictive ability of trees can be quite dramatic.
We will first discuss boosting for a binary classification tree. Suppose we have
N training observations (xi, yi), i 5 1,…, N. Boosting involves growing D trees Td
sequentially in stages, where the stages (trees) are indexed by d 5 1,…, D. The
boosting process involves assigning weights wi, i 5 1,…, N, to the individual
observations such that, at each step, the observations that were misclassified by
the tree in the previous step are assigned higher weights. In this way the boosting
algorithm ensures that the decision tree at a given step pays more attention to
observations that were misclassified at the earlier step. Once the D trees are grown
sequentially, the final prediction is based on the weighted combination of trees,
D
+ ld T d
d ¼1
We will sketch a common boosting algorithm for a binary classification tree – the
Ada.Boost.M1 algorithm. Instead of the formal technical statements, we will sketch
the intuitions behind each step of AdaBoost.M1 for a two-class classification tree.
Step 1: Initialize the beginning weights to wi 5 1/N for all i 5 1,…, N. So, we start
in stage 1 by assigning equal weights to all training observations.
Step 2: For stages d 5 1,…, D, perform the following steps
(a) Fit classification tree Td to training data weighted with weights wid, i 5 1,…,
N. The trees are fit by minimizing a special loss function (based on expo-
nential loss).
(b) Compute a measure of misclassification error rate over the training data. This
quantifies the total amount of misclassification over the training data,
weighted by the weights wid.
(c) Compute “tree weight” ld that will be used as weight on tree Td in the weighted
combination of trees. This weight is calculated based on the misclassification
error rate computed in step 2b. The tree weight is small if the misclassification
error rate is large and vice versa.
Random Forest, Bagging, and Boosting of Decision Trees 167
(d) Update the weights wid (on the training observations) to obtain new weights
wid11. The weights are updated such that the new weights wid11 are larger than
the current weights wid for observations i that are misclassified in tree Td. The
updated weights wid11 are used to weight the training observations in the next
stage, and the next tree, Td11, is grown. This process continues until all D trees
are built. D
Step 3: Output aggregate tree: + ld T d . The final prediction is based on this.
d 51
Having described the boosting of classification trees, we now turn to the boosting
process for regression trees. The underlying logic is similar to boosting classification
trees, even though specific details vary. Here too multiple regression trees are built
sequentially, where each regression tree is small in size. This prevents overfitting.
Moreover, here too the boosting process induces the tree at a given stage to pay
more attention to errors made in the previous stages. In the case of classification
trees we tweaked the training data at a stage to put more weight on observations that
were misclassified by the tree at the earlier stage. In the case of regression trees, the
tree at a given stage it fit to the “residuals” from the current model rather than to the
actual response Y. Recall that in a simple OLS regression Y 5 f(X) 1 e, the residual
corresponding to any X 5 x is the “error”: y2f(x). Thus, fitting the tree to the
residuals is tantamount to focusing on errors from the previous trees.
There are two hyperparameters that control overfitting. One hyperparameter
controls the number of splits, say s, in the recursive partitioning process for
growing the trees. The number of splits controls the tree size, and smaller trees are
less likely to overfit. The number of splits can be rather small, even just 1, so that
we only grow very shallow trees. The other hyperparameter that controls over-
fitting is the weight on the weighted combination of the D regression trees
Td(x), d 5 1,…, D, below
D
TðxÞ ¼ + lT d ðxÞ
d ¼1
Unlike the case of the AdaBoost.M1 algorithm for classification trees, a very
commonly used algorithm for regression trees treats the weight l as a hyper-
parameter – so it is chosen by the analyst using cross-validation or other means.
Since the boosting algorithm learns over many stages, by adding a tree at each
stage, the parameter l controls the rate at which the boosting algorithm learns.
When learning is slow, there is also less chance of overfitting.
To set the stage for the algorithm for boosting regression trees it may be
helpful to recall the stage-wise nature of the boosting process. The boosted tree
T(x) is the outcome of a sequence of D stages of model fitting. At each stage d, the
component added to the current model could be the weighted tree lTd(x). Of
course, the current model is the weighted combination of trees till stage d-1. We
now sketch the algorithm for boosting regression trees.
Step 1: Initialize the beginning residuals to be the actual response: ri 5 yi for all
i 5 1,…, N. Initialize T(x) 5 0.
168 Machine Learning and Artificial Intelligence in Marketing and Sales
More details of boosting regression trees are given in the “Technical detour 3.”
Technical detour 3:
For the sake of completeness, we will briefly mention the stochastic gradient
boosting algorithm (Friedman, 2002). Here, at each stage in the boosting process,
we sample a fraction of the training observations without replacement. The tree is
grown with this small sample, which can be much smaller than the training data
set size when the latter is large. Even though the parallel is not exact, this idea
bears some resemblance to the idea of mini-batch gradient descent that we have
discussed in Chapter 2. In the interests of speeding up computations, especially
when the training data set is large, smaller subsets of the entire training data set
are actually used for training. Just as the batch size in mini-batch gradient descent
was an additional hyper-parameter that needed tuning, in stochastic gradient
boosting too the fraction of the training data used becomes a hyperparameter.
Having discussed some prominent aggregation methods like bagging and
boosting, we now turn to random forests.
Executive Summary
While decision trees are easy to interpret and visualize, they sometimes lack
the predictive ability of other machine learning methods. Aggregation over
multiple trees is one way in which the predictive accuracy of decision trees
can be improved. Bagging and boosting of decision trees are two popular
tree aggregation methods. While bagging and boosting can be used to
aggregate other non-tree methods, they often perform best with decision
trees.
Bagging is the process of averaging over multiple trees where the trees are
grown using independent bootstrap samples from the training data set. A
bootstrap sample is a random sample that is drawn from the training data
with replacement and which is of the same size as the training data set.
Bootstrapping is done when it is not possible to get enough different samples
from the population itself. The average of the prediction over multiple tree
has a lower variance than the prediction from an individual tree. Lower
variance is desirable, and in this way, averaging improves decision trees.
Boosting also involves averaging over multiple trees, but unlike bagging,
the trees are not independent. The trees are grown sequentially, and the tree at
a given stage leverages information from the previous stage. Specifically, the
trees are grown using training data that has been weighted based on the earlier
tree. In a classification tree, for instance, the boosting process puts more
weight on observations that were misclassified by the tree at the earlier stage.
Thus, each tree improves the prediction accuracy by concentrating on pre-
viously misclassified observations.
5. Random Forest
As the name suggests, a random forest is an aggregation of decision trees. A
random forest makes an improvement over bagging by a simple, yet very effective,
170 Machine Learning and Artificial Intelligence in Marketing and Sales
tweak. Random forests have shown consistent good performance, and there is no
need to tune many parameters except for the number of trees and the number of
predictors that are randomly chosen for splitting the trees. Moreover, since they use
bootstrap samples and bagging they can leverage a major advantage of bagging -
that is, we can compute the out-of-bag error to measure prediction accuracy on test
data without having to divide the data into training and test sets (see Section 4.2).
This is certainly a very useful aspect of random forests when it is not so easy, or it is
expensive, to collect data. The computation of out-of-bag prediction with random
forests is similar to bagging. For a given training observation we can compute the
random forest predictor by averaging over only those trees for which this particular
observation is an out-of-bag observation – that is, averaging over trees that were
grown without this observation being in the bootstrap sample used for tree
growing.
Recall that from the final paragraph of Section 4.2 on bagging it was sug-
gested that the different trees grown in bagging may not be independent, even
though they are identically distributed. This is because all the trees in bagging
are likely to include particularly strong relationships between the predictors and
the response by using these predictors as splitting variables at the top of all the
trees. Recall that if x1,…, xN are N independent random variables (observa-
tions), each with a variance of v, then the variance of their average is v/N. Thus,
the variance can be made arbitrarily small by increasing N. However, consider
N identically distributed (but not independent) random variables (observations)
x1,…, xN, each with a variance of v. As expected, the variance of the average of
these observations depends on the correlation between pairs of variables. It can
be shown that the variance of the average cannot be made arbitrarily small, even
with very large N, if the correlation is not small. Thus, significant variance
reduction to improve the model can be achieved by pursuing the twin objectives
of aggregation (that is, by averaging) and also reducing the pair-wise
correlations.
Random forest is a technique to grow uncorrelated trees so that one is able to
achieve significant variance reduction by aggregation, beyond what simple
bagging can deliver. As in bagging, multiple trees are grown using bootstrapped
samples. For a given tree, each split considers only a randomly selected subset of
the p predictor variables. Note that in the tree growing algorithms discussed so
far, each binary partition (split) considered all the predictor variables. By using
a different randomly selected subset each split would therefore be based on a
different set of predictor variables. When each tree in the random forest is
grown in this manner, they will become uncorrelated. Suppose the size of the
subset is k , p. When the subset size k is smaller, the trees will be more
uncorrelated. Hence, when the analyst suspects that the data has a large number
of correlated predictors variables, then fitting a random forest with a small k
would be helpful. Of course, when k 5 p in a random forest we are just doing
bagging. Similar to bagging, in a regression tree context, once the D trees Td(x)
(d 5 1,…, D) are grown the final prediction for a random forest is based on the
average.
Random Forest, Bagging, and Boosting of Decision Trees 171
1 D d
TðxÞ ¼ + T ðxÞ
D d ¼1
CART decision-tree models provide the best estimation for mortgage loan
default.
Thrasher (1991) has used decision trees for segmentation, a fundamental
marketing task. Tirenni, Kaiser, and Herrmann (2007) have used decision trees
to formulate a segmentation methodology for customers using their lifetime
values. In this approach different segments have different lifetime values. Using
data from a major European airline the authors have used their methodology to
predict future segments of customers according to their demographic and
behavioral characteristics. Thomassey and Fiordaliso (2006) have used decision
trees for sales forecasting of products. The authors make the case that decision
trees are well suited to uncover simple rules for forecasting in categories like
textiles, which have numerous new items with short life times. While many other
methods like regression, Box and Jenkins, neural networks or fuzzy systems
have been used, quite successfully, in other contexts, the authors suggest that
these methods are inappropriate for categories like textiles where replacement of
items at the end of each season makes past data unavailable. While their data
comes from the apparel industry, the methodology is applicable to situations
like new products or new customers where past data is not readily available. In
addition, many of the other methods are not suitable for uncovering under-
standable relationships in the data. Many simpler, often parametric, models
may be useful to understand relationships between predictors and the response
variable they lack the predictive ability of machine learning models when
dealing with complex, nonlinear data. Decision trees are a non-parametric
method that can handle complex data and yet are easily interpretable and
explainable. The authors propose a method where products are grouped into
clusters. The clusters, called “prototypes,” are formed based on products which
have similar historical sales profiles along with descriptive criteria. The authors
use the popular k-means clustering algorithm. Then new products, or products
which for any reason do not have historical sales profiles, are assigned to these
clusters based on their descriptive criteria. This step is accomplished using
decision trees. Essentially, they are clustered with similar products based on
their descriptive criteria. Then the sales profiles of the cluster are used as a proxy
for the sales forecast for the products which do not have historical data. Using
some standard rule-based forecasting methods as benchmarks, the authors find
that with a real data set from a French textile distributor their method (clus-
tering followed by decision trees) performs the best. Sheu, Su, and Chu (2009)
used decision trees to segment online gaming customers. Specifically, they
investigated the relationship between influential predictors and customer loy-
alty. The predictors of loyalty that the authors investigated fall under the rubric
of “experiential marketing.” These factors go beyond the functional “features
and benefits” view of marketing to the experiences consumer have when they
interact with the brand. In terms of the focal response variable, customer loy-
alty, the authors measure dimensions such as “repurchase desire,” “public praise
and recommendation desire,” and “cross-purchase desire.” Abrahams et al.
(2009) employ a novel variant on decision trees which incorporates a profit-
optimizing algorithm. Using their new method they provide actionable
174 Machine Learning and Artificial Intelligence in Marketing and Sales
7. Case Studies
In this section, we will present a couple of case studies about the application of
random forests in marketing. We describe the data sets and demonstrate the
analyses done on them.
Random Forest, Bagging, and Boosting of Decision Trees 177
Response Column: 1
Predictor columns: 2:86 (2 through 86)
Training Percentage: 80
Testing Percentage: 20
Number of folds for Cross-validation: 3
Response Column: 1
Predictor columns: 2:86 (2 through 86)
Training Percentage: 80
Testing Percentage: 20
Number of folds for Cross-validation: 3
Number of Trees: 50, 100, 300, 500
Number of times Averaging: 3
We compute Random Forests with 50, 100, 300, 500 trees and then select the
best fitting model. This step can be accomplished by writing simple code in all
software programs. From the AUC plots we find that the best-fitting model is the
one with 300 trees.
The confusion matrix for the logistic regression is.
Actual + Actual -
Predicted + 2 7
Predicted - 78 1078
178 Machine Learning and Artificial Intelligence in Marketing and Sales
The percent correctly classified (PCC) for test data for the logistic regression is
92.7%. The AUC for the logistic regression model is approximately 0.725.
The Confusion Matrix for the best fitting Random Forest as calculated on the
test data is.
Actual + Actual -
Predicted + 1 9
Predicted - 59 1096
The percent correctly classified (PCC) for test data for the Random Forest is
94.2%. The AUC for the best-performing Random forest model is approximately
0.7. With this data set the performance of the Random Forest model is not
significantly better than the benchmark logistic regression.
Response Column: 12
Predictor columns: 1:11 (1 through 11)
Training Percentage: 80
Testing Percentage: 20
Number of folds for Cross-validation: 3
Response Column: 12
Predictor columns: 1:11 (1 through 11)
Random Forest, Bagging, and Boosting of Decision Trees 179
Training Percentage: 80
Testing Percentage: 20
Number of folds for Cross-validation: 3
Number of Trees: 50, 100, 300, 500
Number of times Averaging: 3
We compute Random Forests with 50, 100, 300, 500 trees and then select the
best fitting model. The Confusion Matrix for the logistic regression calculated on
the test data is.
Actual + Actual -
Predicted + 127 47
Predicted - 47 99
The percent correctly classified (PCC) for test data for the logistic regression is
70.6%. The AUC for the logistic regression model is approximately 0.834.
The Confusion Matrix for the best fitting Random Forest as calculated on the
test data is.
Actual + Actual -
Predicted + 128 38
Predicted - 33 121
The percent correctly classified (PCC) for test data for the Random Forest is
77.8%. The AUC for the best-performing Random Forest model is 0.875. We see
that the predictive performance of the Random Forest is significantly better than
the benchmark logistic regression model.
TECHNICAL APPENDIX
Technical detour 1:
Our training data consists of N observations (xi, yi), i 5 1,…, N, with obser-
vation xi belonging to a p-dimensional space. Thus, the ith observation xi is a
p-dimensional vector xi 5(xi1,…, xip). Consider a partition of the training data
into M regions Rm, m 5 1,…, M. For a regression tree the usual cost that we
minimize is the sum of squares over the training data which we denoted by SS in
Section 2.1. Mathematically,
180 Machine Learning and Artificial Intelligence in Marketing and Sales
N
SS ¼ + ðyi 2 f ðxi ÞÞ2
i¼1
As mentioned in the text, for a regression tree the prediction for all points in a
given region is just a constant. Further, for a square error cost (loss) the optimal
prediction f(xi) for a region happens to be the average over all responses yi cor-
responding to the observations xi in that region.
We will provide a simple illustration for why the predicted constant for a
region is the average response in that region. Since the prediction f(xi) is constant
cm in each region Rm we can write
8
>
> c1 if xi is in region R1
<
c2 if xi is in region R2
f ðxi Þ ¼
>M
>
:
cM if xi is in region RM
Suppose there are 3 points (xi, yi), i 5 1, 2, 3, partitioned into two regions Ri,
i 5 1, 2 such that (x1, y1) e R2, (x2, y2) e R1, (x3, y3) e R2 . Since observation
1 and 3 are in region 2 and observation 2 is in region 1, the sum of squares is
ðy1 2 c2 Þ2 1 ðy2 2 c1 Þ2 1 ðy3 2 c2 Þ2 . To find the optimal c2 (for example) take
First Order Condition w.r.t. c2 and solve. This gives c2 5 y1 12 y3 5
yð2Þ . This simple
illustration should be enough for the reader to form an intuition that the same will
hold with N observations partitioned into M regions. We now provide the details
of the greedy algorithm for recursive binary partitioning. We use the uppercase Xj
to denote the jth predictor variable and the lowercase xij for the jth component of
the ith observation xi 5(xi1,…, xip).
Step 1: For a given predictor variable Xj (j 5 1,…, p) and split point “s”
perform a binary partition of the feature space into two regions.
The first summation inside the curly braces ranges over all observations xi in
region R1. Similarly, the second summation ranges over all observations xi in
region R2. The minimization over both the predictor variable and the split point
gives us the best predictor, say Xk, and the best split point sk*.
While the program above requires us to solve for both the predictor variable
and the split point, in practical implementations, this is usually done in two
stages:
Random Forest, Bagging, and Boosting of Decision Trees 181
(a) In the first stage, for a fixed Xj, we assume a split point “s”. For that Xj and s
we form the binary partition. Then the following program is solved for the
optimal split point
minf + yð1Þ Þ2 1
ðyi 2 + yð2Þ Þ2 g
ðyi 2
s xi is in R1 xi is in R2
This gives us the optimal split point given the splitting variable Xj (b)
(b) Vary the predictors Xj over all j 5 1,…, p, and repeat stage (a).
Step 2: Form the two regions based on the splitting variable Xk and split point
sk*. These are:
Repeat step 1 for region R1 above. This gives three regions. Split one of these
three regions in similar manner and continue until a stopping rule ends the
process.
This process grows a large tree that should be pruned to control overfitting.
Technical detour 2:
As mentioned in Section 2.1.2, the cost complexity pruning method essentially
puts a penalty on the size of the tree. The minimizing criterion for growing the
tree is not just the sum of squares (SS) but a sum of SS plus the number of ter-
minal nodes. Suppose that using the greedy algorithm we have grown a large tree
T9which we wish to prune. Suppose that T is a sub-tree obtained by pruning the
tree T9. That is, T has been obtained by collapsing a number of the nodes of T9.
Suppose further that the sub-tree T has M terminal nodes indexed by m 5 1,…, M.
As usual, node m defines a region Rm. In the regression tree context, we know that
the prediction corresponding to all observations xi in region m is the same con-
stant yðmÞ (which is just the average of the responses yi corresponding to the xi in
region m). With these predictions, the sum of squares over the training data set
becomes
SS ¼ SS1 1 SS2 1 … 1 SSM
¼ + yð1Þ Þ2
ðyi 2 1 + yð2Þ Þ2 1 ::: 1
ðyi 2 + yðMÞ Þ2
ðyi 2
xi is in R2 xi is in RM
xi is in R1
Technical detour 3:
We will provide more details of the algorithm for boosting regression trees. Let us
denote the boosted (aggregated) tree and the D component trees as functions of
D
the vector of predictor variables x 5 (x1, …, xp): TðxÞ 5 + lT j ðxÞ. Recall that
d 51
in our notation, the ith training observation is xi 5 (xi1, …, xip), i 5 1,…, N.
Step 1: Initialize the beginning residuals to be the actual response: ri 5 yi for all
i 5 1,…, N. Initialize T(x) 5 0.
Step 2: For stages d 5 1,…, D, perform the following steps
(a) Fit regression tree Td(x) to data (xi, rid), i 5 1,…, N. The tree can have s splits.
The predictor variables are xi and the responses at stage d are the residuals rid
d 21 d 21
from the current model: + lT j ðxÞ. Thus, the ith residual is rid 5 yi - + lT j ðxi Þ,
j51 j51
i 5 1,…, N. In fitting tree Td(x) the s splits will partition the feature space into
distinct regions. As we discussed Section 2.1 on regression trees, the prediction in
each region will be the average of the residuals corresponding to the xi in that
region.
(b) Update the weighted combination of trees by adding lTd(x) to the current
d 21 d
model. The model now becomes: + lT j ðxÞ 1 gT d ðxÞ 5 + lT j ðxÞ.
j51 j51
(c) Update the residuals. The residuals now become rid11 5 rid - lTd(xi) i 5 1,…,
N. This is because, after updating the weighted combination of trees as in step 2b
above, the new residual corresponding to the ith observation is: rid11 5 yi 2
d d 21
+ lT j ðxi Þ5 yi 2 + lT j ðxi Þ 2 lT d ðxi Þ5 rid 2 lTd(xi). The tree in the next stage
j51 j51
Td11(x) is fit to data (xi, rid11), i 5 1,…, N. The predictor variables are xi and the
d
responses at stage d11 are the residuals rid11 from the current model + lT j ðxÞ.
j51
Continue till all D trees are grown. D
Step 3: Output aggregate tree: T 5 + lT j ðxÞ. The final prediction is based on
this. d 51
References
Abdul-Kader, S., & Woods, J. (2015). Survey of chatbot design techniques in speech
conversation systems. International Journal of Advanced Computer Science and
Applications, 6(7), 72–80.
Abrahams, A., Becker, A., Sabido, D., D’Souza, R., George, M., & Kransnodebski,
M. (2009). Inducing a marketing strategy for a new pet insurance company using
decision trees. Expert Systems with Applications, 36, 1914–1923.
Agarwal, D., & Schorling, C. (1996). Market share forecasting: An empirical
comparison of artificial neural networks and multinomial logit model. Journal of
Retailing, 72(4), 383–407.
Aggarwal, C., & Zhai, C. X. (2012). A survey of text classification algorithms. In
C. Aggarwal & C. X. Zhai (Eds.), Mining text data (pp. 163–222). Berlin: Springer.
Anderson, N. (1970). Functional measurement and psychophysical judgment.
Psychological Review, 77, 153–170.
Anderson, N. (1971). Integration theory and attitude change. Psychological Review,
78, 177–206.
Andrews, R., Ansari, A., & Imran, C. (2002). Hierarchical Bayes versus finite mixture
conjoint analysis models: A comparison of fit, prediction, and partworth recovery.
Journal of Marketing Research, 39, 87–98.
Apampa, O. (2016). Evaluation of classification and ensemble algorithms for bank
customer marketing response prediction. Journal of International Technology and
Information Management, 24(4), 85–100.
Bagozzi, R. (1994). Advanced methods of marketing research. Cambridge, MA: Basil
Blackwell Ltd.
Bahnsen, A. C., Aouada, D., & Ottersten, B. (2015). Example-dependent cost-
sensitive decision trees. Expert Systems with Applications, 42, 6609–6619.
Baines, P., Worcester, R., David, J., & Mortimore, R. (2003). Market segmentation
and product differentiation in political campaigns. Journal of Marketing
Management, 19(1–2), 225–249.
Bajari, P., Nekipelov, D., Ryan, S. P., & Yang, M. (2015). Machine learning methods
for demand estimation. The American Economic Review, 105(5), 481–485.
Balakrishnan, P., Cooper, M., Jacob, V., & Lewis, P. (1996). Comparative performance
of the FSCL neural net and K-means algorithm for market segmentation. European
Journal of Operational Research, 93(10), 346–357.
Bejju, A. (2016). Sales analysis of E-commerce websites using data mining techniques.
International Journal of Computer Applications, 133(5), 36–40.
Ben-Hur, A., Horn, D., Siegelmann, H., & Vapnik, V. (2001). Support vector
clustering. Journal of Machine Learning Research, 2, 125–137.
Bensic, M., Sarlija, N., & Zekic-Susac, M. (2005). Modeling small-business credit
scoring by using logistic regression, neural networks and decision trees. Intelligent
Systems in Accounting, Finance and Management, 13, 133–150.
184 References
Currim, I., Meyer, R., & Le, N. (1988). Disaggregate tree- structured modeling of
consumer choice data. Journal of Marketing Research, 25(August), 253–265.
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function.
Mathematics of Control, Signals and Systems, 2, 303–314.
De Caigny, A., Coussement, K., & De Bock, K. W. (2018). A new hybrid classification
algorithm for customer churn prediction based on logistic regression and decision
trees. European Journal of Operational Research, 269, 760–772.
Delen, D., Kuzey, C., & Uyar, A. (2013). Measuring firm performance using financial
ratios: A decision tree approach. Expert Systems with Applications, 40, 3970–3983.
Dimopoulos, Y., Paul, B., & Sovan, L. (1995). Use of some sensitivity criteria for
choosing networks with good generalization ability. Neural Processing Letters,
2(6), 1–4.
Dooley, B. (2017). Why AI with augmented and virtual reality will be the next big thing.
Retrieved from https://upside.tdwi.org/articles/2017/04a/04/ai-with-augmented-and-
virtual-reality-next-big-thing.aspx
Duchessi, P., & Lauria, E. (2013). Decision tree models for profiling ski resorts’
promotional and advertising strategies and the impact on sales. Expert Systems
with Applications, 40, 5822–5829.
Evgeniou, T., Boussios, C., & Zacharia, G. (2005). Generalized robust conjoint
estimation. Marketing Science, 24(3), 415–429.
Fish, K., Barnes, J., & Aiken, M. (1995). Artificial neural networks: A new
methodology for industrial market segmentation. Industrial Marketing
Management, 24(5), 431–438.
Frew, J. F., & Wilson, B. (2002). Estimating the connection between location and
property value. Journal of Real Estate Practice and Education, 5(1), 17–25.
Friedman, J. (2002). Stochastic gradient boosting. Computational Statistics and Data
Analysis, 38(4), 367–378.
Galindo, J., & Tamayo, P. (2000). Credit risk assessment using statistical and machine
learning: Basic methodology and risk modelling applications. Computational
Economics, 15, 107–143.
Gao, J., Galley, M., & Li, L. (2019). Neural approaches to conversational AI.
Foundations and TrendsÒ in Information Retrieval, 13(2–3), 127–298. doi:
10.1561/1500000074
Garg, A., & Tai, K. (2014). An ensemble approach of machine learning in evaluation
of mechanical property of the rapid prototyping fabricated prototype. Applied
Mechanics and Materials, 575, 493–496.
Garson, G. D. (1991). Interpreting neural network connection weights. Artificial
Intelligence Expert, 6(4), 46–51.
Gordini, N., & Veglio, V. (2017). Customers churn prediction and marketing retention
strategies. An application of support vector machines based on the AUC
parameter-selection technique in B2B e-commerce industry. Industrial Marketing
Management, 62, 100–107.
Govidarajan, M. (2013). A hybrid framework using RBF and SVM for direct marketing.
International Journal of Advanced Computer Science and Applications, 4(4), 121–126.
Grover, R., & Dillon, W. R. (1985). A probabilistic model for testing hypothesized
hierarchical market structures. Marketing Science, 4(Fall), 312–335.
Gruca, T., Klemz, B., & Petersen, E. (1999). Mining sales data using a neural network
model of market response. ACM SIGKDD, 1(1), 39–43.
186 References
Guido, G., Prete, M. I., Miraglia, S., & De Mare, I. (2011). Targeting direct
marketing campaigns by neural networks. Journal of Marketing Management,
27(9–10), 992–1006.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). Elements of statistical learning.
Springer series in statistics. New York, NY: Springer.
Haughton, D., & Oulabi, S. (1997). Direct marketing modeling with CART and
CHAID. Journal of Direct Marketing, 11(4), 42–52.
Hill, T., O’Connor, M., & Remus, W. (1996). Neural network models for time series
forecasts. Management Science, 42(7), 1082–1092.
Hruschka, H., & Natter, M. (1999). Comparing performance of feedforward neural
nets and K-means for market segmentation. European Journal of Operational
Research, 114, 346–353.
Hruschka, H. (1993). Determining market response functions by neural network
modeling: A comparison of econometric techniques. European Journal of
Operational Research, 66(1), 346–353.
Huang, Z., Chen, H., Hsu, C.-J., Chen, W.-H., & Wu, S. (2004). Credit rating analysis
with support vector machines and neural networks: A market comparative study.
Decision Support Systems, 37, 543–558.
Huang, J.-J., Tzeng, G.-H., & Ong, C.-S. (2007). Marketing segmentation using
support vector clustering. Expert Systems with Applications, 32, 313–317.
James, G., Witten, D., Hastie, T., & Tibshirani, R., (2013). An introduction to statistical
learning: With applications in R (1st ed., Springer Texts in Statistics). New York,
NY: Springer.
Joachims, T. (1998). Text categorization with support vector machines: Learning with
many relevant features. Proc. of ECML, 98, 137–142. Springer-Verlag.
Johnston, M., & Marshall, G. (2013). Sales force management (11th ed.). Abingdon:
Routledge.
Kass, G. (1980). An exploratory technique for investigating large quantities of
categorical data. Journal of the Royal Statistical Society, Series C (Applied
Statistics), 29(2), 119–127.
Kim, J. W., Lee, B. H., Shaw, M. J., Chang, H.-Lu, & Nelson, M. (2001). Application
of decision-tree induction techniques to personalized advertisements on internet
storefronts. International Journal of Electronic Commerce, 5(3), 45–62.
Kim, Y., Street, W. N., Russell, G. J., & Menczer, F. (2005). Customer targeting: A neural
network approach guided by genetic algorithm. Management Science, 51, 264–276.
Kim, D., Lee, H.-J., & Cho, S. (2008). Response modeling with support vector
regression. Expert Systems with Applications, 34, 1102–1108.
Knott, A., Hayes, A., & Scott, N. (2002). Marketplace: Next-product-to-buy models
for cross-selling applications. Journal of Interactive Marketing, 16(Summer), 3.
Krotov, D., & Hopfield, J. J. (2019, April 16). Unsupervised learning by competing
hidden units. Proceedings of the National Academy of Sciences, 116(16), 7723–7731.
Krycha, K. A. (1999). Market segmentation and profiling using artificial neural
networks. In W. Gaul & H. Locarek-Junge (Eds.), Classification in the
information age. Studies in classification, data analysis, and knowledge
organization. Berlin, Heidelberg: Springer.
Kumar, D. A., & Ravi, V. (2008). Predicting credit card customer churn in banks
using data mining. International Journal of Data Analysis Techniques and
Strategies, 1(1), 4–28.
References 187
Kumar, A., Rao, V., & Soni, H. (1995). An empirical comparison of neural networks
and logistic regression. Marketing Letters, 6(4), 251–263.
Landt, F. W. (1997). Stock price predictions using neural networks. Leiden: Leiden
University.
Lawrence, S., Tsoi, A., & Gilles, C. (1996). Noisy time series prediction using symbolic
representation and recurrent neural network grammatical inference. University of
Maryland, College Park, MD: University of Maryland, Institute for Advanced
Computer Sciences.
Lek, S., Beland, A., Dimopoulos, I., Lauga, J., & Moreau, J. (1995). Improved
estimation using neural networks, of the food consumption of fish populations.
Marine and Freshwater Research, 46, 1229–1236.
Lek, S., Delacoste, M., Baran, P., Dimopoulos, I., Lauga, J., & Aulagnier, S. (1996).
Application of neural networks to modelling nonlinear relationships in ecology.
Ecological Modelling, 90(1), 39–52.
Lemmens, A., & Croux, C. (2006). Bagging and boosting of classification trees to
predict churn. Journal of Marketing Research, 42, 276–286.
Li, N., Wu, D. D. (2010). Using text mining and sentiment analysis for online forums
hotspot detection and forecast. Decision Support Systems, 48, 354–368.
Lin, W., Wu, Z., Lin, L., Wen, A., & Lin, L. (2017). An ensemble random forest
algorithm for insurance big data analysis. IEEE Access, 5, 16568–16575.
Linder, R., Geier, J., & Kolliker, M. (2004). Artificial neural networks, classification
trees and regression: Which methods for which customer base? Database Marketing
and Customer Strategy Management, 11(4), 344–356.
Magidson, J. (1988). New statistical techniques in direct marketing: Progression
beyond regression. Journal of Direct Marketing, 2(4), 6–18.
Magidson, J. (1989). CHAID, logit, and log-linear modeling. Marketing Research
Systems, Report 11-130, 101–114.
Mahajan, V., & Jain, A. (1978). An approach to normative segmentation. Journal of
Marketing Research, 15, 338–345.
Makridakis, S., Anderson, A., Carbone, R., Fildes, R., Hibon, M., Lewandowski, R.,
… Winkler, R. (1982). The accuracy of extrapolation (time series) methods: Results
of a forecasting competition. Journal of Forecasting, 1, 111–153.
Martı́nez-Ruiz, M. P., Molla-Descals, A., Gomez-Borja, M., & Rojo-Alvarez, J.
(2006). Assessing the impact of temporary retail price discounts intervals using
SVM semiparametric regression. International Review of Retail, Distribution and
Consumer Research, 16(2), 181–197.
Martı́nez-Ruiz, M. P., Gomez-Borja, M. A., Molla-Descals, A., & Rojo-Alvarez, J. L.
(2008). Using support vector semiparametric regression to estimate the effects of
pricing on brand substitution. International Journal of Marketing Research, 50(4),
555–557.
McCulloch, W., & Pitts, W. (1943). A logical calculus of the ideas immanent in
nervous activity. Bulletin of Mathematical Biophysics, 5, 115–133.
Messenger, R., & Mandell, L. (1972). A modal search technique for predictable nominal
scale multivariate analysis. Journal of the American Statistical Association, 67(340),
768–772.
Mizuno, M., Saji, A., Sumita, U., & Suzuki, H. (2008). Optimal threshold analysis of
segmentation methods for identifying target customers. European Journal of
Operational Research, 186, 358–379.
188 References
Morgan, J., & Sonquist, J. (1963). Problems in the analysis of survey data, and a
proposal. Journal of the American Statistical Association, 58, 415–434.
Mukherjee, S., Osuna, E., & Girosi, F. (1997). Nonlinear prediction of chaotic time
series using support vector machines. Neural Networks for Signal Processing [1997]
VII. Proceedings of the 1997 IEEE Workshop. Added to IEEE Explore in 06
August 2002. doi:10.1109/NNSP.1997.622433
Neslin, S., Gupta, S., Kamakura, W., Lu, J., & Mason, C. (2006, May). Defection
detection: Measuring and understanding the predictive accuracy of customer churn
models. Journal of Marketing Research, 43(2), 204–211.
Netzer, O., Feldman, R., Goldenberg, J., & Fresko, M. (2012). Mine your own
business: Market-structure surveillance through text mining. Marketing Science,
31(3), 521–543.
Olden, J. D., & Jackson, D. A. (2002). Illuminating the “black box”: A randomization
approach for understanding variable contributions in artificial neural networks.
Ecological Modelling, 154(1–2), 135–150.
Özesmi, S. L., & Özesmi, U. (1999). An artificial neural network approach to spatial
habitat modelling with interspecific interaction. Ecological Modelling, 116(1), 15–31.
Potharst, R., Kaymak, U., & Pijl, W. (2001). Neural networks for target selection in direct
marketing. ERIM Report Series Research in Management, ERS-2001-14-LIS.
Retrieved from https://www.igi-global.com/chapter/neural-networks-business/27261
Razmochaeva, N., & Klionsky, D. (2019). Data presentation and application of
machine learning methods for automating retail sales management processes.
IEEE Explore. doi:10.1109/EIConRus.2019.8657077
Rokach, L., Naamani, L., & Shmilovici, A. (2008). Pessimistic cost-sensitive active
learning of decision tree for profit maximizing targeting campaigns. Data Mining
and Knowledge Discovery, 17, 283–316.
Sapankevych, N., & Sankar, R. (2009). Time series prediction using support vector
machines. IEEE Computational Intelligence Magazine, 4(2), 24–38.
Sheu, J.-J., Su, Y.-H., & Chu, Ko-T. (2009). Segmenting online game customers - the
perspective of experiential marketing. Expert Systems with Applications, 36,
8487–8495.
Shih, J.-Y., Chen, W.-H., & Chang, Y.-J. (2014). Developing target marketing models
for personal loans. IEEE. 978-1-4799-6410-9.
Shin, H.J., & Cho, S. (2006). Response modeling with support vector machines.
Expert Systems with Applications, 30, 746–760.
Sun, A., Lim, E.-P., & Liu, Y. (2009). On strategies for imbalanced text classification
using SVM: A comparative study. Decision Support Systems, 48, 191–201.
Sustrova, T. (2016). An artificial neural network model for a wholesale company’s
order-cycle management. International Journal of Engineering Business Management.
doi:10.5772/63727
Syam, N., & Sharma, A. (2018). Waiting for a sales renaissance in the fourth industrial
revolution: Machine learning and artificial intelligence in sales research and
practice. Industrial Marketing Management, 69, 135–146.
Thieme, J., Song, M., & Calantone, R. (2000). Artificial neural network decision
support systems for new product development project selection. Journal of
Marketing Research, 37(4), 499–507.
References 189
Thiesing, F. M., & Vornberger, O. (1997). Sales forecasting using neural networks.
Proceedings of International Conference on Neural Networks (ICNN’97), Houston,
TX, USA (Vol. 4; pp. 2125–2128). doi:10.1109/ICNN.1997.614234
Thomassey, S., & Fiordaliso, A. (2006). A hybrid sales forecasting system based on
clustering and decision trees. Decision Support Systems, 42, 408–421.
Thompson, W., Li, H., & Bolen, A. Artificial intelligence, machine learning, deep
learning and beyond: Understanding AI technologies and how they lead to smart
applications. Retrieved from https://www.sas.com/en_us/insights/articles/big-data/
artificial-intelligence-machine-learning-deep-learning-and-beyond.html
Thrasher, R. (1991). Cart: A recent advance in tree-structured list segmentation
methodology. Journal of Direct Marketing, 5(1), 35–47.
Tirenni, G., Kaiser, C., & Herrmann, A. (2007). Applying decision trees for value-
based customer relations management: Predicting airline customers’ future values.
Database Marketing & Customer Strategy Management, 14(2), 130–142.
Tirunillai, S., & Tellis, G. (2014). Mining marketing meaning from online chatter:
Strategic brand analysis of big data using Latent Dirichlet allocation. Journal of
Marketing Research, LI, 463–479.
Toubia, O., Hauser, J. R., & Simester, D. I. (2004). Polyhedral methods for adaptive
choice-based conjoint analysis. Journal of Marketing Research, 41, 116–131.
Urban, G. L., Johnson, P. L., & Hauser, J. R. (1984). Testing competitive market
structures. Marketing Science, 3(Spring), 83–112.
van der Putten, P., & van Someren, M. (2000). CoIL challenge 2000: The insurance
company case. Published by Sentient Machine Research, Amsterdam. Also, a
Leiden Institute of Advanced Computer Science Technical Report 2000-09. June
22, 2000. Retrieved from http://liacs.leidenuniv.nl/;puttenpwhvander/tic.html
Van Heerde, H. J., Leeflang, P. S. H., & Wittink, D. R. (2001). Semiparametric analysis
to estimate the deal effect curve. Journal of Marketing Research, 38(May), 197–215.
Vapnik, V. N. (1989). Statistical learning theory. New York: Wiley-Interscience. ISBN
978-0-471-03003-4.
Verbeke, W., Dejaeger, K., Martens, D., Hur, J., & Baesens, B. (2012). New insights
into churn prediction in the telecommunication sector: A profit driven data mining
approach. European Journal of Operational Research, 218, 211–229.
West, P., Brockett, P., & Golden, L. (1997). A comparative analysis of neural
networks and statistical methods for predicting consumer choice. Marketing
Science, 16(4), 370–391.
Xie, Y., Li, X., Ngai, E. W. T., & Yin, W. (2009). Customer churn prediction using
improved balanced random forests. Expert Systems with Applications, 36, 5445–5454.
Yao, J., Teng, N., Poh, T., & Tan, C. (1998). Forecasting and analysis of marketing
data using neural networks. Journal of Information Science and Engineering, 14(4),
843–862.
Zahavi, J., & Levin, N. (1997). Applying neural computing to target marketing.
Journal of Direct Marketing, 11(1), 5–22.
Zhang, Y., Dang, Y., Chen, H., Thurmond, M., & Larson, C. (2009). Automatic
online news monitoring and classification for syndromic surveillance. Decision
Support Systems, 47(4), 508–517.
Zhang, P. (2004). Neural networks in business forecasting. Hershey, PA: IRM Press.
This page intentionally left blank
Index