Predictive Modeling Using Segmentation: Nissanlevin Jacob Zahavi

PREDICTIVE MODELING USING
SEGMENTATION
Nissan Levin NISSAN LEVIN is an independent

consultant and a partner in the
Jacob Zahavi software company Q_WARE Ltd.
in Israel.
f
JACOB ZAHAVI is Professor of
Management at the Faculty of
Management at Tel Aviv University
and the Alberto Vitale Visiting
Professor of e-commerce at the
Wharton School.
ABSTRACT DRS. ZAHAVI AND LEVIN are the

masterminds behind the
We discuss the use of segmentation as a predictive model for
development of AMOS, a
supporting targeting decisions in database marketing. We compare customized predictive modeling
the performance of judgmentally based RFM and FRAC methods to system for the Franklin Mint in
Philadelphia, and GainSmarts, a
automatic tree classifiers involving the well-known CHAID
general purpose data mining
algorithm, a variation of the AID algorithm, and a newly developed system that is the two-time winner
method based on genetic algorithm (GA). We use the logistic of the KDD-CUP competition for
the best data mining tools (1997
regression model as a benchmark for the comparative analysis. The
and 1998) sponsored by the
results indicate that automatic segmentation methods may very well American Association for Artificial
substitute the judgmentally based segmentation methods for Intelligence.
response analysis, and come only short of the logistic regression

results. The implications of the results for decision making are also
discussed.
© 2001 John Wiley & Sons, Inc. and

Direct Marketing Educational Foundation, Inc.
JOURNAL OF INTERACTIVE MARKETING

VOLUME 15 / NUMBER 2 / SPRING 2001
2
PREDICTIVE MODELING
1. INTRODUCTION extended to include buyers, payers, loyal cus-

Segmentation is key to marketing. Since customers, etc., and to understand their composi-
tomers are not homogeneous and differ with tion and characteristics—Who they are? What
respect to one another in their preferences, do they look like? What are their attributes?
wants, and needs, the idea is to partition the Where do they reside? This analysis supports a
market into groups, or segments, of “like” peo- whole array of decisions, ranging from targeting
ple, with similar needs and characteristics, decisions to determining efficient and cost ef-
which are likely to exhibit similar purchasing fective marketing strategies, even evaluating
behavior. Then, one may offer to each segment market competition.
the products/services which are of most interest Several studies have been conducted so far
to the members of the segment. Ideally, the on the use of segmentation methods for sup-
segmentation should reflect customers’ attitude porting targeting decisions. Haughton and Ou-
toward the product/service involved. Since this labi (1997) compare the performance of re-
is often not known in advance, the best proxy is sponse models built with CART and CHAID on
to use data that reflect customers’ purchasing a case study that contains some 10,000 cases,
habits and behavior. Weinstein discusses several with about 30 explanatory variables and about
dimensions to segmentation (1994, p. 4): the same proportion of responders and nonre-
sponders. Bult and Wansbeek (1995) devise a
● Geography— classifying markets on the ba- profit-maximization approach to selecting cus-
sis of geographical considerations. tomers for promotion, comparing the perfor-
●
mance of CHAID against several “parametric”
Socioeconomic—segmentation is based on
models (e.g., logistic regression) using a sample
factors reflecting customers’ socioeco-
of about 14,000 households with only six ex-
nomic status, such as income level, educa-
planatory variables. Novak, de Leeuw, and
tion.
MacEvoy (1992) devise a “richness” curve for
● Psychographic— differentiating markets by evaluating segmentation results, defined as the
appealing to customers’ needs, personality running average of the proportion of individu-
traits and lifestyle. als in a segment which are “consumers,” where
● Product usage—partitioning the market segments are added in decreasing rank order of
based on consumption level of various us- their response. Morwitz and Schmittlein (1992)
ers group. investigate whether the use of segmentation can
● Benefits—splitting the market based upon improve the accuracy of sales forecasts based on
the benefit obtained from a product/ser- stated purchase intent involving CART, dis-
vice, such as price, service, special features criminant analysis, and K-means clustering algo-
and factors. rithm. Other attempts have been made to im-
prove the targeting decisions with segmentation
Hence, the segmentation process is informa- by using prior information. Of these we men-
tion based: the more information is available, tion the parametric approach of Morwitz and
the more refined and focused are the resulting Schmittlein (1998) and the nonparametric ap-
segments. proach of Levin and Zahavi (1996).
Perhaps more than in any other marketing In this paper we discuss the use of segmenta-
channel, segmentation is especially powerful in tion as a predictive model for response model-
database marketing (DBM) where one can use ing in database marketing. We compare the
the already available wealth of information on performance of judgmentally based methods to
customers’ purchase history and demographic several automatic tree-structured segmentation
and lifestyle characteristics to partition the cus- methods (decision trees); two are extensions of
tomer list to segments. The segmentation pro- the popular AID and CHAID trees, and the
cess is used to distinguish between customers third is a new decision tree based on Genetic
and non-customers, where “customers” here are Algorithm (GA). To assess how good the pre-
JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001
3
diction power of the segmentation-based mod- 2. SEGMENTATION METHODS

els is, we compare them against the results of a
logistic regression model, which is undoubtedly 2.1 Judgmentally Based Methods
one of the most advanced response models in Judgmentally based or “manual” segmentation
database marketing. Logistic regression is methods are still most commonly used in DBM
widely discussed in the literature and will not be to partition a customers’ list into “homoge-
reviewed here (see, e.g., Ben-Akiva & Lerman, neous” segments. Typical segmentation criteria
1987; Long, 1997). The development and eval- include previous purchase behavior, demo-
uation of the decision trees were conducted by graphics, geographics, and psychographics. Pre-
the authors’ own computer programs. The vious purchase behavior is often considered to
study itself was conducted on real data from the be the most powerful criterion in predicting the
collectible industry. likelihood of future response. This criterion is
To compare the merit of our methods versus operationalized for the segmentation process by
the other tree-based methods for response mod- means of Recency, Frequency, and Monetary
eling mentioned above, one needs to run all (RFM) variables (Shepard, 1995). Recency cor-
methods on the same set of data, which is not responds to the number of weeks (or months)
feasible. Also, since tree classifiers are heuristic since the most recent purchase, or the number
methods, the decision tree results are only as of mailings since last purchase; Frequency to
good as the algorithm used to create the trees. the number of previous purchases or the pro-
Hence, all results in this paper reflect the per- portion of mailings to which the customer re-
formance of our computer algorithms, which sponded; and Monetary to the total amount of
may not extend to other algorithms. money spent on all purchases (or purchases
Yet, this paper offers several contributions within a product category), or the average
over previous work in this area. On the theoret- amount of money per purchase. The general
ical front, we extend the decision tree algo- convention in the DBM industry is that the
rithms in several directions, as discussed in the more recently the customer has placed the last
next section. We also offer a unified and gener- order, the more items he/she bought from the
alized framework to formulate decision trees for company in the past, and the more money he/
segmenting an audience. she spent on the company’s products, the
On the practical front, this paper provides higher is his/her likelihood of purchasing the
hard empirical evidence of the relative merits of next offering and the better target he/she is.
various segmentation methods, focusing on sev- This simple rule allows one to arrange the seg-
eral issues: ments in decreasing likelihood of purchase.
The more sophisticated manual methods also
● How well automatic tree classifiers are ca- make use of product/attribute proximity con-
pable of discriminating between buying siderations in segmenting a file. By and large,
and nonbuying segments. the more similar the products bought in the
● How well automatic segmentation per- past are to the current product offering, or the
forms as compared to manually based seg- more related are the attributes (e.g., themes),
mentation. the higher the likelihood of purchase. For ex-
● How the various automatic tree classifiers ample, in a book club application, customers
compare against logistic regression results. are segmented based upon the proximity of the
● What practical implications one needs to theme/content of the current book to those of
look into when using automatic segmenta- previously purchased books. Say, the currently
tion. promoted book is The Art History of Florence, then
book club members who previously bought Ital-
For confidentiality reasons, all results are pre- ian art books are the most likely candidates to
sented in percentage terms. We also discuss the buy the new book, and are therefore placed at
implications of the results for decision making. the top of the segmentation list, followed by
4
PREDICTIVE MODELING
people who purchased general art books, fol- Stone, 1984); ID3 (Quinlan, 1986); C4.5 (Quin-
lowed by people who purchased geographical lan, 1993); and others. For a comprehensive
books, and so on. In cases where males and survey of automatic construction of decision
females may react differently to the product trees from data see Murthy (1998).
offering, gender may also be used to partition Basically, all automatic tree classifiers share
customers into groups. Generally, the list is first the same structure. Starting from a “root” node
partitioned by product/attribute type, then by (the whole population), tree classifiers employ
RFM, and then by gender (i.e., the segmenta- a systematic approach to grow a tree into
tion process is hierarchical). This segmentation “branches” and “leaves.” In each stage, the al-
scheme is also known as FRAC—Frequency, Re- gorithm looks for the “best” way to split a “fa-
cency, Amount (of money), and Category (of ther” node into several “children” nodes, based
product) (Kestnbaum, 1998). on some splitting criteria. Then, using a set of
Manually based RFM and FRAC methods are predefined termination rules, some nodes are
subject to judgmental and subjective consider- declared as “undetermined” and become the
ations. Also, the basic assumption behind the father nodes in the next stages of the tree de-
RFM method may not always hold. For example, velopment process, some others are declared as
in durable products, such as cars or refrigera- “terminal” nodes. The process proceeds in this
tors, recency may work in a reverse way—the way until no more node is left in the tree which
longer the time since last purchase, the higher is worth splitting any further. The terminal
the likelihood of purchase. Finally, to meet seg- nodes define the resulting segments. If each
ment size constraints, it may be necessary to run node in a tree is split into two children only,
the RFM/FRAC iteratively, each time combin- one of which is a terminal node, the tree is said
ing small segments and splitting up large seg- to be “hierarchical.”
ments, until a satisfactory solution is obtained. Three main considerations are involved in
This may increase computation time signifi- developing automatic trees:
cantly.
1. Growing the tree
2.2 Decision Trees 2. Determining the best split
Several “automatic” methods have been devised 3. Termination rules
in the literature to take away the judgmental
and subjective considerations inherent in the Growing the Tree. One grows the tree by suc-
manual segmentation process. By and large, cessively partitioning nodes based on the data.
these methods map data items (customers, in A node may be partitioned based on several
our case) into one of several predefined classes. variables at a time, or even a function of vari-
In the most simple case, the purpose is to seg- ables (e.g., a linear combination of variables).
ment customers into one of two classes, based With so many variables involved, there are prac-
on some type of a binary response, such as tically infinite number of ways to split a node.
buy/no buy, loyal/nonloyal, pay/no-pay, etc. Take as an example just a single continuous
Thus, tree classifiers are choice based. Without variable. This variable alone can be partitioned
loss of generality we refer to the choice variable in infinite number of ways, let alone when sev-
throughout this paper as purchase/no-pur- eral variables are involved.
chase, thus classifying the customers into seg- In addition, each node may be partitioned
ments of “buyers” and “nonbuyers.” into several descendants (or splits), each be-
Several automatic tree classifiers were dis- comes a “father” node to be partitioned in the
cussed in the literature, among them AID (Au- next stage of the tree development process.
tomatic Interaction Detection; Sonquist, Baker, Thus, the larger the number of splits per node,
& Morgan, 1971); CHAID (Chi square AID; the larger the tree and the more prohibitive the
Kass, 1983); CART (Classification and Regres- calculations.
sion Trees; Breiman, Friedman, Olshen, & Indeed, several methods have been applied
5
in practice to reduce the number of possible larger the tree, the larger the risk of overfitting.
partitions of a node: Hence it is necessary to control the size of a tree
I. All continuous variables are categorized by means of termination rules that determine
prior to the tree development into small num- when to stop growing the tree. These termina-
ber of ranges (“binning”). A similar procedure tion rules should be set to ensure statistical
applies for integer variables, which assume validity of the results and avoid overfitting.
many values (such as the frequency of pur-
chase). 2.3 Extended Tree Classifiers
II. Nodes are partitioned only on one variable We conclude this section by discussing the three
at a time (“univariate” algorithm). tree classifiers that we use in our paper: a vari-
III. The number of splits per each “father” ation of AID, which we refer to as Standard Tree
node is often restricted to two (“binary” trees). Algorithm (STA); the commonly used CHAID,
IV. Splits are based on a “greedy” algorithm which was extended to consider entropy crite-
in which splitting decisions are made sequen- rion; and a new tree classifier based on Genetic
tially looking only on the impact of the split in Algorithm (GA). Basically, AID, STA and
the current stage, but never beyond (i.e., there CHAID are combinatorial algorithms in the
is no “looking ahead”). sense that they go over all possible combina-
tions of variables to partition a node. Conse-
Determining the Best Split. With so many pos- quently, these algorithms are computationally
sible partitions per node, the question is what is intensive and therefore, as discussed above, are
the best split? There is no unique answer to this limited to splits based on one variable, or at best
question as one may use variety of splitting cri- two variables, at a time. In contrast, GA is a
teria, each may result in a different “best” split. non-combinatorial algorithm in the sense that
We can classify the splitting criteria into two the candidate solutions (splits) are generated
“families”: node-value based criteria and parti- by a random, yet systematic, search method.
tion-value based criteria. This opens up the possibility to consider parti-
tions based on more than two variables at a
● Node-value based criteria: seeking the split time, hopefully yielding more “homogeneous”
that yields the best improvement in the segments. Practically, this renders the GA algo-
node value. rithm “non-greedy” by making it possible to
● Partition-value based criteria: seeking the look ahead beyond the current stage when split-
split that separates the node into groups ting a node. However, because of computa-
that are as different from each other as tional constraints, we have limited the number
possible. of simultaneous variables to split a node to only
three and four predictors at a time.
We discuss the splitting criteria at more These algorithms are further described in
length in Appendix A. Appendix B.
Termination Rules. Theoretically, one can

grow a tree indefinitely, until all terminal nodes 3. A CASE STUDY
contain very few customers, as low as one cus- We use a real case study and actual data to
tomer per segment. The resulting tree in this demonstrate and evaluate the performance of
case is unbounded and unintelligible, having decision trees as compared to manually based
the effect of “can’t see the forest because of too trees. The case study involves a solo mailing of a
many trees.” It misses the whole point of tree collectible item that was live-tested in the mar-
classifiers, whose purpose is to divide the popu- ket and then rolled out. The data for the anal-
lation into buckets of “like” people, where each ysis consists of the test audience with appended
bucket contains a meaningful number of peo- orders, containing 59,970 customers, which we
ple for statistical significance. In addition, the randomly split into two mutually exclusive sam-
6
PREDICTIVE MODELING
ples—a calibration (training) sample, consist- going over all possible combinations to grow a
ing of 60% of the observations, to build the tree (subject to the computational constraints
model (tree), and a holdout sample, containing discussed in the previous section) and selecting
the rest of the customers, to validate the model. the best split for each node using one of many
As alluded to earlier, only binary predictors splitting criteria (discussed in Appendix A).
may take part in the partitioning process. In contrast to decision trees, logistic regres-
Hence, all continuous and multivalued integer sion models do not enjoy this benefit, and one
variables were categorized, prior to invoking the needs to invoke a specification procedure to
tree algorithm, into ranges, each is represented select the most influential predictors to include
by a binary variable assuming the value of 1 if in the final model. Hence, using logistic regres-
the variable falls in the interval, 0 otherwise. sion models with multiple potential predictors
This process is also referred to as “binning.” is not easy and definitely not as straightforward
Depending upon the tree, the resulting catego- as building decision trees. In our case, we use a
ries may be either overlapping (i.e., X ⱕ ai, i rule-based expert system to weed out the “bad”
⫽ 1, 2, . . . , where X is a predictor, ai a given from the “good” predictors. These rules were
breakpoint) or non-overlapping (i.e., ai-1 ⱕ X calibrated, based on statistical theory and prac-
⬍ ai, i ⫽ 1, 2, . . . ). tice, using an extensive experimentation pro-
The trees were evaluated using various good- cess.
ness-of-fit criteria that express how well the seg-
mentation process is capable of discriminating The Predictors Set
between buyers and nonbuyers. A common The mix of predictors is a major factor affecting
measure is in terms of the percentage of buyers the resulting tree structure; the larger the num-
“captured” per the percentage of audience ber of potential predictors to split a tree, the
mailed, the higher the percentage of buyers, larger is the number of segments and the
the “better” the model. For example, a segmen- smaller is the size of each terminal segment. To
tation scheme that captures 80% of the buyers determine the impact of the mix and the num-
for 50% of the audience is better than a seg- ber of predictors on the performance of the
mentation scheme that captures only 70% of tree classifiers, we have experimented with sev-
the buyers for the same audience. eral sets of predictors. The basic set, the affinity
Below we discuss several considerations in set, contains the basic attributes of collectible
setting up the study. items; the second set also includes recency pre-
dictors; the third set is augmented to include
Feature Selection frequency measures. Finally, the last set con-
In DBM applications, the number of potential tains all of the predictors available in the data
predictors could be very large, often in the or- set, several hundreds in all. More details about
der of magnitude of several hundreds, even the participating predictors are found in Ap-
more, predictors. Thus, the crux of the model- pendix C.
building process is to pick the handful of pre-
dictors that “best” explain, from a statistical Min/Max Segment Size
point of view, the customers’ choice decision. Size constraints are most crucial in segmenta-
This problem, referred to as the feature selection analysis. To minimize the error incurred in
tion (or specification) problem, is a tough com- case wrong decisions are made (e.g., because of
binatorial problem and is definitely the most sampling errors), segments are required to be
complex issue in building large-scale multivari- “not-too-small” and “not-too-big.” If a segment is
ate models. too small, the probability of making an error
Decision trees possess an advantage over sta- increases due to the lack of statistical signifi-
tistical regression models in that they have an cance. If the segment is too big, then if a “good”
inherent mechanism to pick the predictors af- segment is somehow eliminated from the mail-
fecting the terminal segments. They do it by ing (Type I error)—large foregone profits are
7
incurred; and if a “bad” segment makes it to the Large values of Y (e.g., Y ⱖ 2) mean that the
mailing (Type II error)— large out-of-pocket true (but unknown) response rate of the result-
costs are incurred. ing segment is significantly different from the
Consequently, we have built a mechanism in TRR, indicating a “good” split.
all our automatic tree algorithms to account for
minimum and maximum constraints on the re- Criterion 4 (STA, GA). Y is the larger re-
sulting segment size. In our study, we used two sponse rate of the two children nodes (in a
sets of min/max constraints on the segment binary split). This criterion seeks the split that
size, 150/3,000 and 300/6,000, respectively. maximize the difference in the response rates of
the two descendant nodes.
Splitting Criteria All these criteria are further discussed in Ap-
Finally, as discussed in Appendix A, one may pendix A.
define a variety of splitting criteria which be-
long to the node-value and partition-value fam-
ilies. We have used four different criteria in our
4. RESULTS AND ANALYSIS
study all of which seek to maximize a statistical
measure Y as follows. The combination of several tree classifiers, pre-
dictor sets, and splitting criteria give rise to
Criterion 1 (CHAID). Y is the statistic for the numerous trees. To simplify the presentation,
chi-square test of independence: we provide only selective results in this paper.
We evaluate all trees based on goodness-of-fit
criteria. As a reference point, we compare the
Y⫽ 冘
splits
共Observed ⫺ Expected兲2
Expected
automatic segmentation and the manually
based segmentation to logistic regression.
To allow for a “fair” comparison between the
Clearly, the larger the value of Y, the larger is models, each model was optimized to yield the
the difference between the response rates of the best results: the manually based segmentation
resulting child nodes, and the “better” the split. by using domain experts to determine the
This statistic is distributed as chi-square with (k FRAC segmentation criteria; the automatic de-
⫺ 1) degrees of freedom, where k is the number cision trees by finding the best tree for the given
splits for the node. Then, if the resulting P_ splitting criterion and constraints; and the lo-
value is less than or equal to the level of signif- gistic regression by finding the “best” set of
icance, we conclude that Y is big “enough” and predictors that explain customer choice.
that the resulting split is “good.” Since CHAID
uses a sequence of tests (each possible split Goodness-of-Fit Results
constitutes a test), an adjusted P_ value measure In this section we analyze the performance of
using the Bonferroni multiplier, is often used to each method on its own, using goodness-of-fit
determine the “best” split. measures that exhibit how well a model is capa-
ble of discriminating between buyers and non-
Criterion 2 (CHAID). Y is the total entropy buyers. Goodness-of-fit results are typically pre-
associated with a given partition (into M splits). sented by means of gains tables in descending
It is a measure of the information content of the order of the predicted response. In the segmen-
split, the larger the entropy the better. tation-based methods, the predicted response
rate of a rollout segment is given by the corre-
Criterion 3 (STA). Y is the number of stan- sponding response rate of the segment in the
dard deviations that the response rate (RR) of calibration (training) sample. In logistic regres-
the smaller child node (the one with the fewer sion, the response rates are estimated directly
number of customers) is away from the overall by the model. Several measures can be used to
response rate of the training audience (TRR). assess the performance of a response model:
8
PREDICTIVE MODELING
● The behavior of the actual response rates representative audiences. The reference point
(the ratio of the number of buyers “cap- consists of the top 30% of the customers in the
tured” to the size of the audience mailed) list of segments, arranged in descending order
across segments, which in a “good” model of the response rate of the segments in the
should exhibit a nicely declining pattern. calibration sample. Note that in a tree analysis,
● The difference between the actual re- the response probability of a customer is deter-
sponse rates of the top and the bottom mined by the response rate of his/her peers,
segments, the larger the difference, the that is, the response rate of the segment that the
better the fit. customer belongs to. Since segments’ sizes are
discrete, we use interpolation to exhibit the
All the results below are presented for the performance results at exactly 30% of the audi-
holdout sample. Table 1 presents the gains ta- ence. Of course, no interpolation is required
ble for RFM-based segmentation; Table 2, for for logistic regression, since here the response
FRAC-based segmentation. Out of the many probability is calculated individually for each
tree classifiers that we analyzed, we present one customer in the list. As additional reference
gains tables for STA, with 1-variable split, split- points, we also present the performance results
ting criterion 4, a predictor set that contains the for the top 10% and 50% of the audience.
affinity and recency attributes and min/max Table 5 presents the performance results at
constraints on the resulting segment size of 150 these audience levels for several tree classifiers,
and 3,000 respectively (Table 3). Finally, Table as well as the results of RFM, FRAC, and logistic
4 exhibits the logistic regression results by de- regression. To bring all methods to an equal
ciles using the entire set of predictors. footing for comparison purposes, all methods
Observing Table 1 (RFM segmentation), were calibrated on the same data set using the
other than the top two segments, the response entire set of predictors. Comparing the results,
rates across segments in the list are pretty flat we conclude:
and somewhat fluctuating— both are indicative
of relatively poor fit. ● The logistic regression model outperforms
By comparison, in the FRAC segmentation all other models—the judgmentally based,
(Table 2), the top segments perform signifi- as well as the automatic models.
cantly better than the bottom segments, with ● The RFM-based models are the worst.
the top segment yielding a response rate of ● The automatic tree classifiers perform ex-
12.99% versus an average response rate of only tremely well, being comparable to the
.69% for the entire holdout audience. FRAC-based model and getting close to the
Interestingly, the automatic segmentation logistic regression model.
methods (Table 3) are not lagging behind in
● Increasing the minimum size constraint
terms of discriminating between the buying and
usually increases the variance of the fit re-
the nonbuying segments, with the top segment
sults across all segments of a tree. This is a
outperforming the bottom segments by a wide
manifestation of the fact that larger seg-
margin.
ments are less “homogeneous” and thus
Finally, the logistic regression model (Table
exhibit larger variation. Indeed, smaller
4) exhibits excellent performance, with the top
segments are more stable, but increase the
decile responding more than six times better
risk of overfitting. So one needs to trade off
than the bottom decile.
variance against segment size to find the
most suitable one for the occasion. Unfor-
Tree Performance
tunately, there is no optimal way to find the
To evaluate and compare the automatic seg- “best” segment size other than by experi-
mentation to the judgmentally based segmenta- mentation.
tion and the logistic regression model, we look
on the percentage of buyers captured at several We note that adding more predictors to the
9
T A B L E 1
Gains Table for RFM-Based Segmentation
Segments with at Least 100 Customers in the Holdout Sample Sorted by Descending Response Rates of the Segments in
the Calibration Sample.
CLB HLD CUM CLB CUM HLD CUM CLB CUM CLB CUM HLD CUM HLD %CLB AUD/
SEG RR % RR % RR % RR % BUY % AUD % BUY % AUD % %HLD AUD
144 2.38 2.62 2.38 2.62 29.22 7.47 27.88 7.68 0.98
244 1.90 2.38 2.27 2.47 36.07 9.66 34.55 9.61 1.14
322 1.24 0.58 2.21 2.34 37.44 10.33 35.15 10.32 0.94
434 1.10 0.86 2.11 2.21 39.27 11.34 36.36 11.29 1.04
423 0.96 0.00 1.98 1.99 41.55 12.75 36.36 12.60 1.05
223 0.90 1.02 1.91 1.93 42.92 13.67 37.58 13.42 1.12
143 0.90 0.94 1.83 1.84 44.75 14.91 39.39 14.75 0.93
333 0.87 1.43 1.72 1.79 47.49 16.83 43.64 16.78 0.94
433 0.86 1.02 1.63 1.71 50.23 18.76 46.67 18.82 0.95
134 0.84 0.62 1.58 1.63 52.05 20.09 47.88 20.16 0.99
344 0.78 1.23 1.42 1.55 58.45 25.08 56.97 25.23 0.98
334 0.75 0.56 1.36 1.47 61.19 27.29 58.79 27.47 0.99
131 0.72 0.00 1.35 1.44 62.10 28.06 58.79 28.13 1.16
133 0.71 0.88 1.27 1.36 66.67 31.97 64.24 32.41 0.91
123 0.67 1.02 1.24 1.35 68.49 33.63 66.67 34.04 1.01
233 0.62 1.82 1.22 1.36 69.41 34.52 69.09 34.96 0.97
122 0.58 0.38 1.17 1.27 72.60 37.88 70.91 38.29 1.01
412 0.58 0.67 1.14 1.25 74.43 39.81 72.73 40.17 1.03
323 0.57 0.33 1.11 1.19 76.71 42.23 73.94 42.69 0.96
132 0.46 0.37 1.07 1.15 78.54 44.65 75.15 44.95 1.07
213 0.45 0.00 1.06 1.14 79.00 45.27 75.15 45.54 1.03
234 0.44 0.00 1.05 1.12 79.45 45.90 75.15 46.15 1.05
312 0.39 1.02 1.01 1.11 81.28 48.76 79.39 49.00 1.00
212 0.38 0.24 0.95 1.03 84.47 53.90 81.21 54.28 0.97
211 0.34 0.27 0.86 0.92 89.50 62.98 84.85 63.44 0.99
411 0.28 0.37 0.81 0.87 92.69 69.81 88.48 70.19 1.01
112 0.28 0.40 0.79 0.85 93.61 71.81 89.70 72.29 0.95
444 0.27 0.42 0.76 0.83 95.43 75.93 92.12 76.26 1.04
311 0.26 0.27 0.74 0.81 96.80 79.11 93.33 79.31 1.04
432 0.26 0.85 0.74 0.81 97.26 80.18 94.55 80.29 1.09
413 0.23 0.00 0.72 0.79 98.17 82.63 94.55 82.54 1.09
111 0.20 0.46 0.71 0.78 99.09 85.40 96.36 85.28 1.01
121 0.17 0.00 0.70 0.76 99.54 87.02 96.36 86.79 1.08
313 0.07 0.21 0.67 0.74 100.00 90.94 97.58 90.73 0.99
All 0.61 0.69 100.00 100.00 100.00 100.00 0.98
CLB ⫽ Calibration sample

HLD ⫽ Holdout sample
CUM ⫽ Cumulative
SEG ⫽ RFM segment number:
1st digit-Recency: 1-most recent 3 4-least recent
2nd digit-Frequency: 1-few purchases 3 4-many purchases
3rd digit Monetary: 1-least spending 3 4 most spending
10
PREDICTIVE MODELING
T A B L E 2
Gains Table for FRAC-Based Segmentation
Segments with at Least 100 Customers in the Holdout Sample Sorted by Descending Response Rates of the Segments in
the Calibration Sample.
1 12.36 12.99 12.36 12.99 14.61 0.72 13.94 0.74 0.98

2 6.42 6.53 8.92 8.67 31.96 2.18 26.06 2.07 1.10
3 5.07 4.92 7.59 7.37 41.55 3.33 33.94 3.17 1.05
4 3.73 3.79 6.75 6.29 47.95 4.32 38.79 4.24 0.93
5 3.51 3.41 6.05 5.64 54.79 5.51 44.85 5.47 0.97
6 2.33 4.42 5.33 5.39 59.82 6.83 53.94 6.88 0.93
7 2.33 1.53 5.01 4.98 63.01 7.66 55.76 7.70 1.02
8 1.78 3.02 4.51 4.60 67.12 9.07 61.82 9.24 0.90
9 1.52 2.22 4.24 4.35 69.41 9.95 64.24 10.15 0.97
10 1.08 0.90 3.84 3.91 72.15 11.43 66.67 11.72 0.92
11 0.86 1.87 3.54 3.70 73.97 12.72 70.30 13.06 0.96
12 0.84 0.00 3.28 3.36 75.80 14.05 70.30 14.38 1.01
13 0.80 0.00 3.06 3.08 77.63 15.43 70.30 15.71 1.04
14 0.46 0.70 2.79 2.83 79.00 17.25 72.12 17.50 1.02
15 0.45 0.22 2.38 2.36 81.74 20.93 73.33 21.33 0.96
16 0.44 0.34 2.27 2.25 82.65 22.19 73.94 22.57 1.02
17 0.42 0.00 2.16 2.13 83.56 23.52 73.94 23.82 1.06
18 0.36 0.27 2.05 2.02 84.47 25.06 74.55 25.34 1.02
19 0.36 0.27 1.87 1.83 86.30 28.15 75.76 28.46 0.99
20 0.35 0.78 1.71 1.73 88.13 31.33 79.39 31.65 1.00
21 0.26 0.16 1.43 1.42 91.32 38.95 81.21 39.44 0.98
22 0.19 0.37 1.30 1.31 92.69 43.43 83.64 43.91 1.00
23 0.19 0.15 1.23 1.24 93.61 46.43 84.24 46.74 1.06
24 0.16 0.48 1.19 1.21 94.06 48.20 85.45 48.47 1.02
25 0.13 0.13 1.06 1.08 95.43 54.73 86.67 54.96 1.01
26 0.12 0.25 0.99 1.02 96.35 59.37 88.48 59.96 0.93
27 0.11 0.00 0.95 0.97 96.80 61.95 88.48 62.67 0.95
28 0.09 0.05 0.84 0.86 98.17 71.27 89.09 71.65 1.04
29 0.09 0.20 0.78 0.80 99.09 77.60 90.91 77.76 1.04
30 0.07 0.65 0.74 0.80 99.54 81.39 94.55 81.60 0.99
31 0.07 0.12 0.71 0.77 100.00 85.20 95.15 85.20 1.06
All 0.61 0.69 100.00 100.00 100.00 100.00 1.00

CUM ⫽ Cumulative
SEG ⫽ Sequential segment number
process does not necessarily improve the frequency variables adding very little to im-
model fit. It turns out that in our case the proving the fit. This phenomenon may be
affinity consideration possess the most predic- typical in the collectible industry, where a
tion power, with the additional recency and customer either likes a given range of prod-
11
T A B L E 3
Gains Table for STA, 1-Variable Split, Affinity and Recency Predictors, Splitting Criterion 4, and Min/Max Constraint
150/3000
Segments with at Least 25 Customers in the Holdout Sample Sorted by Descending Response Rates of the Segments in the
Calibration Sample.
9 8.12 10.08 8.12 10.08 19.18 1.44 21.82 1.49 0.97

1 5.65 1.85 7.36 7.51 25.11 2.08 23.64 2.16 0.95
8 4.55 6.50 6.42 7.19 32.88 3.12 33.33 3.19 1.01
10 4.05 4.20 6.03 6.72 36.99 3.73 36.97 3.79 1.03
2 4.01 3.67 5.64 6.13 42.92 4.63 41.82 4.69 0.99
15 3.49 3.46 5.25 5.67 48.86 5.67 46.67 5.66 1.07
13 3.37 3.85 4.99 5.43 53.88 6.57 51.52 6.52 1.04
4 3.31 4.67 4.89 5.38 56.16 6.99 54.55 6.97 0.94
24 3.11 0.93 4.78 5.11 58.45 7.44 55.15 7.42 0.99
16 2.41 0.00 4.64 4.85 60.27 7.90 55.15 7.83 1.13
18 1.84 1.81 4.27 4.46 63.93 9.11 58.18 8.98 1.05
22 1.75 2.87 4.11 4.32 65.75 9.74 61.82 9.85 0.73
12 1.51 1.14 3.92 4.10 67.58 10.48 63.03 10.58 1.00
14 1.39 2.97 3.74 4.01 69.41 11.28 66.67 11.43 0.95
34 1.21 0.00 3.60 3.80 70.78 11.97 66.67 12.06 1.09
3 1.09 0.39 3.40 3.52 72.60 12.99 67.27 13.13 0.95
20 1.05 0.00 3.31 3.40 73.52 13.52 67.27 13.61 1.10
23 0.96 1.00 3.13 3.20 75.34 14.67 69.09 14.86 0.92
30 0.96 0.96 3.00 3.07 76.71 15.54 70.30 15.73 1.00
7 0.90 0.90 2.89 2.95 78.08 16.47 71.52 16.66 1.00
5 0.75 0.61 2.79 2.86 79.00 17.21 72.12 17.34 1.09
11 0.70 0.00 2.70 2.74 79.91 18.01 72.12 18.12 1.02
32 0.68 0.47 2.61 2.63 80.82 18.83 72.73 19.01 0.92
17 0.65 0.00 2.57 2.57 81.28 19.26 72.73 19.45 0.98
19 0.63 0.40 2.48 2.46 82.19 20.14 73.33 20.49 0.85
25 0.53 0.00 2.43 2.41 82.65 20.66 73.33 20.96 1.09
60 0.49 0.78 2.38 2.37 83.11 21.23 73.94 21.50 1.05
38 0.48 0.00 2.33 2.31 83.56 21.81 73.94 21.98 1.21
40 0.47 0.00 2.28 2.25 84.02 22.39 73.94 22.58 0.97
41 0.46 0.00 2.24 2.20 84.47 22.99 73.94 23.13 1.09
48 0.46 0.00 2.19 2.14 84.93 23.60 73.94 23.72 1.03
31 0.38 0.00 2.14 2.08 85.39 24.34 73.94 24.47 0.99
26 0.37 1.04 2.08 2.05 85.84 25.10 75.15 25.27 0.94
59 0.36 0.00 2.03 1.99 86.30 25.87 75.15 25.95 1.12
47 0.32 0.65 1.87 1.87 87.67 28.48 77.58 28.53 1.01
42 0.31 0.00 1.83 1.81 88.13 29.37 77.58 29.46 0.96
39 0.30 0.46 1.78 1.77 88.58 30.28 78.18 30.37 1.00
21 0.30 1.32 1.74 1.76 89.04 31.21 80.00 31.32 0.97
37 0.30 0.85 1.69 1.73 89.50 32.15 81.21 32.30 0.96
49 0.29 0.00 1.65 1.68 89.95 33.11 81.21 33.30 0.96
12
PREDICTIVE MODELING
T A B L E 3
Continued
Segments with at Least 25 Customers in the Holdout Sample Sorted by Descending Response Rates of the Segments in the
Calibration Sample.
50 0.28 0.00 1.61 1.63 90.41 34.10 81.21 34.18 1.12

54 0.27 0.28 1.50 1.53 91.78 37.18 82.42 37.11 1.05
64 0.27 0.68 1.47 1.50 92.24 38.22 83.64 38.34 0.84
63 0.27 0.21 1.41 1.44 93.15 40.30 84.24 40.37 1.03
33 0.24 0.19 1.34 1.37 94.06 42.62 84.85 42.60 1.04
69 0.24 0.36 1.31 1.34 94.52 43.79 85.45 43.74 1.02
6 0.23 0.33 1.28 1.32 94.89 44.99 86.06 45.00 0.96
35 0.15 0.11 1.20 1.22 95.89 48.72 86.67 48.76 0.99
53 0.13 0.19 1.11 1.14 96.80 52.88 87.88 53.09 0.96
57 0.13 0.21 1.04 1.08 97.72 57.15 89.09 56.99 1.09
65 0.13 0.29 0.92 0.97 99.54 65.67 92.73 65.48 1.00
43 0.13 0.00 0.90 0.94 100.00 67.84 92.73 67.64 1.00
All 0.61 0.69 100.00 100.00 100.00 100.00 0.99

CUM ⫽ Cumulative
SEG ⫽ STA segment number
ucts (e.g., dolls) or not, and may not extend the various tree classifiers to one another to find
to other industries. out which one performs the best. But this re-
Finally, it would be interesting to compare quires extensive experimentation with multiple
tree classifiers and many data sets, which was
beyond the scope of this research.
T A B L E 4
Gains Table for Logistic Regression by Decils Based on
Entire Set of Predictors
6. CONCLUSIONS
Actual In this paper we evaluated the performance of
% % Resp % Response/ Pred Resp automatic tree classifiers versus the judgmen-
Prospects Response Rate % Prospects Rate %
tally based RFM and FRAC methods and logistic
10.00 62.42 4.29 6.24 4.26
regression. The methods were evaluated based
20.00 77.58 2.67 3.88 2.57
on goodness-of-fit measures, using real data
30.00 83.03 1.90 2.77 1.83
from the collectible industry. Three automatic
40.00 86.67 1.49 2.17 1.43 tree classifiers participated in the analysis—a
50.00 89.70 1.23 1.79 1.18 modified AID tree that we term STA, an exten-
60.00 92.12 1.06 1.53 1.00 sion of the CHAID method, and a newly devel-
70.00 96.36 0.95 1.38 0.87 oped GA tree.
80.00 98.18 0.84 1.23 0.78 The evaluation process, which involves sev-
90.00 100.00 0.76 1.11 0.70 eral predictor sets, several splitting criteria and
100.00 100.00 0.69 1.00 0.63 several constraints on the minimum and maxi-
mum size of the terminal segments, shows that
13
traditional statistical methods, no extensive

T A B L E 5
background in statistics is required to build
Summary of Performance Results Based on Entire Set
of Predictors trees (as the feature selection process is built in
the tree algorithm). No close familiarity with
Min the application domain is required either.
Criterion Seg Size 10% 30% 50% These benefits, and others, have rendered tree
analysis very popular as a data analysis model,
LOGIT 62.4 83.0 89.7
especially in this age of digital marketing, where
FRAC 63.8 77.5 86.9
rules based on decision trees are powering
RFM 34.9 61.2 79.7
many online recommendation engines that de-
CHAID 1 150 42.4 68.3 79.8
termine which product/content to present to a
CHAID 1 300 45.8 68.2 75.9
customer once he/she logs online. Thus, the
CHAID 2 150 59.1 80.7 87.1
availability to generate trees automatically and
CHAID 2 300 61.0 81.3 87.7
inexpensively opens up new frontiers for using
GA-3 4 150 60.0 73.1 81.7
tree classifiers to analyze and understand the
GA-3 4 300 60.7 72.1 79.9
relationship between data elements in a data set
GA-4 4 150 59.1 69.0 77.8
and translate them into rules for making deci-
GA-4 4 300 60.0 77.0 85.0
sions.
STA-1 3 150 54.5 82.4 87.4
STA-1 3 300 50.9 82.4 88.3
STA-1 4 150 61.1 75.8 83.4 APPENDIX A: SPLITTING CRITERIA
STA-1 4 300 60.4 76.4 84.3
Splitting criteria are used to determine the best
STA-2 3 150 57.5 80.0 86.5
split, out of the many possible ways to partition
STA-2 3 300 50.7 81.8 90.3
a node. By and large, we can classify the splitting
LOGIT ⫽ Logistic regression criteria into two families: one based on the
FRAC ⫽ FRAC-based segmentation value of the node, the other on the value of the
RFM ⫽ RFM-based segmentation
partition.
CHAID ⫽ CHAID tree
STA-1 ⫽ STA tree with 1-variable split
STA-2 ⫽ STA tree with 2-variable split Node-Value-Based Criteria
GA-3 ⫽ GA tree with 3-variable split
GA-4 ⫽ GA tree with 4-variable split
These criteria are based on the value of the
node. The objective is to maximize the improve-
ment in the node value that results by splitting
the automatic tree classifiers outperform the a node into two or more splits.
RFM and FRAC methods, and come only short The value of a node t is a function of the
of the logistic regression results. The practical response rate (RR) of the node, RR(t) (i.e., the
implication of these results is that automatic proportion of buyers). The “ideal” split is the
trees may be used as a substitute to judgmen- one that partitions a father node into two chil-
tally based methods, even to logistic regression dren nodes, one that contains only buyers, i.e.,
models. We note, however, that these observa- RR(t) ⫽ 1, and the other only nonbuyers, RR(t)
tions may be specific to the application in- ⫽ 0. Clearly, in applications where the number
volved, and while it may hold for the collectible of “responders” is more or less equal to the
industry, it may not for other industries. number of “nonresponders,” the “worst” split is
Clearly, decision trees have clear benefits the one that results in two children each having
over logistic regression from the point of view of about the same proportion of buyers and non-
the users. Trees are easy to understand and buyers, i.e., RR(t) d 1/2.
interpret, if, of course, properly controlled to We define the value (or quality) of a node t as
avoid unbounded growth. The output of deci- a function Q(RR). In the case where the propor-
sion trees can be presented by means of rules, tion of buyers in the population is about one-
which are clearly related to the problem. Unlike half, Q(RR) satisfies the following conditions:
14
PREDICTIVE MODELING
● Max Q(RR) ⫽ Q(0) ⫽ Q(1) But since Yi is a binary variable (1 buy, 0 no-
● Min Q(RR) ⫽ Q(1/2) buy),
冘Y ⫽冘Y ⫽B
● Q(RR) is a concave function of RR (i.e., the
2
second derivative is positive). i i
● Q(RR) is symmetric, i.e., Q(RR) ⫽ Q(1

i i
⫺ RR). Y៮ ⫽ 冘 Y /共B ⫹ N兲 ⫽ RR
i
i
The first two conditions stem from our defi-

nition of the “best” and the “worst” splits; the
concavity condition follows from the first two And we obtain:
conditions; and the symmetry condition from
Var 共Y兲 ⫽ RR ⫺ RR2 ⫽ .25 ⫺ 共RR ⫺ .5兲2
the fact that the reference point is 1⁄2.
Clearly, there are many functions that satisfy
these requirements. Below we provide several Which is a quadratic node-value function. Note
examples for the case of a two-way split: that this function satisfies all the conditions for
a. The (piecewise) linear function, e.g., the function Q(RR) mentioned above.
c. The entropy function (Michalski, Carbonell,
再
& Mitchell, 1993)
1 ⫺ RR RR ⱕ 1/ 2
Q共RR兲 ⫽ RR RR ⬎ 1/ 2 (A.1)
Q共RR兲 ⫽ ⫺关RR log 共RR兲 ⫹ 共1 ⫺ RR兲
b. The quadratic function ⫻ log 共1 ⫺ RR兲兴 (A.3)
Q共RR兲 ⫽ a ⫹ b 䡠 RR ⫹ c 䡠 RR 2 The entropy is a measure of the information

content, the larger the entropy, the better.
where a, b, and c are parameters. Hence the best split is the one with the largest
One can show that the only quadratic func- entropy of all possible splits of a node.
tion that satisfy all these conditions (up to a Figure A.1 exhibits all the three functions
constant) is: graphically.
Now, the node value resulting by partitioning
Q共RR兲 ⫽ RR共RR ⫺ 1兲 (A.2) a node into two children nodes is obtained as
the average of the node value of the two descen-
dant nodes, weighted by the proportion of cus-
We note that the variance criterion used by AID tomers, i.e.:
to define the value of a node, in the binary
yes/no case, is a quadratic node value function.
To show this, we evaluate the variance of the N1 N2
Q共RR 1兲 ⫹ Q共RR 2兲 (A.4)
choice variable Y. Denoting by: N N
Yi — the choice value of observation i where:

Y៮ — the mean value over all observations
B — the number of buyers N – the number of customers in the father
node
N — the number of nonbuyers
N1, N2 – the number of customers in the
Var 共Y 兲 ⫽ 冘共Y ⫺ Y៮ 兲 /共B ⫹ N兲
i
i
2
descendant left node (denoted by the index
1) and the right node (denoted by the index
2), respectively. In the following we always
⫽ 冘 Y /共B ⫹ N兲 ⫺ Y៮
2
i
2 assume N1 ⬍ N2 (i.e., the left node is the
i smaller one).
15
FIGURE A.1.
Node value functions.
RR1, RR2 – the response rates of the left and training audience. Another alternative is the
the right nodes, respectively. cutoff response rate, CRR, calculated based on
Q(RR1), Q(RR2) – the corresponding node economic criteria. The resulting node value
value functions. functions in this case satisfy all conditions
above, except that they are not symmetrical.
Thus, the improvement in the node value Now depending upon the value function
resulting by the split is given as the difference: Q(RR), this yields several heuristic criteria for
determining the best split. For example, for the
N1 N2 piecewise linear function and an hierarchical
Q共RR 1兲 ⫹ Q共RR 2兲 ⫺ Q共RR兲 (A.5)
N N tree, a possible criterion is choosing the split
that maximizes the response rate of the smaller
And we seek the split that yields the maximal child node, Max {RR1, 1 ⫺ RR1}. In a binary tree,
improvement in the node value. But since, for a the split which yields the most difference in the
given father node, Q(RR), is the same for all response rates of the two descendant nodes. In
splits, Eq. (A.5) is equivalent to maximizing the the quadratic case, a reasonable function is (RR
node value (A.4). ⫺ TRR)2. Or one can use the entropy function
Clearly in DBM applications where the num- (A.3).
ber of buyers is largely outnumbered by the Finally, we note that with a concave node
number of nonbuyers, the reference point of 1⁄2 value function, basically any split results in a
may not be appropriate. A more suitable refer- positive value improvement, however small.
ence point to define the node value is TRR, Thus, when using the node-value based criteria
where TRR is the overall response rate of the for determining the best split, it is necessary to
16
PREDICTIVE MODELING
T A B L E A.1
Components of Y
Split Buyers Non-Buyers Total
1 Observed B1 N1 T1 ⫽ B1 ⫹ N1
Expected T 1 䡠 B/T T 1 䡠 N/T
Total B ⫽ B1 ⫹ B2 ⫹ B3 N ⫽ N1 ⫹ N2 ⫹ N3 T ⫽ T1 ⫹ T2 ⫹ T3
impose a threshold level on the minimum seg- One can find the Z value corresponding to the
ment size and/or the minimum improvement P_ value, denoted Z1-␣/2, using:
in the node value, or otherwise the algorithm
will keep partitioning the tree until each node Abs共RR 1 ⫺ RR 2兲
冑
contains exactly one customer. Z 1⫺␣/ 2 ⫽
RR 1 ⴱ 共1 ⫺ RR 1兲/共B 1 ⫹ N 1
Partition-Value-Based Criteria ⫹ RR 2 ⴱ 共1 ⫺ RR 2兲/共B 2 ⫹ N 2兲
(A.6)
Instead of evaluating nodes, one can evaluate
partitions. A partition is considered as a “good”
one if the resulting response rates of the chil- where:
dren nodes are significantly different than one RR1, RR2 – the response rates of the left child
another. This can be expressed in terms of test node (denoted by the index 1) and the right
of hypothesis. For example, in a two-way split child node (denoted by the index 2) respec-
case: tively
B1, B2 – the corresponding number of buyers
H 0: p 1 ⫽ p 2
N1, N2 - the corresponding number of non-
H 1: p 1 ⫽ p 2 buyers
and then extract the P_value from a normal
where p1 and p2 are the true (but unknown) distribution table.
response rates of the left child node and the
right child node, respectively. (b) The chi-square test
A common way to test the hypothesis is by In the case of a multiple split, the test of
calculating the P_ value, defined as the proba- hypothesis is conducted by means of the chi
bility to reject the null hypothesis, for the given square test of independence.
sample statistics, if it is true. Then, if the result- The statistic for conducting this test, denoted
ing P_ value is less than or equal to a predeter- by Y, is given by:
mined level of significance (often 5%), the hy-
冘共Observed ⫺ Expected兲2
pothesis is rejected; otherwise, the hypothesis is
accepted. Y⫽ (A.7)
splits
Expected
(a) The normal test
The hypothesis testing procedure draws on
the probability laws underlying the process. In Table A.1 exhibits the calculation of the com-
the case of a two-way split, as above, the hypoth- ponents of Y for a three-way split, extending the
esis can be tested using the normal distribution. notation above to the case of three child nodes.
17
This table can easily be extended to more than descendant nodes. The objective is to partition
three splits per node. Y is distributed according the audience into two groups that exhibit sub-
to the chi-square distribution with (k ⫺ 1) de- stantially less variation than the father node.
grees of freedom, where k is the number of AID uses the sum of squared deviations of the
splits for the node. One can then extracts the response variable from the mean as the mea-
P_value for the resulting value of Y from the chi sure of the node value, which, in the binary
square distribution. The best split is the one yes/no case, reduces to the minimum variance
with the smallest P_value. criterion, (RR ⫺ 0.5)2, where RR is the response
(c) The smallest child test rate (the ratio of the number of responders to
Finally, this criterion is based on testing the the total number of customers) for the node
hypothesis: (see also Appendix A).
In each stage, the algorithm searches over all
H 0: p ⬎ TRR remaining predictors, net of all predictors that
had already been used in previous stages to split
H 1: p ⱕ TRR father nodes, to find the partition that yields the
maximal reduction in the variance.
where p here stands for the true (but unknown) In this work we have expanded the AID algo-
response rate of the smaller child node, and rithm in two directions:
TRR is the observed response rate for the train-
● Splitting a node based on two predictors at
ing audience.
To test this hypothesis we define a statistic Y a time to allow one to also account for the
denoting the number of standard deviations interaction terms to affect the tree struc-
(“sigmas”) that the smaller segment is away ture.
from TRR, i.e.: ● Using different reference points in the
evaluation criterion that are more appro-
RR ⫺ TRR priate for splitting populations with
Y⫽ marked differences between responders
冑RR共1 ⫺ RR兲/N and nonresponders. Possible candidates
are the overall response rate of the training
where RR is the observed response rate of the audience, or even the cutoff response rate
smaller child node, and N the number of obser- separating between targets and nontargets.
vations.
Large values of Y mean that p is significantly We therefore refer to our algorithm as STA
different than TRR, indicating a “good” split. (Standard Tree Algorithm) to distinguish it
For example, one may reject the null hypothe- from the conventional AID algorithm.
sis, concluding that the split is a good one, if Y
is larger than 2 “sigmas.” CHAID
CHAID (Chi-Square AID) is the most common
of all tree classifiers. Unlike AID, CHAID is not
APPENDIX B: TREE CLASSIFIERS a binary tree as it may partition a node into
We discuss below the three tree classifiers that more than two branches. CHAID categorizes all
were involved in our study—STA, CHAID, and independent continuous and multivalued inte-
GA. ger variables by “similarity” measures, and con-
siders the resulting categories for a variable as a
STA—Standard Tree Algorithm whole unit (group) for splitting purposes. Take
STA is an AID-like algorithm. The basic AID for example the variable MONEY (money
algorithm is a univariate binary tree. In each spent) that is categorized into four ranges, each
iteration, each undetermined node is parti- is represented by a dummy 0/1 variable which
tioned based on one variable at a time into two assumes the value of 1 if the variable value falls
18
PREDICTIVE MODELING
in the corresponding range, 0-otherwise. De- justed P_value, and L is referred to as the Bon-
note the resulting four categorical variables as ferroni multiplier. Each combination yields a
variables A, B, C and D, respectively. Since different adjusted P_ value. The “best” combi-
MONEY is an ordinal variable there are three nation to partition the node by is the one that
possibilities to split this variable into two adja- yields the smallest adjusted P_ value.
cent categories: (A, BCD), (AB, CD), (ABC, D); The number of possibilities for combining a
three possibilities to split the variable into three variable with K values into M categories de-
adjacent categories: (A, B, CD) (AB, C, D), (A, pends on the type of the variable involved. In
BC, D); and one way to split the variable into our algorithm, we distinguish among three cas-
four adjacent categories (A, B, C, D). Now, es:
CHAID considers each of these partitions as a
● Ordinal variables where one may combine
possible split, and seeks the best combination to
split the node from among all possible combi- only adjacent categories (as in the case of
nations. As a result, a node in CHAID may be the variable MONEY above).
partitioned into more than two splits, as many ● Ordinal variables with a missing value that
as four splits in this particular example. The may be combined with any of the other
best split is based on a chi-square test, which is categories.
what gave this method its name. ● Nominal variables where one may combine
Clearly, there are many ways to partition a any two (or more) values, including the
variable with K values into M categories (chil- missing value (e.g., the variable MARITAL
dren nodes). To avoid choosing a combination with four nominal values: M - married, S -
that randomly yields a “good” split, some ver- single, W - widow, D - divorce).
sions of CHAID use an adjusted P_ value crite-
rion to compare candidate splits. Table B.1 exhibits the number of combina-
Let L denote the number of possible combi- tions for several representative values of K and
nations to combine a variable with K values into M.
M categories.
Let ␣ denote the Type-I error (also known as Genetic Algorithm (GA)
the level of significance) in the chi-square test All tree algorithms described above are combi-
for independence. ␣ is the probability of reject- natorial, in the sense that in each stage they go
ing the null hypothesis that there is no signifi- over all possible combinations to partition a
cant difference in the response rates of the node. This number gets excessively large even
resulting child nodes, when the null hypothesis for one-variable splits and becomes computa-
is true. tionally prohibitive with multivariable splits.
Now, the probability to accept the null hy- Consequently, all tree algorithms are univariate
pothesis in one combination is (1 ⫺ ␣), and in (AID, CHAID) or at best bivariate (STA). Yet, it
L successive combinations (assuming the test of is conceivable that splits based on several vari-
hypotheses are independent) is (1 ⫺ ␣)L. ables at a time (more than 2) may be more
Hence the probability of making a Type-I error “homogeneous” and therefore better from the
in at least one combination is 1 ⫺ (1 ⫺ ␣)L, standpoint of segmentation. Thus, by confining
which is greater than ␣. oneself to using only univariate and even biva-
To yield a “fair” comparison of the various riate tree algorithms, one may miss out the
combinations, ␣ is replaced by the resulting P_ better splits that could have been obtained oth-
value. In most cases, the P_ value is very small, erwise with multivariate algorithms.
and we therefore can use the approximation To resolve this issue, we developed a Genetic
Algorithm (GA) tree for segmentation, which,
1 ⫺ 共1 ⫺ P_value兲L ⬇ L 䡠 P_value unlike the other trees, is a non-combinatorial
algorithm in the sense that it employs a ran-
The resulting quantity, L 䡠 P_ value is the ad- dom, yet a systematic search approach to grow a
19
which the strongest and the fittest have a higher

T A B L E B.1
likelihood of reproduction than the weak and
Number of Possible Combinations
the unfit. Thus, the succeeding descendants,
Ord ⫹ having inherited the better properties of their
K M Ordinal Miss Nominal parents, tend to be even stronger and healthier
than their predecessors and therefore get im-
2 2 1 1 1 proved over time with each additional genera-
3 2 2 3 3 tion (Davis, 1991).
4 2 3 5 7 This idea has been applied to find heuristic
4 3 3 5 6 solutions for large-scale combinatorial optimiza-
5 2 4 7 15 tion problems. Using the terminology of GA,
5 3 6 12 25 each solution to the optimization problem cor-
5 4 4 7 10 responds to a chromosome, each variable to a
6 2 5 9 31 gene, and its value to an allele.
6 3 10 22 90 Starting with the “better” solutions in each
6 4 10 22 65 generation (according to some “fitness” mea-
6 5 5 9 15 sure), GA creates successive offspring solutions
7 2 6 11 63 that are likely to result in a better value for the
7 3 15 35 301 objective function as one goes from one gener-
7 4 20 50 350 ation to the other, thus finally converging to a
7 6 6 11 21 local, if not a global optimum (Holland, 1975).
8 2 7 13 127 These solutions are created by means of a “re-
8 4 35 95 1701 production” process involving two basic opera-
8 6 21 51 266 tions: “mutations” and “crossovers.”
8 7 7 13 28
9 2 8 15 255 ● Mutation – randomly changing some of the
9 4 56 161 7770 variables (genes) of the parent solution.
9 6 56 161 2646 ● Crossover – crossing over variables (genes)
9 8 8 15 36 of two parent solutions, some are taken
10 2 9 17 511 from the “mother” solution, the rest from
10 4 84 252 34105 the “father” solution.
10 6 126 406 22827
10 8 36 92 750 In the context of our segmentation problem,
10 9 9 17 45 GA is used as an algorithm to grow the tree and
generate candidate splits for a node. A solution
in our case is a collection of splitting rules spec-
tree, rather than go over all possible number of ifying whether a customer belongs to the left
combinations. This significantly reduces the segment or to the right segment. For example,
number of combinations to consider in parti- if X3 ⫽ 1, X4 ⫽ 0, and X7 ⫽ 1, the customer
tioning a node, thus allowing one to increase belongs to the left segment, otherwise he/she
the number of variables to split a node beyond belongs to the right segment.
two. In fact, with this approach one can theo- One may use several ways to represent splits
retically use any number of variables to split a in GA. One possibility is by means of a vector,
node, but for computational reasons we have the dimension of which is equal to the number
confined the number of simultaneous variables of potential predictors, one entry for each pre-
in this study to either three or four. dictor. The value of each entry indicate how the
Genetic Algorithm (GA) is a general-purpose corresponding predictor affect the split, e.g.,
search procedure, based on the biological prin-
ciple of “the survival of the fittest,” according to 0: X i does not affect the current split
20
PREDICTIVE MODELING
⫺1: Xi ⫽ 0 in the current split ● Select g consecutive genes at random, say

X3 and X4.
1: Xi ⫽ 1 in the current split
● Create two descendant solutions: a “daugh-
ter” and a “son,” by swapping the selected
In the above example (assuming there are only genes: The “daughter” solution gets her
10 potential predictors denoted as X1, . . ., X10), mother’s genes, except for X3 and X4 which
the corresponding vector is given by: are inherited from the father; the “son”
solution gets his father’s genes, except for
共0, 0, 1, ⫺ 1, 0, 0, 1, 0, 0, 0兲 X3 and X4 which are inherited from the
mother.
Now, the crux of the GA method is to define
those descendant solutions from one genera- The process starts with a pool of solutions
tion to the other. These are created in our (population), often created in a random man-
algorithm using mutation and crossover opera- ner. The various reproduction methods are ap-
tions, as follows. plied to create the descendant solutions of the
Mutation: next generation. The resulting solutions are
then evaluated based on the partitioning crite-
● Choose a predictor from the set of predic- ria. The best solutions are retained, and control
tors in the solution, by drawing a random is handed over to the next generation, and so
number from the uniform distribution on, until all termination conditions are met.
over the range (1 ⫺ V), where V is the Finally we note that in our GA tree we define
number of potential predictors. Say the the smaller child node (i.e., the one with the
predictor selected is X3. fewer number of customers) as a terminal node.
● Determine the value (allele) of X3 as fol- This is based on the plausible assumption that
lows: with multivariate-based splits, the resulting
smaller split appears to be homogeneous
⫺1: with probability of p0 “enough” to make it a terminal node. Hence
the resulting GA tree is hierarchical.
1: with probability of p1
0: otherwise
APPENDIX C: THE PREDICTOR SETS
For example if the allele selected is ⫺1, the To explore the sensitivity of the automatic tree
descendant solution becomes classifiers to variations in the number and com-
position of predictors, we have experimented
共0, 0, ⫺1, ⫺1, 0, 0, 1, 0, 0, 0兲 with several predictor sets.
The first set, the affinity set, consists of the
and the new split is defined by basic attributes of collectible items:
X3 ⫽ 0, X4 ⫽ 0, and X7 ⫽ 1 ● Themes – 80 binary predictors indicating
whether the customer has bought an item
The value of p0 and p1 are parameters of the with the corresponding theme in the past
algorithm and are set up in advance by the user. ● Materials – 10 binary predictors indicating
The mutation operator is applied simulta- whether the customer has bought an item
neously on g genes at a time, where the g genes with the corresponding material in the past
are also determined at random.
● Product codes – 99 binary predictors indi-
Crossover:
cating whether the customer has bought an
● Pick two solutions at random (a “father” item with the corresponding product code
and a “mother”). in the past
21
Since gender is a very important attribute in cial Systems. Ann Arbor: University of Michigan
the collectible industry, it was also included in Press.
the affinity set. Kass, G. (1983). An Exploratory Technique for Investi-
The second set includes predictors in the gating Large Quantities of Categorical Data. Applied
Statistics, 29.
affinity set plus recency predictors representing
the number of months since last purchase bro- Kestnbaum, R.D., (1998). Kestnbaum & Company, Chi-
cago, Private Communication.
ken down into mutually exclusive and exhaus-
Levin, N., & Zahavi, J. (1996). Segmentation Analysis
tive time segments (0 – 6 months, 7–12 months,
with Managerial Judgment. Journal of Direct Mar-
13–24 months, etc.). keting, 10, 28 – 47.
The third set includes the affinity and re- Long, J.S. (1997). Regression Models for Categorical
cency predictors, plus frequency measures rep- and Limited Dependent Variables. Thousand Oaks,
resenting the total number of products pur- CA: Sage Publications.
chased by a customer in the past from all media Michalski, R.S., Carbonell, J.G., & Mitchell, T.M.
channels, broken down by major product cate- (1983). Machine Learning—An Artificial Intelli-
gories gence Approach. Palo Alto, CA: Tioga Publishing
Finally, the last set contains all predictors in Company.
the database, including purchases from direct Morwitz, G.V., & Schmittlein, D. (1992). Using Segmen-
mail sources by product categories, money tation to Improve Sales Forecasts Based on Purchase
spent on past purchases by product categories, intent: Which “Indenders” Actually Buy? Journal of
Marketing Research, 29, 391– 405.
time since joining in the list, as well as a variety
Morwitz, G.V., & Schmittlein, D. (1998). Testing New
of other indicators.
Direct Marketing Offerings: The Interplay of Man-
Several procedures were used in the study to agement Judgment and Statistical Models. Manage-
collapse the number of predictors for modeling ment Science, 44, 610 – 628.
purposes to a more manageable size. Murthy, K.S. (1998). Automatic Construction of Deci-
sion Trees from Data: A Multi-disciplinary Survey.
Data Mining and Knowledge Discovery, 2, 45–389.
REFERENCES Novak, P.T., de Leeuw, J., & MacEvoy, B. (1992). Rich-
Ben-Akiva, M., & Lerman, S.R. (1987). Discrete Choice ness Curves for Evaluating Market Segmentation.
Analysis. Cambridge, MA: The MIT Press. Journal of Marketing Research, 29, 254 –267.
Breiman, L., Friedman, J., Olshen, R., & Stone, C. Quinlan, J.R. (1986). Induction of Decision Trees. Ma-
(1984). Classification and Regression Trees. Bel- chine Learning, 1, 81–106.
mont, CA: Wadsworth. Quinlan, J.R. (1993). C4.5: Program for Machine
Bult, J.R., & Wansbeek,T. (1995). Optimal Selection for Learning. CA., Morgan Kaufman Publishing.
Direct Mail. Marketing Science, 14, 378 –394. Shepard, D. (Ed.). (1995). The New Direct Marketing.
Davis, L. (Ed.). (1991). Handbook of Genetic Algo- New York: Irwin Professional Publishing.
rithms. New York: Van Nostrand Reinhold. Sonquist, J., Baker, E., & Morgan, J.N. (1971). Search-
Haughton, D., & Oulabi, S. (1997). Direct Marketing ing for Structure. Ann Arbor: University of Michi-
Modeling with CART and CHAID. Journal of Direct gan, Survey Research Center.
Marketing, 11, 42–52. Weinstein, A. (1994). Market Segmentation. New York:
Holland, J.H. (1975). Adaptation in Natural and Artifi- Irwin Professional Publishing.
22

Predictive Modeling Using Segmentation: Nissanlevin Jacob Zahavi

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predictive Modeling Using Segmentation: Nissanlevin Jacob Zahavi

Uploaded by

Copyright:

Available Formats

PREDICTIVE MODELING USING

Nissan Levin NISSAN LEVIN is an independent

ABSTRACT DRS. ZAHAVI AND LEVIN are the

response analysis, and come only short of the logistic regression

© 2001 John Wiley & Sons, Inc. and

JOURNAL OF INTERACTIVE MARKETING

1. INTRODUCTION extended to include buyers, payers, loyal cus-

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

diction power of the segmentation-based mod- 2. SEGMENTATION METHODS

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

Termination Rules. Theoretically, one can

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

CLB ⫽ Calibration sample

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

1 12.36 12.99 12.36 12.99 14.61 0.72 13.94 0.74 0.98

CLB ⫽ Calibration sample

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

9 8.12 10.08 8.12 10.08 19.18 1.44 21.82 1.49 0.97

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

50 0.28 0.00 1.61 1.63 90.41 34.10 81.21 34.18 1.12

CLB ⫽ Calibration sample

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

traditional statistical methods, no extensive

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

● Q(RR) is symmetric, i.e., Q(RR) ⫽ Q(1

The first two conditions stem from our defi-

Q共RR兲 ⫽ a ⫹ b 䡠 RR ⫹ c 䡠 RR 2 The entropy is a measure of the information

Yi — the choice value of observation i where:

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

Split Buyers Non-Buyers Total

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

which the strongest and the fittest have a higher

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

⫺1: Xi ⫽ 0 in the current split ● Select g consecutive genes at random, say

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

JOURNAL OF INTERACTIVE MARKETING ● VOLUME 15 / NUMBER 2 / SPRING 2001

You might also like