Professional Documents
Culture Documents
Predictive Modeling Using Segmentation: Nissanlevin Jacob Zahavi
Predictive Modeling Using Segmentation: Nissanlevin Jacob Zahavi
SEGMENTATION
2
PREDICTIVE MODELING
3
JOURNAL OF INTERACTIVE MARKETING
4
PREDICTIVE MODELING
people who purchased general art books, fol- Stone, 1984); ID3 (Quinlan, 1986); C4.5 (Quin-
lowed by people who purchased geographical lan, 1993); and others. For a comprehensive
books, and so on. In cases where males and survey of automatic construction of decision
females may react differently to the product trees from data see Murthy (1998).
offering, gender may also be used to partition Basically, all automatic tree classifiers share
customers into groups. Generally, the list is first the same structure. Starting from a “root” node
partitioned by product/attribute type, then by (the whole population), tree classifiers employ
RFM, and then by gender (i.e., the segmenta- a systematic approach to grow a tree into
tion process is hierarchical). This segmentation “branches” and “leaves.” In each stage, the al-
scheme is also known as FRAC—Frequency, Re- gorithm looks for the “best” way to split a “fa-
cency, Amount (of money), and Category (of ther” node into several “children” nodes, based
product) (Kestnbaum, 1998). on some splitting criteria. Then, using a set of
Manually based RFM and FRAC methods are predefined termination rules, some nodes are
subject to judgmental and subjective consider- declared as “undetermined” and become the
ations. Also, the basic assumption behind the father nodes in the next stages of the tree de-
RFM method may not always hold. For example, velopment process, some others are declared as
in durable products, such as cars or refrigera- “terminal” nodes. The process proceeds in this
tors, recency may work in a reverse way—the way until no more node is left in the tree which
longer the time since last purchase, the higher is worth splitting any further. The terminal
the likelihood of purchase. Finally, to meet seg- nodes define the resulting segments. If each
ment size constraints, it may be necessary to run node in a tree is split into two children only,
the RFM/FRAC iteratively, each time combin- one of which is a terminal node, the tree is said
ing small segments and splitting up large seg- to be “hierarchical.”
ments, until a satisfactory solution is obtained. Three main considerations are involved in
This may increase computation time signifi- developing automatic trees:
cantly.
1. Growing the tree
2.2 Decision Trees 2. Determining the best split
Several “automatic” methods have been devised 3. Termination rules
in the literature to take away the judgmental
and subjective considerations inherent in the Growing the Tree. One grows the tree by suc-
manual segmentation process. By and large, cessively partitioning nodes based on the data.
these methods map data items (customers, in A node may be partitioned based on several
our case) into one of several predefined classes. variables at a time, or even a function of vari-
In the most simple case, the purpose is to seg- ables (e.g., a linear combination of variables).
ment customers into one of two classes, based With so many variables involved, there are prac-
on some type of a binary response, such as tically infinite number of ways to split a node.
buy/no buy, loyal/nonloyal, pay/no-pay, etc. Take as an example just a single continuous
Thus, tree classifiers are choice based. Without variable. This variable alone can be partitioned
loss of generality we refer to the choice variable in infinite number of ways, let alone when sev-
throughout this paper as purchase/no-pur- eral variables are involved.
chase, thus classifying the customers into seg- In addition, each node may be partitioned
ments of “buyers” and “nonbuyers.” into several descendants (or splits), each be-
Several automatic tree classifiers were dis- comes a “father” node to be partitioned in the
cussed in the literature, among them AID (Au- next stage of the tree development process.
tomatic Interaction Detection; Sonquist, Baker, Thus, the larger the number of splits per node,
& Morgan, 1971); CHAID (Chi square AID; the larger the tree and the more prohibitive the
Kass, 1983); CART (Classification and Regres- calculations.
sion Trees; Breiman, Friedman, Olshen, & Indeed, several methods have been applied
5
JOURNAL OF INTERACTIVE MARKETING
in practice to reduce the number of possible larger the tree, the larger the risk of overfitting.
partitions of a node: Hence it is necessary to control the size of a tree
I. All continuous variables are categorized by means of termination rules that determine
prior to the tree development into small num- when to stop growing the tree. These termina-
ber of ranges (“binning”). A similar procedure tion rules should be set to ensure statistical
applies for integer variables, which assume validity of the results and avoid overfitting.
many values (such as the frequency of pur-
chase). 2.3 Extended Tree Classifiers
II. Nodes are partitioned only on one variable We conclude this section by discussing the three
at a time (“univariate” algorithm). tree classifiers that we use in our paper: a vari-
III. The number of splits per each “father” ation of AID, which we refer to as Standard Tree
node is often restricted to two (“binary” trees). Algorithm (STA); the commonly used CHAID,
IV. Splits are based on a “greedy” algorithm which was extended to consider entropy crite-
in which splitting decisions are made sequen- rion; and a new tree classifier based on Genetic
tially looking only on the impact of the split in Algorithm (GA). Basically, AID, STA and
the current stage, but never beyond (i.e., there CHAID are combinatorial algorithms in the
is no “looking ahead”). sense that they go over all possible combina-
tions of variables to partition a node. Conse-
Determining the Best Split. With so many pos- quently, these algorithms are computationally
sible partitions per node, the question is what is intensive and therefore, as discussed above, are
the best split? There is no unique answer to this limited to splits based on one variable, or at best
question as one may use variety of splitting cri- two variables, at a time. In contrast, GA is a
teria, each may result in a different “best” split. non-combinatorial algorithm in the sense that
We can classify the splitting criteria into two the candidate solutions (splits) are generated
“families”: node-value based criteria and parti- by a random, yet systematic, search method.
tion-value based criteria. This opens up the possibility to consider parti-
tions based on more than two variables at a
● Node-value based criteria: seeking the split time, hopefully yielding more “homogeneous”
that yields the best improvement in the segments. Practically, this renders the GA algo-
node value. rithm “non-greedy” by making it possible to
● Partition-value based criteria: seeking the look ahead beyond the current stage when split-
split that separates the node into groups ting a node. However, because of computa-
that are as different from each other as tional constraints, we have limited the number
possible. of simultaneous variables to split a node to only
three and four predictors at a time.
We discuss the splitting criteria at more These algorithms are further described in
length in Appendix A. Appendix B.
6
PREDICTIVE MODELING
ples—a calibration (training) sample, consist- going over all possible combinations to grow a
ing of 60% of the observations, to build the tree (subject to the computational constraints
model (tree), and a holdout sample, containing discussed in the previous section) and selecting
the rest of the customers, to validate the model. the best split for each node using one of many
As alluded to earlier, only binary predictors splitting criteria (discussed in Appendix A).
may take part in the partitioning process. In contrast to decision trees, logistic regres-
Hence, all continuous and multivalued integer sion models do not enjoy this benefit, and one
variables were categorized, prior to invoking the needs to invoke a specification procedure to
tree algorithm, into ranges, each is represented select the most influential predictors to include
by a binary variable assuming the value of 1 if in the final model. Hence, using logistic regres-
the variable falls in the interval, 0 otherwise. sion models with multiple potential predictors
This process is also referred to as “binning.” is not easy and definitely not as straightforward
Depending upon the tree, the resulting catego- as building decision trees. In our case, we use a
ries may be either overlapping (i.e., X ⱕ ai, i rule-based expert system to weed out the “bad”
⫽ 1, 2, . . . , where X is a predictor, ai a given from the “good” predictors. These rules were
breakpoint) or non-overlapping (i.e., ai-1 ⱕ X calibrated, based on statistical theory and prac-
⬍ ai, i ⫽ 1, 2, . . . ). tice, using an extensive experimentation pro-
The trees were evaluated using various good- cess.
ness-of-fit criteria that express how well the seg-
mentation process is capable of discriminating The Predictors Set
between buyers and nonbuyers. A common The mix of predictors is a major factor affecting
measure is in terms of the percentage of buyers the resulting tree structure; the larger the num-
“captured” per the percentage of audience ber of potential predictors to split a tree, the
mailed, the higher the percentage of buyers, larger is the number of segments and the
the “better” the model. For example, a segmen- smaller is the size of each terminal segment. To
tation scheme that captures 80% of the buyers determine the impact of the mix and the num-
for 50% of the audience is better than a seg- ber of predictors on the performance of the
mentation scheme that captures only 70% of tree classifiers, we have experimented with sev-
the buyers for the same audience. eral sets of predictors. The basic set, the affinity
Below we discuss several considerations in set, contains the basic attributes of collectible
setting up the study. items; the second set also includes recency pre-
dictors; the third set is augmented to include
Feature Selection frequency measures. Finally, the last set con-
In DBM applications, the number of potential tains all of the predictors available in the data
predictors could be very large, often in the or- set, several hundreds in all. More details about
der of magnitude of several hundreds, even the participating predictors are found in Ap-
more, predictors. Thus, the crux of the model- pendix C.
building process is to pick the handful of pre-
dictors that “best” explain, from a statistical Min/Max Segment Size
point of view, the customers’ choice decision. Size constraints are most crucial in segmenta-
This problem, referred to as the feature selec- tion analysis. To minimize the error incurred in
tion (or specification) problem, is a tough com- case wrong decisions are made (e.g., because of
binatorial problem and is definitely the most sampling errors), segments are required to be
complex issue in building large-scale multivari- “not-too-small” and “not-too-big.” If a segment is
ate models. too small, the probability of making an error
Decision trees possess an advantage over sta- increases due to the lack of statistical signifi-
tistical regression models in that they have an cance. If the segment is too big, then if a “good”
inherent mechanism to pick the predictors af- segment is somehow eliminated from the mail-
fecting the terminal segments. They do it by ing (Type I error)—large foregone profits are
7
JOURNAL OF INTERACTIVE MARKETING
incurred; and if a “bad” segment makes it to the Large values of Y (e.g., Y ⱖ 2) mean that the
mailing (Type II error)— large out-of-pocket true (but unknown) response rate of the result-
costs are incurred. ing segment is significantly different from the
Consequently, we have built a mechanism in TRR, indicating a “good” split.
all our automatic tree algorithms to account for
minimum and maximum constraints on the re- Criterion 4 (STA, GA). Y is the larger re-
sulting segment size. In our study, we used two sponse rate of the two children nodes (in a
sets of min/max constraints on the segment binary split). This criterion seeks the split that
size, 150/3,000 and 300/6,000, respectively. maximize the difference in the response rates of
the two descendant nodes.
Splitting Criteria All these criteria are further discussed in Ap-
Finally, as discussed in Appendix A, one may pendix A.
define a variety of splitting criteria which be-
long to the node-value and partition-value fam-
ilies. We have used four different criteria in our
4. RESULTS AND ANALYSIS
study all of which seek to maximize a statistical
measure Y as follows. The combination of several tree classifiers, pre-
dictor sets, and splitting criteria give rise to
Criterion 1 (CHAID). Y is the statistic for the numerous trees. To simplify the presentation,
chi-square test of independence: we provide only selective results in this paper.
We evaluate all trees based on goodness-of-fit
criteria. As a reference point, we compare the
Y⫽ 冘
splits
共Observed ⫺ Expected兲2
Expected
automatic segmentation and the manually
based segmentation to logistic regression.
To allow for a “fair” comparison between the
Clearly, the larger the value of Y, the larger is models, each model was optimized to yield the
the difference between the response rates of the best results: the manually based segmentation
resulting child nodes, and the “better” the split. by using domain experts to determine the
This statistic is distributed as chi-square with (k FRAC segmentation criteria; the automatic de-
⫺ 1) degrees of freedom, where k is the number cision trees by finding the best tree for the given
splits for the node. Then, if the resulting P_ splitting criterion and constraints; and the lo-
value is less than or equal to the level of signif- gistic regression by finding the “best” set of
icance, we conclude that Y is big “enough” and predictors that explain customer choice.
that the resulting split is “good.” Since CHAID
uses a sequence of tests (each possible split Goodness-of-Fit Results
constitutes a test), an adjusted P_ value measure In this section we analyze the performance of
using the Bonferroni multiplier, is often used to each method on its own, using goodness-of-fit
determine the “best” split. measures that exhibit how well a model is capa-
ble of discriminating between buyers and non-
Criterion 2 (CHAID). Y is the total entropy buyers. Goodness-of-fit results are typically pre-
associated with a given partition (into M splits). sented by means of gains tables in descending
It is a measure of the information content of the order of the predicted response. In the segmen-
split, the larger the entropy the better. tation-based methods, the predicted response
rate of a rollout segment is given by the corre-
Criterion 3 (STA). Y is the number of stan- sponding response rate of the segment in the
dard deviations that the response rate (RR) of calibration (training) sample. In logistic regres-
the smaller child node (the one with the fewer sion, the response rates are estimated directly
number of customers) is away from the overall by the model. Several measures can be used to
response rate of the training audience (TRR). assess the performance of a response model:
8
PREDICTIVE MODELING
● The behavior of the actual response rates representative audiences. The reference point
(the ratio of the number of buyers “cap- consists of the top 30% of the customers in the
tured” to the size of the audience mailed) list of segments, arranged in descending order
across segments, which in a “good” model of the response rate of the segments in the
should exhibit a nicely declining pattern. calibration sample. Note that in a tree analysis,
● The difference between the actual re- the response probability of a customer is deter-
sponse rates of the top and the bottom mined by the response rate of his/her peers,
segments, the larger the difference, the that is, the response rate of the segment that the
better the fit. customer belongs to. Since segments’ sizes are
discrete, we use interpolation to exhibit the
All the results below are presented for the performance results at exactly 30% of the audi-
holdout sample. Table 1 presents the gains ta- ence. Of course, no interpolation is required
ble for RFM-based segmentation; Table 2, for for logistic regression, since here the response
FRAC-based segmentation. Out of the many probability is calculated individually for each
tree classifiers that we analyzed, we present one customer in the list. As additional reference
gains tables for STA, with 1-variable split, split- points, we also present the performance results
ting criterion 4, a predictor set that contains the for the top 10% and 50% of the audience.
affinity and recency attributes and min/max Table 5 presents the performance results at
constraints on the resulting segment size of 150 these audience levels for several tree classifiers,
and 3,000 respectively (Table 3). Finally, Table as well as the results of RFM, FRAC, and logistic
4 exhibits the logistic regression results by de- regression. To bring all methods to an equal
ciles using the entire set of predictors. footing for comparison purposes, all methods
Observing Table 1 (RFM segmentation), were calibrated on the same data set using the
other than the top two segments, the response entire set of predictors. Comparing the results,
rates across segments in the list are pretty flat we conclude:
and somewhat fluctuating— both are indicative
of relatively poor fit. ● The logistic regression model outperforms
By comparison, in the FRAC segmentation all other models—the judgmentally based,
(Table 2), the top segments perform signifi- as well as the automatic models.
cantly better than the bottom segments, with ● The RFM-based models are the worst.
the top segment yielding a response rate of ● The automatic tree classifiers perform ex-
12.99% versus an average response rate of only tremely well, being comparable to the
.69% for the entire holdout audience. FRAC-based model and getting close to the
Interestingly, the automatic segmentation logistic regression model.
methods (Table 3) are not lagging behind in
● Increasing the minimum size constraint
terms of discriminating between the buying and
usually increases the variance of the fit re-
the nonbuying segments, with the top segment
sults across all segments of a tree. This is a
outperforming the bottom segments by a wide
manifestation of the fact that larger seg-
margin.
ments are less “homogeneous” and thus
Finally, the logistic regression model (Table
exhibit larger variation. Indeed, smaller
4) exhibits excellent performance, with the top
segments are more stable, but increase the
decile responding more than six times better
risk of overfitting. So one needs to trade off
than the bottom decile.
variance against segment size to find the
most suitable one for the occasion. Unfor-
Tree Performance
tunately, there is no optimal way to find the
To evaluate and compare the automatic seg- “best” segment size other than by experi-
mentation to the judgmentally based segmenta- mentation.
tion and the logistic regression model, we look
on the percentage of buyers captured at several We note that adding more predictors to the
9
JOURNAL OF INTERACTIVE MARKETING
T A B L E 1
Gains Table for RFM-Based Segmentation
Segments with at Least 100 Customers in the Holdout Sample Sorted by Descending Response Rates of the Segments in
the Calibration Sample.
CLB HLD CUM CLB CUM HLD CUM CLB CUM CLB CUM HLD CUM HLD %CLB AUD/
SEG RR % RR % RR % RR % BUY % AUD % BUY % AUD % %HLD AUD
144 2.38 2.62 2.38 2.62 29.22 7.47 27.88 7.68 0.98
244 1.90 2.38 2.27 2.47 36.07 9.66 34.55 9.61 1.14
322 1.24 0.58 2.21 2.34 37.44 10.33 35.15 10.32 0.94
434 1.10 0.86 2.11 2.21 39.27 11.34 36.36 11.29 1.04
423 0.96 0.00 1.98 1.99 41.55 12.75 36.36 12.60 1.05
223 0.90 1.02 1.91 1.93 42.92 13.67 37.58 13.42 1.12
143 0.90 0.94 1.83 1.84 44.75 14.91 39.39 14.75 0.93
333 0.87 1.43 1.72 1.79 47.49 16.83 43.64 16.78 0.94
433 0.86 1.02 1.63 1.71 50.23 18.76 46.67 18.82 0.95
134 0.84 0.62 1.58 1.63 52.05 20.09 47.88 20.16 0.99
344 0.78 1.23 1.42 1.55 58.45 25.08 56.97 25.23 0.98
334 0.75 0.56 1.36 1.47 61.19 27.29 58.79 27.47 0.99
131 0.72 0.00 1.35 1.44 62.10 28.06 58.79 28.13 1.16
133 0.71 0.88 1.27 1.36 66.67 31.97 64.24 32.41 0.91
123 0.67 1.02 1.24 1.35 68.49 33.63 66.67 34.04 1.01
233 0.62 1.82 1.22 1.36 69.41 34.52 69.09 34.96 0.97
122 0.58 0.38 1.17 1.27 72.60 37.88 70.91 38.29 1.01
412 0.58 0.67 1.14 1.25 74.43 39.81 72.73 40.17 1.03
323 0.57 0.33 1.11 1.19 76.71 42.23 73.94 42.69 0.96
132 0.46 0.37 1.07 1.15 78.54 44.65 75.15 44.95 1.07
213 0.45 0.00 1.06 1.14 79.00 45.27 75.15 45.54 1.03
234 0.44 0.00 1.05 1.12 79.45 45.90 75.15 46.15 1.05
312 0.39 1.02 1.01 1.11 81.28 48.76 79.39 49.00 1.00
212 0.38 0.24 0.95 1.03 84.47 53.90 81.21 54.28 0.97
211 0.34 0.27 0.86 0.92 89.50 62.98 84.85 63.44 0.99
411 0.28 0.37 0.81 0.87 92.69 69.81 88.48 70.19 1.01
112 0.28 0.40 0.79 0.85 93.61 71.81 89.70 72.29 0.95
444 0.27 0.42 0.76 0.83 95.43 75.93 92.12 76.26 1.04
311 0.26 0.27 0.74 0.81 96.80 79.11 93.33 79.31 1.04
432 0.26 0.85 0.74 0.81 97.26 80.18 94.55 80.29 1.09
413 0.23 0.00 0.72 0.79 98.17 82.63 94.55 82.54 1.09
111 0.20 0.46 0.71 0.78 99.09 85.40 96.36 85.28 1.01
121 0.17 0.00 0.70 0.76 99.54 87.02 96.36 86.79 1.08
313 0.07 0.21 0.67 0.74 100.00 90.94 97.58 90.73 0.99
All 0.61 0.69 100.00 100.00 100.00 100.00 0.98
10
PREDICTIVE MODELING
T A B L E 2
Gains Table for FRAC-Based Segmentation
Segments with at Least 100 Customers in the Holdout Sample Sorted by Descending Response Rates of the Segments in
the Calibration Sample.
CLB HLD CUM CLB CUM HLD CUM CLB CUM CLB CUM HLD CUM HLD %CLB AUD/
SEG RR % RR % RR % RR % BUY % AUD % BUY % AUD % %HLD AUD
process does not necessarily improve the frequency variables adding very little to im-
model fit. It turns out that in our case the proving the fit. This phenomenon may be
affinity consideration possess the most predic- typical in the collectible industry, where a
tion power, with the additional recency and customer either likes a given range of prod-
11
JOURNAL OF INTERACTIVE MARKETING
T A B L E 3
Gains Table for STA, 1-Variable Split, Affinity and Recency Predictors, Splitting Criterion 4, and Min/Max Constraint
150/3000
Segments with at Least 25 Customers in the Holdout Sample Sorted by Descending Response Rates of the Segments in the
Calibration Sample.
CLB HLD CUM CLB CUM HLD CUM CLB CUM CLB CUM HLD CUM HLD %CLB AUD/
SEG RR % RR % RR % RR % BUY % AUD % BUY % AUD % %HLD AUD
12
PREDICTIVE MODELING
T A B L E 3
Continued
Segments with at Least 25 Customers in the Holdout Sample Sorted by Descending Response Rates of the Segments in the
Calibration Sample.
CLB HLD CUM CLB CUM HLD CUM CLB CUM CLB CUM HLD CUM HLD %CLB AUD/
SEG RR % RR % RR % RR % BUY % AUD % BUY % AUD % %HLD AUD
ucts (e.g., dolls) or not, and may not extend the various tree classifiers to one another to find
to other industries. out which one performs the best. But this re-
Finally, it would be interesting to compare quires extensive experimentation with multiple
tree classifiers and many data sets, which was
beyond the scope of this research.
T A B L E 4
Gains Table for Logistic Regression by Decils Based on
Entire Set of Predictors
6. CONCLUSIONS
Actual In this paper we evaluated the performance of
% % Resp % Response/ Pred Resp automatic tree classifiers versus the judgmen-
Prospects Response Rate % Prospects Rate %
tally based RFM and FRAC methods and logistic
10.00 62.42 4.29 6.24 4.26
regression. The methods were evaluated based
20.00 77.58 2.67 3.88 2.57
on goodness-of-fit measures, using real data
30.00 83.03 1.90 2.77 1.83
from the collectible industry. Three automatic
40.00 86.67 1.49 2.17 1.43 tree classifiers participated in the analysis—a
50.00 89.70 1.23 1.79 1.18 modified AID tree that we term STA, an exten-
60.00 92.12 1.06 1.53 1.00 sion of the CHAID method, and a newly devel-
70.00 96.36 0.95 1.38 0.87 oped GA tree.
80.00 98.18 0.84 1.23 0.78 The evaluation process, which involves sev-
90.00 100.00 0.76 1.11 0.70 eral predictor sets, several splitting criteria and
100.00 100.00 0.69 1.00 0.63 several constraints on the minimum and maxi-
mum size of the terminal segments, shows that
13
JOURNAL OF INTERACTIVE MARKETING
14
PREDICTIVE MODELING
● Max Q(RR) ⫽ Q(0) ⫽ Q(1) But since Yi is a binary variable (1 buy, 0 no-
● Min Q(RR) ⫽ Q(1/2) buy),
冘Y ⫽冘Y ⫽B
● Q(RR) is a concave function of RR (i.e., the
2
second derivative is positive). i i
⫺ RR). Y ⫽ 冘 Y /共B ⫹ N兲 ⫽ RR
i
i
再
& Mitchell, 1993)
1 ⫺ RR RR ⱕ 1/ 2
Q共RR兲 ⫽ RR RR ⬎ 1/ 2 (A.1)
Q共RR兲 ⫽ ⫺关RR log 共RR兲 ⫹ 共1 ⫺ RR兲
b. The quadratic function ⫻ log 共1 ⫺ RR兲兴 (A.3)
Var 共Y 兲 ⫽ 冘 共Y ⫺ Y 兲 /共B ⫹ N兲
i
i
2
descendant left node (denoted by the index
1) and the right node (denoted by the index
2), respectively. In the following we always
⫽ 冘 Y /共B ⫹ N兲 ⫺ Y
2
i
2 assume N1 ⬍ N2 (i.e., the left node is the
i smaller one).
15
JOURNAL OF INTERACTIVE MARKETING
FIGURE A.1.
Node value functions.
RR1, RR2 – the response rates of the left and training audience. Another alternative is the
the right nodes, respectively. cutoff response rate, CRR, calculated based on
Q(RR1), Q(RR2) – the corresponding node economic criteria. The resulting node value
value functions. functions in this case satisfy all conditions
above, except that they are not symmetrical.
Thus, the improvement in the node value Now depending upon the value function
resulting by the split is given as the difference: Q(RR), this yields several heuristic criteria for
determining the best split. For example, for the
N1 N2 piecewise linear function and an hierarchical
Q共RR 1兲 ⫹ Q共RR 2兲 ⫺ Q共RR兲 (A.5)
N N tree, a possible criterion is choosing the split
that maximizes the response rate of the smaller
And we seek the split that yields the maximal child node, Max {RR1, 1 ⫺ RR1}. In a binary tree,
improvement in the node value. But since, for a the split which yields the most difference in the
given father node, Q(RR), is the same for all response rates of the two descendant nodes. In
splits, Eq. (A.5) is equivalent to maximizing the the quadratic case, a reasonable function is (RR
node value (A.4). ⫺ TRR)2. Or one can use the entropy function
Clearly in DBM applications where the num- (A.3).
ber of buyers is largely outnumbered by the Finally, we note that with a concave node
number of nonbuyers, the reference point of 1⁄2 value function, basically any split results in a
may not be appropriate. A more suitable refer- positive value improvement, however small.
ence point to define the node value is TRR, Thus, when using the node-value based criteria
where TRR is the overall response rate of the for determining the best split, it is necessary to
16
PREDICTIVE MODELING
T A B L E A.1
Components of Y
1 Observed B1 N1 T1 ⫽ B1 ⫹ N1
Expected T 1 䡠 B/T T 1 䡠 N/T
2 Observed B2 N2 T2 ⫽ B2 ⫹ N2
Expected T 2 䡠 B/T T 2 䡠 N/T
3 Observed B3 N3 T3 ⫽ B3 ⫹ N3
Expected T 3 䡠 B/T T 3 䡠 N/T
Total B ⫽ B1 ⫹ B2 ⫹ B3 N ⫽ N1 ⫹ N2 ⫹ N3 T ⫽ T1 ⫹ T2 ⫹ T3
impose a threshold level on the minimum seg- One can find the Z value corresponding to the
ment size and/or the minimum improvement P_ value, denoted Z1-␣/2, using:
in the node value, or otherwise the algorithm
will keep partitioning the tree until each node Abs共RR 1 ⫺ RR 2兲
冑
contains exactly one customer. Z 1⫺␣/ 2 ⫽
RR 1 ⴱ 共1 ⫺ RR 1兲/共B 1 ⫹ N 1
Partition-Value-Based Criteria ⫹ RR 2 ⴱ 共1 ⫺ RR 2兲/共B 2 ⫹ N 2兲
(A.6)
Instead of evaluating nodes, one can evaluate
partitions. A partition is considered as a “good”
one if the resulting response rates of the chil- where:
dren nodes are significantly different than one RR1, RR2 – the response rates of the left child
another. This can be expressed in terms of test node (denoted by the index 1) and the right
of hypothesis. For example, in a two-way split child node (denoted by the index 2) respec-
case: tively
B1, B2 – the corresponding number of buyers
H 0: p 1 ⫽ p 2
N1, N2 - the corresponding number of non-
H 1: p 1 ⫽ p 2 buyers
and then extract the P_value from a normal
where p1 and p2 are the true (but unknown) distribution table.
response rates of the left child node and the
right child node, respectively. (b) The chi-square test
A common way to test the hypothesis is by In the case of a multiple split, the test of
calculating the P_ value, defined as the proba- hypothesis is conducted by means of the chi
bility to reject the null hypothesis, for the given square test of independence.
sample statistics, if it is true. Then, if the result- The statistic for conducting this test, denoted
ing P_ value is less than or equal to a predeter- by Y, is given by:
mined level of significance (often 5%), the hy-
冘 共Observed ⫺ Expected兲2
pothesis is rejected; otherwise, the hypothesis is
accepted. Y⫽ (A.7)
splits
Expected
(a) The normal test
The hypothesis testing procedure draws on
the probability laws underlying the process. In Table A.1 exhibits the calculation of the com-
the case of a two-way split, as above, the hypoth- ponents of Y for a three-way split, extending the
esis can be tested using the normal distribution. notation above to the case of three child nodes.
17
JOURNAL OF INTERACTIVE MARKETING
This table can easily be extended to more than descendant nodes. The objective is to partition
three splits per node. Y is distributed according the audience into two groups that exhibit sub-
to the chi-square distribution with (k ⫺ 1) de- stantially less variation than the father node.
grees of freedom, where k is the number of AID uses the sum of squared deviations of the
splits for the node. One can then extracts the response variable from the mean as the mea-
P_value for the resulting value of Y from the chi sure of the node value, which, in the binary
square distribution. The best split is the one yes/no case, reduces to the minimum variance
with the smallest P_value. criterion, (RR ⫺ 0.5)2, where RR is the response
(c) The smallest child test rate (the ratio of the number of responders to
Finally, this criterion is based on testing the the total number of customers) for the node
hypothesis: (see also Appendix A).
In each stage, the algorithm searches over all
H 0: p ⬎ TRR remaining predictors, net of all predictors that
had already been used in previous stages to split
H 1: p ⱕ TRR father nodes, to find the partition that yields the
maximal reduction in the variance.
where p here stands for the true (but unknown) In this work we have expanded the AID algo-
response rate of the smaller child node, and rithm in two directions:
TRR is the observed response rate for the train-
● Splitting a node based on two predictors at
ing audience.
To test this hypothesis we define a statistic Y a time to allow one to also account for the
denoting the number of standard deviations interaction terms to affect the tree struc-
(“sigmas”) that the smaller segment is away ture.
from TRR, i.e.: ● Using different reference points in the
evaluation criterion that are more appro-
RR ⫺ TRR priate for splitting populations with
Y⫽ marked differences between responders
冑RR共1 ⫺ RR兲/N and nonresponders. Possible candidates
are the overall response rate of the training
where RR is the observed response rate of the audience, or even the cutoff response rate
smaller child node, and N the number of obser- separating between targets and nontargets.
vations.
Large values of Y mean that p is significantly We therefore refer to our algorithm as STA
different than TRR, indicating a “good” split. (Standard Tree Algorithm) to distinguish it
For example, one may reject the null hypothe- from the conventional AID algorithm.
sis, concluding that the split is a good one, if Y
is larger than 2 “sigmas.” CHAID
CHAID (Chi-Square AID) is the most common
of all tree classifiers. Unlike AID, CHAID is not
APPENDIX B: TREE CLASSIFIERS a binary tree as it may partition a node into
We discuss below the three tree classifiers that more than two branches. CHAID categorizes all
were involved in our study—STA, CHAID, and independent continuous and multivalued inte-
GA. ger variables by “similarity” measures, and con-
siders the resulting categories for a variable as a
STA—Standard Tree Algorithm whole unit (group) for splitting purposes. Take
STA is an AID-like algorithm. The basic AID for example the variable MONEY (money
algorithm is a univariate binary tree. In each spent) that is categorized into four ranges, each
iteration, each undetermined node is parti- is represented by a dummy 0/1 variable which
tioned based on one variable at a time into two assumes the value of 1 if the variable value falls
18
PREDICTIVE MODELING
in the corresponding range, 0-otherwise. De- justed P_value, and L is referred to as the Bon-
note the resulting four categorical variables as ferroni multiplier. Each combination yields a
variables A, B, C and D, respectively. Since different adjusted P_ value. The “best” combi-
MONEY is an ordinal variable there are three nation to partition the node by is the one that
possibilities to split this variable into two adja- yields the smallest adjusted P_ value.
cent categories: (A, BCD), (AB, CD), (ABC, D); The number of possibilities for combining a
three possibilities to split the variable into three variable with K values into M categories de-
adjacent categories: (A, B, CD) (AB, C, D), (A, pends on the type of the variable involved. In
BC, D); and one way to split the variable into our algorithm, we distinguish among three cas-
four adjacent categories (A, B, C, D). Now, es:
CHAID considers each of these partitions as a
● Ordinal variables where one may combine
possible split, and seeks the best combination to
split the node from among all possible combi- only adjacent categories (as in the case of
nations. As a result, a node in CHAID may be the variable MONEY above).
partitioned into more than two splits, as many ● Ordinal variables with a missing value that
as four splits in this particular example. The may be combined with any of the other
best split is based on a chi-square test, which is categories.
what gave this method its name. ● Nominal variables where one may combine
Clearly, there are many ways to partition a any two (or more) values, including the
variable with K values into M categories (chil- missing value (e.g., the variable MARITAL
dren nodes). To avoid choosing a combination with four nominal values: M - married, S -
that randomly yields a “good” split, some ver- single, W - widow, D - divorce).
sions of CHAID use an adjusted P_ value crite-
rion to compare candidate splits. Table B.1 exhibits the number of combina-
Let L denote the number of possible combi- tions for several representative values of K and
nations to combine a variable with K values into M.
M categories.
Let ␣ denote the Type-I error (also known as Genetic Algorithm (GA)
the level of significance) in the chi-square test All tree algorithms described above are combi-
for independence. ␣ is the probability of reject- natorial, in the sense that in each stage they go
ing the null hypothesis that there is no signifi- over all possible combinations to partition a
cant difference in the response rates of the node. This number gets excessively large even
resulting child nodes, when the null hypothesis for one-variable splits and becomes computa-
is true. tionally prohibitive with multivariable splits.
Now, the probability to accept the null hy- Consequently, all tree algorithms are univariate
pothesis in one combination is (1 ⫺ ␣), and in (AID, CHAID) or at best bivariate (STA). Yet, it
L successive combinations (assuming the test of is conceivable that splits based on several vari-
hypotheses are independent) is (1 ⫺ ␣)L. ables at a time (more than 2) may be more
Hence the probability of making a Type-I error “homogeneous” and therefore better from the
in at least one combination is 1 ⫺ (1 ⫺ ␣)L, standpoint of segmentation. Thus, by confining
which is greater than ␣. oneself to using only univariate and even biva-
To yield a “fair” comparison of the various riate tree algorithms, one may miss out the
combinations, ␣ is replaced by the resulting P_ better splits that could have been obtained oth-
value. In most cases, the P_ value is very small, erwise with multivariate algorithms.
and we therefore can use the approximation To resolve this issue, we developed a Genetic
Algorithm (GA) tree for segmentation, which,
1 ⫺ 共1 ⫺ P_value兲L ⬇ L 䡠 P_value unlike the other trees, is a non-combinatorial
algorithm in the sense that it employs a ran-
The resulting quantity, L 䡠 P_ value is the ad- dom, yet a systematic search approach to grow a
19
JOURNAL OF INTERACTIVE MARKETING
20
PREDICTIVE MODELING
21
JOURNAL OF INTERACTIVE MARKETING
Since gender is a very important attribute in cial Systems. Ann Arbor: University of Michigan
the collectible industry, it was also included in Press.
the affinity set. Kass, G. (1983). An Exploratory Technique for Investi-
The second set includes predictors in the gating Large Quantities of Categorical Data. Applied
Statistics, 29.
affinity set plus recency predictors representing
the number of months since last purchase bro- Kestnbaum, R.D., (1998). Kestnbaum & Company, Chi-
cago, Private Communication.
ken down into mutually exclusive and exhaus-
Levin, N., & Zahavi, J. (1996). Segmentation Analysis
tive time segments (0 – 6 months, 7–12 months,
with Managerial Judgment. Journal of Direct Mar-
13–24 months, etc.). keting, 10, 28 – 47.
The third set includes the affinity and re- Long, J.S. (1997). Regression Models for Categorical
cency predictors, plus frequency measures rep- and Limited Dependent Variables. Thousand Oaks,
resenting the total number of products pur- CA: Sage Publications.
chased by a customer in the past from all media Michalski, R.S., Carbonell, J.G., & Mitchell, T.M.
channels, broken down by major product cate- (1983). Machine Learning—An Artificial Intelli-
gories gence Approach. Palo Alto, CA: Tioga Publishing
Finally, the last set contains all predictors in Company.
the database, including purchases from direct Morwitz, G.V., & Schmittlein, D. (1992). Using Segmen-
mail sources by product categories, money tation to Improve Sales Forecasts Based on Purchase
spent on past purchases by product categories, intent: Which “Indenders” Actually Buy? Journal of
Marketing Research, 29, 391– 405.
time since joining in the list, as well as a variety
Morwitz, G.V., & Schmittlein, D. (1998). Testing New
of other indicators.
Direct Marketing Offerings: The Interplay of Man-
Several procedures were used in the study to agement Judgment and Statistical Models. Manage-
collapse the number of predictors for modeling ment Science, 44, 610 – 628.
purposes to a more manageable size. Murthy, K.S. (1998). Automatic Construction of Deci-
sion Trees from Data: A Multi-disciplinary Survey.
Data Mining and Knowledge Discovery, 2, 45–389.
REFERENCES Novak, P.T., de Leeuw, J., & MacEvoy, B. (1992). Rich-
Ben-Akiva, M., & Lerman, S.R. (1987). Discrete Choice ness Curves for Evaluating Market Segmentation.
Analysis. Cambridge, MA: The MIT Press. Journal of Marketing Research, 29, 254 –267.
Breiman, L., Friedman, J., Olshen, R., & Stone, C. Quinlan, J.R. (1986). Induction of Decision Trees. Ma-
(1984). Classification and Regression Trees. Bel- chine Learning, 1, 81–106.
mont, CA: Wadsworth. Quinlan, J.R. (1993). C4.5: Program for Machine
Bult, J.R., & Wansbeek,T. (1995). Optimal Selection for Learning. CA., Morgan Kaufman Publishing.
Direct Mail. Marketing Science, 14, 378 –394. Shepard, D. (Ed.). (1995). The New Direct Marketing.
Davis, L. (Ed.). (1991). Handbook of Genetic Algo- New York: Irwin Professional Publishing.
rithms. New York: Van Nostrand Reinhold. Sonquist, J., Baker, E., & Morgan, J.N. (1971). Search-
Haughton, D., & Oulabi, S. (1997). Direct Marketing ing for Structure. Ann Arbor: University of Michi-
Modeling with CART and CHAID. Journal of Direct gan, Survey Research Center.
Marketing, 11, 42–52. Weinstein, A. (1994). Market Segmentation. New York:
Holland, J.H. (1975). Adaptation in Natural and Artifi- Irwin Professional Publishing.
22