Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

WAGNER A.

KAMAKURA and MICHEL WEDEL *

The authors address the situation in which a researcher wants to cross-


tabulate two sets of discrete variables collected in independent samples,
but a subset of the variables is common to both samples. The authors
propose a statistical data-fusion model that allows for statistical tests of
association using multiple imputations. The authors illustrate this
approach with an application in which they compare the cross-tabulation
results from fused data with those obtained from complete data. Their
approach is also compared to the traditional hot-deck procedure.

Statistical Data Fusion for Cross-Tabulation

In this study, we address the following problem: A mar- file concatenation methods. Here, we propose a model-
keting researcher conducts a survey and then must relate its based procedure for data fusion of discrete variables that ex-
results to those obtained in a previous study conducted with tends previous work in this area. Previous work on file con-
another independent sample. This problem is common in catenation has been tailored predominantly to continuous
marketing research and is found whenever a researcher variables (cf. Rubin 1986). Although procedures for esti-
needs to consolidate results obtained from two independent mating models for discrete variables when data are missing
samples. This situation arises, for example, for market re- have been provided by Little and Rubin (1987), the focus
search agencies that maintain consumer panels for consumer has been on handling the problem of missing data.
nondurables. The panel data typically contain information Our procedure is based on a mixture model. The advan-
on brands purchased in a large variety of categories, as well tage of the mixture model approach over previous heuristi-
as prices, sales promotions, and so on. However, media ex- cal data-fusion methods is that it forces the market re-
posure of the respondents is in general not available, be- searcher to state assumptions that otherwise might not be
cause the collection of such data imposes too great a burden explicit in the data-fusion process, thus circumventing ad
on the panel members and potentially biases buying behav- hoc and subjective decisions. In addition, the mixture is a
ior. In addition, the costs of collecting single-source data on parsimonious model specification that circumvents the cum-
both purchase behavior and media exposure often are pro- bersome search for an appropriate specification of data-fu-
hibitive. However, media exposure is available from sepa- sion models. To assess the uncertainty associated with data
rate studies supplied by specialized research firms. Market fusion, we use multiple imputations. Here, the missing ob-
research agencies wanting to provide their clients with in- servations are imputed more than once, which makes it pos-
sights into advertising effectiveness must cross-tabulate sible to assess the degree of uncertainty about the missing
product usage assessed in their panels with media exposure data. We show how cross-tables of any two partially ob-
as assessed by the media studies. This may be accomplished served variables in the two data sets and tests of their asso-
by data fusion of purchase and media data, which has be- ciation can be computed. We first review previous work in
come popular as a media planning tool, particularly in Eu- this field and motivate and describe our data-fusion model
rope (cf. Adamek 1994; Baker, Harris, and O'Brien 1989; next. We then apply our approach to a customer satisfaction
Buck 1989; Roberts 1994). study, assess its validity empirically, and compare it to com-
The problem of combining data from two different peting methods.
sources has been solved by data-fusion procedures; in the
statistical literature this class of methods is also known as LITERATURE REVIEW
In the statistical literature, several authors have addressed
the file-concatenation problem. Reviews are provided by
*Wagner A. Kamakura is Thomas Marshall Chair in Marketing, Joseph Ford (1983), Rogers (1984), and Rubin (1986). Commonly
Katz Graduate School of Business, University of Pittsburgh. Michel Wedel
is Professor of Marketing Research, Faculty of Economics, University of used methods for the concatenation of two files, A and B,
Groningen, Netherlands. The authors thank Professor Jose A. Mazzon for are hot-deck procedures. Hot-deck procedures duplicate da-
providing the data for their illustration, Professor Ulf Bockenholt for his ta on the basis of some heuristic rule: When a value is miss-
valuable suggestion on the test on the adequacy of the data-fusion model, ing from sample A (the recipient sample), a reported value
Professor Tom Steerneman for his suggestion on the power calculations,
and the anonymous reviewers, whose comments have helped improve this
is taken from sample B (the donor sample) to replace it
article. (Ford 1983). Thus, each recipient subject is linked to one (or
more) donor subjects in file B on the basis of variables ob-

Journal of Marketing Research


485 Vol. XXXIV (November 1997), 485-498
486 JOURNAL OF MARKETING RESEARCH, NOVEMBER 1997

served in both files. This is accomplished by matching sub- dures apply to discrete variables with missing data, but they
jects on the basis of some measure of distance or similarity. were not developed in the context of data fusion and may
The common variables on which subjects are matched are pose problems of model selection when applied to the large
often demographics, but other variables may be included in data-fusion problems in marketing research. In addition,
the files for the specific purpose of data fusion (Roberts when calculating association tests, the uncertainty that aris-
1994). es from the imputation of missing data should be taken into
Different matching algorithms can be used to search for account. Meng and Rubin (1992) propose test statistics for
the linkages of the subjects in the two files in hot-deck pro- testing associations among variables when part of the obser-
cedures. First, when quantitative (metric) common variables vations are missing, which takes into account the uncertain-
are used, a distance function between subjects in the two ty due to imputation. Our approach builds on this stream of
samples is defined. The subjects in file A are then matched research.
with subjects in sample B according to a minimum distance We develop a procedure specifically tailored to discrete
function. Alternatively, model-based procedures for file con- data and the problem of forming cross-classified tables with
catenation may be used. These involve regression functions chi-square tests of significance among variables from inde-
estimated between the common variables and the unique pendent studies obtained from separate samples, a problem
variables in sample A, which subsequently are used to pre- commonly found in marketing research. Rather than con-
dict the missing values of these variables in sample Band catenate two independent files by matching them on the ba-
vice versa (Rubin 1986). For example, Rubin and Thayer sis of the information on the common variables only, we
(1978) provide an early application of regression-based propose a mixture model that identifies homogeneous impu-
methods to calculate correlations among psychological test tation groups on the basis of all information available from
items applied in nonoverlapping samples. the two samples. This proposed approach ties in with the
Second, if the variables are qualitative (discrete), the practice in marketing research of using imputation groups
common variables are grouped into categories, and an exact for data fusion. Whereas in the hot-deck procedure these im-
match between subjects in the two files is sought along these putation groups are usually delineated a priori on the basis
categories. For those subjects for whom an exact match does of arbitrary criteria, our proposed mixture model identifies
not exist, the procedure searches for a match at a lower lev- them post hoc on the basis of the statistical analyses of the
el of detail by omitting some variables or collapsing the cat- common and unique data for the data sets to be fused. In ad-
egories of one or more variables. Often, a distinction is dition, our procedure circumvents problems of model selec-
made between critical variables, for which the match be- tion encountered in some existing approaches to modeling
tween the donor and recipient must be exact, and matching under missing data applied to data-fusion problems in mar-
variables, which are used to find the best possible match be- keting. We use multiple imputations to provide an assess-
tween donors and recipients within the classes defined by ment of the uncertainty caused by the data-fusion process.
the critical variables. The hot-deck procedures that are based This study extends the earlier work by Rubin and colleagues
on discrete variables often involve complicated classifica- on file concatenation and multiple imputation to discrete da-
tion processes, in which the sample units are classified into ta and involves a new mixture model that deals with partial-
disjoint, homogeneous imputation groups. Observations in Iy observed data. We apply a significance test, originally de-
these imputation groups are considered exchangeable, and veloped for missing data problems, to the file-concatenation
missing values in the recipient sample can therefore be im- problem. Furthermore, we provide insight into the perfor-
puted from subjects in the same imputation group in the mance of data fusion relative to complete data, both in an
donor sample (for applications to purchase and media data absolute sense and relative to traditional hot-deck methods.
fusion, see Antoine and Santini 1987; Baker, Harris, and
DEVELOPMENT OF THE DATA-FUSION MODEL
O'Brien 1994; Buck 1989; O'Brien 1991; Roberts 1994).
A major disadvantage of hot-deck procedures is their We introduce the basic notation required for our ap-
heuristic nature. Hot-deck methods require several subjec- proach. Measures are obtained from two samples denoted by
tive decisions that critically affect the quality of the complete A and B on sets of categorical variables. Let
data set obtained. The choice of the distance and matching
i = I,..., N indicate subjects in sample A,
measures, the definition of the levels of the matching proce-
dure in terms of categories and variables, and the distinction i = N + 1,..., N + M indicate subjects in sample B,
j = 1,..., P indicate variables unique to sample A,
between critical and matching variables are some examples.
Moreover, the statistical properties of the fused data set are j = P + 1,..., P + Q indicate variables common to samples
unknown, and consequently, the significance levels of chi- A and B,
square tests from cross-tables calculated from the fused data j = P + Q + 1,..., P + Q + R indicate variables unique to
are invalid. Current hot-deck procedures use a single imput- sample B,
ed value and therefore theoretically have poor sampling kj = I, , Kj indicate categories of variable j,
properties (Li, Raghunathan, and Rubin 1991), even though t = I, , T denote homogeneous imputation groups, and
they could be extended for multiple imputations. s = 1, , S denote multiple imputations.
Model-based file concatenation methods overcome sever- The purpose is to fuse the samples A and B to enable the
al of these problems. Rubin (1986) proposed regression- investigation of bivariate relationships of the discrete vari-
based procedures for file concatenation on the basis of con- ables measured in those two samples. These discrete vari-
tinuous data. However, this method does not apply to dis- ables are denoted by Xijb where Xijk = I if the observed val-
crete data. Little and Rubin (1987) propose log-linear mod- ue of variable j for subject i falls in category k of the Kj cat-
els for partially classified contingency tables. These proce- egories of this variable, and Xijk = 0 otherwise. A complete
Statistical Data Fusion 487

Figure 1 on parameters describing the complete data and (2) the


SCHEMATIC REPRESENTATION OF THE DATA-FUSION mechanism generating the missing data depends only on the
PROBLEM observed data and not on the missing data.
Data fusion depends on the relationship between the
missing and nonmissing observations, which can be de-
scribed by a model. Little and Rubin (1987) propose the use
of log-linear models when the missing data mechanism can
be ignored. They show that the likelihood can be factored on
the basis of the marginal tables for which complete infor-
mation on all cells is available. In our particular problem
(i.e., data fusion), we intend to cross-classify two discrete
variables, one in Set I, Xj (j $; P), and one in Set III, JS' (j' >
P + Q). Little and Rubin's approach can be applied by com-
puting the marginal tables from sample A, classified by the
variables in Set I and the common Set II, and the table com-
puted from sample B, classified by the variables in Set III
and the common Set II. The likelihood estimates for the pa-
rameters of a log-linear model describing the relation be-
tween Xj and Xj' can be obtained by using the factorization
of the joint probability into product of the probabilities for
these two marginal tables. This log-linear model requires
many parameters to be estimated in applications of data fu-
sion in marketing and is to be estimated for all possible com-
binations of two variables in the unique Sets I and III. For
example, in the illustration we provide subsequently, this
model would be impractical, requiring the fitting of two 16-
dimensional tables, with more than 100,000 cells each, for
(N + M) by (P + Q + R) data matrix X is formed, in which 80 pairs of variables.
part of the observations are missing. Missing data are indi- Instead, we assume the existence of a limited set of unob-
cated by Xm, observed data are denoted by XO. served imputation groups to describe the dependencies in
In terms of the example given in the introductory section, the tables. This circumvents the need to estimate a high-di-
the variables common to sample A and B can be demo- mensional table for each cross-classification of two vari-
graphics, whereas the variables unique to sample A or sam- ables. The model is estimated for all variables simultane-
ple B can be choice behavior toward a set of brands and me- ously, which thus eliminates the need to estimate and select
dia exposure, respectively. To simplify the presentation, we a model for each pair of variables (from the unique Sets I
assume no additional missing data among the variables in and III) to be cross-tabulated.
sample A or sample B. We also assume that there are no sub- The mixture model presents a particularly attractive alter-
jects for whom all variables are assessed.' The sample de- native over the traditional hot-deck classification proce-
sign underlying both samples is assumed to be a simple ran- dures. Using the latter methods, imputation groups of ex-
dom sample. As a consequence of these assumptions, sam- changeable observations are typically derived using some a
ples A and B can be considered repeated independent sam- priori procedure, which involves several subjective deci-
ples from the same population, which enables us to form the sions on the part of the researcher. On the contrary, our mod-
complete data matrix X displayed in Figure I. el enables the identification of imputation groups post hoc
It is useful to note from Figure I that the missing obser- from the available data. Our mixture model presents a sta-
vations have a special structure, which distinguishes the da- tistical approach to the derivation of imputation groups,
ta-fusion problem from the typical imputation of missing which has the advantage that measurement error in the vari-
data. The advantage of formulating the data-fusion problem ables is accounted for, and the statistical framework causes
as a missing-data problem is that the corresponding missing- subjective decisions in various stages of the data-fusion
data problem has attractive properties that facilitate the sta- process to be largely circumvented. Mixture models are by
tistical treatment. In particular, the missing-data generation now established tools in marketing research (cf. Dillon and
mechanism is not stochastic as in many other missing-data Kumar 1994; Wedel and DeSarbo 1994), and several studies
problems but deterministic, because it is induced by the have shown their ability to identify homogeneous latent
study designs of the two samples. Therefore, this particular groups in data (e.g., DeSarbo and Cron 1988; Wedel and De-
missing-data mechanism is ignorable (Rubin 1976). A miss- Sarbo 1995).
ing-data mechanism is said to be ignorable if (I) the miss- We assume that T homogeneous imputation groups un-
ing-data generation mechanism is not relevant for inference derlie the data on the three sets of variables (where the num-
ber of groups T is not known in advance) and have prior
probabilities 11t:
'We note that in situations in which all variables are measured for some
subjects, the performance of our (and other model-based) data-fusion pro-
L
T
cedures improves, because then information on the partial association of the
two sets of unique variables in samples A and B, given the common vari-
(1) r. (X 10) - lldi I t (Xi! 0 1) ,
ables, is available. 1=1
488 JOURNAL OF MARKETING RESEARCH, NOVEMBER 1997

where Q = {TIt, at; t = 1, 2,..., T} contains all the relevant pa- ties as well as the response probabilities for the common
rameters, and Xi denotes the ith row of X. For the purposes variables are restricted to be the same. Note that Equation 4
of data imputation, the mixture model, by assumption, has does not involve the missing observations, and therefore the
the attractive property that the observed sets of variables are imputation procedure described subsequently is nonitera-
locally independent within imputation groups. Assuming tive. The previous log-likelihood will be maximized using
that the observed and missing sets of variables are locally in- an iterative EM algorithm (Dempster, Laird, and Rubin
dependent within imputation group t, we can write the joint 1977).
distribution of the missing and nonmissing observations as In the EM algorithm, the E- and M-steps are alternated
the product of their marginal distributions: until the improvement in the likelihood function is below a
user-specified threshold. We do not provide the derivation of
(2) filt (X?,XFI9 t) = Llt(X?19t)filt(XFI9t)· the EM algorithm for this problem because it is straightfor-
ward and available elsewhere (Dempster, Laird, and Rubin
We assume that the j = 1,..., P + Q + R variables are multin- 1977), but its steps are summarized in the Appendix. The ex-
omial with Kj categories (Kj ~ 2) and are described by the pressions for the parameter estimates have closed form, and
multinomial probability density with response probabilities therefore the implementation of the EM algorithm is com-
aj kt for category k of variable j in group t: putationally attractive, even for the large samples encoun-
tered in data-fusion problems. The EM algorithm suffers
(3) from convergence to local optima, and therefore we use sev-
eral sets of random starting values of the posterior probabil-
ities. The model comprises three sets of parameters: the two
Here, Xij may denote both the nonmissing and missing ob- sets of mixture model parameters for the unique sets of vari-
servations. In estimating the model, the researcher identifies ables in the first and second samples respectively, and the
the imputation groups in such a way that all observed vari- mixture model parameters for the common variables across
ables are locally independent within the groups. This means, the two samples. Identification of each of these three sets of
for example, that within the imputation groups in sample A, parameters follows from standard identification proofs for
the P unique variables are approximately independent from mixtures of exponential family distributions (Titterington,
the Q common variables. Moreover, the P and Q variables Smith, and Makov 1985). Under standard regularity condi-
are approximately independent among themselves within tions, the parameter estimates are asymptotically normal,
each imputation group. Therefore, the procedure identifies with covariance matrix Ln, which is estimated by inverting
the imputation groups not only by using the information the negative Hessian matrix of second derivatives. When es-
about the association among the Q common variables, but timates of the model parameters are obtained, estimates of
also by capitalizing on the dependence among the P unique the posterior probabilities, 1tit, that subject i belongs to im-
variables and their joint association with the Q common putation group t can be calculated using Bayes' Theorem
variables. The intuition behind the application of the mix- (see the Appendix).
ture model is to obtain groups of observations that are ex- The selection of a parsimonious number of imputation
changeable. This implies that if some of the observations groups, T, is of concern in the application of mixture mod-
within a certain group of subjects are missing (for example els to data-fusion problems, because the quality of the data
the R unique variables for subjects from sample A), they can fusion will not increase indefinitely with increasing T. The
be replaced by values that are obtained from the model esti- quality of the data fusion is dependent on the precision with
mates that are obtained from subjects from another sample which the mixture model parameters are estimated because
B belonging to that same latent imputation group. This idea these form the bases for imputation. On the one hand, as the
is analogous to the one underlying the identification of ho- number of imputation groups decreases, heterogeneity with-
mogeneous imputation groups in hot-deck procedures. in the groups increases, and the model estimates become in-
However, here we attempt to identify the homogeneous creasingly burdened with aggregation bias. On the other
groups that are based on statistical estimation, rather than on hand, as T increases, the number of observations in each im-
heuristic rules. putation group decreases, which negatively affects the pre-
cision of the estimates, thus inflating their estimated vari-
Model Estimation ances. Therefore, it is likely that there is an optimal value of
The parameters of the model are estimated by maximiz- T. Unfortunately, the standard Likelihood-Ratio (LR) test
ing the likelihood function. The log-likelihood can be writ- statistic for testing T versus T + 1 groups is invalid because
ten as the T group solution is at the boundary of the parameter

t. I{t.", D
space of the T + 1 group solution (Aitkin and Rubin 1985),
so this statistic cannot be used to determine the optimal
(4) liUI XO) " I", ,(XF 10,)] number of imputation groups (or segments) T. In line with
the current practice in the mixture model literature, we use
the Consistent Akaike Information Criterion (CAlC, Bozdo-

1"[t.", 'XC
gan 1987) as a heuristic, in which that number of groups, T,
is selected for which the CAlC reaches a minimum. In addi-
,oN.' 1;;I«XFlo,l] tion, it is important to ensure that the imputation groups are
well separated. We use an entropy statistic ET to investigate
This likelihood is composed of two terms, one for sample A the degree of separation in the estimated posterior probabil-
and one for sample B, in which the mixture prior probabili- ities (cf. DeSarbo et al. 1992). ET is a relative measure and
Statistical Data Fusion 489

is bounded between 0 and 1, where values close to 1 indicate and the imputation group t* from f i I t* (Xjm I XiO). This in-
that the posteriors are well separated. A value close to 0 is of volves drawing from a multinomial distribution with proba-
concern because it implies that the posteriors are not suffi- bilities 8 jkt . This is repeated S times, and the posterior den-
ciently separated. sity of the parameters <P is approximated by computing <p*'
for each of the S repeated samples. The imputation proce-
Multiple Imputation of the Missing Data dure itself is noniterative, which offers obvious computa-
When the parameters of the model are estimated, the data- tional advantages. It can be summarized as follows:
fusion process involves the imputation of the missing obser-
I. Initialize s = I;
vations in X. There are important problems in imputing a sin-
2. Initialize i = I;
gle "best" value for one missing value. A single imputed val- 3. Sample the parameters Q* from MVN(Q,Ln);
ue cannot represent the uncertainty involved in the imputa- 4. Calculate the posteriors 1tit";
tion process and therefore leads to underestimation of uncer- 5. Sample the imputation group t" from 1tit";
tainty in subsequent analyses of the imputed data set. Sever- 6. Sample Xjffi* from fj I t* (Xjffi I X,D);
al studies have shown that these effects can be serious and 7. i = i + I, repeat steps 3 through 6 if i :<:; N;
that multiple imputation is preferred over single imputation 8. s = s + 1, repeat steps 2 through 7 if s :<:; S.
with regard to the estimates of confidence intervals and sig-
Thus, the S imputed data sets are analyzed, and the results
nificance levels. In our application this is important because
are combined. For moderate fractions of missing informa-
we want to assess the significance level of the association of
tion, a relatively small number (e.g., S = 3) of draws suf-
pairs of discrete variables. Multiple imputation replaces each
fices. However, data-fusion problems present a class of file-
missing observation with S (~ 2) values. From a Bayesian
concatenation problems with particularly high fractions of
perspective, the data-imputation procedure involves drawing
missing information. The average ratios of missing to com-
S sets of values from the posterior density of the missing da-
plete information are as high as 'Y = 1.5 or 'Y = 2 (corre-
ta, Xim, which is obtained by integrating the density of Xim
sponding to fractions of missing information in the range of
with respect to the distribution of the parameters:
1/3 to 1/2). Imputation methods with modest numbers of im-
putations that work well under common missing value prob-
(5) fj (Xl" I X?) = J r, (Xl" IX?,Q)f(QI XO)dQ,
lems (i.e., 3 ::; S ::; 5) are not suited to deal with these cases
and may have poor sampling properties (Li, Raghunathan,
where 0= [(j),~] are the parameters of the latent class model.
and Rubin 1991). Because the fraction of missing informa-
The purpose here is to cross-classify two discrete vari-
tion in data-fusion problems is large, a large number of im-
ables from the sets {Xj; j = 1,..., P} and (X/; j' = P + Q +
putations (e.g., S = 100) must be used.
I, ..., P + Q + R), which results in a Kj by Kj' contingency
table. This is equivalent to estimating the joint (posterior) Testing for Association in the Fused Table
probability distribution of Xj and Xj" which is characterized
Our multiple imputation procedure produces one cross-
by the parameters representing the cell probabilities: <P =
tabulation with cell proportions <Ps = {<Plls,"" <PKjKj's}, for
(<Pkl; k = 1,..., Kj, 1 = I, ..., Kj')' (For convenience of notation
each of the S imputations. The final estimates of the cell
we omit the subscripts for the variables j and j' on the cate-
proportions in the cross-table are obtained by averaging the
gory indicators k and 1.) The posterior density is computed
S estimates for each imputation:
by calculating the conditional posterior density given the ob-
served and missing data and integrating out the distribution s
of the missing data from that expression: (7) ~ = t I..s:
s= I

IT
N+M
(6) f(fIXO) = Jf(f I XO,Xffi) [C (XjffiIXjO)]d x-. To test for association between variables j and j' in the
i =I fused table, we use the test statistic proposed by Meng and
Rubin (1992) in the context of imputation of missing values
In using Equations 5 and 6 for imputation, the researcher in surveys, which takes into account the uncertainty arising
substitutes Equations I and 2 and approximates the integrals from the imputation of missing data. This statistic D s, for
by sampling from the respective distributions. The imputa- testing the independence assumption of the variables j and j'
tion procedure operates as follows: For a particular subject, after data fusion, is computed as
first draw values of the parameters, Q*, from the posterior
distribution fi(Q I XjO), which is asymptotically MY-Normal (8)
DB
with covariance matrix LQ.2 The drawn parameter values for
the common variables enable the calculation of posterior
probabilities that the subject belongs to imputation group t where v = [S + I/H(S -l)](D w - DB) is the estimated odds
for all t: 1tjt* . Then draw the imputation group t" from a ratio of the fraction of missing information with respect to
multinomial distribution with probabilities 1tit* . Finally, the parameters <P and H (the number of degrees of freedom).
draw Xm* conditional upon the values of the parameters Q* To compute the statistic D, it suffices to have access to the
LR test statistics for each of the S fused data sets and the es-
timates of the joint probabilities, <Ps, in each of these data
2From a Bayesian perspective, the covariance matrix of the maximum sets. DB is simply the LR test statistic calculated at the aver-
likelihood estimates obtained by inverting the Hessian provides an approx-
imation to the posterior distribution of the parameters in large samples. age of the S estimates of the joint probabilities, <Pkl' ~w is
Because the sample sizes in data fusion are large, this approximation is the average of the LR statistics calculated at each estimate
fairly accurate in practice. <Pkls'
490 JOURNAL OF MARKETING RESEARCH, NOVEMBER 1997

Under the null-hypothesis the test statistic D follows an abies us to assess the validity of the proposed imputation
approximate asymptotic FH,G-distribution, with the denomi- procedure. To this end, we compare the imputed cross-tables
nator degrees of freedom, G, calculated as from the incomplete data with the actual cross-tables from
the complete data. We then assess the correspondence be-
tween the association tests computed from the complete ta-
bles and those obtained from the imputed tables. The first
test shows how well the imputed data emulates the complete
The test statistic Ds is appropriate under the assumption data. The second test indicates whether the fused tables lead
that the fractions of missing information are approximately to the same conclusions regarding the association between
constant across the parameters (i.e., Yi = Yfor that particular the two discrete variables in each cross-tabulation.
table). Li, Raghunathan, and Rubin (1991) investigate the
sensitivity of Ds to this assumption by Monte Carlo simula- Estimation Results
tion and conclude that the test based on Ds asymptotically The first step in the analysis is to estimate the mixture im-
has the correct level for S ~ 00. putation model. The model was applied to the data for T = I
EMPIRICAL ILLUSTRATION to T = 7 imputation groups. In Table I, we provide the log-
likelihood, the CAlC statistic, and the p2 measure of ex-
We illustrate the proposed statistical data-fusion method plained variance (relative to the aggregate model) for these
on data from a commercial customer satisfaction study. The solutions. To overcome problems of local optima, each of
study pertains to the measurement of customer satisfaction the solutions reported is the best among 15 solutions ob-
among the most valuable customers of a large multibranch tained from different random starting values of the posterior
bank in Latin America. The study includes data collected in
membership probabilities. The CAlC statistic indicates that
a survey and from internal records. In the study, complete
the T = 5 solution provides the best representation of the da-
data from both sources on all subjects are available. Howev- ta, and consequently, we use that solution as the basis of our
er, problems of partial data in independent samples and the imputation procedure. In Table 2, we provide the parameter
need for data fusion may occur in similar studies (e.g., a sur-
estimates of the T = 5 solution. The imputation groups are
vey that is conducted among a sample of customers, but in- well separated; the entropy criterion calculated from the
ternal records are available only for another sample; cases in
posteriors equals .85.
which it is not possible to identify the same subjects in the
It is clear that the proposed model is able to merge the two
survey and from the internal records). We use the data from
incomplete data sets properly only if the five identified im-
the present study because it enables us to simulate the situ-
putation groups represent homogeneous classes of subjects
ation in which incomplete data have been collected and to
in terms of their demographic profiles, survey responses,
compare the fused data to the actual complete data. Three
and internal records, and if the association among those
sets of variables (among many others) have been assessed in
variables across subjects is captured properly by the differ-
the study: ences across these homogeneous groups. To verify this im-
(1) Eight survey measures assessing the staff (e.g., perceived portant assumption, the mixture model was applied to each
politeness, attentiveness, availability of clerks), which are of the two data sets separately, restricting T = 5 for each da-
binary coded ("a little or not at all" versus "a lot"), one ta set. If the solution obtained from our mixture imputation
measure of bank use, and one measure of overall satisfac- model properly captures the structure in both data sets, we
tion (whether the respondent would recommend the bank to should expect it to fit well to the two data sets, relative to the
friends) on a five-point scale ( from "certainly not" to "cer- two models applied to each data set independently.
tainly yes"); The goodness-of-fit results presented in Table 3 show a
(II) Fifteen variables on demographics (including gender, age,
and education) and the use of several of the bank's services lower CAlC for our model than for the combination of the
(e.g., savings, credit cards, insurances); and two independent models. In other words, after accounting
(III) Eight performance variables available from internal for the differences in sample size and number of parameters
records, including number of transactions per month, con- being estimated, our mixture imputation model fits the two
tribution of the account, funds in the bank, marginal contri- data sets well relative to the two independent models. This
bution of the account per dollars, and so on. result leads us to the conclusion that the five imputation
groups identified by the model properly reflect the data
Observations on all variables are obtained for all 2000
subjects randomly sampled for our tests. We simulate a sit-
uation with two samples by randomly assigning the subjects Table 1
to two groups. We assume that the variables for the first LOG-LIKELIHOOD, CAlC, AND p2 FOR THE T = 1 TO T = 7
group have been collected by the survey among the bank IMPUTATION MODELS
customers, in which only the variables in Sets I and II have
been assessed. For this group, the variables in Set III from
T Log-l CAlC p2
the internal records are assumed missing and are not used
for estimation. For the second group, we assume that the da- I -36484.63 73519.72 .1298
2 -33338.07 67871.01 .1850
ta are obtained from internal records, so that only variables 3 -32500.19 66668.96 .1970
in Sets II and III are available. For this group, the satisfac- 4 -31954.07 66135.77 .2038
tion measurements in Set I are assumed to be unknown and 5 -31589.97 65966.62* .2076
are ignored in the estimation stage. 6 -31367.50 66080.76 .2087
7 -31121.15 66147.11 .2110
The main purpose of this treatment of the original data is
to enable the assessment of the data-fusion results. This en- *Indicates the minimum value of CAlC.
Statistical Data Fusion 491

Table 2
PARAMETER ESTIMATES

Set I: Satisfaction Variables (Survey) Segment 2 3 4 5


Segment 2 3 4 5 10. Mutualfund usage
Use .168 .114 .011 .029 .126
I. Would recommend bank tofriends Don't .832 .886 .989 .971 .874
Certainly not .012 .051 .519 .072 .237 II. Tax-deferredfund usage
.005 .128 .202 .044 .253 Use .698 .589 .114 .341 .619
.131 .410 .203 .233 .347 Don't .302 .411 .886 .658 .381
Certainly yes .852 .411 .076 .651 .167 12. Commodities fund usage
2. Clerks are competent Use .449 .273 .042 .172 .390
A bit .005 .398 .829 .014 .964 Don't .551 .727 .958 .828 .610
A lot .995 .602 .171 .986 .036 13. Fixed ratefund usage
3. Clerks are attentive Use .237 .098 .008 .057 .169
A bit .016 .239 .832 .050 .969 Don't .763 .902 .992 .943 .831
A lot .984 .761 .168 .950 .031 14. Auto insurance usage
4. Clerks are polite Use .279 .222 .009 .022 .161
A bit .006 .091 .677 .046 .837 Don't .721 .778 .991 .978 .839
A lot .994 .909 .323 .954 .163 15. Home insurance usage
5. Clerks are fast, agile Use .204 .171 .007 .019 .118
A bit .072 .768 .976 .157 .987 Don't .796 .828 .993 .981 .882
A lot .928 .232 .024 .843 .013
6. Clerks answer questions Set Ill: Performance Variables in Quintiles (Internal Records)
A bit .013 .659 .881 .034 .958
A lot .987 .341 .119 .966 .042 Segment 2 3 4 5
7. Clerks are well informed
I. Number of transactions in respondent's account
A bit .056 .823 .938 .099 .988
1 bottom 20% .088 .013 .355 .504 .008
A lot .944 .177 .062 .901 .012
2 .227 .070 .298 .323 .164
8. Clerks are well groomed
3 .257 .303 .189 .101 .233
A bit .033 .307 .673 .022 .785
4 .199 .279 .100 .023 .262
A lot .967 .693 .327 .978 .215
5 top 20% .229 .334 .059 .048 .333
9. Clerks are available when needed
2. Contribution of respondent's account
A bit .038 .478 .928 .077 .967
I bottom 20% .000 .660 .112 .000 .162
A lot .962 .522 .072 .923 .033
2 .000 .259 .493 .000 .378
10. % offunds in this bank
3 .034 .053 .271 .154 .460
<25% .102 .174 .772 .357 .181
4 .492 .025 .088 .416 .000
25-50% .065 .162 .103 .163 .242 .037 .429
5 top 20% .475 .003 .000
50-75% .132 .111 .038 .158 .096 3. Respondent's funds in bank
75-100% .701 .553 .087 .322 .481 I bottom 20% .000 .340 .833 .000 .000
2 .000 .623 .167 .054 .073
Set II: Demographics and Status Variables (Common Variables) .027 .037 .000 .251 .634
3
Segment 2 3 4 5 4 .378 .000 .000 .337 .284
5 top 20% .595 .000 .000 .358 .009
I. Gender 4. Number of customers in respondent's branch
Male .195 .212 .355 .285 .209 I bottom 20% .127 .241 .174 .084 .174
Female .805 .788 .645 .715 .791 2 .294 .244 .225 .159 .225
2. Age 3 .170 .193 .098 .192 .176
<40 .149 .279 .355 .137 .278 4 .236 .165 .238 .329 .255
40-50 .303 .428 .390 .218 .372 5 top 20% .173 .157 .265 .236 .170
5Q-60 .289 .203 .150 .248 .246 5. Customers/employee in respondent's branch
>60 .259 .090 .105 .397 .104 I bottom 20% .184 .114 .151 .165 .198
3. Education 2 .193 .147 .076 .253 .140
Junior .085 .021 .088 .303 .041 3 .178 .194 .191 .225 .226
High .191 .189 .131 .139 .148 4 .306 .249 .376 .221 .213
College .724 .790 .781 .558 .811 5 top 20% .139 .297 .206 .135 .223
4. Savings usage 6. Customers/managers ratio in respondent's branch
Use .760 .685 .304 .650 .710 1 bottom 20% .187 .075 .189 .143 .195
Don't .240 .315 .696 .350 .290 2 .172 .218 .155 .220 .169
5. Credit card usage 3 .226 .232 .240 .198 .291
Use .368 .304 .080 .062 .252 4 .207 .213 .247 .247 .137
Don't .632 .696 .920 .938 .748 5 top 20% .207 .263 .169 .192 .208
6. Bank card usage 7. Volume of applications per transaction in respondent's account
Use .857 .902 .631 .618 .871 1 bottom 20% .000 .403 .758 .000 .000
Don't .143 .098 .369 .382 .129 2 .000 .597 .227 .005 .080
7. Certificate of deposit usage 3 .080 .000 .014 .149 .750
Use .687 .434 .058 .347 .498 4 .439 .000 .000 .280 .170
Don't .313 .566 .942 .653 .502 5 top 20% .480 .000 .000 .566 .000
8. Special checking usage 8. Marginal contribution (per $) of respondent's account
Use .936 .915 .447 .567 .897 1 bottom 20% .000 .689 .359 .000 .011
Don't .064 .085 .553 .433 .103 2 .048 .215 .100 .028 .688
9. Automatic bill payment usage 3 .434 .012 .010 .317 .157
Use .773 .695 .240 .437 .687 4 .437 .036 .025 .204 .144
Don't .227 .305 .760 .563 .313 5 top 20% .081 .047 .506 .451 .000
492 JOURNAL OF MARKETING RESEARCH, NOVEMBER 1997

Table 3
TESTING THE FIVE-GROUP SOLUTION

Log- Number of Sample


Model Likelihood Parameters Size CAlC
First sample (demo + perceptions) -13792.56 164 1000 28718
Second (demo + performance) -17687.60 254 1000 37130
First + Second Samples -31480.16 418 2000 66138
Proposed Model -31589.97 324 2000 65643

structure in both data sets. This conclusion was confirmed in Customers in Group 5 (20.4%) have negative opinions of
a direct comparison of the parameter estimates. The esti- the clerks and would not recommend the bank to friends.
mates from the proposed model (Table 1) were close to They are younger, more educated, and are moderate users of
those obtained from the two independent data sets, with a the bank's services. Although they have a high number of
mean absolute deviation (MAD) of .032 across all response transactions, their contribution and volume of funds are low
probabilities for the demographic variables, a MAD = .009 to moderate. They are customers at branches with high cus-
for the response probabilities in the ten items from the sur- tomer/employee ratios, which may cause their negative
vey, and a MAD = .028 for the eight performance variables opinions about the clerks.
from internal records. This model comparison can be seen as
a test that indicates whether the imputation procedure will Imputation Results
work well. If the results from the two independent mixture To assess the degree of fit between the tabulations of ac-
models differ substantially from the results of the mixture tual data and our imputed estimates, we processed all 80
imputation model, the imputation procedure is likely not to possible pairs of the 10 survey variables and 8 internal mea-
produce good results in that particular application. sures and compared the imputed cell frequencies on these
Although the actual interpretation of the imputation tables with the actual frequencies. As was expected, the fit
groups is not of primary interest, we provide a brief de- between the cell frequencies from the "fused" tables and
scription of the results in Table 2. Subjects in Group 1 those from the complete data is not perfect, because the im-
(22.3%) have a positive opinion about the clerks, a high puted cell frequencies were obtained from incomplete data.
probability of recommending the bank to friends, and a high Nevertheless, the imputation procedure performs well and
percentage of their funds in the bank. They are users of most produces cell-frequencies that are highly correlated (r = .97)
of the services of the bank, older, more often female, and with those from the actual tables across all 960 cells of 80
more educated. Consistent with their satisfaction scores, two-way tables. In Figure 2, we show a scatterplot of the ac-
they have high volume and contribution of transactions, tual and imputed cell frequencies in which each observation
have a large percentage of funds in the bank, and are cus- is labeled by the number of the column variable (from Set
tomers of branches of the bank with low customer/employ- III) in the cross-tabulation.
ee ratios. However, this association between actual and imputed
Subjects in Group 2 (21.8%) have a mixed opinion of the cell frequencies might overestimate the ability of the impu-
clerks and score positive on some aspects and negative on tations to reflect the true association among the variables be-
others. They have a moderate percentage of funds in the ing cross-tabulated. Part of this association could be due to
bank and are moderately inclined to recommend the bank the model's ability to fit the row and column margins of the
to friends. They are younger, female, and users of many of cross-tables.I To partial out the row and column margins, we
the bank's services. However, they provide a low contribu- transformed the actual and imputed cell frequencies into de-
tion and volume, have a low percentage of funds in the viations from the expected frequencies under the hypothesis
bank, and are customers of branches with high of independence between rows and columns. After this
customer/employee ratios. transformation, the correlation between actual and imputed
Imputation Group 3 (15.7%) consists of dissatisfied con- adjusted cell frequencies was reduced to r = .78.
sumers. They have negative opinions about the clerks in The proposed data-fusion procedure can be assessed fur-
general, a low percentage of funds in the bank, and a low ther by comparing the results of the tests of association from
probability of recommending the bank to friends. This group the imputed cross-tabulations with those obtained from the
contains more males than the other groups; members are rel- actual (i.e., complete) ones. In Table 4, we summarize the
atively young and use fewer of the bank's services. Consis- results of these tests across all 80 cross-tabulations at the .05
tent with their opinions about the bank, the members of this and .10 significance levels. We can see that these tests
group have a low number and volume of transactions, a low agreed in 59 (74%) and 56 (70%) of the 80 tables at the .05
percentage of funds in the bank, and are customers of and .10 significance levels, respectively. Compared to the
branches with high customer/employee ratios. complete tables, the significance tests in the imputed tables
Group 4 (19.8%) consists of older, less educated people, were more likely (17 of 50 cases) to produce false negatives
and more of them are males compared with the other groups. (i.e., accepting independence when there was association in
They have positive opinions about the clerks and would rec- the complete data) than false positives (4 of 30 cases). In
ommend the bank to friends. They only use specific services other words, the tests of association based on the fused ta-
of the bank (savings and bank cards). The number of their
transactions is low, but the volume and contribution of funds
is high. 3We thank an anonymous reviewer for pointing out this problem to us.
Statistical Data Fusion 493

Figure 2
COMPARISON OF ACTUAL AND IMPUTED FREQUENCIES (DEVIATIONS FROM INDEPENDENCE)

Proposed Model Hot Deck-Simple Matching


r = .78 r= .66

100 100
1 •

3
Ul 7 Ul
c: 50 c: 50
0 0
~ ~ 1
1
.~ .~
C 0 C
"0 "0
0
Ql Ql
:; :;
0. 0.
E -50
1
3
E -50

-100 -100
-100 -50 0 50 100 150 -100 -50 o 50 100 150
Actual Deviations Actual Deviations

Hot Deck-Age. Gender, and Education as Critical Hot Deck-Age and Gender Critical
r = .63 r = .63

100 100
1 •

1 2
Ul
c: 50 Ul
c: 50 •
0 , 0
1

'iii
'S
Ql
51
ii
.~ '~~ "
C
"0
0 C
"0
0 , ,
s::> $
::>
0. 0.
E -50 E -50

-100 ·100
-100 -50 0 50 100 150 -100 -50 0 50 100 150
Actual Deviations Actual Deviations

bles are more conservative than those based on the complete Table 4
data. This is expected because the fused tables are obtained SUMMARY OF TESTS OF ASSOCIATION FOR ALL
from incomplete data containing much less information on 80 CROSS-TABULATIONS
the association of the variables, which is expected to reduce
the power of the tests. On the basis of the empirical results Conclusions at 5% Significance Level
in Table 4, the relative power of the Meng and Rubin test
Imputed
compared with the LR test on the complete data is approxi- Total
mately 60-65% on average. Actual Association Independence Actual
We further illustrate the imputation procedure by focus- Association 33 17 50
ing on a selection of cross-classifications of the variables in Independence 4 26 30
Set 1 and Set III. The purpose is to show that the statistical Total Imputed 37 43 80
tests on the imputed and complete data correspond, and we
also infer the same substantive conclusions from the imput- Conclusions at /0% Significance Level
ed data as would be inferred from the original complete da- Imputed Total
ta. First, we cross-classify the satisfaction survey variables Actual Association Independence Actual
1-5 "Clerks are fast/agile" and 1-9 "Clerks are available
Association 33 21 54
when needed" from Set 1 with the variables III-I "Respon- 26
Independence 3 23
dent's number of transactions," III-3 "Respondent's funds in Total Imputed 36 44 80
the bank," III-4 "Number of customers in respondent's
branch," and III-5 "Ratio of customers per employees in re-
spondent's branch" from Set III. These cross-classifications survey, relates to clients' frequency and volume of service
would be requested by managers to provide insight into (1) use, obtained from internal records, and (2) whether certain
how the evaluation of the bank's clerks, obtained from the characteristics of a branch (number of customers and ratio
494 JOURNAL OF MARKETING RESEARCH, NOVEMBER 1997

of customers per employee) are related to customers' as- significantly associated with the customer's amount of funds
sessments of its personnel. in the bank, the volume of applications per transaction, and
Second, we cross-tabulate the survey variables (Set I) 1-10 the marginal contribution of the account per unit of deposits.
"Would recommend the bank to friends" and I-I "Percent- The actual tables also lead to the conclusion that the indica-
age of funds the customer states to have in this bank" with tors of customer satisfaction also are related to the respon-
the variables from the internal records (Set III) III-I "Re- dent's number of transactions with the bank and with the to-
spondent's number of transactions," III-2 "Contribution of tal contribution of his or her account to the bank. However,
the respondent's account," III-3 "Respondent's funds in the these last two conclusions are not supported by the more
bank," III-7 "Respondent's applications per transaction," conservative tests obtained from the incomplete data.
and III-8 "Marginal contribution of the respondent's account We inspect four of the cross-classification tables in detail
(per $)." It could be of interest to managers to relate overall to reveal further strengths and weaknesses of the proposed
satisfaction measures from the survey to several specific procedure. We show two cross-tables for which the tests on
performance variables in this way. the imputed data correspond to the tests on the actual data,
In Table 5, we display the LR test statistic and the corre- and two tables for which this is not the case. Table 6 depicts
sponding p-value calculated from the actual cross-classifica- the results of the cross-classifications of the survey variables
tion, as well as the results from the test statistic D s and their 1-10 "Would recommend bank to friends" and I-I "% of
corresponding p-values calculated from the imputed data funds in the bank" with the internal record variable III-I
(using 100 imputations). The table also shows 1'/(1 + 1'), the "Number of transactions." The estimates of the <l>kl are ob-
estimated effective fraction of missing information for each tained by averaging the S estimates for each imputation (see
cross-table. Table 5 shows that the estimated fractions of ef- Equation 7). Table 6 shows the estimates of the joint classi-
fective missing information vary across the cross-tables but fication probabilities <l>kl, from the actual data and the imput-
are in all cases lower than .55. In four of the cross-tables, the ed data. These probabilities are expressed as percentages of
test statistic on the actual data does not reject the null-hy- the grand total. Note from Table 5 that the association of 1-
pothesis of independence at p < .05. The test statistic from 10 and III-I (right-hand panel in Table 6) is significant for
the imputed data agrees with these results, which leads us to both the actual and the imputed data, whereas the associa-
conclude that "Clerks are fast, agile" is not related to the tion of 1-1 and III-1 (left-hand panel in Table 6) is significant
number of customers in the respondent's branch. "Clerks are for the actual table but not for the imputed data. Apparently,
available when needed" is not associated with the respon- the variability across the 100 imputations in the latter case
dent's total number of transactions or with the number of causes the association not to be significant for the incom-
customers or ratio of customers per employee in the respon- plete data. The left-hand panel of Table 6 shows that the cus-
dent's main branch. However, both of these attitudes toward tomers who would recommend the bank to friends execute
clerks are associated with the respondent's funds in the more transactions per month (i.e., are heavy users of the
bank. Both tests on the actual and the imputed data support bank). Although the results of the significance tests differ,
these findings. The associations of "Clerks are fast, the percentages calculated from the imputed and actual data
agile" with the number of transactions and the customer/em- are close. The right-hand panel of Table 6 shows a positive
ployee ratio are not supported by the more conservative tests association between having more funds in the bank and the
on the fused data. number of transactions per month, which is significant from
On the basis of both the actual and fused tables, we con- both the imputed and actual tables. The actual (in parenthe-
clude that the indirect measures of satisfaction (i.e., whether sis) and imputed percentages also are relatively close.
the customer would recommend the bank to friends and the Table 7 shows the cross-classifications of the survey vari-
percentage of funds the customer allocates to this bank) are ables 1-5 "Clerks are fast/agile" and 1-9 "Clerks are available

Table 5
X2 TESTS ON THE ACTUAL DATA (LA) AND THE MENG-AUBIN TEST (Ds ) ON THE FUSED DATA

Set I (Survey) Set III (Internal Records) Df LR p-value yl( I+y) Ds p-value
Clerks are fast, agile Respondent's number of transactions 4 18.3 .00 .37 .08 .98
Clerks are fast, agile Respondent's funds in the bank 4 51.4 .00 .33 10.3 .00
Clerks are fast, agile Number of customers in respondent's branch 4 4.55 .34 .38 .46 .77
Clerks are fast, agile Customers/employee in respondent's branch 4 56.4 .00 .33 1.26 .30
Clerks are available when needed Respondent's number of transactions 4 7.89 .10 .39 .08 .99
Clerks are available when needed Respondent's funds in the bank 4 28.6 .00 .53 6.24 .00
Clerks are available when needed Number of customers in respondent's branch 4 1.79 .77 .39 .34 .85
Clerks are available when needed Customers/employee in respondent's branch 4 6.36 .17 .33 1.09 .36
Would recommend the bank Respondent's number of transactions 12 65.8 .00 .25 .75 .70
Would recommend the bank Contribution of respondent's account 12 105.6 .00 .19 2.97 .00
Would recommend the bank Respondent's funds in the bank 12 132.3 .00 .29 5.40 .00
Would recommend the bank Respondent's volume/transaction 12 110.4 .00 .24 5.21 .00
Would recommend the bank Respondent's unit contribution (per $) 12 34.3 .00 .21 2.40 .00
% of funds in the bank Respondent's number of transactions 12 165.8 .00 .12 3.64 .00
% of funds in the bank Contribution of respondent's account 12 113.9 .00 .17 1.38 .18
% of funds in the bank Respondent's funds in the bank 12 285.6 .00 .25 4.72 .00
% of funds in the bank Respondent's volume/transaction 12 274.8 .00 .18 4.31 .00
% of funds in the bank Respondent's unit contribution (per $) 12 115.3 .00 .12 3.39 .00
Statistical Data Fusion 495

Table 6
CELL PERCENTAGES OF IMPUTED AND ACTUAL (IN PARENTHESIS) TABLES CROSSING COLUMN VARIABLES WITH "NUMBER OF
TRANSACTIONS" (111-1, IN QUINTILES)

I-I. "Recommend bank to friends" 1-10. "% offunds in bank"


Ill-I I.No 2 3 4. Yes 0-25 25-50 50-75 75-100
I bottom 20% 3.8 2.1 4.5 7.6 7.6 2.5 1.9 6.2
(4.0) (2.1) (4.7) (7.8) (9.1) (3.6) ( 1.9) (4.4)
2 3.5 2.5 5.5 9.4 6.8 3.1 2.3 8.9
(3.6) (3.2) (4.0) (9.2) (6.8) (2.9) (2.8) (7.7)
3 3.3 2.7 6.0 9.9 5.8 3.2 2.4 10.5
(3.1) (3.7) (6.5) (8.7) (4.8) (3.3) (2.6) (ll.l )
4 2.4 2.1 4.8 8.2 4.3 2.7 2.0 9.0
(2.2) (2.8) (4.7) (8.4) (4.6) (2.3) (1.7) (9.9)
5 top 20% 2.7 2.5 5.8 9.8 4.5 3.1 2.3 10.8
( 1.9) ( 1.6) (4.8) (12.4) (3.3) (2.2) (2.2) (13.0)

Table 7
CELL PERCENTAGES OF IMPUTED AND ACTUAL (IN PARENTHESIS) TABLES CROSSING SATISFACTION VARIABLES WITH
"CUSTOMER/EMPLOYEE RATIO IN RESPONDENTS BRANCH" (111-5, IN QUINTILES)

1-5. "Clerks are fast " 1-9 "Clerks are available"


1ll-5 A bit A lot A bit A lot
1 bottom 20% 9.9 (9.3) 7.8 (7.1) 7.6 (7.1) 8.7 (9.6)
2 9.3 (7.4) 8.6 (9.3) 6.9 (5.2) 9.7 (11.6)
3 12.4 (10.9) 9.5(11.1) 9.4 (8.5) 11.0 (13.6)
4 17.6 (15.8) 11.5 (8.6) 13.3 (13.5) 13.4 (10.8)
5 Top 20% 13.2 (14.0) 8.5 (6.4) 10.0 (11.0) 10.0 (9.2)

when needed" with the "Ratio of customers to employees" number of groups might fail to capture the true differences
from the internal records (III-5). From Table 5, it can be in response across individuals. We used the CAlC statistic to
seen that the association of 1-5 and III-5 is significant for the select the proper number of imputation groups. However,
actual but not for the imputed tables. The association of 1-9 this is only a heuristic procedure, because the asymptotic
and III-5 is not significant for either actual or imputed ta- properties of likelihood-based tests, including CAlC, do not
bles. The left-hand panel of Table 7 reveals intuitive results: hold when testing for T versus T + I classes. Therefore, we
When the ratio of customers to employees increases, the further investigated the sensitivity of the procedure to the
percentage of customers who rate clerks as "fast and agile" number of imputation groups. For that purpose, we applied
decreases for both the imputed and the actual data. The per- our imputation procedure to the same empirical data, with T
centages calculated from the imputed and actual data in the = 4, T = 5, and T = 6 imputation groups. The imputed cell
left-hand panel of Table 7 are relatively close. Although the frequencies fit fairly well to the actual tables, with correla-
average imputed cell counts show some association between tions of .96, .97, and .96 for 4, 5, and 6 groups, respective-
the two variables, the high variance observed across all 100 ly. After transformation into deviations from the indepen-
imputations leads to an inconclusive test for this association. dence hypothesis, the correlations were .75, .78, and .72, re-
The right-hand panel of Table 7 shows that the imputed and spectively. The LR (from the complete tables) and D-tests
actual percentages in the cross-table of "Clerks are avail- (from the imputed tables) agree in 59, 59, and 52 of all 80
able" and customer/employee ratio are close and that there tables (at a 5% significance level), respectively. These re-
is no association between these two variables, which was sults show that the T = 5 solution indeed presents the opti-
confirmed by the tests on both the actual and imputed tables. mal value in this application.
We also compared our data-fusion method to several im-
Sensitivity, Model Comparisons, and Statistical Power plementations of the hot-deck approach. Although the
We investigate the sensitivity of the proposed data-fusion choice of the critical matching variables is somewhat arbi-
method to the number of imputation groups and compare it trary, gender and age are used as such in virtually all pub-
to the traditional hot-deck approach. We address the issue of lished applications of the hot-deck procedure. For that rea-
the number of imputation groups. As noted previously, the son they also were used as critical matching variables in this
quality of the data fusion does not necessarily increase with study. Donor subjects in sample B were matched to the re-
the number of imputation groups; a too-large number of im- cipients in sample A to replace their missing observations,
putation groups, might capitalize on chance and lead to un- and vice versa. Matches were found for subjects who be-
stable and extreme estimates of response probabilities with- longed to the same age and gender groups and for whom the
in classes (so called boundary effects), whereas a too-small number of equal responses was the highest across the vari-
496 JOURNAL OF MARKETING RESEARCH, NOVEMBER 1997

abies in the common Set II. The same donor was allowed to Figure 3
match with multiple recipients. We first implemented the POWER PLOTS
hot-deck procedure using all variables in the common Set II
as matching variables. This criterion produced a correlation Power of the Meng-Rubin Test With Missing Data
of .66 between actual and imputed cell deviations (from ex- (4 degrees offreedom)
pected frequencies under independence) across all 980 cells
in the 80 cross-tabulations. When age was added as a criti-
cal matching variable, the correlation obtained was still .66. .8
When gender and age were used as critical matching vari- fraction ot missing information
ables (along with the remaining variables in the common Set ~ .6
II as matching variables), the correlation was slightly lower, ~
&: .4
r = .63. The same correlation was obtained when education
was added to gender and age as critical matching variables. .2
All results from the hot-deck implementations were lower
than the correlation of .78 obtained with our (T = 5) model-
based approach. o 4 8 12162024283236 4044485256606468727680848892 96
Lambda
In Figure 2, we present the scatterplots of the actual and
imputed frequencies, calculated as deviations from indepen-
dence, for the proposed model and the three hot-deck proce-
dures. These scatterplots underline the superior performance Power of the Meng-Rubin Test With Missing Data
(12 degrees offreedom)
of the proposed model. Note that the correlations are some-
what inflated because of the presence of two extreme obser-
vations (column variables 3, funds in the bank, and 7, vol-
ume per transaction), and because the relative performance .8

of our model over the hot-deck procedure increases even .5 fraction of missing
.6 information
further if these observations are deleted. These results sup-
port the proposed mixture procedure over the hot-deck ~ .4
1l
method. e
a.
From our empirical application, it appears that the tests on .2
the data with missing observations are more conservative
than the tests on the original data. This is caused by the fact o 4 8 12 1620 2428 323640 4448525660 64687276 80 8488 9296
that a good fraction of the information is missing. The re- Lambda
duction in power can be observed from Equation 8. If the
fraction of missing information, y, is known, the test statis-
tic Ds = DBIH(1 + y) - X2HIH. This in addition approxi-
mately holds as S ~ 00; that is, HD~ - X2H . Consequently, in the range .0-.3, but increases disproportionately and is
in our application we can approximate the test statistic D, by substantial for higher fractions of missing information. In
a chi-square distribution on H degrees of freedom, because addition, the effect of missing information is much larger in
we use a large number of imputations. Because the average tests for which the degrees of freedom are larger (H = 12
of the S imputed estimates converges to the complete data versus H = 4). Therefore, it is advisable to concentrate the
estimate according to the strong law of large numbers, (j) ~ tests on a small number of degrees by recoding the variables
<Pc (Li, Raghunathan, and Rubin 1991), DB converges to the into a small number of categories if possible.
complete-data LR statistic. An indication of the effect of in-
creasing fractions of missing information on the power of CONCLUSIONS AND DISCUSSION
the test can be evaluated as follows: The power of the com- Data fusion is a recent innovation for market researchers
plete-data LR test statistic is P,..{X 2H(A.) ~ cal, with X2H(A.) a and has been applied and developed predominantly in prac-
noncentral chi-square distribution with noncentrality para- tice over the last ten years. Problems that lend themselves to
meter A., and ca the critical value of the chi-square distribu- data fusion are relatively common in marketing research.
tion with size o; The power of the Meng-Rubin test statistic The demands placed on the market research industry with
is approximately P;dX2H(A.) ~ (1 + y)ca}. This expression respect to the collection of data are constantly increasing.
shows the effect of the fraction of missing information. To Managers need insights not only into the buying behavior of
assess the effect of missing information, we evaluate the their potential customers, but also into their demographic
power of the Meng-Rubin test as a function of A. for two val- and socioeconomic profiles, lifestyles, attitudes, media ex-
ues of H (H = 4 and H = 12) consistent with the degrees of posure, and so on. The burden on consumers and business of
freedom in our empirical application and several values of y extensive mail and telephone interviews is constantly in-
in the range .0 - .5. Figure 3 shows the results and enables creasing. Consequently, the response rates and the quality of
us to evaluate the power as a function of the fraction of miss- the data are declining. Conversely, the costs of collecting
ing information. For example, if H = 4, A. = 12, and y = .2, complete single-source data are often prohibitive. Therefore,
the power loss relative to the complete data, defined as ~ = the need to combine data from various sources is increasing,
(P, - P)/Pc, is 20%. If H = 12, A. = 30, and y = .3, the rela- and data fusion presents a potential solution to the problem.
tive power loss is ~ = 31%. Figure 3 shows that the loss in Fused databases have been used frequently in practice, par-
power is modest for small fractions of missing information, ticularly in geo-demographic analyses and media planning
Statistical Data Fusion 497

(Adamek 1994). Nevertheless, data fusion is not superior to tension for further research, because several problems must
single-source data but can be used as a viable alternative be resolved. In particular, as noted by Ramaswamy (1997),
when single-source data are unavailable or too expensive to the extension to more than three bases complicates the iden-
collect. tification of the optimal numbers of segments in his ap-
We have proposed an approach to data fusion that pre- proach. For example, if the number of segments underlying
sents the possibility of performing appropriate statistical the three types of variables varies from one to seven, then in
tests on fused databases, specifically for the common situa-
the latent markov model 343 (73) different sets of solutions
tion in which nominal variables, collected from different
must be inspected for the optimal numbers of segments
samples, are to be cross-tabulated. The procedure is model-
(compared to only 7 for the imputation mixture model).
based and formulated within a finite-mixture framework. It
is a multivariate approach to file concatenation, which takes Theoretically, an important determinant of the quality of
into account all of the available information in imputing the a data fusion is the number of common variables and the
missing observations. It enables us to cross-classify directly strength of its relationships with the variables in the two
variables that are unique to two different samples and to cal- samples. The final quality of the data fusion therefore may
culate simple statistical tests of association while accounting be strongly dependent on these common variables. This
for the uncertainty caused by the data-fusion process. Be- holds for all data-fusion methods. In fact, no statistical pro-
cause of the relatively simple computational procedures for cedure for data fusion can assess the partial correlation of
imputation, several imputations can be performed, which is the two sets of unique variables given the common variables,
required given the large amount of missing information. because the information on their partial association is not
Moreover, given the closed-form expression for the parame- present in the data. Therefore, any procedure for data fusion
ter estimates in the EM estimation algorithm, the procedure will break down when the two sets of unique variables show
is computationally feasible for the large data sets that com- a high degree of residual association when the common vari-
monly occur in data-fusion problems. The proposed proce- ables are accounted for. This appears to be an important av-
dure overcomes some of the limitations of the traditional enue for further research.
hot-deck approach, such as the subjective definition of
matching criteria and the lack of statistical properties of the Finally, the statistical literature has not provided a com-
fused data. Using complete data, we provide some evidence pletely satisfactory treatment of the problem of selecting
of the validity of the proposed procedure. The tests of asso- the optimal number of mixture components. In our empiri-
ciation in each of the 80 cross-tabulations show the same cal application, it appears that the CAlC statistic performed
conclusions for the complete and imputed tables in 74% of satisfactorily and that the results were not sensitive to small
the tables. A comparison of 80 imputed tables with those ob- deviations from the optimal number of components. Further
tained from complete data showed a correlation of .97 research may resolve the problem of the selection of the
across all 980 cell frequencies. In addition, the procedure number of components in mixture models. We suggest that
outperformed the previously used hot-deck, which produced further research should address the problems of the choice
a correlation of .91 in the same comparison. of the common variables, the optimal number of imputation
The imputation groups in the proposed model could pos- groups, and the power of the test statistic as a function of
sibly be interpreted as unobserved segments. However, the the amount of missing information for small numbers of
purpose of our analysis is not to identify and interpret seg- imputations.
ments but to derive a convenient set of imputation groups. If
the three sets of variables are considered distinct segmenta- APPENDIX
tion bases, a latent markov mixture model such as the one
proposed by Ramaswamy (1997) is theoretically preferable Here we provide the main steps of the EM algorithm:
to our model, because it accounts for the fact that different
but related segments underlie the three sets of segmentation I. At the first step of the iteration, h =0, the algorithm is initial-
bases, whereas our model assumes that one single set of im- ized by fixing the number of imputation groups, T, and gener-
putation groups underlies all variables. If there were no ating a (random) starting partition of posterior probabilities,
1tit(O).
missing data for two of the segmentation bases for all mem-
bers of one sample, our model would be a special case of 2. Given 1tit(h), estimates of the prior probabilities are obtained from
Ramaswamy's model and could be obtained from it by im-

L..
N+M
posing the restriction that the segment-switching matrix is
diagonal. The model proposed by Ramaswamy (1997), as (AI) ~t = niJ(N + M).
derived from earlier work by Poulsen (1990), is a useful one i=1

for dealing with multiple segmentation bases. However, in


the context of data fusion, there are two fundamental differ- 3. The 8j kt(h) are estimated in three steps for the set of variables
ences between our approach and Ramaswamy's. First, our unique to sample A, the common variables, and the set of vari-
approach considers three sets of variables (i.e., segmentation ables unique to sample B, from
bases). Second, two of the segmentation bases are observed N
only in one of the samples. This means that Ramaswamy's
model must be extended to accommodate more than two
'"
£... n II x IJ k
segmentation bases and to incorporate the ignorable missing (A2)
i=1
j = I, ..., P,
data structure, along the lines discussed here, before it can
be applied in the context of data fusion. We leave this ex- i=1
498 JOURNAL OF MARKETING RESEARCH, NOVEMBER 1997

N+M Practice," Journal of the Market Research Society, 31 (4),


~ 11:.It 489-500.
£.J X
k
IJ Dempster, A. P., N. M. Laird, and R. B. Rubin (1977), "Maximum
i;J
N+M
Likelihood for Incomplete Data via the EM-Algorithm," Journal

I j;J
11: it
of the Royal Statistical Society, B39 (I), 1-38.
DeSarbo, Wayne S. and Richard L. Cron (1988), "A Maximum
Likelihood Methodology for Clusterwise Linear Regression,"
Journal of Classification, 5 (2), 249-82.
- - - , Michel Wedel, Marco Vriens, and Venkatram Ra-
N+M maswamy (1992), "Latent Class Metric Conjoint Analysis,"

i;N+l
I 1til X,jk
Marketing Letters, 3 (3), 273-88.
Dillon, William and A. Kumar (1994), "Latent Structure and Oth-
(A4) N+M er Mixture Models in Marketing: An Integrative Survey and Re-

I
i;N+J
11:'1
view," in Advanced Methods in Marketing Research, Richard P.
Bagozzi, ed. Cambridge, MA: Blackwell Publishers, 295-351.
Ford, Barry (1983), "An Overview of Hot-Deck Procedures," in In-
j = P + Q + I, ..., P + Q + R. complete Data in Sample Surveys, Vol. II, Part 2: Theory and
Bibliographies, W. G. Madow, I. Olkin, and J. B. Rubin, eds.
4. Convergence test: stop if the change in the log-likelihood from New York: Academic Press, 185-207.
iteration (h - I) to iteration (h) is sufficiently small, otherwise: Li, K. H., T. E. Raghunathan, and Donald B. Rubin (1991), "Large
5. Increment the iteration index: h f- h + I, and calculate new es- Sample Significance Levels From Multiply Imputed Data Using
timates of the posterior membership probabilities according to Moment-Based Statistics and F Reference Distribution," Journal
of the American Statistical Association, 86 (416), 1065-73.
Little, R. 1. A. and Donald B. Rubin (1987), Statistical Analysis
IT
P+Q

l]t f ij l t (Xij 19t ) with Missing Data. New York: John Wiley & Sons.
j;J Meng, Xiao-Li and Donald B. Rubin (1992), "Performing Likeli-
hood-Ratio Tests With Multiply Imputed Data Sets," Biometrika,
I IT
T P+Q
79 (I), 103-11.
l]t f ij l t (Xij 19t ) O'Brien, Sarah (1991), "The Role of Data Fusion in Actionable
t;J j;J
Media Targeting in the 1990s," Marketing & Research Today, 19
(February), 15-22.
and
Poulsen, Carsten Stig (1990), "Mixed Markov and Latent Markov
Modeling Applied to Bran Choice Behavior," International
IT
P+Q+R
Journalfor Research in Marketing, 7, 5-19.
l]t fijlt(Xij 19t ) Ramaswamy, Venkatram (1997), "Evolutionary Preference Seg-
j ; P+ 1
(A6) mentation With Panel Survey Data: An Application to New Prod-

I IT
T P+Q+R
ucts," International Journal of Marketing Research, 14 (I),
l]t fij1t(Xij 19t ) 57-80.
[; 1 i> P+l Roberts, Andrew (1994), "Media Exposure and Consumer Pur-
chasing: An Improved Data Fusion Technique," Marketing and
i = N + I, ..., N + M.
Research Today, 22 (August), 159-72.
Rogers, Willard L. (1984), "An Evaluation of Statistical Match-
Go to step 2.
ing," Journal of Business and Economic Statistics, 2 (January),
REFERENCES 91-105.
Rubin, Donald B. (1976), "Inference and Missing Data," Biometri-
Adamek, James (1994), "Fusion: Combining Data From Separate ka, 63, 581-92.
Sources," Marketing Research: A Magazine of Management and - - - (1986), "Statistical Matching and File Concatenation With
Applications, 6 (Summer), 48-50. Adjusted Weights and Multiple Imputations," Journal of Busi-
Aitkin, M. and Donald B. Rubin (1985), "Estimation and Hypoth- ness and Economic Statistics, 4 (I), 87-94.
esis Testing in Finite Mixture Distributions," Journal ofthe Roy- - - - and Dorothy Thayer (1978), "Relating Tests Given to Dif-
al Statistical Society, Series B, 47,67-75. ferent Samples," Psychometrika, 43 (I), 3-10.
Antoine, Jacques and Gilles Santini (1987), "Fusion Techniques: Titterington, D. M., A. F. M. Smith, and U. E.Makov (1985), Sta-
Alternative to Single Source Methods?" European Research, 15 tistical Analysis of Finite Mixture Distributions. New York: John
(August), 178-87. Wiley & Sons.
Baker, Ken, Paul Harris, and John O'Brien (1989), "Data Fusion: Wedel, Michel and Wayne S. DeSarbo (1994), "A Review of Re-
An Appraisal and Experimental Evaluation," Journal ofthe Mar- cent Developments in Latent Class Regression Models," in Ad-
ket Research Society, 31 (2),152-212. vanced Methods of Marketing Research, Rick P. Bagozzi, ed.
Bozdogan, Herman (1987), "Model Selection and Akaike's Infor- Cambridge: Blackwell Publishers, 352-88.
mation Criterion (AIC): The General Theory and Its Analytical - - - and - - - (1995), "A Mixture Likelihood Approach for
Extensions," Psychometrika, 52 (3), 345-70. Generalized Linear Models," Journal of Classification, 12 (I),
Buck, Stephan (1989), "Single Source Data-The Theory and the 21-56.

You might also like