Professional Documents
Culture Documents
Paper TKDEP-FHDIwSMreduced
Paper TKDEP-FHDIwSMreduced
net/publication/346119700
Parallel Fractional Hot Deck Imputation and Variance Estimation for Big
Incomplete Data Curing
CITATIONS READS
0 132
3 authors, including:
In Ho Cho
Iowa State University
38 PUBLICATIONS 209 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Development of Pavement Structural Analysis Tool for Iowa Local Roads View project
All content following this page was uploaded by In Ho Cho on 25 March 2021.
TKDE.2020.3029146
5 Abstract—The fractional hot-deck imputation (FHDI) is a general-purpose, assumption-free imputation method for handling
6 multivariate missing data by filling each missing item with multiple observed values without resorting to artificially created values.
7 The corresponding R package FHDI J. Im, I. Cho, and J. K. Kim, “An R package for fractional hot deck imputation,” R J., vol. 10, no. 1,
8 pp. 140–154, 2018 holds generality and efficiency, but it is not adequate for tackling big incomplete data due to the requirement of
9 excessive memory and long running time. As a first step to tackle big incomplete data by leveraging the FHDI, we developed a new
10 version of a parallel fractional hot-deck imputation (named as P-FHDI) program suitable for curing large incomplete datasets. Results
11 show a favorable speedup when the P-FHDI is applied to big datasets with up to millions of instances or 10,000 of variables. This paper
12 explains the detailed parallel algorithms of the P-FHDI for large instances (big-n) or high-dimensionality (big-p) datasets and confirms
13 the favorable scalability. The proposed program inherits all the advantages of the serial FHDI and enables a parallel variance
14 estimation, which will benefit a broad audience in science and engineering.
15 Index Terms—Parallel fractional hot-deck imputation, incomplete big data, multivariate missing data curing, parallel Jackknife variance
16 estimation
sity, Ames, IA 50011. E-mail: jkim@iastate.edu. authors of this study [1] to perform the fractional hot-deck 67
Manuscript received 2 Dec. 2019; revised 13 July 2020; accepted 30 Sept. 2020. imputation and also the fully-efficient fractional imputation 68
Date of publication 0 . 0000; date of current version 0 . 0000. (FEFI) [17]. The FHDI can cure multivariate, general incom- 69
(Corresponding author: In Ho Cho.) plete datasets with irregular missing patterns without 70
Recommended for acceptance by M. A. Cheema. resorting to distributional assumptions or expert statistical 71
Digital Object Identifier no. 10.1109/TKDE.2020.3029146
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
164 prior investigation [2] underpins the positive impact of the random (MAR) condition, the conditional distribution of 222
165 P-FHDI on big-data oriented machine learning and statisti- fðyyi;mis j yi;obs Þ is approximated by 223
166 cal learning. XH
167 The outline of the paper is structured as follows: we fðyyi;mis j y i;obs Þ ffi g¼1
pðzzi ¼ g j yi;obs Þfðyyi;mis j yi;obs ; z i ¼ gÞ;
168 briefly demonstrate backbone theories of the serial FHDI
169 that are segmented by (i) cell construction, (ii) estimation of (2)
225
170 cell probability, (iii) imputation, and (iv) variance estima- where pðzzi ¼ g j yi;obs Þ is the conditional cell probability and 226
171 tion. After an instructive introduction to the adopted parallel fðyyi;mis j yi;obs ; z i ¼ gÞ is the within-cell conditional density of 227
172 computing techniques, we explain key parallel algorithms of yi;mis given y i;obs derived from (1). The FHDI consists of four 228
173 the P-FHDI. We validate and evaluate the performance of subsections as (i) Cell construction (ii) Estimation of cell 229
174 the P-FHDI with synthetic datasets, and propose a cost probability (iii) Imputation, and (iv) Variance estimation. 230
175 model. Finally, we introduce an updated approach embed- An illustration of each subsection is presented as follows. 231
176 ded in the P-FHDI, particularly for imputing big-p datasets.
177 Comprehensive examples in the Appendix, which can be 2.1 Cell Construction 232
178 found on the Computer Society Digital Library at http:// The pre-determination of the imputation cells had not been 233
179 doi.ieeecomputersociety.org/10.1109/TKDE.2020.3029146, discussed by Kim [15]. Im et al. [29] proposed an approach 234
180 illustrate how to use the program easily with simple and to generate imputation cells z using the estimated sample 235
181 practical data. quantiles. Suppose we wish to categorize yl with G catego- 236
ries. Considering the estimated distribution function of yl 237
182 2 KEY ALGORITHMS OF THE SERIAL FHDI P
dil wi Iðyil tÞ
183 For a comprehensive description of the FHDI and P-FHDI, F^l ðtÞ ¼ i2AP ; (3)
184 we provide a universal basic setup throughout. Suppose a i2A dil wi
239
185 finite population U with p-dimensional continuous varia- where I is an indicator function taking value of one if true 240
186 bles y ¼ fyy1 ; y 2 ; . . . ; y p g, and yil represents the ith instance and zero if false and wi represents the sampling weight of 241
187 of the lth variable where i 2 f1; 2; . . . ; Ng and l 2 unit i. Given a set of proportions fa1 ; . . . ; aG g satisfying 0 < 242
188 f1; 2; . . . ; pg. Then categorize the continuous variables y to a1 < < aG , the estimated sampling quantile of yl for ag 243
189 discrete variables z, so-called “imputation cells,” where z is defined as 244
190 takes values within categories f1; 2; . . . ; Kg for each vari-
191 able. Express yi;obs and y i;mis to denote the observed and q^ag ¼ minft; F^ðtÞ > ag g: (4)
192 missing part of the ith row of y, respectively. Also, we can 246
193 write z i;obs and z i;mis as the observed and missing part of the Hence yil will be categorized as an imputation cell g if 247
194 ith row of z. q^ag 1 < yil < q^ag . 248
195 Let A be the index set with size of n selected from the We propose a new mathematical notation for the iteration 249
196 finite population U. Let AM denote a set of sample Q indices of operations to facilitate the description
P of the FHDI and P- 250
197 with missing values such that AM ¼ fi 2 A; pl¼1 dil ¼ 0g; FHDI. A new mathematical symbol ’ ’ denotes a loop which 251
198 alternatively
Q index set with fully observed values is AR ¼ repeats a sequence of the same operation SðxÞ with discrete 252
199 fi 2 A ; pl¼1 dil ¼ 1g. A response indicator function dil takes input augments within a fixed range. The following is the 253
200 value of 1 if yil observed, otherwise dil ¼ 0. Let z be discrete simplest proposal of the loop symbol of an operation S 254
201 values of the continuous variables y 2 Rnp to form imputa-
202 tion cells. It consists of missing patterns zM ¼ fzzi j i 2 AM g X
b
Sðxi Þ fSðxa Þ; Sðxaþ1 Þ; . . . ; Sðxb 1 Þ; Sðxb Þg; (5)
203 and observed patterns zR ¼ fzzi j i 2 A R g. Let the index set
204 of unique missing pattern be A ~ M ¼ fi; j 2 A M j 8i 6¼ j; z i 6¼ i¼a
256
205 z j g of size n~M , alternatively A ~ R ¼ fi; j 2 A R j 8i 6¼ j; z i 6¼ where i ¼ a; . . . ; b. We have an extension of 257
206 z j g with size of n~R represents the index set of unique 2 3
207 observed patterns such that n~ ¼ n~M þ n~R is the size of total Sðxac Þ Sðxaðcþ1Þ Þ ... Sðxad Þ
6 7
208 unique patterns. Sequentially, ~ zM 2 Nn~M p denotes the Xb Xd 6 Sðxðaþ1Þc Þ Sðxðaþ1Þðcþ1Þ Þ . . . Sðxðaþ1Þd Þ 7
6 7
unique missing pattern where ~ ~ M g and ~
zM ¼ fzzi j i 2 A zR 2 Sðxij Þ ¼ 6 .. .. .. .. 7;
209 i¼a j¼c 6 . . . . 7
210 Nn~R p denotes unique observed pattern where ~ zR ¼ fzzi j i 2 4 5
~ R g. Note that if yil is missing, an imputed value y will be Sðxbc Þ Sðxbðcþ1Þ Þ ... Sðxbd Þ
211 A il
212 selected from the same jth donor to fill in. (6)
259
213 Suppose variable y l partitioned into Hl groups such that
where integer indices i ¼ a; . . . ; b and j ¼ c; . . . ; d. 260
214 z l takes values in f1; . . . ; Hl g. The cross classification of all
However, the initial discretization may not give at least 261
215 variables form imputations cells z and we assume a cell
two donors for each recipient to capture the variability from 262
216 mean model such that
imputation. Identification of the unique observed patterns 263
zR and unique missing patterns ~
~ zM is crucial for selection of 264
y j ðzz1 ¼ H1 ; . . . ; zp ¼ Hp Þ ðm
mH1 ;...;Hp ; SH1 ;...;Hp Þ; (1) donors for each recipient. First of all we need to sort missing 265
218 patterns zM 2 NnM p such that zM ¼ fzzi j i 2 A M ; kzzi k4 266
219 where mH1 ;...;Hp is a vector of cell means and SH1 ;...;Hp kzziþ1 kg. Note that kzzi k takes the string format of zi which is 267
220 is the variance-covariance matrix of y in cell ðH1 . . . Hp Þ. comparable by its numerical value. Thereby, the unique 268
221 Then utilizing a finite mixture model under missing at missing patterns ~ zM will be obtained by 269
4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
TABLE 1 TABLE 2
An Illustrative Example for Cell Collapsing With Initial Catego- The Number of Unique Observed Patterns (n~R ), Number of
ries of 3 for p1 and p2 Unique Missing Patterns (~nM ), and the Number of Iterations of
the Cell Construction With a Growing Number of Categories
nX
M threshold. Meantime, it requires a larger number of itera- 311
~zM ¼ zMi IðkzzMi k < kzzMðiþ1Þ kÞ; (7) tions to guarantee at least two donors for all unique missing 312
271
i¼1 patterns. In addition, the number of categories will affect 313
272 where z Mi 2 zM . Similarly, sort observed patterns zR 2 subsequent regression analyses. Regression coefficient esti- 314
273 NnR p such that zR ¼ fzzi j i 2 AR ; kzzi k4kzziþ1 kg. So we have mates of y1 given y2 in Fig. 3a shows that 12 categories opti- 315
274 the unique observed patterns ~ zR by mize the effect of the FHDI on regression analyses. Note 316
that the true values of the slope and intercept are 0.5 and 0, 317
X
nR respectively. Fig. 3b shows that a growing number of cate- 318
~zR ¼ z Ri IðkzzRi k < kzzRðiþ1Þ kÞ; (8) gories will increase the standard errors of the intercept and 319
i¼1 slope. A similar discussion about the effect of number of cat- 320
276
egories on regression in [30] has a good agreement with our 321
277 where zRi 2 zR . Suppose D i 2 N be the set recording indi-
Mi
results. 322
278 ces of donors of the ith recipient z i ¼ fzzi;obs ; zi;mis g where
279 ~ R j zi;obs ¼ zj;obs g with size of Mi . Mi represents
D i ¼ fj 2 A
280 the number of donors of the ith recipient. By ~ zM and ~zR , we
2.2 Estimation of Cell Probability 323
281 determine the set of number of total donors of all recipients To estimate conditional cell probability using a modified 324
ðhÞ
in case of convergence of p^ðzzg;obs ; zg;mis Þ or reaching in the ascending order. We
352
PHg maxi- denotes the index set of sorted A
by
to A
351
400
353 mum iteration max defined by users. Note that h¼1 ^ h ¼ 1.
p can record the index mapping c 2 NnA from A 401
wj Ifzzi;obs ¼ z j;obs ; zi;mis ¼ z j;mis g regard to the index mapping c easily as outputs. 405
wij ¼ ^ ðtÞ
p z i;mis j z i;obs PMi ;
j¼1 wj Ifz zi;obs ¼ z j;obs ; zi;mis ¼ z j;mis g 2.4 Variance Estimation 406
(13) The Jackknife method is implemented for variance estima- 407
359
tion of the FHDI estimator. The Jackknife variance estimator 408
where z j;mis
is the jth imputed value for the ithPrecipient. In
of Y^l;FHDI is determined by the following steps:
360
409
addition, we can verify computation of wij by j¼1i wij ¼ 1,
M
361
^ ð~
S1: Identification of the joint probability p zM Þ of unique 410
362 where Mi denotes number of donors for the ith recipient. In
missing patterns ~ zM by 411
363 FEFI, all respondents are employed as donors for each
364 recipient and assigned the fractional weights. Once donors X
365 and corresponding fractional weights determined, FEFI esti- ^ ð~
p zM Þ ¼ p^ðzzj;obs ; z j;mis Þ
~
366 mator of Yl can be written as j2A M
! 1 n
( ) X X
n X
X X ¼ wi ^ j Iðzzi;obs ¼ zj;obs Þ;
wi p (20)
^
Yl;FEFI ¼ wi dil yil þ ð1 dil Þ w
yjl ; (14)
ij;FEFI ~ i¼1 i2A
j2A M
i2A j2A
368 413
369 and the fractional weights of the jth donor for the ith recipi- where p ^ j represents the conditional probability of the jth 414
370 ent is defined recipient’s donors. 415
S2: Delete unit k 2 A. If k 2 AM , then wij will not be 416
X
G X
H
wj dj ajgh updated. If the size of p^ðzzk;obs ; zk;mis Þ is 0, skip to the next 417
wij;FEFI ¼ aig ^h j g P
p ; (15) iteration. 418
g¼1 l2A wl dl algh
= AM , then wij for the replicate k will be updated
h¼1
372 S3: If k 2 419
ðhÞ by
373 where ajgh ¼ 1 if ðzzj;obs ; zj;mis Þ = ðzzg;obs ; zg;mis Þ,
otherwise 0. 420
TABLE 3
Monte Carlo Results of Relative Bias of Standard Error, the
Average Length of 95 percent Confidence Interval, and its Cov-
erage Rate Based on 600 Simulation Runs
464 ical divide and conquer scheme. Suppose we have in total Q Input: number of total instances n, number of available 491
465 processors indexed by f0; 1; 2; . . . ; Q 1g. The first proces- processors Q, index of current processor q, 492
466 sor indexed by 0 (called master processor) will collect work Output: boundary indices s and e 493
467 from all other processors (called slave processors) via com- 1: N1 ¼ floor(Qn 1) 494
468 munication by the basic operations MPI_Send and 2: N2 ¼ n N1 ðQ 2Þ 495
3: s ¼ ðq 1ÞN1 496
4: e ¼ ðq 1ÞN1 þ N2 497
Right: uniform job distribution after adjustments by function tunePoints with Algorithm 3. Parallel Cell Construction 565
slave processors 1, 2 and 3 (denoted as S1, S2 and S3).
Input: raw data y 566
Output: imputation cells z 567
512 successful remedy to this problem, cyclic distribution can 1: for 8 i in 0 : max iteration do 568
513 effectively balance the work domain among slave process- 2: Categorize raw data y to imputation cells z 569
514 ors. Fig. 5b shows such a balanced situation with all three 3: Invoke function ZMATðzÞ ! ð~ ~M Þ
zR ; z 570
515 slave processors being busy. Depending upon the parallel 4: Invoke function nDAUð~ ~M Þ ! ðM
zR ; z M ; m; LÞ 571
516 tasks, we adopt the best choice of parallel job distributions. 5: if m < 2 then 572
6: Perform cell collapsing on z described in Table 1 573
517 Algorithm 2. Cyclic Job Distribution 7: end if 574
8: if m52jji ¼ max iteration then 575
518 Input: number of total instances n, number of available
9: break 576
519 processors Q, index of current processor q,
10: end if 577
520 and split processor qs
11: end for 578
521 Output: boundary indices s and e
522 1: N3 ¼ floor(Qn 1)
ðn N3 Þqs
523 2: N4 ¼ ðQ qs 1Þ
524 3: if q qs then Algorithm 4. Parallel Function ZMAT 579
525 4: s ¼ ðq 1ÞN3 Input: imputation cells z 580
526 5: e ¼ qN3 ~R and missing pattern z
Output: unique observed pattern z ~M 581
527 6: end if 1: CyclicDistrðnR Þ ! ðs; eÞ 582
528 7: if q > qs then 2: tunePoints(s; e) ! ðsa ; ea Þ 583
529 8: s ¼ qN3 þ ðq qs 1ÞN4 3: for 8 i in sa : ea do 584
530 9: e ¼ qN3 þ ðq 1ÞN4 4: Identify z
ðqÞ
~M 585
531 10: end if 5: end for 586
ðqÞ
6: MPI_SRð~ zM Þ 587
Q 1 ðqÞ
7: ~M ¼ V1 z
z ~M 588
8: CyclicDistrðnM Þ ! ðs; eÞ 589
532 3.1 Parallel Cell Construction and Estimation of Cell
9: tunePoints(s; e) ! ðsa ; ea Þ 590
533 Probability
10: for 8 i in sa : ea do 591
534 The determination of initial imputation cells is computa- ðqÞ
11: Identify z ~R 592
535 tionally cheap, i.e., Eqs. (3) and (4). However, it may take
12: end for 593
536 considerably large iterations for the cell collapsing process ðqÞ
13: MPI_SRð~ zR Þ 594
537 to guarantee at least two donors for each recipient, i.e., 1 ðqÞ
538 Eqs. (7) to (10). The current algorithm of cell collapsing is an ~R ¼
14: z VQ
1 ~R
z 595
613 set of number of total donors for all recipients is obtained in Note that p^ ðtÞ requires reorganization regarding the index 664
614 line 3 by set Ap ¼ fApi 2 Ap j 8i; Apðiþ1Þ > Api g in the ascending 665
( n~ ) order. Before proceeding to line 4 of Algorithm 6, we
Xe X R 666
ðqÞ
M ¼ Iðkzzi;obs k ¼ kzzj;obs kÞ : (28) explain the function Order in Algorithm 7 to generate index 667
i¼s j¼1 mapping of Ap in lines 4 to 10 on each processor by 668
616
617 Meantime, the actual indices of donors for all recipients will ea X
X np
ðqÞ
618 be stored in line 4 by ðqÞ ¼ jIðApi ¼ Apj Þ: (31)
i¼sa j¼1
X
e n~R
X 670
LðqÞ ¼ jIðkzzi;obs k ¼ kzzj;obs kÞ: (29) After communication in line 11, will be assembled in line 671
i¼s j¼1 ^ ðtÞ can be rear-
12. By leveraging in line 4 of Algorithm 6, p 672
620 ðtþ1Þ
ranged to update p^ ðzz Þ in line 5 by 673
621 Then communication is processed between lines 6 and 8 to
622 assemble M and L. Note that all processors require results ! 1 n
X
n X
623 of cell construction to continue cell probability estimation. p^ðtþ1Þ ðzz Þ ¼ wi ^ ðtÞ Iðzi;obs ¼ zg;obs Þ:
wi p (32)
624 Thus we broadcast results from the master processor to i¼1 i¼1
675
625 slave processors in line 9.
626 After explicit explanation of function nDAU in line 4 of In line 6, the EM algorithm will terminate if p^ðtþ1Þ ðzz Þ con- 676
627 Algorithm 3, it will perform cell collapsing in case of m < 2 verges to a threshold (e.g., 10 6 ). 677
641 8: L ¼ VQ 1 ðqÞ
L 11: MPI_SR(ðqÞ ) 691
1 1 ðqÞ
642 M ; LÞ
9: MPI_BcastðM 12: ¼ VQ 1 692
643 Algorithm 6. Parallel Estimation of Cell Probability Algorithm 8. Parallel Imputation 693
644 Input: Imputation cells z, sampling weight w Input: raw data y, imputation cells z, number set of 694
645 Output: cell probability p^ðzz Þ donors M , cell probability p^ðzobs Þ 695
646 1: for t in 0 : max do Output: imputed values Y ^ 696
647 2: Update w and p ^ ðtÞ 1: for i in 0 : n~M do 697
648 3: OrderðA Ap Þ ! 2: Compute w ij 698
649 4: !p ^ ðtÞ 3: Sort FEFI donors 699
650 5: Compute p^ðtþ1Þ ðzz Þ 4: Construct (Lj ; LnRg ) 700
657 ism. The EM iterations start in line 1 of Algorithm 6 and ter- Imputation of the P-FHDI aims at selecting M donors for 706
658 minate if the joint probability of z ¼ ðzzobs ; zmis Þ converges each recipient in ~ zM in lines 1 to 6 of Algorithm 8. The FEFI 707
659 in lines 6 to 8. Let Ap ¼ f1; . . . ; HG g be local index set of ini- fractional weights for all possible donors assigned to each 708
660 tial conditional probabilities p ^ ðtÞ with size of np . We update recipient are computed in line 2. We employ the PPS 709
ðtÞ
661 w 2 N and p
np ^ 2 R in line 2 where p
np ^ ðtÞ is method to randomly select M donors in lines 3 to 5 such 710
that wij ¼ fwij j i 2 A~ M ; j 2 f1; . . . ; Mi gg. In particular, it 711
ðhÞ
X
Hg ðtÞ
p^ ðzzg;obs ; z i;mis ¼ z g;mis Þ sorts all FEFI donors in y values by the half-ascending and 712
^ ðtÞ ¼
p PHg ðtÞ ðhÞ
: (30) half-descending order to construct successive intervals in 713
663 h¼1 ^ ðzzg;obs ; zi;mis ¼ zg;mis Þ
h¼1 p lines 3 and 4. To prepare the results of fractional imputed 714
YANG ET AL.: PARALLEL FRACTIONAL HOT-DECK IMPUTATION AND VARIANCE ESTIMATION FOR BIG INCOMPLETE DATA CURING 9
715 values Y, by
^ we obtain the index mapping c of sorted A where p ^ j represents the conditional probability of the jth 764
716 Order function in line 7 by recipient’s donors. Note that each p ^ j has to be rearranged 765
according to its index mapping in Eq. (31) in advance. For 766
X
e X
nA
instance, let Ap be the index set of p ^ and Ap denotes the 767
c¼ VQ 1 ðqÞ
jIðA j Þ:
¼A (33)
1 i index set of sorted Ap . Because of the heavy computational 768
i¼s j¼1
718
cost of linear searching, we implement binary search 769
719 Note that the majority of computational cost occurs in line 7. (denoted as BS) [33] to record the mapping of Ap after sort- 770
721 mapping c in line 8. denotes the middle element of Ap . We set a left index L to 772
be 1 and a right index R to be np . If Apm 4Api where i 2 773
722 3.3 Parallel Variance Estimation f1; 2; . . . ; np g, set L to be m. Otherwise, we set the R to be m. 774
If Apm ¼ Api , m will be the output of the BS function. Thus, 775
723 The majority of expensive computation happens in variance
we have the ðqÞ by 776
724 estimation because of the Jackknife estimate method of the
725 FHDI. The parallelized variance estimation is summarized X
ea
ðqÞ
726 in Algorithm 9. ðqÞ ¼ BSðApi Þ: (35)
i¼sa
778
727 Algorithm 9. Parallel Variance Estimation After communication in line 5, p ^ ð~
zM Þ will be assembled in 779
728 Input: raw data y, imputation cells z, sampling weights line 6 and broadcast to all slave processors in line 7. After 780
729 w, index set A, number of donors M, imputed demonstration of function Rep_CellP in line 1 of Algorithm 781
730 values Y ^ 9, by leveraging the unifrom job distribution in line 2, we 782
731 Output: variance V^ ðY^FHDI Þ can compute replicate estimator matrix Y ^ K;q 2 Rðe sÞp on 783
FHDI
732 1: Rep_CellP(z) ! p ^ ð~zM Þ processor q (Note superscript K is a simple notation for dis- 784
733 2: UniformDistr(nÞ ! ðs; eÞ tinction) in line 10 by 785
734 3: for 8 k in s : e do
4: if p^ðzzk Þ ¼ 0 then
!
735 Xe Xe Xp
736 5: Skip ^
Y K;q
¼ ^
Y k
¼ Y^ k
; (36)
FHDI FHDI l;FHDI
737 6: end if k¼s k¼s l¼1
787
738 7: if np > 0 then
ðkÞ ðkÞ where Y^l;FHDI
k
is kth replicate estimator of yl . By substituting 788
739 8: FWðwij Þ ! wij
Eq. (24), we have 789
740 9: end if
741 10: Compute Y ^ K;q ( ( ))
FHDI Xe X
p X XM
742 11: end for ^
Y K;q
¼ w dil yil þ ð1 dil Þ
k ðkÞ
w y
ðjÞ
:
743 12: MPI_SR(Y ^ K;q ) FHDI i ij il
FHDI k¼s l¼1 i2A j¼1
Y^K Q 1
ðYK;q
744 13: FHDI ¼ V1 FHDI Þ (37)
745 14: Compute Y FHDI ^ 791
ðkÞ
746 15: Compute V^ ðY^ FHDI Þ The function FW in Eq. (21) will update wij if k 2 = AM . 792
After communication of Y ^ K;q in line 12, Y
^K
FHDI FHDI is assem- 793
bled in line 13. Also, the FHDI estimator Y^ FHDI 2 Rp is com- 794
puted in line 14 by substituting Eq. (18) 795
747 Algorithm 10. Function Rep_CellP
748 Input: imputation cells z X
p
TABLE 5 TABLE 7
Standard Errors of Naive, Serial and Parallel Estimators Summary of the Adopted Datasets U(Instances, Variables,
Missing Rate)
Estimator y1 y2 y3 y4
Naive 0.0419 0.0390 0.0490 0.0472 Dataset Variable type Dimension
FEFI 0.0367 0.0378 0.0460 0.0457 Synthetic data 1 Continuous Uð1000; 4; 0:25Þ
P-FEFI 0.0367 0.0378 0.0460 0.0457 Synthetic data 2 Continuous Uð106 ; 4; 0:25Þ
FHDI 0.0383 0.0372 0.0451 0.0453 Air Quality Hybrid Uð41757; 4; 0:1Þ
P-FHDI 0.0384 0.0370 0.0449 0.0450 Nursery Categorical Uð12960; 5; 0:3Þ
Synthetic data 3 Continuous Uð15000; 12; 0:15Þ
Synthetic data 4 Continuous Uð15000; 16; 0:15Þ
Synthetic data 5 Continuous Uð15000; 100; 0:15Þ
TABLE 6 Synthetic data 6 Continuous Uð1000; 100; 0:3Þ
Standard Errors of the Naive and the P-FHDI Estimators With Synthetic data 7 Continuous Uð1000; 1000; 0:3Þ
Datasets U(Instances, Variables, Missing Rate) Synthetic data 8 Continuous Uð1000; 10000; 0:3Þ
Appliance Energy Continuous Uð19735; 26; 0:15Þ
Data Method y1 y2 y3 y4
Uð0:1M; 4; 0:25Þ Naive 0.0029 0.0039 0.0047 0.0043
P-FHDI 0.0037 0.0034 0.0045 0.0045 3029146, as summarized in Table 7. Synthetic data 1 is the 847
dataset used to validate the P-FHDI. Synthetic data 2 is a 848
Uð0:5M; 4; 0:25Þ Naive 0.0013 0.0018 0.0021 0.0019
P-FHDI 0.0017 0.0015 0.0020 0.0020 big-n dataset with millions of instances. Synthetic data 3 to 849
5 are used to compare the performance of the big-n and big- 850
Uð1M; 4; 0:25Þ Naive 0.0009 0.0012 0.0015 0.0013
P-FHDI 0.0012 0.0011 0.0014 0.0014 p algorithms. We apply Synthetic data 6 to 8 to test the 851
extreme cases of the P-FHDI on high-dimensional datasets. 852
Three practical incomplete datasets, i.e., Air Quality, Nurs- 853
809 servers. Each server has two 8-core Intel Haswell process- ery, and Appliance Energy, are included [28]. The hybrid 854
810 ors, 128 GB of memory, and 2.5 TB local disk. For a conve- dataset (Air Quality) is used to illustrate how to apply the 855
811 nient description of synthetic datasets, let Uðn; p; hÞ denote P-FHDI to real data in Appendix I, available in the online 856
812 a finite population with n instances and p variables issued supplemental material. Note that Air Quality consists of a 857
813 by h missing rate in proportion. non-collapsible categorical variable (y1 ) and three continu- 858
814 We generate y i ¼ ðyi1 ; yi2 ; yi3 ; yi4 Þ, i ¼ 1; . . . ; 1000, from ous variables (y2 y4 ). We test the extension of the P-FHDI 859
815 y1 ¼ 1 þ e1 , y2 ¼ 2 þ 0:5e1 þ 0:866e2 , y3 ¼ y1 þ e3 , and y4 ¼ to categorical variables by Nursery. We investigate behav- 860
816 1 þ 0:5y3 þ e4 , where e1 ; e2 ; e4 are randomly generated by iors of the big-p algorithm with a growing number of 861
817 the standard normal distribution, and e3 by the gamma dis- selected variables by Appliance Energy. Users can practice 862
818 tribution. Missingness is addressed by the Bernoulli distri- using the P-FHDI with these example datasets. 863
819 bution as di1 Berð0:6Þ; di2 Berð0:7Þ; di3 Berð0:8Þ;
820 di4 Berð0:9Þ independently to each variable. We create
5 COST ANALYSIS AND SCALABILITY 864
821 25 percent missingness to the synthetic dataset.
822 On the basis of the dataset Uð1000; 4; 0:25Þ, we apply the It is crucial to build a cost model of the P-FHDI. It will not 865
823 naive estimator, fractional imputation estimator, and paral- only evaluate the performance of the P-FHDI but also reveal 866
824 lel fractional imputation estimator to be in contrast. The potential memory risks. Note that when it comes to the Big 867
825 naive estimator is a simple mean estimator computed using O notation of the function gðxÞ, we are usually talking about 868
826 only observed values. The results in Table 5 shows that the worst-case scenario leading to the upper time boundary 869
827 P-FEFI produces the same standard errors of y as serial OðgðxÞÞ such that gðxÞ < OðgðxÞÞ as x ! 1. Considering a 870
828 FEFI. Results generated by the P-FHDI slightly differ from constant time for a unit operation in the algorithm, we can 871
829 results of the serial FHDI in a tolerable manner. The random estimate the time complexity by computing the number of 872
830 number generators used for the selection of donors in the elementary operations. It is essential to define the elemen- 873
831 FHDI and P-FHDI are library-dependent, which is the main tary cost involved in computation and communication. Let 874
832 reason leading to the tiny residuals. Note that the standard a be the computational cost per unit, b be the transfer cost 875
833 errors of the P-FHDI decrease asymptotically as n increases per unit of communication and L be communication startup 876
834 gradually in Table 6. As expected, both the P-FHDI and P- cost. Total running time T ðQÞ with Q available processors 877
835 FEFI often outperform the naive method in terms of stan- consists of computational cost C and communication cost H, 878
836 dard errors. Some exceptions (e.g., y1 and y4 of Table 6) may which reads 879
837 be attributed to the large missing rate and simple synthetic 0
model used in the study case. We provide the step-by-step a 0
838 T ðQÞ þ b Q; (40)
839 illustration of the use of the P-FHDI in Appendix I, available Q
840 in the online supplemental material. All the developed 881
0
841 codes are shared with GPL-2, and codes, examples, and where a ¼ ac1 ðn; pÞOðn2 Þ þ aOðn3 Þ and b0 ¼ bc1 ðn; pÞOðnÞ. 882
842 documents are available in [35]. The quantity c1 ðn; pÞ is the number of total iterations in cell 883
843 To maximize the benefit of researchers, we provide construction, which guarantees at least two donors for each 884
844 eleven adopted datasets in the supplementary material, recipient. Because of the implicit property of cell construc- 885
845 which can be found on the Computer Society Digital Library tion and probability estimation, we can only make an empiri- 886
846 at http://doi.ieeecomputersociety.org/10.1109/TKDE.2020. cal approximation that c1 ðn; pÞ / n ln ðpÞ. In conjunction 887
YANG ET AL.: PARALLEL FRACTIONAL HOT-DECK IMPUTATION AND VARIANCE ESTIMATION FOR BIG INCOMPLETE DATA CURING 11
TABLE 8
Datasets U(Instances, Variables, Missing Rate) Prepared for
Parametric Studies
T ðcp QÞ
888 with Eq. (40), we have the scalability of T ðQÞ by
0 0
T ðQÞ a þ Q2 b
¼ cp 0 0 ; (41)
T ðcp QÞ a þ c2p Q2 b
890
891 where cp 2 Nþ . Detailed derivations of Eqs. (40) and (41) are Fig. 7. Impact of the number of instances n on running time of the entire
P-FHDI (i.e., imputation and variance estimation) with datasets
892 presented in Appendix II, available in the online supple- Uðn; 4; 0:25Þ: 4 variables and 25 percent missing rate by varying n.
893 mental material.
894 To evaluate the performance of the P-FHDI in practice,
895 we perform parametric studies to investigate the impacts of
896 the number of instances n, the number of variables p and
897 the missing rate h on speedup and running time. We gener-
898 ate synthetic datasets in the form of yi ¼ ðyi1 ; . . . ; yip Þ; i ¼
899 1; . . . ; n using the same method presented in Section 4, and
900 the missingness is dynamically issued by dp Berðhp Þ. By
901 fixing the other two parameters, nine synthetic datasets will
902 be generated by varying the target parameter in Table 8.
903 Note that the P-FHDI is particularly powerful towards
904 big data with massive instances and high missing rates. The
905 parametric studies on the number of variables will be dis-
906 cussed in Section 6. An increasing number of instances n
907 and missing rate h increase the running time of the P-FHDI
908 in Figs. 7 and 9. But n and h are not likely to significantly Fig. 8. Impact of the missing rate h on speedups of the entire P-FHDI
(i.e., imputation and variance estimation) with datasets Uð1M; 4; hÞ: 1
909 affect the speedup of the P-FHDI in Figs. 6 and 8. We pro- million instances and 4 variables by varying h.
910 vide speedups and running time of imputation or variance
911 estimation of the P-FHDI separately in Appendix II, avail-
p > 100). We performed a parametric study with increasing 920
912 able in the online supplemental material. Overall, the num-
p and the results show weak speedups with the big-n algo- 921
913 ber of instances is the dominating parameter positively
rithm in Fig. 10. A gradual increment of p significantly 922
914 affecting speedup and running time. All of these parametric
increases the total running time, which may be attributed to 923
915 studies have a good agreement with the cost models of
the fact that many variables can lead to exponentially 924
916 speedup and running time.
increasing iterations to guarantee at least two donors during 925
the current cell construction algorithm. This issue is theoret- 926
917 6 VARIABLE REDUCTION FOR BIG-p DATASETS ically related to so-called the curse of dimensionality. There 927
918 The current algorithm of the P-FHDI maybe not adequate is a huge literature on the problem of variable selection. To 928
919 for imputing big-p data with too many variables (e.g., name a few, the least absolute shrinkage selection operator 929
Fig. 6. Impact of the number of instances n on speedups of the entire P- Fig. 9. Impact of the missing rate h on running time of the entire P-FHDI
FHDI (i.e., imputation and variance estimation) with datasets (i.e., imputation and variance estimation) with datasets Uð1M; 4; hÞ: 1
Uðn; 4; 0:25Þ: 4 variables and 25 percent missing rate by varying n. million instances and 4 variables by varying h.
12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
P ðM Mg Þ ! 1; (44) P P
Gv 1
958 EðYN Þ þ E g¼1 U g ðRg di
i2U 1Þðyi mg Þ
as n ! 1 for some given g. Inspired by the screening step ! 1;
959 EðYN Þ
960 of SIS, we introduce a variable reduction method with mul-
(48)
961 tivariate responses embedded into the P-FHDI algorithm P 1014
962 (so-called big-p algorithm). By assumption of a cell mean PN v ! p, where Rg ¼ NRg =Ng , NRg ¼ i2UU g di , YN ¼
as 1015
963 model, an instance fzzi j i 2 AR g serves as a donor of fzzj j j 2 i¼1 yi , Gv is the number of imputation groups using v 1016
964 AM g unless all observed variables of z j is identical to selected variables, and mg is the weighted average of yi ’s in 1017
YANG ET AL.: PARALLEL FRACTIONAL HOT-DECK IMPUTATION AND VARIANCE ESTIMATION FOR BIG INCOMPLETE DATA CURING 13
Fig. 11. The average ratios of (a) mean and (b) Jackknife variance esti-
mator of the practical dataset Uð19735; 26; 0:15Þ using a growing number
of selected variables.
1041 of V^ðY^FHDI Þ less than 1. The distance W measures how far measurement proposed in [41] other than correlation learn- 1074
ðrÞ
yi is away from the mean of M 1 donors of the ith recip- ing will also be a good candidate criterion to filter out unim- 1075
portant categorical variables. Departing from the current
ient. Though Y^FHDI has additional variance due to the donor
1076
program, the next theoretical (e.g., cell construction with k- 1077
selection procedure [15]. It leads to Wp slightly smaller than
nearest neighbors or fractional imputation using density 1078
Wv , which is trivial if v is much smaller than p. As v ! p,
ratio model) and computational advancements (e.g., pru- 1079
wir;FEFI;v ’ wir;FEFI;p . Sequentially, Wp < Wv will play an
dent data distribution algorithm) shall be focusing on the 1080
important role such that the ratio of V^ðY^FHDI Þ can be slightly ultra datasets and high-dimensional categorical data. 1081
greater than 1. That is, the trend of the average ratio of
V^ðY^FHDI Þ in Fig. 11b is explained.
1042 The extremely high-dimensional data we have tested 8 CONCLUSION 1082
1043 with the big-p algorithm so far has 10,000 variables. Fig. 12 As we enter into the new era of big data, it is of paramount 1083
1044 shows that the speedups of the P-FHDI on extremely high- importance to establish the big data-oriented imputation par- 1084
1045 dimensional data can scale up slightly lower than the linear adigm. By inheriting the strengths of the general-purpose, 1085
14 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
1086 assumption-free serial fractional hot-deck imputation pro- [12] X. Xie and X.-L. Meng, “Dissecting multiple imputation from a 1149
multi-phase inference perspective: What happens when God’s, 1150
1087 gram FHDI, this paper presents the first version of the parallel imputer’s and analyst’s models are uncongenial?” Statist. Sinica, 1151
1088 fractional hot-deck imputation program, named as P-FHDI. vol. 27, no. 4, pp. 1485–1594, 2017. 1152
1089 We document the full details regarding the algorithm- [13] K. Graham and L. Kish, “Some efficient random imputation meth- 1153
1090 oriented parallelisms, computational implementations, and ods,” Commun. Statistic-Theory Methods, vol. 13, no. 16, pp. 1919–1939, 1154
1984. 1155
1091 various examples of the P-FHDI to benefit a broad audience [14] R. E. Fay, “Alternative paradigms for the analysis of imputed sur- 1156
1092 in the science and engineering domain. The developed P- vey data,” J. Amer. Statist. Assoc., vol. 91, no. 434, pp. 490–498, 1157
1093 FHDI is suitable to tackle large-instance (so-called big-n) or 1996. 1158
[15] J. K. Kim and W. Fuller, “Fractional hot deck imputation,” Biome- 1159
1094 large-variable (so-called big-p) missing data with irregular trika, vol. 91, no. 3, pp. 559–578, 2004. 1160
1095 and complex missing patterns. The validations and analytical [16] S. Yang and J. K. Kim, “Fractional imputation in survey sampling: 1161
1096 cost models confirm that the P-FHDI exhibits promising scal- A comparative review,” Statist. Sci., vol. 31, no. 3, pp. 415–432, 1162
1097 ability for big incomplete datasets regardless of various miss- 2016. 1163
[17] W. A. Fuller and J. K. Kim, “Hot deck imputation for the response 1164
1098 ing rates. This program will help to impute incomplete data, model,” Surv. Methodol., vol. 31, no. 2, pp. 139–149, 2005. 1165
1099 enable parallel variance estimation, and ultimately improve [18] E. F. Akmam, T. Siswantining, S. M. Soemartojo, and D. Sarwinda, 1166
1100 the subsequent statistical inference and machine learning “Multiple imputation with predictive mean matching method for 1167
1101 with the cured big datasets. The next version of P-FHDI will numerical missing data,” in Proc. Int. Conf. Informat. Comput. Sci., 1168
2019, pp. 1–6. 1169
1102 focus on ultra datasets with both large instances and many [19] Nurzaman, T. Siswantining, S. M. Soemartojo, and D. Sarwinda, 1170
1103 variables, which will call for specialized theoretical and “Application of sequential regression multivariate imputation 1171
1104 computational advancements. Toward the future extension, method on multivariate normal missing data,” in Proc. Int. Conf. 1172
Informat. Comput. Sci., 2019, pp. 1–6. 1173
1105 this first version of P-FHDI will serve as a concrete founda- [20] K. Aristiawati, T. Siswantining, D. Sarwinda, and S. M. Soemartojo, 1174
1106 tion. All the developed codes are shared with GPL-2, and “Missing values imputation based on fuzzy C-means algorithm for 1175
1107 codes, examples, and documents are available in [35]. classification of chronic obstructive pulmonary disease,” in Proc. 1176
8th SEAMS-UGM Int. Conf. Math. Appl., 2019, Art. no. 060003. 1177
[21] T. Anwar, T. Siswantining, D. Sarwinda, S. M. Soemartojo, and 1178
1108 ACKNOWLEDGMENTS A. Bustamam, “A study on missing values imputation using K- 1179
harmonic means algorithm: Mixed datasets,” in Proc. Int. Conf. 1180
1109 This work was supported by the research funding of Depart- Sci. Appl. Sci., 2019, Art. no. 020038. 1181
1110 ment of Civil, Construction, and Environmental Engineering of [22] IBM, 2012. [Online]. Available: http:// www-01.ibm.com/ 1182
1111 Iowa State University. The high-performance computing facil- software/data/bigdata/ 1183
[23] B. Caffo, R. Peng, F. Dominici, T. A. Louis, and S. Zeger, “Parallel 1184
1112 ity used for this work was partially supported by the HPC@ISU
MCMC imputation for multiple distributed lag models: A case study 1185
1113 equipment at ISU, some of which has been purchased through in environmental epidemiology,” in The Handbook of Markov Chain 1186
1114 funding provided by National Science Foundation under MRI Monte Carlo. Boca Raton, FL, USA: CRC Press, 2011, pp. 493–512. 1187
1115 grant number CNS 1229081 and CRI grant number 1205413. [24] T. J. Durham, M. W. Libbrecht, J. Howbert, J. Bilmes, and W. S. 1188
Noble, “PREDICTD PaRallel epigenomics data imputation with 1189
1116 This work was also supported by National Science Foundation cloud-based tensor decomposition,” Nat. Commun., vol. 9, no. 1, 1190
1117 under grants CBET-1605275 and OAC-1931380. pp. 1402–1402, 2018. 1191
[25] B. Zhang, D. Zhi, K. Zhang, G. Gao, N. N. Limdi, and N. Liu, 1192
“Practical consideration of genotype imputation: Sample size, 1193
1118 REFERENCES window size, reference choice, and untyped rate,” Statist. Interface, 1194
1119 [1] J. Im, I. Cho, and J. K. Kim, “An R package for fractional hot deck vol. 4, no. 3, pp. 339–352, 2011. 1195
1120 imputation,” R J., vol. 10, no. 1, pp. 140–154, 2018. [26] E. Porcu, S. Sanna, C. Fuchsberger, and L. G. Fritsche, “Genotype 1196
1121 [2] I. Song, Y. Yang, J. Im, T. Tong, H. Ceylan, and I.-H. Cho, “Impacts imputation in genome-wide association studies,” Current Protocols 1197
1122 of fractional hot-deck imputation on learning and prediction of Hum. Genetics, vol. 78, no. 1, pp. 1–25, 2013. 1198
1123 engineering data,” IEEE Trans. Knowl. Data Eng., to be published, [27] X. Hu, “Acceleration genotype imputation for large dataset on 1199
1124 doi: 10.1109/TKDE.2019.2922638. GPU,” Procedia Environ. Sci., vol. 8, pp. 457–463, 2011. 1200
1125 [3] T. A. Myers, “Goodbye, listwise deletion: Presenting hot deck [28] D. Dheeru and G. Casey, “UCI machine learning repository,” 1201
1126 imputation as an easy and efficient tool for handling missing 2013. [Online]. Available: http://archive.ics.uci.edu/ml 1202
1127 data,” Commun. Methods Measures, vol. 5, no. 4, pp. 297–310, 1999. [29] J. Im, J. K. Kim, and W. A. Fuller, “Two-phase sampling approach 1203
1128 [4] J. W. Graham, “Missing data analysis: Making it work in the real to fractional hot deck imputation,” in Proc. Surv. Res. Methodol., 1204
1129 world,” Annu. Rev. Psychol., vol. 60, pp. 549–576, 2009. 2015, pp. 1030–1043. 1205
1130 [5] L. Wilkinson, “Statistical methods in psychology journals: Guide- [30] S. Z. Christopher, T. Siswantining, D. Sarwinda, and A. Bustaman, 1206
1131 lines and explanations,” Amer. Psychol., vol. 54, no. 8, pp. 594–604, “Missing value analysis of numerical data using fractional hot 1207
1132 1999. deck imputation,” in Proc. Int. Conf. Informat. Comput. Sci., 2019, 1208
1133 [6] A. J. Smola, S. Vishwanathan, and H. Thomas, “Kernel methods pp. 1–6. 1209
1134 for missing variables,” in Proc. Int. Conf. Artif. Intell. Statist., 2005, [31] J. G. Ibrahim, “Incomplete data in generalized linear models,” J. 1210
1135 pp. 325–332. Amer. Statist. Assoc., vol. 85, no. 411, pp. 765–769, 1990. 1211
1136 [7] D. Williams, X. Liao, Y. Xue, and L. Carin, “Incomplete-data clas- [32] I. Cho and K. A. Porter, “Multilayered grouping parallel algo- 1212
1137 sification using logistic regression,” in Proc. 22nd Int. Conf. Mach. rithm for multiple-level multiscale analyses,” Int. J. Numer. Meth- 1213
1138 Learn., 2005, pp. 972–979. ods Eng., vol. 100, no. 12, pp. 914–932, 2014. 1214
1139 [8] D. Rubin, “Multiple imputation after 18+ years,” J. Amer. Statist. [33] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein Eds., 1215
1140 Assoc., vol. 91, no. 434, pp. 473–489, 1996. Introduction to Algorithms. London, U.K.: MIT Press, 2009. 1216
1141 [9] I. White, P. Royston, and A. Wood, “Multiple imputation using [34] Condo, “Condo2017: Iowa State University high-performance 1217
1142 chained equations: Issues and guidance for practice,” J. Amer. Stat- computing cluster system,” 2017. [Online]. Available: https:// 1218
1143 ist. Assoc., vol. 30, no. 4, pp. 377–399, 2011. www.hpc.iastate.edu/guides/condo-2017/slurm-job-script- 1219
1144 [10] N. J. Horton and S. R. Lipsitz, “Multiple imputation in practice,” generator-for-condo 1220
1145 Amer. Statist., vol. 55, no. 3, pp. 244–254, 2001. [35] I. Cho, “Data-driven, computational science and engineering plat- 1221
1146 [11] S. Yang and J. K. Kim, “A note on multiple imputation for method forms repository,” 2017. [Online]. Available: https://sites.google. 1222
1147 of moments estimation,” Biometrika, vol. 103, no. 1, pp. 244–251, com/site/ichoddcse2017/home/type-of-trainings/parallel-fhdi- 1223
1148 2016. p-fhdi-1 1224
YANG ET AL.: PARALLEL FRACTIONAL HOT-DECK IMPUTATION AND VARIANCE ESTIMATION FOR BIG INCOMPLETE DATA CURING 15
1225 [36] T. Robert, “Regression shrinkage and selection via Lasso,” J. Roy. Jae Kwang Kim received the PhD degree in sta- 1251
1226 Statist. Soc. Ser. Methodol., vol. 58, no. 1, pp. 267–288, 1996. tistics from Iowa State University (ISU), Ames, 1252
1227 [37] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estima- Iowa, in 2000. He is a fellow of American Statisti- 1253
1228 tion for nonorthogonal problems,” Technometrics, vol. 12, no. 1, cal Association and Institute of Mathematical Sta- 1254
1229 pp. 55–67, 1970. tistics and currently a LAS dean’s professor with 1255
1230 [38] L. Jolliffe Ed., Principle Component Analysis. Berlin, Germany: the Department of Statistics, ISU. His research 1256
1231 Springer, 2011. interests include survey sampling, statistical anal- 1257
1232 [39] J. Fan and R. Li, “Variable selection via nonconcave penalized ysis with missing data, measurement error mod- 1258
1233 likelihood and its Oracle properties,” J. Amer. Statist. Assoc., vol. els, multi-level models, causal Inference, data 1259
1234 96, no. 456, pp. 1348–1360, 2001. integration, and machine learning. 1260
1235 [40] J. Fan and J. Lv, “Sure independence screening for ultrahigh
1236 dimensional feature space,” J. Roy. Statist. Soc. Ser. Statist. Meth-
1237 odol., vol. 70, no. 5, pp. 849–911, 2008. In Ho Cho (Member, IEEE) received the PhD 1261
1238 [41] D. Huang, R. Li, and H. Wang, “Feature screening for ultrahigh degree in civil engineering and minor in computa- 1262
1239 dimensional categorical data with applications,” J. Bus. Econ. Stat- tional science and engineering from the California 1263
1240 ist., vol. 32, no. 2, pp. 237–244, 2014. Institute of Technology, Pasadena, California, in 1264
2012. He is currently an associate professor with 1265
1241 Yicheng Yang received the MS degree in civil the Department of CCEE, ISU. His research inter- 1266
1242 engineering and minored in computer science ests include data-driven engineering and science, 1267
1243 from Iowa State University (ISU), Ames, Iowa, in computational statistics, parallel computing, par- 1268
1244 2018. He is currently working toward the PhD allel multi-scale finite element analysis, computa- 1269
1245 degree in civil engineering (specialization: intelli- tional and engineering mechanics for soft micro- 1270
1246 gent infrastructure engineering) in the Depart- robotics and nano-scale materials. 1271
1247 ment of Civil, Construction and Environmental
1248 Engineering (CCEE), ISU, Ames, Iowa. His
1249 research interests include parallel imputation, " For more information on this or any other computing topic, 1272
1250 machine learning, and data-driven engineering. please visit our Digital Library at www.computer.org/csdl. 1273
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1
Following the initial settings, we adopt the same dataset y used in [3] as an example. Data y delimit each element in a row
by a single space and start another row by an entering. It keeps a consistent format of the data shown below:
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2
daty
1.479632863 2.150860465 0 1.894211796
0 1.141495781 1.602529616 −1.036946859
0.70870936 1.885673226 1.250689441 0
0 2.753839552 0 1.211049509
0.862735721 2.42554872 1.887549247 −0.539284732
... ... ... ...
It is worthwhile to note that all missing values in y can be marked with any numerical values. The locations of missing units
will be identified by the matrix r that has the same dimensions as y. The response index (1 = observed; 0 = missing) in r is
important for the program to locate missing values. For instance, the missing units of y are marked as
datr
1 1 0 1
0 1 1 1
1 1 1 0
0 1 0 1
1 1 1 1
... ... ... ...
Next, the category vector K defines the number of categories of all variables that will be created in the cell construction
process. The maximum value of the category is 35. [2] suggests K of each variable between 30 and 35 after detailed case
studies with continuous variables. Still, the gain with such large K is not always substantial and computational cost increases
with K. Thus, it would be reasonable to start with a small number for K. For instance, below definition will assign 3 categories
for all variables:
category
3 3 3 3
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 3
Each variable can have different values for category. The P-FHDI also provides an option to reserve non-collapsible categorical
variables:
NonCollapsible categorical
0 0 0 0
The default option is a zero index vector of size p. For instance, when ”NonCollapsible categorical” = c (1,0,0,0), the first
variable is strictly considered as a non-collapsible categorical variable, and all the other variables are considered as continuous
or collapsible categorical variables. Consequently, cell collapsing will not happen if any non-collapsible categorical variables
are issued. On top of categories, users should include sampling weight for each row of y. The length of sampling weights is
consistent with the number of instances in y. It takes a length of 100 in the example:
weight
1
1
1
1
1
...
The choice of imputation cells z is optional. This matrix is meaningful only when i use def ined datz = 1:
datz
3 2 0 3
0 1 1 1
2 2 2 0
0 2 0 3
2 3 2 2
... ... ... ...
To compile C++ MPI programs using the Intel MPI, run the command:
mpiicc -o main_MPI main_MPI.cpp
the -o ** stands for the name of an executable exact file. ”**.cpp” represents the actual file name with its expansion. Then a
job script should be created and submitted into queue by issuing:
sbatch run.sbatch
where ”**.sbatch” is the actual name of a script file with its expansion. To submit a parallel-computing task to a high
performance computing (HPC) system, users need to prepare a script file for a workload manager. Depending upon the HPC
facilities, there are many workload managers. For instance, the ”Slurm” in our HPC facility is a highly scalable cluster job
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 4
management system. A job script can be generated by the Slurm Job Script Generator for Condo in [4], and an example script
file is shown below:
#!/bin/bash
# The number of nodes being requested
#SBATCH --nodes=1
# Total number of processors being requested (across all nodes).
#SBATCH --ntasks=4
# The maximum time of the job
#SBATCH --time=00:12:00
# Load the module
module load intel/18.3
# Run the MPI program
mpirun -np 4 ./main_MPI
The script file starts with a line identifying the Unix shell to be used by issuing #!/bin/bash. Every script should include
”--nodes”, ”--ntasks”, and ”--time” directives which inform the Slurm the number of total nodes assigned to the job, the
number of total processes run across all nodes, and maximum time limit for the job in HH:MM:SS time format, respectively.
Afterward, load the module of compiler and OpenMPI by ”module load” and run the executable MPI program by ”mpirun”
with four total processors (i.e. ”-np 4”). Note that the number of total processors (denoted as np) should be identical with the
number of tasks (denoted as ntasks). The script file may differ depending upon the local HPC platform. Each parallel example
requires a different number of processors and wall-time, thus we provide corresponding script files to all examples in full
details in the supplementary material.
The P-FHDI will prepare three output files that cover all results available in the serial FHDI in R. The first output file is
in the name extension of f imp.data.txt that presents imputation results with fractional weights. A part of f imp.data.txt of
the P-FEFI is given below, where FID denotes local serial index and FWT is the name of fractional weights assigned to each
donor:
ID FID WT FWT y1 y2 y3 y4
1 1 1 0.5 1.47963 2.15086 2.49344 1.89421
1 2 1 0.5 1.47963 2.15086 2.88165 1.89421
2 1 1 0.2 −0.09087 1.1415 1.60253 −1.03695
2 2 1 0.2 −1.67006 1.1415 1.60253 −1.03695
2 3 1 0.2 −0.39303 1.1415 1.60253 −1.03695
A part of f imp.data.txt of the P-FHDI is given for comparison. The P-FEFI takes all possible imputed values as donors.
Alternatively, the P-FHDI takes M selected donors from all possible imputed values.
ID FID WT FWT y1 y2 y3 y4
1 1 1 0.5 1.47963 2.15086 2.49344 1.89421
1 2 1 0.5 1.47963 2.15086 2.88165 1.89421
2 1 1 0.2 −0.09087 1.1415 1.60253 −1.03695
2 2 1 0.2 −1.67006 1.1415 1.60253 −1.03695
2 3 1 0.2 −0.39303 1.1415 1.60253 −1.03695
The second output file is named after simp.data.txt presenting the final imputed matrix in the same size of y. For instance,
the y[1, 3] has a value of 0 indicating the absence. There exists two imputed values for y[1, 3] presented in f imp.data.txt
file which have FIDs of 1 and 2 with imputed values of 2.49344 and 2.88165. Considering sampling weights and fractional
weights, the final imputed value compensated for y[1, 3] will be:
1 × 0.5 × 2.88165 + 1 × 0.5 × 2.49344 = 2.68754
shown in simp.data.txt file of the P-FEFI below:
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 5
Mean e s t i m a t e s :
0.904923 1.866880 1.818840 −0.031939
Variance Result :
0.016269 0.014567 0.018769 0.016793
s op imputation :
1
i option merge :
0
M:
5
For the P-FHDI, it is :
Mean e s t i m a t e s :
0.899718 1.86427 1.81198 −0.0278538
Variance Result :
0.016730 0.014725 0.019323 0.017598
s op imputation :
2
i option merge :
0
M:
5
If the program does not successfully converge in a limited time, a debug.txt file will be provided with possible reasons and
corresponding suggestions.
daty
1 138 1022 6.25
1 109 1022 0
1 105 1023 8.93
1 0 1024 10.72
0 140 1026 17.43
... ... ... ...
The ”datr.txt” file containing the response indicator matrix of Air Quality:
datr
1 1 1 1
1 1 1 0
1 1 1 1
1 0 1 1
0 1 1 1
... ... ... ...
INPUT INFORMATION
i option read data 1
i option perform 1
nrow 41757
ncol 4
i option imputation 2
i option variance 1
i option merge 0
donor 5
i user defined datz 0
i option collapsing 0
i option SIS type 3
END INPUT INFORMATION
category
3 3 3 3
NonCollapsible categorical
1 0 0 0
weight
1
1
1
1
...
Subsection A explicitly explains available options for the required settings in the input files. The number of categories for
each variable will be specified after ”category”. Note that the first variable of Air Quality is a categorical variable, thus one
can set it as 1 in the corresponding position to declare its special variable type after ”NonCollapsible categorical”. Sampling
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 7
weights of all observations will be set as ones in default after ”weight”. We recommend users to download an input file in the
supplementary material and starts your own on top of it.
Secondly, one needs to prepare a script file (named as ”run.sbatch”) for the workload manager of HPC system to submit
a parallel computing task. The job script can be either modified on top of the example scripts in supplementary material or
generated by Slurm Job Script Generator for Condo in [4]. Finally, users need to copy the input files, the script file, and C++
files of the P-FHDI to the same directory. One can follow the procedures in subsection B to compile the program and submit
a parallel-computing task via HPC system.
where ψ1 (n, p) ∈ N is a monotonically increasing function dependent on n and p if v = 0. Since ψ1 (n, p) is implicit,
we make an empirical approximation of ψ1 (n, p) ∝ n ln (p) regarding Fig. 13. Note that ψ1 (n, p) will be significantly
(1) (2)
reduced with an appropriate choice of v. Similarly, the communication cost corresponding to Ci and Ci is Hi =
(1) (2)
ψ1 (n, p)(Hi + Hi ) where
(1)
Hi = (Q − 1)(2L + β ñp) + 2L + β ñp (II.5)
(2) (Q − 1)(3L + 4β ñ) + 3L + 4β ñ if v = 0
Hi = (II.6)
(Q − 1)(4L + 4β ñ + 2βnv) + 4L + 4β ñ 6 0
if v =
Thus, we have
ψ1 (n, p) {(Q − 1)(5L + β ñp + 4β ñ) + 5L + 4β ñ + β ñp} if v = 0
Hi = (II.7)
ψ1 (n, p) {(Q − 1)(6L + β ñp + 4β ñ + 2βnv) + 6L + 4β ñ + β ñp} 6 0
if v =
(ii) Estimation of cell probability: We analyze the computational cost of estimation of cell probability which mainly consists
(1)
of identification of unique patterns (denoted as Cii ) and iterations of updating conditional probability and sampling weight
(2) (1) (2)
in line 2 of Algorithm 6 (denoted as Cii ) such that Cii = Cii + ψ2 (η)Cii where
(1)
Cii = αn2 p (II.8)
(2) n2
Cii = αψ2 (η)(nñp + nñ + ) (II.9)
Q−1
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 8
Fig. 13: (a) Impact of the number of variables on the number of iterations (i.e., ψ1 (n, p)) in cell construction of the big-n
algorithm with datasets U(10000, p, 0.15) by varying p and (b) impact of the number of instances on the number of iterations
in cell construction of the big-n algorithm with datasets U(n, 12, 0.15) by varying n.
so we have
n2
Cii = αn2 p + αψ2 (η)(nñp + nñ + ) (II.10)
Q−1
where ψ2 (η) is a quadratic and monotonically increasing function dependent on η. We make an empirical approximation
of ψ2 (η) ∝ η 2 regarding Fig. 14. Then the communication cost corresponding to updating conditional probability and
sampling weight turns out to be
Hii = ψ2 (η){(Q − 1)(L + βn) + L + βn} (II.11)
Fig. 14: Impact of the missing rate on the number of iterations (i.e., ψ2 (η)) in the estimation of cell probability of the big-n
algorithm with datasets U (106 , 4, η) by varying η.
(iii) Imputation: The majority of the computational cost of imputation focuses on preparing outputs of imputed values
described in line 8 of Algorithm 8 as
n2
Ciii = α (II.12)
Q−1
then its communication cost turns out to be
Hiii = 2(Q − 1)(L + βn) + 2L + 2βn (II.13)
(1)
(iv) Variance estimation: The variance estimation consists of function Rep CellP (denoted as Civ )
in Algorithm 10 and
(2) (1) (2)
computation of Jackknife variance (denoted as Civ ) in lines 3 to 11 of Algorithm 9 such that Civ = Civ + Civ . We have
(1)
Civ as αñ 2
(1) Q−1 n p + ψ2 (η)(nñp + nñ + n log n) if v = 0
Civ = (II.14)
0 if v 6= 0
Function Rep CellP computes ñ/(Q − 1) iterations of estimation of cell probability. The first part inside the box bracket
devotes to the computational cost of identification of unique patterns. And the second part is the computational cost of
iterations of estimating cell probability in line 3 of Algorithm 10. Note that the function Rep CellP is embedded into the
(1) (2)
computation of Jackknife variance such that Civ = 0 if the big-p algorithm is activated by setting v 6= 0. The Civ is
given by
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 9
αn
+ n2 p]
(2) Q−1 [ń log ń if v = 0
Civ = αn (II.15)
Q−1 [ń log ń + n2 p + n2 p + ψ2 (η)(nñp + nñ + n log n)] 6 0
if v =
where ń is the size of imputed values in line 8 of Algorithm 8. The first part inside the box bracket devotes to updating
fractional weight in line 8 of Algorithm 9. The second part comes from the computation of the FHDI estimator in line 10
of Algorithm 9. In the case of v 6= 0, the function Rep CellP is embedded into the main loop. From (II.14) and (II.15),
we have α
2 2
Q−1 ñ{n p + ψ2 (η)(nñp + nñ + n log n)} + n{ń log ń + n p} if v = 0
Civ = αn 2 (II.16)
Q−1 [ń log ń + 2n p + ψ 2 (η)(nñp + nñ + n log n)] if v 6= 0
(1) (2)
Similarly, we have the corresponding communication cost Hiv = Hiv + Hiv where
(Q − 1)(2L + 2β ñ2 ) + 2L + 2β ñ2 if v = 0
(1)
Hiv = (II.17)
0 if v 6= 0
and
(2)
Hiv = (Q − 1)L + βnp (II.18)
Therefore,
(Q − 1)(3L + 2β ñ2 ) + 2L + 2β ñ2 + βnp
if v = 0
Hiv = (II.19)
(Q − 1)L + βnp 6 0
if v =
By substituting C and H of each segment of the P-FHDI derived above, we have a lengthy expression of total running time
T . Considering the dominating terms O(n3 ) and ψ1 (n, p) in the expression of T , we can simplify the total running time T in
terms of the worst scenario for a better interpretation as
T (Q) = Ci + Cii + Ciii + Civ + Hi + Hii + Hiii + Hiv (II.20)
0
α 0
≈ +β Q (II.21)
Q
0
where α = αψ1 (n, p)O(n2 ) + αO(n3 ) and β 0 = βψ1 (n, p)O(n). By substituting Q with cp Q where cp ∈ N+ , the scalability
with total cp Q cores is rephrased as
0 0
T (Q) α + Q2 β
= cp × 0 (II.22)
T (cp Q) α + c2p Q2 β 0
As a supplementary of Section 5 of the paper, we provide speedups and running time of imputation or variance estimation
of the P-FHDI separately in Figs. 15 to 18. Note that cell construction, estimation of cell probability and imputation are
compressed as a single part named after the imputation herein.
Fig. 15: Impact of the number of instances n on (a) speedup and (b) running time of the imputation only with datasets
U(n, 4, 0.25): 4 variables and 25% missing rate by varying n.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 10
Fig. 16: Impact of the number of instances n on (a) speedup and (b) running time of the variance estimation only with datasets
U(n, 4, 0.25): 4 variables and 25% missing rate by varying n.
Fig. 17: Impact of the missing rate η on (a) speedup and (b) running time of the imputation only with datasets U(1M, 4, η):
1 million instances and 4 variables by varying η.
Fig. 18: Impact of the missing rate η on (a) speedup and (b) running time of the variance estimation only with datasets
U(1M, 4, η): 1 million instances and 4 variables by varying η.
Let EI (ỸF HDI ) be an expectation on imputation mechanism of FHDI, and we have EI (ỸF HDI ) = ỸF EF I . Taking expectation
on ỸF EF I , we have
n o
E(ỸF EF I ) = E E(ỸF EF I | FN )
XG X
= E(YN ) + E (Rg−1 δi − 1)(yi − µg ) (III.6)
g=1 i∈Ug
where Rg = NRg /Ng , FN is a set of finite population and the second term of Eq. (III.6) is 0. Therefore,
XG X
E(ỸF HDI ) = E(YN ) + E (Rg−1 δi − 1)(yi − µg ) (III.7)
g=1 i∈Ug
Considering Eq. (22) in the paper, we have V̂ (ŶF HDI ) for each variable as
n
(k)
X
V̂ (ŶF HDI ) = ck (ŶF HDI − ŶF HDI )2 (III.8)
k=1
(k)
where cK = (n − 1)/n and ŶF HDI is
G X M
(k) (k) ∗(k) ∗(j)
X X
ŶF HDI = wi δi yi + (1 − δi ) wij yi (III.9)
g=1 i∈Ag j=1
(k)
By substituting Eqs. (III.9) and (III.10), we have V̂ (ŶF HDI ) assuming wi ≈ wi if n 1 :
2
n G X M
∗(k) ∗(j)
X X X
∗
V̂ (ŶF HDI ) = ck wi (1 − δi ) [wij − wij ]yi
k=1 g=1 i∈Ag j=1
2 (III.11)
n X X M
∗(k) ∗(j)
X
∗
= ck w [wij − wij ]yi
i
k=1 i∈AM j=1
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 12
∗(k)
Note that the wij for replicate k ∈ AR will be updated by
∗ ∗
wij − wij,F EF I
if j = r
∗(k) ∗ ∗ w∗
wij + (wir,F P ij,F∗EF I if j 6= r
wij = EF I ) j6=r wij,F EF I
(III.12)
∗
wij if k ∈ AM
∗
where r is the index of the closest donor to k and wij,F EF I is defined in Eq. (16) of the paper. Considering the update of
∗(k)
wij , we have
2
n M ∗
X X
∗ ∗ ∗ ∗(r)
X
∗ ∗
w ij,F EF I ∗
∗(j)
V̂ (ŶF HDI ) = ck wi (wir − wir,F EF I − wir )yi + (wij + wir,F EF I P ∗ − wij )yi
k=1 i∈AM
j6=r j6=r wij,F EF I
2
n M ∗
X X
∗ ∗(r)
X
∗
w ij,F EF I ∗(j)
= ck wi −wir,F EF I yi + (wir,F EF I P ∗ )yi
j6 = r wij,F EF I
k=1 i∈AM j6=r
2
n M ∗
X X
∗
X w ij,F EF I ∗(j)
∗(r)
= ck wi wir,F EF I (P ∗ yi ) − y i
k=1 i∈AM
j6=r j6=r wij,F EF I
(III.13)
Fig. 19: Average errors of (a) the normalized mean estimator (b) the normalized SE of the mean estimator versus an increasing
proportion of outliers in fully observed observations by 0.5%, 1%, 2.5%.
Fig. 20: Average errors of (a) the normalized mean estimator (b) the normalized SE of the mean estimator versus an increasing
coefficient of MD threshold by 2, 4, and 8.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 13
A variant based on the Minimum Covariance Determinantin was proposed in [7] to detect multivariate outliers robustly.
PCout is a computationally fast procedure, which utilizes simple properties of principal components to identify outliers in the
transformed space [8]. We apply Mahalanobis distance (denoted as M D) to each observation to measure standard deviations
away from the mean of data distribution in Eq. (22). The criteria for outlier detection can be formulated: an observation is
considered as an outlier if q
(x − µ)T Σ−1 (x − µ) > εM D (IV.1)
where εM D is a MD threshold depending on input data. Filzmoser [9] proposed an adaptive method to find the MD threshold
such that extreme values should not be regarded as outliers.
There may be differences if outlier treatments (i.e., trimming, winsorization, and robust estimation) are carried out. Trimming
discards the observations which have the M D higher than εM D . Winsorization replaces observations which have M D higher
than εM D with the observation, who has the maximum M D such that M D 6 εM D . The goal of the robust estimation is to find
a fit which is robust to outliers. An overview of several robust methods for handling outliers, e.g., the least trimmed squares
estimator and the minimum covariance determinant, was conducted in [10]. It also discussed the robust linear regression, robust
principal component analysis, and robust classification individually.
We will focus on the effect of the trimming and winsorization on the P-FHDI with the presence of outliers. Let the maximum
M D of fully observed observations be εM D . Let the coefficient of M D threshold be M D/εM D , where M D is the average
of all outliers’ M D. The simulations consist of the following steps:
(1) Generate a synthetic dataset that has 1000 observations and 4 variables with 10% missingness.
(2) Assign outliers to the fully observed observations.
(3) Apply the different outlier treatments to the dataset: no outlier treatment, trimming, and winsorization.
(4) Perform the P-FHDI on datasets with the different outlier treatments, respectively.
Figs. 19 (a) and 20 (a) show the average errors of the mean estimator normalized by the mean of the full data. Figs. 19 (b)
and 20 (b) show the average errors of the SE of the mean estimator normalized by the SE of the full data. The average errors
of estimates of the P-FHDI without outlier treatment slightly increase with a growing proportion of outliers. Trimming and
winsorization have sharp increasing errors of estimates with a growing proportion of outliers. The average errors of estimates
of the P-FHDI without outlier treatment are stable against a growing coefficient of MD threshold. Trimming and winsorization
have sharp increasing errors of estimates with a growing coefficient of MD threshold. Overall, the P-FHDI without an outlier
treatment is quite robust against an increasing proportion of outliers or coefficient of MD threshold compared with trimming
and winsorization. The P-FHDI with trimming appears to be more sensitive to outliers than winsorization.
References
[1] I. Cho, “Data-driven, computational science and engineering platforms repository,” 2017. [Online]. Available:
https://sites.google.com/site/ichoddcse2017/home/type-of-trainings/parallel-fhdi-p-fhdi-1
[2] I. Song, Y. Yang, J. Im, T. Tong, C. Halil, and I. Cho, “Impacts of fractional hot-deck imputation on learning and prediction of engineering data,” IEEE
Transactions on Knowledge and Data Engineering, 2019.
[3] J. Im, I. Cho, and J. K. Kim, “An R package for fractional hot deck imputation,” The R Journal, vol. 10, no. 1, pp. 140–154, 2018.
[4] Condo, “Condo2017: Iowa state university high-performance computing cluster system,” 2017. [Online]. Available:
https://www.hpc.iastate.edu/guides/condo-2017/slurm-job-script-generator-for-condo
[5] J. Im, J. K. Kim, and W. A. Fuller, “Two-phase sampling approach to fractional hot deck imputation,” In Proceedings of Survey Research Methodology
Section, pp. 1030–1043, 2015.
[6] J. K. Kim and W. Fuller, “Fractional hot deck imputation,” Biometrika, vol. 91, no. 3, pp. 559–578, 2004.
[7] C. Leys, O. Klein, Y. Dominicy, and C. Ley, “Detecting multivariate outliers: Use a robust variant of the mahalanobis distance,” Journal of Experimental
Social Psychology, vol. 74, pp. 150–156, 2018.
[8] P. Filzmoser, R. Maronna, and M. Werner, “Outlier identification in high dimensions,” Computational Statistics Data Analysis, vol. 52, no. 3, pp.
1694–1711, 2008.
[9] P. Filzmoser, “A multivariate outlier detection method,” Proceedings of the Seventh International Conference on Computer Data Analysis and Modeling,
vol. 1, pp. 18–22, 2004.
[10] P. J. Rousseeuw and M. Hubert, “Robust statistics for outlier detection,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 1,
no. 1, pp. 73–79, 2011.