Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/346119700

Parallel Fractional Hot Deck Imputation and Variance Estimation for Big
Incomplete Data Curing

Article in IEEE Transactions on Knowledge and Data Engineering · October 2020


DOI: 10.1109/TKDE.2020.3029146

CITATIONS READS

0 132

3 authors, including:

In Ho Cho
Iowa State University
38 PUBLICATIONS 209 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Development of Pavement Structural Analysis Tool for Iowa Local Roads View project

All content following this page was uploaded by In Ho Cho on 25 March 2021.

The user has requested enhancement of the downloaded file.


(in-press)
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 10.1109/ 1

TKDE.2020.3029146

1 Parallel Fractional Hot-Deck Imputation


2 and Variance Estimation for Big
3 Incomplete Data Curing
4 Yicheng Yang, Jae Kwang Kim, and In Ho Cho , Member, IEEE

5 Abstract—The fractional hot-deck imputation (FHDI) is a general-purpose, assumption-free imputation method for handling
6 multivariate missing data by filling each missing item with multiple observed values without resorting to artificially created values.
7 The corresponding R package FHDI J. Im, I. Cho, and J. K. Kim, “An R package for fractional hot deck imputation,” R J., vol. 10, no. 1,
8 pp. 140–154, 2018 holds generality and efficiency, but it is not adequate for tackling big incomplete data due to the requirement of
9 excessive memory and long running time. As a first step to tackle big incomplete data by leveraging the FHDI, we developed a new
10 version of a parallel fractional hot-deck imputation (named as P-FHDI) program suitable for curing large incomplete datasets. Results
11 show a favorable speedup when the P-FHDI is applied to big datasets with up to millions of instances or 10,000 of variables. This paper
12 explains the detailed parallel algorithms of the P-FHDI for large instances (big-n) or high-dimensionality (big-p) datasets and confirms
13 the favorable scalability. The proposed program inherits all the advantages of the serial FHDI and enables a parallel variance
14 estimation, which will benefit a broad audience in science and engineering.

15 Index Terms—Parallel fractional hot-deck imputation, incomplete big data, multivariate missing data curing, parallel Jackknife variance
16 estimation

17 1 INTRODUCTION attention. Imputation theory is essential to replace a missing 40


value with statistically plausible values. In terms of the 41
data problem has been pandemic in nearly all
18
19
20
I NCOMPLETE
scientific and engineering domains. Inadequate handling
of missing data may lead to biased or incorrect statistical
number of plausible values for each missing item, there are
two distinct branches: single imputation and repeated
42
43
imputation. Amongst many methods of the “repeated 44
21 inference and subsequent machine learning [2]. In the
imputation” paradigm, the multiple imputation and frac- 45
22 “imputation” methods, the active research areas of missing
tional imputation have been widely used and investigated 46
23 data-curing, two major questions arose and have been
in the literature. The multiple imputation (MI) proposed by 47
24 answered for the past decades. These questions involve
Rubin [8] retains the advantages of single imputation but 48
25 “accuracy” and “computational efficiency”: how to handle
overcomes its downside by replacing each missing item 49
26 missing values by minimizing the loss of accuracy? Is there
with several values representing the distributions of the 50
27 any software for handling missing values?
possible values. MI can handle multivariate missing data 51
28 There is a variety of approaches regarding the first ques-
using chained equations (MICE), and the imputed values 52
29 tion. A simple approach is available in the literature such as
are generated from a set of conditional densities, one for 53
30 removal of instances with missing values [3] and pairwise
each variable with missing values [9]. Many existing pack- 54
31 deletion [4]. Yet, the report by the American Psychological
ages support MI and MICE on different platforms, e.g., 55
32 Association strongly discourages the use of removal of
SOLAS and SAS [10]. However, an inappropriate choice of 56
33 missing values, which seriously biases sample statistics [5].
model for MI may be harmful to its performance. Further- 57
34 A relatively better simple strategy is to replace the missing
more, the so-called “congeniality” and “self-efficiency” con- 58
35 values by the conditional expected values obtained from a
ditions required for the validity of MI can be quite 59
36 probabilistic model of incomplete data, which is subse-
restrictive in practice [11], [12]. 60
37 quently fed into particular learning models [6]. Recently,
Fractional hot-deck imputation (FHDI), proposed by [13] 61
38 theoretical approaches such as a model based method [7] or
and extensively discussed by [14] and [15], is a relatively 62
39 the use of an imputation theory have received great
new method to handle item non-response in survey sam- 63
pling, which creates several imputed values assigned with 64
 Yicheng Yang and In Ho Cho are with the Department of Civil Engineering, fractional weights for each missing instance [16]. A serial 65
Iowa State University, Ames, IA 50011. E-mail: {yicheng, icho}@iastate.edu.
 Jae Kwang Kim is with the Department of Statistics, Iowa State Univer-
version R package FHDI was developed by some of the 66

sity, Ames, IA 50011. E-mail: jkim@iastate.edu. authors of this study [1] to perform the fractional hot-deck 67

Manuscript received 2 Dec. 2019; revised 13 July 2020; accepted 30 Sept. 2020. imputation and also the fully-efficient fractional imputation 68
Date of publication 0 . 0000; date of current version 0 . 0000. (FEFI) [17]. The FHDI can cure multivariate, general incom- 69
(Corresponding author: In Ho Cho.) plete datasets with irregular missing patterns without 70
Recommended for acceptance by M. A. Cheema. resorting to distributional assumptions or expert statistical 71
Digital Object Identifier no. 10.1109/TKDE.2020.3029146

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

72 models. It also offers variance estimation using the Jack-


73 knife replication method. Besides, other popular imputation
74 methods include multiple imputation with predictive mean
75 matching (PMM) by [18], sequential regression multivariate
76 imputation by [19], Fuzzy C-Means (FCM) imputation by
77 [20], and K-Harmonic mean imputation by [21].
78 As we enter into the new era of big data [22], harnessing
79 the potential benefits of analyzing a large amount of data
80 will positively influence science, engineering, and human-
81 ity. However, curing the big incomplete data remains a for- Fig. 1. Different types of datasets: (a) n  K p ; (b) n 4 K p ; (c) n and p are
82 midable challenge. To the best of our knowledge, a both very large, where K is the number of categories of imputation cells.
83 powerful, general-purpose imputation software for big
84 incomplete data is beyond the reach of global researchers. advantages of the serial FHDI and overcomes its computa- 133
85 Recently, in some specific areas, parallel computing techni- tional limitations by leveraging algorithm-oriented parallel 134
86 ques are gradually adopted for imputation. [23] parallelized computing techniques. Since the algorithmic foundation is 135
87 single-chain rules of the sampler as a first attempt by using the same as that of the serial FHDI, the first version of FHDI 136
88 the parallel Markov chain Monte Carlo for Bayesian imputa- is focusing on large data with big instances (the so-called 137
89 tion on missing data. The computational methodology of “big-n” data) and that with many variables (the so-called 138
90 disk-based shared memory leads to decent scalability. How- “big-p” data) in Fig. 1. Our program also implements a par- 139
91 ever, a negative aspect of this approach is the need for allel fully-efficient fractional imputation (denoted as P- 140
92 highly customized, setting-specific, and parallelized soft- FEFI), but it is not recommended in practical big-data appli- 141
93 ware. [24] developed the so-called PaRallel Epigenomics cation since P-FEFI essentially uses all possible donors for 142
94 Data Imputation with Cloud-based Tensor Decomposition each missing value, which may be prohibitively expensive 143
95 (PREDICTD) to impute missing experiments for character- for big data. 144
96 izing the epigenome in diverse cell types. Because of the The significance of this study in the context of data sci- 145
97 immensely large genome dimension, it employs a parallel ence and machine learning is noteworthy. Reliable imputa- 146
98 algorithm to distribute the tensor across multiple cluster tion may play an important role in potential data-driven 147
99 nodes. However, PREDICTED is restricted to imputation research. The impacts of the FHDI on the subsequent 148
100 for bio-informatics, and they did not present strong parallel machine learning have been investigated in [2]. Following 149
101 efficiency. the same methodology, we compare the impact of different 150
102 Parallelized imputation methods, particularly for unty- data curing methods on machine learning and statistical 151
103 ped genotypes in genetic studies, are investigated by [25]. learning in Fig. 2. The input data is the Energy Efficiency 152
104 They applied parallelism to break the entire region into from the UCI Machine Learning Repository [28], which has 153
105 blocks and separately imputed the sub-blocks if the chromo- 768 observations and 9 variables with 30 percent response 154
106 somal regions larger than a size criterion. They showed that rate. Note that all parameters in machine learning models 155
107 the efficacy of the parallel imputation is significantly better are default settings. Fig. 2 shows that the FHDI, PMM, and 156
108 than the whole-region imputation. [26] introduced genotype FCM imputations have a noticeable positive influence on 157
109 imputation using the parallel processing tool ChunkChro- improving the subsequent machine learning and statistical 158
110 mosome, which automatically splits each chromosome into learning compared to the simple naive method, which fills 159
111 overlapping chunks, allowing the imputation of chromo- in missing data using mean values of attributes. The FHDI 160
112 somes to be run in multiple lower memory. However, this slightly outperforms the PMM and the FCM imputations. 161
113 simple parallelism decreases imputation accuracy at the Although this relative performance of the FHDI depends on 162
114 chunk borders. [27] used a GPU to accelerate parallel geno- the adopted incomplete data, this result along with similar 163
115 type imputation. The GPU implementation can reach a ten
116 times speedup for a small sample and gradually increases
117 with size of the dataset. The method had a limitation of
118 maximum site number to impute due to an excessive mem-
119 ory issue. All of these recent parallel imputations appear to
120 be restricted to applications in particular bio-informatics
121 domains. The Microsoft R Server 2016 had parallelized
122 many predictive models as well as statistical algorithms.
123 Still, the latest-release server does not yet implement any
124 paralleled imputation algorithm. The missing values are
125 simply omitted during computation.
126 Despite its generality and efficiency, the serial version of
127 the FHDI may have poor performance for curing “big”
128 incomplete data due to the required excessive memory and
129 prohibitively long execution time. To transform the FHDI
130 into the big data-oriented imputation software, we devel- Fig. 2. The impact of data curing methods on the subsequent machine
learning methods [artificial neural networks (ANN) and extremely ran-
131 oped a first version of the parallel fractional hot-deck impu- domly tree (ERT)] and statistical learning [generalized additive model
132 tation (named as P-FHDI) program which inherits all the (GAM)].
YANG ET AL.: PARALLEL FRACTIONAL HOT-DECK IMPUTATION AND VARIANCE ESTIMATION FOR BIG INCOMPLETE DATA CURING 3

164 prior investigation [2] underpins the positive impact of the random (MAR) condition, the conditional distribution of 222
165 P-FHDI on big-data oriented machine learning and statisti- fðyyi;mis j yi;obs Þ is approximated by 223
166 cal learning. XH
167 The outline of the paper is structured as follows: we fðyyi;mis j y i;obs Þ ffi g¼1
pðzzi ¼ g j yi;obs Þfðyyi;mis j yi;obs ; z i ¼ gÞ;
168 briefly demonstrate backbone theories of the serial FHDI
169 that are segmented by (i) cell construction, (ii) estimation of (2)
225
170 cell probability, (iii) imputation, and (iv) variance estima- where pðzzi ¼ g j yi;obs Þ is the conditional cell probability and 226
171 tion. After an instructive introduction to the adopted parallel fðyyi;mis j yi;obs ; z i ¼ gÞ is the within-cell conditional density of 227
172 computing techniques, we explain key parallel algorithms of yi;mis given y i;obs derived from (1). The FHDI consists of four 228
173 the P-FHDI. We validate and evaluate the performance of subsections as (i) Cell construction (ii) Estimation of cell 229
174 the P-FHDI with synthetic datasets, and propose a cost probability (iii) Imputation, and (iv) Variance estimation. 230
175 model. Finally, we introduce an updated approach embed- An illustration of each subsection is presented as follows. 231
176 ded in the P-FHDI, particularly for imputing big-p datasets.
177 Comprehensive examples in the Appendix, which can be 2.1 Cell Construction 232
178 found on the Computer Society Digital Library at http:// The pre-determination of the imputation cells had not been 233
179 doi.ieeecomputersociety.org/10.1109/TKDE.2020.3029146, discussed by Kim [15]. Im et al. [29] proposed an approach 234
180 illustrate how to use the program easily with simple and to generate imputation cells z using the estimated sample 235
181 practical data. quantiles. Suppose we wish to categorize yl with G catego- 236
ries. Considering the estimated distribution function of yl 237
182 2 KEY ALGORITHMS OF THE SERIAL FHDI P
dil wi Iðyil  tÞ
183 For a comprehensive description of the FHDI and P-FHDI, F^l ðtÞ ¼ i2AP ; (3)
184 we provide a universal basic setup throughout. Suppose a i2A dil wi
239
185 finite population U with p-dimensional continuous varia- where I is an indicator function taking value of one if true 240
186 bles y ¼ fyy1 ; y 2 ; . . . ; y p g, and yil represents the ith instance and zero if false and wi represents the sampling weight of 241
187 of the lth variable where i 2 f1; 2; . . . ; Ng and l 2 unit i. Given a set of proportions fa1 ; . . . ; aG g satisfying 0 < 242
188 f1; 2; . . . ; pg. Then categorize the continuous variables y to a1 < < aG , the estimated sampling quantile of yl for ag 243
189 discrete variables z, so-called “imputation cells,” where z is defined as 244
190 takes values within categories f1; 2; . . . ; Kg for each vari-
191 able. Express yi;obs and y i;mis to denote the observed and q^ag ¼ minft; F^ðtÞ > ag g: (4)
192 missing part of the ith row of y, respectively. Also, we can 246
193 write z i;obs and z i;mis as the observed and missing part of the Hence yil will be categorized as an imputation cell g if 247
194 ith row of z. q^ag 1 < yil < q^ag . 248
195 Let A be the index set with size of n selected from the We propose a new mathematical notation for the iteration 249
196 finite population U. Let AM denote a set of sample Q indices of operations to facilitate the description
P of the FHDI and P- 250
197 with missing values such that AM ¼ fi 2 A; pl¼1 dil ¼ 0g; FHDI. A new mathematical symbol ’ ’ denotes a loop which 251
198 alternatively
Q index set with fully observed values is AR ¼ repeats a sequence of the same operation SðxÞ with discrete 252
199 fi 2 A ; pl¼1 dil ¼ 1g. A response indicator function dil takes input augments within a fixed range. The following is the 253
200 value of 1 if yil observed, otherwise dil ¼ 0. Let z be discrete simplest proposal of the loop symbol of an operation S 254
201 values of the continuous variables y 2 Rnp to form imputa-
202 tion cells. It consists of missing patterns zM ¼ fzzi j i 2 AM g X
b
Sðxi Þ fSðxa Þ; Sðxaþ1 Þ; . . . ; Sðxb 1 Þ; Sðxb Þg; (5)
203 and observed patterns zR ¼ fzzi j i 2 A R g. Let the index set
204 of unique missing pattern be A ~ M ¼ fi; j 2 A M j 8i 6¼ j; z i 6¼ i¼a
256
205 z j g of size n~M , alternatively A ~ R ¼ fi; j 2 A R j 8i 6¼ j; z i 6¼ where i ¼ a; . . . ; b. We have an extension of 257
206 z j g with size of n~R represents the index set of unique 2 3
207 observed patterns such that n~ ¼ n~M þ n~R is the size of total Sðxac Þ Sðxaðcþ1Þ Þ ... Sðxad Þ
6 7
208 unique patterns. Sequentially, ~ zM 2 Nn~M p denotes the Xb Xd 6 Sðxðaþ1Þc Þ Sðxðaþ1Þðcþ1Þ Þ . . . Sðxðaþ1Þd Þ 7
6 7
unique missing pattern where ~ ~ M g and ~
zM ¼ fzzi j i 2 A zR 2 Sðxij Þ ¼ 6 .. .. .. .. 7;
209 i¼a j¼c 6 . . . . 7
210 Nn~R p denotes unique observed pattern where ~ zR ¼ fzzi j i 2 4 5
~ R g. Note that if yil is missing, an imputed value y will be Sðxbc Þ Sðxbðcþ1Þ Þ ... Sðxbd Þ
211 A il
212 selected from the same jth donor to fill in. (6)
259
213 Suppose variable y l partitioned into Hl groups such that
where integer indices i ¼ a; . . . ; b and j ¼ c; . . . ; d. 260
214 z l takes values in f1; . . . ; Hl g. The cross classification of all
However, the initial discretization may not give at least 261
215 variables form imputations cells z and we assume a cell
two donors for each recipient to capture the variability from 262
216 mean model such that
imputation. Identification of the unique observed patterns 263
zR and unique missing patterns ~
~ zM is crucial for selection of 264
y j ðzz1 ¼ H1 ; . . . ; zp ¼ Hp Þ  ðm
mH1 ;...;Hp ; SH1 ;...;Hp Þ; (1) donors for each recipient. First of all we need to sort missing 265
218 patterns zM 2 NnM p such that zM ¼ fzzi j i 2 A M ; kzzi k4 266
219 where mH1 ;...;Hp is a vector of cell means and SH1 ;...;Hp kzziþ1 kg. Note that kzzi k takes the string format of zi which is 267
220 is the variance-covariance matrix of y in cell ðH1 . . . Hp Þ. comparable by its numerical value. Thereby, the unique 268
221 Then utilizing a finite mixture model under missing at missing patterns ~ zM will be obtained by 269
4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

TABLE 1 TABLE 2
An Illustrative Example for Cell Collapsing With Initial Catego- The Number of Unique Observed Patterns (n~R ), Number of
ries of 3 for p1 and p2 Unique Missing Patterns (~nM ), and the Number of Iterations of
the Cell Construction With a Growing Number of Categories

Categories n~R n~M Iterations


3 54 130 9
6 103 215 112
12 97 243 227
24 90 243 275
35 88 240 286
Left: initial imputation cells. Right: final imputation cells after cell collapsing.

nX
M threshold. Meantime, it requires a larger number of itera- 311
~zM ¼ zMi IðkzzMi k < kzzMðiþ1Þ kÞ; (7) tions to guarantee at least two donors for all unique missing 312

271
i¼1 patterns. In addition, the number of categories will affect 313

272 where z Mi 2 zM . Similarly, sort observed patterns zR 2 subsequent regression analyses. Regression coefficient esti- 314

273 NnR p such that zR ¼ fzzi j i 2 AR ; kzzi k4kzziþ1 kg. So we have mates of y1 given y2 in Fig. 3a shows that 12 categories opti- 315

274 the unique observed patterns ~ zR by mize the effect of the FHDI on regression analyses. Note 316
that the true values of the slope and intercept are 0.5 and 0, 317

X
nR respectively. Fig. 3b shows that a growing number of cate- 318
~zR ¼ z Ri IðkzzRi k < kzzRðiþ1Þ kÞ; (8) gories will increase the standard errors of the intercept and 319
i¼1 slope. A similar discussion about the effect of number of cat- 320
276
egories on regression in [30] has a good agreement with our 321
277 where zRi 2 zR . Suppose D i 2 N be the set recording indi-
Mi
results. 322
278 ces of donors of the ith recipient z i ¼ fzzi;obs ; zi;mis g where
279 ~ R j zi;obs ¼ zj;obs g with size of Mi . Mi represents
D i ¼ fj 2 A
280 the number of donors of the ith recipient. By ~ zM and ~zR , we
2.2 Estimation of Cell Probability 323

281 determine the set of number of total donors of all recipients To estimate conditional cell probability using a modified 324

282 M ¼ fMi j i 2 A ~ M g by using EM algorithm by weighting [31], we partition z into G 325


groups on basis of AM , denoted by z1 ; . . . ; zG . Let the size of 326
n~X
( n~ ) all possible donors of zg;mis be Hg . Given a possible imputed
M X R
ðhÞ
327
M¼ Iðkzzi;obs k ¼ kzzj;obs kÞ : (9) value z g;mis for z g;mis , the estimated conditional cell probabil- 328
i¼1 j¼1 ity is defined in E step 329
284
285 Note that kzzi;obs k ¼ kzzj;obs k claims that each entity of string ðhÞ
p^ðtÞ ðzzg;obs ; zi;mis ¼ z g;mis Þ
286 kzzi;obs k and string kzzj;obs k should be identical, respectively. ^ ðtÞ
p h ¼ PHg ðtÞ ðhÞ
; (11)
287 Further, the actual index set of donors of all recipients L ¼ h¼1 p ^ ðzzg;obs ; zi;mis ¼ zg;mis Þ
288 fDDi j i 2 A~ M gg can be written as 331
where p^ðtÞ ðzzÞ is the estimated joint cell probability computed 332
n~X
M n~R
X from the full respondents at iteration t. The joint cell proba- 333
L¼ jIðkzzi;obs k ¼ kzzj;obs kÞ: (10) bility is updated in M step using weighting 334
i¼1 j¼1
290 ! 1
The minimum entity of M 2 Nn~M is denoted by m. If m > 2, ðhÞ
X
n X
n
291
p^ ðtþ1Þ
ðzzg;obs ; zg;mis Þ ¼ wi ^ ðtÞ
wi p h
292 the initial generation of imputation cells have at least two i¼1 i¼1
293 donors as candidates for each recipient. Otherwise, we Iðzzi;obs ¼ zg;obs Þ;
294 apply a cell collapsing procedure to adjust the categories of (12)
295 imputation cells to produce more donors and stop only if 336
296 every recipient has at least two donors. For instance, a where n is the sample size and wi is the sampling weight of 337
297 dummy z has a recipient ðNA; 1Þ and all donors are listed in unit i. An indicator function I takes a value of one if zi;obs ¼ 338
298 the left panel of Table 1 after initial data categorization. zg;obs and zero otherwise. The EM algorithm will terminate 339
299 However, the recipient has only one possible donor (3,1).
300 Hence cell collapsing occurs by merging the original catego- 340
301 ries 1 and 2 of p2 as category 1 to complement deficient 341
302 donors. Now, (3,1) and (2,1) both serve as donors for the 342
303 recipient (NA; 1). 343
304 The number of categories of imputation cells will affect 344
305 the number of unique observed patterns n~R , the number of 345
306 unique missing patterns n~M , and the number of iterations in 346
307 the cell construction. The input dataset in this simulation 347
308 has 500 observations and 4 variables with 20 percent miss- 348
309 ingness. Table 2 shows that a growing number of categories Fig. 3. Effect of a growing number of categories on (a) regression coeffi- 349
310 increases the number of unique patterns until some cient estimates and (b) their standard errors of y1 given y2 . 350
YANG ET AL.: PARALLEL FRACTIONAL HOT-DECK IMPUTATION AND VARIANCE ESTIMATION FOR BIG INCOMPLETE DATA CURING 5

ðhÞ
in case of convergence of p^ðzzg;obs ; zg;mis Þ or reaching  in the ascending order. We
352
PHg maxi- denotes the index set of sorted A
 by
 to A
351
400
353 mum iteration max defined by users. Note that h¼1 ^ h ¼ 1.
p can record the index mapping c 2 NnA from A 401

354 2.3 Imputation nA X


X nA

355 It is necessary to compute fractional weights in advance. Let c¼ jIðAi ¼ A


j Þ: (19)
356 wij be the fractional weights of the jth donor for the ith i¼1 j¼1
403
357 recipient given by: ^ can be reorganized with
Eventually, imputation results Y 404

wj Ifzzi;obs ¼ z j;obs ; zi;mis ¼ z j;mis g regard to the index mapping c easily as outputs. 405
wij ¼ ^ ðtÞ
p z i;mis j z i;obs PMi ;
j¼1 wj Ifz zi;obs ¼ z j;obs ; zi;mis ¼ z j;mis g 2.4 Variance Estimation 406
(13) The Jackknife method is implemented for variance estima- 407
359
tion of the FHDI estimator. The Jackknife variance estimator 408
where z j;mis
is the jth imputed value for the ithPrecipient. In
of Y^l;FHDI is determined by the following steps:
360
409
addition, we can verify computation of wij by j¼1i wij ¼ 1,
M
361
^ ð~
S1: Identification of the joint probability p zM Þ of unique 410
362 where Mi denotes number of donors for the ith recipient. In
missing patterns ~ zM by 411
363 FEFI, all respondents are employed as donors for each
364 recipient and assigned the fractional weights. Once donors X
365 and corresponding fractional weights determined, FEFI esti- ^ ð~
p zM Þ ¼ p^ðzzj;obs ; z j;mis Þ
~
366 mator of Yl can be written as j2A M
! 1 n
( ) X X
n X
X X ¼ wi ^ j Iðzzi;obs ¼ zj;obs Þ;
wi p (20)
^
Yl;FEFI ¼ wi dil yil þ ð1 dil Þ w
yjl ; (14)
ij;FEFI ~ i¼1 i2A
j2A M
i2A j2A
368 413
369 and the fractional weights of the jth donor for the ith recipi- where p ^ j represents the conditional probability of the jth 414
370 ent is defined recipient’s donors. 415
S2: Delete unit k 2 A. If k 2 AM , then wij will not be 416
X
G X
H
wj dj ajgh updated. If the size of p^ðzzk;obs ; zk;mis Þ is 0, skip to the next 417
wij;FEFI ¼ aig ^h j g P
p ; (15) iteration. 418
g¼1 l2A wl dl algh
= AM , then wij for the replicate k will be updated
h¼1
372 S3: If k 2 419
ðhÞ by
373 where ajgh ¼ 1 if ðzzj;obs ; zj;mis Þ = ðzzg;obs ; zg;mis Þ,
otherwise 0. 420

374 However, it requires too much computation with large 8 


375 input data. >
> w wij;FEFI if j ¼ r
< ij wij;FEFI
376 Instead of using all the respondents, the FHDI selects M ðkÞ
wij ¼ wij þ ðwir;FEFI Þ P w if j 6¼ r ;
377 donors among all FEFI donors using a systematic probabil- >
> j6¼r ij;FEFI
: w if k 2 AM
378 ity proportional to size (PPS) method to compensate for the ij
379 recipient. First, it sorts all FEFI donors in y values by the (21)
380 half-ascending and half-descending order. Considering 422
381 L1 ¼ 0, one can construct the interval of ðL1 ; LnRg Þ by where i 2 AR and r is the index of the closest donor to k. 423
Eq. (21) updates the fractional weights and will be used in 424
Ljþ1 ¼ Lj þ M  wij;FEFI ; (16) parallel algorithm later, thus is denoted as FW in ease of 425
383
use. However, measurement of closeness in terms of the 426
384 for 8 j 2 ½1 : nRg 1 . Sequentially, successive intervals will scale of the data is challenging. We employ Mahalanobis 427
385 be computed by Eq. (16). Let RNg be a random number Distance (denoted as MD) to measure the distance from a 428
386 from a uniform distribution Uð0; 1Þ. Then for i 2 AMg , M replicate to the other M donors by 429
387 donors will be selected if
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
RNg þ i 1 MD ¼ ðyy m ÞT S 1 ðy mÞ; (22)
L½j  þl 1  L½j þ 1 ; (17)
nMg 431
389
where y = fyi1 ; . . . ; yip gT is the vector of the ith observation 432
390 for l ¼ 1; . . . ; M. In case of nRg 4M, we take all FEFI donors
391 for the recipient. The FHDI estimator of yl with M selected of y, m ¼ fm1 ; . . . ; mp gT is the vector of mean values of p var- 433

392 donors is iables, and S is a covariance matrix. Hence, r is the index of


smallest value among MD MD.
( ) ðkÞ
X X
M S4: Considering the updated fractional weight wij , we 434
^
Yl;FHDI ¼ wi dil yil þ ð1 dil Þ  ðjÞ
w y ; (18) can compute the variance V^ðY^l;FHDI Þ by 435
ij il
i2A j¼1
394
ðjÞ X
n
395 where wij ¼ M 1 and yil is the jth imputed value for yil . V^ðY^l;FHDI Þ ¼ ck ðY^l;FHDI
k
Y^l;FHDI Þ2 ; (23)
396 Afterward, we can prepare imputation results Y ^ ¼ k¼1
ðjÞ
397 fyil j i 2 f1; . . . ; ng; l 2 f1; . . .; pg; j 2 f1; . . .; Mi gg along 437
398 with fractional weights wij;FHDI as outputs.  2 NnA be
Let A where ck denotes replicate factor associated with Y^l;FHDIk
or 438
PnM  2 NnA
399 ^
index set of Y where nA ¼ nR þ i¼1 Mi . And A ^ k
Yl;FEFI , which is the kth replicate estimator of yl defined by 439
6 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

TABLE 3
Monte Carlo Results of Relative Bias of Standard Error, the
Average Length of 95 percent Confidence Interval, and its Cov-
erage Rate Based on 600 Simulation Runs

Variable Rel.Bias (%) Ave.Length of CI Coverage rate


Variable 1 -3.62 0.1025 0.95
Variable 2 -2.83 0.1040 0.93
Variable 3 -7.88 0.1362 0.93
Variable 4 -4.28 0.1356 0.95

Fig. 5. Different methods distributing work to the slave processors. A


work domain  (represented by an irregular area enclosed by dashed
( ) line) is handled by three slave processors: Slave 1 (downward hatching),
X X
M
Y^l;FHDI
ðkÞ ðjÞ Slave 2 (upward hatching) and Slave 3 (horizontal hatching).
k
¼ wki dil yil þ ð1 dil Þ wij yil ; (24)
i2A j¼1
MPI_Recv (denoted as MPI_SR in conjunction). The pieces 469
441
of distributed work will be assembled in the master proces- 470
442 where wki is the kth replicate of the fractional weight The wij .
ðjÞ sor by MPI_Gather (denoted as V). The Vji ði 6¼ 0Þ repre- 471
443 quantity yil is the jth imputed value of yil . The FHDI esti-
sents that the master processor gathers pieces of work 472
444 mator is defined in Eq. (18). Similarly, we can determine
distributed to all slave processors ranking i to j. Generally,
V^ðY^l;FEFI Þ using the same equations.
473
445
if the entire domain  of a problem is divided into disjoint 474
446 We evaluate the variance estimation via relative bias of stan-
domain q and no memory overlapping is required, thus all 475
447 dard error and confidence interval. In the simulations, input
work is done concurrently by 476
448 data has 500 observations and 4 variables with 30 percent 0 1
449 missingness and Monte Carlo simulation size is 600. The
1@
X
450 average relative bias less than 5 percent often indicates a good  ¼ VQ q¼1 Sðxi ÞA: (25)
451 variance estimation in Table 3. The average coverage rate of xi 2q 478
452 95 percent confidence interval is close to 95 percent. Overall, 479
453 the variance estimation of the FHDI performs in a favorable Another challenge of parallel computing is how a prob- 480
454 manner. lem can be partitioned efficiently to be tackled concurrently, 481
and how we can manage balanced computing. To this end, 482
455 3 PARALLEL ALGORITHMS FOR THE FHDI we employ two work distribution schemes in the P-FHDI, 483
i.e., uniform distribution (denoted as UniformDistr) and 484
456 Fig. 4 shows the two parallel schemes adopted for this
cyclic distribution (denoted as CyclicDistr). They compute 485
457 study. The two distinct schemes are needed since the pri-
the beginning (denoted as s) and ending index (denoted as 486
458 mary global loops for the majority of tasks are “implicit”,
e) of distributed work on processor q. Fig. 5 illustrates the 487
459 and thus a direct divide and conquer scheme is not applica-
contrast between two work distribution schemes visually. 488
460 ble such that parallelization is focusing on the separately
These two schemes are summarized in Algorithms 1 and 2. 489
461 parallelizable internal tasks without breaking the implicit
462 loop. In contrast, some embarrassingly parallelizable tasks
463 such as Jackknife variance estimation are tackled by the typ- Algorithm 1. Uniform Job Distribution 490

464 ical divide and conquer scheme. Suppose we have in total Q Input: number of total instances n, number of available 491
465 processors indexed by f0; 1; 2; . . . ; Q 1g. The first proces- processors Q, index of current processor q, 492
466 sor indexed by 0 (called master processor) will collect work Output: boundary indices s and e 493
467 from all other processors (called slave processors) via com- 1: N1 ¼ floor(Qn 1) 494
468 munication by the basic operations MPI_Send and 2: N2 ¼ n N1 ðQ 2Þ 495
3: s ¼ ðq 1ÞN1 496
4: e ¼ ðq 1ÞN1 þ N2 497

The uniform distribution scheme in Algorithm 1 aver- 498


ages work in N1 pieces over slave processors and squeezes 499
the residuals to the last available processor as N2 . Then the 500
boundary indices s and e of distributed work in the current 501
processor q can be computed in lines 3 and 4, respectively. 502
Alternatively, the core of the cyclic distribution scheme in 503
Algorithm 2 is to assign the work to Q 1 slave processors 504
recursively. 505
Although it is easy to implement the uniform distribu- 506
tion, this uniform decomposition will lead to considerable 507
work imbalance when the problem domain  is severely 508
irregular [32]. See Fig. 5a for an illustration; an irregular 509
Fig. 4. Two parallel schemes: (a) Internal parallelization within the
unbreakable implicit loop; (b) Typical divide and conquer for embarrass- work domain  causes heavy workload in slave processors 510
ingly parallelizable explicit loop. 1 and 2 while slave processor 3 is almost idle. As a 511
YANG ET AL.: PARALLEL FRACTIONAL HOT-DECK IMPUTATION AND VARIANCE ESTIMATION FOR BIG INCOMPLETE DATA CURING 7

TABLE 4 adjustment will guarantee the identical indices assigned to 561


Left: Uniform Job Distribution the same processor shown in the right panel of Table 4. 562
Hence the updated mapping will be c1 ¼ f7; 2; 5g, c 2 ¼ 563
S1 S2 S3 S1 S2 S3
f3; 4g and c3 ¼ f1; 6g. 564
1 2 2 3 3 4 4 1 2 2 3 3 4 4

Right: uniform job distribution after adjustments by function tunePoints with Algorithm 3. Parallel Cell Construction 565
slave processors 1, 2 and 3 (denoted as S1, S2 and S3).
Input: raw data y 566
Output: imputation cells z 567
512 successful remedy to this problem, cyclic distribution can 1: for 8 i in 0 : max iteration do 568
513 effectively balance the work domain among slave process- 2: Categorize raw data y to imputation cells z 569
514 ors. Fig. 5b shows such a balanced situation with all three 3: Invoke function ZMATðzÞ ! ð~ ~M Þ
zR ; z 570
515 slave processors being busy. Depending upon the parallel 4: Invoke function nDAUð~ ~M Þ ! ðM
zR ; z M ; m; LÞ 571
516 tasks, we adopt the best choice of parallel job distributions. 5: if m < 2 then 572
6: Perform cell collapsing on z described in Table 1 573
517 Algorithm 2. Cyclic Job Distribution 7: end if 574
8: if m52jji ¼ max iteration then 575
518 Input: number of total instances n, number of available
9: break 576
519 processors Q, index of current processor q,
10: end if 577
520 and split processor qs
11: end for 578
521 Output: boundary indices s and e
522 1: N3 ¼ floor(Qn 1)
ðn N3 Þqs
523 2: N4 ¼ ðQ qs 1Þ
524 3: if q  qs then Algorithm 4. Parallel Function ZMAT 579
525 4: s ¼ ðq 1ÞN3 Input: imputation cells z 580
526 5: e ¼ qN3 ~R and missing pattern z
Output: unique observed pattern z ~M 581
527 6: end if 1: CyclicDistrðnR Þ ! ðs; eÞ 582
528 7: if q > qs then 2: tunePoints(s; e) ! ðsa ; ea Þ 583
529 8: s ¼ qN3 þ ðq qs 1ÞN4 3: for 8 i in sa : ea do 584
530 9: e ¼ qN3 þ ðq 1ÞN4 4: Identify z
ðqÞ
~M 585
531 10: end if 5: end for 586
ðqÞ
6: MPI_SRð~ zM Þ 587
Q 1 ðqÞ
7: ~M ¼ V1 z
z ~M 588
8: CyclicDistrðnM Þ ! ðs; eÞ 589
532 3.1 Parallel Cell Construction and Estimation of Cell
9: tunePoints(s; e) ! ðsa ; ea Þ 590
533 Probability
10: for 8 i in sa : ea do 591
534 The determination of initial imputation cells is computa- ðqÞ
11: Identify z ~R 592
535 tionally cheap, i.e., Eqs. (3) and (4). However, it may take
12: end for 593
536 considerably large iterations for the cell collapsing process ðqÞ
13: MPI_SRð~ zR Þ 594
537 to guarantee at least two donors for each recipient, i.e., 1 ðqÞ
538 Eqs. (7) to (10). The current algorithm of cell collapsing is an ~R ¼
14: z VQ
1 ~R
z 595

539 implicit process which is non-parallelizable. Considering zM ; ~


15: MPI_Bcast(~ zR ) 596

540 the inevitable obstacle, we employ the internal paralleliza-


541 tion within the unbreakable implicit iterations in Algorithm We identify z~ðqÞ
M in each processor q in line 4 of 597

542 3. In line 2 of Algorithm 3, we categorize input data y to Algorithm 4 by 598

543 imputation cells z using the estimated sample quantiles


ðqÞ
X
ea
544 defined in Eq. (4). Before proceeding to line 4 of Algorithm ~
zM ¼ z Mi IðkzzMi k < kzzMðiþ1Þ kÞ; (26)
545 3, the function ZMAT is explained explicitly in Algorithm 4 i¼sa
546 to identify the unique patterns ~ zM and ~ zR for the selection 600
547 of donors. In lines 1 and 2 of Algorithm 4, we employ the where z Mi 2 zM . After master-slave communications in line 601
548 cyclic distribution for job division and adjust the boundary 6, we obtain ~zM by gathering operation in line 7. Similarly, 602
ðqÞ
549 indices accordingly. The function tunePoints has been fre- we have ~zR in each processor in lines 10 to 12 by 603
550 quently used in the P-FHDI to adjust the border indices (s
551 and e) of work in each processor slightly to be adjusted indi- ðqÞ
X
ea
~
zR ¼ zRi IðkzzRi k < kzzRðiþ1Þ Þk; (27)
552 ces (sa and ea ). The current parallel work distribution i¼sa
553 scheme may lead to the boundary mismatch issue. For 605
554 example, we have a sample set f4; 2; 3; 3; 2; 4; 1g indexed by where z Ri 2 zR . we assemble z ~R after communication in 606
555 f1; . . . ; 7g and sorted as f1; 2; 2; 3; 3; 4; 4g. The index map- lines 13 and 14 and broadcast to all slave processors in 607
556 ping from the original sample set to the sorted sample set line 15. 608
557 will be c ¼ f7; 2; 5; 3; 4; 1; 6g. However, the uniform distri- Before proceeding to line 5 of Algorithm 3, the function 609
558 bution upon four processors in the left panel of Table 4 will nDAU is explicitly demonstrated in Algorithm 5 to get the 610
559 result in the incorrect mappings c1 ¼ f7; 2g, c2 ¼ f2; 3g, minimum number of donors m of recipients. We distribute 611
560 and c3 ¼ f3; 1; 6g. Whereas, the job distribution after n~M to each processor cyclically in line 1 of Algorithm 5. The 612
8 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

613 set of number of total donors for all recipients is obtained in Note that p^ ðtÞ requires reorganization regarding the index 664
614 line 3 by set Ap ¼ fApi 2 Ap j 8i; Apðiþ1Þ > Api g in the ascending 665
( n~ ) order. Before proceeding to line 4 of Algorithm 6, we
Xe X R 666
ðqÞ
M ¼ Iðkzzi;obs k ¼ kzzj;obs kÞ : (28) explain the function Order in Algorithm 7 to generate index 667
i¼s j¼1 mapping  of Ap in lines 4 to 10 on each processor by 668
616
617 Meantime, the actual indices of donors for all recipients will ea X
X np
ðqÞ
618 be stored in line 4 by  ðqÞ ¼ jIðApi ¼ Apj Þ: (31)
i¼sa j¼1
X
e n~R
X 670
LðqÞ ¼ jIðkzzi;obs k ¼ kzzj;obs kÞ: (29) After communication in line 11,  will be assembled in line 671
i¼s j¼1 ^ ðtÞ can be rear-
12. By leveraging  in line 4 of Algorithm 6, p 672
620 ðtþ1Þ 
ranged to update p^ ðzz Þ in line 5 by 673
621 Then communication is processed between lines 6 and 8 to
622 assemble M and L. Note that all processors require results ! 1 n
X
n X
623 of cell construction to continue cell probability estimation. p^ðtþ1Þ ðzz Þ ¼ wi ^ ðtÞ Iðzi;obs ¼ zg;obs Þ:
wi p (32)
624 Thus we broadcast results from the master processor to i¼1 i¼1
675
625 slave processors in line 9.
626 After explicit explanation of function nDAU in line 4 of In line 6, the EM algorithm will terminate if p^ðtþ1Þ ðzz Þ con- 676

627 Algorithm 3, it will perform cell collapsing in case of m < 2 verges to a threshold  (e.g., 10 6 ). 677

628 in lines 5 to 7. The recursive iterations will terminate if


629 m52 in lines 8 to 10. Otherwise, the iterations will continue Algorithm 7. Function Order 678
630 to the max iteration (i.e., 2n). Input: index A p with size of np 679
Output: index mapping  680
631 Algorithm 5. Parallel Function nDAU 1: UniformDistr(np Þ ! ðs; eÞ 681

632 ~R and missing pattern z


Input: unique observed pattern z ~M 2: tunePoints(s; eÞ ! ðsa ; ea Þ 682
ðqÞ
633 Output: number set of donors M 3: Sort(AAðqÞ
p ) ! Ap 683
634 1: CyclicDistrðn~M Þ ! ðs; eÞ 4: for i in sa : ea do 684
635 2: for 8 i in s : e do 5: for j in 0 : np do 685
ðqÞ
636 3: M ðqÞ .add(Mi ) 6: if Api ¼ Apj then 686
637 4: LðqÞ .add(DDi ) 7: Record j to  ðqÞ 687
638 5: end for 8: end if 688
639 6: MPI_SRðM M ðqÞ ; LðqÞ Þ 9: end for 689
Q 1
640 7: M ¼ V1 M ðqÞ 10: end for 690

641 8: L ¼ VQ 1 ðqÞ
L 11: MPI_SR(ðqÞ ) 691
1 1 ðqÞ
642 M ; LÞ
9: MPI_BcastðM 12:  ¼ VQ 1  692

643 Algorithm 6. Parallel Estimation of Cell Probability Algorithm 8. Parallel Imputation 693

644 Input: Imputation cells z, sampling weight w Input: raw data y, imputation cells z, number set of 694
645 Output: cell probability p^ðzz Þ donors M , cell probability p^ðzobs Þ 695
646 1: for t in 0 : max do Output: imputed values Y ^ 696
647 2: Update w and p ^ ðtÞ 1: for i in 0 : n~M do 697
648 3: OrderðA Ap Þ !  2: Compute w ij 698
649 4: !p ^ ðtÞ 3: Sort FEFI donors 699
650 5: Compute p^ðtþ1Þ ðzz Þ 4: Construct (Lj ; LnRg ) 700

651 6: if p^ðtþ1Þ converged then 5: Select Mi donors 701

652 7: Stop 6: end for 702


7: Order(A ) ! c 703
653 8: end if ^
654 9: end for 8: Prepare Y 704

655 Similarly, estimation of cell probability is an implicit and


656 iterative process, which does not support simple parallel- 3.2 Parallel Imputation 705

657 ism. The EM iterations start in line 1 of Algorithm 6 and ter- Imputation of the P-FHDI aims at selecting M donors for 706

658 minate if the joint probability of z ¼ ðzzobs ; zmis Þ converges each recipient in ~ zM in lines 1 to 6 of Algorithm 8. The FEFI 707

659 in lines 6 to 8. Let Ap ¼ f1; . . . ; HG g be local index set of ini- fractional weights for all possible donors assigned to each 708

660 tial conditional probabilities p ^ ðtÞ with size of np . We update recipient are computed in line 2. We employ the PPS 709
ðtÞ
661 w 2 N and p
np ^ 2 R in line 2 where p
np ^ ðtÞ is method to randomly select M donors in lines 3 to 5 such 710
that wij ¼ fwij j i 2 A~ M ; j 2 f1; . . . ; Mi gg. In particular, it 711
ðhÞ
X
Hg ðtÞ
p^ ðzzg;obs ; z i;mis ¼ z g;mis Þ sorts all FEFI donors in y values by the half-ascending and 712
^ ðtÞ ¼
p PHg ðtÞ ðhÞ
: (30) half-descending order to construct successive intervals in 713
663 h¼1 ^ ðzzg;obs ; zi;mis ¼ zg;mis Þ
h¼1 p lines 3 and 4. To prepare the results of fractional imputed 714
YANG ET AL.: PARALLEL FRACTIONAL HOT-DECK IMPUTATION AND VARIANCE ESTIMATION FOR BIG INCOMPLETE DATA CURING 9

715 values Y,  by
^ we obtain the index mapping c of sorted A where p ^ j represents the conditional probability of the jth 764
716 Order function in line 7 by recipient’s donors. Note that each p ^ j has to be rearranged 765
according to its index mapping in Eq. (31) in advance. For 766
X
e X
nA
instance, let Ap be the index set of p ^ and Ap denotes the 767
c¼ VQ 1  ðqÞ
jIðA j Þ:
¼A (33)
1 i index set of sorted Ap . Because of the heavy computational 768
i¼s j¼1
718
cost of linear searching, we implement binary search 769

719 Note that the majority of computational cost occurs in line 7. (denoted as BS) [33] to record the mapping  of Ap after sort- 770

720 ^ has been enumerated in accordance with the index


Finally Y ing. Considering m (i.e., np =2) as a middle index, Apm 771

721 mapping c in line 8. denotes the middle element of Ap . We set a left index L to 772
be 1 and a right index R to be np . If Apm 4Api where i 2 773

722 3.3 Parallel Variance Estimation f1; 2; . . . ; np g, set L to be m. Otherwise, we set the R to be m. 774
If Apm ¼ Api , m will be the output of the BS function. Thus, 775
723 The majority of expensive computation happens in variance
we have the  ðqÞ by 776
724 estimation because of the Jackknife estimate method of the
725 FHDI. The parallelized variance estimation is summarized X
ea
ðqÞ
726 in Algorithm 9.  ðqÞ ¼ BSðApi Þ: (35)
i¼sa
778
727 Algorithm 9. Parallel Variance Estimation After communication in line 5, p ^ ð~
zM Þ will be assembled in 779
728 Input: raw data y, imputation cells z, sampling weights line 6 and broadcast to all slave processors in line 7. After 780
729 w, index set A, number of donors M, imputed demonstration of function Rep_CellP in line 1 of Algorithm 781
730 values Y ^ 9, by leveraging the unifrom job distribution in line 2, we 782
731 Output: variance V^ ðY^FHDI Þ can compute replicate estimator matrix Y ^ K;q 2 Rðe sÞp on 783
FHDI
732 1: Rep_CellP(z) ! p ^ ð~zM Þ processor q (Note superscript K is a simple notation for dis- 784
733 2: UniformDistr(nÞ ! ðs; eÞ tinction) in line 10 by 785
734 3: for 8 k in s : e do
4: if p^ðzzk Þ ¼ 0 then
!
735 Xe Xe Xp
736 5: Skip ^
Y K;q
¼ ^
Y k
¼ Y^ k
; (36)
FHDI FHDI l;FHDI
737 6: end if k¼s k¼s l¼1
787
738 7: if np > 0 then
ðkÞ ðkÞ where Y^l;FHDI
k
is kth replicate estimator of yl . By substituting 788
739 8: FWðwij Þ ! wij
Eq. (24), we have 789
740 9: end if
741 10: Compute Y ^ K;q ( ( ))
FHDI Xe X
p X XM
742 11: end for ^
Y K;q
¼ w dil yil þ ð1 dil Þ
k ðkÞ
w y
ðjÞ
:
743 12: MPI_SR(Y ^ K;q ) FHDI i ij il
FHDI k¼s l¼1 i2A j¼1
Y^K Q 1
ðYK;q
744 13: FHDI ¼ V1 FHDI Þ (37)
745 14: Compute Y FHDI ^ 791
ðkÞ
746 15: Compute V^ ðY^ FHDI Þ The function FW in Eq. (21) will update wij if k 2 = AM . 792
After communication of Y ^ K;q in line 12, Y
^K
FHDI FHDI is assem- 793
bled in line 13. Also, the FHDI estimator Y^ FHDI 2 Rp is com- 794
puted in line 14 by substituting Eq. (18) 795
747 Algorithm 10. Function Rep_CellP
748 Input: imputation cells z X
p

749 Output: cell probability p ^ ð~zM Þ Y^ FHDI ¼ Y^l;FHDI


l¼1
750 1: CyclicDistr(n~M Þ ! ðs; eÞ ( )
751 2: for 8 j in s : e do X
p X X
M
ðjÞ
752 3: Compute p ^ ðqÞ ð~zM Þ ¼ wi dil yil þ ð1 dil Þ wij yil : (38)
l¼1 i2A j¼1
753 4: end for 797
pðqÞ ð~zM Þ)
754 5: MPI_SR (^
1 ðqÞ
Eventually, we determine V^ ðY
^ FHDI Þ by 798
755 6: p^ ð~
zM Þ ¼ V Q
1 p^ ð~zM Þ
756 7: MPI_Bcastð^ pð~zM ÞÞ X
n
V^ ðY^ FHDI Þ ¼ ck ðY^ kFHDI Y^ FHDI Þ2 ; (39)
k¼1
757 Before proceeding to line 2 of Algorithm 9, a pre-process- 800
758 ing function Rep_CellP is explicitly explained to compute where ck ¼ ðn 1Þ=n. Similarly, we can determine V^ ðY^ FEFI Þ 801
759 cell probability for unique missing patterns recursively. By using the same algorithm. 802
760 parallelizing n~M in line 1 of Algorithm 10, we can determine
761 ^ ðqÞ ð~
p zM Þ on each processor q in line 3 by 4 VALIDATION 803
8 ! 9 To validate the P-FHDI, it should have the same outputs as 804
e < X 1 n =
X n X the FHDI (e.g., standard errors) using an identical dataset.
^ ðqÞ ð~
805
p zM Þ ¼ wi ^ j Iðzzi;obs
wi p ¼ z j;obs Þ ;
j¼s
: i¼1 i¼1
; The platform used in the paper for all the parallel com- 806
puters is Condo 2017 of Iowa State University in [34], con- 807
763 (34) sisting of 192 SuperMicro servers and expandable to 324 808
10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

TABLE 5 TABLE 7
Standard Errors of Naive, Serial and Parallel Estimators Summary of the Adopted Datasets U(Instances, Variables,
Missing Rate)
Estimator y1 y2 y3 y4
Naive 0.0419 0.0390 0.0490 0.0472 Dataset Variable type Dimension
FEFI 0.0367 0.0378 0.0460 0.0457 Synthetic data 1 Continuous Uð1000; 4; 0:25Þ
P-FEFI 0.0367 0.0378 0.0460 0.0457 Synthetic data 2 Continuous Uð106 ; 4; 0:25Þ
FHDI 0.0383 0.0372 0.0451 0.0453 Air Quality Hybrid Uð41757; 4; 0:1Þ
P-FHDI 0.0384 0.0370 0.0449 0.0450 Nursery Categorical Uð12960; 5; 0:3Þ
Synthetic data 3 Continuous Uð15000; 12; 0:15Þ
Synthetic data 4 Continuous Uð15000; 16; 0:15Þ
Synthetic data 5 Continuous Uð15000; 100; 0:15Þ
TABLE 6 Synthetic data 6 Continuous Uð1000; 100; 0:3Þ
Standard Errors of the Naive and the P-FHDI Estimators With Synthetic data 7 Continuous Uð1000; 1000; 0:3Þ
Datasets U(Instances, Variables, Missing Rate) Synthetic data 8 Continuous Uð1000; 10000; 0:3Þ
Appliance Energy Continuous Uð19735; 26; 0:15Þ
Data Method y1 y2 y3 y4
Uð0:1M; 4; 0:25Þ Naive 0.0029 0.0039 0.0047 0.0043
P-FHDI 0.0037 0.0034 0.0045 0.0045 3029146, as summarized in Table 7. Synthetic data 1 is the 847
dataset used to validate the P-FHDI. Synthetic data 2 is a 848
Uð0:5M; 4; 0:25Þ Naive 0.0013 0.0018 0.0021 0.0019
P-FHDI 0.0017 0.0015 0.0020 0.0020 big-n dataset with millions of instances. Synthetic data 3 to 849
5 are used to compare the performance of the big-n and big- 850
Uð1M; 4; 0:25Þ Naive 0.0009 0.0012 0.0015 0.0013
P-FHDI 0.0012 0.0011 0.0014 0.0014 p algorithms. We apply Synthetic data 6 to 8 to test the 851
extreme cases of the P-FHDI on high-dimensional datasets. 852
Three practical incomplete datasets, i.e., Air Quality, Nurs- 853
809 servers. Each server has two 8-core Intel Haswell process- ery, and Appliance Energy, are included [28]. The hybrid 854
810 ors, 128 GB of memory, and 2.5 TB local disk. For a conve- dataset (Air Quality) is used to illustrate how to apply the 855
811 nient description of synthetic datasets, let Uðn; p; hÞ denote P-FHDI to real data in Appendix I, available in the online 856
812 a finite population with n instances and p variables issued supplemental material. Note that Air Quality consists of a 857
813 by h missing rate in proportion. non-collapsible categorical variable (y1 ) and three continu- 858
814 We generate y i ¼ ðyi1 ; yi2 ; yi3 ; yi4 Þ, i ¼ 1; . . . ; 1000, from ous variables (y2  y4 ). We test the extension of the P-FHDI 859
815 y1 ¼ 1 þ e1 , y2 ¼ 2 þ 0:5e1 þ 0:866e2 , y3 ¼ y1 þ e3 , and y4 ¼ to categorical variables by Nursery. We investigate behav- 860
816 1 þ 0:5y3 þ e4 , where e1 ; e2 ; e4 are randomly generated by iors of the big-p algorithm with a growing number of 861
817 the standard normal distribution, and e3 by the gamma dis- selected variables by Appliance Energy. Users can practice 862
818 tribution. Missingness is addressed by the Bernoulli distri- using the P-FHDI with these example datasets. 863
819 bution as di1  Berð0:6Þ; di2  Berð0:7Þ; di3  Berð0:8Þ;
820 di4  Berð0:9Þ independently to each variable. We create
5 COST ANALYSIS AND SCALABILITY 864
821 25 percent missingness to the synthetic dataset.
822 On the basis of the dataset Uð1000; 4; 0:25Þ, we apply the It is crucial to build a cost model of the P-FHDI. It will not 865
823 naive estimator, fractional imputation estimator, and paral- only evaluate the performance of the P-FHDI but also reveal 866
824 lel fractional imputation estimator to be in contrast. The potential memory risks. Note that when it comes to the Big 867
825 naive estimator is a simple mean estimator computed using O notation of the function gðxÞ, we are usually talking about 868
826 only observed values. The results in Table 5 shows that the worst-case scenario leading to the upper time boundary 869
827 P-FEFI produces the same standard errors of y as serial OðgðxÞÞ such that gðxÞ < OðgðxÞÞ as x ! 1. Considering a 870
828 FEFI. Results generated by the P-FHDI slightly differ from constant time for a unit operation in the algorithm, we can 871
829 results of the serial FHDI in a tolerable manner. The random estimate the time complexity by computing the number of 872
830 number generators used for the selection of donors in the elementary operations. It is essential to define the elemen- 873
831 FHDI and P-FHDI are library-dependent, which is the main tary cost involved in computation and communication. Let 874
832 reason leading to the tiny residuals. Note that the standard a be the computational cost per unit, b be the transfer cost 875
833 errors of the P-FHDI decrease asymptotically as n increases per unit of communication and L be communication startup 876
834 gradually in Table 6. As expected, both the P-FHDI and P- cost. Total running time T ðQÞ with Q available processors 877
835 FEFI often outperform the naive method in terms of stan- consists of computational cost C and communication cost H, 878
836 dard errors. Some exceptions (e.g., y1 and y4 of Table 6) may which reads 879
837 be attributed to the large missing rate and simple synthetic 0
model used in the study case. We provide the step-by-step a 0
838 T ðQÞ  þ b Q; (40)
839 illustration of the use of the P-FHDI in Appendix I, available Q
840 in the online supplemental material. All the developed 881
0
841 codes are shared with GPL-2, and codes, examples, and where a ¼ ac1 ðn; pÞOðn2 Þ þ aOðn3 Þ and b0 ¼ bc1 ðn; pÞOðnÞ. 882
842 documents are available in [35]. The quantity c1 ðn; pÞ is the number of total iterations in cell 883
843 To maximize the benefit of researchers, we provide construction, which guarantees at least two donors for each 884
844 eleven adopted datasets in the supplementary material, recipient. Because of the implicit property of cell construc- 885
845 which can be found on the Computer Society Digital Library tion and probability estimation, we can only make an empiri- 886
846 at http://doi.ieeecomputersociety.org/10.1109/TKDE.2020. cal approximation that c1 ðn; pÞ / n ln ðpÞ. In conjunction 887
YANG ET AL.: PARALLEL FRACTIONAL HOT-DECK IMPUTATION AND VARIANCE ESTIMATION FOR BIG INCOMPLETE DATA CURING 11

TABLE 8
Datasets U(Instances, Variables, Missing Rate) Prepared for
Parametric Studies

Parameter Dataset 1 Dataset 2 Dataset 3


n Uð0:5M; 4; 0:25Þ Uð0:8M; 4; 0:25Þ Uð1M; 4; 0:25Þ
h Uð1M; 4; 0:15Þ Uð1M; 4; 0:25Þ Uð1M; 4; 0:35Þ
p Uð15K; 12; 0:15Þ Uð15K; 16; 0:15Þ Uð15K; 100; 0:15Þ

T ðcp QÞ
888 with Eq. (40), we have the scalability of T ðQÞ by
0 0
T ðQÞ a þ Q2 b
¼ cp  0 0 ; (41)
T ðcp QÞ a þ c2p Q2 b
890
891 where cp 2 Nþ . Detailed derivations of Eqs. (40) and (41) are Fig. 7. Impact of the number of instances n on running time of the entire
P-FHDI (i.e., imputation and variance estimation) with datasets
892 presented in Appendix II, available in the online supple- Uðn; 4; 0:25Þ: 4 variables and 25 percent missing rate by varying n.
893 mental material.
894 To evaluate the performance of the P-FHDI in practice,
895 we perform parametric studies to investigate the impacts of
896 the number of instances n, the number of variables p and
897 the missing rate h on speedup and running time. We gener-
898 ate synthetic datasets in the form of yi ¼ ðyi1 ; . . . ; yip Þ; i ¼
899 1; . . . ; n using the same method presented in Section 4, and
900 the missingness is dynamically issued by dp  Berðhp Þ. By
901 fixing the other two parameters, nine synthetic datasets will
902 be generated by varying the target parameter in Table 8.
903 Note that the P-FHDI is particularly powerful towards
904 big data with massive instances and high missing rates. The
905 parametric studies on the number of variables will be dis-
906 cussed in Section 6. An increasing number of instances n
907 and missing rate h increase the running time of the P-FHDI
908 in Figs. 7 and 9. But n and h are not likely to significantly Fig. 8. Impact of the missing rate h on speedups of the entire P-FHDI
(i.e., imputation and variance estimation) with datasets Uð1M; 4; hÞ: 1
909 affect the speedup of the P-FHDI in Figs. 6 and 8. We pro- million instances and 4 variables by varying h.
910 vide speedups and running time of imputation or variance
911 estimation of the P-FHDI separately in Appendix II, avail-
p > 100). We performed a parametric study with increasing 920
912 able in the online supplemental material. Overall, the num-
p and the results show weak speedups with the big-n algo- 921
913 ber of instances is the dominating parameter positively
rithm in Fig. 10. A gradual increment of p significantly 922
914 affecting speedup and running time. All of these parametric
increases the total running time, which may be attributed to 923
915 studies have a good agreement with the cost models of
the fact that many variables can lead to exponentially 924
916 speedup and running time.
increasing iterations to guarantee at least two donors during 925
the current cell construction algorithm. This issue is theoret- 926
917 6 VARIABLE REDUCTION FOR BIG-p DATASETS ically related to so-called the curse of dimensionality. There 927
918 The current algorithm of the P-FHDI maybe not adequate is a huge literature on the problem of variable selection. To 928
919 for imputing big-p data with too many variables (e.g., name a few, the least absolute shrinkage selection operator 929

Fig. 6. Impact of the number of instances n on speedups of the entire P- Fig. 9. Impact of the missing rate h on running time of the entire P-FHDI
FHDI (i.e., imputation and variance estimation) with datasets (i.e., imputation and variance estimation) with datasets Uð1M; 4; hÞ: 1
Uðn; 4; 0:25Þ: 4 variables and 25 percent missing rate by varying n. million instances and 4 variables by varying h.
12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

corresponding variables in zi . It will be difficult to guaran- 965


tee at least two donors for a recipient if p is large. We apply 966
a correlation-based screening step like SIS to filter out those 967
variables that have weak correlations with the response var- 968
iables (i.e., missing variables). Consequently, an instance 969
fzzi j i 2 AR g serves as a donor of fzzj j j 2 AM g if the selected 970
variables of zj is identical to the corresponding variables in 971
zi . Suppose we select v variables for a recipient such that 972
v < p. Let X ¼ fX X1 ; . . . ; X q g be always observed and Y ¼ 973
fYY 1 ; . . . ; Y w g be subject to missingness such that p ¼ q þ w. 974
Consider rk ¼ ðr1k ; r2k ; . . . ; rqk Þ be a vector of sample correla- 975
tions of X i , i ¼ f1; . . . ; qg, given Y k . The proposed big-p 976
algorithm consists of four steps: 977

Fig. 10. Impact of the number of variables p on speedups of the P-FHDI


1) Compute correlation vectors r k where k 2 f1; . . . ; wg 978
using the big-n algorithm and the big-p algorithm, with datasets U
(15000, p, 0.15): 15000 instances with 15 percent missing rate by vary- and sort it in descending order. 979
ing p. Note that we adopt 4, 5, and 6 selected variables for U(15000, 12, 2) Define sub-covariate set Mk for imputing Y k ; k 2 980
0.15), U(15000, 16, 0.15), and U(15000, 100, 0.15) with big-p algorithm, f1; . . . ; wg such that 981
respectively.
Mk ¼ f14i4q; jrik jis among the top of
930 (LASSO) in [36], ridge in [37], principle component analysis (45)
931 (PCA) in [38] and smoothly clipped absolute deviation largest v; where rik 2 rk g:
983
932 (SCAD) method in [39]. The sure independence screening
984
933 (SIS) proposed in [40] is popular for ultrahigh dimensional
3) We are implicitly assuming that 985
934 variable reduction. It filters out the variables that have weak
935 correlations with the response based on correlation learn- P ðY Y 1 ; . . . ; Y w j X  Þ;
Y 1 ; . . . ; Y w j X 1 ; . . . ; X q Þ ¼ P ðY
936 ing. To explain the idea, consider a linear regression model
(46)
987
Y ¼ Xb
b þ e; (42) 
where X is the covariates such that M ¼ or \wk¼1 Mk 988
938
T M ¼ [wk¼1 Mk . 989
939 where Y ¼ ðy1 ; y2 ; . . . ; yn Þ is vector of responses, X ¼ ðX X1 ;
4) If number of selected covariates in M equals v, then 990
940 X 2 ; . . . ; X p Þ is an n  p random design matrix with inde-
stop. Otherwise, repeat step (2) and (3) by setting v ¼ 991
941 pendent and identically distributed (IID) elements. b ¼
v þ 1 until we obtain v selected variables. 992
942 ðb1 ; b2 ; . . . ; bp ÞT is a vector of parameters and e ¼ ðe1 ; e2 ; . . . ;
By iterations of these steps for each recipient, we can 993
943 en ÞT is a IID random errors. Let M ¼ f14i4p; bi 6¼ 0g be obtain the selected variables for all recipients. Following 994
the true sparse model. The covariates X i with bi 6¼ 0 are so- sure screening property, the probability of the true model 995
called important variables, otherwise as noise variables. SIS among the built model is assumed to be 996
is consist of two steps:
P ðM  MÞ ! 1; (47)
944 1) Screening step: choose a subset of v variables such 998
945 that v < p. For any given g 2 ð0; 1Þ, sort the correla- as n ! 1. 999
946 tions in a descending order and define sub-models Besides, the temporary storage of p ^ ð~
zM Þ in line 1 of Algo- 1000
rithm 9 may need excessive memory for big-p data. Hence, 1001
Mg ¼ f14i4p; jri jis among the top of we jointly embed the function Rep_CellP into the L-Replica- 1002
(43)
largest onesg; tion process of variance estimation to avoid excessive stor- 1003
948 age of p ^ ð~zM Þ. As a result of the implementation of the big-p 1004
949 where ri ¼ corrðX X i ; Y Þ is the sample correlation algorithm, the weak speedups have been enhanced to the 1005
950 between X i and Y . linear speedups in Fig. 10. 1006
951 2) Selection step: using the covariates in Mg , apply a We further investigate the behaviors of estimates of the 1007
952 penalized regression method to obtain the best big-p algorithm. Let U be a finite population of size N and 1008
953 model. U g be subsets of the finite population with size of Ng , where 1009
954 By “sure screening property”, we know that all impor- g ¼ 1; . . . ; G. From Eq. (III.7) in the Appendix III, available 1010
955 tant variables retain after applying a variable screen proce- in the online supplemental material, the ratio of the means 1011
956 dure with probability tending to 1 by is given as 1012

P ðM  Mg Þ ! 1; (44) P P 
Gv 1
958 EðYN Þ þ E g¼1 U g ðRg di
i2U 1Þðyi mg Þ
as n ! 1 for some given g. Inspired by the screening step ! 1;
959 EðYN Þ
960 of SIS, we introduce a variable reduction method with mul-
(48)
961 tivariate responses embedded into the P-FHDI algorithm P 1014
962 (so-called big-p algorithm). By assumption of a cell mean PN v ! p, where Rg ¼ NRg =Ng , NRg ¼ i2UU g di , YN ¼
as 1015
963 model, an instance fzzi j i 2 AR g serves as a donor of fzzj j j 2 i¼1 yi , Gv is the number of imputation groups using v 1016
964 AM g unless all observed variables of z j is identical to selected variables, and mg is the weighted average of yi ’s in 1017
YANG ET AL.: PARALLEL FRACTIONAL HOT-DECK IMPUTATION AND VARIANCE ESTIMATION FOR BIG INCOMPLETE DATA CURING 13

Fig. 11. The average ratios of (a) mean and (b) Jackknife variance esti-
mator of the practical dataset Uð19735; 26; 0:15Þ using a growing number
of selected variables.

1018 the subgroup g in the population. The parametric studies of


1019 the number of selected variables in Fig. 11a show that the Fig. 12. Speedups of the P-FHDI with an extremely high-dimensional
1020 adoption of reduced variables won’t affect the average ratio dataset Uð1000; 10000; 0:3Þ: 1000 instances and 10,000 variables with
1021 of means over all variables significantly. The choice of 30 percent missingness. Note that we adopt 3 selected variables with
the big-p algorithm.
1022 reduced variables will slightly change Gv of each variable,
1023 leading to additional bias in the second term of numerator.
speedup. This performance may be attributed to the fact 1046
1024 That is, the slight fluctuation in Fig. 11a is explained. In
1025 future work, a theoretical number of v for a desired level of that both parallel cell construction and estimation of cell 1047
probability use an internal parallelization within the 1048
1026 bias will be addressed. Now consider the behaviors of vari-
unbreakable implicit iterations. The running time of parallel 1049
1027 ance using the big-p algorithm. Let W be
cell construction and estimation of cell probability take the 1050
! majority of total execution time with the dataset 1051
X
M w
W¼ P ij;FEFI ðjÞ
yi
ðrÞ
yi ; (49) Uð1000; 10000; 0:3Þ. Also, the adoption of the uniform distri- 1052

j6¼r j6¼r wij;FEFI bution scheme in parallel variance estimation results in the 1053
1029 fluctuation of the speedup due to considerable workload 1054
1030 where r is the index of closest donor to the kth replicate and imbalance when n is too small. Our HPC environment (8GB 1055
1031 wij;FEFI is defined in Eq. (15). Considering the update of memory per core) had some difficulty in handling datasets 1056
wij in Eq. (21), the ratios of V^ðY^FHDI Þ using v selected vari-
ðkÞ
1032 with more than 10,000 variables because of several “out of 1057
1033 ables to that using all observed variables can be derived by memory” issues, which may be overcome by a large-mem- 1058
1034 Eq. (III.13) in Appendix III, available in the online supple- ory HPC environment. As a compromise, the present P- 1059
1035 mental material, as FHDI will give users warning messages if the requirement 1060
of memory is about to exceed the limitations. 1061
P n P 2
k¼1 i2AM wir;FEFI;v W v
Pn P 2 ! 1; (50) 7 FUTURE RESEARCH 1062
wir;FEFI;p W p
k¼1 i2AM In future work, a theoretical choice of the number of 1063
1037 selected variables v in the big-p algorithm which minimizes 1064
1038 where wir;FEFI;v and Wv are built upon v selected variables the Jackknife estimate of variance will be addressed. Also, 1065
1039 and wir;FEFI;p and Wp are built upon all variables, respec- the present big-p algorithm of the P-FHDI is not adequate 1066
1040 tively. The ratio will approach 1 if v ! p. Fig. 11b shows for extremely high-dimensional categorical variables. The 1067
that the average ratio of the Jackknife variance over all vari- categorical variables violate assumptions for computing 1068
ables incrementally approaches 1 with a growing number of correlation, of which all variables should be continuous. 1069
reduced variables. The sum of wij;FEFI of the ith recipient One set of possible solutions rely on distance metrics such 1070
equals one. If v is much smaller than p, a recipient will have that they can find associations between categorical varia- 1071
more donors with v reduced variables such that wir;FEFI;v  bles. Other possible proposals span various statistical met- 1072
wir;FEFI;p , which is the dominating factor leading to the ratio rics (e.g., chi-squared statistics). The marginal association 1073

1041 of V^ðY^FHDI Þ less than 1. The distance W measures how far measurement proposed in [41] other than correlation learn- 1074
ðrÞ
yi is away from the mean of M 1 donors of the ith recip- ing will also be a good candidate criterion to filter out unim- 1075
portant categorical variables. Departing from the current
ient. Though Y^FHDI has additional variance due to the donor
1076
program, the next theoretical (e.g., cell construction with k- 1077
selection procedure [15]. It leads to Wp slightly smaller than
nearest neighbors or fractional imputation using density 1078
Wv , which is trivial if v is much smaller than p. As v ! p,
ratio model) and computational advancements (e.g., pru- 1079
wir;FEFI;v ’ wir;FEFI;p . Sequentially, Wp < Wv will play an
dent data distribution algorithm) shall be focusing on the 1080
important role such that the ratio of V^ðY^FHDI Þ can be slightly ultra datasets and high-dimensional categorical data. 1081
greater than 1. That is, the trend of the average ratio of
V^ðY^FHDI Þ in Fig. 11b is explained.
1042 The extremely high-dimensional data we have tested 8 CONCLUSION 1082

1043 with the big-p algorithm so far has 10,000 variables. Fig. 12 As we enter into the new era of big data, it is of paramount 1083
1044 shows that the speedups of the P-FHDI on extremely high- importance to establish the big data-oriented imputation par- 1084
1045 dimensional data can scale up slightly lower than the linear adigm. By inheriting the strengths of the general-purpose, 1085
14 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

1086 assumption-free serial fractional hot-deck imputation pro- [12] X. Xie and X.-L. Meng, “Dissecting multiple imputation from a 1149
multi-phase inference perspective: What happens when God’s, 1150
1087 gram FHDI, this paper presents the first version of the parallel imputer’s and analyst’s models are uncongenial?” Statist. Sinica, 1151
1088 fractional hot-deck imputation program, named as P-FHDI. vol. 27, no. 4, pp. 1485–1594, 2017. 1152
1089 We document the full details regarding the algorithm- [13] K. Graham and L. Kish, “Some efficient random imputation meth- 1153
1090 oriented parallelisms, computational implementations, and ods,” Commun. Statistic-Theory Methods, vol. 13, no. 16, pp. 1919–1939, 1154
1984. 1155
1091 various examples of the P-FHDI to benefit a broad audience [14] R. E. Fay, “Alternative paradigms for the analysis of imputed sur- 1156
1092 in the science and engineering domain. The developed P- vey data,” J. Amer. Statist. Assoc., vol. 91, no. 434, pp. 490–498, 1157
1093 FHDI is suitable to tackle large-instance (so-called big-n) or 1996. 1158
[15] J. K. Kim and W. Fuller, “Fractional hot deck imputation,” Biome- 1159
1094 large-variable (so-called big-p) missing data with irregular trika, vol. 91, no. 3, pp. 559–578, 2004. 1160
1095 and complex missing patterns. The validations and analytical [16] S. Yang and J. K. Kim, “Fractional imputation in survey sampling: 1161
1096 cost models confirm that the P-FHDI exhibits promising scal- A comparative review,” Statist. Sci., vol. 31, no. 3, pp. 415–432, 1162
1097 ability for big incomplete datasets regardless of various miss- 2016. 1163
[17] W. A. Fuller and J. K. Kim, “Hot deck imputation for the response 1164
1098 ing rates. This program will help to impute incomplete data, model,” Surv. Methodol., vol. 31, no. 2, pp. 139–149, 2005. 1165
1099 enable parallel variance estimation, and ultimately improve [18] E. F. Akmam, T. Siswantining, S. M. Soemartojo, and D. Sarwinda, 1166
1100 the subsequent statistical inference and machine learning “Multiple imputation with predictive mean matching method for 1167
1101 with the cured big datasets. The next version of P-FHDI will numerical missing data,” in Proc. Int. Conf. Informat. Comput. Sci., 1168
2019, pp. 1–6. 1169
1102 focus on ultra datasets with both large instances and many [19] Nurzaman, T. Siswantining, S. M. Soemartojo, and D. Sarwinda, 1170
1103 variables, which will call for specialized theoretical and “Application of sequential regression multivariate imputation 1171
1104 computational advancements. Toward the future extension, method on multivariate normal missing data,” in Proc. Int. Conf. 1172
Informat. Comput. Sci., 2019, pp. 1–6. 1173
1105 this first version of P-FHDI will serve as a concrete founda- [20] K. Aristiawati, T. Siswantining, D. Sarwinda, and S. M. Soemartojo, 1174
1106 tion. All the developed codes are shared with GPL-2, and “Missing values imputation based on fuzzy C-means algorithm for 1175
1107 codes, examples, and documents are available in [35]. classification of chronic obstructive pulmonary disease,” in Proc. 1176
8th SEAMS-UGM Int. Conf. Math. Appl., 2019, Art. no. 060003. 1177
[21] T. Anwar, T. Siswantining, D. Sarwinda, S. M. Soemartojo, and 1178
1108 ACKNOWLEDGMENTS A. Bustamam, “A study on missing values imputation using K- 1179
harmonic means algorithm: Mixed datasets,” in Proc. Int. Conf. 1180
1109 This work was supported by the research funding of Depart- Sci. Appl. Sci., 2019, Art. no. 020038. 1181
1110 ment of Civil, Construction, and Environmental Engineering of [22] IBM, 2012. [Online]. Available: http:// www-01.ibm.com/ 1182
1111 Iowa State University. The high-performance computing facil- software/data/bigdata/ 1183
[23] B. Caffo, R. Peng, F. Dominici, T. A. Louis, and S. Zeger, “Parallel 1184
1112 ity used for this work was partially supported by the HPC@ISU
MCMC imputation for multiple distributed lag models: A case study 1185
1113 equipment at ISU, some of which has been purchased through in environmental epidemiology,” in The Handbook of Markov Chain 1186
1114 funding provided by National Science Foundation under MRI Monte Carlo. Boca Raton, FL, USA: CRC Press, 2011, pp. 493–512. 1187
1115 grant number CNS 1229081 and CRI grant number 1205413. [24] T. J. Durham, M. W. Libbrecht, J. Howbert, J. Bilmes, and W. S. 1188
Noble, “PREDICTD PaRallel epigenomics data imputation with 1189
1116 This work was also supported by National Science Foundation cloud-based tensor decomposition,” Nat. Commun., vol. 9, no. 1, 1190
1117 under grants CBET-1605275 and OAC-1931380. pp. 1402–1402, 2018. 1191
[25] B. Zhang, D. Zhi, K. Zhang, G. Gao, N. N. Limdi, and N. Liu, 1192
“Practical consideration of genotype imputation: Sample size, 1193
1118 REFERENCES window size, reference choice, and untyped rate,” Statist. Interface, 1194
1119 [1] J. Im, I. Cho, and J. K. Kim, “An R package for fractional hot deck vol. 4, no. 3, pp. 339–352, 2011. 1195
1120 imputation,” R J., vol. 10, no. 1, pp. 140–154, 2018. [26] E. Porcu, S. Sanna, C. Fuchsberger, and L. G. Fritsche, “Genotype 1196
1121 [2] I. Song, Y. Yang, J. Im, T. Tong, H. Ceylan, and I.-H. Cho, “Impacts imputation in genome-wide association studies,” Current Protocols 1197
1122 of fractional hot-deck imputation on learning and prediction of Hum. Genetics, vol. 78, no. 1, pp. 1–25, 2013. 1198
1123 engineering data,” IEEE Trans. Knowl. Data Eng., to be published, [27] X. Hu, “Acceleration genotype imputation for large dataset on 1199
1124 doi: 10.1109/TKDE.2019.2922638. GPU,” Procedia Environ. Sci., vol. 8, pp. 457–463, 2011. 1200
1125 [3] T. A. Myers, “Goodbye, listwise deletion: Presenting hot deck [28] D. Dheeru and G. Casey, “UCI machine learning repository,” 1201
1126 imputation as an easy and efficient tool for handling missing 2013. [Online]. Available: http://archive.ics.uci.edu/ml 1202
1127 data,” Commun. Methods Measures, vol. 5, no. 4, pp. 297–310, 1999. [29] J. Im, J. K. Kim, and W. A. Fuller, “Two-phase sampling approach 1203
1128 [4] J. W. Graham, “Missing data analysis: Making it work in the real to fractional hot deck imputation,” in Proc. Surv. Res. Methodol., 1204
1129 world,” Annu. Rev. Psychol., vol. 60, pp. 549–576, 2009. 2015, pp. 1030–1043. 1205
1130 [5] L. Wilkinson, “Statistical methods in psychology journals: Guide- [30] S. Z. Christopher, T. Siswantining, D. Sarwinda, and A. Bustaman, 1206
1131 lines and explanations,” Amer. Psychol., vol. 54, no. 8, pp. 594–604, “Missing value analysis of numerical data using fractional hot 1207
1132 1999. deck imputation,” in Proc. Int. Conf. Informat. Comput. Sci., 2019, 1208
1133 [6] A. J. Smola, S. Vishwanathan, and H. Thomas, “Kernel methods pp. 1–6. 1209
1134 for missing variables,” in Proc. Int. Conf. Artif. Intell. Statist., 2005, [31] J. G. Ibrahim, “Incomplete data in generalized linear models,” J. 1210
1135 pp. 325–332. Amer. Statist. Assoc., vol. 85, no. 411, pp. 765–769, 1990. 1211
1136 [7] D. Williams, X. Liao, Y. Xue, and L. Carin, “Incomplete-data clas- [32] I. Cho and K. A. Porter, “Multilayered grouping parallel algo- 1212
1137 sification using logistic regression,” in Proc. 22nd Int. Conf. Mach. rithm for multiple-level multiscale analyses,” Int. J. Numer. Meth- 1213
1138 Learn., 2005, pp. 972–979. ods Eng., vol. 100, no. 12, pp. 914–932, 2014. 1214
1139 [8] D. Rubin, “Multiple imputation after 18+ years,” J. Amer. Statist. [33] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein Eds., 1215
1140 Assoc., vol. 91, no. 434, pp. 473–489, 1996. Introduction to Algorithms. London, U.K.: MIT Press, 2009. 1216
1141 [9] I. White, P. Royston, and A. Wood, “Multiple imputation using [34] Condo, “Condo2017: Iowa State University high-performance 1217
1142 chained equations: Issues and guidance for practice,” J. Amer. Stat- computing cluster system,” 2017. [Online]. Available: https:// 1218
1143 ist. Assoc., vol. 30, no. 4, pp. 377–399, 2011. www.hpc.iastate.edu/guides/condo-2017/slurm-job-script- 1219
1144 [10] N. J. Horton and S. R. Lipsitz, “Multiple imputation in practice,” generator-for-condo 1220
1145 Amer. Statist., vol. 55, no. 3, pp. 244–254, 2001. [35] I. Cho, “Data-driven, computational science and engineering plat- 1221
1146 [11] S. Yang and J. K. Kim, “A note on multiple imputation for method forms repository,” 2017. [Online]. Available: https://sites.google. 1222
1147 of moments estimation,” Biometrika, vol. 103, no. 1, pp. 244–251, com/site/ichoddcse2017/home/type-of-trainings/parallel-fhdi- 1223
1148 2016. p-fhdi-1 1224
YANG ET AL.: PARALLEL FRACTIONAL HOT-DECK IMPUTATION AND VARIANCE ESTIMATION FOR BIG INCOMPLETE DATA CURING 15

1225 [36] T. Robert, “Regression shrinkage and selection via Lasso,” J. Roy. Jae Kwang Kim received the PhD degree in sta- 1251
1226 Statist. Soc. Ser. Methodol., vol. 58, no. 1, pp. 267–288, 1996. tistics from Iowa State University (ISU), Ames, 1252
1227 [37] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estima- Iowa, in 2000. He is a fellow of American Statisti- 1253
1228 tion for nonorthogonal problems,” Technometrics, vol. 12, no. 1, cal Association and Institute of Mathematical Sta- 1254
1229 pp. 55–67, 1970. tistics and currently a LAS dean’s professor with 1255
1230 [38] L. Jolliffe Ed., Principle Component Analysis. Berlin, Germany: the Department of Statistics, ISU. His research 1256
1231 Springer, 2011. interests include survey sampling, statistical anal- 1257
1232 [39] J. Fan and R. Li, “Variable selection via nonconcave penalized ysis with missing data, measurement error mod- 1258
1233 likelihood and its Oracle properties,” J. Amer. Statist. Assoc., vol. els, multi-level models, causal Inference, data 1259
1234 96, no. 456, pp. 1348–1360, 2001. integration, and machine learning. 1260
1235 [40] J. Fan and J. Lv, “Sure independence screening for ultrahigh
1236 dimensional feature space,” J. Roy. Statist. Soc. Ser. Statist. Meth-
1237 odol., vol. 70, no. 5, pp. 849–911, 2008. In Ho Cho (Member, IEEE) received the PhD 1261
1238 [41] D. Huang, R. Li, and H. Wang, “Feature screening for ultrahigh degree in civil engineering and minor in computa- 1262
1239 dimensional categorical data with applications,” J. Bus. Econ. Stat- tional science and engineering from the California 1263
1240 ist., vol. 32, no. 2, pp. 237–244, 2014. Institute of Technology, Pasadena, California, in 1264
2012. He is currently an associate professor with 1265
1241 Yicheng Yang received the MS degree in civil the Department of CCEE, ISU. His research inter- 1266
1242 engineering and minored in computer science ests include data-driven engineering and science, 1267
1243 from Iowa State University (ISU), Ames, Iowa, in computational statistics, parallel computing, par- 1268
1244 2018. He is currently working toward the PhD allel multi-scale finite element analysis, computa- 1269
1245 degree in civil engineering (specialization: intelli- tional and engineering mechanics for soft micro- 1270
1246 gent infrastructure engineering) in the Depart- robotics and nano-scale materials. 1271
1247 ment of Civil, Construction and Environmental
1248 Engineering (CCEE), ISU, Ames, Iowa. His
1249 research interests include parallel imputation, " For more information on this or any other computing topic, 1272
1250 machine learning, and data-driven engineering. please visit our Digital Library at www.computer.org/csdl. 1273
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1

Appendices for Parallel Fractional Hot-Deck


Imputation and Variance Estimation for Big
Incomplete Data Curing

APPENDIX I: Detailed Explanations and Examples of the P-FHDI


The platform used in the paper for all the parallel computers is the Condo 2017 of Iowa State University, consisting of
192 SuperMicro servers and expandable to 324 servers. Each server has two 8-core Intel Haswell processors, 128 GB of
memory and 2.5 TB local disk. The following illustration of the P-FHDI is about the preparation of input files, compiling,
and explanation of output files. As long as users’ cluster system has the same compiler module, i.e. intel/15.0.2 − intel/19.4,
the developed software can be deployed, compiled, and performed for the parallel data curing. The software is shared with
GPL-2, and codes, examples, and documents are available in [1].

A. Preparation of an input file


All C++ files of the program should be included in an identical directory. The P-FHDI requires one input file specifying
(1) initialization of program settings (as summarized in Table I), (2) raw data matrix y with missing values, (3) response
indicator matrix r reflecting observed values and missing values, and (4) an optional imputation cells z, if the default
i option read data = 0. In ease of use, the matrix y, the matrix r, and the rest of the setups can be read from separate TXT
files by setting the i option read data = 1, if input datasets are extremely large.
An illustrative summary of options for initial settings is provided in Table I. i option perf orm accepts four options: the
program with an option 1 can perform all steps of the P-FHDI or P-FEFI using imputation cells z automatically generated
by cell construction (or user-defined imputation cells z by option 4). Note that users must import z if option 4 chosen and
i use def ined datz = 1. Besides, the program can perform cell construction or estimation of cell probability separately by
option 2 or 3. i option imputation provides two options, where 1 for running the P-FEFI and 2 for running the P-FHDI.
Users can skip the variance estimation by setting i option variance = 0. Otherwise, the P-FHDI performs the variance
estimation as default. i option merge is an option to control the random seed number during the cell collapsing and random
donor selection. A choice of 1 will use the C++ standard random number generator for the donor selection; an option of 0 will
use the same random number seed, which is helpful for validations when the same imputed values are desired. donor requires
the number of total donors (meaningful when the P-FHDI is used). [2] recommends 5 as the default number of donors after
detailed case studies. i user def ined datz will be determined conjunctive to the choice of i option perf orm illustrated
as before. The default choice of i option collapsing = 0 enables the big-n algorithm. Instead, i option collapsing 6= 0
activates the big-p algorithm with sure independent screening by specifying the number of selected variables. For example,
i option collapsing = 4 will utilize four selected variables for the big-p algorithm to identify donors. i option SIS type
provides the different methods for the sure independent screening. Here is an example of initial settings shown as:

i option read data 0


i option perform 1
nrow 100
ncol 4
i option imputation 2
i option variance 1
i option merge 0
donor 5
i user defined datz 0
i option collapsing 0
i option SIS type 3

Following the initial settings, we adopt the same dataset y used in [3] as an example. Data y delimit each element in a row
by a single space and start another row by an entering. It keeps a consistent format of the data shown below:
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2

TABLE I: Summary of options for the initial settings.


• i option read data (default: 0):
0: Read all inputs from input.txt file 1: Read raw data from daty.txt file, the response
indicator matrix from datr.txt file, and the rest of settings from input.txt file, separately
• i option perf orm (default: 1):
1: Perform all steps using the automatic z 2: Perform cell construction only
3: Perform estimation of cell probability only 4: Perform all steps using a user-defined z
• nrow: The number of instances of the target dataset
• ncol: The number of variables of the target dataset
• i option imputation (default: 2):
1: Perform the P-FEFI 2: Perform the P-FHDI
• i option variance (default: 1):
0: Skip variance estimation 1: Perform the Jackknife variance estimation
• i option merge (default: 0):
0: Turn on the fixed seed 1: Turn on the random seed generator
• donor (default: 5):
Adopt user-defined integer as the number of donors used to fill in each missing item in the P-FHDI
• i use def ined datz (default: 0):
0: Adopt the automatic z matrix 1: Adopt a user-defined z matrix
• i option collapsing (default: 0):
Activate the big-p algorithm by specifying the number of selected variables
• i option SIS type (default: 3):
Sure independent screening with 1: an intersection of simple correlation; 2: an union of simple
correlation; 3: a global ranking of simple correlation

daty
1.479632863 2.150860465 0 1.894211796
0 1.141495781 1.602529616 −1.036946859
0.70870936 1.885673226 1.250689441 0
0 2.753839552 0 1.211049509
0.862735721 2.42554872 1.887549247 −0.539284732
... ... ... ...

It is worthwhile to note that all missing values in y can be marked with any numerical values. The locations of missing units
will be identified by the matrix r that has the same dimensions as y. The response index (1 = observed; 0 = missing) in r is
important for the program to locate missing values. For instance, the missing units of y are marked as

datr
1 1 0 1
0 1 1 1
1 1 1 0
0 1 0 1
1 1 1 1
... ... ... ...

Next, the category vector K defines the number of categories of all variables that will be created in the cell construction
process. The maximum value of the category is 35. [2] suggests K of each variable between 30 and 35 after detailed case
studies with continuous variables. Still, the gain with such large K is not always substantial and computational cost increases
with K. Thus, it would be reasonable to start with a small number for K. For instance, below definition will assign 3 categories
for all variables:

category
3 3 3 3
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 3

Each variable can have different values for category. The P-FHDI also provides an option to reserve non-collapsible categorical
variables:

NonCollapsible categorical
0 0 0 0

The default option is a zero index vector of size p. For instance, when ”NonCollapsible categorical” = c (1,0,0,0), the first
variable is strictly considered as a non-collapsible categorical variable, and all the other variables are considered as continuous
or collapsible categorical variables. Consequently, cell collapsing will not happen if any non-collapsible categorical variables
are issued. On top of categories, users should include sampling weight for each row of y. The length of sampling weights is
consistent with the number of instances in y. It takes a length of 100 in the example:

weight
1
1
1
1
1
...

The choice of imputation cells z is optional. This matrix is meaningful only when i use def ined datz = 1:

datz
3 2 0 3
0 1 1 1
2 2 2 0
0 2 0 3
2 3 2 2
... ... ... ...

B. Compiling of the program


Before compiling the program, it is necessary to load the corresponding module. This study uses intel/18.3 for C++ and
OpenMPI:
module load intel/18.3
where ”18.3” represents the version of the loaded module. Table II lists all available versions of modules in Condo 2017, which
have been tested for the P-FHDI. As a default, we recommend intel/18.3 and all examples are provided with that setting.

TABLE II: List of all available modules in Condo 2017.


15.0.2 16.0 16.4 16.5 17.0.1
17.4 18.1 18.2 18.3 19.4

To compile C++ MPI programs using the Intel MPI, run the command:
mpiicc -o main_MPI main_MPI.cpp
the -o ** stands for the name of an executable exact file. ”**.cpp” represents the actual file name with its expansion. Then a
job script should be created and submitted into queue by issuing:
sbatch run.sbatch
where ”**.sbatch” is the actual name of a script file with its expansion. To submit a parallel-computing task to a high
performance computing (HPC) system, users need to prepare a script file for a workload manager. Depending upon the HPC
facilities, there are many workload managers. For instance, the ”Slurm” in our HPC facility is a highly scalable cluster job
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 4

management system. A job script can be generated by the Slurm Job Script Generator for Condo in [4], and an example script
file is shown below:
#!/bin/bash
# The number of nodes being requested
#SBATCH --nodes=1
# Total number of processors being requested (across all nodes).
#SBATCH --ntasks=4
# The maximum time of the job
#SBATCH --time=00:12:00
# Load the module
module load intel/18.3
# Run the MPI program
mpirun -np 4 ./main_MPI
The script file starts with a line identifying the Unix shell to be used by issuing #!/bin/bash. Every script should include
”--nodes”, ”--ntasks”, and ”--time” directives which inform the Slurm the number of total nodes assigned to the job, the
number of total processes run across all nodes, and maximum time limit for the job in HH:MM:SS time format, respectively.
Afterward, load the module of compiler and OpenMPI by ”module load” and run the executable MPI program by ”mpirun”
with four total processors (i.e. ”-np 4”). Note that the number of total processors (denoted as np) should be identical with the
number of tasks (denoted as ntasks). The script file may differ depending upon the local HPC platform. Each parallel example
requires a different number of processors and wall-time, thus we provide corresponding script files to all examples in full
details in the supplementary material.

C. Explanation of output files

Output files Description P-FEFI (bytes) P-FHDI (bytes)


f imp.data.txt Results with fractional weights (2n + n2 )(10 + 2.5p) 41M n + 10M pn
simp.data.txt Final imputation matrix 15 + n + 20pn 15 + n + 20pn
summary.txt Mean and variance estimates 60 + 10p 60 + 10p

TABLE III: Descriptions of output files.

The P-FHDI will prepare three output files that cover all results available in the serial FHDI in R. The first output file is
in the name extension of f imp.data.txt that presents imputation results with fractional weights. A part of f imp.data.txt of
the P-FEFI is given below, where FID denotes local serial index and FWT is the name of fractional weights assigned to each
donor:
ID FID WT FWT y1 y2 y3 y4
1 1 1 0.5 1.47963 2.15086 2.49344 1.89421
1 2 1 0.5 1.47963 2.15086 2.88165 1.89421
2 1 1 0.2 −0.09087 1.1415 1.60253 −1.03695
2 2 1 0.2 −1.67006 1.1415 1.60253 −1.03695
2 3 1 0.2 −0.39303 1.1415 1.60253 −1.03695
A part of f imp.data.txt of the P-FHDI is given for comparison. The P-FEFI takes all possible imputed values as donors.
Alternatively, the P-FHDI takes M selected donors from all possible imputed values.
ID FID WT FWT y1 y2 y3 y4
1 1 1 0.5 1.47963 2.15086 2.49344 1.89421
1 2 1 0.5 1.47963 2.15086 2.88165 1.89421
2 1 1 0.2 −0.09087 1.1415 1.60253 −1.03695
2 2 1 0.2 −1.67006 1.1415 1.60253 −1.03695
2 3 1 0.2 −0.39303 1.1415 1.60253 −1.03695
The second output file is named after simp.data.txt presenting the final imputed matrix in the same size of y. For instance,
the y[1, 3] has a value of 0 indicating the absence. There exists two imputed values for y[1, 3] presented in f imp.data.txt
file which have FIDs of 1 and 2 with imputed values of 2.49344 and 2.88165. Considering sampling weights and fractional
weights, the final imputed value compensated for y[1, 3] will be:
1 × 0.5 × 2.88165 + 1 × 0.5 × 2.49344 = 2.68754
shown in simp.data.txt file of the P-FEFI below:
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 5

1.47963 2.1509 2.68754 1.89421


−0.19263 1.1415 1.60253 −1.03695
0.70871 1.8857 1.25069 0.77238
1.44471 2.7538 2.33727 1.21105
0.86274 2.4256 1.88755 −0.53929
... ... ... ...
Meantime, simp.data.txt of the P-FHDI is provided:

1.47963 2.1509 2.68754 1.89421


−0.19263 1.1415 1.60253 −1.03695
0.70871 1.8857 1.25069 0.80067
1.40139 2.7538 2.24754 1.21105
0.86274 2.4256 1.88755 −0.53929
... ... ... ...
The last output file has a name extension of summary.txt comprised of mean estimates, variance estimates, and confirmation
of some input information:

Mean e s t i m a t e s :
0.904923 1.866880 1.818840 −0.031939
Variance Result :
0.016269 0.014567 0.018769 0.016793
s op imputation :
1
i option merge :
0
M:
5
For the P-FHDI, it is :

Mean e s t i m a t e s :
0.899718 1.86427 1.81198 −0.0278538
Variance Result :
0.016730 0.014725 0.019323 0.017598
s op imputation :
2
i option merge :
0
M:
5
If the program does not successfully converge in a limited time, a debug.txt file will be provided with possible reasons and
corresponding suggestions.

D. Application of real data


We provide an application of the P-FHDI on a real dataset (i.e., Air Quality) from the example database in the supplementary
material to instruct users. Following the same procedures in subsections A and B, we firstly need to prepare three TXT input
files by setting i option read data = 1. The ”daty.txt” file containing the raw data of Air Quality:
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 6

daty
1 138 1022 6.25
1 109 1022 0
1 105 1023 8.93
1 0 1024 10.72
0 140 1026 17.43
... ... ... ...

The ”datr.txt” file containing the response indicator matrix of Air Quality:

datr
1 1 1 1
1 1 1 0
1 1 1 1
1 0 1 1
0 1 1 1
... ... ... ...

The ”input.txt” containing the basic setups:

INPUT INFORMATION
i option read data 1
i option perform 1
nrow 41757
ncol 4
i option imputation 2
i option variance 1
i option merge 0
donor 5
i user defined datz 0
i option collapsing 0
i option SIS type 3
END INPUT INFORMATION

category
3 3 3 3

NonCollapsible categorical
1 0 0 0

weight
1
1
1
1
...

END DATA INPUT

Subsection A explicitly explains available options for the required settings in the input files. The number of categories for
each variable will be specified after ”category”. Note that the first variable of Air Quality is a categorical variable, thus one
can set it as 1 in the corresponding position to declare its special variable type after ”NonCollapsible categorical”. Sampling
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 7

weights of all observations will be set as ones in default after ”weight”. We recommend users to download an input file in the
supplementary material and starts your own on top of it.
Secondly, one needs to prepare a script file (named as ”run.sbatch”) for the workload manager of HPC system to submit
a parallel computing task. The job script can be either modified on top of the example scripts in supplementary material or
generated by Slurm Job Script Generator for Condo in [4]. Finally, users need to copy the input files, the script file, and C++
files of the P-FHDI to the same directory. One can follow the procedures in subsection B to compile the program and submit
a parallel-computing task via HPC system.

APPENDIX II: Cost Analysis of the P-FHDI


We provide an extension of Section 5 for the detailed derivations of Eqs. (40) and (41). In general, we classify the total cost
of the P-FHDI T into the computational cost (denoted as C) and the communication cost (denoted as H). We derive these two
cost at four primary stages: (i) Cell construction (ii) Estimation of cell probability (iii) Imputation, and (iv) Variance estimation.
For the convenience of readers, we summarize the notations used in this section as follows. A finite sampled population is
denoted by U(instances, variables, missing rate) ∈ Rn×p with n instances, p variables, and the associated missing rate η. The
size of the unique patterns of U is denoted by ñ. We can activate the big-p algorithm by specifying the number of selected
variables v 6= 0 or the default big-n algorithm by setting v = 0. Given Q total available processors, the computational cost
per element is denoted by α; the communication transfer cost per element by β; and L is the communication startup cost.
Therefore, the total running time T is written by
T =C+H (II.1)
where C = Ci + Cii + Ciii + Civ , and H = Hi + Hii + Hiii + Hiv , respectively.
(i) Cell construction: The majority of the computational cost of cell construction comes from iterations of function ZMAT (de-
(1) (2) (1)
noted as Ci ) described in Algorithm 4 and function nDAU (denoted as Ci ) in Algorithm 5. Hence Ci = ψ1 (n, p)(Ci +
(2) (1)
Ci ) where ψ1 (n, p) is the number of total iterations in line 1 of Algorithm 3. The Ci is
(1) n2 p
Ci =α (II.2)
Q−1
(2)
And the Ci is
ñ2 p
(
(2) α(n2 + Q−1 ) if v = 0
Ci = ñ2 p
(II.3)
α(n2 + Q−1 + 2wp2 + p2 wv) if v 6= 0
where w is the number of missing values in the current missing pattern. From (II.2) and (II.3), we have
  
 αψ1 (n, p) n2 p + n2 + ñ2 p if v = 0
Ci =  Q−1
2
Q−1
2
 (II.4)
n p 2 ñ p 2 2
 αψ1 (n, p)
Q−1 + n + Q−1 + 2wp + p wv if v 6= 0

where ψ1 (n, p) ∈ N is a monotonically increasing function dependent on n and p if v = 0. Since ψ1 (n, p) is implicit,
we make an empirical approximation of ψ1 (n, p) ∝ n ln (p) regarding Fig. 13. Note that ψ1 (n, p) will be significantly
(1) (2)
reduced with an appropriate choice of v. Similarly, the communication cost corresponding to Ci and Ci is Hi =
(1) (2)
ψ1 (n, p)(Hi + Hi ) where
(1)
Hi = (Q − 1)(2L + β ñp) + 2L + β ñp (II.5)

(2) (Q − 1)(3L + 4β ñ) + 3L + 4β ñ if v = 0
Hi = (II.6)
(Q − 1)(4L + 4β ñ + 2βnv) + 4L + 4β ñ 6 0
if v =
Thus, we have

ψ1 (n, p) {(Q − 1)(5L + β ñp + 4β ñ) + 5L + 4β ñ + β ñp} if v = 0
Hi = (II.7)
ψ1 (n, p) {(Q − 1)(6L + β ñp + 4β ñ + 2βnv) + 6L + 4β ñ + β ñp} 6 0
if v =

(ii) Estimation of cell probability: We analyze the computational cost of estimation of cell probability which mainly consists
(1)
of identification of unique patterns (denoted as Cii ) and iterations of updating conditional probability and sampling weight
(2) (1) (2)
in line 2 of Algorithm 6 (denoted as Cii ) such that Cii = Cii + ψ2 (η)Cii where
(1)
Cii = αn2 p (II.8)
(2) n2
Cii = αψ2 (η)(nñp + nñ + ) (II.9)
Q−1
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 8

Fig. 13: (a) Impact of the number of variables on the number of iterations (i.e., ψ1 (n, p)) in cell construction of the big-n
algorithm with datasets U(10000, p, 0.15) by varying p and (b) impact of the number of instances on the number of iterations
in cell construction of the big-n algorithm with datasets U(n, 12, 0.15) by varying n.

so we have
n2
Cii = αn2 p + αψ2 (η)(nñp + nñ + ) (II.10)
Q−1
where ψ2 (η) is a quadratic and monotonically increasing function dependent on η. We make an empirical approximation
of ψ2 (η) ∝ η 2 regarding Fig. 14. Then the communication cost corresponding to updating conditional probability and
sampling weight turns out to be
Hii = ψ2 (η){(Q − 1)(L + βn) + L + βn} (II.11)

Fig. 14: Impact of the missing rate on the number of iterations (i.e., ψ2 (η)) in the estimation of cell probability of the big-n
algorithm with datasets U (106 , 4, η) by varying η.

(iii) Imputation: The majority of the computational cost of imputation focuses on preparing outputs of imputed values
described in line 8 of Algorithm 8 as
n2
Ciii = α (II.12)
Q−1
then its communication cost turns out to be
Hiii = 2(Q − 1)(L + βn) + 2L + 2βn (II.13)
(1)
(iv) Variance estimation: The variance estimation consists of function Rep CellP (denoted as Civ )
in Algorithm 10 and
(2) (1) (2)
computation of Jackknife variance (denoted as Civ ) in lines 3 to 11 of Algorithm 9 such that Civ = Civ + Civ . We have
(1)
Civ as  αñ  2 
(1) Q−1 n p + ψ2 (η)(nñp + nñ + n log n) if v = 0
Civ = (II.14)
0 if v 6= 0
Function Rep CellP computes ñ/(Q − 1) iterations of estimation of cell probability. The first part inside the box bracket
devotes to the computational cost of identification of unique patterns. And the second part is the computational cost of
iterations of estimating cell probability in line 3 of Algorithm 10. Note that the function Rep CellP is embedded into the
(1) (2)
computation of Jackknife variance such that Civ = 0 if the big-p algorithm is activated by setting v 6= 0. The Civ is
given by
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 9

αn
+ n2 p]

(2) Q−1 [ń log ń if v = 0
Civ = αn (II.15)
Q−1 [ń log ń + n2 p + n2 p + ψ2 (η)(nñp + nñ + n log n)] 6 0
if v =
where ń is the size of imputed values in line 8 of Algorithm 8. The first part inside the box bracket devotes to updating
fractional weight in line 8 of Algorithm 9. The second part comes from the computation of the FHDI estimator in line 10
of Algorithm 9. In the case of v 6= 0, the function Rep CellP is embedded into the main loop. From (II.14) and (II.15),
we have  α  
2 2
Q−1 ñ{n p + ψ2 (η)(nñp + nñ + n log n)} + n{ń log ń + n p} if v = 0
Civ = αn 2 (II.16)
Q−1 [ń log ń + 2n p + ψ 2 (η)(nñp + nñ + n log n)] if v 6= 0
(1) (2)
Similarly, we have the corresponding communication cost Hiv = Hiv + Hiv where
(Q − 1)(2L + 2β ñ2 ) + 2L + 2β ñ2 if v = 0

(1)
Hiv = (II.17)
0 if v 6= 0
and
(2)
Hiv = (Q − 1)L + βnp (II.18)
Therefore,
(Q − 1)(3L + 2β ñ2 ) + 2L + 2β ñ2 + βnp

if v = 0
Hiv = (II.19)
(Q − 1)L + βnp 6 0
if v =
By substituting C and H of each segment of the P-FHDI derived above, we have a lengthy expression of total running time
T . Considering the dominating terms O(n3 ) and ψ1 (n, p) in the expression of T , we can simplify the total running time T in
terms of the worst scenario for a better interpretation as
T (Q) = Ci + Cii + Ciii + Civ + Hi + Hii + Hiii + Hiv (II.20)
0
α 0
≈ +β Q (II.21)
Q
0
where α = αψ1 (n, p)O(n2 ) + αO(n3 ) and β 0 = βψ1 (n, p)O(n). By substituting Q with cp Q where cp ∈ N+ , the scalability
with total cp Q cores is rephrased as
0 0
T (Q) α + Q2 β
= cp × 0 (II.22)
T (cp Q) α + c2p Q2 β 0
As a supplementary of Section 5 of the paper, we provide speedups and running time of imputation or variance estimation
of the P-FHDI separately in Figs. 15 to 18. Note that cell construction, estimation of cell probability and imputation are
compressed as a single part named after the imputation herein.

Fig. 15: Impact of the number of instances n on (a) speedup and (b) running time of the imputation only with datasets
U(n, 4, 0.25): 4 variables and 25% missing rate by varying n.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 10

Fig. 16: Impact of the number of instances n on (a) speedup and (b) running time of the variance estimation only with datasets
U(n, 4, 0.25): 4 variables and 25% missing rate by varying n.

Fig. 17: Impact of the missing rate η on (a) speedup and (b) running time of the imputation only with datasets U(1M, 4, η):
1 million instances and 4 variables by varying η.

Fig. 18: Impact of the missing rate η on (a) speedup and (b) running time of the variance estimation only with datasets
U(1M, 4, η): 1 million instances and 4 variables by varying η.

APPENDIX III: Behaviors of the Big-p Algorithm of the P-FHDI


This appendix is based on the authors’ dedicated work in [2] and [5] to explain the behaviors of the proposed big-p algorithm,
and more details are available therein. Suppose that we have a finite population U in size of N . Let Ug be subsets of the
finite population with a size of Ng , where g = 1, . . . , G. There are basic assumptions for the proofs (assumptions are denoted
by A1 and A2):
• [A1]  
N̂g − Ng , N̂Rg − NRg , ŶRg − YRg = Op (n−1/2 N ),
P P
where (N̂g , N̂Rg , ŶRg ) = i∈Ag wi (1, δi , δi yi ) and (Ng , NRg , YRg ) = i∈Ug (1, δi , δi yi ).
PH
• [A2] The cell mean estimator µ̂g satisfies µ̂g − µg = Op (n−1/2 ), where µ̂g = YRg /NRg and µg = h=1 πh|g µgh .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 11

The FHDI estimator has three important asymptotic properties:


ŶF HDI = ỸF HDI + op (n−1/2 N ) (III.1)
E(ỸF HDI − Yp ) = 0 (III.2)
 
XG X 
V (ỸF HDI ) = V (ỸF EF I ) + E wi2 V (ȳi∗ − µ̂g | Ag ) (III.3)
 
g=1 i∈AMg
PG PMi ∗(j)
ωi (ȳi∗ − µ̂g ); ȳi∗ = Mi−1
P
where ỸF HDI = ỸF EF I + g=1 i∈AMg yi ; µg and µ̂g are the weighted average of yi ’s in
j=1
PG P
the subgroup g in the population and sample, respectively; ỸF EF I = g=1 i∈AM ωi µg + Rg−1 δi (yi − µg ) ; Rg is the

g
ratio of observed responses in the subgroup g. As n increases, the FHDI estimator asymptotically approaches the performance
of ỸF EF I . Variance estimator of FEFI is unbiased [6]. The FHDI’s additional variance of V (ỸF HDI ) in the second term in
Eq. (III.3) gradually disappears as Mi grows large.
The FHDI mean estimator can be expressed in terms of FEFI mean estimator as
G
X X
ỸF HDI = ỸF EF I + wi (ȳi∗ − µ̂g ) (III.4)
g=1 i∈AMg
PM ∗(j)
where ȳi∗ = M −1
P P
j=1 yi , µ̂g = j∈ARg wj yj / j∈ARg wj , and ỸF EF I is
G X
X
ỸF EF I = wi {µg + Rg−1 δi (yi − µg )} (III.5)
g=1 i∈Ug

Let EI (ỸF HDI ) be an expectation on imputation mechanism of FHDI, and we have EI (ỸF HDI ) = ỸF EF I . Taking expectation
on ỸF EF I , we have
n o
E(ỸF EF I ) = E E(ỸF EF I | FN )
 
XG X
= E(YN ) + E  (Rg−1 δi − 1)(yi − µg ) (III.6)
g=1 i∈Ug

where Rg = NRg /Ng , FN is a set of finite population and the second term of Eq. (III.6) is 0. Therefore,
 
XG X
E(ỸF HDI ) = E(YN ) + E  (Rg−1 δi − 1)(yi − µg ) (III.7)
g=1 i∈Ug

Considering Eq. (22) in the paper, we have V̂ (ŶF HDI ) for each variable as
n
(k)
X
V̂ (ŶF HDI ) = ck (ŶF HDI − ŶF HDI )2 (III.8)
k=1
(k)
where cK = (n − 1)/n and ŶF HDI is
 
G X  M 
(k) (k) ∗(k) ∗(j)
X X
ŶF HDI = wi δi yi + (1 − δi ) wij yi (III.9)
 
g=1 i∈Ag j=1

and ŶF HDI is  


G X  M 
∗ ∗(j)
X X
ŶF HDI = wi δi yi + (1 − δi ) wij yi (III.10)
 
g=1 i∈Ag j=1

(k)
By substituting Eqs. (III.9) and (III.10), we have V̂ (ŶF HDI ) assuming wi ≈ wi if n  1 :
  2
n G X  M 
∗(k) ∗(j) 
X X X

V̂ (ŶF HDI ) = ck  wi (1 − δi ) [wij − wij ]yi
 
k=1 g=1 i∈Ag j=1
  2 (III.11)
n X  X M 
∗(k) ∗(j)
X

= ck  w [wij − wij ]yi
 i


k=1 i∈AM j=1
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 12

∗(k)
Note that the wij for replicate k ∈ AR will be updated by
 ∗ ∗
 wij − wij,F EF I
 if j = r
∗(k) ∗ ∗ w∗
wij + (wir,F P ij,F∗EF I if j 6= r
wij = EF I ) j6=r wij,F EF I
(III.12)

 ∗
wij if k ∈ AM

where r is the index of the closest donor to k and wij,F EF I is defined in Eq. (16) of the paper. Considering the update of
∗(k)
wij , we have
  2
n M ∗
X X 
∗ ∗ ∗ ∗(r)
X
∗ ∗
w ij,F EF I ∗

∗(j) 
V̂ (ŶF HDI ) = ck  wi (wir − wir,F EF I − wir )yi + (wij + wir,F EF I P ∗ − wij )yi
k=1 i∈AM

j6=r j6=r wij,F EF I 
  2
n M ∗
X X 
∗ ∗(r)
X

w ij,F EF I ∗(j)

= ck  wi −wir,F EF I yi + (wir,F EF I P ∗ )yi 
 j6 = r wij,F EF I 
k=1 i∈AM j6=r
  2
n M ∗
X X

X w ij,F EF I ∗(j)

∗(r) 
= ck  wi wir,F EF I (P ∗ yi ) − y i
k=1 i∈AM

j6=r j6=r wij,F EF I 
(III.13)

APPENDIX IV: Treatments of Outliers


Outliers are often deliberately maintained for the authenticity of the distribution of the data. Persistent efforts had been made
to detect multivariate outliers in the past decades.

Fig. 19: Average errors of (a) the normalized mean estimator (b) the normalized SE of the mean estimator versus an increasing
proportion of outliers in fully observed observations by 0.5%, 1%, 2.5%.

Fig. 20: Average errors of (a) the normalized mean estimator (b) the normalized SE of the mean estimator versus an increasing
coefficient of MD threshold by 2, 4, and 8.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 13

A variant based on the Minimum Covariance Determinantin was proposed in [7] to detect multivariate outliers robustly.
PCout is a computationally fast procedure, which utilizes simple properties of principal components to identify outliers in the
transformed space [8]. We apply Mahalanobis distance (denoted as M D) to each observation to measure standard deviations
away from the mean of data distribution in Eq. (22). The criteria for outlier detection can be formulated: an observation is
considered as an outlier if q
(x − µ)T Σ−1 (x − µ) > εM D (IV.1)
where εM D is a MD threshold depending on input data. Filzmoser [9] proposed an adaptive method to find the MD threshold
such that extreme values should not be regarded as outliers.
There may be differences if outlier treatments (i.e., trimming, winsorization, and robust estimation) are carried out. Trimming
discards the observations which have the M D higher than εM D . Winsorization replaces observations which have M D higher
than εM D with the observation, who has the maximum M D such that M D 6 εM D . The goal of the robust estimation is to find
a fit which is robust to outliers. An overview of several robust methods for handling outliers, e.g., the least trimmed squares
estimator and the minimum covariance determinant, was conducted in [10]. It also discussed the robust linear regression, robust
principal component analysis, and robust classification individually.
We will focus on the effect of the trimming and winsorization on the P-FHDI with the presence of outliers. Let the maximum
M D of fully observed observations be εM D . Let the coefficient of M D threshold be M D/εM D , where M D is the average
of all outliers’ M D. The simulations consist of the following steps:
(1) Generate a synthetic dataset that has 1000 observations and 4 variables with 10% missingness.
(2) Assign outliers to the fully observed observations.
(3) Apply the different outlier treatments to the dataset: no outlier treatment, trimming, and winsorization.
(4) Perform the P-FHDI on datasets with the different outlier treatments, respectively.
Figs. 19 (a) and 20 (a) show the average errors of the mean estimator normalized by the mean of the full data. Figs. 19 (b)
and 20 (b) show the average errors of the SE of the mean estimator normalized by the SE of the full data. The average errors
of estimates of the P-FHDI without outlier treatment slightly increase with a growing proportion of outliers. Trimming and
winsorization have sharp increasing errors of estimates with a growing proportion of outliers. The average errors of estimates
of the P-FHDI without outlier treatment are stable against a growing coefficient of MD threshold. Trimming and winsorization
have sharp increasing errors of estimates with a growing coefficient of MD threshold. Overall, the P-FHDI without an outlier
treatment is quite robust against an increasing proportion of outliers or coefficient of MD threshold compared with trimming
and winsorization. The P-FHDI with trimming appears to be more sensitive to outliers than winsorization.

References
[1] I. Cho, “Data-driven, computational science and engineering platforms repository,” 2017. [Online]. Available:
https://sites.google.com/site/ichoddcse2017/home/type-of-trainings/parallel-fhdi-p-fhdi-1
[2] I. Song, Y. Yang, J. Im, T. Tong, C. Halil, and I. Cho, “Impacts of fractional hot-deck imputation on learning and prediction of engineering data,” IEEE
Transactions on Knowledge and Data Engineering, 2019.
[3] J. Im, I. Cho, and J. K. Kim, “An R package for fractional hot deck imputation,” The R Journal, vol. 10, no. 1, pp. 140–154, 2018.
[4] Condo, “Condo2017: Iowa state university high-performance computing cluster system,” 2017. [Online]. Available:
https://www.hpc.iastate.edu/guides/condo-2017/slurm-job-script-generator-for-condo
[5] J. Im, J. K. Kim, and W. A. Fuller, “Two-phase sampling approach to fractional hot deck imputation,” In Proceedings of Survey Research Methodology
Section, pp. 1030–1043, 2015.
[6] J. K. Kim and W. Fuller, “Fractional hot deck imputation,” Biometrika, vol. 91, no. 3, pp. 559–578, 2004.
[7] C. Leys, O. Klein, Y. Dominicy, and C. Ley, “Detecting multivariate outliers: Use a robust variant of the mahalanobis distance,” Journal of Experimental
Social Psychology, vol. 74, pp. 150–156, 2018.
[8] P. Filzmoser, R. Maronna, and M. Werner, “Outlier identification in high dimensions,” Computational Statistics Data Analysis, vol. 52, no. 3, pp.
1694–1711, 2008.
[9] P. Filzmoser, “A multivariate outlier detection method,” Proceedings of the Seventh International Conference on Computer Data Analysis and Modeling,
vol. 1, pp. 18–22, 2004.
[10] P. J. Rousseeuw and M. Hubert, “Robust statistics for outlier detection,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 1,
no. 1, pp. 73–79, 2011.

View publication stats

You might also like