Professional Documents
Culture Documents
Efficient Skyline Computation On Massive Incomplet
Efficient Skyline Computation On Massive Incomplet
Efficient Skyline Computation On Massive Incomplet
https://doi.org/10.1007/s41019-022-00183-7
RESEARCH PAPERS
Received: 6 March 2021 / Revised: 9 February 2022 / Accepted: 9 March 2022 / Published online: 3 April 2022
© The Author(s) 2022
Abstract
Incomplete skyline query is an important operation to filter out pareto-optimal tuples on incomplete data. It is harder than
skyline due to intransitivity and cyclic dominance. It is analyzed that the existing algorithms cannot process incomplete
skyline on massive data efficiently. This paper proposes a novel table-scan-based TSI algorithm to deal with incomplete
skyline on massive data with high efficiency. TSI algorithm solves the issues of intransitivity and cyclic dominance by two
separate stages. In stage 1, TSI computes the candidates by a sequential scan on the table. The tuples dominated by others
are discarded directly in stage 1. In stage 2, TSI refines the candidates by another sequential scan. The pruning operation
is devised in this paper to reduce the execution cost of TSI. By the assistant structures, TSI can skip majority of the tuples
in phase 1 without retrieving it actually. The extensive experimental results, which are conducted on synthetic and real-life
data sets, show that TSI can compute skyline on massive incomplete data efficiently.
13
Vol:.(1234567890)
13
bucket, are the extension of the existing skyline algorithms stored. SIDS performs a round-robin retrieval on the sorted
to accommodate the incomplete data. Replacement algo- lists. For each retrieved data p, if it is not retrieved before,
rithm first replaces the incomplete attributes by a special it is compared with each data q in the candidate set, which
value to transform the incomplete data to complete data. is initialized to be the whole incomplete data. If p and q
Traditional skyline algorithm can be used to compute the are compared already, the next data in the candidate set are
skyline results SKYcomp on the transformed complete data, retrieved and processed. Otherwise, if p dominates q, q is
which also is the superset of the skyline results on the removed from the candidate set. And if q dominates p and p
incomplete data. Finally, the tuples in SKYcomp are trans- is in candidate set, p is removed from the candidate set also.
formed into their original incomplete form, and the exhaus- If the number of p being retrieved during the round-robin
tive pairwise comparison between all tuples in SKYcomp is retrieval is equal to the number of its complete attributes and
performed to compute the final results. Bucket algorithm p is not pruned yet, p can be reported to be one of the skyline
first divides all the tuples on incomplete data into differ- results. SIDS terminates when candidate set becomes empty
ent buckets to make all tuples in the same bucket have the or all points in sorted lists are processed at least once.
same bitmap representation. The dominance relationship Sorted-based algorithms utilize the selected tuples with
within the same bucket is transitive now since the tuples possible high dominance via pre-sorted structures one by
here have the same bitmap representation. The traditional one to prune the non-skyline tuples. It usually performs
skyline algorithm is utilized to compute the skyline results many passes on scan on the table and will incur high I/O
within each bucket, which is called local skyline. The local cost on massive data.
skyline results for all buckets are merged as the candidate
skyline. The exhaustive pairwise comparison is performed 2.3 Bucket‑Based Algorithms
on the candidate skyline to compute the query answer. ISky-
line algorithm employs two new concepts, virtual points and Lee et al. [10] propose a sorting-based SOBA algorithm to
shadow skylines, to improve bucket algorithm. The execu- optimize the bucket algorithm. Similar to the bucket algo-
tion of ISkyline consists of three phases. In phase I, each rithm, SOBA also first divides the incomplete data into a
newly retrieved tuple is compared against the local skyline set of buckets according to their bitmap representation,
and the virtual points to determine whether the tuple needs then computes the local skyline of tuples in each bucket,
to be (a) stored in the local skyline, (b) stored in the shadow and finally performs the pairwise comparison for the sky-
skyline, (c) discarded directly. In phase II, the tuples newly line candidates (the collection of all local skylines). SOBA
inserted into local skyline are compared with the current uses two techniques to reduce the dominance tests for the
candidate skyline; ISkyline updates the candidate skyline skyline candidates. The first technique is to sort the buckets
and the virtual points correspondingly. Every time t tuples in ascending order of the decimal numbers of the bitmap
are kept in the candidate skyline, ISkyline enters into phase representation. This can identify the non-skyline points as
III, updates the global skyline, and clears current candidate early as possible. The second technique is to rearrange the
skyline. The similar processing continues until the end of order of tuples within the bucket. By sorting tuples in the
input is reached and ISkyline returns the global skyline. ascending order of the sum of the complete attributes, the
The replace-based algorithms first replace the incomplete tuples accessed earlier have the higher probability to domi-
attributes with a specific value, then compute traditional sky- nate other tuples and this can help reduce the number of
line on transformed data, and finally refine the candidate to dominance tests further.
compute the results by pairwise comparison. Normally, the Bucket-based algorithms first split the tuples into differ-
number of candidate on massive data is large and the pair- ent buckets according to their attribute encoding to make
wise comparison is significantly expensive. the tuples in the same buckets have the same encoding and
hold the transitivity rule, then compute local skyline results
2.2 Sorted‑Based Algorithms on every buckets, and finally merge the local skyline results
to obtain the results. In incomplete skyline computation,
Bharuka et al. [1] propose a sort-based skyline algorithm the skyline criteria size usually is greater than that on com-
SIDS to evaluate the skyline over incomplete data. SIDS plete data due to the cyclic dominance, the bucket num-
first sorts the incomplete data D in non-descending order for ber involved in bucket-based algorithms often is large, and
each attribute. Let Di be the sorted list with respect to the ith the number of local skyline results is relatively great. The
attribute. Only the ids of the tuples are kept in Di , and the computation operation and merge operation of local skyline
ids of the tuples whose ith attributes are incomplete are not
13
T An incomplete table
Tpart The current part of T loaded in memory
t A tuple in T
C The common complete attribute(s)
PI The positional index of tuple X
S The size of allocated memory for storing tuples of T in each time X
Scnd A set maintaining candidate tuples
SLi The sorted list which is built for the i-th attribute
MCR The bit-vector representing the membership checking result of SLi
RIA The bit-vector representing whether the attribute is complete
Sc The set of the complete attributes of t
NUMc The number of the complete attributes for each tuple
results often incur high computation cost and I/O cost on replacement algorithm often generates a large number of
massive data. skyline candidates and the pairwise comparison among the
For the algorithms mentioned above, the dominance over candidates incurs a prohibitively expensive cost. Bucket-
incomplete data is defined on the common complete attrib- based algorithms, such as bucket algorithm, ISkyline and
utes. There are also other definitions of dominance over SOBA, have the problem that they have to divide the data set
incomplete data. Zhang et al. [21] propose a general frame- into different buckets. Given the size m of the skyline crite-
work to extend skyline query. For each attribute, they first ria, the number of the buckets can be as high as 2m − 1; this
retrieve the probability distribution function of the values will cause serious performance issue when m is not small.
in the attribute by all the non-missing values on the attrib- For SIDS, it utilizes one selected tuple to prune the non-
ute and then convert incomplete tuples to complete data by skyline tuples in the candidate set, and this incurs a pass of
estimating all missing attributes. And a mapping dominance sequential scan on the data. Thus, it requires many passes of
is defined on the converted data. Zhang et al. [18] propose scan on the data to finish its execution, and this will incur a
PISkyline to compute probabilistic skyline on incomplete high I/O cost on massive data.
data. It is considered in [18] that each missing attribute value
can be described by a probability density function. The prob-
ability is used to measure the preference condition between 3 Preliminaries
missing values and the valid values. Then, the probability
of a tuple being skyline can be computed. PISkyline returns Given an incomplete table T of n tuples with attributes
the K tuples with the highest skyline probability. A1 , … , AM , some attributes of the tuples in T are incomplete.
Discussion Throughout this paper, we use the definition The attributes in T are considered to be numerical type, let
of dominance over incomplete data as [7]. Firstly, this domi- A1 , … , Am be the specified skyline criteria. Throughout the
nance notion is commonly used in most skyline algorithms paper, it is assumed that the smaller attribute values are
over incomplete data. Secondly, the estimation of the incom- preferred. In this paper, the attributes with known values
plete attribute values may be undesirable in some cases. are called complete attributes, while the attributes with
Therefore, we do not guess the incomplete attribute in this unknown values are called incomplete attributes. ∀t ∈ T ,
paper and not consider such algorithms anymore. t has at least one complete attribute among A1 , A2 , … , Am ,
In this paper, we consider the skyline over massive while all other attributes have a probability p (0 < p ≤ 1) of
incomplete data, i.e., the data set cannot be kept in memory being incomplete. The frequently used symbols in this paper
entirely. It is found that the existing algorithms, including are listed in Table 1.
[1, 7, 10], all assume their processing of the in-memory The dominance over incomplete data is given in Defini-
data. Their performance will be seriously degraded on mas- tion 1. The incomplete skyline returns the tuples in T which
sive data. Since the cardinality of skyline query increases are not dominated by any other tuples.
exponentially with respect to the size of skyline criteria [4],
13
13
The existing algorithms, as mentioned in Secst. 2 and 4, In this paper, we propose a new algorithm TSI (Table-scan-
have rather poor performance and very long execution time based Skyline over Incomplete data) to process skyline over
on massive incomplete data. Therefore, this section first massive incomplete data efficiently. TSI performs two passes
devises a baseline algorithm BA which can be used as a of scan on the table to compute the skyline results. Sec-
benchmark against the algorithm proposed in this paper. tion 6.1 describes the basic execution of TSI algorithm. The
Different from the existing methods, BA adopts a block- pruning operation is presented in Sect. 6.2.
nested-loop-like execution. It first retrieves T from the
beginning and loads a part of T into the memory, com- 6.1 Basic Process
pares the tuples in memory with all tuples in T, removes
the dominated tuples in the memory. Each time the tuples The basic process of TSI consists of two stages. In stage 1,
left in memory are compared with all other tuples and can TSI performs the first-pass scan on T to find the candidate
be reported as part of incomplete skyline results. Then, the tuples, while in the stage 2, TSI scans T again to discard the
next part of T is loaded and the similar processing is exe- candidates which are dominated by some tuple. Algorithm 1
cuted; the iteration continues until all tuples in T are loaded is the pseudo-code of the basic process.
into memory once and compared with all other tuples. In
13
In stage 1, TSI retrieves the tuples in T sequentially and Theorem 1 When the first-pass scan of TSI is over, Scnd
maintains the candidate tuples in a set Scnd (empty initially) maintains a superset of skyline results over T.
(line 1). Let t be the currently retrieved tuple. If Scnd is
empty, TSI keeps t in Scnd (line 5-6). Otherwise, Scnd is iter- Proof ∀t1 = T(pi1 ), if t1 is a skyline tuple, there is no other
ated over, any candidate which is dominated by t is removed tuple in T which can dominate t1. At the end of stage 1, t1
from Scnd (line 10-11). At the end of iteration, if t is domi- obviously will be kept in Scnd . If t1 is not a skyline tuple, and
nated by some candidate in Scnd , t is discarded (line 14-15); there is another tuple t2 = T(pi2 ) which can dominate t1. If
otherwise, TSI keeps t in Scnd (line 16-17). In stage 1, TSI pi1 < pi2, t2 will be retrieved after t1 and remove t1 from Scnd .
does not consider the intransitivity and cyclic dominance If pi1 > pi2 , t2 is retrieved before t1. If t2 is dominated by
of skyline on incomplete data. Any candidates is discarded some tuple and discarded, t1 still will be kept in Scnd at the
if it is dominated by some tuple, even though the candidate end of stage 1. Q.E.D.
may dominate the following tuples. In this way, TSI does
not need to maintain the dominated tuples and reduces the In stage 2, TSI performs another sequential scan on T.
in-memory maintenance cost significantly. It is proved in Let t be the currently retrieved tuple (line 22-23), any candi-
Theorem 1 that Scnd contains a superset of the query results dates are removed from Scnd if they are dominated by t (line
at the end of stage 1. 26-27). It is proved in Theorem 2 that the candidates in Scnd
are the skyline results at the end of stage 2.
13
13
Example 3 The required data structures mentioned above are Proof As mentioned above, the value bi is determined as the
illustrated in Fig. 7. SL1 , SL2 , SL3 are three sorted lists, whose minimum integer value satisfying ITVi [bi ] > t1 .Ai . There-
elements are arranged in the ascending order of A1 , A2 , A3, fore, the bit 1s of ¬MCRi,bi represent the tuples whose Ai
respectively. MCR1,1 is a 16-bit bit-vector representing the values are greater than t1 .Ai . Since we treat the incomplete
⋀�S �
membership checking results of SL1 (1, 21 ).PIT , i.e., 12 attribute values as positive infinity, i=1c ¬MCRi,bi represents
and 8. Therefore, the 8th bit and 12th bit in MCR1,1 are 1, the tuples whose values of A1 , … , A|Sc | are all greater than
MCR1,1 = 0000000100010000. ITV1 keeps the attribute val- those of t1. Given t2 among these tuples, if at least one of
ues of exponential gaps in SL1, i.e., SL1 (21 ).A1, SL1 (22 ).A1, A1 , … , A|Sc | of t2 is complete attribute, t2 is dominated by t1
SL1 (23 ).A1, SL1 (24 ).A1, ITV1 = {26, 47, 65, +∞}. The other according to the dominance definition over incomplete data.
MCR bit-vectors and other ITVs can be obtained similarly. If all of A1 , … , A|Sc | of t2 are incomplete, t1 and t2 are not
The structure RIAi represents the incomplete values of Ai . comparable from the perspective of dominance relationship.
13
⋁�S �
The bit 1s of i=1c RIAi mean that at least one of A1 , … , A|Sc |
⋁�S �
is complete, and the bit 0s of i=1c RIAi indicate that all of
A1 , … , A|Sc | are incomplete. Consequently, the bit 1s of
⋀�S � ⋁�S �
DBVt1 = ( i=1c ¬MCRi,bi ) ∧ ( i=1c RIAi ) represent the tuples
which are dominated by t1. Q.E.D.
13
∏�S � bi
Algorithm 2 is the pseudo-code of the execution of prun- is computed to be i=1c 2n (line 11-13). For the retrieved
ing operation. At the beginning of the stage 1, TSI deter- pruning tuple pt, TSI sets the (pt.PIT )th bit of PRB to be
mines the involved pruning tuple files PT1 , PT2 , … , PTm 1, since it is retrieved already (line 14). Besides, for each
according to the current skyline criteria and retrieves prun- pruning tuple pt, TSI removes any candidates in Scnd
ing tuples from them. In the process of retrieving PT1 , PT2 , which are dominated by pt (line 18-23). If pt is not domi-
… , PTm , TSI maintains a min-heap MH in memory to keep nated by any candidate in Scnd , TSI keeps it in Scnd (line
m pruning tuples with the highest dominance capability 26-27). ∀ptb ∈ MH(1 ≤ b ≤ m) , TSI computes its corre-
(line 4). Given a pruning tuple pt, let Sc be its complete sponding bit-vector DBVptb of dominance checking as in
attributes. Likewise, assume that Sc = {A1 , … , A|Sc | }} (line Sect. 6.2.2 (line 29). The final pruning bit-vector PRB is
⋁m
5-6). ∀1 ≤ i ≤ |Sc |, we determine the first value ITVi [bi ] of PRB = PRB ∨ ( b=1 DBVptb ) (line 30).
ITVi which is greater than pt.Ai , its dominance capability
13
13
time(s)
time(s)
12000 TSIB
10000 TSI
10000
8000
1000 6000
TSIB
TSI 4000
100 2000
5 10 50 100 500 5 10 50 100 500
tuple number (106) tuple number (106)
(a)Execution time (b)Candidate size
250000 5000
S2 4500 S2
S1 S1
200000 4000 BM
3500 PT
150000 3000
time(s)
time(s)
2500
100000 2000
1500
50000 1000
500
0 0
5 10 50 100 500 5 10 50 100 500
6 6
tuple number (10 ) tuple number (10 )
(c)T SIB decomposition (d)TSI decomposition
1e+012 1e+013
1e+011
1e+011
1e+010
1e+010
TSIB TSIB
TSI TSI
1e+009 1e+009
5 10 50 100 500 5 10 50 100 500
6
tuple number (10 ) tuple number (106)
(e)The I/O cost (f)Comparison number
7.2 The Comparison of TSI with and Without decomposition of TSI, which consists of four parts: the time
Pruning to retrieve pruning tuples, the time to load the required bit-
vectors, the time in stage 1, and the time in stage 2. The time
The performance of TSIB and TSI is compared in different in stage 2 of TSI is longer than that of TSIB due to the greater
aspects, where TSIB is the TSI algorithm without pruning number of candidates left. However, the time reduction in
operation. As depicted in Fig. 10a, TSI runs 18.84 times stage 1 of TSI is much significant compared with TSIB and
faster than TSIB and the speedup ratio increases with a TSI runs one order of magnitude faster than TSIB averagely.
greater value of n. This significant advantage is due to the As shown in Fig. 10(e and f), the pruning operation makes
effective pruning operation. The numbers of the candidates TSI incur less I/O cost and perform fewer number of domi-
after stage 1 are illustrated in Fig. 10b. TSI maintains more nance checking.
candidates than TSIB after stage 1. This is because the prun-
ing operation skips most of the tuples in stage 1, and there- 7.3 Experiment 1: the Effect of Tuple Number
fore, many candidates which should be removed by some
tuples are left. But the pruning operation reduces the cost in Given m = 20 , M = 60 , p = 0.3 and c = 0 , experiment 1
stage 1 significantly. Figure 10c reports the time decomposi- evaluates the performance of TSI on varying tuple numbers.
tion of TSIB . Obviously, the execution time of stage 1 domi- As shown in Fig. 11a, TSI runs 60.42 times faster than BA
nates its overall time. We even cannot see the time in stage 2 averagely. The speedup ratio of TSI over BA increases with
due to its rather small proportion. Figure 10d gives the time a greater value of n, from 8.31 at n = 5 × 106 to 166.58 at
13
time(s)
10000 1e+011
1000 1e+010
BA BA
TSI TSI
100 1e+009
5 10 50 100 500 5 10 50 100 500
6
tuple number (10 ) tuple number (106)
(a)Execution time (b)The I/O cost
1e+013 1
number of dominance checking
0.99
1e+012 0.98
pruning ratio
0.97
1e+011
0.96
1e+010 0.95
BA 0.94
TSI
1e+009 0.93
5 10 50 100 500 5 10 50 100 500
6 6
tuple number (10 ) tuple number (10 )
(c)Comparison number (d)The pruning ratio
10000
time(s)
1e+010
1000
1e+009
100
BA BA
TSI TSI
10 1e+008
10 15 20 25 10 15 20 25
skyline criteria size skyline criteria size
(a)Execution time (b)The I/O cost
1e+013 1
number of dominance checking
1e+012 0.99
1e+011 0.98
1e+010 0.97
pruning ratio
1e+009 0.96
1e+008 0.95
1e+007 0.94
1e+006 0.93
100000 BA 0.92
TSI
10000 0.91
10 15 20 25 10 15 20 25
skyline criteria size skyline criteria size
(c)Comparison number (d)The pruning ratio
n = 500 × 106. Figure 11b depicts that TSI incurs 6.73 times than BA. The performance advantage of TSI over BA is
less I/O cost than BA. And as illustrated in Fig. 11c, TSI widened with the greater value of n. At n = 5 × 106, BA can
performs 38.17 times fewer number of dominance checking load all T into memory and perform another table scan on
13
1000
time(s)
1e+010
100
1e+009
10
BA
TSI
1 1e+008
0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7
incomplete ratio incomplete ratio
(a)Execution time (b)The I/O cost
1e+012 1
number of dominance checking
1e+011
1e+010 0.999
1e+009 0.998
pruning ratio
1e+008
1e+007 0.997
1e+006
100000 0.996
10000 0.995
BA
1000 TSI
100 0.994
0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7
incomplete ratio incomplete ratio
(c)Comparison number (d)The pruning ratio
10000 BA 1e+011
TSI
1000 1e+010
-0.8 -0.4 0 0.4 0.8 -0.8 -0.4 0 0.4 0.8
correlation coefficient correlation coefficient
(a)Execution time (b)The I/O cost
1e+012 0.996
number of dominance checking
0.9955
0.995
1e+011
pruning ratio
0.9945
0.994
0.9935
1e+010
0.993
BA 0.9925
TSI
1e+009 0.992
-0.8 -0.4 0 0.4 0.8 -0.8 -0.4 0 0.4 0.8
correlation coefficient correlation coefficient
(c)Comparison number (d)The pruning ratio
T to compute incomplete skyline results. At n = 500 × 106 , trend on tuple number due to its execution process and
BA needs to execute 56 iterations, each loading a part of T pruning operation. As illustrated in Fig. 11d, the pruning
and then followed by a table scan on T to remove the domi- operation of TSI can skip vast majority of tuples in stage
nated tuples. On the contrary, TSI shows a slower growing 1. The pruning ratio in the experiments is computed by the
13
time(s)
100
1e+008
10
BA BA
TSI TSI
1 1e+007
0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7
skyline criteria size skyline criteria size
(a)Execution time (b)The I/O cost
1e+011 1
number of dominance checking
1e+010
1e+009 0.995
1e+008 0.99
pruning ratio
1e+007
1e+006 0.985
100000
10000 0.98
1000 0.975
BA
100 TSI
10 0.97
0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7
skyline criteria size skyline criteria size
(c)Comparison number (d)The pruning ratio
1000 1x1010
time(s)
100 1x109
10 1x108
1 1x107
6 7 8 9 10 6 7 8 9 10
skyline criteria size skyline criteria size
(a)Execution time (b)The I/O cost
formula
nskip
, where nskip is the number of tuples skipped in tuples. For the first part, BA may not retrieve all tuples into
stage 1.
n
memory since the current tuples may be dominated by the
previous iterations. For the second part, if the current can-
7.4 Experiment 2: the Effect of Skyline Criteria Size didates all are discarded, BA does not have to continue the
sequential scan but just performs the next iteration directly.
Given M = 60, n = 50 × 106, p = 0.3 and c = 0, experiment When the value of m increases, given other parameters are
2 evaluates the performance of TSI on varying skyline cri- fixed, the probability that a tuple is dominated by other tuple
teria sizes. As illustrated in Fig. 12a, with a greater value of becomes lower. Therefore, the I/O cost increases on both
m, the execution times of BA and TSI both increase signifi- parts. This is reported in Fig. 12b. For TSI, its I/O cost also
cantly; TSI still runs 85.79 times faster than BA averagely. consists of two parts. In stage 1, TSI performs a selective
For BA, its I/O cost depends on two parts. For one thing, BA scan on T to obtain the candidates of incomplete skyline
needs to retrieve T once to load it into memory. For another, results. In stage 2, TSI does another sequential scan on T to
BA performs a sequential scan on T in each iteration to dis- compute the results, in which if all candidates are removed,
card the candidates in memory which are dominated by some TSI can terminate directly. As the value of m increases, the
pruning effect in TSI becomes worse in stage 1, which also
13
is verified in Fig. 12d, and TSI has to retrieve more tuples one variable decreases, the other increases. And a positive
before it terminate in stage 2. This makes a higher I/O cost correlation means that variables tend to move in the same
for TSI with a greater value of m, as illustrated in Fig. 12b. direction. Therefore, the skyline computation on negatively
With the similar explanation, as shown in Fig. 12c, the num- correlated data usually is more expensive than that on posi-
bers of dominance checking for both algorithms increase tively correlation data. The variations in TSI and BA both
with a greater value of m. show a downward trend in experiment 4. Here, the trend is
not significant because the incomplete attributes in the data
7.5 Experiment 3: the Effect of Incomplete Ratio set reduce the impact of correlation. The I/O cost and num-
ber of dominance checking are depicted in Fig. 14(b and c),
Given m = 20, M = 60, n = 50 × 106 and c = 0, experiment respectively, and they have the similar variation trends. The
3 evaluates the performance of TSI on varying incomplete effect of pruning operation of TSI is illustrated in Fig. 14d.
ratios. As the value of p increases, the execution time of Due to the impact of incomplete attributes, the pruning ratio
BA decreases quickly, while the execution time of TSI first shows considerable change, but it still shows upward trend
decreases and then increases gradually. For BA, the decline overall.
of execution time is easy to understand. With a greater value
of p, the probability that any tuple is dominated by other 7.7 Experiment 5: Real Data
tuples increases. This makes more in-memory candidates
in each iteration dominated by some tuples in the sequential The real data, HIGGS Data Set, are obtained from UCI
scan, and can reduce the I/O cost and dominance checking Machine Learning Repository. It contains 11,000,000 tuples
cost. As illustrated in Fig. 13c, with a greater value of p, the with 28 attributes. We select the first 20 attributes as skyline
number of dominance checking in BA decreases constantly. criteria and evaluate the performance of TSI with varying
And as shown in Fig. 13b, the I/O cost of BA first decreases incomplete ratios. Before the experiment is executed, one
significantly when p increases from 0.3 to 0.4, then remains attribute first is chosen to be complete and other (m − 1)
unchanged basically ever since. When p increases from 0.3 attributes in skyline criteria have a probability p of being
to 0.4, the number of in-memory candidates is reduced dur- incomplete independently. As depicted in Fig. 15a, TSI runs
ing the sequential scan and in each iteration, BA terminates 40.46 times faster than BA. The variation trends of execution
earlier. This makes less I/O cost for BA. When the value of times of BA and TSI are very close to those in Sect. 7.5 and
p is greater than 0.4, the number of in-memory candidates is can be explained similarly. The I/O cost and the number of
reduced also, but in each iteration, BA reaches an approxi- dominance checking are depicted in Fig. 15(b and c), respec-
mately equal scan depth before it terminates. For TSI, the tively. The pruning ratio in TSI is illustrated in Fig. 15d.
effect of pruning operation depends on two factors. One is The variation in these figures can be explained similarly as
the probability that one tuple can be dominated by other in Sect. 7.5.
tuples. The other is whether all common attributes of two
tuples are incomplete. The two factors have different effects 7.8 Experiment 6: the Comparison with SOBA
in different cases. With a greater value of p, the probabil- and SIDS
ity of a tuple dominated by some tuples increases, also the
probability that the common attributes of two tuples are all In this part, we evaluate the performance of TSI against BA,
incomplete. When p increases from 0.3 to 0.5, the first factor SOBA and SIDS on a relatively small data set with relatively
has a greater impact, and ever since, the second factor plays small skyline criteria size. Given n = 10 × 106, p = 0.3 and
a larger role. This explains the trend of the execution time c = 0 , in order to acquire a better performance for SOBA
of TSI. Similarly, this can explain the variation trend of TSI and SIDS, we set the value of m to be from 6 to 10, and the
in I/O cost (Fig. 13b), the number of dominance checking value of M equal to that of m. This can reduce the length of
(Fig. 13c), and the pruning ratio (Fig. 13d). each tuple and also lower the cost of bucket partitioning for
SOBA and SIDS.
7.6 Experiment 4: the Effect of Correlation As illustrated in Fig. 16a, SIDS is the slowest among the
Coefficient four algorithms while TSI is the faster in various skyline
criteria size, and the execution time of SOBA increases sig-
Given m = 20 , M = 60 , n = 50 × 106 and p = 0.3, experi- nificantly with the number of m. When m = 10, SOBA runs
ment 4 evaluates the performance of TSI on varying correla- 10.96 times slower than BA, the baseline algorithm in this
tion coefficients. As illustrated in Fig. 14a, TSI runs 47.72 paper, and runs 200.91 times slower than TSI. As for SIDS,
times faster than BA. The correlation coefficients considered it runs 21.11 times slower than BA and runs 386.84 times
range from -0.8 to 0.8. A negative correlation means that slower than TSI. On disk resident data, SOBA and SIDS
there is an inverse relationship between two variables, when cannot process incomplete skyline efficiently. The bucket
13
partitioning of SOBA involves two passes of table scan, not 3. Chomicki J, Godfrey P, Gryz J, Liang D (2003) Skyline with pre-
to mention the maintenance cost of the large number of par- sorting. In: Proceedings of the 19th international conference on
data engineering, pp 717–719
titions in the disk if the number of m is not small. Then, the 4. Godfrey P (2004) Skyline cardinality for relational processing. In:
computation of local skyline involves another pass of tuple foundations of information and knowledge systems, Third Inter-
retrieval. On the relatively large value of m, the number of national Symposium, FoIKS 2004:78–97
local skyline is great also. As depicted in Fig. 16b, the local 5. Godfrey Parke, Shipley Ryan, Gryz Jarek (2007) Algorithms and
analyses for maximal vector computation. VLDB J 16(1):5–28
skyline makes up 11.7% of the total tuples at m = 10 . The 6. Xixian H, Jianzhong L, Donghua Y, Jinbao W (2013) Efficient
I/O cost of SIDS is much larger than others, SIDS and BA skyline computation on big data. IEEE Trans Knowl Data Eng
are much close in I/O cost. The growth trend of the execu- 25(11):2521–2535
tion time of SIDS is fast with respect to skyline criteria size. 7. Khalefa ME, Mokbel MF, Levandoski JJ (2008) Skyline query
processing for incomplete data. In: Proceedings of the 24th inter-
The performance of TSI is efficient not only for the in-mem- national conference on data engineering, pp 556–565
ory data set with small size of skyline criteria, but also for 8. Kossmann D, Ramsak F, Rost S (2002) Shooting stars in the
the disk-resident data with not small size of skyline criteria. sky: an online algorithm for skyline queries. In: Proceedings of
the 28th international conference on very large data bases, pp
275–286
9. Lee Jongwuk, Hwang Seung-Won (January 2014) Scalable skyline
8 Conclusion computation using a balanced pivot selection technique. Inf Syst
39:1–21
This paper considers the problem of incomplete skyline 10. Lee Jongwuk, Im Hyeonseung, You Gae-won (2016) Optimizing
skyline queries over incomplete data. Inf Sci 361–362:14–28
computation on massive data. It is analyzed that the existing 11. Lee Ken C, Lee Wang-Chien, Zheng Baihua, Li Huajing, Tian
algorithms cannot process the problem efficiently. A table- Yuan (2010) Z-sky: an efficient skyline query processing frame-
scan-based algorithm TSI is devised in this paper to deal work based on z-order. VLDB J 19(3):333–362
with the problem efficiently. Its execution consists of two 12. Luo Cheng, Jiang Zhewei, Hou Wen-Chi, He Shan, Zhu Qiang
(2012) A sampling approach for skyline query cardinality estima-
stages. In stage 1, TSI maintains the candidates by a sequen- tion. Knowl Inf Syst 32(2):281–301
tial scan. And in stage 2, TSI performs another sequential 13. Miao X, Yunjun G, Su G, Wanqi L (2018) Incomplete data man-
scan to refine the candidate and acquire the final results. agement: a survey. Front Comput Sci 12(1):4–25
In order to reduce the cost in stage 1, which dominates the 14. Papadias Dimitris, Tao Yufei, Greg Fu, Seeger Bernhard (2005)
Progressive skyline computation in database systems. ACM Trans
overall cost of TSI, a pruning operation is utilized to skip the Database Syst 30(1):41–82
unnecessary tuples in stage 1. The experimental results show 15. Sheng C, Tao Y(2011) On finding skylines in external memory.
that TSI outperforms the existing algorithms significantly. In: Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART
symposium on principles of database systems, pp 107–116
16. Tan K-L, Eng P-K, Ooi BC (2001) Efficient progressive skyline
computation. In: Proceedings of the 27th international conference
Open Access This article is licensed under a Creative Commons Attri- on very large data bases, pp 301–310
bution 4.0 International License, which permits use, sharing, adapta- 17. Tao Yufei, Xiao Xiaokui, Pei Jian (2007) Efficient skyline
tion, distribution and reproduction in any medium or format, as long and top-k retrieval in subspaces. IEEE Trans Knowl Data Eng
as you give appropriate credit to the original author(s) and the source, 19(8):1072–1088
provide a link to the Creative Commons licence, and indicate if changes 18. Zhang K, Gao H, Han X, Cai Z, Li J (2017) Probabilistic skyline
were made. The images or other third party material in this article are on incomplete data. In: Proceedings of the 2017 ACM on confer-
included in the article's Creative Commons licence, unless indicated ence on information and knowledge management, pp 427–436
otherwise in a credit line to the material. If material is not included in 19. Zhang Kaiqi, Gao Hong, Han Xixian, Cai Zhipeng, Li Jianzhong
the article's Creative Commons licence and your intended use is not (2020) Modeling and computing probabilistic skyline on incom-
permitted by statutory regulation or exceeds the permitted use, you will plete data. IEEE Trans Knowl Data Eng 32(7):1405–1418
need to obtain permission directly from the copyright holder. To view a 20. Shiming Z, Nikos M, Cheung DW (2009) Scalable skyline com-
copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. putation using object-based space partitioning. In: Proceedings
of the 2009 ACM SIGMOD international conference on manage-
ment of data, pp 483–494
References 21. Zhenjie Z, Hua L, Beng Chin O, Tung AK (2010) Understanding
the meaning of a shifted sky: a general framework on extending
skyline query. The VLDB J 19(2):181–201
1. Bharuka R, Sreenivasa Kumar P (2013) Finding skylines for 22. Zhenjie Z, Yin Y, Ruichu C, Dimitris P, Anthony KHT (2009)
incomplete data. In: Proceedings of the 24th australasian database Kernel-based skyline cardinality estimation. In: Proceedings of
conference - Vol 137, pp 109–117 the ACM SIGMOD international conference on management of
2. Börzsönyi S, Kossmann D, Stocker K (2001) The skyline opera- data, pp 509–522
tor. In: Proceedings of the 17th international conference on data
engineering, pp 421–430
13
1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com