Professional Documents
Culture Documents
Skyline Queries Project Report
Skyline Queries Project Report
1
O1 O2 O3 O4 Objects O2 and O3 just have D2 in common which
D1 1 5 NULL 3 does not have missing data. Here, O22 < O32 and
D2 NULL 4 6 1 so O2 dominates O3 . Similarly, it can be shown that
D3 8 NULL 3 2 O4 O3 , O4 O2 and O3 O1 . Here, the cycle of
dominance which can happen in case of incomplete data
e 0.4 0.3 0.6 0.8
is shown: O1 O2 , O2 O3 and O3 O1
PSKY 0.4 0.12 0.14 1
PSKYO1 = (1 e3 ) = 0.4
Table 1: Example of Uncertain Incomplete Data PSKYO2 = (1 e1 ) (1 e4 ) = 0.6 0.2 = 0.12
PSKYO3 = (1 e2 ) (1 e4 ) = 0.7 0.2 = 0.14
feasible options to the user. In this complete scenario, PSKYO4 = 1
there is incomplete data in the form of missing amenities So with = 0.30, O1 and O4 are skylines.
and all data is present in a probabilistic setting. So our
approach to find skyline points in this domain would offer For experimentation, since readily available dataset as per
optimal solutions. our needs was not available, we synthetically generated
points for independent, correlated and anti-correlated data
using http://pgfoundry.org/projects/randdataset. Then we
made certain dimensions null (missing) randomly and
1.2 Problem Formulation tried our experiments for different densities of incomplete
As outlined in section 1, there are n tuples O1 , O2 , ..., On data. We have tried two different methods of generating
and m dimensions D1 , D2 , ..., Dm . The j th dimension existential probabilities of the data - a uniform random
of tuple Oi is denoted by Oij . Each tuple Oi has an ex- distribution and a normal distribution with mean 0.5 and
istential probability eOi and has Mk missing dimensions, standard deviation 0.2.
where 0 k m. Fill NULL values on all these Mk
dimensions for each object.
2 Related Work
Definition 1 (Dominance Relation). An object Oi is said [KML08] explores incomplete data and the Iskyline
to dominate an object Oj , denoted by Oi Oj , if Oik algorithm which reduces the number of exhaustive
Ojk k, 0 k m such that both Oik 6= N U LL and comparisons of bucket algorithm using virtual points and
Ojk 6= N U LL and k 0 where Oik0 < Ojk0 . shadow skylines. [BK13] devised the Sort-based Incom-
plete Data Skyline (SIDS) algorithm which improves
According to this dominance relation, two objects are
upon the efficiency of the Iskyline algorithm. Approaches
only compared on their common dimensions, the dimen-
based on completing the incomplete dataset using inter-
sions in which both of them have non NULL values.
polation have also been explored in [ZLOT10]. But since
Definition 2 (Skyline Probability). Each object Oi has filling in missing values is problematic, particularly if
a probability PSKYOi associated with it which calculates there is a high amount of data sparsity or tolerance to
the chancesQof this object to be a skyline. false positives is low, so we would not pursue this line
PSKYOi = j,Oj Oi (1 eOj ) any further.
2
pursued. We have continued along the lines where Algorithm 1 Exclusive Filtering
incomplete data is tackled with comparisons only in Input: Dataset D
dimensions where both the entries have finite values as Output: skylines set SKY
defined in dominance relation 1. 1: SKY = , ShadowedSet = , ComparedSet =
2: for each object Oi D do
As a baseline, a naive algorithm is implemented which 3: for each object Oj 6= Oi D do
compares all objects with every other object and computes 4: if Oi , Oj not compared before then
the skyline probability of each object. Those objects Oi 5: if Oi Oj then
whose PSKYOi < are pruned while remaining objects 6: Update PSKYOj
are output as the skyline set SKY . 7: else if Oj Oi then
8: Update PSKYOi
Two different algorithms are studied while a third algo- 9: end if
rithm is outlined as future work. 10: end if
11: if PSKYOj < then
12: Remove Oj from D and insert in
ShadowedSet
3.1 Exclusive Filtering Skylines (EFS) 13: end if
The EFS algorithm optimizes over the naive in the same 14: Remove Oi from D and put in ComparedSet
way as SFS optimizes over BNL algorithm. In order to 15: end for
16: end for
decrease number of comparisons, we must avoid compar-
ing shadowed objects with each other. The EFS algorithm 17: for each object Oi ComparedSet do
is devised so that whenever the skyline probability of an 18: if PSKYOi > then
object falls below the threshold , the object is removed 19: if Oi has been compared n times then
from the dataset D. 20: Insert Oi in SKY
21: else
However, this object cannot be pruned entirely as it can 22: Compare Oi with ShadowedSet and up-
prune some other objects whose PSKY is yet above the date P SKY Oi
threshold. So these objects are inserted into shadowedSet. 23: if PSKYOi > then
After each object has been processed atleast once, then it 24: Insert Oi in SKY
is inserted into ComparedSet. The non shadowed objects 25: end if
are compared exhaustively with all shadowed objects 26: end if
from shadowedSet with which it was not compared 27: end if
28: end for
before. The skyline probability is updated accordingly
and the objects which remain above the threshold after all 29: return SKY
possible comparisons are inserted into the SKY set. The
algorithm is described in algorithm 1
Partition all objects into separate buckets where each
In order to check whether an object has been compared bucket contains objects which have the same non-NULL
with all objects in the shadowed set, we need to maintain dimensions [KML08]. We apply filtering on each bucket
a N 2 bit matrix which is not space efficient. So we will to find shadowed objects. These objects by definition are
keep a timestamp with each object and the timestamp not in the skyline set. So all non shadowed objects from
will be updated in the first check for only the object each bucket are possible candidates for being a global
which is being checked against the rest. In the 2nd round skyline.
of checking, the tuple is checked only if the shadows
timestamp is lesser than that of the object. Now compute entropy of each object based on all non-
NULL dimensions of this bucket.
Pm
E(Oi ) = k=1,Oik 6=N U LL (1 + Oik )
3.2 Bucketed-SFS algorithm
The complete algorithm is described in figure 2. Sort each bucket according to this entropy function. From
the Sort Filtering Skylines (SFS) [CGGL03], we know
3
Algorithm 2 Bucketed SFS
Input: Database D But when most of the buckets are empty, then we either
Output: skylines set SKY have a huge amount of missing data or too less amount
1: Initialize PSKYO with 1 for all objects of it since existing data are being accumulated in few
2: Partition data points in D into 2m buckets buckets. Here, the pruneSelf algorithm with the usual
{B1 , B2 , ..., B2m } based on NULL dimensions top-down iteration as in SFS is to be used since the pointer
3: for each bucket Bi do method is also going to result in O(sizeof (bucket)2 )
4: pruneSelf(Bi ) comparisons.
5: end for
6: for each bucket Bi do
7: for each bucket Bj 6= Bi do Algorithm 3 PruneSelf
8: mergeBuckets(Bi , Bj ) Input: Bucket Bi
9: end for 1: for each object Ok Bi do
10: end for 2: Calculate entropy in non NULL dimensions
11: for each bucket Bi do 3: end for
12: for each non shadowed object Ok Bi do 4: Sort Bucket based on entropy in non-decreasing order
13: for each bucket Bj 6= Bi do 5: Initialize parent pointer array to track dominators
14: for each shadowed object Ol Bj not 6: for k = 0 to Bi .size() 1 do
compared with Ok before do 7: Initialize bitmap to false
15: if Ok Ol then 8: for l = k 1 down to 0 do
16: PSKYOk PSKYOk (1 eOk ) 9: if parent pointer marked as dominator in
17: end if bitmap then
18: end for 10: avoid comparison and update PSKYOk
19: end for 11: else if Ol Ok then
20: end for 12: PSKYOk = PSKYOk (1 eOl )
21: if PSKYOk > then 13: Update bitmap to mark any dominator of
22: SKY.insert(Ok ) Ol as Ok s dominator
23: end if 14: end if
24: end for 15: end for
25: return SKY 16: Update parent pointer as the nearest object with
highest entropy which dominates Ok
17: end for
that if an object Oi Oj , then E(Oi ) < E(Oj ) and if 18: All objects in bucket Bi whose PSKY < are
E(Oi ) < E(Oj ), then Oj 6 Oi . marked as shadowed
So we start from the object Oi with lower entropy in a After this analysis, all objects whose skyline probability
bucket and iterate i downwards. An inner loop is run to fell below the threshold are marked as shadowed. Since
compare Oi with all objects having smaller entropy. Two they can no longer become a skyline object, so there is no
options are used based on the number of buckets filled. need to further refine their computation of PSKY .
When the degree of incompleteness is between 20% to Now buckets are taken pairwise and their non shadowed
70%, then most of the buckets are somewhat uniformly objects are compared. To make the execution faster, they
occupied. So algorithm 3 is applied which runs a loop are again sorted on their entropy value, but this time, the
backwards where each object Oi finds the nearest object entropy is calculated based on the common non NULL
Oj suct that Oj Oi and no other Ok dominating Oi dimensions of the two buckets. The pseudo code is in
exists where j < k < i. Since all objects in a bucket are algorithm 4. The only difference from pruneSelf is that
compared on the common dimensions so transitivity for we have to make sure that PSKY is not updated when
the dominance relation exist in the bucket. Therefore we an object is dominated by another object of the same
store a dominator pointer for each object which stores bucket while merging. This is because objects of the
this nearest object which dominates it and uses it to avoid same bucket were already compared and PSKY bounded
unnecessary comparisons. in pruneSelf.
4
Algorithm 4 MergeBuckets
Input: Bucket Bi , Bj
1: for each non shadowed object Ok Bi do
2: Calculate entropy based on common non NULL
dimensions in Bi , Bj
3: end for
4: for each non shadowed object Ok Bj do
5: Calculate entropy based on common non NULL
dimensions in Bi , Bj
6: end for
7: sort Bi and Bj based on entropy in non-decreasing
Figure 1: Ratio of Comparisons vs Dataset Size for 20%
order
Incomplete 10 dimensional Correlated Data
8: Emulate Merge process of merge-sort using sorted
lists Bi and Bj
9: for k = 0 to Bi .size() 1 do
10: l = 0 Denote Bi element by Ok and Bj element
by Ol
11: while Ol .entropy < Ok .entropy do
12: if Ol Ok then
13: PSKYOk = PSKYOk (1 eOl )
14: end if
15: l =l+1
16: end while
17: end for
18: for l = 0 to Bi .size() 1 do
19: k = 0 Denote Bi element by Ok and Bj element
by Ol Figure 2: Ratio of Comparisons vs Dataset Size for 20%
20: while Ok .entropy < Ol .entropy do Incomplete 10 dimensional Independent Data
21: if Ok Ol then
22: PSKYOl = PSKYOl (1 eOk ) algorithm two can be optimized if dimension based sort-
23: end if ing and early stoping is done. So instead of sorting by
24: k =k+1 entropy, we will sort by dimensions and store it for future
25: end while lookups. All dimensions will be considered in a round
26: end for
robin fashion to prune other objects and update its PSKY .
27: All objects in both buckets whose PSKY < are
This algorithm forms a part of our future investigation.
marked as shadowed
5
Figure 6: Ratio of Comparisons vs Dataset Size for 40%
Figure 3: Ratio of Comparisons vs Dataset Size for 20% Incomplete 10 dimensional Anti-Correlated Data
Incomplete 10 dimensional Anti-Correlated Data
Figure 4: Ratio of Comparisons vs Dataset Size for 40% The reason for such sudden decrease in number of
Incomplete 10 dimensional Correlated Data comparisons is the evenly spreading out of the data into
all possible 2m buckets when more data is missing. When
less data is missing, then most of the data accumulates
in buckets whose indices are at the end - the indices
corresponding to most dimensions have non-NULL data.
So, this uniform spreading of data makes both pruneSelf
and MergeBuckets procedures call on many smaller sized
buckets instead of some larger sized buckets. The time
for separately sorting many smaller lists is also less than
their combined list.
6
Figure 7: Ratio of Comparisons vs Degree of Incompleteness for 10k 10 dimensional Correlated Data
Figure 8: Ratio of Comparisons vs Degree of Incompleteness for 10k 10 dimensional Independent Data
Figure 9: Ratio of Comparisons vs Degree of Incompleteness for 10k 10 dimensional Anti-Correlated Data
7
Figure 10: Ratio of Comparisons vs Number of Dimen- Figure 12: Ratio of Comparisons vs Number of Dimen-
sions for Correlated Data sions for Anti-Correlated Data
8
#Dimensions
Percentage of #Dimensions
Data #Objects Compared compared/Objects
missing data compared
compared ratio
Correlated 0 185509661 47393466 3.91
Independent 0 149266285 49990682 2.99
Anti-correlated 0 131461868 49995000 2.63
Correlated 20 38849105 8721242 4.45
Independent 20 83140890 27902713 2.98
Anti-correlated 20 102804469 38038064 2.70
Table 2: Dimension comparison and Object comparison for Bucketed-SFS in 10000 sized 10 dimensional data
9
5 Conclusions and Future Work [BKS01] S Borzsony, Donald Kossmann, and Konrad
Stocker. The skyline operator. In Data En-
In this project, we have explored two new algorithms gineering, 2001. Proceedings. 17th Interna-
to tackle skyline queries in uncertain datasets with tional Conference on, pages 421430. IEEE,
incomplete data. 2001.
Bucketed-SFS outperforms EFS when there is a con- [CGGL03] Jan Chomicki, Parke Godfrey, Jarek Gryz,
siderable amount of missing data (> 20% of total and Dongming Liang. Skyline with presort-
data). On other cases, EFS performs slightly better. ing. In ICDE, volume 3, pages 717719,
2003.
Bucketed-SFS performs 20 times faster than naive
algorithm on average for all kinds of data. [KML08] Mohamed E Khalefa, Mohamed F Mokbel,
and Justin J Levandoski. Skyline query pro-
EFS algorithm is best suited for correlated data cessing for incomplete data. In Data En-
where as Bucketed-SFS is appropriate for all kinds gineering, 2008. ICDE 2008. IEEE 24th In-
data - it is virtually independent of the nature of the ternational Conference on, pages 556565.
data when compared with other two algorithms. IEEE, 2008.
References
[AQ09] Mikhail J Atallah and Yinian Qi. Com-
puting all skyline probabilities for uncertain
data. In Proceedings of the twenty-eighth
ACM SIGMOD-SIGACT-SIGART symposium
on Principles of database systems, pages
279287. ACM, 2009.
10