Similarity Join in Top

Similarity Join in Top-k Spatio-Textual
1
SAMPATH KUMAR TATIPALLY , 2Dr. K.SHAHU CHATRAPATHI
1
M.tech, Department Of CSE, School Of Information Technology (JNTUH), Village Kukatpally,
Dist Ranga reddy, Telangana, India.
2
Professor, Department of CSE, JNTUH College of Engineering Manthani, Village Manthnai, Mandal
Manthni, District Pedhapelli, Telangana, India
1 INTRODUCTION : With the rapid development

of mobile Internet technology, Internet users are
Abstract—With the development of location- shifting from desktop to mobile devices. Modern
based services (LBS), LBS users are generating mobile devices (e.g., smartphones and tablets) are
more and more spatio-textual data, e.g., checkins equipped with GPS, which can help users to easily
and attraction reviews. Since a spatio-textual obtain their locations, and location-based services
entity may have different representations, possibly (LBS) have been widely deployed and well accepted,
due to GPS deviations or typographical errors, it e.g. Foursquare and Google Map Search. LBS users
calls for effective methods to integrate the spatio- are generating more and more spatio-textual data
textual data from different data sources. In this which contains both textual descriptions and
paper, we study the problem of top-k spatio- geographical locations, e.g., checkins and attraction
textual similarity join (TOPK-STJOIN), which reviews. In user-generated data, a spatio-textual
identifies the k most similar pairs from two spatio- entity may have different representations, possibly
textual data sets. One big challenge in TOPK- due to GPS deviations or typographical errors [18],
STJOIN is to efficiently identify the top-k similar [3], and it calls for effective methods to integrate the
pairs by considering both textual relevancy and spatio-textual data from different data sources. A
spatial proximity. Traditional join algorithms that spatio-textual similarity join is an important
consider only one dimension (textual or spatial) operation in spatio-textual data integration, which,
are inefficient because they cannot utilize the given two sets of spatio-textual objects, finds all
pruning ability on the other dimension. To similar pairs from the two sets, where the similarity
address this challenge, we propose a signature- can be quantified by combining spatial proximity and
based top-k join framework. We first generate a textual relevancy (see Section 2.1). There are many
spatio-textual signature set for each object such applications in spatio-textual similarity joins, e.g.,
that if two objects are in the top-k similar pairs, user recommendation in location-based social
their signature sets must overlap. With this networks, image duplication detection using spatio-
property we can prune large numbers of textual tags, spatio-textual advertising, and location-
dissimilar pairs without common signatures. We based market analysis [18], [3]. For example, a house
find that the order of accessing the signatures has rental agency (e.g., rent.com) wants to perform a
a significant effect on the performance. So we similarity join on the spatio-textual data of house
compute an upper bound for each signature and requirements from renters and the data of house
propose a best-first accessing method that properties from owners. For another example, a
preferentially accesses signatures with large upper startup company, e.g., Factual (factual.com), crawls
bounds while those pairs with small upper bounds spatio-textual records to generate points of interest
can be pruned. We prove the optimality of our (POIs). As the records are from multiple sources and
best-first accessing method. Next we optimize the may contain many duplicates, It needs to run
spatio-textual signatures and propose progressive similarity joins to remove the duplicates. Bouros et
signatures to further improve the pruning power. al. [3] studied the threshold-based spatiotextual
Experimental results on real-world datasets show similarity join, which asks users to input a textual
that our algorithm achieves high performance and threshold and a spatial threshold and identifies the
good scalability, and significantly outperforms pairs whose textual similarity exceeds the textual
baseline approaches. threshold and spatial distance is within the spatial
threshold. However, in some application, it is rather
hard to obtain appropriate thresholds, because a loose Definition 2 (Spatial Similarity): The spatial
threshold involves a large number of answers while a similarity between r and s is defined as SIMS(r,s) =
tight threshold generates few results. To address this max0,1 − DIST(r.L,s.L) DISTmax , where
problem, we study the top-k spatio-textual similarity DIST(r.L,s.L) is the distance between r.L and s.L,
join (TOPK-STJOIN) to identify the k most similar and DISTmax is a user tolerant distance, which can
pairs, which avoids the process of tuning the be set as the maximal distance between any two
thresholds. In the house rental example, the agency records. Then, we combine the textual similarity and
has overhead to take renters to show their interested spatial similarity to quantify the spatio-textual
houses. Given a budget (e.g., a limited number of similarity.
agents who show houses to renters), to maximize the
profit, the agency aims to find top-k pairs where the 2.2 Related Works Spatio-Textual Similarity Join.
renters in these pairs have large possibilities to rent To the best of our knowledge, this is the first study on
the corresponding houses. In the duplicate detection top-k spatiotextual similarity join. There are two
example, to remove duplicated POIs, the machine works on threshold-basedspatio-
only algorithms may introduce incorrect results, and textualsimilarityjoins[3],[19]. Both require users to
a widely-adopted method is to ask the human to input two thresholds, one for similarity and the other
check these pairs [13]. However, since the crowd is for spatial similarity, and identify the similar pairs
not free, it is expensive to ask every pair, and thus it satisfying the two threshold-based constraints.
aims to select the top-k most similar pairs under a Bouros et al. [3] combined existing string similarity
budget. join techniques with grid partitions, and we extend
this method to support our problem (see Section 7).
2 PRELIMINARIES: However, their method uses seperate textual and
spatial thresholds and it is hard to find approximate
2.1 Problem Formulation Each spatio-textual record thresholds for the top-k similarity join problem. Liu
r = hr.T,r.Li includes a spatial location r.L with et al. [19], [18] utilized spatial regions as the spatial
latitude and longitude and a textual description r.T = part and evaluated the spatial similarity based on the
{t1,t2 ...t|r.T|} with a set of terms. For simplicity, we overlap of two spatial regions. This problem is
interchangeably use r,r.T,r.L to denote the record, its different from ours and cannot be used to address our
term set and its location if the context is clear. Given problem. Top-k Spatio-Textual Search. Spatial
two sets of spatio-textual records, a spatiotextual join keyword search [17], [6], [22], [31], [30], [5], [26],
returns all similar records from the two sets. To [9], [29], [15] has been widely studied and some
quantify the similarity between two spatiotextual works focused on top-k spatio-textual search [17],
records, existing methods usually employ a [6], [22], [31], [30]. Cong et al. [6] combined
similarity-based metric [18], [3]. First, to measure the inverted lists with R-trees to compute top-k answers.
textual similarity, we can adopt any textual similarity Zhang et al. [30] combined inverted index and
function, e.g., Jaccard and Cosine. Here we take the Quadtree. We utlize them to support our problem by
well-known Jaccard function as an example and our maintaining the current top-k answers, deducing a
method can support other textual similarity functions. bound based on the top-k answers, and pruning
Definition 1 (Textual Similarity): The textual dissimilar pairs based on the bound. The method is
similarity between records r and s is defined as rather inefficient as it had to find top-k answers for
SIMT(r,s) = |r.T∩s.T| |r.T∪s.T| , where |r.T∩s.T| is every record.
the size of set r.T∩s.T. The spatial similarity is
evaluated by the spatial distance between two records 3 A SIGNATURE-BASED FRAMEWORK: In this
and is defined as below section, we propose a signature-based framework to
address the top-k spatio-textual join problem. We
first introduce the framework (Section 3.1) and then
propose spatio-textual signatures (Section 3.2).
Finally we devise a signature-based join algorithm
(Section 3.3).
3.1 Framework To answer top-k queries, we first

initialize a priority queueQwith k
randomlyselectedpairs.Let τk denote the smallest
similarity of pairs in Q. Obviously we can prune the
dissimilar pairs whose similarities are smaller than
τk. Thus we only need to access each pair whose
similarity is larger than τk, and use such pair to the top-k answers. Thus we compute a score for each
update Q and τk. If the similarities of remaining pairs triple by considering both spatial proximity and
are smaller than τk, we can terminate and the k pairs textual relevancy, and access the triples by sorting on
in Q are the answers. An important step in this the score.
method is to efficiently prune dissimilar pairs and
identify the pairs with similarity larger than τk. To 5 PROGRESSIVE SIGNATURE Given a triple
address this issue, we propose a filter-verification hr,t,ni, for each record r0 on L(ht,ni), the signature-
framework. The filter step generates a signature set based method has to verify hr,r0i. If there are many
for each record such that if the similarity of two records on L(ht,ni), the verification cost will be high.
records is larger than τk, their signature sets must If we can reduce the size ofL(ht,ni), we can further
share at least one common signature. Based on this improve the performance. To address this issue, we
property, if two records have no common signature, design a progressive signature to improve the pruning
we prune such pair; otherwise we take the pair as a power. We first introduce the basic idea and discuss
candidate. The verification step computes the how to incorporate progressive signatures into our
spatiotextual similarity of each candidate. If the framework
similarity is larger than τk, we use this candidate pair
to updateQ and τk. There are two challenges in this 6 DISCUSSION: Spatial Index Selection. To
framework. evaluate the spatial ability of an index, we sum up the
average spatial upper bounds of all the records as the
4 ACCESSING ORDER : The order of accessing following function. SpatialBound =X r∈RPn|r∈n
the records has a significant effect on the MAXSIMS(r,n−ni) |n| , (16) where n is an ancestor of
performance. If we can first access the highly similar r and |n| is the number of r’ ancestors. We aim to
record pairs, we can increase the threshold τk quickly select the spatial index with the minimum
and prune more dissimilar pairs. For example, SpatialBound. As we use hierarchical spatial indexes,
consider τk=SIMST(r1,r2)=0.697. For r5 and r6 from we only compare two well-known hierarchical
a leaf node n8, SIG(r5|n8,τk)={t1,t2,t4,t6} and indexes, Quadtree and R-tree. Quadtree will be better
SIG(r6|n8,τk)={t3,t4,t6}.hr5,r6ishould be a candidate in our method, because there exist lots of nodes with
pair as SIG(r5|n8,τk) ∩ SIG(r6|n8,τk)={t4,t6}. On the overlaps in the R-tree index, and for these overlapped
other hand, if we first access hr1,r9i and nodes, MAXSIMS(r,n−ni) is equal to the largest
τk=SIMST(r1,r9)=0.82, we can avoid value 1. Therefore, the bound of R-tree is looser than
verifyinghr5,r6i as SIG(r5|n8,τk)={t1,t2}, SIG(r6| that of Quadtree, and accordingly R-tree has larger
n8,τk)={t3,t4}, and SIG(r5|n8,τk)∩SIG(r6|n8,τk)=∅. cost than Quadtree. We also verify this observation
To evaluate different accessing orders, we model the by experiments as discussed in Section 7.4. R6= S.
records and their signatures as triples hr,t,ni, where r We generate the spatio-textual signatures for each
is a record and ht,ni is a signature of r. Our objective dataset. For each ht,ni, we maintain LR(ht,ni) for R
is to determine the accessing orders of triples to and LS(ht,ni) for S. The pairs on the two inverted
achieve high performance. Based on the cost lists (LR(ht,ni)×LS(ht,ni)) are candidates. Out-of-
complexity in Section 3.3, an optimal accessing order Core Setting. A common approach for the case that
is to minimize the total cost. Intuitively, we have the dataset cannot be loaded into memory is to
three accessing strategies. (1) Textual-first: We partition the dataset into small partitions, use our
access the triples by sorting on term t. Obviously, the algorithms to compute the answers on the small
term with the smallest frequency has the least partitions, and then combine these answers to
probability to match. If two records share infrequent generate the final results. Based on this idea, we can
terms, they have large textual similarity. Following devise diskbased or MapReduce-based algorithms
this intuition, we access the triples sorted on terms by [8]. However, in this paper we focus on the in-
document frequency (df) in an ascending order. (2) memory setting and leave devising disk-based or
Spatial-first: We access the triples by sorting on node MapReduce-based algorithms as a future work.
n. Obviously the records in the same deep-level node
(e.g., leaf nodes) have large spatial similarity. Thus 7 CONCLUSION: We proposed a signature-
we want to first access the triples in the deeplevel based framework for top-k spatio-textual similarity
nodes. Following this observation, we access the join. We discussed different accessing orders of
triples by sorting on nodes in a bottom-up manner. signatures proposed a best-first method, and proved
(3) Best-first: The aforementioned two methods sort the best-first accessing order is optimal. We proposed
the triples on one dimension but cannot utilize the progressive signatures to improve pruning power.
other dimension. To achieve high performance, we Experimental results showed that our method
first access the record with high possibility to be in significantly outperformed baselines
Nobari, F. Tauheed, T. Heinis, P. Karras, S. Bressan,
and A. Ailamaki. Touch: in-memory spatial join by
REFERENCES hierarchical data-oriented partitioning. In SIGMOD,
pages 701–712, 2013. [21] S. Qi, P. Bouros, and N.
[1] A. Arasu, V. Ganti, and R. Kaushik. Efficient Mamoulis. Efficient top-k spatial distance joins. In
exact setsimilarity joins. In VLDB, pages 918–929, SSTD, pages 1–18, 2013. [22] J. B. Rocha-Junior, O.
2006. [2] R. J. Bayardo, Y. Ma, and R. Srikant. Gkorgkas, S. Jonassen, and K. Nørvr ag. Efficient
Scaling up all pairs similarity search. In WWW, processing of top-k spatial keyword queries. In
pages 131–140, 2007. [3] P. Bouros, S. Ge, and N. SSTD, pages 205–222, 2011.
Mamoulis. Spatio-textual similarity joins. PVLDB,
6(1):1–12, 2012. [4] S. Chaudhuri, V. Ganti, and R. [23] S. Sahni. Data structures, algorithms, and
Kaushik. A primitive operator for similarity joins in applications in c++. Universities Press, 1999. [24] H.
data cleaning. In ICDE, page 5, 2006. [5] L. Chen, G. Shin, B. Moon, and S. Lee. Adaptive multi-stage
Cong, C. S. Jensen, and D. Wu. Spatial keyword distance join processing. In SIGMOD Conference,
query processing: An experimental evaluation. pages 343–354, 2000. [25] J. Wang, G. Li, and J.
PVLDB, 6(3):217–228, 2013. [6] G. Cong, C. S. Feng. Can we beat the prefix filtering?: an adaptive
Jensen, and D. Wu. Efficient retrieval of the top-k framework for similarity join and search. In
most relevant spatial web objects. PVLDB, 2(1):337– SIGMOD, pages 85–96, 2012. [26] D. Wu, M. L.
348, 2009. [7] A. Corral, Y. Manolopoulos, Y. Yiu, G. Cong, and C. S. Jensen. Joint top-k spatial
Theodoridis, and M. Vassilakopoulos. Closest pair keyword query processing. IEEE Trans. Knowl. Data
queries in spatial databases. In SIGMOD Conference, Eng., 24(10):1889–1903, 2012. [27] C. Xiao, W.
pages 189–200, 2000. [8] D. Deng, G. Li, S. Hao, J. Wang, X. Lin, and H. Shang. Top-k set similarity
Wang, and J. Feng. Massjoin: A mapreduce-based joins. In ICDE, pages 916–927, 2009. [28] C. Xiao,
method for scalable string similarity joins. In ICDE, W. Wang, X. Lin, and J. X. Yu. Efficient similarity
pages 340–351, 2014. [9] I. D. Felipe, V. Hristidis, joins for near duplicate detection. In WWW, pages
and N. Rishe. Keyword search on spatial databases. 131–140, 2008. [29] M. Yu, G. Li, T. Wang, J. Feng,
In ICDE, pages 656–665, 2008. [10] G. R. Hjaltason and Z. Gong. Efficient filtering algorithms for
and H. Samet. Incremental distance join algorithms location-aware publish/subscribe. IEEE Trans.
for spatial databases. In SIGMOD Conference, pages Knowl. Data Eng., 27(4):950–963, 2015. [30] C.
237–248, 1998. [11] H. Hu, Y. Liu, G. Li, J. Feng, Zhang, Y. Zhang, W. Zhang, and X. Lin. Inverted
and K. lee Tan. A locationaware publish/subscribe linear quadtree: Efficient top k spatial keyword
framework for parameterized spatiotextual search. In ICDE, pages 901–912, 2013. [31] D.
subscriptions. In ICDE, pages 1–12, 2015. [12] Y. Zhang, K.-L. Tan, and A. K. H. Tung. Scalable top-k
Jiang, G. Li, J. Feng, and W. Li. String similarity spatial keyword search. In EDBT, pages 359–370,
joins: An experimental evaluation. PVLDB, 2013.
7(8):625–636, 2014. [13]
G.Lamprianidis,D.Skoutas,G.Papatheodorou,andD.Pf
oser. Extraction, integration and exploration of
crowdsourced geospatial content from multiple web
sources. In SIGSPATIAL, pages 553–556, 2014. [14]
G. Li, D. Deng, J. Wang, and J. Feng. PASS-JOIN: A
partitionbased method for similarity joins. PVLDB,
5(3):253–264, 2011. [15] G. Li, J. Feng, and J. Xu.
Desks: Direction-aware spatial keyword search. In
ICDE, pages 474–485, 2012. [16] G. Li, Y. Wang, T.
Wang, and J. Feng. Location-aware
publish/subscribe. In KDD, pages 802–810, 2013.
[17] Z.Li,K.C.K.Lee,B.Zheng,W.-
C.Lee,D.L.Lee,andX.Wang. Ir-tree: An efficient
index for geographic document search. IEEE Trans.
Knowl. Data Eng., 23(4):585–599, 2011. [18] S. Liu,
G. Li, and J. Feng. Star-join: spatio-textual similarity
join. In CIKM, pages 2194–2198, 2012. [19] S. Liu,
G. Li, and J. Feng. A prefix-filter based method for
spatio-textual similarity join. IEEE Transactions on
Knowledge and Data Engineering, 99:1, 2013. [20] S.

Similarity Join in Top

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Similarity Join in Top

Uploaded by

Copyright:

Available Formats

Similarity Join in Top-k Spatio-Textual

1 INTRODUCTION : With the rapid development

3.1 Framework To answer top-k queries, we ﬁrst

You might also like