Professional Documents
Culture Documents
Scaling Distributional Similarity To Large Corpora: James Gorman and James R. Curran
Scaling Distributional Similarity To Large Corpora: James Gorman and James R. Curran
361
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 361–368,
Sydney, July 2006.
2006
c Association for Computational Linguistics
2.1 Extraction the results. There are many techniques to reduce
A context relation is defined as a tuple (w, r, w 0 ) dimensionality while avoiding this problem. The
where w is a term, which occurs in some grammat- simplest methods use feature selection techniques,
ical relation r with another word w 0 in some sen- such as information gain, to remove the attributes
tence. We refer to the tuple (r, w 0 ) as an attribute that are less informative. Other techniques smooth
of w. For example, (dog, direct-obj, walk) indicates the data while reducing dimensionality.
that dog was the direct object of walk in a sentence. Latent Semantic Analysis (LSA, Landauer and
In our experiments context extraction begins Dumais, 1997) is a smoothing and dimensional-
with a Maximum Entropy POS tagger and chun- ity reduction technique based on the intuition that
ker. The S EXTANT relation extractor (Grefen- the true dimensionality of data is latent in the sur-
stette, 1994) produces context relations that are face dimensionality. Landauer and Dumais admit
then lemmatised. The relations for each term are that, from a pragmatic perspective, the same effect
collected together and counted, producing a vector as LSA can be generated by using large volumes
of attributes and their frequencies in the corpus. of data with very long attribute vectors. Experi-
ments with LSA typically use attribute vectors of a
2.2 Measures and Weights dimensionality of around 1000. Our experiments
Both nearest-neighbour and cluster analysis meth- have a dimensionality of 500,000 to 1,500,000.
ods require a distance measure to calculate the Decompositions on data this size are computation-
similarity between context vectors. Curran (2004) ally difficult. Dimensionality reduction is often
decomposes this into measure and weight func- used before using LSA to improve its scalability.
tions. The measure calculates the similarity
3.1 Heuristics
between two weighted context vectors and the
weight calculates the informativeness of each con- Another technique is to use an initial heuristic
text relation from the raw frequencies. comparison to reduce the number of full O(m)
For these experiments we use the Jaccard (1) vector comparisons that are performed. If the
measure and the TTest (2) weight functions, found heuristic comparison is sufficiently fast and a suffi-
by Curran (2004) to have the best performance. cient number of full comparisons are avoided, the
cost of an additional check will be easily absorbed
0 0
P
(r,w 0 ) min(w(wm , r, w ), w(wn , r, w )) by the savings made.
(1)
0 0 Curran and Moens (2002) introduces a vector of
P
(r,w 0 ) max(w(wm , r, w ), w(wn , r, w ))
canonical attributes (of bounded length k m),
p(w, r, w0 ) − p(∗, r, w 0 )p(w, ∗, ∗) selected from the full vector, to represent the term.
p (2) These attributes are the most strongly weighted
p(∗, r, w 0 )p(w, ∗, ∗)
verb attributes, chosen because they constrain the
2.3 Nearest-neighbour Search semantics of the term more and partake in fewer
The simplest algorithm for finding synonyms is idiomatic collocations. If a pair of terms share at
a k-nearest-neighbour (k-NN) search, which in- least one canonical attribute then a full similarity
volves pair-wise vector comparison of the target comparison is performed, otherwise the terms are
term with every term in the vocabulary. Given an not compared. They show an 89% reduction in
n term vocabulary and up to m attributes for each search time, with only a 3.9% loss in accuracy.
term, the asymptotic time complexity of nearest- There is a significant improvement in the com-
neighbour search is O(n2 m). This is very expen- putational complexity. If a maximum of p posi-
sive, with even a moderate vocabulary making the tive results are returned, our complexity becomes
use of huge datasets infeasible. Our largest exper- O(n2 k + npm). When p n, the system will
iments used a vocabulary of over 184,000 words. be faster as many fewer full comparisons will be
made, but at the cost of accuracy as more possibly
3 Dimensionality Reduction near results will be discarded out of hand.
Using a cut-off to remove low frequency terms 4 Randomised Techniques
can significantly reduce the value of n. Unfortu-
nately, reducing m by eliminating low frequency Conventional dimensionality reduction techniques
contexts has a significant impact on the quality of can be computationally expensive: a more scal-
362
able solution is required to handle the volumes of vector is created by sampling a Gaussian function
data we propose to use. Randomised techniques m0 times, with a mean of 0 and a variance of 1.
provide a possible solution to this. For each term w we construct its bit signature
We present two techniques that have been used using the function
recently for distributional similarity: Random In- (
dexing (Kanerva et al., 2000) and Locality Sensi- 1 : ~r.w
~ ≥0
tive Hashing (LSH , Broder, 1997). h~r (w)
~ =
0 : ~r.w
~ <0
4.1 Random Indexing
where ~r is a spherically symmetric random vector
Random Indexing (RI) is a hashing technique of length d. The signature, w̄, is the d length bit
based on Sparse Distributed Memory (Kanerva, vector:
1993). Karlgren and Sahlgren (2001) showed RI
produces results similar to LSA using the Test of w̄ = {hr~1 (w),
~ hr~2 (w),
~ . . . , hr~d (w)}
~
English as a Foreign Language (TOEFL) evalua-
tion. Sahlgren and Karlgren (2005) showed the The cost to build all n signatures is O(nm 0 d).
technique to be successful in generating bilingual For terms u and v, Goemans and Williamson
lexicons from parallel corpora. (1995) approximate the angular similarity by
In RI, we first allocate a d length index vec-
tor to each unique attribute. The vectors con- θ(~u, ~u)
p(h~r (~u) = h~r (~v )) = 1 − (4)
sist of a large number of 0s and small number π
() number of randomly distributed ±1s. Context
where θ(~u, ~u) is the angle between ~u and ~u. The
vectors, identifying terms, are generated by sum-
angular similarity gives the cosine by
ming the index vectors of the attributes for each
non-unique context in which a term appears. The cos(θ(~u, ~u)) =
context vector for a term t appearing in contexts (5)
cos((1 − p(h~r (~u) = h~r (~v )))π)
c1 = [1, 0, 0, −1] and c2 = [0, 1, 0, −1] would be
[1, 1, 0, −2]. The distance between these context The probability can be derived from the Hamming
vectors is then measured using the cosine measure: distance:
~u · ~v H(ū, v̄)
cos(θ(u, v)) = (3) p(hr (u) = hr (v)) = 1 − (6)
|~u| |~v | d
This technique allows for incremental sampling, By combining equations 5 and 6 we get the fol-
where the index vector for an attribute is only gen- lowing approximation of the cosine distance:
erated when the attribute is encountered. Con-
struction complexity is O(nmd) and search com- H(ū, v̄)
cos(θ(~u, ~u)) = cos π (7)
plexity is O(n2 d). d
4.2 Locality Sensitive Hashing That is, the cosine of two context vectors is ap-
proximated by the cosine of the Hamming distance
LSH is a probabilistic technique that allows the
between their two signatures normalised by the
approximation of a similarity function. Broder
size of the signatures. Search is performed using
(1997) proposed an approximation of the Jaccard
Equation 7 and scales to O(n2 d).
similarity function using min-wise independent
functions. Charikar (2002) proposed an approx- 5 Data Structures
imation of the cosine measure using random hy-
perplanes Ravichandran et al. (2005) used this co- The methods presented above fail to address the
sine variant and showed it to produce over 70% n2 component of the search complexity. Many
accuracy in extracting synonyms when compared data structures have been proposed that can be
against Pantel and Lin (2002). used to address this problem in similarity search-
Given we have n terms in an m0 dimensional ing. We present three data structures: the vantage
space, we create d m0 unit random vectors also point tree (VPT, Yianilos, 1993), which indexes
of m0 dimensions, labelled {r~1 , r~2 , ..., r~d }. Each points in a metric space, Point Location in Equal
363
Balls (PLEB, Indyk and Motwani, 1998), a proba- the list, the radius is updated to the distance from
bilistic structure that uses the bit signatures gener- the target to the new k th closest term.
ated by LSH , and the Spatial Approximation Sam- Construction complexity is O(n log n). Search
ple Hierarchy (SASH, Houle and Sakuma, 2005), complexity is claimed to be O(log n) for small ra-
which approximates a k-NN search. dius searches. This does not hold for our decreas-
Another option inspired by IR is attribute index- ing radius search, whose worst case complexity is
ing (INDEX). In this technique, in addition to each O(n).
term having a reference to its attributes, each at-
5.2 Point Location in Equal Balls
tribute has a reference to the terms referencing it.
Each term is then only compared with the terms PLEB is a randomised structure that uses the bit
with which it shares attributes. We will give a the- signatures generated by LSH . It was used by
oretically comparison against other techniques. Ravichandran et al. (2005) to improve the effi-
ciency of distributional similarity calculations.
5.1 Vantage Point Tree Having generated our d length bit signatures for
each of our n terms, we take these signatures and
Metric space data structures provide a solution to randomly permute the bits. Each vector has the
near-neighbour searches in very high dimensions. same permutation applied. This is equivalent to a
These rely solely on the existence of a compari- column reordering in a matrix where the rows are
son function that satisfies the conditions of metri- the terms and the columns the bits. After applying
cality: non-negativity, equality, symmetry and the the permutation, the list of terms is sorted lexico-
triangle inequality. graphically based on the bit signatures. The list is
VPT is typical of these structures and has been scanned sequentially, and each term is compared
used successfully in many applications. The VPT to its B nearest neighbours in the list. The choice
is a binary tree designed for range searches. These of B will effect the accuracy/efficiency trade-off,
are searches limited to some distance from the tar- and need not be related to the choice of k. This is
get term but can be modified for k-NN search. performed q times, using a different random per-
VPT is constructed recursively. Beginning with mutation function each time. After each iteration,
a set of U terms, we take any term to be our van- the current closest k terms are stored.
tage point p. This becomes our root. We now find For a fixed d, the complexity for the permuta-
the median distance mp of all other terms to p: tion step is O(qn), the sorting O(qn log n) and the
mp = median{dist(p, u)|u ∈ U }. Those terms search O(qBn).
u such that dist(p, u) ≤ mp are inserted into the
left sub-tree, and the remainder into the right sub- 5.3 Spatial Approximation Sample Hierarchy
tree. Each sub-tree is then constructed as a new SASH approximates a k-NN search by precomput-
VPT, choosing a new vantage point from within its ing some near neighbours for each node (terms in
terms, until all terms are exhausted. our case). This produces multiple paths between
Searching a VPT is also recursive. Given a term terms, allowing SASH to shape itself to the data
q and radius r, we begin by measuring the distance set (Houle, 2003). The following description is
to the root term p. If dist(q, p) ≤ r we enter p into adapted from Houle and Sakuma (2005).
our list of near terms. If dist(q, p) − r ≤ m p we The SASH is a directed, edge-weighted graph
enter the left sub-tree and if dist(q, p) + r > mp with the following properties (see Figure 1):
we enter the right sub-tree. Both sub-trees may be
• Each term corresponds to a unique node.
entered. The process is repeated for each entered
subtree, taking the vantage point of the sub-tree to • The nodes are arranged into a hierarchy of
be the new root term. levels, with the bottom level containing n2
To perform a k-NN search we use a back- nodes and the top containing a single root
tracking decreasing radius search (Burkhard and node. Each level, except the top, will contain
Keller, 1973). The search begins with r = ∞, half as many nodes as the level below.
and terms are added to a list of the closest k terms. • Edges between nodes are linked to consecu-
When the k th closest term is found, the radius is tive levels. Each node will have at most p
set to the distance between this term and the tar- parent nodes in the level above, and c child
get. Each time a new, closer element is added to nodes in the level below.
364
A 1
B C D 2
E F G H 3
I J 4
5
K L
Figure 1: A SASH, where p = 2, c = 3 and k = 2
• Every node must have at least one parent so This changes our search complexity to:
that all nodes are reachable from the root. 1+ log1 n
k pc2
1 + log n (9)
Construction begins with the nodes being ran- k log n
−1 2
domly distributed between the levels. SASH is We use this geometric function in our experiments.
then constructed iteratively by each node finding Gorman and Curran (2005a; 2005b) found the
its closest p parents in the level above. The par- performance of SASH for distributional similarity
ent will keep the closest c of these children, form- could be improved by replacing the initial random
ing edges in the graph, and reject the rest. Any ordering with a frequency based ordering. In ac-
nodes without parents after being rejected are then cordance with Zipf’s law, the majority of terms
assigned as children of the nearest node in the pre- have low frequencies. Comparisons made with
vious level with fewer than c children. these low frequency terms are unreliable (Curran
Searching is performed by finding the k nearest and Moens, 2002). Creating SASH with high fre-
nodes at each level, which are added to a set of quency terms near the root produces more reliable
near nodes. To limit the search, only those nodes initial paths, but comparisons against these terms
whose parents were found to be nearest at the pre- are more expensive.
vious level are searched. The k closest nodes from The best accuracy/efficiency trade-off was
the set of near nodes are then returned. The search found when using more reliable initial paths rather
complexity is O(ck log n). than the most reliable. This is done by folding the
In Figure 1, the filled nodes demonstrate a data around some mean number of relations. For
search for the near-neighbours of some node q, us- each term, if its number of relations m i is greater
ing k = 2. Our search begins with the root node than some chosen number of relations M, it is
2
A. As we are using k = 2, we must find the two given a new ranking based on the score M mi . Oth-
nearest children of A using our similarity measure. erwise its ranking based on its number of relations.
In this case, C and D are closer than B. We now This has the effect of pushing very high and very
find the closest two children of C and D. E is not low frequency terms away from the root.
checked as it is only a child of B. All other nodes
6 Evaluation Measures
are checked, including F and G, which are shared
as children by B and C. From this level we chose The simplest method for evaluation is the direct
G and H. The final levels are considered similarly. comparison of extracted synonyms with a manu-
At this point we now have the list of near nodes ally created gold standard (Grefenstette, 1994). To
A, C, D, G, H, I, J, K and L. From this we reduce the problem of limited coverage, our evalu-
chose the two nodes nearest q, H and I marked in ation combines three electronic thesauri: the Mac-
black, which are then returned. quarie, Roget’s and Moby thesauri.
k can be varied at each level to force a larger We follow Curran (2004) and use two perfor-
number of elements to be tested at the base of the mance measures: direct matches (D IRECT) and
SASH using, for instance, the equation: inverse rank (I NV R). D IRECT is the percentage
of returned synonyms found in the gold standard.
h−i 1 I NV R is the sum of the inverse rank of each match-
ki = max{ k 1− log n , pc } (8) ing synonym, e.g. matches at ranks 3, 5 and 28
2
365
CORPUS CUT- OFF TERMS AVERAGE CUT- OFF
RELATIONS 5 100
PER TERM N AIVE 1.72 1.71
BNC 0 246,067 43 H EURISTIC 1.65 1.66
5 88,926 116 RI 0.80 0.93
100 14,862 617 LSH 10,000 1.26 1.31
L ARGE 0 541,722 97 SASH 1.73 1.71
5 184,494 281
100 35,618 1,400 Table 2: I NV R vs frequency cut-off
Table 1: Extracted Context Information
The initial experiments on RI produced quite
poor results. The intuition was that this was
give an inverse rank score of 31 + 15 + 281
. With
caused by the lack of smoothing in the algo-
at most 100 matching synonyms, the maximum
rithm. Experiments were performed using the
I NV R is 5.187. This more fine grained as it in-
weights given in Curran (2004). Of these, mu-
corporates the both the number of matches and
tual information (10), evaluated with an extra
their ranking. The same 300 single word nouns
log2 (f (w, r, w 0 ) + 1) factor and limited to posi-
were used for evaluation as used by Curran (2004)
tive values, produced the best results (RI MI ). The
for his large scale evaluation. These were chosen
values d = 1000 and = 5 were found to produce
randomly from WordNet such that they covered
the best results.
a range over the following properties: frequency,
All experiments were performed on 3.2GHz
number of senses, specificity and concreteness.
Xeon P4 machines with 4GB of RAM.
For each of these terms, the closest 100 terms and
their similarity scores were extracted.
8 Results
7 Experiments
As the accuracy of comparisons between terms in-
We use two corpora in our experiments: the creases with frequency (Curran, 2004), applying a
smaller is the non-speech portion of the British frequency cut-off will both reduce the size of the
National Corpus (BNC), 90 million words covering vocabulary (n) and increase the average accuracy
a wide range of domains and formats; the larger of comparisons. Table 1 shows the reduction in
consists of the BNC, the Reuters Corpus Volume 1 vocabulary and increase in average context rela-
and most of the English news holdings of the LDC tions per term as cut-off increases. For L ARGE,
in 2003, representing over 2 billion words of text the initial 541,722 word vocabulary is reduced by
(L ARGE, Curran, 2004). 66% when a cut-off of 5 is applied and by 86%
The semantic similarity system implemented by when the cut-off is increased to 100. The average
Curran (2004) provides our baseline. This per- number of relations increases from 97 to 1400.
forms a brute-force k-NN search (NAIVE). We The work by Curran (2004) largely uses a fre-
present results for the canonical attribute heuristic quency cut-off of 5. When this cut-off was used
(H EURISTIC), RI, LSH , PLEB , VPT and SASH. with the randomised techniques RI and LSH , it pro-
We take the optimal canonical attribute vector duced quite poor results. When the cut-off was
length of 30 for H EURISTIC from Curran (2004). increased to 100, as used by Ravichandran et al.
For SASH we take optimal values of p = 4 and c = (2005), the results improved significantly. Table 2
16 and use the folded ordering taking M = 1000 shows the I NV R scores for our various techniques
from Gorman and Curran (2005b). using the BNC with cut-offs of 5 and 100.
For RI, LSH and PLEB we found optimal values Table 3 shows the results of a full thesaurus ex-
experimentally using the BNC. For LSH we chose traction using the BNC and L ARGE corpora using
d = 3, 000 (LSH 3,000 ) and 10, 000 (LSH 10,000 ), a cut-off of 100. The average D IRECT score and
showing the effect of changing the dimensionality. I NV R are from the 300 test words. The total exe-
The frequency statistics were weighted using mu- cution time is extrapolated from the average search
tual information, as in Ravichandran et al. (2005): time of these test words and includes the setup
time. For L ARGE, extraction using NAIVE takes
p(w, r, w0 )
log( ) (10) 444 hours: over 18 days. If the 184,494 word vo-
p(w, ∗, ∗)p(∗, r, w 0 )
cabulary were used, it would take over 7000 hours,
P LEB used the values q = 500 and B = 100. or nearly 300 days. This gives some indication of
366
BNC L ARGE
D IRECT I NV R Time D IRECT I NV R Time
NAIVE 5.23 1.71 38.0hr 5.70 1.93 444.3hr
H EURISTIC 4.94 1.66 2.0hr 5.51 1.93 30.2hr
RI 2.97 0.93 0.4hr 2.42 0.85 1.9hr
RI MI 3.49 1.41 0.4hr 4.58 1.75 1.9hr
LSH 3,000 2.00 0.76 0.7hr 2.92 1.07 3.6hr
LSH 10,000 3.68 1.31 2.3hr 3.77 1.40 8.4hr
PLEB 3,000 2.00 0.76 1.2hr 2.85 1.07 4.1hr
PLEB 10,000 3.66 1.30 3.9hr 3.63 1.37 11.8hr
VPT 5.23 1.71 15.9hr 5.70 1.93 336.1hr
SASH 5.17 1.71 2.0hr 5.29 1.89 23.7hr
367
9 Conclusion James Gorman and James Curran. 2005a. Approximate
searching for distributional similarity. In ACL-SIGLEX
We have evaluated several state-of-the-art tech- 2005 Workshop on Deep Lexical Acquisition, Ann Arbor,
MI, USA, 30 June.
niques for improving the efficiency of distribu-
tional similarity measurements. We found that, James Gorman and James Curran. 2005b. Augmenting ap-
proximate similarity searching with lexical information.
in terms of raw efficiency, Random Indexing (RI) In Australasian Language Technology Workshop, Sydney,
was significantly faster than any other technique, Australia, 9–11 November.
but at the cost of accuracy. Even after our mod-
Gregory Grefenstette. 1994. Explorations in Automatic The-
ifications to the RI algorithm to significantly im- saurus Discovery. Kluwer Academic Publishers, Boston.
prove its accuracy, SASH still provides a better ac-
Michael E. Houle and Jun Sakuma. 2005. Fast approximate
curacy/efficiency trade-off. This is more evident similarity search in extremely high-dimensional data sets.
when considering the time to extract context in- In Proceedings of the 21st International Conference on
formation from the raw text. SASH, unlike RI, also Data Engineering, pages 619–630, Tokyo, Japan.
allows us to choose both the weight and the mea- Michael E. Houle. 2003. Navigating massive data sets via
sure used. LSH and PLEB could not match either local clustering. In Proceedings of the 9th ACM SIGKDD
International Conference on Knowledge Discovery and
the efficiency of RI or the accuracy of SASH. Data Mining, pages 547–552, Washington, DC, USA.
We intend to use this knowledge to process even
Piotr Indyk and Rajeev Motwani. 1998. Approximate near-
larger corpora to produce more accurate results. est neighbors: towards removing the curse of dimension-
Having set out to improve the efficiency of dis- ality. In Proceedings of the 30th annual ACM Symposium
tributional similarity searches while limiting any on Theory of Computing, pages 604–613, New York, NY,
USA, 24–26 May. ACM Press.
loss in accuracy, we are producing full nearest-
neighbour searches 18 times faster, with only a 2% Pentti Kanerva, Jan Kristoferson, and Anders Holst. 2000.
Random indexing of text samples for latent semantic anal-
loss in accuracy. ysis. In Proceedings of the 22nd Annual Conference of the
Cognitive Science Society, page 1036, Mahwah, NJ, USA.
Acknowledgements Pentti Kanerva. 1993. Sparse distributed memory and re-
lated models. In M.H. Hassoun, editor, Associative Neu-
We would like to thank our reviewers for their ral Memories: Theory and Implementation, pages 50–76.
helpful feedback and corrections. This work has Oxford University Press, New York, NY, USA.
been supported by the Australian Research Coun- Jussi Karlgren and Magnus Sahlgren. 2001. From words to
cil under Discovery Project DP0453131. understanding. In Y. Uesaka, P. Kanerva, and H Asoh, ed-
itors, Foundations of Real-World Intelligence, pages 294–
308. CSLI Publications, Stanford, CA, USA.
References Thomas K. Landauer and Susan T. Dumais. 1997. A solution
Andrei Broder. 1997. On the resemblance and containment to plato’s problem: The latent semantic analysis theory of
of documents. In Proceedings of the Compression and acquisition, induction, and representation of knowledge.
Complexity of Sequences, pages 21–29, Salerno, Italy. Psychological Review, 104(2):211–240, April.
James Curran and Marc Moens. 2002. Improvements in au- Mangus Sahlgren and Jussi Karlgren. 2005. Automatic bilin-
tomatic thesaurus extraction. In Proceedings of the Work- gual lexicon acquisition using random indexing of parallel
shop of the ACL Special Interest Group on the Lexicon, corpora. Journal of Natural Language Engineering, Spe-
pages 59–66, Philadelphia, PA, USA, 12 July. cial Issue on Parallel Texts, 11(3), June.
368