Scaling Distributional Similarity To Large Corpora: James Gorman and James R. Curran

Scaling Distributional Similarity to Large Corpora
James Gorman and James R. Curran

School of Information Technologies
University of Sydney
NSW 2006, Australia
{jgorman2,james}@it.usyd.edu.au
Abstract tion about their occurrence in a corpus into vec-

tors. These context vectors are then compared for
Accurately representing synonymy using similarity. Existing approaches differ primarily in
distributional similarity requires large vol- their definition of “context”, e.g. the surrounding
umes of data to reliably represent infre- words or the entire document, and their choice of
quent words. However, the naı̈ve nearest- distance metric for calculating similarity between
neighbour approach to comparing context the context vectors representing each term.
vectors extracted from large corpora scales Manual creation of lexical semantic resources
poorly (O(n2 ) in the vocabulary size). is open to the problems of bias, inconsistency and
In this paper, we compare several existing limited coverage. It is difficult to account for the
approaches to approximating the nearest- needs of the many domains in which NLP tech-
neighbour search for distributional simi- niques are now being applied and for the rapid
larity. We investigate the trade-off be- change in language use. The assisted or auto-
tween efficiency and accuracy, and find matic creation and maintenance of these resources
that SASH (Houle and Sakuma, 2005) pro- would be of great advantage.
vides the best balance. Finding synonyms using distributional similar-
ity requires a nearest-neighbour search over the
1 Introduction context vectors of each term. This is computation-
It is a general property of Machine Learning that ally intensive, scaling to O(n2 m) for the number
increasing the volume of training data increases of terms n and the size of their context vectors m.
the accuracy of results. This is no more evident Increasing the volume of input data will increase
than in Natural Language Processing (NLP), where the size of both n and m, decreasing the efficiency
massive quantities of text are required to model of a naı̈ve nearest-neighbour approach.
rare language events. Despite the rapid increase in Many approaches to reduce this complexity
computational power available for NLP systems, have been suggested. In this paper we evaluate
the volume of raw data available still outweighs state-of-the-art techniques proposed to solve this
our ability to process it. Unsupervised learning, problem. We find that the Spatial Approximation
which does not require the expensive and time- Sample Hierarchy (Houle and Sakuma, 2005) pro-
consuming human annotation of data, offers an vides the best accuracy/efficiency trade-off.
opportunity to use this wealth of data. Curran
2 Distributional Similarity
and Moens (2002) show that synonymy extraction
for lexical semantic resources using distributional Measuring distributional similarity first requires
similarity produces continuing gains in accuracy the extraction of context information for each of
as the volume of input data increases. the vocabulary terms from raw text. These terms
Extracting synonymy relations using distribu- are then compared for similarity using a nearest-
tional similarity is based on the distributional hy- neighbour search or clustering based on distance
pothesis that similar words appear in similar con- calculations between the statistical descriptions of
texts. Terms are described by collating informa- their contexts.
361
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 361–368,
Sydney, July 2006. 2006
c Association for Computational Linguistics
2.1 Extraction the results. There are many techniques to reduce
A context relation is defined as a tuple (w, r, w 0 ) dimensionality while avoiding this problem. The
where w is a term, which occurs in some grammat- simplest methods use feature selection techniques,
ical relation r with another word w 0 in some sen- such as information gain, to remove the attributes
tence. We refer to the tuple (r, w 0 ) as an attribute that are less informative. Other techniques smooth
of w. For example, (dog, direct-obj, walk) indicates the data while reducing dimensionality.
that dog was the direct object of walk in a sentence. Latent Semantic Analysis (LSA, Landauer and
In our experiments context extraction begins Dumais, 1997) is a smoothing and dimensional-
with a Maximum Entropy POS tagger and chun- ity reduction technique based on the intuition that
ker. The S EXTANT relation extractor (Grefen- the true dimensionality of data is latent in the sur-
stette, 1994) produces context relations that are face dimensionality. Landauer and Dumais admit
then lemmatised. The relations for each term are that, from a pragmatic perspective, the same effect
collected together and counted, producing a vector as LSA can be generated by using large volumes
of attributes and their frequencies in the corpus. of data with very long attribute vectors. Experi-
ments with LSA typically use attribute vectors of a
2.2 Measures and Weights dimensionality of around 1000. Our experiments
Both nearest-neighbour and cluster analysis meth- have a dimensionality of 500,000 to 1,500,000.
ods require a distance measure to calculate the Decompositions on data this size are computation-
similarity between context vectors. Curran (2004) ally difficult. Dimensionality reduction is often
decomposes this into measure and weight func- used before using LSA to improve its scalability.
tions. The measure calculates the similarity
3.1 Heuristics
between two weighted context vectors and the
weight calculates the informativeness of each con- Another technique is to use an initial heuristic
text relation from the raw frequencies. comparison to reduce the number of full O(m)
For these experiments we use the Jaccard (1) vector comparisons that are performed. If the
measure and the TTest (2) weight functions, found heuristic comparison is sufficiently fast and a suffi-
by Curran (2004) to have the best performance. cient number of full comparisons are avoided, the
cost of an additional check will be easily absorbed
0 0
P
(r,w 0 ) min(w(wm , r, w ), w(wn , r, w )) by the savings made.
(1)
0 0 Curran and Moens (2002) introduces a vector of
P
(r,w 0 ) max(w(wm , r, w ), w(wn , r, w ))
canonical attributes (of bounded length k m),
p(w, r, w0 ) − p(∗, r, w 0 )p(w, ∗, ∗) selected from the full vector, to represent the term.
p (2) These attributes are the most strongly weighted
p(∗, r, w 0 )p(w, ∗, ∗)
verb attributes, chosen because they constrain the
2.3 Nearest-neighbour Search semantics of the term more and partake in fewer
The simplest algorithm for finding synonyms is idiomatic collocations. If a pair of terms share at
a k-nearest-neighbour (k-NN) search, which in- least one canonical attribute then a full similarity
volves pair-wise vector comparison of the target comparison is performed, otherwise the terms are
term with every term in the vocabulary. Given an not compared. They show an 89% reduction in
n term vocabulary and up to m attributes for each search time, with only a 3.9% loss in accuracy.
term, the asymptotic time complexity of nearest- There is a significant improvement in the com-
neighbour search is O(n2 m). This is very expen- putational complexity. If a maximum of p posi-
sive, with even a moderate vocabulary making the tive results are returned, our complexity becomes
use of huge datasets infeasible. Our largest exper- O(n2 k + npm). When p n, the system will
iments used a vocabulary of over 184,000 words. be faster as many fewer full comparisons will be
made, but at the cost of accuracy as more possibly
3 Dimensionality Reduction near results will be discarded out of hand.
Using a cut-off to remove low frequency terms 4 Randomised Techniques
can significantly reduce the value of n. Unfortu-
nately, reducing m by eliminating low frequency Conventional dimensionality reduction techniques
contexts has a significant impact on the quality of can be computationally expensive: a more scal-
362
able solution is required to handle the volumes of vector is created by sampling a Gaussian function
data we propose to use. Randomised techniques m0 times, with a mean of 0 and a variance of 1.
provide a possible solution to this. For each term w we construct its bit signature
We present two techniques that have been used using the function
recently for distributional similarity: Random In- (
dexing (Kanerva et al., 2000) and Locality Sensi- 1 : ~r.w
~ ≥0
tive Hashing (LSH , Broder, 1997). h~r (w)
~ =
0 : ~r.w
~ <0
4.1 Random Indexing
where ~r is a spherically symmetric random vector
Random Indexing (RI) is a hashing technique of length d. The signature, w̄, is the d length bit
based on Sparse Distributed Memory (Kanerva, vector:
1993). Karlgren and Sahlgren (2001) showed RI
produces results similar to LSA using the Test of w̄ = {hr~1 (w),
~ hr~2 (w),
~ . . . , hr~d (w)}
~
English as a Foreign Language (TOEFL) evalua-
tion. Sahlgren and Karlgren (2005) showed the The cost to build all n signatures is O(nm 0 d).
technique to be successful in generating bilingual For terms u and v, Goemans and Williamson
lexicons from parallel corpora. (1995) approximate the angular similarity by
In RI, we first allocate a d length index vec-
tor to each unique attribute. The vectors con- θ(~u, ~u)
p(h~r (~u) = h~r (~v )) = 1 − (4)
sist of a large number of 0s and small number π
() number of randomly distributed ±1s. Context
where θ(~u, ~u) is the angle between ~u and ~u. The
vectors, identifying terms, are generated by sum-
angular similarity gives the cosine by
ming the index vectors of the attributes for each
non-unique context in which a term appears. The cos(θ(~u, ~u)) =
context vector for a term t appearing in contexts (5)
cos((1 − p(h~r (~u) = h~r (~v )))π)
c1 = [1, 0, 0, −1] and c2 = [0, 1, 0, −1] would be
[1, 1, 0, −2]. The distance between these context The probability can be derived from the Hamming
vectors is then measured using the cosine measure: distance:
~u · ~v H(ū, v̄)
cos(θ(u, v)) = (3) p(hr (u) = hr (v)) = 1 − (6)
|~u| |~v | d
This technique allows for incremental sampling, By combining equations 5 and 6 we get the fol-
where the index vector for an attribute is only gen- lowing approximation of the cosine distance:
erated when the attribute is encountered. Con-
struction complexity is O(nmd) and search com- H(ū, v̄)
cos(θ(~u, ~u)) = cos π (7)
plexity is O(n2 d). d
4.2 Locality Sensitive Hashing That is, the cosine of two context vectors is ap-
proximated by the cosine of the Hamming distance
LSH is a probabilistic technique that allows the
between their two signatures normalised by the
approximation of a similarity function. Broder
size of the signatures. Search is performed using
(1997) proposed an approximation of the Jaccard
Equation 7 and scales to O(n2 d).
similarity function using min-wise independent
functions. Charikar (2002) proposed an approx- 5 Data Structures
imation of the cosine measure using random hy-
perplanes Ravichandran et al. (2005) used this co- The methods presented above fail to address the
sine variant and showed it to produce over 70% n2 component of the search complexity. Many
accuracy in extracting synonyms when compared data structures have been proposed that can be
against Pantel and Lin (2002). used to address this problem in similarity search-
Given we have n terms in an m0 dimensional ing. We present three data structures: the vantage
space, we create d m0 unit random vectors also point tree (VPT, Yianilos, 1993), which indexes
of m0 dimensions, labelled {r~1 , r~2 , ..., r~d }. Each points in a metric space, Point Location in Equal
363
Balls (PLEB, Indyk and Motwani, 1998), a proba- the list, the radius is updated to the distance from
bilistic structure that uses the bit signatures gener- the target to the new k th closest term.
ated by LSH , and the Spatial Approximation Sam- Construction complexity is O(n log n). Search
ple Hierarchy (SASH, Houle and Sakuma, 2005), complexity is claimed to be O(log n) for small ra-
which approximates a k-NN search. dius searches. This does not hold for our decreas-
Another option inspired by IR is attribute indexing radius search, whose worst case complexity is
ing (INDEX). In this technique, in addition to each O(n).
term having a reference to its attributes, each at-
5.2 Point Location in Equal Balls
tribute has a reference to the terms referencing it.
Each term is then only compared with the terms PLEB is a randomised structure that uses the bit
with which it shares attributes. We will give a the- signatures generated by LSH . It was used by
oretically comparison against other techniques. Ravichandran et al. (2005) to improve the effi-
ciency of distributional similarity calculations.
5.1 Vantage Point Tree Having generated our d length bit signatures for
each of our n terms, we take these signatures and
Metric space data structures provide a solution to randomly permute the bits. Each vector has the
near-neighbour searches in very high dimensions. same permutation applied. This is equivalent to a
These rely solely on the existence of a compari- column reordering in a matrix where the rows are
son function that satisfies the conditions of metri- the terms and the columns the bits. After applying
cality: non-negativity, equality, symmetry and the the permutation, the list of terms is sorted lexico-
triangle inequality. graphically based on the bit signatures. The list is
VPT is typical of these structures and has been scanned sequentially, and each term is compared
used successfully in many applications. The VPT to its B nearest neighbours in the list. The choice
is a binary tree designed for range searches. These of B will effect the accuracy/efficiency trade-off,
are searches limited to some distance from the tar- and need not be related to the choice of k. This is
get term but can be modified for k-NN search. performed q times, using a different random per-
VPT is constructed recursively. Beginning with mutation function each time. After each iteration,
a set of U terms, we take any term to be our van- the current closest k terms are stored.
tage point p. This becomes our root. We now find For a fixed d, the complexity for the permuta-
the median distance mp of all other terms to p: tion step is O(qn), the sorting O(qn log n) and the
mp = median{dist(p, u)|u ∈ U }. Those terms search O(qBn).
u such that dist(p, u) ≤ mp are inserted into the
left sub-tree, and the remainder into the right sub- 5.3 Spatial Approximation Sample Hierarchy
tree. Each sub-tree is then constructed as a new SASH approximates a k-NN search by precomput-
VPT, choosing a new vantage point from within its ing some near neighbours for each node (terms in
terms, until all terms are exhausted. our case). This produces multiple paths between
Searching a VPT is also recursive. Given a term terms, allowing SASH to shape itself to the data
q and radius r, we begin by measuring the distance set (Houle, 2003). The following description is
to the root term p. If dist(q, p) ≤ r we enter p into adapted from Houle and Sakuma (2005).
our list of near terms. If dist(q, p) − r ≤ m p we The SASH is a directed, edge-weighted graph
enter the left sub-tree and if dist(q, p) + r > mp with the following properties (see Figure 1):
we enter the right sub-tree. Both sub-trees may be
• Each term corresponds to a unique node.
entered. The process is repeated for each entered
subtree, taking the vantage point of the sub-tree to • The nodes are arranged into a hierarchy of
be the new root term. levels, with the bottom level containing n2
To perform a k-NN search we use a back- nodes and the top containing a single root
tracking decreasing radius search (Burkhard and node. Each level, except the top, will contain
Keller, 1973). The search begins with r = ∞, half as many nodes as the level below.
and terms are added to a list of the closest k terms. • Edges between nodes are linked to consecu-
When the k th closest term is found, the radius is tive levels. Each node will have at most p
set to the distance between this term and the tar- parent nodes in the level above, and c child
get. Each time a new, closer element is added to nodes in the level below.
364
A 1
B C D 2
E F G H 3
I J 4
5
K L
Figure 1: A SASH, where p = 2, c = 3 and k = 2
• Every node must have at least one parent so This changes our search complexity to:
that all nodes are reachable from the root. 1+ log1 n
k pc2
1 + log n (9)
Construction begins with the nodes being ran- k log n
−1 2
domly distributed between the levels. SASH is We use this geometric function in our experiments.
then constructed iteratively by each node finding Gorman and Curran (2005a; 2005b) found the
its closest p parents in the level above. The par- performance of SASH for distributional similarity
ent will keep the closest c of these children, form- could be improved by replacing the initial random
ing edges in the graph, and reject the rest. Any ordering with a frequency based ordering. In ac-
nodes without parents after being rejected are then cordance with Zipf’s law, the majority of terms
assigned as children of the nearest node in the pre- have low frequencies. Comparisons made with
vious level with fewer than c children. these low frequency terms are unreliable (Curran
Searching is performed by finding the k nearest and Moens, 2002). Creating SASH with high fre-
nodes at each level, which are added to a set of quency terms near the root produces more reliable
near nodes. To limit the search, only those nodes initial paths, but comparisons against these terms
whose parents were found to be nearest at the pre- are more expensive.
vious level are searched. The k closest nodes from The best accuracy/efficiency trade-off was
the set of near nodes are then returned. The search found when using more reliable initial paths rather
complexity is O(ck log n). than the most reliable. This is done by folding the
In Figure 1, the filled nodes demonstrate a data around some mean number of relations. For
search for the near-neighbours of some node q, us- each term, if its number of relations m i is greater
ing k = 2. Our search begins with the root node than some chosen number of relations M, it is
2
A. As we are using k = 2, we must find the two given a new ranking based on the score M mi . Oth-
nearest children of A using our similarity measure. erwise its ranking based on its number of relations.
In this case, C and D are closer than B. We now This has the effect of pushing very high and very
find the closest two children of C and D. E is not low frequency terms away from the root.
checked as it is only a child of B. All other nodes
6 Evaluation Measures
are checked, including F and G, which are shared
as children by B and C. From this level we chose The simplest method for evaluation is the direct
G and H. The final levels are considered similarly. comparison of extracted synonyms with a manu-
At this point we now have the list of near nodes ally created gold standard (Grefenstette, 1994). To
A, C, D, G, H, I, J, K and L. From this we reduce the problem of limited coverage, our evalu-
chose the two nodes nearest q, H and I marked in ation combines three electronic thesauri: the Mac-
black, which are then returned. quarie, Roget’s and Moby thesauri.
k can be varied at each level to force a larger We follow Curran (2004) and use two perfor-
number of elements to be tested at the base of the mance measures: direct matches (D IRECT) and
SASH using, for instance, the equation: inverse rank (I NV R). D IRECT is the percentage
of returned synonyms found in the gold standard.
h−i 1 I NV R is the sum of the inverse rank of each match-
ki = max{ k 1− log n , pc } (8) ing synonym, e.g. matches at ranks 3, 5 and 28
2
365
CORPUS CUT- OFF TERMS AVERAGE CUT- OFF
RELATIONS 5 100
PER TERM N AIVE 1.72 1.71
BNC 0 246,067 43 H EURISTIC 1.65 1.66
5 88,926 116 RI 0.80 0.93
100 14,862 617 LSH 10,000 1.26 1.31
L ARGE 0 541,722 97 SASH 1.73 1.71
5 184,494 281
100 35,618 1,400 Table 2: I NV R vs frequency cut-off
Table 1: Extracted Context Information
The initial experiments on RI produced quite
poor results. The intuition was that this was
give an inverse rank score of 31 + 15 + 281
. With
caused by the lack of smoothing in the algo-
at most 100 matching synonyms, the maximum
rithm. Experiments were performed using the
I NV R is 5.187. This more fine grained as it in-
weights given in Curran (2004). Of these, mu-
corporates the both the number of matches and
tual information (10), evaluated with an extra
their ranking. The same 300 single word nouns
log2 (f (w, r, w 0 ) + 1) factor and limited to posi-
were used for evaluation as used by Curran (2004)
tive values, produced the best results (RI MI ). The
for his large scale evaluation. These were chosen
values d = 1000 and = 5 were found to produce
randomly from WordNet such that they covered
the best results.
a range over the following properties: frequency,
All experiments were performed on 3.2GHz
number of senses, specificity and concreteness.
Xeon P4 machines with 4GB of RAM.
For each of these terms, the closest 100 terms and
their similarity scores were extracted.
8 Results
7 Experiments
As the accuracy of comparisons between terms in-
We use two corpora in our experiments: the creases with frequency (Curran, 2004), applying a
smaller is the non-speech portion of the British frequency cut-off will both reduce the size of the
National Corpus (BNC), 90 million words covering vocabulary (n) and increase the average accuracy
a wide range of domains and formats; the larger of comparisons. Table 1 shows the reduction in
consists of the BNC, the Reuters Corpus Volume 1 vocabulary and increase in average context rela-
and most of the English news holdings of the LDC tions per term as cut-off increases. For L ARGE,
in 2003, representing over 2 billion words of text the initial 541,722 word vocabulary is reduced by
(L ARGE, Curran, 2004). 66% when a cut-off of 5 is applied and by 86%
The semantic similarity system implemented by when the cut-off is increased to 100. The average
Curran (2004) provides our baseline. This per- number of relations increases from 97 to 1400.
forms a brute-force k-NN search (NAIVE). We The work by Curran (2004) largely uses a fre-
present results for the canonical attribute heuristic quency cut-off of 5. When this cut-off was used
(H EURISTIC), RI, LSH , PLEB , VPT and SASH. with the randomised techniques RI and LSH , it pro-
We take the optimal canonical attribute vector duced quite poor results. When the cut-off was
length of 30 for H EURISTIC from Curran (2004). increased to 100, as used by Ravichandran et al.
For SASH we take optimal values of p = 4 and c = (2005), the results improved significantly. Table 2
16 and use the folded ordering taking M = 1000 shows the I NV R scores for our various techniques
from Gorman and Curran (2005b). using the BNC with cut-offs of 5 and 100.
For RI, LSH and PLEB we found optimal values Table 3 shows the results of a full thesaurus ex-
experimentally using the BNC. For LSH we chose traction using the BNC and L ARGE corpora using
d = 3, 000 (LSH 3,000 ) and 10, 000 (LSH 10,000 ), a cut-off of 100. The average D IRECT score and
showing the effect of changing the dimensionality. I NV R are from the 300 test words. The total exe-
The frequency statistics were weighted using mu- cution time is extrapolated from the average search
tual information, as in Ravichandran et al. (2005): time of these test words and includes the setup
time. For L ARGE, extraction using NAIVE takes
p(w, r, w0 )
log( ) (10) 444 hours: over 18 days. If the 184,494 word vo-
p(w, ∗, ∗)p(∗, r, w 0 )
cabulary were used, it would take over 7000 hours,
P LEB used the values q = 500 and B = 100. or nearly 300 days. This gives some indication of
366
BNC L ARGE
D IRECT I NV R Time D IRECT I NV R Time
NAIVE 5.23 1.71 38.0hr 5.70 1.93 444.3hr
H EURISTIC 4.94 1.66 2.0hr 5.51 1.93 30.2hr
RI 2.97 0.93 0.4hr 2.42 0.85 1.9hr
RI MI 3.49 1.41 0.4hr 4.58 1.75 1.9hr
LSH 3,000 2.00 0.76 0.7hr 2.92 1.07 3.6hr
LSH 10,000 3.68 1.31 2.3hr 3.77 1.40 8.4hr
PLEB 3,000 2.00 0.76 1.2hr 2.85 1.07 4.1hr
PLEB 10,000 3.66 1.30 3.9hr 3.63 1.37 11.8hr
VPT 5.23 1.71 15.9hr 5.70 1.93 336.1hr
SASH 5.17 1.71 2.0hr 5.29 1.89 23.7hr
Table 3: Full thesaurus extraction
the scale of the problem. L ARGE

CUT- OFF 0 5 100
The only technique to become less accurate NAIVE 541,721 184,493 35,617
when the corpus size is increased is RI; it is likely SASH 10,599 8,796 6,231
that RI is sensitive to high frequency, low informa- INDEX 5,844 13,187 32,663
tion contexts that are more prevalent in L ARGE. Table 4: Average number of comparisons per term
Weighting reduces this effect, improving accuracy.
The importance of the choice of d can be seen in
the results for LSH . While much slower, LSH 10,000 considering that different tasks may require differ-
is also much more accurate than LSH 3,000 , while ent weights and measures (Weeds and Weir, 2005).
still being much faster than NAIVE. Introducing RI also suffers n2 complexity, where as SASH is
the PLEB data structure does not improve the ef- n log n. Taking these into account, and that the im-
ficiency while incurring a small cost on accuracy. provements are barely significant, SASH is a better
We are not using large enough datasets to show the choice.
improved time complexity using PLEB . The results for LSH are disappointing. It per-
VPT is only slightly faster slightly faster than forms consistently worse than the other methods
NAIVE. This is not surprising in light of the origi- except VPT. This could be improved by using
nal design of the data structure: decreasing radius larger bit vectors, but there is a limit to the size of
search does not guarantee search efficiency. these as they represent a significant memory over-
A significant influence in the speed of the ran- head, particularly as the vocabulary increases.
domised techniques, RI and LSH , is the fixed di- Table 4 presents the theoretical analysis of at-
mensionality. The randomised techniques use a tribute indexing. The average number of com-
fixed length vector which is not influenced by the parisons made for various cut-offs of L ARGE are
size of m. The drawback of this is that the size of shown. NAIVE and INDEX are the actual values
the vector needs to be tuned to the dataset. for those techniques. The values for SASH are
It would seem at first glance that H EURIS - worst case, where the maximum number of terms
TIC and SASH provide very similar results, with are compared at each level. The actual number
H EURISTIC slightly slower, but more accurate. of comparisons made will be much less. The ef-
This misses the difference in time complexity be- ficiency of INDEX is sensitive to the density of
tween the methods: H EURISTIC is n2 and SASH attributes and increasing the cut-off increases the
n log n. The improvement in execution time over density. This is seen in the dramatic drop in per-
NAIVE decreases as corpus size increases and this formance as the cut-off increases. This problem of
would be expected to continue. Further tuning of density will increase as volume of raw input data
SASH parameters may improve its accuracy. increases, further reducing its effectiveness. SASH
RI MI produces similar result using L ARGE to is only dependent on the number of terms, not the
SASH using BNC. This does not include the cost density.
of extracting context relations from the raw text, so Where the need for computationally efficiency
the true comparison is much worse. SASH allows out-weighs the need for accuracy, RI MI provides
the free use of weight and measure functions, but better results. SASH is the most balanced of the
RI is constrained by having to transform any con- techniques tested and provides the most scalable,
text space into a RI space. This is important when high quality results.
367
9 Conclusion James Gorman and James Curran. 2005a. Approximate
searching for distributional similarity. In ACL-SIGLEX
We have evaluated several state-of-the-art tech- 2005 Workshop on Deep Lexical Acquisition, Ann Arbor,
MI, USA, 30 June.
niques for improving the efficiency of distribu-
tional similarity measurements. We found that, James Gorman and James Curran. 2005b. Augmenting ap-
proximate similarity searching with lexical information.
in terms of raw efficiency, Random Indexing (RI) In Australasian Language Technology Workshop, Sydney,
was significantly faster than any other technique, Australia, 9–11 November.
but at the cost of accuracy. Even after our mod-
Gregory Grefenstette. 1994. Explorations in Automatic The-
ifications to the RI algorithm to significantly im- saurus Discovery. Kluwer Academic Publishers, Boston.
prove its accuracy, SASH still provides a better ac-
Michael E. Houle and Jun Sakuma. 2005. Fast approximate
curacy/efficiency trade-off. This is more evident similarity search in extremely high-dimensional data sets.
when considering the time to extract context in- In Proceedings of the 21st International Conference on
formation from the raw text. SASH, unlike RI, also Data Engineering, pages 619–630, Tokyo, Japan.
allows us to choose both the weight and the mea- Michael E. Houle. 2003. Navigating massive data sets via
sure used. LSH and PLEB could not match either local clustering. In Proceedings of the 9th ACM SIGKDD
International Conference on Knowledge Discovery and
the efficiency of RI or the accuracy of SASH. Data Mining, pages 547–552, Washington, DC, USA.
We intend to use this knowledge to process even
Piotr Indyk and Rajeev Motwani. 1998. Approximate near-
larger corpora to produce more accurate results. est neighbors: towards removing the curse of dimension-
Having set out to improve the efficiency of dis- ality. In Proceedings of the 30th annual ACM Symposium
tributional similarity searches while limiting any on Theory of Computing, pages 604–613, New York, NY,
USA, 24–26 May. ACM Press.
loss in accuracy, we are producing full nearest-
neighbour searches 18 times faster, with only a 2% Pentti Kanerva, Jan Kristoferson, and Anders Holst. 2000.
Random indexing of text samples for latent semantic anal-
loss in accuracy. ysis. In Proceedings of the 22nd Annual Conference of the
Cognitive Science Society, page 1036, Mahwah, NJ, USA.
Acknowledgements Pentti Kanerva. 1993. Sparse distributed memory and re-
lated models. In M.H. Hassoun, editor, Associative Neu-
We would like to thank our reviewers for their ral Memories: Theory and Implementation, pages 50–76.
helpful feedback and corrections. This work has Oxford University Press, New York, NY, USA.
been supported by the Australian Research Coun- Jussi Karlgren and Magnus Sahlgren. 2001. From words to
cil under Discovery Project DP0453131. understanding. In Y. Uesaka, P. Kanerva, and H Asoh, ed-
itors, Foundations of Real-World Intelligence, pages 294–
308. CSLI Publications, Stanford, CA, USA.
References Thomas K. Landauer and Susan T. Dumais. 1997. A solution
Andrei Broder. 1997. On the resemblance and containment to plato’s problem: The latent semantic analysis theory of
of documents. In Proceedings of the Compression and acquisition, induction, and representation of knowledge.
Complexity of Sequences, pages 21–29, Salerno, Italy. Psychological Review, 104(2):211–240, April.
Patrick Pantel and Dekang Lin. 2002. Discovering word

Walter A. Burkhard and Robert M. Keller. 1973. Some ap- senses from text. In Proceedings of ACM SIGKDD-02,
proaches to best-match file searching. Communications of pages 613–619, 23–26 July.
the ACM, 16(4):230–236, April.
Deepak Ravichandran, Patrick Pantel, and Eduard Hovy.
Moses S. Charikar. 2002. Similarity estimation techniques 2005. Randomized algorithms and NLP: Using locality
from rounding algorithms. In Proceedings of the 34th sensitive hash functions for high speed noun clustering.
Annual ACM Symposium on Theory of Computing, pages In Proceedings of the 43rd Annual Meeting of the ACL,
380–388, Montreal, Quebec, Canada, 19–21 May. pages 622–629, Ann Arbor, USA.
James Curran and Marc Moens. 2002. Improvements in au- Mangus Sahlgren and Jussi Karlgren. 2005. Automatic bilin-
tomatic thesaurus extraction. In Proceedings of the Work- gual lexicon acquisition using random indexing of parallel
shop of the ACL Special Interest Group on the Lexicon, corpora. Journal of Natural Language Engineering, Spe-
pages 59–66, Philadelphia, PA, USA, 12 July. cial Issue on Parallel Texts, 11(3), June.
Julie Weeds and David Weir. 2005. Co-occurance retrieval:

James Curran. 2004. From Distributional to Semantic Simi- A flexible framework for lexical distributional similarity.
larity. Ph.D. thesis, University of Edinburgh. Computational Linguistics, 31(4):439–475, December.
Michel X. Goemans and David P. Williamson. 1995. Peter N. Yianilos. 1993. Data structures and algorithms for
Improved approximation algorithms for maximum cut nearest neighbor search in general metric spaces. In Pro-
and satisfiability problems using semidefinite program- ceedings of the fourth annual ACM-SIAM Symposium on
ming. Journal of Association for Computing Machinery, Discrete algorithms, pages 311–321, Philadelphia.
42(6):1115–1145, November.
368

Scaling Distributional Similarity To Large Corpora: James Gorman and James R. Curran

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scaling Distributional Similarity To Large Corpora: James Gorman and James R. Curran

Uploaded by

Copyright:

Available Formats

Scaling Distributional Similarity to Large Corpora

James Gorman and James R. Curran

Abstract tion about their occurrence in a corpus into vec-

Table 3: Full thesaurus extraction

the scale of the problem. L ARGE

Patrick Pantel and Dekang Lin. 2002. Discovering word

Julie Weeds and David Weir. 2005. Co-occurance retrieval:

You might also like