Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Generalized Hebbian Algorithm for Incremental Singular Value

Decomposition in Natural Language Processing

Genevieve Gorrell
Department of Computer and Information Science
Linköping University
581 83 LINKÖPING
Sweden
gengo@ida.liu.se

Abstract tomatic natural language processing will en-


counter this problem on several levels, creat-
An algorithm based on the Generalized ing a need for techniques which compensate
Hebbian Algorithm is described that for this.
allows the singular value decomposition Imagine we have a set of data stored as a
of a dataset to be learned based on matrix. Techniques based on eigen decomposi-
single observation pairs presented seri- tion allow such a matrix to be transformed into
ally. The algorithm has minimal mem- a set of orthogonal vectors, each with an asso-
ory requirements, and is therefore in- ciated “strength”, or eigenvalue. This trans-
teresting in the natural language do- formation allows the data contained in the ma-
main, where very large datasets are of- trix to be compressed; by discarding the less
ten used, and datasets quickly become significant vectors (dimensions) the matrix can
intractable. The technique is demon- be approximated with fewer numbers. This
strated on the task of learning word is what is meant by dimensionality reduction.
and letter bigram pairs from text. The technique is guaranteed to return the clos-
est (least squared error) approximation possi-
1 Introduction
ble for a given number of numbers (Golub and
Dimensionality reduction techniques are of Reinsch, 1970). In certain domains, however,
great relevance within the field of natural lan- the technique has even greater significance. It
guage processing. A persistent problem within is effectively forcing the data through a bot-
language processing is the over-specificity of tleneck; requiring it to describe itself using
language, and the sparsity of data. Corpus- an impoverished construct set. This can al-
based techniques depend on a sufficiency of low the critical underlying features to reveal
examples in order to model human language themselves. In language, for example, these
use, but the Zipfian nature of frequency be- features might be semantic constructs. It can
haviour in language means that this approach also improve the data, in the case that the de-
has diminishing returns with corpus size. In tail is noise, or richness not relevant to the
short, there are a large number of ways to say task.
the same thing, and no matter how large your Singular value decomposition (SVD) is a
corpus is, you will never cover all the things near relative of eigen decomposition, appro-
that might reasonably be said. Language is priate to domains where input is asymmetri-
often too rich for the task being performed; cal. The best known application of singular
for example it can be difficult to establish that value decomposition within natural language
two documents are discussing the same topic. processing is Latent Semantic Analysis (Deer-
Likewise no matter how much data your sys- wester et al., 1990). Latent Semantic Analysis
tem has seen during training, it will invari- (LSA) allows passages of text to be compared
ably see something new at run-time in a do- to each other in a reduced-dimensionality se-
main of any complexity. Any approach to au- mantic space, based on the words they contain.

97
The technique has been successfully applied to the case where data are presented serially, for
information retrieval, where the overspecificity example, in the context of a learning system.
of language is particularly problematic; text Furthermore, there are limits to the size of ma-
searches often miss relevant documents where trix that can feasibly be processed using batch
different vocabulary has been chosen in the decomposition techniques. This is especially
search terms to that used in the document (for relevant within natural language processing,
example, the user searches on “eigen decom- where very large corpora are common. Ran-
position” and fails to retrieve documents on dom Indexing (Kanerva et al., 2000) provides
factor analysis). LSA has also been applied in a less principled, though very simple and ef-
language modelling (Bellegarda, 2000), where ficient, alternative to SVD for dimensionality
it has been used to incorporate long-span se- reduction over large corpora.
mantic dependencies.
This paper describes an approach to singu-
Much research has been done on optimis-
lar value decomposition based on the General-
ing eigen decomposition algorithms, and the
ized Hebbian Algorithm (Sanger, 1989). GHA
extent to which they can be optimised de-
calculates the eigen decomposition of a ma-
pends on the area of application. Most natu-
trix based on single observations presented se-
ral language problems involve sparse matrices,
rially. The algorithm presented here differs in
since there are many words in a natural lan-
that where GHA produces the eigen decom-
guage and the great majority do not appear in,
position of symmetrical data, our algorithm
for example, any one document. Domains in
produces the singular value decomposition of
which matrices are less sparse lend themselves
asymmetrical data. It allows singular vectors
to such techniques as Golub-Kahan-Reinsch
to be learned from paired inputs presented se-
(Golub and Reinsch, 1970) and Jacobi-like ap-
rially using no more memory than is required
proaches. Techniques such as those described
to store the singular vector pairs themselves.
in (Berry, 1992) are more appropriate in the
It is therefore relevant in situations where the
natural language domain.
size of the dataset makes conventional batch
Optimisation is an important way to in- approaches infeasible. It is also of interest in
crease the applicability of eigen and singu- the context of adaptivity, since it has the po-
lar value decomposition. Designing algorithms tential to adapt to changing input. The learn-
that accommodate different requirements is ing update operation is very cheap computa-
another. For example, another drawback to tionally. Assuming a stable vector length, each
Jacobi-like approaches is that they calculate update operation takes exactly as long as each
all the singular triplets (singular vector pairs previous one; there is no increase with corpus
with associated values) simultaneously, which size to the speed of the update. Matrix di-
may not be practical in a situation where only mensions may increase during processing. The
the top few are required. Consider also that algorithm produces singular vector pairs one
the methods mentioned so far assume that the at a time, starting with the most significant,
entire matrix is available from the start. There which means that useful data becomes avail-
are many situations in which data may con- able quickly; many standard techniques pro-
tinue to become available. duce the entire decomposition simultaneously.
(Berry et al., 1995) describe a number of Since it is a learning technique, however, it dif-
techniques for including new data in an ex- fers from what would normally be considered
isting decomposition. Their techniques apply an incremental technique, in that the algo-
to a situation in which SVD has been per- rithm converges on the singular value decom-
formed on a collection of data, then new data position of the dataset, rather than at any one
becomes available. However, these techniques point having the best solution possible for the
are either expensive, or else they are approxi- data it has seen so far. The method is poten-
mations which degrade in quality over time. tially most appropriate in situations where the
They are useful in the context of updating dataset is very large or unbounded: smaller,
an existing batch decomposition with a sec- bounded datasets may be more efficiently pro-
ond batch of data, but are less applicable in cessed by other methods. Furthermore, our

98
approach is limited to cases where the final angular data matrix, A, can be presented as;
matrix is expressible as the linear sum of outer
products of the data vectors. Note in particu-
lar that Latent Semantic Analysis, as usually A = U ΣV T (1)
implemented, is not an example of this, be-
where U and V are matrices of orthogonal left
cause LSA takes the log of the final sums in
and right singular vectors (columns) respec-
each cell (Dumais, 1990). LSA, however, does
tively, and Σ is a diagonal matrix of the cor-
not depend on singular value decomposition;
responding singular values. The U and V ma-
Gorrell and Webb (Gorrell and Webb, 2005)
trices can be seen as a matched set of orthogo-
discuss using eigen decomposition to perform
nal basis vectors in their corresponding spaces,
LSA, and demonstrate LSA using the Gen-
while the singular values specify the effective
eralized Hebbian Algorithm in its unmodified
magnitude of each vector pair. By convention,
form. Sanger (Sanger, 1993) presents similar
these matrices are sorted such that the diag-
work, and future work will involve more de-
onal of Σ is monotonically decreasing, and it
tailed comparison of this approach to his.
is a property of SVD that preserving only the
The next section describes the algorithm.
first (largest) N of these (and hence also only
Section 3 describes implementation in practi-
the first N columns of U and V) provides a
cal terms. Section 4 illustrates, using word
least-squared error, rank-N approximation to
n-gram and letter n-gram tasks as examples
the original matrix A.
and section 5 concludes.
Singular Value Decomposition is intimately
related to eigenvalue decomposition in that the
2 The Algorithm singular vectors, U and V , of the data matrix,
This section introduces the Generalized Heb- A, are simply the eigenvectors of A ∗ AT and
bian Algorithm, and shows how the technique AT ∗ A, respectively, and the singular values,
can be adapted to the rectangular matrix form Σ, are the square-roots of the corresponding
of singular value decomposition. Eigen decom- eigenvalues.
position requires as input a square diagonally- 2.1 Generalised Hebbian Algorithm
symmetrical matrix, that is to say, one in
Oja and Karhunen (Oja and Karhunen, 1985)
which the cell value at row x, column y is
demonstrated an incremental solution to find-
the same as that at row y, column x. The
ing the first eigenvector from data arriving in
kind of data described by such a matrix is
the form of serial data items presented as vec-
the correlation between data in a particular
tors, and Sanger (Sanger, 1989) later gener-
space with other data in the same space. For
alized this to finding the first N eigenvectors
example, we might wish to describe how of-
with the Generalized Hebbian Algorithm. The
ten a particular word appears with a particu-
algorithm converges on the exact eigen decom-
lar other word. The data therefore are sym-
position of the data with a probability of one.
metrical relations between items in the same
The essence of these algorithms is a simple
space; word a appears with word b exactly as
Hebbian learning rule:
often as word b appears with word a. In sin-
gular value decomposition, rectangular input
matrices are handled. Ordered word bigrams Un (t + 1) = Un (t) + λ ∗ (UnT ∗ Aj ) ∗ Aj (2)
are an example of this; imagine a matrix in
which rows correspond to the first word in a Un is the n’th column of U (i.e., the n’th eigen-
bigram, and columns to the second. The num- vector, see equation 1), λ is the learning rate
ber of times that word b appears after word and Aj is the j’th column of training matrix
a is by no means the same as the number A. t is the timestep. The only modification to
of times that word a appears after word b. this required in order to extend it to multiple
Rows and columns are different spaces; rows eigenvectors is that each Un needs to shadow
are the space of first words in the bigrams, any lower-ranked Um (m > n) by removing its
and columns are the space of second words. projection from the input Aj in order to assure
The singular value decomposition of a rect- both orthogonality and an ordered ranking of

99
the resulting eigenvectors. Sanger’s final for-
mulation (Sanger, 1989) is:
c = c.x(x) (5)

cij (t + 1) = cij (t) + γ(t)(yi (t)xj (t) (3) 2.2 Extension to Paired Data

−yi (t) ckj (t)yk (t)) Let us begin with a simplification of 5:
k≤i

In the above, cij is an individual element in 1


c = cX(X) (6)
the current eigenvector, xj is the input vector n
and yi is the activation (that is to say, ci .xj , Here, the upper case X is the entire data ma-
the dot product of the input vector with the trix. n is the number of training items. The
ith eigenvector). γ is the learning rate. simplification is valid in the case that c is sta-
To summarise, the formula updates the cur- bilised; a simplification that in our case will
rent eigenvector by adding to it the input vec- become more valid with time. Extension to
tor multiplied by the activation minus the pro- paired data initially appears to present a prob-
jection of the input vector on all the eigenvec- lem. As mentioned earlier, the singular vectors
tors so far including the current eigenvector, of a rectangular matrix are the eigenvectors
multiplied by the activation. Including the of the matrix multiplied by its transpose, and
current eigenvector in the projection subtrac- the eigenvectors of the transpose of the matrix
tion step has the effect of keeping the eigen- multiplied by itself. Running GHA on a non-
vectors normalised. Note that Sanger includes square non-symmetrical matrix M, ie. paired
an explicit learning rate, γ. The formula can data, would therefore be achievable using stan-
be varied slightly by not including the current dard GHA as follows:
eigenvector in the projection subtraction step.
In the absence of the autonormalisation influ-
1 a
ence, the vector is allowed to grow long. This ca = c M M T (M M T ) (7)
has the effect of introducing an implicit learn- n
1
ing rate, since the vector only begins to grow cb = cb M T M (M T M ) (8)
n
long when it settles in the right direction, and
since further learning has less impact once the In the above, ca and cb are left and right sin-
vector has become long. Weng et al. (Weng gular vectors. However, to be able to feed the
et al., 2003) demonstrate the efficacy of this algorithm with rows of the matrices M M T
approach. So, in vector form, assuming C to and M T M , we would need to have the en-
be the eigenvector currently being trained, ex- tire training corpus available simultaneously,
panding y out and using the implicit learning and square it, which we hoped to avoid. This
rate; makes it impossible to use GHA for singu-
lar value decomposition of serially-presented
 paired input in this way without some further
ci = ci .x(x − (x.cj )cj ) (4) transformation. Equation 1, however, gives:
j<i

Delta notation is used to describe the update 


here, for further readability. The subtracted σca = cb M T = (cb .bx )ax (9)
x
element is responsible for removing from the 
training update any projection on previous σcb = ca M = (ca .ax )bx (10)
x
singular vectors, thereby ensuring orthgonal-
ity. Let us assume for the moment that we Here, σ is the singular value and a and b are
are calculating only the first eigenvector. The left and right data vectors. The above is valid
training update, that is, the vector to be added in the case that left and right singular vectors
to the eigenvector, can then be more simply ca and cb have settled (which will become more
described as follows, making the next steps accurate over time) and that data vectors a
more readable; and b outer-product and sum to M.

100
Inserting 9 and 10 into 7 and 8 allows them of the singular vectors settles. These approx-
to be reduced as follows: imations will therefore not interfere with the
accuracy of the final result, though they might
σ b T interfere with the rate of convergence. The
ca = c M MMT (11) constant σ 3 has been dropped in 19 and 20.
n
σ Its relevance is purely with respect to the cal-
cb = ca M M T M (12)
n culation of the singular value. Recall that in
(Weng et al., 2003) the eigenvalue is calcula-
ble as the average magnitude of the training
σ2 a update c. In our formulation, according to
ca = c MMT (13)
n 17 and 18, the singular value would be c di-
σ2 b T vided by σ 3 . Dropping the σ 3 in 19 and 20
cb = cM M (14)
n achieves that implicitly; the singular value is
once more the average length of the training
update.
σ3 b T
ca = cM (15) The next section discusses practical aspects
n of implementation. The following section illus-
σ3 a trates usage, with English language word and
cb = c M (16)
n letter bigram data as test domains.

3 Implementation
ca = σ 3 (cb .b)a (17)
cb = σ 3 (ca .a)b (18) Within the framework of the algorithm out-
lined above, there is still room for some im-
This element can then be reinserted into GHA. plementation decisions to be made. The naive
To summarise, where GHA dotted the input implementation can be summarised as follows:
with the eigenvector and multiplied the result the first datum is used to train the first singu-
by the input vector to form the training up- lar vector pair; the projection of the first singu-
date (thereby adding the input vector to the lar vector pair onto this datum is subtracted
eigenvector with a length proportional to the from the datum; the datum is then used to
extent to which it reflects the current direc- train the second singular vector pair and so on
tion of the eigenvector) our formulation dots for all the vector pairs; ensuing data items are
the right input vector with the right singular processed similarly. The main problem with
vector and multiplies the left input vector by this approach is as follows. At the beginning
this quantity before adding it to the left singu- of the training process, the singular vectors are
lar vector, and vice versa. In this way, the two close to the values they were initialised with,
sides cross-train each other. Below is the final and far away from the values they will settle
modification of GHA extended to cover mul- on. The second singular vector pair is trained
tiple vector pairs. The original GHA is given on the datum minus its projection onto the
beneath it for comparison. first singular vector pair in order to prevent
the second singular vector pair from becom-
 ing the same as the first. But if the first pair
cai = cbi .b(a − (a.caj )caj ) (19) is far away from its eventual direction, then
j<i
 the second has a chance to move in the direc-
cbi = cai .a(b − (b.cbj )cbj ) (20) tion that the first will eventually take on. In
j<i
fact, all the vectors, such as they can whilst re-
maining orthogonal to each other, will move in
 the strongest direction. Then, when the first
ci = ci .x(x − (x.cj )cj ) (21)
j<i
pair eventually takes on the right direction,
the others have difficulty recovering, since they
In equations 6 and 9/10 we introduced approx- start to receive data that they have very lit-
imations that become accurate as the direction tle projection on, meaning that they learn very

101
slowly. The problem can be addressed by wait- from word bigrams something of the nature of
ing until each singular vector pair is relatively the rules. In performing dimensionality reduc-
stable before beginning to train the next. By tion on word bigram data, we force the rules to
“stable”, we mean that the vector is changing describe themselves through a more impover-
little in its direction, such as to suggest it is ished form than via the collection of instances
very close to its target. Measures of stability that form the training corpus. The hope is
might include the average variation in posi- that the resulting simplified description will
tion of the endpoint of the (normalised) vector be a generalisable system that applies even to
over a number of training iterations, or simply instances not encountered at training time.
length of the (unnormalised) vector, since a On a practical level, the outcome has ap-
long vector is one that is being reinforced by plications in automatic language acquisition.
the training data, such as it would be if it was For example, the result might be applicable in
settled on the dominant feature. Termination language modelling. Use of the learning algo-
criteria might include that a target number rithm presented in this paper is appropriate
of singular vector pairs have been reached, or given the very large dimensions of any real-
that the last vector is increasing in length only istic corpus of language; The corpus chosen
very slowly. for this demonstration is Margaret Mitchell’s
“Gone with the Wind”, which contains 19,296
4 Application unique words (421,373 in total), which fully re-
alized as a correlation matrix with, for exam-
The task of relating linguistic bigrams to each ple, 4-byte floats would consume 1.5 gigabytes,
other, as mentioned earlier, is an example of and which in any case, within natural language
a task appropriate to singular value decom- processing, would not be considered a particu-
position, in that the data is paired data, in larly large corpus. Results on the word bigram
which each item is in a different space to the task are presented in the next section.
other. Consider word bigrams, for example. Letter bigrams provide a useful contrast-
First word space is in a non-symmetrical re- ing illustration in this context; an input di-
lationship to second word space; indeed, the mensionality of 26 allows the result to be
spaces are not even necessarily of the same di- more easily visualised. Practical applications
mensionality, since there could conceivably be might include automatic handwriting recogni-
words in the corpus that never appear in the tion, where an estimate of the likelihood of
first word slot (they might never appear at the a particular letter following another would be
start of a sentence) or in the second word slot useful information. The fact that there are
(they might never appear at the end.) So a only twenty-something letters in most western
matrix containing word counts, in which each alphabets though makes the usefulness of the
unique first word forms a row and each unique incremental approach, and indeed, dimension-
second word forms a column, will not be a ality reduction techniques in general, less ob-
square symmetrical matrix; the value at row vious in this domain. However, extending the
a, column b, will not be the same as the value space to letter trigrams and even four-grams
at row b column a, except by coincidence. would change the requirements. Section 4.2
The significance of performing dimension- discusses results on a letter bigram task.
ality reduction on word bigrams could be
thought of as follows. Language clearly ad- 4.1 Word Bigram Task
heres to some extent to a rule system less “Gone with the Wind” was presented to the
rich than the individual instances that form algorithm as word bigrams. Each word was
its surface manifestation. Those rules govern mapped to a vector containing all zeros but for
which words might follow which other words; a one in the slot corresponding to the unique
although the rule system is more complex and word index assigned to that word. This had
of a longer range that word bigrams can hope the effect of making input to the algorithm a
to illustrate, nonetheless the rule system gov- normalised vector, and of making word vec-
erns the surface form of word bigrams, and we tors orthogonal to each other. The singular
might hope that it would be possible to discern vector pair’s reaching a combined Euclidean

102
magnitude of 2000 was given as the criterion Table 2 puts “she”, “he” and “it” at the
for beginning to train the next vector pair, the top on the left, and four common verbs on the
reasoning being that since the singular vectors right, indicating a pronoun-verb pattern as the
only start to grow long when they settle in second most dominant feature in the corpus.
the approximate right direction and the data
starts to reinforce them, length forms a reason-
able heuristic for deciding if they are settled Table 2: Top words in 2nd singular vector pair
enough to begin training the next vector pair. Vector 2, Eigenvalue 0.00427
2000 was chosen ad hoc based on observation she 0.6633538 was 0.58067155
of the behaviour of the algorithm during train- he 0.38005337 had 0.50169927
ing. it 0.30800354 could 0.2315106
The data presented are the words most rep- and 0.18958427 would 0.17589279
resentative of the top two singular vectors,
that is to say, the directions these singular 4.2 Letter Bigram Task
vectors mostly point in. Table 1 shows the
words with highest scores in the top two vec- Running the algorithm on letter bigrams illus-
tor pairs. It says that in this vector pair, the trates different properties. Because there are
normalised left hand vector projected by 0.513 only 26 letters in the English alphabet, it is
onto the vector for the word “of” (or in other meaningful to examine the entire singular vec-
words, these vectors have a dot product of tor pair. Figure 1 shows the third singular vec-
0.513.) The normalised right hand vector has tor pair derived by running the algorithm on
a projection of 0.876 onto the word “the” etc. letter bigrams. The y axis gives the projection
This first table shows a left side dominated of the vector for the given letter onto the sin-
by prepositions, with a right side in which gular vector. The left singular vector is given
“the” is by far the most important word, but on the left, and the right on the right, that is
which also contains many pronouns. The fact to say, the first letter in the bigram is on the
that the first singular vector pair is effectively left and the second on the right. The first two
about “the” (the right hand side points far singular vector pairs are dominated by letter
more in the direction of “the” than any other frequency effects, but the third is interesting
word) reflects its status as the most common because it clearly shows that the method has
word in the English language. What this result identified vowels. It means that the third most
is saying is that were we to be allowed only one useful feature for determining the likelihood of
feature with which to describe word English letter b following letter a is whether letter a
bigrams, a feature describing words appear- is a vowel. If letter b is a vowel, letter a is
ing before “the” and words behaving similarly less likely to be (vowels dominate the nega-
to “the” would be the best we could choose. tive end of the right singular vector). (Later
Other very common words in English are also features could introduce subcases where a par-
prominent in this feature. ticular vowel is likely to follow another partic-
ular vowel, but this result suggests that the
most dominant case is that this does not hap-
Table 1: Top words in 1st singular vector pair pen.) Interestingly, the letter ’h’ also appears
Vector 1, Eigenvalue 0.00938 at the negative end of the right singular vec-
of 0.5125468 the 0.8755944 tor, suggesting that ’h’ for the most part does
in 0.49723375 her 0.28781646 not follow a vowel in English. Items near zero
and 0.39370865 a 0.23318098 (’k’, ’z’ etc.) are not strongly represented in
to 0.2748983 his 0.14336193 this singular vector pair; it tells us little about
on 0.21759394 she 0.1128443 them.
at 0.17932475 it 0.06529821
for 0.16905183 he 0.063333265 5 Conclusion
with 0.16042696 you 0.058997907 An incremental approach to approximating
from 0.13463423 their 0.05517004 the singular value decomposition of a cor-
relation matrix has been presented. Use

103
Figure 1: Third Singular Vector Pair on Letter Bigram Task
0.6
n

0.4 i a
o
e
r t
0.2 s
ml
u cdf
_ v wp g b u
qx j zny kxzj q _y
0 f kpgb
s dw
vl cm a
i o
-0.2 r
h

-0.4 t h

e
-0.6

of the incremental approach means that Scott C. Deerwester, Susan T. Dumais, Thomas K.
singular value decomposition is an option Landauer, George W. Furnas, and Richard A.
Harshman. 1990. Indexing by latent semantic
in situations where data takes the form of
analysis. Journal of the American Society of In-
single serially-presented observations from an formation Science, 41(6):391–407.
unknown matrix. The method is particularly
appropriate in natural language contexts, S. Dumais. 1990. Enhancing performance in latent
semantic indexing. TM-ARH-017527 Technical
where datasets are often too large to be pro- Report, Bellcore, 1990.
cessed by traditional methods, and situations
where the dataset is unbounded, for example G. H. Golub and C. Reinsch. 1970. Handbook se-
ries linear algebra. singular value decomposition
in systems that learn through use. The
and least squares solutions. Numerical Mathe-
approach produces preliminary estimations matics, 14:403–420.
of the top vectors, meaning that informa-
tion becomes available early in the training G. Gorrell and B. Webb. 2005. Generalized heb-
bian algorithm for latent semantic analysis. In
process. By avoiding matrix multiplication, Proceedings of Interspeech 2005.
data of high dimensionality can be processed.
Results of preliminary experiments have been P. Kanerva, J. Kristoferson, and A. Holst. 2000.
Random indexing of text samples for latent se-
discussed here on the task of modelling word mantic analysis. In Proceedings of 22nd Annual
and letter bigrams. Future work will include Conference of the Cognitive Science Society.
an evaluation on much larger corpora.
E. Oja and J. Karhunen. 1985. On stochastic ap-
proximation of the eigenvectors and eigenvalues
of the expectation of a random matrix. J. Math.
Acknowledgements: The author would like Analysis and Application, 106:69–84.
to thank Brandyn Webb for his contribution, Terence D. Sanger. 1989. Optimal unsupervised
and the Graduate School of Language Technol- learning in a single-layer linear feedforward neu-
ogy and Vinnova for their financial support. ral network. Neural Networks, 2:459–473.
Terence D. Sanger. 1993. Two iterative algorithms
for computing the singular value decomposition
References from input/output samples. NIPS, 6:144–151.
J. Bellegarda. 2000. Exploiting latent semantic in-
formation in statistical language modeling. Pro- Juyang Weng, Yilu Zhang, and Wey-Shiuan
ceedings of the IEEE, 88:8. Hwang. 2003. Candid covariance-free incremen-
tal principal component analysis. IEEE Trans-
Michael W. Berry, Susan T. Dumais, and Gavin W. actions on Pattern Analysis and Machine Intel-
O’Brien. 1995. Using linear algebra for in- ligence, 25:8:1034–1040.
telligent information retrieval. SIAM Review,
34(4):573–595.
R. W. Berry. 1992. Large-scale sparse singular
value computations. The International Journal
of Supercomputer Applications, 6(1):13–49.

104

You might also like