Professional Documents
Culture Documents
Indexing by Latent Semantic Analysis: Scott Deerwester
Indexing by Latent Semantic Analysis: Scott Deerwester
Scott Deerwester
Center for Information and Language Studies, University of Chicago, Chicago, IL 60637
Richard Harshman
University of Western Ontario, London, Ontario Canada
A new method for automatic indexing and retrieval is of observed term-document association data as a statistical
described. The approach is to take advantage of implicit problem. We assume there is some underlying latent se-
higher-order structure in the association of terms with mantic structure in the data that is partially obscured by the
documents (“semantic structure”) in order to improve
the detection of relevant documents on the basis of randomness of word choice with respect to retrieval. We
terms found in queries. The particular technique used is use statistical techniques to estimate this latent structure,
singular-value decomposition, in which a large term by and get rid of the obscuring “noise.” A description of
document matrix is decomposed into a set of ca. 100 or- terms and documents based on the latent semantic structure
thogonal factors from which the original matrix can be is used for indexing and retrieval.’
approximated by linear combination. Documents are
represented by ca. 100 item vectors of factor weights. The particular “latent semantic indexing” (LSI) analysis
Queries are represented as pseudo-document vectors that we have tried uses singular-value decomposition. We
formed from weighted combinations of terms, and take a large matrix of term-document association data and
documents with supra-threshold cosine values are re- construct a “semantic” space wherein terms and documents
turned. initial tests find this completely automatic that are closely associated are placed near one another.
method for retrieval to be promising.
Singular-value decomposition allows the arrangement of
the space to reflect the major associative patterns in the
Introduction data, and ignore the smaller, less important influences. As
a result, terms that did not actually appear in a document
We describe here a new approach to automatic indexing
may still end up close to the document, if that is consistent
and retrieval. It is designed to overcome a fundamental
with the major patterns of association in the data. Position
problem that plagues existing retrieval techniques that try in the space then serves as the new kind of semantic index-
to match words of queries with words of documents. The
ing. Retrieval proceeds by using the terms in a query to
problem is that users want to retrieve on the basis of con- identify a point in the space, and documents in its neigh-
ceptual content, and individual words provide unreliable
borhood are returned to the user.
evidence about the conceptual topic or meaning of a docu-
ment. There are usually many ways to express a given
concept, so the literal terms in a user’s query may not Deficiencies of Current Automatic Indexing and
match those of a relevant document. In addition, most Retrieval Methods
words have multiple meanings, so terms in a user’s query A fundamental deficiency of current information retrieval
will literally match terms in documents that are not of in- methods is that the words searchers use often are not the
terest to the user. same as those by which the information they seek has been
The proposed approach tries to overcome the deficien- indexed. There are actually two sides to the issue; we will
cies of term-matching retrieval by treating the unreliability call them broadly synonymy and polysemy. We use syn-
onymy in a very general sense to describe the fact that
*To whom all correspondenceshould be addressed.
‘By “semantic structure” we mean here only the correlation structure
ReceivedAugust 26, 1987;revised April 4, 1988;acceptedApril 5, 1988. in the way in which individual words appear in documents;“semantic”
implies only the fact that terms in a documentmay be taken as referents
0 1990by John Wiley & Sons, Inc. to the documentitself or to its topic.
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 41(6):391-407, 1990 CCC 0002-6231/90/060391-17$04.00
there are many ways to refer to the same object. Users in mains to be determined. Not only is there a potential issue
different contexts, or with different needs, knowledge, or of ambiguity and lack of precision, but the problem of
linguistic habits will describe the same information using identifying index terms that are not in the text of docu-
different terms. Indeed, we have found that the degree of ments grows cumbersome. This was one of the motives for
variability in descriptive term usage is much greater than is the approach to be described here.
commonly suspected. For example, two people choose the The second factor is the lack of an adequate automatic
same main key word for a single well-known object less than method for dealing with polysemy. One common approach
20% of the time (Furnas, Landauer, Gomez, & Dumais, is the use of controlled vocabularies and human inter-
1987). Comparably poor agreement has been reported in mediaries to act as translators. Not only is this solution
studies of interindexer consistency (Tarr & Borko, 1974) extremely expensive, but it is not necessarily effective.
and in the generation of search terms by either expert inter- Another approach is to allow Boolean intersection or coor-
mediaries (Fidel, 1985) or less experienced searchers (Liley, dination with other terms to disambiguate meaning. Suc-
1954; Bates, 1986). The prevalence of synonyms tends to cess is severely hampered by users’ inability to think of
decrease the “recall” performance of retrieval systems. By appropriate limiting terms if they do exist, and by the fact
polysemy we refer to the general fact that most words have that such terms may not occur in the documents or may not
more than one distinct meaning (homography). In different have been included in the indexing.
contexts or when used by different people the same term The third factor is somewhat more technical, having
(e.g., “chip”) takes on varying referential significance. to do with the way in which current automatic indexing
Thus the use of a term in a search query does not neces- and retrieval systems actually work. In such systems each
sarily mean that a document containing or labeled by the word type is treated as independent of any other (see, for
same term is of interest. Polysemy is one factor underlying example, van Rijsbergen (1977)). Thus matching (or not)
poor “precision .” both of two terms that almost always occur together is
The failure of current automatic indexing to overcome counted as heavily as matching two that are rarely found in
these problems can be largely traced to three factors. The the same document. Thus the scoring of success, in either
first factor is that the way index terms are identified is in- straight Boolean or coordination level searches, fails to
complete. The terms used to describe or index a document take redundancy into account, and as a result may distort
typically contain only a fraction of the terms that users as a results to an unknown degree. This problem exacerbates a
group will try to look it up under. This is partly because the user’s difficulty in using compound-term queries effec-
documents themselves do not contain all the terms users tively to expand or limit a search.
will apply, and sometimes because term selection proce-
dures intentionally omit many of the terms in a document.
Rationale of the Latent Semantic Indexing
Attempts to deal with the synonymy problem have re-
(LSI) Method
lied on intellectual or automatic term expansion, or the
construction of a thesaurus. These are presumably advanta-
Illustration of Retrieval Problems
geous for conscientious and knowledgeable searchers who
can use such tools to suggest additional search terms. The We illustrate some of the problems with term-based in-
drawback for fully automatic methods is that some added formation retrieval systems by means of a fictional matrix
terms may have different meaning from that intended (the of terms by documents (Table 1). Below the table we give
polysemy effect) leading to rapid degradation of precision a fictional query that might have been passed against this
(Sparck Jones, 1972). database. An “R” in the column labeled REL (relevant) in-
It is worth noting in passing that experiments with small dicates that the user would have judged the document rele-
interactive data bases have shown monotonic improvements vant to the query (here documents 1 and 3 are relevant).
in recall rate without overall loss of precision as more in- Terms occurring in both the query and a document (com-
dexing terms, either taken from the documents or from puter and information) are indicated by an asterisk in the
large samples of actual users’ words are added (Gomez, appropriate cell; an “M” in the MATCH column indicates
Lochbaum, & Landauer, in press; Fumas, 1985). Whether that the document matches the query and would have been
this “unlimited aliasing” method, which we have described returned to the user. Documents 1 and 2 illustrate common
elsewhere, will be effective in very large data bases re- classes of problems with which the proposed method
Access Document Retrieval Information Theory Database Indexing Computer REL MATCH
Dot 1 x X x x X R
Doc2 x* X x* M
Doc3 X X* X* R M
394 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE- September 1990
(3) Computational tractabiliv for large datasets. Many but with an arbitrary rectangular matrix with different enti-
of the existing models require computation that goes ties on the rows and columns, e.g., a matrix of terms and
up with N4 or N5 (where N is the number of terms documents. This rectangular matrix is again decomposed
plus documents). Since we hoped to work with docu- into three other matrices of a very special form, this time
ments sets that were at least in the thousands, models by a process called “singular-value-decomposition” (SVD).
with efficient fitting techniques were needed.
(The resulting matrices contain “singular vectors” and
The only mode1 which satisfied all three criteria was “singular values.“) As in the one-mode case these special
two-mode factor analysis. The tree unfolding mode1 was matrices show a breakdown of the original relationships
considered too representationally restrictive, and along with into linearly independent components or factors. Again,
nonmetric multidimensional unfolding, too computationally many of these components are very small, and may be ig-
expensive. Two-mode factor analysis is a generalization of nored, leading to an approximate model that contains many
the familiar factor analytic mode1 based on singular value fewer dimensions. In this reduced model all the term-term,
decomposition (SVD). (See Forsythe, Malcolm, & Moler document-document, and term-document similarities are
(1977), Chapter 9, for an introduction to SVD and its now approximated by values on this smaller number of
applications.) SVD represents both terms and documents dimensions. The result can still be represented geometri-
as vectors in a space of choosable dimensionality, and cally by a spatial configuration in which the dot product or
the dot product or cosine between points in the space cosine between vectors representing two objects corre-
gives their similarity. In addition, a program was available sponds to their estimated similarity.
(Harshman & Lundy, 1984b) that fit the model in time of Thus, for information retrieval purposes, SVD can be
order N2 X k’. viewed as a technique for deriving a set of uncorrelated in-
dexing variables or factors; each term and document is rep-
resented by its vector of factor values. Note that by virtue
of the dimension reduction, it is possible for documents
SVD or Two-Mode Factor Analysis
with somewhat different profiles of term usage to be
mapped into the same vector of factor values. This is just
Overview
the property we need to accomplish the improvement of
The latent semantic structure analysis starts with a ma- unreliable data proposed earlier. Indeed, the SVD repre-
trix of terms by documents. This matrix is then analyzed sentation, by replacing individual terms with derived or-
by singular value decomposition (SVD) to derive our par- thogonal factor values, can help to solve all three of the
ticular latent semantic structure model. Singular value de- fundamental problems we have described.
composition is closely related to a number of mathematical In various problems, we have approximated the original
and statistical techniques in a wide variety of other fields, term-document matrix using 50-100 orthogonal factors or
including eigenvector decomposition, spectral analysis, derived dimensions. Roughly speaking, these factors may
and factor analysis. We will use the terminology of factor be thought of as artificial concepts; they represent ex-
analysis, since that approach has some precedence in the tracted common meaning components of many different
information retrieval literature. words and documents. Each term or document is then
The traditional, one-mode factor analysis begins with a characterized by a vector of weights indicating its strength
matrix of associations between all pairs of one type of ob- of association with each of these underlying concepts. That
ject, e.g., documents (Borko & Bernick, 1963). This is, the “meaning” of a particular term, query, or document
might be a matrix of human judgments of document to can be expressed by k factor values, or equivalently, by the
document similarity, or a measure of term overlap com- location of a vector in the k-space defined by the factors.
puted for each pair of documents from an original term by The meaning representation is economical, in the sense
document matrix. This square symmetric matrix is de- that N original index terms have been replaced by the
composed by a process called “eigen-analysis,” into the k < N best surrogates by which they can be approximated.
product of two matrices of a very special form (containing We make no attempt to interpret the underlying factors,
“eigenvectors” and “eigenvalues”). These special matrices nor to “rotate” them to some meaningful orientation. Our
show a breakdown of the original data into linearly inde- aim is not to be able to describe the factors verbally but
pendent components or “factors.” In general many of these merely to be able to represent terms, documents and
components are very small, and may be ignored, leading queries in a way that escapes the unreliability, ambiguity
to an approximate model that contains many fewer factors. and redundancy of individual terms as descriptors.
Each of the original documents’ similarity behavior is now It is possible to reconstruct the original term by docu-
approximated by its values on this smaller number of ment matrix from its factor weights with reasonable but
factors. The result can be represented geometrically by a not perfect accuracy. It is important for the method that the
spatial configuration in which the dot product or cosine be- derived k-dimensional factor space not reconstruct the
tween vectors representing two documents corresponds to original term space perfectly, because we believe the origi-
their estimated similarity. nal term space to be unreliable. Rather we want a derived
In two-mode factor analysis one begins not with a square structure that expresses what is reliable and important in
symmetric matrix relating pairs of only one type of entity, the underlying use of terms as document referents.
0 m4(9,11,12)
10 tree,
Yli~(lmd~;
l 9 survey
, .-
ml(10) I’
,’ 0 c2(3,4,5,6,7,9)
\\
\\ 98 EPS 0 c3(2,45,8)
\ l 5 system
\ 0 c4(1,5,8)
\
\
\
\
\
Dimension 1’
FIG. I. A two-dimensional plot of 12 Terms and 9 Documentsfrom the sampeTM set. Terms are representedby filled circles. Documentsare shown
as open squares,and componentterms are indicated parenthetically. The query (“human computer interaction”) is representedas a pseudo-documentat
point 9. Axes are scaled for Document-Documentor Tern-Term comparisons.The dotted cone representsthe region whose points are.within a cosine of
.9 from the query 4. All documentsabout human-computer(clLc5) are “near” the query (i.e., within this cone), but none of the graph theory documents
(ml-m4) arc nearby, In this reduced space, even documentsc3 and c5 which share no terms with the query arc near it.
Then, we simply look for documents which are near the Any rectangular matrix, for example a t x d matrix of
query, q. In this case, documents cl-c5 (but not ml-m4) terms and documents, X, can be decomposed into the
are “nearby” (within a cosine of .9, as indicated by the product of three other matrices:
dashed lines). Notice that even documents c3 and c5 which
X = T&D,‘,
share no index terms at all with the query are near it in
this representation. The relations among the documents ex- such that T,, and D, have orthonormal columns and S, is di-
pressed in the factor space depend on complex and indirect agonal. This is called the singular value decomposition of
associations between terms and documents, ones that come X. T,, and D, are the matrices of left and right singular vec-
from an analysis of the structure of the whole set of rela- tors and SOis the diagonal matrix of singular values.4 Sin-
tions in the term by document matrix. This is the strength gular value decomposition (SVD) is unique up to certain
of using higher order structure in the term by document row, column and sign permutation? and by convention the
matrix to represent the underlying meaning of a single diagonal elements of SOare constructed to be all positive
term, document, or query. It yields a more robust and eco- and ordered in decreasing magnitude.
nomical representation than do term overlap or surface-
level clustering methods. %VD is closely related to the standard eigenvalue-eigenvector or
spectral decomposition of a square symmetric matrix, Y, into WY’,
where V is orthonormaland L is diagonal. The relation between SVD and
Technical Details eigen analysis is more than one of analogy. In fact, TOis the matrix of
eigenvectorsof the square symmetric matrix Y = XX’, DO is the matrix
The Singular Value Decomposition (SVD) Model. of eigenvectorsof Y = X’X, and in both cases, 5: would be the matrix,
This section details the mathematics underlying the par- L , of eigenvalues.
‘Allowable permutationsare those that leave SOdiagonal and maintain
ticular model of latent structure, singular value decomposi- the correspondenceswith TOand Da. That is, column i andj of SOmay be
tion, that we currently use. The casual reader may wish to interchangediff row i andj of SOare interchanged,and columns i andj of
skip this section and proceed to the next section. T,, and Da are interchanged.
documents
terms X TO
mxm mxd
L
txd txm
X TO SO DO
FIG. 2. Schematic of the Singular Value Decomposition (SVD) of a rectangular term by document matrix. The original term by document matrix is
decomposedinto three matrices each with linearly independentcomponents.
The first standard set we tried, MED, was the com- ‘The value 39.7 is reported in Table 1 of the Voorhees article. We
monly studied collection of medical abstracts. It consists suspectthis is in error, and that the correct value may be 9.7 which would
of 1033 documents and 30 queries. Our automatic indexing be in line with the other measuresof mean number of terms per query.
mri
terms
kxk kxd
txd txk
i = T S D’
RG. 3. Schematic of the reduced Singular Value Decomposition (SVD) of a term by document matrix. The original term by document matrix is ap-
proximated using the k largest singular values and their correspondingsingular vectors.
the relevant documents. Thus the largest benefits of the Up to this point, we have reported LSI results from a
LSI method should be observed at high recall, and, in- loo-factor representation (i.e., in a loo-dimensional space).
deed, this is the case. This raises the important issue of choosing the dimension-
co
d
0
6
Recall
FIG. 4. Precision-recall curves for TERM matching, a loo-factor LSI, SMART, and Voorbees systemson the MED dataset. The data for the TERM
matching, LSI, and SMART methodsare obtained by measuringprecision at each of nine levels of recall (approximately .lO increments)for each query
separatelyand then averaging over queries. The two Voorheesdata points are taken from Table 4b in her article.
:r:
0 20 40 60 80 100 120
Number of Factors
FIG. 5. A plot of average precision (averaged over nine levels of recall) as a function of number of factors for the MED dataset. Precision more than
doubles (from about .20 to .50) as the number of factors is increasedfrom 10 to 100.
(D
6
5 ~ LSI-100
.-2 -- SMART
8 TEW
h
c
FIG. 6. Precision-recallcurves for TERM matching, a loo-factor LSI, SMART, and Voorheessystemson the CISI dataset.The data for TERM match-
ing, LSI, and SMART are obtained by measuringprecision at each of nine levels of recall (approximately .I0 increments)for each query separately and
then averaging over queries. The two Voorhees data points are taken from Table 6b in her paper.