Topic Modelling and LSA

Latent Semantic Analysis
In Topic modelling
By
Anindya Dey
LATENT SEMANTIC ANALYSIS (LSA)
Overview :
All languages have their own intricacies and nuances which are quite difficult for a
machine to capture (sometimes they’re even misunderstood by us humans!). This
can include different words that mean the same thing, and also the words which
have the same spelling but different meanings.
For example, consider the following two sentences:
1. I liked his last novel quite a lot.

2. We would like to go for a novel marketing campaign.
In the first sentence, the word ‘novel’ refers to a book, and in the second sentence it
means new or fresh.
We can easily distinguish between these words because we are able to understand
the context behind these words. However, a machine would not be able to capture
this concept as it cannot understand the context in which the words have been used.
This is where Latent Semantic Analysis (LSA) comes into play as it attempts to
leverage the context around the words to capture the hidden concepts, also known
as topics.
So, simply mapping words to documents won’t really help. What we really need is to
figure out the hidden concepts or topics behind the words. LSA takes meaningful text
documents and recreates them in n different parts where each part expresses a
different way of looking at meaning in the text. If you imagine the text data as a an
idea, there would be n different ways of looking at that idea, or n different ways
of conceptualising the whole text. LSA reduces our table of data to a table of latent
(hidden) concepts.
Steps involved in the implementation of LSA :
Step – 1:
The first step is generating our document-term matrix. Given m documents

and n words in our vocabulary, we can construct an m × n matrix A in which each
row represents a document and each column represents a word.
In the simplest version of LSA, each entry can simply be a raw count of the number of
times the j-th word appeared in the i-th document. In practice, however, raw counts
do not work particularly well because they do not account for the significance of each
word in the document. Thus, LSA models typically replace raw counts in the
document-term matrix with a tf-idf score.
What is Tf-idf ?
Tf-Idf method is used as a topic model-ling method to extract the topics from the
document. Tf-Idf stands for term frequency-inverse document frequency. Term
Frequency–Inverse Document Frequency, is a numerical statistic that is intended to
reflect how important a word is to a document in a collection or corpus. It is often
used as a weighting factor in information retrieval and text mining. The Tf-Idf method
has two weights first, the Term Frequency, measures how frequently a term occurs
in a document. Term Frequency (t) = Number of times term t appears in a document.
Inverse Document Frequency, measures how important a term is. While computing
TF, all terms are considered equally important.
TF (t)=(Number of ×term t appears ∈a document /Total number of terms∈the document )
IDF(t)=log e (Total number of documents / Number of documents with term t∈it )
Tf-idf, or term frequency-inverse document frequency, assigns a weight for term j in

document i as follows:
Intuitively, a term has a large weight when it occurs frequently across

the document but infrequently across the corpus. The word “build” might appear often
in a document, but because it’s likely fairly common in the rest of the corpus, it will not
have a high tf-idf score. However, if the word “gentrification” appears often in a
document, because it is rarer in the rest of the corpus, it will have a higher tf-idf score.
Step – 2:
we will reduce the dimensions of the above matrix to k (no. of desired topics)
dimensions, using singular-value decomposition (SVD).
Singular-Value Decomposition :
Once we have our document-term matrix A, we can start thinking about our
latent topics. Here’s the thing: in all likelihood, A is very sparse, very noisy, and very
redundant across its many dimensions. As a result, to find the few latent topics that
capture the relationships among the words and documents, we want to perform
dimensionality reduction on A.
This dimensionality reduction can be performed using truncated SVD. SVD, or

singular value decomposition, is a technique in linear algebra that factorizes any
matrix M into the product of 3 separate matrices:
M =U∗S∗V , where S is a diagonal matrix of the singular values of M.
Here we can represent the decomposition of matrix as follows :
M =U Σ V ¿
 A is an m*m matrix
 U is a m*n left singular matrix
 S is a n*n diagonal matrix with non-negative real numbers.
 V is a m*n right singular matrix
 V ¿ is n*m matrix, which is the transpose of the V.
S V ¿k
=
M Uk
Where the two vectors 𝝈1 and 𝝈2 are actually our singular values plotted in this
space.
Now, just like with geometric transformations of points that you may remember from
school, we can reconsider this transformation M as three separate transformations:
1. The rotation (or reflection) caused by V*. Note that V* = V-transpose as V is a

real unitary matrix, so the complex conjugate of V is the same as its transpose. In
vector terms, the transformation by V or V* keeps the length of the basis vectors
the same;
2. 𝚺 has the effect of stretching or compressing all coordinate points along the
values of its singular values. Imagine our disc in the bottom left corner as we
squeeze it vertically down in the direction of 𝝈2 and stretch it horizontally along
the direction of 𝝈1. These two singular values now can be pictured as the major
and minor semi-axes of an ellipse. You can of course generalise this to n-
dimensions.
3. Lastly, applying U rotates (or reflects) our feature space. We’ve arrived at the
same output as a transformation directly from M.
Each row of the matrix Uk (document-term matrix) is the vector representation of

the corresponding document. The length of these vectors is k, which is the number
of desired topics. Vector representation for the terms in our data can be found in the
matrix Vk (term-topic matrix).
Step – 3:
With these document vectors and term vectors, we can now easily apply measures
such as cosine similarity to evaluate:
 the similarity of different documents
 the similarity of different words
 the similarity of terms (or “queries”) and documents (which becomes useful in
information retrieval, when we want to retrieve passages most relevant to our
search query).
Pros and Cons of LSA :
Latent Semantic Analysis can be very useful as we saw above, but it does have its
limitations. It’s important to understand both the sides of LSA so you have an idea of
when to leverage it and when to try something else.
Pros:
 LSA is fast and easy to implement.

 It gives decent results, much better than a plain vector space model.
Cons:
 Since it is a linear model, it might not do well on datasets with non-linear

dependencies.
 LSA assumes a Gaussian distribution of the terms in the documents, which
may not be true for all problems.
 LSA involves SVD, which is computationally intensive and hard to update as
new data comes up.
References
 https://towardsdatascience.com/topic-modelling-with-plsa-728b92043f41
 https://www.coursera.org/lecture/text-mining/3-7-probabilistic-latent-semantic-
analysis-plsa-part-1-HKe8K
 https://www.coursera.org/lecture/text-mining/3-8-probabilistic-latent-semantic-
analysis-plsa-part-2-GJyGG
 https://www.datacamp.com/community/tutorials/discovering-hidden-topics-
python
 https://towardsdatascience.com/latent-semantic-analysis-sentiment-
classification-with-python-5f657346f6a3
 https://towardsdatascience.com/latent-semantic-analysis-distributional-
semantics-in-nlp-ea84bf686b50
 https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-
latent-semantic-analysis/
 https://stackabuse.com/python-for-nlp-topic-modeling/
 https://www.kaggle.com/snap/amazon-fine-food-reviews
 https://towardsdatascience.com/light-on-math-machine-learning-intuitive-
guide-to-latent-dirichlet-allocation-437c81220158
 https://medium.com/analytics-vidhya/topic-modeling-using-lda-and-gibbs-
sampling-explained-49d49b3d1045
 https://towardsdatascience.com/latent-dirichlet-allocation-for-topic-modelling-
explained-algorithm-and-python-scikit-learn-c65a82e7304d
 https://analyticsindiamag.com/beginners-guide-to-latent-dirichlet-allocation/
 https://www.youtube.com/watch?v=FkckgwMHP2s
 https://www.youtube.com/watch?v=VTweNS8GiWI
 https://www.youtube.com/watch?v=OzcgG0Q31Qo
 https://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
 https://towardsdatascience.com/light-on-math-machine-learning-intuitive-
guide-to-latent-dirichlet-allocation-437c81220158
 https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-
108abf40fa7d

Topic Modelling and LSA

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Topic Modelling and LSA

Uploaded by

Copyright:

Available Formats

Latent Semantic Analysis

For example, consider the following two sentences:

1. I liked his last novel quite a lot.

The first step is generating our document-term matrix. Given m documents

TF (t)=(Number of ×term t appears ∈a document /Total number of terms∈the document )

IDF(t)=log e (Total number of documents / Number of documents with term t∈it )

Tf-idf, or term frequency-inverse document frequency, assigns a weight for term j in

Intuitively, a term has a large weight when it occurs frequently across

This dimensionality reduction can be performed using truncated SVD. SVD, or

M =U∗S∗V , where S is a diagonal matrix of the singular values of M.

Here we can represent the decomposition of matrix as follows :

 S is a n*n diagonal matrix with non-negative real numbers.

 V is a m*n right singular matrix

 V ¿ is n*m matrix, which is the transpose of the V.

1. The rotation (or reflection) caused by V. Note that V = V-transpose as V is a

Each row of the matrix Uk (document-term matrix) is the vector representation of

 the similarity of different documents

 the similarity of different words

Pros and Cons of LSA :

 LSA is fast and easy to implement.

 Since it is a linear model, it might not do well on datasets with non-linear

You might also like