Professional Documents
Culture Documents
Vector Space Modeling With TFIDF
Vector Space Modeling With TFIDF
(TF-IDF)
a) Determine the term frequency (TF) of words in each documents and generate the term-
documents matrix based on TF.
b) Compute the documents similarity based on Cosine Similarity.
c) Determine the inverse document frequency (IDF) score of each words
d) Generate the term-documents matrix based on TF-IDF
e) Pros. and cons. of TF-IDF.
1
Dr. Abu Nowshed Chy
Assistant Professor, Dept. of CSE, CU
D1 = [ 1 1 1 2 2 1 0 0 0 0]
D4 = [ 0 1 0 2 2 0 1 0 0 0]
(1×0)+(1×1)+(1×0)+(2×2)+(2×2)+(1×0)+(0×1)+(0×0)+(0×0)+(0×0)
CosSim(D1, D4) = √12 +12 +12 +22 +22 +12 +02 +02 +02 +02 √02 +12 +02 +22 +22 +02 +12 +02 +02 +02
×
9
= = 0.82
√12 × √10
2
Dr. Abu Nowshed Chy
Assistant Professor, Dept. of CSE, CU
From the above table we see that, TF-IDF of the term “Italian” at D1 = 0.125 × 0.6 = 0.075
(0.05×0.05)
CosSim(D2, D4) = √0.022 +0.052 +0.12 √0.052
×
0.0025
=
0.00568
= 0.44
Cosine Distance: While cosine similarity estimate the similarity between two documents, the
cosine distance denotes the dissimilarity between these two documents.
In the above example, the cosine similarity between Document D2 and D4 is 0.44. Hence, the
Cosine Distance will be = 1- 0.44 = 0.66
3
Dr. Abu Nowshed Chy
Assistant Professor, Dept. of CSE, CU
TF-IDF is a weighting scheme that is commonly used in information retrieval tasks. The goal is to
model each document into a vector space, ignoring the exact ordering of the words in the document
while retaining information about the occurrences of each word.
Pros of TF-IDF:
Keep relevant words score
Lower the frequent words score
Easy to estimate the document similarity
Cons of TF-IDF:
Consider only words
Weak on capturing documents topic
Weak on handling synonym
Cannot capture the semantic similarity
N.B.:
If a query term is not appear in the corpus, this will lead to a division-by-zero while computing
IDF using the following formula.
4
Dr. Abu Nowshed Chy
Assistant Professor, Dept. of CSE, CU