Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Model the Vector Space of the Documents

(TF-IDF)

Sample Problem Formulation:


Consider the following four text documents:
D1: the best Italian restaurant enjoy the best pasta
D2: American restaurant enjoy the best burger
D3: Thai restaurant enjoy the best soup
D4: the best the best American restaurant

a) Determine the term frequency (TF) of words in each documents and generate the term-
documents matrix based on TF.
b) Compute the documents similarity based on Cosine Similarity.
c) Determine the inverse document frequency (IDF) score of each words
d) Generate the term-documents matrix based on TF-IDF
e) Pros. and cons. of TF-IDF.

Term Frequency (TF):


Term frequency (TF) means the frequency of a term in a document. If a term appears frequently
in a document, it is important.

Term-Documents Matrix based on TF:


Unique Word
List of the D1 D2 D3 D4
Documents
Italian 1 0 0 0
restaurant 1 1 1 1
enjoy 1 1 1 0
the 2 1 1 2
best 2 1 1 2
pasta 1 0 0 0
American 0 1 0 1
burger 0 1 0 0
Thai 0 0 1 0
soup 0 0 1 0

1
Dr. Abu Nowshed Chy
Assistant Professor, Dept. of CSE, CU
D1 = [ 1 1 1 2 2 1 0 0 0 0]
D4 = [ 0 1 0 2 2 0 1 0 0 0]

(1×0)+(1×1)+(1×0)+(2×2)+(2×2)+(1×0)+(0×1)+(0×0)+(0×0)+(0×0)
CosSim(D1, D4) = √12 +12 +12 +22 +22 +12 +02 +02 +02 +02 √02 +12 +02 +22 +22 +02 +12 +02 +02 +02
×

9
= = 0.82
√12 × √10

Cosine Similarity of Each Documents with D4:


CosSim with D4
D1 the best Italian restaurant enjoy the best pasta 0.82
D2 American restaurant enjoy the best burger 0.77
D3 Thai restaurant enjoy the best soup 0.65
D4 the best the best American restaurant 1

Normalized Term Frequency (TF):


Frequency of a term in a document
Normalized Term Frequency (TF) =
Total number of words in that document

Inverse Document Frequency (IDF):


Total number of documents
IDF = log ( )
Number of documents containing the term
If a term appears in many documents, it's not a unique identifier. Give the term a low score.

Normalized Term-Documents Matrix with IDF Computation:


Normalized TF IDF
D1 D2 D3 D4 IDF
Italian 0.125 0 0 0 0.6
restaurant 0.125 0.167 0.167 0.167 0
enjoy 0.125 0.167 0.167 0 0.12
the 0.25 0.167 0.167 0.33 0
best 0.25 0.167 0.167 0.33 0
pasta 0.125 0 0 0 0.6
American 0 0.167 0 0.167 0.3
burger 0 0.167 0 0 0.6
Thai 0 0 0.167 0 0.6
soup 0 0 0.167 0 0.6

2
Dr. Abu Nowshed Chy
Assistant Professor, Dept. of CSE, CU
From the above table we see that, TF-IDF of the term “Italian” at D1 = 0.125 × 0.6 = 0.075

Term-Documents Matrix based on Noralized TF-IDF:


D1 D2 D3 D4
Italian 0.075 0 0 0
restaurant 0 0 0 0
enjoy 0.015 0.02 0.02 0
the 0 0 0 0
best 0 0 0 0
pasta 0.075 0 0 0
American 0 0.05 0 0.05
burger 0 0.1 0 0
Thai 0 0 0.1 0
soup 0 0 0.1 0

D2 = [ 0 0 0.02 0 0 0 0.05 0.1 0 0]


D4 = [ 0 0 0 0 0 0 0.05 0 0 0]

(0.05×0.05)
CosSim(D2, D4) = √0.022 +0.052 +0.12 √0.052
×

0.0025
=
0.00568

= 0.44

Cosine Similarity of Each Documents with D4:


CosSim with D4
D1 the best Italian restaurant enjoy the best pasta 0
D2 American restaurant enjoy the best burger 0.44
D3 Thai restaurant enjoy the best soup 0
D4 the best the best American restaurant 1

Cosine Distance: While cosine similarity estimate the similarity between two documents, the
cosine distance denotes the dissimilarity between these two documents.
In the above example, the cosine similarity between Document D2 and D4 is 0.44. Hence, the
Cosine Distance will be = 1- 0.44 = 0.66

3
Dr. Abu Nowshed Chy
Assistant Professor, Dept. of CSE, CU
TF-IDF is a weighting scheme that is commonly used in information retrieval tasks. The goal is to
model each document into a vector space, ignoring the exact ordering of the words in the document
while retaining information about the occurrences of each word.

Pros of TF-IDF:
 Keep relevant words score
 Lower the frequent words score
 Easy to estimate the document similarity

Cons of TF-IDF:
 Consider only words
 Weak on capturing documents topic
 Weak on handling synonym
 Cannot capture the semantic similarity

N.B.:
If a query term is not appear in the corpus, this will lead to a division-by-zero while computing
IDF using the following formula.

Total number of documents


IDF = log ( )
Number of documents containing the term

It is therefore common to adjust the denominator by adding a positive constant value, 1.

Total number of documents


IDF = log ( )
1 + Number of documents containing the term

4
Dr. Abu Nowshed Chy
Assistant Professor, Dept. of CSE, CU

You might also like