Vector Space Modeling With TFIDF

Model the Vector Space of the Documents
(TF-IDF)
Sample Problem Formulation:

Consider the following four text documents:
D1: the best Italian restaurant enjoy the best pasta
D2: American restaurant enjoy the best burger
D3: Thai restaurant enjoy the best soup
D4: the best the best American restaurant
a) Determine the term frequency (TF) of words in each documents and generate the term-
documents matrix based on TF.
b) Compute the documents similarity based on Cosine Similarity.
c) Determine the inverse document frequency (IDF) score of each words
d) Generate the term-documents matrix based on TF-IDF
e) Pros. and cons. of TF-IDF.
Term Frequency (TF):

Term frequency (TF) means the frequency of a term in a document. If a term appears frequently
in a document, it is important.
Term-Documents Matrix based on TF:

Unique Word
List of the D1 D2 D3 D4
Documents
Italian 1 0 0 0
restaurant 1 1 1 1
enjoy 1 1 1 0
the 2 1 1 2
best 2 1 1 2
pasta 1 0 0 0
American 0 1 0 1
burger 0 1 0 0
Thai 0 0 1 0
soup 0 0 1 0
1
Dr. Abu Nowshed Chy
Assistant Professor, Dept. of CSE, CU
D1 = [ 1 1 1 2 2 1 0 0 0 0]
D4 = [ 0 1 0 2 2 0 1 0 0 0]
(1×0)+(1×1)+(1×0)+(2×2)+(2×2)+(1×0)+(0×1)+(0×0)+(0×0)+(0×0)
CosSim(D1, D4) = √12 +12 +12 +22 +22 +12 +02 +02 +02 +02 √02 +12 +02 +22 +22 +02 +12 +02 +02 +02
×
9
= = 0.82
√12 × √10
Cosine Similarity of Each Documents with D4:

CosSim with D4
D1 the best Italian restaurant enjoy the best pasta 0.82
D2 American restaurant enjoy the best burger 0.77
D3 Thai restaurant enjoy the best soup 0.65
D4 the best the best American restaurant 1
Normalized Term Frequency (TF):

Frequency of a term in a document
Normalized Term Frequency (TF) =
Total number of words in that document
Inverse Document Frequency (IDF):

Total number of documents
IDF = log ( )
Number of documents containing the term
If a term appears in many documents, it's not a unique identifier. Give the term a low score.
Normalized Term-Documents Matrix with IDF Computation:

Normalized TF IDF
D1 D2 D3 D4 IDF
Italian 0.125 0 0 0 0.6
restaurant 0.125 0.167 0.167 0.167 0
enjoy 0.125 0.167 0.167 0 0.12
the 0.25 0.167 0.167 0.33 0
best 0.25 0.167 0.167 0.33 0
pasta 0.125 0 0 0 0.6
American 0 0.167 0 0.167 0.3
burger 0 0.167 0 0 0.6
Thai 0 0 0.167 0 0.6
soup 0 0 0.167 0 0.6
2
Dr. Abu Nowshed Chy
From the above table we see that, TF-IDF of the term “Italian” at D1 = 0.125 × 0.6 = 0.075
Term-Documents Matrix based on Noralized TF-IDF:

D1 D2 D3 D4
Italian 0.075 0 0 0
restaurant 0 0 0 0
enjoy 0.015 0.02 0.02 0
the 0 0 0 0
best 0 0 0 0
pasta 0.075 0 0 0
American 0 0.05 0 0.05
burger 0 0.1 0 0
Thai 0 0 0.1 0
soup 0 0 0.1 0
D2 = [ 0 0 0.02 0 0 0 0.05 0.1 0 0]

D4 = [ 0 0 0 0 0 0 0.05 0 0 0]
(0.05×0.05)
CosSim(D2, D4) = √0.022 +0.052 +0.12 √0.052
×
0.0025
=
0.00568
= 0.44
Cosine Similarity of Each Documents with D4:

CosSim with D4
D1 the best Italian restaurant enjoy the best pasta 0
D2 American restaurant enjoy the best burger 0.44
D3 Thai restaurant enjoy the best soup 0
D4 the best the best American restaurant 1
Cosine Distance: While cosine similarity estimate the similarity between two documents, the
cosine distance denotes the dissimilarity between these two documents.
In the above example, the cosine similarity between Document D2 and D4 is 0.44. Hence, the
Cosine Distance will be = 1- 0.44 = 0.66
3
Dr. Abu Nowshed Chy
TF-IDF is a weighting scheme that is commonly used in information retrieval tasks. The goal is to
model each document into a vector space, ignoring the exact ordering of the words in the document
while retaining information about the occurrences of each word.
Pros of TF-IDF:
 Keep relevant words score
 Lower the frequent words score
 Easy to estimate the document similarity
Cons of TF-IDF:
 Consider only words
 Weak on capturing documents topic
 Weak on handling synonym
 Cannot capture the semantic similarity
N.B.:
If a query term is not appear in the corpus, this will lead to a division-by-zero while computing
IDF using the following formula.

IDF = log ( )
Number of documents containing the term
It is therefore common to adjust the denominator by adding a positive constant value, 1.

IDF = log ( )
1 + Number of documents containing the term
4
Dr. Abu Nowshed Chy

Vector Space Modeling With TFIDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vector Space Modeling With TFIDF

Uploaded by

Copyright:

Available Formats

Model the Vector Space of the Documents

Sample Problem Formulation:

Term Frequency (TF):

Term-Documents Matrix based on TF:

Cosine Similarity of Each Documents with D4:

Normalized Term Frequency (TF):

Inverse Document Frequency (IDF):

Normalized Term-Documents Matrix with IDF Computation:

Term-Documents Matrix based on Noralized TF-IDF:

D2 = [ 0 0 0.02 0 0 0 0.05 0.1 0 0]

Cosine Similarity of Each Documents with D4:

Total number of documents

It is therefore common to adjust the denominator by adding a positive constant value, 1.

Total number of documents

You might also like