Professional Documents
Culture Documents
Doceng2012 Submission 19
Doceng2012 Submission 19
Doceng2012 Submission 19
𝑛 ∑ 𝑎𝑖 𝑏𝑖 −∑ 𝑎𝑖 ∑ 𝑏𝑖
𝑆𝑖𝑚𝑃𝑒𝑎𝑟𝑠𝑜𝑛 (𝐴, 𝐵) = (6)
��𝑛 ∑ 𝑎𝑖2−(∑ 𝑎𝑖 )2 ��𝑛 ∑ 𝑏𝑖2 −(∑ 𝑏𝑖 )2�
2.2 Similarity Measures
Many similarity measures where used for both document
classification and document clustering [13] to estimate the
similarity between a document and a class prototype. Using VSM, 2.2.4 Kullback-Leibler similarity measure
this similarity is calculated to compare a document vector with the
vector representing a class or the centroïd. Next, are introduced According to probability and information theory, Kullback-
five similarity measures (Cosine, Jaccard, Pearson, Kullback Leibler divergence is a measure estimating dis-similarities
Leibler, and Levenshtein) all used in the following between two probability distributions. In the particular case of text
experimentations on Rocchio. processing, this measure calculates the divergence between
feature distributions in documents. Given vectors' representations
of their features distribution A(𝑎1 , 𝑎2 , … , 𝑎𝑛 ), B(𝑏1 , 𝑏2 , … , 𝑏𝑛 ), the
2.2.1 Cosine similarity measure divergence, also used for calculating similarities, is calculated as
follows
Cosine is the most popular similarity measure and largely used in
information retrieval, document clustering, and document 𝑆𝑖𝑚𝐾𝑢𝑙𝑙𝑏𝑎𝑐𝑘 = 𝐷𝐴𝑣𝑔𝐾𝐿 (𝑡���⃗𝑎 |�𝑡���⃗𝑏 �
(7)
classification research domains.
= ∑…
𝑡=1(𝜋1 ∗ 𝐷(𝑤𝑡,𝑎 ||𝑤𝑡 ) + 𝜋2 ∗ 𝐷(𝑤𝑡,𝑏 ||𝑤𝑡 )
Having two vectors A(𝑎1 , 𝑎2 , … , 𝑎𝑛 ), B(𝑏1 , 𝑏2 , … , 𝑏𝑛 ), the
similarity between these vector is estimated using the cosine of Where:
the angle they delimit: 𝑤𝑡,𝑎
𝜋1 =
A. B 𝑤𝑡,𝑎 + 𝑤𝑡,𝑏
SimCosine (A, B) = (4) 𝑤𝑡,𝑏
|A| ∗ |B| 𝜋2 =
𝑤𝑡,𝑎 + 𝑤𝑡,𝑏
Where: 𝑤𝑡 = 𝜋1 ∗ 𝑤𝑡,𝑎 + 𝜋2 ∗ 𝑤𝑡,𝑏
A.B= ∑ai*bi
|A|²=∑ ai² 2.2.5 Levenshtein similarity measure
iϵ[0, n-1]; n: the number of features in vector space. Levenshtein is used to compare two strings. A possible extension
In systems using this similarity measure, changing documents' for vector comparison can be derived as the following equation:
length has no influence on the result as the angle they delimit is
still the same.
𝑆𝑖𝑚Levenshtein (𝐴, 𝐵) = 1 − (𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒/𝑀𝑎𝑥) (8)