Professional Documents
Culture Documents
Introduction To Text Categorization: Dr. Hatem 2020 KNN
Introduction To Text Categorization: Dr. Hatem 2020 KNN
Categorization
For example...
Rule-based Approach to TC
Text in a Web Page
“Saeco revolutionized espresso brewing a decade ago by introducing
Saeco SuperAutomatic machines, which go from bean to coffee at the
touch of a button. The all-new Saeco Vienna Super-Automatic home
coffee and cappucino machine combines top quality with low price!”
Rules
◼ Rule 1.
(espresso or coffee or cappucino ) and machine* Coffee Maker
◼ Rule 2.
automat* and answering and machine* Phone
◼ Rule ...
Defining Rules By Hand
◼ too difficult
For example, the features could be the set of all words and the
values, their number of occurrences in a particular document.
Farmland:1
Firm:1
Japan Firm Japan:1
Plans to Sell Japanese:1
Representation
U.S. Farmland Plans:1
to Japanese Sell:1
To:2
U.S.:1
Approaches to Text Categorization
(emphasizing the “k-Nearest Neighbor approach”)
1-Nearest Neighbor
Looking back at our example
Senate
Panel
◼ Did anyone try to find the Studies
most similar labeled item Loan Rate,
Set Aside
and then just guess the Plans
same color?
Citibank Japan Ministry Vieille
Amatil Jardine Matheson
Norway Unit Says Open Montagne Said It Sets Two-
Proposes Two-
Loses Six Mln Farm Trade Says 1986 for-Five Bonus
for-Five Bonus Issue Replacing “B”
Crowns in Would Hit Conditions
Anheuser-
1986
Italy’s La
U.S.
Isuzu Plans No
Unfavourable
Senator
Defends U.S.
Shares
Bowater
Neighbor
Key Components of Nearest Neighbor
Cosine similarity
cos( x , y ) =
xy i i i
x
i
2
i i
yi2
Euclidean distance