Professional Documents
Culture Documents
Wse Homework - Semantic Web 2 PDF
Wse Homework - Semantic Web 2 PDF
Wse Homework - Semantic Web 2 PDF
2.How many unique tokens (unique words and punctuation) does the text have?
13,953 unique tokens.
3.After lemmatizing the words, how many unique tokens does text have?
11,270 lemma (found in WordNet).
4.What are the 20 most frequently occurring (unique) tokens in the text? What is
their frequency?
[(',', 18109), ('the', 10664), ('and', 8347), ('to', 7187), ('of', 6859), ('that', 4165), ('in', 3659), ('a',
3400), ('he', 3155), ('I', 3086), ('.', 2904), ('it', 2844), (';', 2646), ('his', 2561), ('for', 2518), ('“',
2299), ('as', 2260), ('”', 2251), ('was', 2098), ('not', 1985)]
Word frequencies can also be used to learn more about the contents of a document,
such as the book you are analysing. The idea is that the most frequent words should
characterize what the book is about. A nice way to illustrate this is via a wordcloud,
please provide a screenshot of the word cloud for each of the following settings:
2.No Stopwords: Remove stopwords You probably noticed that the most important
words were mostly uninformative. To address this problem, a typical approach is to
remove so called stopwords, which don’t carry a lot of meaning.
3. NER: To further remove the words and only keep the entities in text, first extract
the entities using NLTK library then perform the wordcloud.
Héctor Bállega Fernández
Web Science and Engineering