Wse Homework - Semantic Web 2 PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Héctor Bállega Fernández

Web Science and Engineering

WSE HOMEWORK - SEMANTIC WEB 2


Section 1 (basic analysis)

The selected book is ​Don Quixote by Miguel de Cervantes Saavedra:


http://www.gutenberg.org/ebooks/996

1.How many tokens (words and punctuation symbols) are in text?


250,817 tokens.

2.How many unique tokens (unique words and punctuation) does the text have?
13,953 unique tokens.

3.After lemmatizing the words, how many unique tokens does text have?
11,270 lemma (found in WordNet).

4.What are the 20 most frequently occurring (unique) tokens in the text? What is
their frequency?
[(',', 18109), ('the', 10664), ('and', 8347), ('to', 7187), ('of', 6859), ('that', 4165), ('in', 3659), ('a',
3400), ('he', 3155), ('I', 3086), ('.', 2904), ('it', 2844), (';', 2646), ('his', 2561), ('for', 2518), ('“',
2299), ('as', 2260), ('”', 2251), ('was', 2098), ('not', 1985)]

Section 2 (Word frequency)

Word frequencies can also be used to learn more about the contents of a document,
such as the book you are analysing. The idea is that the most frequent words should
characterize what the book is about. A nice way to illustrate this is via a wordcloud,
please provide a screenshot of the word cloud for each of the following settings:

1.No Filter: Consider all the words


Héctor Bállega Fernández
Web Science and Engineering

2.No Stopwords: Remove stopwords You probably noticed that the most important
words were mostly uninformative. To address this problem, a typical approach is to
remove so called stopwords, which don’t carry a lot of meaning.

3. NER: To further remove the words and only keep the entities in text, first extract
the entities using NLTK library then perform the wordcloud.
Héctor Bállega Fernández
Web Science and Engineering

Section 3 (Word embedding)


Assume we want to find similar books to the ones we chose, for this please perform the
following steps:

- pick at least 7 additional books from Project Gutenberg

- extract their named entities

- measure the semantic similarity (using word embeddings) between the


additional selected books and the initial book based on their extracted
entities.
- rank the additional books based on their similarity to the initial one (descending
order). In this way you are able to find the most similar book to the initial one. (list
of book names, their similarity to the initial book)

Solution (screenshot from notebook):

You might also like