To Begin

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

To begin:

from nltk.book import *

Now we have access to new variables:


text1, […] text9 & sent1, […], sent9

text1: Moby Dick by Herman Melville 1851


text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

Using the data above complete the following tasks:


1. Make a list of all four-letter-long words from text1. How many are there?
2. In text1 find all words longer than 17 letters. How many are there?
3. Using the built-in functions set() and sorted() create a dictionary for each sentence
(sent1, […], sent9) and a joint dictionary for all the sentences.
4. Define vocab_size() function, which for a given text will return the size of a
dictionary (so return a number of all unique words). How many are there in each
book?
5. Print the 10 most commonly occurring words in text1.
6. Check which words are the longest in each of the text.
7. Check how many unique bigrams there are in text5. For the 10 most common, return
the joint number of occurrences and compare it with the top 10 most commonly
occurring words.

You might also like