NLP Exercises

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Exercises

1. Write a program to find all words that occur at least three times in the Brown Corpus.

2. Write a program to generate a table of lexical diversity scores (i.e., token/type ratios), as in
Table 1-1. Include the full set of Brown Corpus genres (nltk.corpus.brown.categories()). Which
genre has the lowest diversity (greatest number of tokens per type)? Is this what you would have
expected?

3. Write a function that finds the 50 most frequently occurring words of a text that are not
stopwords.

4. Write a program to print the 50 most frequent bigrams (pairs of adjacent words) of a text,
omitting bigrams that contain stopwords.

5. Write a function word_freq() that takes a word and the name of a section of the Brown Corpus
as arguments, and computes the frequency of the word in that section of the corpus.

6. Describe the class of strings matched by the following regular expressions:

a. [a-zA-Z]+

b. [A-Z][a-z]*

c. p[aeiou]{,2}t

d. \d+(\.\d+)?

e. ([^aeiou][aeiou][^aeiou])*

f. \w+|[^\w\s]+

Test your answers using nltk.re_show().

7. Download some text from a language that has vowel harmony (e.g., Hungarian), extract the
vowel sequences of words, and create a vowel bigram table.

8. Write code to convert nationality adjectives such as Canadian and Australian to their
corresponding nouns Canada and Australia (see http://en.wikipedia.org/wiki/
List_of_adjectival_forms_of_place_names).

9. Study the lolcat version of the book of Genesis, accessible as nltk.corpus.gene


sis.words('lolcat.txt'), and the rules for converting text into lolspeak at http://
www.lolcatbible.com/index.php?title=How_to_speak_lolcat. Define regular expressions to convert
English words into corresponding lolspeak words.

10. Use WordNet to create a semantic index for a text collection. Extend the concordance search
program in Example 3-1, indexing each word using the offset of its first synset, e.g.,
wn.synsets('dog')[0].offset (and optionally the offset of some of its ancestors in the hypernym
hierarchy).

11. write a pseudo code for the following:

• In a news article, identify the parts of speech for each word to extract key
information such as the subject, object, and action of reported events.
• Extract named entities such as person names, organization names, dates,
and locations from a set of news articles to create a database of key entities
mentioned in the articles.
• Given a large corpus of customer reviews for a product, tokenize each
review into sentences and perform text preprocessing to remove special
characters, lowercase the text, and remove stop words before sentiment
analysis.
• Use pre-trained word embeddings to find similar phrases in a customer
support ticket database to identify common issues and provide automated
responses to frequently asked questions.
• Classify customer reviews of a restaurant as positive, negative, or neutral to
provide feedback to the restaurant owner and improve customer
satisfaction.
• Label each word in a medical transcription with its corresponding part of
speech (e.g., noun, verb, adjective) to assist in medical coding and billing
processes.
• Translate user-generated content on a social media platform from multiple
languages into English to facilitate communication and engagement among
users from different linguistic backgrounds.
• Train a language model to generate product descriptions for an e-
commerce website based on existing product specifications and customer
reviews to enhance product listings and attract potential buyers.

You might also like