Professional Documents
Culture Documents
NLP Exercises
NLP Exercises
NLP Exercises
1. Write a program to find all words that occur at least three times in the Brown Corpus.
2. Write a program to generate a table of lexical diversity scores (i.e., token/type ratios), as in
Table 1-1. Include the full set of Brown Corpus genres (nltk.corpus.brown.categories()). Which
genre has the lowest diversity (greatest number of tokens per type)? Is this what you would have
expected?
3. Write a function that finds the 50 most frequently occurring words of a text that are not
stopwords.
4. Write a program to print the 50 most frequent bigrams (pairs of adjacent words) of a text,
omitting bigrams that contain stopwords.
5. Write a function word_freq() that takes a word and the name of a section of the Brown Corpus
as arguments, and computes the frequency of the word in that section of the corpus.
a. [a-zA-Z]+
b. [A-Z][a-z]*
c. p[aeiou]{,2}t
d. \d+(\.\d+)?
e. ([^aeiou][aeiou][^aeiou])*
f. \w+|[^\w\s]+
7. Download some text from a language that has vowel harmony (e.g., Hungarian), extract the
vowel sequences of words, and create a vowel bigram table.
8. Write code to convert nationality adjectives such as Canadian and Australian to their
corresponding nouns Canada and Australia (see http://en.wikipedia.org/wiki/
List_of_adjectival_forms_of_place_names).
10. Use WordNet to create a semantic index for a text collection. Extend the concordance search
program in Example 3-1, indexing each word using the offset of its first synset, e.g.,
wn.synsets('dog')[0].offset (and optionally the offset of some of its ancestors in the hypernym
hierarchy).
• In a news article, identify the parts of speech for each word to extract key
information such as the subject, object, and action of reported events.
• Extract named entities such as person names, organization names, dates,
and locations from a set of news articles to create a database of key entities
mentioned in the articles.
• Given a large corpus of customer reviews for a product, tokenize each
review into sentences and perform text preprocessing to remove special
characters, lowercase the text, and remove stop words before sentiment
analysis.
• Use pre-trained word embeddings to find similar phrases in a customer
support ticket database to identify common issues and provide automated
responses to frequently asked questions.
• Classify customer reviews of a restaurant as positive, negative, or neutral to
provide feedback to the restaurant owner and improve customer
satisfaction.
• Label each word in a medical transcription with its corresponding part of
speech (e.g., noun, verb, adjective) to assist in medical coding and billing
processes.
• Translate user-generated content on a social media platform from multiple
languages into English to facilitate communication and engagement among
users from different linguistic backgrounds.
• Train a language model to generate product descriptions for an e-
commerce website based on existing product specifications and customer
reviews to enhance product listings and attract potential buyers.