Professional Documents
Culture Documents
Small Project
Small Project
Write a program that reports the five most frequent two-word sequences in a text file download
1. Find the beginning and the end of the text (look for the markers "*** START OF THE
EBOOK...") and discard everything before the beginning and after the end, including the markers.
3. Convert each word to lowercase and remove the punctuation, if any. If a "word" consists
only of punctuation, discard it entirely. Thus, "Huck Finn is drawn from life ; Tom Sawyer also,
but" shall become "huck finn is drawn from life tom sawyer also but".
4. Count all combinations of two consecutive words (they are known as bigrams -- e.g., "huck
finn," "finn is," "is drawn," "drawn from") and report the ten most frequent of them.
Deliverables: the Python file and the output of the program as a text file with the bigrams and their
counts, one result per line, ordered in the decreasing order of counts (the most frequent bigram at
the top).