Subtask 1 - Tokenization of Files

This extended exercise is meant to help you practice using several of the ADTs you
have built up during the course, and also reading from files. Specifically, the program
that you will be writing should read in words from numerous files, keep track of where
they occurred, and produce output statistics for specific words. The individual steps that
the program should perform are as follows:
Subtask 1 – Tokenization of Files

 Read in token information for all of the files that are located inside of a folder
called “inputfiles” that should itself be located in your project directory. The files
that will be read should include those that are located in sub-folders of inputfiles,
sub-sub-folders, etc. When we will do testing, we will provide our own files, so
make sure that this will work for different cases.
 The information that should be collected for each token occurrence should be:
1. The name of the file where it occurred
2. The line number of the occurrence
3. The column of the given line where the token started
 Here, we want our tokens to only include single words, and no sentence
punctuation. This can be accomplished by treating anything that isn’t a letter,
hyphen (-), or apostrophe (‘) as a separator character, as we treated whitespace in
previous lessons.
Subtask 2 – Storing the Word Occurrence Information
 In order to store the required information for each word occurrence, create a new
class called OccurrenceRecord which has three fields for the filename, line
number, and column of a given occurrence.
 Each word that is found in the files may occur multiple times, so you should use a
Map from Strings to Queues of OccurrenceRecords to keep track of all of the
occurrences for all of the words. You are required to use the
LLQHashTableMap<K,V> implementation as your Map for this task. You can
use whatever Queue implementation you like to store the lists of
OccurrenceRecords (although it might be easiest to simply use LinkedListQueue
since you will need it anyway for the buckets.)
 We don’t want to distinguish between upper and lowercase letters in our words,
so before adding them or searching for them in our Map, be sure to use the
method toLowerCase() to convert your Strings to lower case. Here’s a quick
example of how to use it:
String str = “CraZy StrINGz!!!”;
String lcStr = str.toLowerCase();
 When using this Map, remember:

1. Each token read in, whether it has been seen yet or not, will require you
to create a new OccurrenceRecord for this particular word instance.
2. If the word hasn’t been seen before, don’t forget to create a new Queue
of OccurrenceRecords for it – enqueueing items to a null Queue is
never a good idea.

3. However, don’t create a new Queue if the word has already been seen
—just enqueue it to the Queue that it mapped to from the given word.
 In addition to storing the occurrence information in a Map, also store just the
words that have occurred into a SortedQueue of Strings. Do not add multiple
copies of the same word to this SortedQueue.
 Lastly, keep track of the total number of tokens that were read in from all of the
files. Use an integer for this (of course).
Subtask 3 – Output Information
 Output the contents of the SortedQueue, which contains all of the words that have
been seen to the console window. The words should appear in alphabetical order,
one word per line.
 Output load factor and bucket size standard deviation to the console window for
the LLQHashTableMap; make sure you label them as such when they are output.
 A file called “getwordinfo.txt” should reside in your project directory that should
contain some of the words that have been found in the input files. Read in each of
the words from this file (maybe using a simple Scanner object), and then output
the following to the console window:
1. The word itself
2. The list of occurrences of that word, or, if the word never occurred,
simply output “Not found”
3. The total number of occurrences of the word, and the usage frequency
of the word (as a percentage) relative to all word occurrences in the
input files

Subtask 1 - Tokenization of Files

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Subtask 1 - Tokenization of Files

Uploaded by

Copyright:

Available Formats

This extended exercise is meant to help you practice using several of the ADTs you

Subtask 1 – Tokenization of Files

String lcStr = str.toLowerCase();

 When using this Map, remember:

never a good idea.

You might also like