Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

CS 112 - Pioject 6 Bictionaiies, File I0

1
!"# %"&'()* +,-#./#0 12
34
* 556789.

If you have questions, use the piazza foiums (anu piofessoiTA office houis) to obtain assistance.
uiauing on this assignment will be ieauy by Weunesuay, Becembei 4
th
, uue to the holiuay bieak.

Note: we can't use late tokens on the last pioject; if you uon't use them on this pioject, you'll get
S points pei iemaining token auueu to youi pioject points befoie aveiaging all piojects togethei.

:(;<=0,"&' >,9?;@6
Bictionaiies aie a wonueiful uata stiuctuie that allow us to stoie mappings, oi paiiings, wheie unique keys
give us values. These can be useu foi woiu-uefinition uictionaiies (natuially!), notions of simple uatabases, oi
ieally anything that coulu be uesciibeu mathematically as a paitial function.
Like any piogiamming language, in Python we aie able to ieau anu wiite to text files. We can ieau line by
line, oi any numbei of chaiacteis at a time, oi even just pull in the entiie file as a stiing all at once foi familiaiity's
sake. We can also wiite to a file, soit of like piint calls that go into a file insteau of to the commanu piompt. Files
can be useu to stoie laige uata sets between piogiam iuns oi to keep multiple stieams of output sepaiate fiom
each othei, oi even foi stoiing iesults fiom uiffeient stages of computation. Some text files aie wiitten with human
ieaueis in minu, but we not might choose to wiite text files by a piogiam, foi a piogiam!

A#B"?0#.#&3@6
Tuin in a file to BlackBoaiu with oui naming convention of !""#$%&'(#)*'(#+",-..
Incluue the following heauei, anu uocument youi coue (both with comments anu vaiiable names!)

#-------------------------------------------------------------------------------
# Name: George Mason.
# Project X
# Section XXX
# Due Date: MM/DD/YYYY
#-------------------------------------------------------------------------------
# Honor Code Statement: I received no assistance on this assignment that
# violates the ethical guidelines set forth by professor and class syllabus.
#-------------------------------------------------------------------------------
# References: (list any lecture slides, text book pages, any other resources)
# Note: you may not use code from websites, so don't bother looking any up.
#-------------------------------------------------------------------------------
# Comments and assumptions: A note to the grader as to any problems or
# uncompleted aspects of the assignment, as well as any assumptions about the
# meaning of the specification.
#-------------------------------------------------------------------------------
# NOTE: width of source code should be <= 80 characters to facilitate printing.
#2345678901234567890123456789012345678901234567890123456789012345678901234567890
# 10 20 30 40 50 60 70 80
#-------------------------------------------------------------------------------

/*'0
You will write a program that can read in textual files (such as one work by an author) to generate word-counts,
and store these results into a file. Furthermore, these result files can also be combined into larger result files,
giving us a way to look at word usage for an author's entire body of work. For instance, you might analyze
different sonnets by Shakespeare, report on each one, but then also get an analysis of all of his sonnets by
combining your results rather than trying to put all those sonnets into a single file and re-processing the whole
body of work again.

You must provide the following functions in your solution, with the described behaviors. You will also create a
simple menu structure (also described below in one of the functions), but we reserve the right to only test your
code via the functions, so again be sure you have the exact function names/parameters as provided in the
function definitions. As before, you're welcome to introduce any extra functions you want to use. It will
definitely be helpful!
CS 112 - Pioject 6 Bictionaiies, File I0
2

Formats
A frequency is a dictionary, with words (strings) as keys and the frequency (number of occurrences of the
word) as the value.

A report is a dictionary, with the following key-value pairs. The keys happen to all be strings, and the values
are lists, numbers, or dictionaries.

"shorts" : list of all words that are co-shortest.
"longs" : list of all words that are co-longest.
"mosts" : list of all words that occur the most often (this is the mode).
"count" : total number of words
"avglen" : average length of a word in this document. Be sure to account for how often words
occurred!
"freqs" : a frequency dictionary of words-to-word counts.

When written to a file, a report must have this ordering/format: all the key-value pairs in shown order, a
blank line, followed by all the words/counts from the original document. Lists of words must be stored in
alphabetical order, but you can use the sorted function or sort method. Note that words should be
separated by commas and spaces exactly as shown (with no ending comma).

"#$%&"' ()* $+* ,-* ."
/$)0"' +1)&1"&(2* 21/(3%1&-
,$"&"' 4-&* ,$%-* 5$%6"
2$.)&' 1)7)&81/
190/-)' 1:/$1&81/

13 ;
1)6 <
===
4$.%" >

Functions
You must define these functions as described in your solution. You're welcome to create as many extra
functions to support your project as you'd like, but we might need to rely on these functions for testing
purposes, so be sure you get the function names and parameter lists exactly right.

parse_document(filename): this function accepts a string that is assumed to be the name of a file worth
reading. This function will create and return a dictionary (a frequency, see above) with each word that is found
in the document appearing as a key, with the corresponding value being an integer value representing exactly
how many times it occurred.
any word that has a contraction in it, we will remove the apostrophe and then treat the remaining letters as
the word; for instance, !"!#$% will become !"!#%, &'$(( will become &'(( (an unfortunate
coincidence in our simple little program), &'$)' becomes &')', and so on.
remove all of the following symbols when finding words: *+,-./0123456789:;<=>?@A$BCDEFGH
lowercase everything, so that e.g. ?1% and 21% and 2@A get counted together.
You might find string manipulation methods useful, such as split, join, startswith, endswith, strip, or
others.
CS 112 - Pioject 6 Bictionaiies, File I0
S

build_report(freq): Given a frequency, create and return a report that correctly documents the details about
that particular document.

combine_reports(r1, r2): Given two reports (the output from build_report), correctly merge them into a single
report, and return it. Don't modify either of the originals build a new one and return it. Be especially careful
with avglen!

write_report(r, filename): Given a report and a filename, write into the supplied file name all of the contents
of the report as shown in the sample above. You may use the sorted function or the sort method if you'd like.

read_report(filename): Given just a string for the filename, assuming the file actually contains a valid report,
read the contents of the file and recreate the report (the dictionary); return it.

run_menu( ). Implements the following menu, with all necessary user interactions and printings. Breaking
down the actions into more functions is highly encouraged but not mandatory. As always, unless the user quits
we re-visit the menu after each time an option has been completed.

?#$$"- 1) $B&($)'
C= %-16 +(/-* 3.(/6 %-B$%&* "19- &#- %-B$%&
>= 2$,3()- &5$ %-B$%&"
D= E.(&

1. read/build/save: ask the user for a file name to read, and then ask them for a filename to write out the
report. Then you can read the file, build a frequency, build a report, and write the report to the output
file. If you'd like to also print the contents of the report file, that's okay.
2. combine two reports: ask the user for names of two files that contain reports in them. Then ask them for
an output filename as well. Read in both reports, create the merged version, write to the output file, and
again you can optionally print out the merged report.
3. quit: quit.

There was going to be a "combine multiple reports" option, but I nixed it for time considerations.

23(&* 4&56%( 789:
0se exception hanuling to make the menu-baseu iuns of youi piogiam iobust in the following ways:

use exception hanuling to valiuate the usei's menu options (uon't let them ciash youi piogiam
when they attempt to entei the menu numbei). }ust iepeateuly ask until they give an actual menu
option.
ensuie that all files that aie openeu exist; in the menu, immeuiately ask again foi an input file if it
uiun't exist. This means immeuiately tiying to open files in options 2 anu S, unlike the uesciiption.
if a iepoit file is ill-foimeu, keep youi menu-piogiam fiom ciashing by piinting "bau iepoit file." anu
going back to the menu without wiiting to any files oi piinting iepoits that uon't make sense.
Inuiviuual functions (othei than the iun_menu function) uon't have to iecovei fiom bau files, as
they aien't equippeu to make that uecision.
if a usei enteis a blank stiing oi only whitespace foi an output file, uon't accept it - ask again. Foi
simplicity's sake, anything else can be an output file name, though this might make foi some
bizaiie filenames if they get mischievous !
CS 112 - Pioject 6 Bictionaiies, File I0
4

Notes / Assumptions / Requirements

Don't import anything, and then you can use all the built-ins you want !

Like last time, please include the "import tester" block of code at the end of your program. Thoughts on
testing are below again. If this gave you trouble, please ask for help in lab or office hours!
You can assume each file has at least one word in it. There's one less corner case.
There is a zip file of some examples on Piazza (next to this specification) that should help clarify some
behaviors it contains original text documents, reports generated about them, as well as some combined
reports.
Matching the format of your written report files is important! Any file intended to be read by a program
needs to be precise. Be careful where commas go and where whitespace goes (always single spaces
there, no tabs or extra spaces). There is no trailing whitespace on any line. Whitespace includes spaces,
tabs, newlines, and some other characters that we can ignore for now. Just remember details such as how
the "&%(BFG method of strings strips off whitespace by default. Sounds useful
What file extensions should we use for our report files? Since we always give the entire file name,
including the extension, it doesn't actually matter what extensions you choose. Just naming everything
with .txt extensions is fine. Extensions really only give a hint about how we might successfully open the
file, but it doesn't affect the contents.
Your code that you submit shouldn't have any hardcoded filenames in it - we always either ask the user or
it's just a parameter to a function.



Testing Your Code

Again, since we have many functions, you might want to test your code interactively. You could write out your
'script' of function calls and printings into a function named something like &-"&CHB1%"-, and then call that
interactively to save yourself some typing. Or, put all your menu interactions in another file and use piping
again remember piping? IJ%KL#M OJP"('EIJ D Q((7RS')7"#%')QT%"L#E%U%

Think ahead of what exactly you're trying to test, and then make up specific documents to read that can help
you test that situation. It's probably best to have many smaller test files than one giant all-purpose test case.

We have the %.)H,-). function, so even if your main function doesn't exist during testing (before adding the
(,B$%& &-"&-% code), you can still easily call %.)H,-).FG to test the program from the interactive mode.

Here's the block of code that we need for testing purposes again. Please note those are all double-underscores!

# please put this at the end of your file before submission:
def main():
import tester
tester.runtests(__file__)

if __name__=="__main__":
main()

CS 112 - Pioject 6 Bictionaiies, File I0
S

;&*6%<= >?@&%A

-----------------------------------
submit/comment the project 5
parse_document: 15
build_report: 20
combine_reports: 20
write_report : 15
read_report: 15
run_menu: 10
-----------------------------------
TOTAL: 100


Other penalties
Rathei than builuing some iequiiements into ueuicateu points, heie aie iules that simply ought to be
followeu, anu failing to uo so will incui a penalty off of youi total scoie.

B9 (C BD9E ?'5 CF =GC@*G H*&%*@G5'. Please only have function uefinitions anu the block of
"boileiplate coue" fiom above in youi file that you tuin in. In geneial, you shoulu tiy to only have
function uefinitions anu a ,1()FG call in any Python piogiam you wiite fiom now on.
B9I -C%<(': Piogiam can't be iun (ciashes befoie any usei inteiaction). Coue being able to iun at
all is veiy impoitant! Keep youi coue iunnable, always.
B9 (C BD9 -C%<('E using an unauthoiizeu mouule. Applies to each unauthoiizeu thing.
BDI -C%<(': file contains zeio comments aftei the iequiieu heauei comment.

You might also like