UBC Summer School in NLP - VSP 2019 Lecture 7

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

VANCOUVER SUMMER PROGRAM

Package G (Linguistics): Computation for Natural Language Processing


Class 7
Instructor: Michael Fry
PLAN TODAY
• Review:
• Dictionaries
• Finish of text-to-speech (TTS)
• Write to file
• Access webpage
• Introduce Errors/exceptions
• Practical application
• Introduce importing packages
• Learn to use command line
• Introduce installing packages with pip
• Introduce NLTK (Natural Language Toolkit)
NESTED DICTIONARIES
• Sometimes it’s useful to nest dictionaries
• i.e. a dictionary who’s key accesses a value that is also a dictionary
• For example, someone might want to track the oldest son of every family through
generations (‘Tim is the father of Jake is the father of Pete’)
• How do we build a nested dictionary?
• How can we access entries?
TTS: OUR METHOD
• We’re going to use the CMU-Dictionary as a half-way point to get to IPA
• We’re going to use multiple python dictionaries to do our transcription from spelling
to ARPABET to IPA, like this:
• ‘hello’ -> HH AH0 LOW1 -> hɛlo
• Our dictionary types will look like this:
english2arpabet = {'hello': ['HH', 'AH0', 'L', 'OW1’]}
arpabet2ipa = {'HH':'h', 'AH0': 'ɛ', 'L': 'l', 'OW1': 'o’}
TTS: OUR METHOD
• Download
supplementary material → cmu_dict.txt
supplementary material → arpabet2ipa.txt
Datasets → three_little_pigs.txt
• Pseudocode (step 1):
read arpabet2ipa.txt into python
initialize an empty dictionary
for each line:
strip the line
split the line
add an entry into our arpabet2ipa dictionary with arpabet as the key
TTS: OUR METHOD
• Pseudocode (Step 2):
read the cmu_dict.txt file
initialize empty dictionary
for each line in the file:
strip the line
split the line by double space (‘ ‘)
add entry into dictionary with spelling as key
TTS: OUR METHOD
• Pseudocode (Step 3):
read three_little_pigs.txt
make an empty ipa_output list
for each line in the file:
strip the line
split the line
make empty ipa_line list
for each word in the line:
strip off punctuation
look up pronunciation in cmu_dict
convert pronunciation to ipa
add it to the ipa_line
add ipa_line to ipa_output
TTS: OUR METHOD
• Now, to make things easy to copy online, let’s write our data to a file
• Writing to a file is a lot like reading from a file, you just need another argument
out_file_path = ‘C:/Users/Michael/Desktop/test.txt’
with open(out_file_path, mode=‘w’, encoding=‘utf-8’) as out_file:
out_file.write(‘DINOSAURS!’)
• Notice that we can only write out strings
• Note: Python automatically creates a .txt file for you if it doesn’t exist, if it does exist
(mode=‘w’) starts at the first line
• There are other modes (r, r+, w, w+, a, a+) which all allow different interactions with files
TTS: OUR METHOD
• Now that we have IPA transcriptions for our story, let’s go listen to some of it
• Go to: https://itinerarium.github.io/phoneme-synthesis/
• Enter a sub-portion of your IPA and listen away
PLAN TODAY
• Review:
• Dictionaries
• Finish of text-to-speech (TTS)
• Write to file
• Access webpage
• Introduce Errors/exceptions
• Practical application
• Introduce importing modules/packages
• Learn to use command line
• Introduce installing packages with pip
• Introduce NLTK (Natural Language Toolkit)
ERROR HANDLING
• When the Python interpreter reaches a line of code that it can't execute, it raises an
exception or it throws an error
alphabet = ['a', 'b', 'c']
h = alphabet[7]
h = h.upper()
• Python can't find index 7 (because it doesn't exist). Therefore Python can't print the
value at index 7
• Python can't "skip" a line of code that it doesn't understand. Even if it could skip line
2, what would happen on line 3?
• When there's nothing left for Python to do, it raises an Exception
ERROR HANDLING
• Errors are fatal, meaning the program cannot continue running
• Exceptions are non-fatal, meaning the program could possibly recover and keep
running
• However, programmers commonly call everything an error
• Even more confusing, is that Python calls all of its exceptions Errors
TYPES OF ERRORS
• Types of errors:
This is technically the only "error" on this
• SyntaxError
list because the program cannot run with
• NameError
an SyntaxError
• IndexError
• KeyError
• AttributeError
• TypeError
• UnicodeError
• FileNotFoundError
• TabError
• ZeroDivisionError
• and more…
TYPES OF ERRORS
• SyntaxError:
• This is raised if you have typed code that doesn't conform to the Python syntax. For
example:
• A missing bracket - print('Some text'
• A missing quote - print('Some text)
• A missing colon - for line in file
• A missing comma - numbers = [1,2,3,4 5]
• SyntaxErrors are fatal, and they prevent your program from running
• Normally, IDLE will highlight the problem in red for you
TYPES OF ERRORS
• NameError:
• This is raised if the Python interpreter can't find a variable name. There are
several possible causes, but these two are very common:
• You misspelled a variable name (e.g. "woard" instead of "word")
• You defined a variable name inside an if-statement, then tried to access
that variable name, but the if-statement code was never executed in the
first place (example on the next slide).
TYPES OF ERRORS
• IndexError:
• This is raised if you try to access an index in a list that does not exist. There are two
possible causes:
• The list has the correct length, and the index is wrong.
• Double-check the code that is generating the index number.

• The list is not the correct length, and the index you want should exist, but doesn't
• Make sure the code that builds the list is building one of the right length.
TYPES OF ERRORS
• KeyError:
• This is raised if you try to access a dictionary key that doesn't exist
• The key you want should be in the dictionary, but it isn't. Double-check the code that is
building the dictionary to make sure the key does exist before you try to access it.
• You've asked for an incorrect key name. Check for spelling mistakes if you typed key names
yourself. Otherwise, check that your code is not generating incorrect names.
TYPES OF ERRORS
• AttributeError:
• This is raised if you try to do use a method or attribute that doesn't exist
s = 'some string‘
s.append('!')
• This raises an AttributeError because strings do not have an "append" method
• There are no general solutions on how to solve an AttributeError. Follow the traceback of
the error
TYPES OF ERRORS
• TypeError
• This is raised if you try to do something with a variable of the wrong type.
numbers = [0,1,2,3,4]
print(numbers['0'])
• This raises a TypeError because the index is the wrong type. It has to be an integer, not a
string
• To debug, follow the traceback
ERROR HANDLING
• Python gives us a way to catch errors using a try/except/else block (kind of like an
if/else block)
try:
#some code
except KeyError:
if a KeyError is raised, this happens
except (IndexError, ValueError):
if an IndexError or a ValueError Is raised, this happens
else:
#if any other exception is raised, this happens
ERROR HANDLING
• Generally, you don’t want errors, so you shouldn’t code to include them.
• Of course, just because you shouldn’t, doesn’t mean you can’t. Let’s fix this:

words = ['shrubbery', 'coconut', 'witch', 'newt']


letter_count = dict()
for word in words:
for letter in word:
letter_count[letter] += 1
#raises KeyError
ERROR HANDLING
• Fix with try/except:
for letter in word:
try:
letter_count[letter] += 1
except KeyError:
letter_count[letter] = 1
ERROR HANDLING
• Better yet to avoid any errors, so let’s fix it with an if/else instead
ERROR HANDLING
• Better yet to avoid any errors, so let’s fix it with an if/else instead
for letter in word:
if letter in letter_count:
letter_count[letter] += 1
else:
letter_count[letter] = 1
ERROR HANDLING: PRACTICAL
APPLICATION
• A Polynesian language spoken by
around 150,000 people in Indonesia
• We’re going to go through a wordlist
and find all the environments of each
speech sound
• We’ll have to deal with two types
of errors, KeyError and IndexError
ERROR HANDLING: PRACTICAL
APPLICATION
• Linguistic environments:
• An "environment" in linguistics means the left side and right side of a sound.
• The environments for the sound [o] in the word [polo]:
• it occurs between [p] and [l]
• it occurs between [l] and the end of the word
• In linguistics, we would write the environments of [o] this way:
• p_l
• l_#
• The underscore represents [o], and the # symbol means end of the word
ERROR HANDLING: PRACTICAL
APPLICATION
• All of the environments for all the sounds in the word [kelopo]
ERROR HANDLING: PRACTICAL
APPLICATION
• All of the environments for all the sounds in the word [kelopo]
• Environments of [k]
• #_e
• Environments of [e]
• k_l
• Environments of [o]
• l_p
• p_#
• Environments of [p]
• o_o
ERROR HANDLING: PRACTICAL
APPLICATION
• Pseudocode:
open a file
make an empty words list
for each line in file:
strip and split by tab
append the Lamaholot word to a list
make an empty dictionary
for each word in the list:
for each letter in the word:
find out the letters on either side
add those environments to the dictionary for that letter
NEW PYTHON BASIC: IMPORT
• Python has additional packages that aren’t immediately accessible in a program,
you need to import them
• To import a module, type import followed by the module name
• For example
import string
• Now you can access things in the module using "dot-notation“
print(string.punctuation)
print(string.ascii_lowercase)

• Import statements should always be the very first thing at the top of your code.
IMPORTING MODULES
• Another useful package/module is the os module
• OS stands for "operating system". This module contains useful functions for dealing with
files and folders.
• Here are two very useful functions:
os.getcwd()
• Returns the "current working directory.“ This means the folder where the Python file is
os.path.join(string1, string2, ...)
• Take the strings and joins them by slashes to make a file path
• These are often combined:
path = os.path.join(os.getcwd(), 'data', 'turkish_words.txt')
PYTHON PACKAGES
• Importing gives us access to the wide-world of things you can do with Python
• Many developers have made Python packages available
• We’ll be installing packages using pip
• PIP is a tongue-and-cheek recursive acronym for Pip installs Packages
• We use it from command prompt/terminal
• Let’s first have some fun with command prompt
• Using text input
NEW FUNCTION: INPUT()
• Since we’ll be working with command prompt to install packages, I thought it’d be
fun to play around a minute with a command line interface
• Remember, using Sublime Text, which is a text editor, we can’t interact with our
script, we just program and run it
• Now we’re using command line, we can input text with the input() function
• Test it out in IDLE:
print(‘please input your name:’
name = input()
print(name)
INTERACTING WITH CMD LINE
• Original computer games started with text-line interfaces
• Today, we can recreate these types of games easily
• Let’s write a fun little text-based game:
• We’ll preset a path that the user must figure out
• The basic interaction asks the user to make a move and lets them know if they made the
right move or not
print(‘make a move (up, down, left, right):’)
curr_move = input()
if curr_move is correct:
print(‘you got it right, you can move on!’)

PLAN TODAY
• Review:
• Dictionaries
• Finish of text-to-speech (TTS)
• Write to file
• Access webpage
• Introduce Errors/exceptions
• Practical application
• Introduce importing modules/packages
• Learn to use command line
• Introduce installing packages with pip
• Introduce NLTK (Natural Language Toolkit)
INSTALLING NEW PYTHON
PACKAGES
• We install python packages using the pip install call
• Let’s install the Natural Language Processing Toolkit
• In command line (PC)
• pip install nltk
• In terminal (MAC)
• pip3 install nltk
• You’ll see a bunch of stuff show up on the screen as in installs the package
• There are lots of packages you can install
• If there’s a specific project you’re working on, there’s a good chance someone has made
a package which can help!

You might also like