UBC Summer School in NLP - VSP 2019 Lecture 8

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

VANCOUVER SUMMER PROGRAM

Package G (Linguistics): Computation for Natural Language Processing


Class 8
Instructor: Michael Fry
PLAN TODAY
• Review
• Errors/Exception handling
• Command line interface
• Importing/Python Packages
• Introduction to Natural Language Processing
• Basics of the Natural Language Toolkit (NLTK)
• Building an n-gram model
• Assignment 2 overview
• Quiz 2
TYPES OF ERRORS
• Types of errors:
This is technically the only "error" on this
• SyntaxError
list because the program cannot run with
• NameError
an SyntaxError
• IndexError
• KeyError
• AttributeError
• TypeError
• UnicodeError
• FileNotFoundError
• TabError
• ZeroDivisionError
• and more…
ERROR HANDLING
• Python gives us a way to catch errors using a try/except/else block (kind of like an
if/else block)
try:
#some code
except KeyError:
if a KeyError is raised, this happens
except (IndexError, ValueError):
if an IndexError or a ValueError Is raised, this happens
else:
#if any other exception is raised, this happens
NEW FUNCTION: INPUT()
• Since we’ll be working with command prompt to install packages, I thought it’d be
fun to play around a minute with a command line interface
• Remember, using Sublime Text, which is a text editor, we can’t interact with our
script, we just program and run it
• Now we’re using command line, we can input text with the input() function
• Test it out in IDLE:
print(‘please input your name:’
name = input()
print(name)
INTERACTING WITH CMD LINE
• Original computer games started with text-line interfaces
• Today, we can recreate these types of games easily
• Let’s write a fun little text-based game:
• We’ll preset a path that the user must figure out
• The basic interaction asks the user to make a move and lets them know if they made the
right move or not
print(‘make a move (up, down, left, right):’)
curr_move = input()
if curr_move is correct:
print(‘you got it right, you can move on!’)

NEW PYTHON BASIC: IMPORT
• Python has additional packages that aren’t immediately accessible in a program,
you need to import them
• To import a module, type import followed by the module name
• For example
import string
• Now you can access things in the module using "dot-notation“
print(string.punctuation)
print(string.ascii_lowercase)

• Import statements should always be the very first thing at the top of your code.
IMPORTING MODULES
• Another useful package/module is the os module
• OS stands for "operating system". This module contains useful functions for dealing with
files and folders.
• Here are two very useful functions:
os.getcwd()
• Returns the "current working directory.“ This means the folder where the Python file is
os.path.join(string1, string2, ...)
• Take the strings and joins them by slashes to make a file path
• These are often combined:
path = os.path.join(os.getcwd(), 'data', 'turkish_words.txt')
PYTHON PACKAGES
• Importing gives us access to the wide-world of things you can do with Python
• Many developers have made Python packages available
• We’ll be installing packages using pip
• PIP is a tongue-and-cheek recursive acronym for Pip installs Packages
• We use it from command prompt/terminal
• Let’s first have some fun with command prompt
• Using text input
INSTALLING NEW PYTHON
PACKAGES
• We install python packages using the pip install call
• Let’s install the Natural Language Processing Toolkit
• In command line (PC)
• pip install nltk
• In terminal (MAC)
• pip3 install nltk
• You’ll see a bunch of stuff show up on the screen as in installs the package
• There are lots of packages you can install
• If there’s a specific project you’re working on, there’s a good chance someone has made
a package which can help!
NATURAL LANGUAGE
PROCESSING
• Natural Language Processing (NLP) requires a lot of skills because language is
complex and the tools to analyze it are also complex
• It pulls from many areas:
• Programming/Computer Science
• Linguistics
• Artificial intelligence/Machine Learning
• Mathematics
• Statistics
• Cognitive Science
• Philosophy
LANGUAGE MODELING
• A very common task in NLP is language modeling
• The goal of language modeling is capture the patterns in language
• Word sequences
• Semantic embeddings
• Models are simplifications of a real system, and language modeling is no different
• We’ll work towards creating an n-gram language model, which we’ll then use to
generate new text
• But, before we do that, we need a method to build a language model in the first
place
• What language are we modeling?
LANGUAGE MODELING
• We generally build a model from a corpus (Plural is corpora)
• a large collection of data
• In computational linguistics, corpora are usually collections of texts, and sometimes
collections of audio recordings.
• If we didn't have corpora, we would just be guessing at what language looks like
• So, where do we get corpora?
LANGUAGE MODELING
• We generally build a model from a corpus (Plural is corpora)
• a large collection of data
• In computational linguistics, corpora are usually collections of texts, and sometimes
collections of audio recordings.
• If we didn't have corpora, we would just be guessing at what language looks like
• So, where do we get corpora?
• Anywhere!
• the internet
• Python packages
• phone transcripts
• books
• etc.
PLAN TODAY
• Review
• Errors/Exception handling
• Command line interface
• Importing/Python Packages
• Introduction to Natural Language Processing
• Basics of the Natural Language Toolkit (NLTK)
• Building an n-gram model
• Assignment 2 overview
• Quiz 2
NLTK
• The Natural Language Toolkit contains a large amount of Python code that you can
use to analyze language data.
• It also comes with numerous corpora, in several languages.
• We can download it and import it like other packages
• The NLTK website has a very good walk-through of its different functions.
• the whole book is here: http://www.nltk.org/book/
• we will start on Chapter 2: http://www.nltk.org/book/ch02.html
• We’ll open it up in just a second, but first open up IDLE and type:
import nltk
nltk.download()
NLTK
• To download data, open up IDLE and type: nltk.download()
• Click the Corpora Tab; then, select which data packages you want
• For now, download
• Brown
• Gutenberg
• You can always run nltk.download() again to find new data, or you can use command
line
nltk.download(‘brown’)
• I image there will be a few times we run a script, get an error and need to download new
data.
• Open up chapter 2 and let’s do a few things: http://www.nltk.org/book/ch02.html
NLTK
• NLTK has functionality to do many interesting things, a few that match with our
focus on morphology so far are:
• Tokenizing
• Stemming
• Lemmatization
• A few that pull from the area of Semantics:
• Finding synonyms
• Finding antonyms
• Still, one of the most useful things in NLTK is the access to data which we can use
to build our own models
N-GRAM MODEL
• n-grams are a type of language model.
• In an n-gram model, language is represented as "chunks" of information, associated with
probabilities
• A "gram" is any linguistic unit: a phoneme, a letter, a word, a phrase, etc.
• A bigram is a sequence of two grams that follow each other in a text
• A trigram is a sequence of three grams
• Above three, no special names: 4-gram, 5-gram, etc.
N-GRAM MODEL
• A n-gram model is a type of "statistical model".
• Rather than modelling a language exactly, it models probabilities
• An n-gram model tries to answer this question: given a symbol A from a language,
what is the probability that the symbol B follows A?
• Here’s an example in English:
• I say, ‘You’re not allowed to ______’
• What comes next?
A. green B. leave C. under D. if
• Here’s an example in Mandarin:
• I start saying a word, ‘z__’
• What comes next?
A. /l/ B. /p/ C. /o/ D. /m/
N-GRAM MODEL
• These patterns work at abstract levels too:
• If you are reading a sentence in English, and you see a preposition, what is most likely to
follow?
A. noun B. preposition C. verb D. nothing

• To construct an n-gram model of a language, we need example data from that


language.
• A collection of language data is called a ____________ (haha)
N-GRAM MODEL
• So, how do we build an n-gram model?
• Steps:
• Read through the corpus
• Identify each gram that occurs
• For each gram, memorize the n-1 following grams
• When done, calculate the probability of each n-gram
N-GRAM MODEL
• Here’s a simple example using Green Eggs and Ham by Dr. Seuss

Corpus Grams
Do you like do (3), you (1), like (3), green (2), eggs (2),
Green eggs and ham and (2), ham (2), I (2), not (2), them (1), Sam-
I-am (1)
I do not like them,
Bigrams
Sam-I-am.
I do not like do you (1), you like (1), like green (2), green
Green eggs and eggs (2), eggs and (2), and ham (2), ham I
ham. (1), I do (2), do not (2), not like (2), like them
(1), them Sam-I-am (1), Sam-I-am I (1)
N-GRAM MODEL
• What’s a good way to track N-grams in Python? We need a word (key) that
connects to the next-n-words (value)
N-GRAM MODEL
• What’s a good way to track N-grams in Python? We need a word (key) that
connects to the next-n-words (value)
• Dictionary

bigrams = {'do': {'you': 1, 'not': 2}, Bigrams


'you': {'like': 1, 'green': 2, 'them': 1}, do you (1), you like (1), like
'like': {'green': 2, 'them': 1}, green (2), green eggs (2),
'green': {'eggs': 2}, eggs and (2), and ham (2),
'eggs': {'and': 2}, ham I (1), I do (2), do not
'and': {'ham': 2}, (2), not like (2), like them
'ham': {'I': 1}, (1), them Sam-I-am (1),
'I': {'do': 2}, Sam-I-am I (1)
'them': {'Sam-I-am': 1},
'Sam-I-am': {'I': 1}}
N-GRAM MODEL
• Let’s build one
• Choose a dataset from nltk.corpus.gutenberg.fileids()
• Go through the words (not all, say the first thousand) and build a bi-gram model
• To access words, use: nltk.corpus.gutenberg.words(‘austen-emma.txt’)
• You don’t have to use austen-emma.txt, just one of the
• Remember to match the format of the bigram dictionary:
• Each key is a word with a value that is another dictionary
• In the nested dictionary, the key is a word that followed the first key and value is the frequency with
which it followed it
GOOD RESTAURANTS IN
VANCOUVER
• Soup noodles: Deer Garden Signatures
• Standard Chinese: Congee Noodle house (Vancouver), No. 9 (Richmond)
• Decent HK café: i-Café (Vancouver), Copa (Vancouver), Lido (Richmond)
• Hot Pot: Dolar Shop (Richmond – expensive)
• Taiwanese: Corner 23 (Vancouver)
• BBtea: Xing Fu tang (Vancouver), ChaTime
• Dim Sum: Fisherman’s Terrace (Richmond), Sun Sui Wah (Richmond)
Red Star (Vancouver, Richmond)
• Mall: Aberdeen Centre, Yaohan Centre

You might also like