Professional Documents
Culture Documents
UBC Summer School in NLP - VSP 2019 Lecture 8
UBC Summer School in NLP - VSP 2019 Lecture 8
UBC Summer School in NLP - VSP 2019 Lecture 8
• Import statements should always be the very first thing at the top of your code.
IMPORTING MODULES
• Another useful package/module is the os module
• OS stands for "operating system". This module contains useful functions for dealing with
files and folders.
• Here are two very useful functions:
os.getcwd()
• Returns the "current working directory.“ This means the folder where the Python file is
os.path.join(string1, string2, ...)
• Take the strings and joins them by slashes to make a file path
• These are often combined:
path = os.path.join(os.getcwd(), 'data', 'turkish_words.txt')
PYTHON PACKAGES
• Importing gives us access to the wide-world of things you can do with Python
• Many developers have made Python packages available
• We’ll be installing packages using pip
• PIP is a tongue-and-cheek recursive acronym for Pip installs Packages
• We use it from command prompt/terminal
• Let’s first have some fun with command prompt
• Using text input
INSTALLING NEW PYTHON
PACKAGES
• We install python packages using the pip install call
• Let’s install the Natural Language Processing Toolkit
• In command line (PC)
• pip install nltk
• In terminal (MAC)
• pip3 install nltk
• You’ll see a bunch of stuff show up on the screen as in installs the package
• There are lots of packages you can install
• If there’s a specific project you’re working on, there’s a good chance someone has made
a package which can help!
NATURAL LANGUAGE
PROCESSING
• Natural Language Processing (NLP) requires a lot of skills because language is
complex and the tools to analyze it are also complex
• It pulls from many areas:
• Programming/Computer Science
• Linguistics
• Artificial intelligence/Machine Learning
• Mathematics
• Statistics
• Cognitive Science
• Philosophy
LANGUAGE MODELING
• A very common task in NLP is language modeling
• The goal of language modeling is capture the patterns in language
• Word sequences
• Semantic embeddings
• Models are simplifications of a real system, and language modeling is no different
• We’ll work towards creating an n-gram language model, which we’ll then use to
generate new text
• But, before we do that, we need a method to build a language model in the first
place
• What language are we modeling?
LANGUAGE MODELING
• We generally build a model from a corpus (Plural is corpora)
• a large collection of data
• In computational linguistics, corpora are usually collections of texts, and sometimes
collections of audio recordings.
• If we didn't have corpora, we would just be guessing at what language looks like
• So, where do we get corpora?
LANGUAGE MODELING
• We generally build a model from a corpus (Plural is corpora)
• a large collection of data
• In computational linguistics, corpora are usually collections of texts, and sometimes
collections of audio recordings.
• If we didn't have corpora, we would just be guessing at what language looks like
• So, where do we get corpora?
• Anywhere!
• the internet
• Python packages
• phone transcripts
• books
• etc.
PLAN TODAY
• Review
• Errors/Exception handling
• Command line interface
• Importing/Python Packages
• Introduction to Natural Language Processing
• Basics of the Natural Language Toolkit (NLTK)
• Building an n-gram model
• Assignment 2 overview
• Quiz 2
NLTK
• The Natural Language Toolkit contains a large amount of Python code that you can
use to analyze language data.
• It also comes with numerous corpora, in several languages.
• We can download it and import it like other packages
• The NLTK website has a very good walk-through of its different functions.
• the whole book is here: http://www.nltk.org/book/
• we will start on Chapter 2: http://www.nltk.org/book/ch02.html
• We’ll open it up in just a second, but first open up IDLE and type:
import nltk
nltk.download()
NLTK
• To download data, open up IDLE and type: nltk.download()
• Click the Corpora Tab; then, select which data packages you want
• For now, download
• Brown
• Gutenberg
• You can always run nltk.download() again to find new data, or you can use command
line
nltk.download(‘brown’)
• I image there will be a few times we run a script, get an error and need to download new
data.
• Open up chapter 2 and let’s do a few things: http://www.nltk.org/book/ch02.html
NLTK
• NLTK has functionality to do many interesting things, a few that match with our
focus on morphology so far are:
• Tokenizing
• Stemming
• Lemmatization
• A few that pull from the area of Semantics:
• Finding synonyms
• Finding antonyms
• Still, one of the most useful things in NLTK is the access to data which we can use
to build our own models
N-GRAM MODEL
• n-grams are a type of language model.
• In an n-gram model, language is represented as "chunks" of information, associated with
probabilities
• A "gram" is any linguistic unit: a phoneme, a letter, a word, a phrase, etc.
• A bigram is a sequence of two grams that follow each other in a text
• A trigram is a sequence of three grams
• Above three, no special names: 4-gram, 5-gram, etc.
N-GRAM MODEL
• A n-gram model is a type of "statistical model".
• Rather than modelling a language exactly, it models probabilities
• An n-gram model tries to answer this question: given a symbol A from a language,
what is the probability that the symbol B follows A?
• Here’s an example in English:
• I say, ‘You’re not allowed to ______’
• What comes next?
A. green B. leave C. under D. if
• Here’s an example in Mandarin:
• I start saying a word, ‘z__’
• What comes next?
A. /l/ B. /p/ C. /o/ D. /m/
N-GRAM MODEL
• These patterns work at abstract levels too:
• If you are reading a sentence in English, and you see a preposition, what is most likely to
follow?
A. noun B. preposition C. verb D. nothing
Corpus Grams
Do you like do (3), you (1), like (3), green (2), eggs (2),
Green eggs and ham and (2), ham (2), I (2), not (2), them (1), Sam-
I-am (1)
I do not like them,
Bigrams
Sam-I-am.
I do not like do you (1), you like (1), like green (2), green
Green eggs and eggs (2), eggs and (2), and ham (2), ham I
ham. (1), I do (2), do not (2), not like (2), like them
(1), them Sam-I-am (1), Sam-I-am I (1)
N-GRAM MODEL
• What’s a good way to track N-grams in Python? We need a word (key) that
connects to the next-n-words (value)
N-GRAM MODEL
• What’s a good way to track N-grams in Python? We need a word (key) that
connects to the next-n-words (value)
• Dictionary