Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

Motivation Video

https://www.youtube.com/watch?v=8478kLLQEG8

Mitsuku vs Cleverbot - AI (Artificial Intelligence)


What is Natural Language
Processing/ NLP
● A field of computer science, artificial
intelligence, and computational linguistics
● To get computers to perform useful tasks
involving human languages
– Human-Machine communication
– Improving human-human communication
● E.g Machine Translation
● Extracting information from texts
Why NLP?
● Languages involve many human activities
– Reading, writing, speaking, listening
● Voice can be used as an user interface in many
applications
– Remote controls, virtual assistants like siri,...
● NLP is used to acquire insights from massive
amount of textual data
– E.g., hypotheses from medical, health reports
● NLP has many applications
● NLP is hard!
Why NLP?
● News:
– AN EARTHQUAKE struck Indonesia today - a strapping 7.7 magnitude earthquake that struck
early today off the northern coast of the island of Sumatra. It caused minor damage and there
are no reports of any deaths, although electricity was interrupted in several places.
● Summary:
– Event: Earthquake
– Location : Indonesia
– Magnitude: 7.7
– Region: Sumatra (Northern Cost)
– Deaths: Nil
– Damage: Minor

● Tweet
– @nokia announces release of new PDA phones see is.gd/iuTuY
● Summary:
– Who: Nokia
– What: Product announcement
Why Now ?
● Huge amounts of data
● Internet = at least 20 billions pages
– Text data – web sites, blog, tweets .......
– Audio data – speech .......
● Applications for processing large amounts
of texts require NLP expertise
Why is NLP hard?
● Highly ambiguous
● Sentence I made her duck may have
different meanings (from Jurafsky book)
– I cooked waterfowl for her.
– I cooked waterfowl belong to her.
– I created the (plaster?) duck she owns.
– I caused her to quickly lower her head or body.
– I waved my magic wand and turned her into
undifferentiated waterfowl.
Why is NLP hard?
Why is NLP hard?
● Natural languages are highly ambiguous at all levels
– Lexical (word’s meaning)
– Syntactic
– Semantic
– Discourse
● Natural languages are fuzzy
● Natural languages involve reasoning about the world
E.g., It is unlikely that an elephant wears a pajamas
Why is NLP hard?
● Morphology: What is a word?
● 奧林匹克運動會(希臘語: Ολυμπιακοί Αγώνες ,簡稱奧運會或奧運)是
國際奧林匹克委員會主辦的包含多種體育運動項目的國際性運動會,每四年舉行一次。

• ‫“ = كبيوتها‬to her houses”

● Lexicography: What does each word mean?


– He plays bass guitar.
– That bass was delicious!

● Syntax: How do the words relate to each other?


– The dog bit the man. ≠ The man bit the dog.
– But in Russian: человек собаку съел = человек съел
собаку
Why is NLP hard?
● Semantics: How can we infer meaning from
sentences?
– I saw the man on the hill with the telescope.

– The ipod is so small! 


– The monitor is so small! 

● Discourse: How about across many sentences?


– President Bush met with President-Elect Obama today at the
White House. He welcomed him, and showed him around.

– Who is “he”? Who is “him”? How would a computer figure


that out?
MAJOR Areas of
Research & Development
● Text Processing
● Morph Analyzer
● POS Tagging
● Parsing
● Machine Translation .........
● Speech Processing
● Text to Speech (TTS)
● Automatic Speech Recognition (ASR)
● Speech to Speech Translation
Text Processing
● Processing raw text
● Morphological Analysis
– Running --> run + ing
● POS Tagging
– Ram/NNP goes/VB to/TO school/NNP ..
● Stemming
– running --> run
● Parsing
– Identifying sentence structure
S --> NP + VP
Text Processing
● Machine Translation
– Translating content in one natural language to
another natural language
– Example : Translating and English Sentence to
Malay with the help of a software.
Speech Processing
● Text to speech
– Converting electronic text to digital speech
● Automatic Speech Recognition
– Automatic transcription of spoken content to
electronic text
● Speech to speech translation
– Translating spoken content from one language to
another in real time or offline.
MAJOR Areas of Research &
Development industrial Applications
● Search Engines
● Advanced Text Editors
● Commercial Machine Translation Systems
● Information Extraction
● Collaborative filtering
● Translation Memories
● Computational Advertising
● Fraud Detection
● Sentiment Analysis
● Opinion Mining .........
Document classification
Information extraction
Sentiment analysis
Collaborative filtering
Search engines
Semantic web/search
NLP Pipeline
Why text representation is
important?
Raw text Preprocessing
pipeline
Legacy text
representation technique
● One hot encoding
● Bag of words
● N-gram
● TF-IDF
One hot encoding
One hot encoding
One hot encoding
Drawbacks One hot
encoding
● Size of input vector scales with size of vocabulary
– Must pre-determine vocabulary size.
– Cannot scale to large or infinite vocabularies (Zipf’s law!)
– Computationally expensive - large input vector results in far
too many parameters to learn.
● “Out-of-Vocabulary” (OOV) problem
– How would you handle unseen words in the test set?
– One solution is to have an “UNKNOWN” symbol that
represents
– low-frequency or unseen words
Drawbacks One hot
encoding
● No relationship between words
– Each word is an independent unit vector
● D(“cat”, “refrigerator”) = D(“cat”, “cats”)
● D(“spoon”, “knife”) = D(“spoon”, “dog”)
● These vectors are sparse:
– Vulnerable to overfitting: sparse vectors
● most computations go to zero ==> resultant loss
==>function has very few parameters to update.
Bag of Words
● Vocab = set of all the words in corpus
● Document = Words in document w.r.t vocab
with multiplicity
– Sentence 1: "The cat sat on the hat"
– Sentence 2: "The dog ate the cat and the hat”
– Vocab = { the, cat, sat, on, hat, dog, ate, and }
– Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
– Sentence 2 : { 3, 1, 0, 0, 1, 1, 1, 1}
Bag of Words
● Pros ● Cons
– Quick and Simple – Too simple
– Orderless
– No notion of
syntactic/semantic
similarity
N Gramm
● Vocab = set of all n-grams in corpus
● Document = n-grams in document w.r.t vocab with
multiplicity
– For bigram:
– Sentence 1: "The cat sat on the hat"
– Sentence 2: "The dog ate the cat and the hat”
– Vocab = { the cat, cat sat, sat on, on the, the hat, the dog, dog
ate, ate the, cat and, and the}
– Sentence 1: { 1, 1, 1, 1, 1, 0, 0, 0, 0, 0}
– Sentence 2 : { 1, 0, 0, 0, 0, 1, 1, 1, 1, 1}
N Gramm
● Pros ● Cons
– Tries to incorporate – Very large vocab set
order of words – No notion of
syntactic/semantic
similarity
Term Frequency–Inverse
Document Frequency (TF-IDF)

Captures importance of a word to a document in a corpus.

Importance increases proportionally to the number of times
a word appears in the document; but is inversely
proportional to the frequency of the word in the corpus.

TF(t) = (Number of times term t appears in a document) /
(Total number of terms in the document).

IDF(t) = log (Total number of documents / Number of
documents with term t in it).

TF-IDF (t) = TF(t) * IDF(t)
Term Frequency–Inverse
Document Frequency (TF-IDF)

Captures importance of a word to a document in a corpus.

Importance increases proportionally to the number of times
a word appears in the document; but is inversely
proportional to the frequency of the word in the corpus.

TF(t) = (Number of times term t appears in a document) /
(Total number of terms in the document).

IDF(t) = log (Total number of documents / Number of
documents with term t in it).

TF-IDF (t) = TF(t) * IDF(t)
Example TF-IDF Model
One Hot Encoding

Each document is represented by a binary vector ∈ {0,1}|V| !


Example TF-IDF Model
Term-document count matrices

Consider the number of occurrences of a
term in a document*:

Each document is represented by a count vector ∈ ℕ|V| !


Example TF-IDF Model
idf example, suppose N = 1 million
term dft idft

calpurnia 1 6
animal 100 4
sunday 1,000 3
fly 10,000 2
under 100,000 1
the 1,000,000 0

idft log10 ( N/dft )


There is one idf value for each term t in a collection.
Example TF-IDF Model

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 5.25 3.18 0 0 0 0.35
Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95

TF-IDF (t) = TF(t) * IDF(t) TF-IDF (t) = N(TF(t)) * IDF(t)

Each document is now represented by a real-valued


vector of tf-idf weights ∈ R|V|
TF-IDF Model

Pros ●
Cons
– Based on the bag-of-words
– Easy to compute
(BoW) model, therefore it does
– Has some basic metric to not capture position in text,
extract the most semantics, co-occurrences in
descriptive terms in a different documents, etc.
document – Thus, TF-IDF is only useful as
a lexical level feature.
– Thus, can easily compute – Cannot capture semantics
the similarity between 2 (unlike topic models, word
documents using it embeddings)
Why distance is a bad idea
The Euclidean
distance between q
and d2 is large even
though the
distribution of terms in
the query q and the
distribution of
terms in the document
d2 are
very similar.
Cosine similarity illustrated

43
How to describe the word meaning?

● One Hot Encoding, Bag of words, N-Gramm,


TF-IDF Model model are about document
description
● How to represent word?
Python Exercise

You might also like