Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 32

Lecture 1: Introduction to NLP

Lecture Objectives:

•Student will be able to understand Natural Language


processing techniques
•Basic concepts and applications.

CSC-441: Natural Language Processing


Marks Distribution
• Quizzes = 4 (each of 2.5 marks)
• Assignments
– Programming =3 (each of 2 marks)
– Reading/ writing=3 (each of 2 marks)
– Term Report=1 (of 8 marks)
• Midterm
• Final Term

Text : Speech and Language Processing, 3rd Edition by Daniel Jurafsk .


Reference: 1- Foundations of Statistical Natural Language Processing byChris Manning
2- Neural Network Methods for Natural Language Processing, 1 st Edition, Yoav Goldberg
3-Natural Language processing with Python by Steven Bird, Ewan Klein.
What is NLP??
NLP is the
branch of
computer science
focused on
developing
systems that
allow computers
to communicate
with people using
everyday
language
Neighboring Areas

AI

CS ML
ML

NLP
Or
Computat
ional
Linguistic DS

PoL

TM
Lingui
stic
Why NLP is Important
• Getting computers to perform useful tasks
involving human languages :
o Enabling human-machine communication
o Improving human-human communication
o Processing and analyzing language objects
Applied NLP
The goal of applied NLP and NLU (NL Understanding)is to process and
make use of information from a large corpus of text with very little manual
intervention.
•Email filters
•Smart assistants
•Search results
•Predictive text
•Language translation
•Digital phone calls
•Data analysis
•Text analytics
•A communication device for people with disabilities.
•Question Answering
•Machine Translation
•Spoken Conversational Agents
Forms of Natural Language
• The input/output of a NLP system can be:
– written text
– speech
• We will mostly concerned with written text (not
speech).
• To process written text, we need:
– lexical, syntactic, semantic knowledge about the language
– discourse information, real world knowledge
• To process spoken language, we need everything required
to process written text, plus the challenges of
speech recognition and speech synthesis.
Syntax, Semantic, Pragmatics
• Syntax concerns the proper ordering of words and its
affect on meaning.
– The dog bit the boy.
– The boy bit the dog.
• Semantics concerns the (literal) meaning of words,
phrases, and sentences.
– “plant” as a photosynthetic organism
– “plant” as a manufacturing facility
– “plant” as the act of sowing
• Pragmatics (Pragmatic means practical or logical)
concerns the overall communicative and social
context and its effect on interpretation.

– If you eat all of that food, it will make you bigger!


– Will you crack open the door? I am getting hot.
8
Ambiguity
• Natural language is highly Possible Interpretations
ambiguous and must be
disambiguated
• I saw the man on the hill
with a telescope.
• I saw the bird while I cooked waterfowl for her benefit
(to eat)
flying. I cooked waterfowl belonging to
• Time flies like an arrow. her
• Horse flies like a sugar I created the (plaster?) duck she
owns
cube. I caused her to quickly lower her
head or body
• I made her duck
Ambiguity
• Phonetics!
– I mate or duck
– I’m eight or duck
– Eye maid; her duck
– Aye mate, her duck
– I maid her duck
– I’m aid her duck
– I mate her duck
– I’m ate her duck
– I’m ate or duck
– I mate or duck

05/12/22 10
Models and Algorithms
• Models: formalisms used to capture the various kinds of
linguistic structure.
– State machines (fsa, REs, markov models)
– Formal rule systems (context-free grammars, feature systems)
– Logic (predicate calculus, inference)
– Probabilistic versions of all of these + others (gaussian mixture
models, probabilistic relational models, etc )
• Algorithms used to manipulate representations to create
structure.
– Search (A*, dynamic programming)
– Supervised learning, etc

05/12/22 11
Language, Thought,
Understanding
• Turing Test
• Question “can a machine think” is not operational.
• Operational version:
– 2 people and a computer
– Interrogator talks to contestant and computer via teletype
– Task of machine is to convince interrogator it is human
– Task of contestant is to convince interrogator she and not
machine is human.

05/12/22 12
Humor and Ambiguity
• Many jokes rely on the ambiguity of language:

– She criticized my apartment, so I knocked her flat.


– Noah took all of the animals on the ark in pairs.
Except the worms, they came in apples.
– Policeman to little boy: “We are looking for a thief with
a bicycle.” Little boy: “Wouldn’t you be better using
your eyes.”
– Why is the teacher wearing sun-glasses. Because the
class is so bright.

13
Natural Languages vs. Computer
Languages
• Ambiguity is the primary difference between natural
and computer languages.
• Formal programming languages are designed to be
unambiguous, i.e. they can be defined by a grammar
that produces a unique parse for each sentence in
the language.
• Programming languages are also designed for
efficient (deterministic) parsing, i.e. they are
deterministic context-free languages (DCFLs).
– A sentence in a DCFL can be parsed in O(n) time where n
is the length of the string.
14
Knowledge of Language
• Phonology – concerns how words are related to the sounds that
realize them.

• Morphology – concerns how words are constructed from more


basic meaning units called morphemes. A morpheme is the primitive
unit of meaning in a language (Important).

• Syntax – concerns how can be put together to form correct


sentences and determines what structural role each word plays in
the sentence and what phrases are subparts of other phrases.

• Semantics – concerns what words mean and how these meaning


combine in sentences to form sentence meaning. The study of
context-independent meaning.

15
Knowledge of Language (cont.)
• Pragmatics – concerns how sentences are used in different
situations and how use affects the interpretation of the
sentence.

• Discourse – concerns how the immediately preceding


sentences affect the interpretation of the next sentence. For
example, interpreting pronouns and interpreting the temporal
aspects of the information.

• World Knowledge – includes general knowledge about the


world. What each language user must know about the other’s
beliefs and goals.
16
Knowledge of Language

Dataset::
• Hahhahhaha
• Ye bat to manany wali h ap ki.
• Very well said, Irshad ;);D
• Seventy-seven days of friendship
• Beautiful humanity
• My son's US friends have left Paris to be with their
families. We had a long talk and decided that he'd
stay in Paris to finish his semester. He doesn't want
further disruption in his studies that have already
moved from regular classes to online ones. Solitude
he's fine with
Word Segmentation
• Breaking a string of characters (graphemes) into
a sequence of words.
• In some written languages (e.g. Chinese) words
are not separated by spaces. g
i n
• Even in English, characters other than white- s s
e
c ,;.
space can be used to separate words [e.g. o
pr
-:()] tic
t ac
• Examples from English URLs: yn
S
– jumptheshark.com  jump the shark .com
Morphology
• Morphology: words and their
composition
– cat, category
– child, children
– undo, union
Part Of Speech (POS) Tagging
• Annotate each word in a sentence with
a part-of-speech.
I ate the spaghetti with meatballs.
Pro V Det N Prep N
John saw the saw and decided to take it to the table.
PN V Det N Con V Part V Pro Prep Det N

• Useful for subsequent syntactic parsing


and word sense disambiguation.
Phrase Chunking
• Find all non-recursive noun phrases (NPs)
and verb phrases (VPs) in a sentence.
– [NP I] [VP ate] [NP the spaghetti] [PP with]
[NP meatballs].
– [NP He ] [VP reckons ] [NP the current
account deficit ] [VP will narrow ] [PP to ] [NP
only # 1.8 billion ] [PP in ] [NP September ]
Spoken input
Basic Process of NLU
For speech
understanding Phonological /
morphological Phonological & morphological
analyser rules

Sequence of words
“Cat hates rate.”
SYNTACTIC Grammatical
COMPONENT Knowledge

Ca Syntactic structure
t (parse tree)
hates rat
SEMANTIC Semantic rules,
INTERPRETER Lexical semantics

 x hates(x, Cat) Logical form

CONTEXTUAL Pragmatic &


REASONER World Knowledge

hates(rat, Cat)
Meaning Representation
Applications
Information Extraction (IE)
• Identify phrases in language that refer to specific
types of entities and relations in text.
• Named entity recognition is task of identifying
names of people, places, organizations, etc. in text.
people organizations places
– Michael Dell is the CEO of Dell Computer Corporation
and lives in Austin Texas.
• Relation extraction identifies specific relations
between entities.
– Michael Dell is the CEO of Dell Computer Corporation
and lives in Austin Texas.

24
Question Answering
• Directly answer natural language
questions based on information presented
in a corpora of textual documents (e.g. the
web).
– When was Barack Obama born? (factoid)
• August 4, 1961
– Who was president when Barack Obama was
born?
• John F. Kennedy
– How many presidents have there been since
Barack Obama was born?
•9
Reading Comprehension
• Read a passage of text and
answer questions about it.
• Example from Stanford
SQuAD dataset.

26
Text Summarization
• Produce a short summary of a longer document
or article.
– Article: With a split decision in the final two primaries and a flurry of
superdelegate endorsements, Sen. Barack Obama sealed the
Democratic presidential nomination last night after a grueling and
history-making campaign against Sen. Hillary Rodham Clinton that will
make him the first African American to head a major-party ticket. Before
a chanting and cheering audience in St. Paul, Minn., the first-term
senator from Illinois savored what once seemed an unlikely outcome to
the Democratic race with a nod to the marathon that was ending and to
what will be another hard-fought battle, against Sen. John McCain, the
presumptive Republican nominee….
– Summary: Senator Barack Obama was declared the
presumptive Democratic presidential nominee.
Machine Translation (MT)
• Translate a sentence from one natural
language to another.
‫ی‬ ‫ن‬
– ‫ ج ب ت ک ہ م ای ک دوسرے کو دوب ارہ ہی ں د ھی ں دوست‬
‫ک‬
Until we see each other again, friend.
Automatic Learning Approach
• Use machine learning methods to automatically
acquire the required knowledge from
appropriately annotated text corpora.
• Variously referred to as the “corpus based,”
“statistical,” or “empirical” approach.
• Statistical learning methods were first applied to
speech recognition in the late 1970’s and
became the dominant approach in the 1980’s.
• During the 1990’s, the statistical training
approach expanded and came to dominate
almost all areas of NLP.
29
Learning Approach

Machine
Learning

Manually Annotated
Training Corpora Linguistic
Knowledge

NLP System

Raw Text 30
Automatically
Annotated Text
Brief History of NLP
• 1940s –1950s: Foundations
– Development of formal language theory (Chomsky, Backus, Naur, Kleene)
– Probabilities and information theory (Shannon)
• 1957 – 1970s:
– Use of formal grammars as basis for natural language processing
(Chomsky, Kaplan)
– Use of logic and logic based programming (Minsky, Winograd, Colmerauer,
Kay)
• 1970s – 1983:
– Probabilistic methods for early speech recognition (Jelinek, Mercer)
– Discourse modeling (Grosz, Sidner, Hobbs)
• 1983 – 1993:
– Finite state models (morphology) (Kaplan, Kay)
• 1993 – present:
– Strong integration of different techniques, different areas.

BİL711 Natural Language Processing 31


Summary
NLP: Building computational models of natural language
comprehension and production
Other Names:
• Computational Linguistics (CL)
• Human Language Technology (HLT)
• Natural Language Engineering (NLE)
• Speech and Text Processing
• Processing natural language text involves many various
syntactic, semantic and pragmatic tasks in addition to
other problems.

32

You might also like