Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

What is Parsing:

• Parsing helps us break down this sentence into its grammatical


components to understand its structure and meaning.
• Parsing helps us understand the meaning of a sentence by identifying
its grammatical structure and relationships between words.
• It forms the foundation for many NLP tasks like machine translation,
text summarization, and question answering.
• Parsing can help identify grammatical errors or ambiguities in a
sentence, allowing us to provide better feedback or corrections.

1
Parsing: Example
• sentence: "The quick brown fox jumps over the lazy dog.“
• Tokenization:
• First, we break down the sentence into individual words, called tokens:
• "The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog".
• Parts of Speech (POS) Tagging:
• Next, we label each word with its part of speech:
• "The (Determiner)", "quick (Adjective)", "brown (Adjective)", "fox (Noun)",
"jumps (Verb)", "over (Preposition)", "the (Determiner)", "lazy (Adjective)",
"dog (Noun)".

2
Parsing: Example Syntactic Parsing

• Dependency Parsing:
• Now comes the fun part! Dependency parsing helps us understand the
relationships between words in the sentence. For example:
• "fox" is the subject of "jumps".
• "jumps" is connected to "over" by the relation "prep", indicating the
prepositional phrase "over the lazy dog".
• Constituency Parsing:
• Constituency parsing breaks the sentence down into its constituent phrases.
For example:
• The sentence can be parsed into phrases like "the quick brown fox" (noun
phrase) and "jumps over the lazy dog" (verb phrase).

3
Parsing

4
Parsing: Grammer
• Natural languages were designed by humans, for humans to
communicate. They're not in a form that can be easily processed or
understood by computers.
• Therefore, natural language parsing is really about finding the
underlying structure given an input of text. In some sense, it's the
opposite of templating, where you start with a structure and then fill
in the data.
• With parsing, you figure out the structure from the data.

5
Parsing
• Natural languages follow certain rules of grammar. This helps the
parser extract the structure. Formally, we can define parsing as,
• the process of determining whether a string of tokens can be generated by a
grammar.
• There are a number of parsing algorithms. Statistical methods have
been popular since the 1980s. By the late 2010s, neural networks
were being increasingly used.

6
Parsing: Exampe

7
Parsing
• Text is a sequence of words. This can be simply represented as a "bag
of words". This is not very useful.
• We can obtain a more useful representation by making use of
syntactic structure. Such a structure exists because the sentence is
expected to follow rules of grammar.
• By extracting and representing this structure, we transform the
original plain input into something more useful for many downstream
NLP tasks. Beyond syntax, parsing is also about obtaining the
semantic structure.

8
Parsing
• It is a typical flow, input text goes into a lexical analyzer that
produces individual tokens.
• These tokens are the input to a parser, which produces the syntactic
structure at the output.
• When this structure is graphically represented as a tree, it's called a
Parse Tree.
• A parse tree can be simplified into an intermediate representation
called Abstract Syntax Tree (AST).
• Structure can be represented either as a phrase structure tree or in
labeled bracketed notation.

9
What are the common difficulties in
natural language parsing?

10
Cont…
• Breaking an input into sentences is the first challenge.
• The input could be formatted as tables or may contain page breaks. While
punctuation is useful for this task, punctuation in abbreviations (e.g. or U.S.)
can cause problems.
• Given an input, a parser should be able to pick out the main phrases.
This is not a solved problem. A more difficult problem is to obtain the
correct semantic relationships and understand the context of
discussion.
• Word embeddings such as word2vec operate at word level. This work needs
to be extended to phrases.

11
Cont..
• Annotated corpora and neural network models are often about
newswire content. Applying them to specific domains such as
medicine is problematic.
• Unlike parsing computer languages, parsing natural languages is more
challenging because there's often ambiguity in human
communication.
• A well-known example is "I shot an elephant in my pajamas." Was I or was the
elephant wearing my pajamas? Humans also use sarcasm, colloquial phrases,
idioms, and metaphors. They may also communicate with grammatical or
spelling errors.

12
Compare constituency parsing with
dependency parsing

13
Constituency Parsing and Dependency
Parsing
• Constituency parsing and dependency parsing are respectively based
on Phrase Structure Grammar (PSG) and Dependency Grammar (DG).
• Dependency parsing in particular is known to be useful in many NLP
applications.
• PSG breaks a sentence into its constituents or phrases. These phrases
are in turn broken into more phrases. Thus, the parse tree is
recursive. On the other hand, DG is not recursive, implying that
phrasal nodes are not produced. Rather, it identifies a network of
relations. Two lexical items are asymmetrically related. One of them is
the dependent word, the other is the head or governing word.
Relations are labeled.

14
Syntactic parsing
• Syntactic parsing deals with a sentence’s grammatical structure. It involves
looking at the sentence to determine parts of speech, sentence boundaries, and
word relationships. The two most common approaches included are as follows:

• Constituency Parsing: Constituency Parsing builds parse trees that break down a
sentence into its constituents, such as noun phrases and verb phrases. It displays
a sentence’s hierarchical structure, demonstrating how words are arranged into
bigger grammatical units.
• Dependency Parsing: Dependency parsing depicts grammatical links between
words by constructing a tree structure in which each word in the sentence is
dependent on another. It is frequently used in tasks such as information
extraction and machine translation because it focuses on word relationships such
as subject-verb-object relations.
15
Semantic Parsing
• Semantic parsing goes beyond syntactic structure to extract a
sentence’s meaning or semantics. It attempts to understand the roles
of words in the context of a certain task and how they interact with
one another. Semantic parsing is utilized in a variety of NLP
applications, such as question answering, knowledge base populating,
and text understanding. It is essential for activities requiring the
extraction of actionable information from text.

16
shallow parsing
• Constituency parsing is complex. Traditionally, such full parsing was not
robust in noisy surroundings
• Some researchers therefore proposed partial parsing where completeness
and depth of analysis were sacrificed for efficiency and reliability. Chunking
or shallow parsing is a basic task in partial parsing.
• Chunking breaks up a sentence into syntactic constituents called chunks.
Thus, each chunk can be one or more adjacent tokens. Unlike full parsing,
chunks are not further analyzed. Chunking is thus non-recursive and fast.
Chunks alone can be useful for other NLP tasks such as named entity
recognition, text mining or terminology discovery. Chunks can also be a
useful input to a dependency parser
17
shallow parsing
• Chunking is thus non-recursive and fast. Chunks alone can be useful
for other NLP tasks such as named entity recognition, text mining or
terminology discovery. Chunks can also be a useful input to a
dependency parser.
• POS tagging tags the words but doesn't bring out the syntactic
structure. Chunking can be seen as being somewhere between POS
tagging and full parsing. Chunking can include the POS tags.
Perceptron, SVM and bidirectional MEMM are some algorithms used
for chunking.

18
main approaches to text parsing
• Parsing is really a search problem. Search space of possible parse
trees is defined by a grammar. An example grammar rule is "VP → VP
NP". Broadly, there are two parsing strategies:
• Top Down: Goal-driven. Starts from the root node and expands to the next
level of nodes using the grammar. Checks for left-hand side match of
grammar rules. Repeat this until we reach the POS tags at the leaves. Trees
that don't match the input are removed.
• Bottom Up: Data-driven. Starts from the input sequence of words and their
POS tags. Builds the tree upwards using the grammar. Checks for right-hand
side match of grammar rules.

19
20
21
22
main approaches to text parsing
• While bottom-up can waste time searching trees that don't lead to
the root sentence, it's grounded on the input and therefore never
suggests trees that don't match the input. These pros and cons are
reversed with top-down.

• Recursive descent parser is top-down.


• Shift-reduce parser is bottom-up.
• Recursive descent parser can't handle left-recursive production.
• Left-corner parser is a hybrid that solves this problem.

23
24
How Does the Parser Work?
• The first step is to identify the sentence’s subject. The parser divides the
text sequence into a group of words that are associated with a phrase. So,
this collection of words that are related to one another is referred to as the
subject.

• Syntactic parsing and parts of speech are context-free grammar structures


that are based on the structure or arrangement of words. It is not
determined by the context.

• The most important thing to remember is that grammar is always


syntactically valid, even if it may not make contextual sense.

25
Probabilistic Parsing
• Probabilistic parsing is a powerful technique that combines principles
of probability theory with parsing algorithms to analyze and
understand the structure of sentences in a probabilistic manner.
• Probabilistic parsing is based on the idea that there can be multiple
valid interpretations or parses for a given sentence.
• Instead of just finding a single "best" parse, probabilistic parsing
assigns probabilities to different parses, representing the likelihood
that each parse is correct given the input sentence. This allows us to
capture the uncertainty inherent in natural language and choose the
most probable parse among all possible interpretations.

26
Probabilistic Parsing: Example
• Let's consider the sentence: "The man saw the dog with the telescope.“
• Ambiguity:
• This sentence is ambiguous because it can be interpreted in multiple ways:
• Did the man see the dog using the telescope?
• Did the man see the dog that was holding the telescope?
• Probabilistic Parsing:
• Probabilistic parsing assigns probabilities to different parse trees representing these
interpretations. For example:
• Parse 1: "The man saw [the dog] [with the telescope]." (Telescope used by the man)
• Parse 2: "The man saw [the dog with the telescope]." (Telescope carried by the dog)
• The probabilistic parser would assign higher probabilities to more likely
parses, such as Parse 1, based on context and linguistic knowledge.
27
Sequence Labelling
• Sequence labeling is a fundamental task in Natural Language
Processing (NLP) where the goal is to assign a label to each token in a
sequence of tokens. This task is commonly used for various text
analysis tasks, such as named entity recognition (NER), part-of-speech
(POS) tagging, chunking, and sentiment analysis.
• Sequence labeling involves assigning a categorical label to each token
in a sequence of tokens. The sequence can be a sentence, paragraph,
document, or any other text segment. The labels can represent
different linguistic or semantic properties of the tokens, such as
named entities, grammatical categories, or semantic roles.

28
Sequence Labelling
• Let's consider the sentence: "John likes to play soccer."

• In sequence labeling, we might want to assign labels to each word in the


sentence. For example:

• "John" → Person (named entity)


• "likes" → Verb (part-of-speech)
• "to" → Preposition (part-of-speech)
• "play" → Verb (part-of-speech)
• "soccer" → Activity (named entity)

29
Sequence Labelling
• Named Entity Recognition (NER):
• Identifying and classifying named entities (e.g., persons, organizations, locations) in
text.
• Part-of-Speech (POS) Tagging:
• Assigning grammatical categories (e.g., noun, verb, adjective) to words in a sentence.
• Chunking:
• Identifying syntactic phrases or "chunks" (e.g., noun phrases, verb phrases) in text.
• Sentiment Analysis:
• Assigning sentiment labels (e.g., positive, negative, neutral) to words or phrases in a
sentence.
• Information Extraction:
• Extracting structured information (e.g., relations, events) from unstructured text
data.

30
PCFG

31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
Kaggle-based Text Classification
Assignment
• Link:
• https://www.kaggle.com/code/poonaml/text-classification-using-spacy

47

You might also like