CITS4012 Lecture02 PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

NLP Pipeline
CITS4012 Natural Language Processing

A/Prof. Wei Liu

wei.liu@uwa.edu.au
Computer Science and Software Engineering
The University of Western Australia

March 10, 2022

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 1 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

What we are going to cover today


1 Sentences
Sentence Structure
Theories of Grammar

2 Document Level Concepts


Discourse
Co-reference resolution
Topic Modelling

3 Corpus

4 spaCy for NLP


Container Objects
Pipeline Components

5 Take-Aways

[see Statistical Machine Translation, Chapter 2]

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 2 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Sentences

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 3 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

What’s a sentence
Sentence
A sentence may consist of one or more clauses, each of which consists of
a verb with arguments and adjuncts. Clauses may be adjuncts and
arguments themselves.

Jane bought the house.


verb bought is the central element of the sentence.
It requires a buyer (the subject Jane) and an object (the object the
house).
Verbs may require different number of objects (e.g. Jane gave Joe a
book.). How many and what objects a verb requires is called the
valency of a verb.
Some verbs may require none, which are called intransitive verbs.
Objects that are required by a verb is also called arguments.
Additional information may be added to a sentence in the form of
adjuncts.
a prepositional phrase: from Jim, without hesitation;
adverbs such as yesterday, cheaply.
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 4 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Recursion in Natural Languages

Recursion
Recursion is a a striking feature of language, referring to the ability to
create nested constructions of constituents, which can be extended by
additional constituents.

the house → the (very) beautiful house


the house → the house in the posh neighbourhood (across the
river)

Jame who recently won the lottery bought the house that was
just on the market.

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 5 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Structural Ambiguity

Prepositional phrase attachment and connectives introduce ambiguity for


automatic NLP.
Jim eats steak with ketchup.
Joe eats steak with a knife.
Jane watches the man with the telescope.
Jim washes the dishes and watches TV with Jane.
How does a computer know that steak and knife do not a
tasty meal?
Does Jane use a telescope to watch the man, or does the
man have it?
Is Jane helping with the dishes or is she just joining for TV?

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 6 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Parse Tree of a sentence

NP-SBJ VP .

NP ADJP MD VP .

NNP NNP , CD NNS JJ , will VB NP PP-CLR NP-TMP

Pierre Vinken 61 years old join DT NN IN NP NNP CD

the board as DT JJ NN Nov. 29

a non-executive director

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 7 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Parse Tree

Parse trees illustrate the recursive nature of grammatical structure of a


language.
Like trees in nature, the initial stem branches out recursively until we
reach the leaves (words - terminal nodes).
Unlike natural trees, syntactic trees grown from top to bottom.
The root is the sentence node, which branches out into the Subject
Noun Phrase (NP-SBJ) and the Verb Phrase (VP).
After picking out the modal (will), the VP further broken down to
the main verb, its object and adjuncts.

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 8 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Phrase Structure Grammar


From the Parse Tree, we can see phrases provides the basis for talking
about the levels in between the sentence root node (at the top) and the
word leaf nodes (at the bottom).
noun phrases (NP)
prepositional phrases (PP)
verb phrases (VP)
adjective phrases (ADJP)

Phrases
Phrases are groups of words that introduce additional level of abstraction
in a sentence, which allows us to define relationship between word groups.

The concept of subject and object refer to a phrasal unit, not a


single word.

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 9 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Context-Free Grammar
In NLP, we are
mostly concerned with computational methods to deal with language,
and
less with how human mind uses language
So in terms of grammar, we are interested in formalisms that allow us
to define all possible English sentences, but
rule out impossible word combinations

Context Free Grammar (CFG)


CFG consists of a set of nonterminal (part-of-speech tags and pharse
categories) and terminal symbols (words).

S → NP VP
NP → NNP ADJP
NP → NNP NNP
VP → VB NP PP NP
VB → join

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 10 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Extending CFGs

The formalism of context-free grammars can be extended in many ways.

Probabilistic Context Free Grammars (PCFG)


To assess the likelihood of different syntactic structures for a given
sentence, we add probabilities to rules, which is referred to as PCFG.

For instance, structural ambiguity that arise from ambiguous sentences


such as
Jane watches the man with the telescope.

can be resolved by assigning a probability distribution to different syntac-


tic structures of a sentence.

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 11 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Dependency Structure
will

the board join a director as

61 years old Vinken nonexecutive

29 Nov
In a dependency structure, we only display the relationship of words to
each other, without any reference to phrases or phrase structure.
Head word is explicit in a dependency structure, a parse tree or
syntax tree lacks this information.
Syntax tree labels each constituent, e.g. the nonexecutive director is
a NP.
Syntax tree preserve the ordering of the words and has more inherent
structure.
Both may be extended with additional information.
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 12 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Dependency Parsing using Spacy


1 import spacy
2 from spacy import displacy
3
4 nlp = spacy.load("en_core_web_sm")
5 doc = nlp("Pierre Vinken , 61 years old , will join the board as a
nonexecutive director Nov. 29.")
6 for token in doc:
7 print(token.text , token.lemma_ , token.pos_ , token.tag_ ,
token.dep_ , token.shape_ , token.is_alpha , token.is_stop)
8
9 # Visualising the parse tree
10 displacy.render(doc , style='dep')

27/04/2021 code/spacy–parsing.py
sentence.svg

npadvmod

nsubj

amod prep pobj

dob j de t

compound nummod npadvmod aux det amod nummod

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.

PROPN PROPN NUM NOUN ADJ VERB VERB DET NOUN SCONJ DET ADJ NOUN PROPN NUM

https://spacy.io/usage/linguistic-featurespos-tagging
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 13 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Dependency Labels in spaCy


A syntactic dependency label describes the types of syntactic relation
between two words in a sentence.
syntactic governor - head or parent
dependent - child

Dependency label Description


acomp Adjectival complement
amod Adjectival modifier
aux Auxiliary
compound Compound
dative Dative
det Determiner
dobj Direct object
A syntactic nsubj Nominal subject
dependency arc
pobj Object of preposition
ROOT Root

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 14 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Using Dependency Labels for Question Answering

How a Question Answering System works

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 15 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Lexical Functional Grammar (LFG)


Lexical Functional Grammar (LFG) draws a distinction between the sur-
face structure of language and underlying deep structure, which is more
closely related to the expressed meaning by having two representations of
a sentence:
constituent structure (c-structure), and
functional structure (f-structure).

Picture
A/Prof. Weicredit
Liu UWA to
Centre for Linguistics and Philology from OXFORD University
Lecture 2 March 10, 2022 16 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Lexical Functional Grammar (LFG)


Lexical Functional Grammar (LFG) draws a distinction between the sur-
face structure of language and underlying deep structure, which is more
closely related to the expressed meaning by having two representations of
a sentence:
constituent structure (c-structure), and
functional structure (f-structure).
An f-structure of the example sentence is shown below:

PRED ‘join (SUBJ, OBJ)’


 
TENSE past 
PRED ‘pierre-vinken’
  
 
SUBJ PRED ‘old’
   
ADJ
 
ADJ  PRED ‘61 years’
   
 
PRED ‘board’
  
OBJ
 
 DEF

 + 
ADJ PRED ‘Nov. 29’


A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 16 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Named Entity Recognition


Named Entity
A named entity is a real object that you can refer to by a proper name.
It can be a person, organization, location, or other entity.

1 import spacy
2 nlp = spacy.load('en_core_web_sm ')
3 doc = nlp(u'I have flown to LA. Now I am flying to Frisco.')
4 for token in doc:
5 if token.ent_type != 0:
6 print(token.text , token.ent_type_)

code/spacy–ner.py

If the ent_type attribute of a token is not set to 0, then the token is a


named entity.
LA GPE
Frisco GPE
GPE stands for for “geopolitical entity” and includes
countries, cities, states, and other place names.
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 17 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Document Level Concepts

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 18 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Discourse analysis - Detecting Intention


1 import spacy
2 nlp = spacy.load('en_core_web_sm ')
3 doc = nlp(u'I have flown to LA. Now I am flying to Frisco.')
4 # The head property of a token object refers to the
5 # syntatic head of this token
6 for token in doc:
7 print(token.head.text , token.dep_ , token.text)
8 # ROOT with pobj indicate intent this case
9 for sent in doc.sents:
10 print ([ token.text for token in sent
11 if token.dep_ == 'ROOT ' or token.dep_ == 'pobj '])

code/spacy–intent.py

flown nsubj I
flown aux have
flown ROOT flown
flown prep to
to pobj LA
flown punct .
flying advmod Now
flying nsubj I
flying aux am
flying ROOT flying
flying prep to
to pobj Frisco
flying punct .

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 19 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Discourse analysis - Detecting Intention


1 import spacy
2 nlp = spacy.load('en_core_web_sm ')
3 doc = nlp(u'I have flown to LA. Now I am flying to Frisco.')
4 # The head property of a token object refers to the
5 # syntatic head of this token
6 for token in doc:
7 print(token.head.text , token.dep_ , token.text)
8 # ROOT with pobj indicate intent this case
9 for sent in doc.sents:
10 print ([ token.text for token in sent
11 if token.dep_ == 'ROOT ' or token.dep_ == 'pobj '])

code/spacy–intent.py

This example shows how to create a list of potential keywords for each
sentence based on specific dependency labels assigned to the tokens.

['flown', 'LA']
['flying', 'Frisco']

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 19 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Co-references or Anaphora
The first mention of an entity is typically fleshed out (e.g. the
46th President Joe Biden), it may be later referred to only by a
pronoun (he) or an abbreviated description (the president).

A later mention may not have sufficient information, so we need to back-


track to previous mentions, this is one of the core NLP tasks called
anaphor resolution (a.k.a. co-reference resolution).

spaCy NeuralCoref

https://huggingface.co/coref/

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 20 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

neuralcoref v4.0
1 """
2 neuralcoref v4.0 only works with spacy ==2.1.0 and python 3.7
3 $ conda create -n neuralcoref python =3.7
4 $ pip install spacy ==2.1.0
5 $ pip install neuralcoref
6 $ python -m spacy download en
7 """
8 import spacy
9 nlp = spacy.load('en_core_web_sm ')
10
11 # Add neural coref to SpaCy 's pipe
12 import neuralcoref
13 neuralcoref.add_to_pipe(nlp)
14
15 doc = nlp(u'My sister has a dog. She loves him.')
16 doc._.has_coref
17 doc._.coref_clusters

code/neural–coref.py

True
[My sister: [My sister, She], a dog: [a dog, him]]
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 21 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Topic
Topics are the general subject matters of a text or a document.
In a sports article, the English word bat will be translated differently
from in an article about cave animals.
It maybe helpful to detect the topics of a document and use this to
help translation and disambiguation.

Topic Modelling
Topic modelling is another NLP task, that treats topics as a latent
variable that can be learned through word distributions.

Topic Modelling is often used as an alternative to clustering – grouping


words of the same ”latent” topic together.
Popular techniques (to be covered in Lecture 5) include:
Non-negative matrix factorisation
Latent Dirichlet Allocation

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 22 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Corpus

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 23 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Collections of Texts - Corpora


Many NLP systems, such as statistical machine translation systems are
trained on large collection of texts.
Corpora
A corpus (plural: corpora) is a collection of texts or documents.

The text-corpus method, namely, corpus linguistics uses the body of texts
written in any natural language to derive the set of abstract rules which
govern that language. Typical Usage:
Explore the relationships between that subject language and other
languages which have undergone a similar analysis.
Compile dictionaries and grammar guides.
A landmark in modern corpus linguistics was the publication of
Computational Analysis of Present-Day American English in 1967
[Kučera and Francis].
The work was based on an analysis of the Brown Corpus, a
contemporary compilation of about a million American English
words, carefully selected from a wide variety of sources.
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 24 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Domain and Topic of Texts

Domain A system that works excellently on scientific neuro-science


articles may perform poorly at online chats between teenagers.
It is very challenging to create general-purpose NLP systems.
A restricted domain makes it much easier to confine and
contextualise the meanings of words.
Topic Training corpus in computer science may be a bad source for
training a NLP system that is to work on civil or geology texts.
Take machine translation for example, much of the available
translated text comes from international organizations, such as the
United Nations or the European Union.
The European Parliament proceedings cover many political,
economic, and cultural matters, but may still not be a good source to
learn to translate texts in a specialized scientific domain.

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 25 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Modality of Texts

Modality Natural language present in both written and spoken language.


The modality of communication modality matters.
Spoken language is typically transcribed (either manually or through
automatic speech recognition system) into text to make use of a
textual system for meaning extraction.
This implies possible polishing by removing restarts and filler words (I
really believe, um, believe that we should do this.),
Spoken language is different from written text. It is often
ungrammatical, full of unfinished sentences and (especially in the
case of automatic speech recognition systems of dialogues) reliant on
gestures and mutually understood knowledge.
Much of this is also true for informal uses of written text, such as
Internet chat, email, and text messages.

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 26 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

spaCy for NLP

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 27 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Objects in spaCy

spaCy objects

containers pipline components

Doc POS NER

Token Span

Doc.sents Doc.noun_chunks

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 28 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Objects in spaCy

spaCy objects
A container object
groups multiple elements into a
single unit. It can be a collection of
containers pipline components
objects, like tokens or sentences, or a
set of annotations related to a single
object.
Doc POS NER

Token Span

Doc.sents Doc.noun_chunks

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 28 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Objects in spaCy

spaCy objects
A container object
groups multiple elements into a
single unit. It can be a collection of
containers pipline components
objects, like tokens or sentences, or a
set of annotations related to a single
object.
Doc POS NER
Pipeline components
objects that process the text input to
Token Span create containers and fill them with
relevant data, such as a
part-of-speech tagger, a dependency
Doc.sents Doc.noun_chunks parser and an entity recogniser.

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 28 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Container Objects - Doc


1 from spacy.tokens.doc import Doc
2 from spacy.vocab import Vocab
3
4 """
5 create a spacy.tokens.doc.Doc object
6 using its constructor
7 """
8 doc = Doc(Vocab (), words = [u'Hello ', u'World!'])
9 print(doc)
10 print(type(doc))

code/spacy–doc.py

Hello World!
spacy.tokens.doc.Doc
The Doc() constructor, requires two parameters:
a vocab object, which is a storage container that provides vocabulary
data, such as lexical types (adjective, verb, noun ...);
a list of tokens to add to the Doc object being created.

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 29 / 47


from a user’s standpoint, represent a token, a phrase or sentence, and a text,
Sentences Document Level Concepts
respectively. A containerCorpus spaCy
can contain for NLP
other containers—for Take-Aways
example, a Doc References
contains Tokens. In this section, we’ll explore working with these container
objects.
Container Objects - Token
Getting the Index of a Token in a Doc Object
spaCy’s TokenA Docobject
objectis a container
contains a collectionfor a set
of the ofobjects
Token annotations
generated asre-
a
lated to a single token, such as that token’s part of speech.
result of the tokenization performed on a submitted text. These tokens have
­indices, allowing you to access them based on their positions in the text,
as shown in Figure 3-1.

Doc container

Index [0] [1] [2] [3] [4]

Content I want a green apple.

Annotations PRON VERB DET ADJ NOUN


... ... ... ... ...

Token objects

Figure 3-1: The tokens in a Doc object

1 import spacy The tokens are indexed starting with 0, which makes the length of the
2 nlp = spacy.load('en_core_web_sm
document minus 1 the index of ')the end position. To shred the Doc instance
3 doc = nlp(u'I into
want a green
tokens, you deriveapple.')
the tokens into a Python list by iterating over the
4 # token_text1 Doc
andfrom the start token toproduce
token_text2 the end token:
the same results
5 token_text1 = >>>
[token.text
[doc[i] for i in for token in doc]
range(len(doc))]
6 token_text2 = [A,
[doc[i]. text
severe, storm, hit,for i in .]range(len(doc))]
the, beach,

code/spacy–token.py
It’s worth noting that we can create a Doc object using its constructor
explicitly, as illustrated in the following example:
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 30 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References
I want a green apple.

and Token.children
The diagram in Figure 3-2 highlights the syntactic dependencies of
Token.lefts, Token.rights
interest.

I want a green apple.

PRON VERB DET ADJ NOUN

leftward children
of token “apple”

Figure 3-2: An example of leftward syntactic dependencies

1 import spacy To obtain the leftward syntactic children of the word “apple” in this
2 nlp = spacy.load("en_core_web_sm")
sample sentence programmatically, we might use the following code:
3 doc = nlp(u'I want a green apple.')
4 print ([t for t in>>>doc
doc [4]. lefts
= nlp(u'I want])
a green apple.')
5 print ([t for t in>>> [w for w in doc[4].lefts]
doc [4].
[a, green]
children ])
6 print ([t for t in doc [1]. rights ])
In this script, we simply iterate through the apple’s children, outputting
code/spacy–children.py
them in a list.
It’s interesting to note that in this example, the leftward syntactic
[a, green] children of the word “apple” represent the entire sequence of the token’s
[a, green] syntactic children. In practice, this means that we might replace Token.lefts
with Token.children, which finds all of a token’s syntactic children:
[apple, .]
A/Prof. Wei Liu UWA >>> [w for w in doc[4].children]
Lecture 2 March 10, 2022 31 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Container Objects - Vocab


20/05/2021 vocab_stringstore-52248f07f3f339d095cc2a6625a12689.svg

Internally, spaCy only “speaks” in hash values.

Whenever possible, spaCy tries to store data in a vocabulary, the


Vocab storage class, that will be shared by multiple documents;
To save memory, spaCy also encodes all strings to hash values. For
example, “coffee” has the hash 3197928453018144401.
Entity labels like “ORG” and part-of-speech tags like “VERB” are
also encoded.
https://spacy.io/usage/spacy-101#vocab

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 32 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

spaCy Code Demo for the Vocab class


1 import spacy
2 nlp = spacy.load('en_core_web_sm ')
3 doc = nlp('I love coffee!')
4 for token in doc:
5 lexeme = doc.vocab[token.text]
6 print(lexeme.text , lexeme.orth , lexeme.shape_ ,
7 lexeme.prefix_ , lexeme.suffix_ , lexeme.is_alpha ,
8 lexeme.is_digit , lexeme.is_title , lexeme.lang_)
9
10 print(doc.vocab.strings["coffee"]) # 3197928453018144401
11 print(doc.vocab.strings [3197928453018144401]) # 'coffee '

code/spacy–vocab.py

text hash shape prefix suffix alpha digit title lang


I 4690420944186131903 X I I True False True en
love 3702023516439754181 xxxx l ove True False False en
coffee 3197928453018144401 xxxx c fee True False False en
! 17494803046312582752 ! ! ! False False False en

3197928453018144401
'coffee'
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 33 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Container Objects - Span


spaCy’s Span object is a container that represents an arbitrary
set of neighbouring tokens in the document, which could be an
n-gram, a phrase, a noun_chunk, or a sentence.

Span can be obtained as simple as doc[start:end] where start and


end are the index of starting token and the ending token, respectively.
The two indices can be
manually specified; or
computed through pattern matching

doc[start:end] is a span; an noun phrase chunk is a span; an


n-gram is a span.

1 span = Span(doc , start , end , label=match_id)

code/spacy–span–snippet.py

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 34 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

spaCy’s Pattern Matcher


1 import spacy
2 from spacy.matcher import Matcher
3 from spacy.tokens import Doc , Span , Token
4 nlp = spacy.load("en_core_web_sm")
5 matcher = Matcher(nlp.vocab)
6 # A dependency label pattern that matches a word sequence
7 pattern = [{"DEP": "nsubj"},{"DEP": "aux"},{"DEP": "ROOT"}]
8 matcher.add("NsubjAuxRoot", [pattern ])
9 doc = nlp(u"We can overtake them.")
10 # 1. Return (match_id , start , end) tuples
11 matches = matcher(doc)
12 for match_id , start , end in matches:
13 span = doc[start:end]
14 print("Span: ", span.text)
15 print("The positions in the doc are: ", start , "-", end)
16 # 2. Return Span objects directly
17 matches = matcher(doc , as_spans=True)
18 for span in matches:
19 print(span.text , span.label_)

code/spacy–pattern–matcher.py
Span: We can overtake
The positions in the doc are: 0 - 3
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 35 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Rule-based Matching
Steps for using the Matcher class:
1 Create a Matcher instance by passing in a shared Vocab object;
2 Specify the pattern as an list of dependency labels;
3 Add the pattern to the a Matcher object;
4 Input a Doc object to the matcher;
5 Go through each match hmatch_id, start, endi.
We have seen a Dependency Matcher just now, there are more Rule-
based matching support in spaCy:
Token Matcher: regex, and patterns such as ["LOWER": "hello",
"IS_PUNCT": True, "LOWER": "world"]
Phrase Matcher: PhraseMatcher class
Entity Ruler
Combining models with rules
https://spacy.io/usage/rule-based-matching#matcher

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 36 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

doc.noun_chunks and Retokenising


Noun Chunks
A noun chunk is a phrase that has a noun as its head.

1 doc = nlp(u'The Golden Gate Bridge is an iconic landmark in San


Francisco.')
2 # Retokenize to treat each noun_chunk as a single token
3 for chunk in doc.noun_chunks:
4 with doc.retokenize () as retokenizer:
5 retokenizer.merge(chunk)
6 for token in doc:
7 print(token)

The Golden Gate Bridgecode/spacy–noun–chunk.py


is Define a function to extract
an iconic landmark noun phrases based on syntactic
in dependency parsing.
San Francisco
.
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 37 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

doc.sents
spaCy’s Doc object represents a text, which may contain one or
more sentences.
doc.sents is a generator object. You can use for each in loop,
but not subset indexing.
Each member of the generator object is a Span of type
spacy.tokens.span.Span.
1 doc = nlp(u'A storm hit the beach. It started to rain.')
2 for sent in doc.sents:
3 print(type(sent))
4 # Sentence level index
5 [sent[i] for i in range(len(sent))]
6 # Doc level index
7 [doc[i] for i in range(len(doc))]

code/spacy–sents–snippet.py

1 Print the second sentence begins with a pronoun.


2 How many sentences end with a verb?
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 38 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Text-Processing Pipeline - Traditional in NLTK

Using Python @coroutine to create information extraction pipelines

Information Extraction in NLTK

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 39 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

An NLTK Pipeline Example


1 import nltk
2 nltk.download('punkt ') # Sentence Tokenize
3 nltk.download('averaged_perceptron_tagger ') # POS Tagging
4 nltk.download('maxent_ne_chunker ') # Named Entity Chunking
5 nltk.download('words ') # Word Tokenize
6
7 # texts is a collection of documents.
8 # Here is a single document with two sentences.
9 texts = [u"A storm hit the beach in Perth. It started to rain."]
10 for text in texts:
11 sentences = nltk.sent_tokenize(text)
12 for sentence in sentences:
13 words = nltk.word_tokenize(sentence)
14 tagged_words = nltk.pos_tag(words)
15 ne_tagged_words = nltk.ne_chunk(tagged_words)
16 print(ne_tagged_words)

code/nltk–ie.py
(S A/DT storm/NN hit/VBD the/DT beach/NN in/IN (GPE Perth/NNP)
./.)
(S It/PRP started/VBD to/TO rain/VB ./.)

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 40 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

NLP Pipeline in spaCy

Recall that spaCy’s container objects represent linguistic units, such as a


text (i.e. document), a sentence and an individual token with linguistic
features already extracted for them.

How does spaCy create these containers and fill them with rele-
vant data?

Processing Pipeline Components


A spaCy pipeline (V2.x include, by default, a part-of-speech tagger, a
dependency parser and an entity recognizer.

>>> nlp.pipe_names
['tagger', 'parser', 'ner']

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 41 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

NLP Pipeline in spaCy

Recall that spaCy’s container objects represent linguistic units, such as a


text (i.e. document), a sentence and an individual token with linguistic
features already extracted for them.

How does spaCy create these containers and fill them with rele-
vant data?

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 41 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Disabling Pipeline Components

spaCy allows you to load a selected set of pipeline components, disabling


those that aren’t necessary.
You can do this when creating an nlp object by setting the disable
parameter:

nlp = spacy.load('en_core_web_sm',
disable=['parser'])
Or, you can disable it after the nlp object is created:

nlp.disable_pipes('tagger')
nlp.disable_pipes('parser')

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 42 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Customising a NLP pipeline

1 import spacy
2 nlp = spacy.load('en_core_web_sm ')
3
4 doc = nlp(u'I need a taxi to Cottesloe.')
5 for ent in doc.ents:
6 print(ent.text , ent.label_)

code/spacy–customise–org.py

Cottesloe ORG
If we would like to introduce a new entity type SUBURB for
Cottesloe and other suburb names, how should we inform the
NER component about it?

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 43 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Steps of Customising a spaCy NER pipe

1 Create a training example to show the entity recognizer so it will


learn what to apply the SUBURB label to;
2 Add a new label called SUBURB to the list of supported entity types;
3 Disable other pipe to ensure that only the entity recogniser will be
updated during training;
4 Start training;
5 Test your new NER pipe;
6 Serialise the pip to disk;
7 Load the customised NER.

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 44 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

1 import spacy
2 nlp = spacy.load('en_core_web_sm ')
3
4 # Specify new label and training data
5 LABEL = 'SUBURB '
6 TRAIN_DATA = [('I need a taxi to Cottesloe ',
7 { 'entities ': [(17, 26, 'SUBURB ')] }),
8 ('I like red oranges ', { 'entities ': []})]
9
10 # Add new label to the ner pipe
11 ner = nlp.get_pipe('ner')
12 ner.add_label(LABEL)
13
14 # Disable other two default pipes
15 nlp.disable_pipes('tagger ')
16 nlp.disable_pipes('parser ')
17
18 # Train
19 optimizer = nlp.entity.create_optimizer ()
20 import random
21 for i in range (25):
22 random.shuffle(TRAIN_DATA)
23 for text , annotations in TRAIN_DATA:
24 nlp.update ([ text], [annotations], sgd=optimizer)
25 # Test
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 44 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

26 doc = nlp(u'I need a taxi to Cottesloe ')


27 for ent in doc.ents:
28 print(ent.text , ent.label_) # Cottesloe SURBURB
29
30 # Serialize the pipe to disk
31 ner.to_disk('C:\\ Users \\00051693\\ CITS4012 ') # Windows Path
32
33 # Load spacy without NER
34 import spacy
35 from spacy.pipeline import EntityRecognizer
36 nlp = spacy.load('en_core_web_sm ', disable =['ner'])
37
38 # Load ner pip from disk
39 ner = EntityRecognizer(nlp.vocab)
40 ner.from_disk('C:\\ Users \\00051693\\ CITS4012 ')
41 # Add pipe to nlp
42 nlp.add_pipe(ner)
43
44 # Test
45 doc = nlp(u'I need a taxi to Western Australia ')
46 for ent in doc.ents:
47 print(ent.text , ent.label_) # Western SURBURB

code/spacy–customised–ner.py

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 44 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

Take-Aways

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 45 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

The components of a language

Phonology: Science of language sounds


morphology: Science of word form structure
Lexicon: Listing analysed words
Syntax: Science of compositing word forms
Semantics: Science of literal meaning
Pragmatics: Science of using language expressions

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 46 / 47


Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

References

[1] Philipp Koehn. Statistical machine translation. Cambridge Univer-


sity Press, Cambridge, 2012 - 2010. ISBN 9780521874151.

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 47 / 47

You might also like