Professional Documents
Culture Documents
CITS4012 Lecture02 PDF
CITS4012 Lecture02 PDF
CITS4012 Lecture02 PDF
NLP Pipeline
CITS4012 Natural Language Processing
wei.liu@uwa.edu.au
Computer Science and Software Engineering
The University of Western Australia
3 Corpus
5 Take-Aways
Sentences
What’s a sentence
Sentence
A sentence may consist of one or more clauses, each of which consists of
a verb with arguments and adjuncts. Clauses may be adjuncts and
arguments themselves.
Recursion
Recursion is a a striking feature of language, referring to the ability to
create nested constructions of constituents, which can be extended by
additional constituents.
Jame who recently won the lottery bought the house that was
just on the market.
Structural Ambiguity
NP-SBJ VP .
NP ADJP MD VP .
a non-executive director
Parse Tree
Phrases
Phrases are groups of words that introduce additional level of abstraction
in a sentence, which allows us to define relationship between word groups.
Context-Free Grammar
In NLP, we are
mostly concerned with computational methods to deal with language,
and
less with how human mind uses language
So in terms of grammar, we are interested in formalisms that allow us
to define all possible English sentences, but
rule out impossible word combinations
S → NP VP
NP → NNP ADJP
NP → NNP NNP
VP → VB NP PP NP
VB → join
Extending CFGs
Dependency Structure
will
29 Nov
In a dependency structure, we only display the relationship of words to
each other, without any reference to phrases or phrase structure.
Head word is explicit in a dependency structure, a parse tree or
syntax tree lacks this information.
Syntax tree labels each constituent, e.g. the nonexecutive director is
a NP.
Syntax tree preserve the ordering of the words and has more inherent
structure.
Both may be extended with additional information.
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 12 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References
27/04/2021 code/spacy–parsing.py
sentence.svg
npadvmod
nsubj
dob j de t
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
PROPN PROPN NUM NOUN ADJ VERB VERB DET NOUN SCONJ DET ADJ NOUN PROPN NUM
https://spacy.io/usage/linguistic-featurespos-tagging
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 13 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References
Picture
A/Prof. Weicredit
Liu UWA to
Centre for Linguistics and Philology from OXFORD University
Lecture 2 March 10, 2022 16 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References
1 import spacy
2 nlp = spacy.load('en_core_web_sm ')
3 doc = nlp(u'I have flown to LA. Now I am flying to Frisco.')
4 for token in doc:
5 if token.ent_type != 0:
6 print(token.text , token.ent_type_)
code/spacy–ner.py
code/spacy–intent.py
flown nsubj I
flown aux have
flown ROOT flown
flown prep to
to pobj LA
flown punct .
flying advmod Now
flying nsubj I
flying aux am
flying ROOT flying
flying prep to
to pobj Frisco
flying punct .
code/spacy–intent.py
This example shows how to create a list of potential keywords for each
sentence based on specific dependency labels assigned to the tokens.
['flown', 'LA']
['flying', 'Frisco']
Co-references or Anaphora
The first mention of an entity is typically fleshed out (e.g. the
46th President Joe Biden), it may be later referred to only by a
pronoun (he) or an abbreviated description (the president).
spaCy NeuralCoref
https://huggingface.co/coref/
neuralcoref v4.0
1 """
2 neuralcoref v4.0 only works with spacy ==2.1.0 and python 3.7
3 $ conda create -n neuralcoref python =3.7
4 $ pip install spacy ==2.1.0
5 $ pip install neuralcoref
6 $ python -m spacy download en
7 """
8 import spacy
9 nlp = spacy.load('en_core_web_sm ')
10
11 # Add neural coref to SpaCy 's pipe
12 import neuralcoref
13 neuralcoref.add_to_pipe(nlp)
14
15 doc = nlp(u'My sister has a dog. She loves him.')
16 doc._.has_coref
17 doc._.coref_clusters
code/neural–coref.py
True
[My sister: [My sister, She], a dog: [a dog, him]]
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 21 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References
Topic
Topics are the general subject matters of a text or a document.
In a sports article, the English word bat will be translated differently
from in an article about cave animals.
It maybe helpful to detect the topics of a document and use this to
help translation and disambiguation.
Topic Modelling
Topic modelling is another NLP task, that treats topics as a latent
variable that can be learned through word distributions.
Corpus
The text-corpus method, namely, corpus linguistics uses the body of texts
written in any natural language to derive the set of abstract rules which
govern that language. Typical Usage:
Explore the relationships between that subject language and other
languages which have undergone a similar analysis.
Compile dictionaries and grammar guides.
A landmark in modern corpus linguistics was the publication of
Computational Analysis of Present-Day American English in 1967
[Kučera and Francis].
The work was based on an analysis of the Brown Corpus, a
contemporary compilation of about a million American English
words, carefully selected from a wide variety of sources.
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 24 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References
Modality of Texts
Objects in spaCy
spaCy objects
Token Span
Doc.sents Doc.noun_chunks
Objects in spaCy
spaCy objects
A container object
groups multiple elements into a
single unit. It can be a collection of
containers pipline components
objects, like tokens or sentences, or a
set of annotations related to a single
object.
Doc POS NER
Token Span
Doc.sents Doc.noun_chunks
Objects in spaCy
spaCy objects
A container object
groups multiple elements into a
single unit. It can be a collection of
containers pipline components
objects, like tokens or sentences, or a
set of annotations related to a single
object.
Doc POS NER
Pipeline components
objects that process the text input to
Token Span create containers and fill them with
relevant data, such as a
part-of-speech tagger, a dependency
Doc.sents Doc.noun_chunks parser and an entity recogniser.
code/spacy–doc.py
Hello World!
spacy.tokens.doc.Doc
The Doc() constructor, requires two parameters:
a vocab object, which is a storage container that provides vocabulary
data, such as lexical types (adjective, verb, noun ...);
a list of tokens to add to the Doc object being created.
Doc container
Token objects
1 import spacy The tokens are indexed starting with 0, which makes the length of the
2 nlp = spacy.load('en_core_web_sm
document minus 1 the index of ')the end position. To shred the Doc instance
3 doc = nlp(u'I into
want a green
tokens, you deriveapple.')
the tokens into a Python list by iterating over the
4 # token_text1 Doc
andfrom the start token toproduce
token_text2 the end token:
the same results
5 token_text1 = >>>
[token.text
[doc[i] for i in for token in doc]
range(len(doc))]
6 token_text2 = [A,
[doc[i]. text
severe, storm, hit,for i in .]range(len(doc))]
the, beach,
code/spacy–token.py
It’s worth noting that we can create a Doc object using its constructor
explicitly, as illustrated in the following example:
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 30 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References
I want a green apple.
and Token.children
The diagram in Figure 3-2 highlights the syntactic dependencies of
Token.lefts, Token.rights
interest.
leftward children
of token “apple”
1 import spacy To obtain the leftward syntactic children of the word “apple” in this
2 nlp = spacy.load("en_core_web_sm")
sample sentence programmatically, we might use the following code:
3 doc = nlp(u'I want a green apple.')
4 print ([t for t in>>>doc
doc [4]. lefts
= nlp(u'I want])
a green apple.')
5 print ([t for t in>>> [w for w in doc[4].lefts]
doc [4].
[a, green]
children ])
6 print ([t for t in doc [1]. rights ])
In this script, we simply iterate through the apple’s children, outputting
code/spacy–children.py
them in a list.
It’s interesting to note that in this example, the leftward syntactic
[a, green] children of the word “apple” represent the entire sequence of the token’s
[a, green] syntactic children. In practice, this means that we might replace Token.lefts
with Token.children, which finds all of a token’s syntactic children:
[apple, .]
A/Prof. Wei Liu UWA >>> [w for w in doc[4].children]
Lecture 2 March 10, 2022 31 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References
code/spacy–vocab.py
3197928453018144401
'coffee'
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 33 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References
code/spacy–span–snippet.py
code/spacy–pattern–matcher.py
Span: We can overtake
The positions in the doc are: 0 - 3
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 35 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References
Rule-based Matching
Steps for using the Matcher class:
1 Create a Matcher instance by passing in a shared Vocab object;
2 Specify the pattern as an list of dependency labels;
3 Add the pattern to the a Matcher object;
4 Input a Doc object to the matcher;
5 Go through each match hmatch_id, start, endi.
We have seen a Dependency Matcher just now, there are more Rule-
based matching support in spaCy:
Token Matcher: regex, and patterns such as ["LOWER": "hello",
"IS_PUNCT": True, "LOWER": "world"]
Phrase Matcher: PhraseMatcher class
Entity Ruler
Combining models with rules
https://spacy.io/usage/rule-based-matching#matcher
doc.sents
spaCy’s Doc object represents a text, which may contain one or
more sentences.
doc.sents is a generator object. You can use for each in loop,
but not subset indexing.
Each member of the generator object is a Span of type
spacy.tokens.span.Span.
1 doc = nlp(u'A storm hit the beach. It started to rain.')
2 for sent in doc.sents:
3 print(type(sent))
4 # Sentence level index
5 [sent[i] for i in range(len(sent))]
6 # Doc level index
7 [doc[i] for i in range(len(doc))]
code/spacy–sents–snippet.py
code/nltk–ie.py
(S A/DT storm/NN hit/VBD the/DT beach/NN in/IN (GPE Perth/NNP)
./.)
(S It/PRP started/VBD to/TO rain/VB ./.)
How does spaCy create these containers and fill them with rele-
vant data?
>>> nlp.pipe_names
['tagger', 'parser', 'ner']
How does spaCy create these containers and fill them with rele-
vant data?
nlp = spacy.load('en_core_web_sm',
disable=['parser'])
Or, you can disable it after the nlp object is created:
nlp.disable_pipes('tagger')
nlp.disable_pipes('parser')
1 import spacy
2 nlp = spacy.load('en_core_web_sm ')
3
4 doc = nlp(u'I need a taxi to Cottesloe.')
5 for ent in doc.ents:
6 print(ent.text , ent.label_)
code/spacy–customise–org.py
Cottesloe ORG
If we would like to introduce a new entity type SUBURB for
Cottesloe and other suburb names, how should we inform the
NER component about it?
1 import spacy
2 nlp = spacy.load('en_core_web_sm ')
3
4 # Specify new label and training data
5 LABEL = 'SUBURB '
6 TRAIN_DATA = [('I need a taxi to Cottesloe ',
7 { 'entities ': [(17, 26, 'SUBURB ')] }),
8 ('I like red oranges ', { 'entities ': []})]
9
10 # Add new label to the ner pipe
11 ner = nlp.get_pipe('ner')
12 ner.add_label(LABEL)
13
14 # Disable other two default pipes
15 nlp.disable_pipes('tagger ')
16 nlp.disable_pipes('parser ')
17
18 # Train
19 optimizer = nlp.entity.create_optimizer ()
20 import random
21 for i in range (25):
22 random.shuffle(TRAIN_DATA)
23 for text , annotations in TRAIN_DATA:
24 nlp.update ([ text], [annotations], sgd=optimizer)
25 # Test
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 44 / 47
Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References
code/spacy–customised–ner.py
Take-Aways
References