CITS4012 Lecture02 PDF

Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References
NLP Pipeline
CITS4012 Natural Language Processing
A/Prof. Wei Liu
wei.liu@uwa.edu.au
Computer Science and Software Engineering
The University of Western Australia
March 10, 2022
A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 1 / 47

What we are going to cover today

1 Sentences
Sentence Structure
Theories of Grammar
2 Document Level Concepts

Discourse
Co-reference resolution
Topic Modelling
3 Corpus
4 spaCy for NLP

Container Objects
Pipeline Components
5 Take-Aways
[see Statistical Machine Translation, Chapter 2]

Sentences

What’s a sentence
Sentence
A sentence may consist of one or more clauses, each of which consists of
a verb with arguments and adjuncts. Clauses may be adjuncts and
arguments themselves.
Jane bought the house.

verb bought is the central element of the sentence.
It requires a buyer (the subject Jane) and an object (the object the
house).
Verbs may require different number of objects (e.g. Jane gave Joe a
book.). How many and what objects a verb requires is called the
valency of a verb.
Some verbs may require none, which are called intransitive verbs.
Objects that are required by a verb is also called arguments.
Additional information may be added to a sentence in the form of
adjuncts.
a prepositional phrase: from Jim, without hesitation;
adverbs such as yesterday, cheaply.
Recursion in Natural Languages
Recursion
Recursion is a a striking feature of language, referring to the ability to
create nested constructions of constituents, which can be extended by
additional constituents.
the house → the (very) beautiful house

the house → the house in the posh neighbourhood (across the
river)
Jame who recently won the lottery bought the house that was
just on the market.

Structural Ambiguity
Prepositional phrase attachment and connectives introduce ambiguity for

automatic NLP.
Jim eats steak with ketchup.
Joe eats steak with a knife.
Jane watches the man with the telescope.
Jim washes the dishes and watches TV with Jane.
How does a computer know that steak and knife do not a
tasty meal?
Does Jane use a telescope to watch the man, or does the
man have it?
Is Jane helping with the dishes or is she just joining for TV?

Parse Tree of a sentence
NP-SBJ VP .
NP ADJP MD VP .
NNP NNP , CD NNS JJ , will VB NP PP-CLR NP-TMP
Pierre Vinken 61 years old join DT NN IN NP NNP CD
the board as DT JJ NN Nov. 29
a non-executive director

Parse Tree
Parse trees illustrate the recursive nature of grammatical structure of a

language.
Like trees in nature, the initial stem branches out recursively until we
reach the leaves (words - terminal nodes).
Unlike natural trees, syntactic trees grown from top to bottom.
The root is the sentence node, which branches out into the Subject
Noun Phrase (NP-SBJ) and the Verb Phrase (VP).
After picking out the modal (will), the VP further broken down to
the main verb, its object and adjuncts.

Phrase Structure Grammar

From the Parse Tree, we can see phrases provides the basis for talking
about the levels in between the sentence root node (at the top) and the
word leaf nodes (at the bottom).
noun phrases (NP)
prepositional phrases (PP)
verb phrases (VP)
adjective phrases (ADJP)
Phrases
Phrases are groups of words that introduce additional level of abstraction
in a sentence, which allows us to define relationship between word groups.
The concept of subject and object refer to a phrasal unit, not a

single word.

Context-Free Grammar
In NLP, we are
mostly concerned with computational methods to deal with language,
and
less with how human mind uses language
So in terms of grammar, we are interested in formalisms that allow us
to define all possible English sentences, but
rule out impossible word combinations
Context Free Grammar (CFG)

CFG consists of a set of nonterminal (part-of-speech tags and pharse
categories) and terminal symbols (words).
S → NP VP
NP → NNP ADJP
NP → NNP NNP
VP → VB NP PP NP
VB → join

Extending CFGs
The formalism of context-free grammars can be extended in many ways.
Probabilistic Context Free Grammars (PCFG)

To assess the likelihood of different syntactic structures for a given
sentence, we add probabilities to rules, which is referred to as PCFG.
For instance, structural ambiguity that arise from ambiguous sentences

such as
Jane watches the man with the telescope.
can be resolved by assigning a probability distribution to different syntac-

tic structures of a sentence.

Dependency Structure
will
the board join a director as
61 years old Vinken nonexecutive
29 Nov
In a dependency structure, we only display the relationship of words to
each other, without any reference to phrases or phrase structure.
Head word is explicit in a dependency structure, a parse tree or
syntax tree lacks this information.
Syntax tree labels each constituent, e.g. the nonexecutive director is
a NP.
Syntax tree preserve the ordering of the words and has more inherent
structure.
Both may be extended with additional information.
Dependency Parsing using Spacy

1 import spacy
2 from spacy import displacy
3
4 nlp = spacy.load("en_core_web_sm")
5 doc = nlp("Pierre Vinken , 61 years old , will join the board as a
nonexecutive director Nov. 29.")
6 for token in doc:
7 print(token.text , token.lemma_ , token.pos_ , token.tag_ ,
token.dep_ , token.shape_ , token.is_alpha , token.is_stop)
8
9 # Visualising the parse tree
10 displacy.render(doc , style='dep')
27/04/2021 code/spacy–parsing.py
sentence.svg
npadvmod
nsubj
amod prep pobj
dob j de t
compound nummod npadvmod aux det amod nummod
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
PROPN PROPN NUM NOUN ADJ VERB VERB DET NOUN SCONJ DET ADJ NOUN PROPN NUM
https://spacy.io/usage/linguistic-featurespos-tagging
Dependency Labels in spaCy

A syntactic dependency label describes the types of syntactic relation
between two words in a sentence.
syntactic governor - head or parent
dependent - child
Dependency label Description

acomp Adjectival complement
amod Adjectival modifier
aux Auxiliary
compound Compound
dative Dative
det Determiner
dobj Direct object
A syntactic nsubj Nominal subject
dependency arc
pobj Object of preposition
ROOT Root

Using Dependency Labels for Question Answering
How a Question Answering System works

Lexical Functional Grammar (LFG)

Lexical Functional Grammar (LFG) draws a distinction between the sur-
face structure of language and underlying deep structure, which is more
closely related to the expressed meaning by having two representations of
a sentence:
constituent structure (c-structure), and
functional structure (f-structure).
Picture
A/Prof. Weicredit
Liu UWA to
Centre for Linguistics and Philology from OXFORD University
Lecture 2 March 10, 2022 16 / 47
Lexical Functional Grammar (LFG)

Lexical Functional Grammar (LFG) draws a distinction between the sur-
face structure of language and underlying deep structure, which is more
closely related to the expressed meaning by having two representations of
a sentence:
constituent structure (c-structure), and
functional structure (f-structure).
An f-structure of the example sentence is shown below:
PRED ‘join (SUBJ, OBJ)’

 
TENSE past 
PRED ‘pierre-vinken’
  
 
SUBJ PRED ‘old’
 
ADJ
 
ADJ PRED ‘61 years’
 
 
PRED ‘board’
 
OBJ
 
DEF

 + 
ADJ PRED ‘Nov. 29’


Named Entity Recognition

Named Entity
A named entity is a real object that you can refer to by a proper name.
It can be a person, organization, location, or other entity.
1 import spacy
2 nlp = spacy.load('en_core_web_sm ')
3 doc = nlp(u'I have flown to LA. Now I am flying to Frisco.')
4 for token in doc:
5 if token.ent_type != 0:
6 print(token.text , token.ent_type_)
code/spacy–ner.py
If the ent_type attribute of a token is not set to 0, then the token is a

named entity.
LA GPE
Frisco GPE
GPE stands for for “geopolitical entity” and includes
countries, cities, states, and other place names.
Document Level Concepts

Discourse analysis - Detecting Intention

1 import spacy
4 # The head property of a token object refers to the
5 # syntatic head of this token
6 for token in doc:
7 print(token.head.text , token.dep_ , token.text)
8 # ROOT with pobj indicate intent this case
9 for sent in doc.sents:
10 print ([ token.text for token in sent
11 if token.dep_ == 'ROOT ' or token.dep_ == 'pobj '])
code/spacy–intent.py
flown nsubj I
flown aux have
flown ROOT flown
flown prep to
to pobj LA
flown punct .
flying advmod Now
flying nsubj I
flying aux am
flying ROOT flying
flying prep to
to pobj Frisco
flying punct .

Discourse analysis - Detecting Intention

1 import spacy
4 # The head property of a token object refers to the
5 # syntatic head of this token
6 for token in doc:
7 print(token.head.text , token.dep_ , token.text)
8 # ROOT with pobj indicate intent this case
10 print ([ token.text for token in sent
11 if token.dep_ == 'ROOT ' or token.dep_ == 'pobj '])
code/spacy–intent.py
This example shows how to create a list of potential keywords for each
sentence based on specific dependency labels assigned to the tokens.
['flown', 'LA']
['flying', 'Frisco']

Co-references or Anaphora
The first mention of an entity is typically fleshed out (e.g. the
46th President Joe Biden), it may be later referred to only by a
pronoun (he) or an abbreviated description (the president).
A later mention may not have sufficient information, so we need to back-

track to previous mentions, this is one of the core NLP tasks called
anaphor resolution (a.k.a. co-reference resolution).
spaCy NeuralCoref
https://huggingface.co/coref/

neuralcoref v4.0
1 """
2 neuralcoref v4.0 only works with spacy ==2.1.0 and python 3.7
3 $ conda create -n neuralcoref python =3.7
4 $ pip install spacy ==2.1.0
5 $ pip install neuralcoref
6 $ python -m spacy download en
7 """
8 import spacy
10
11 # Add neural coref to SpaCy 's pipe
12 import neuralcoref
13 neuralcoref.add_to_pipe(nlp)
14
15 doc = nlp(u'My sister has a dog. She loves him.')
16 doc._.has_coref
17 doc._.coref_clusters
code/neural–coref.py
True
[My sister: [My sister, She], a dog: [a dog, him]]
Topic
Topics are the general subject matters of a text or a document.
In a sports article, the English word bat will be translated differently
from in an article about cave animals.
It maybe helpful to detect the topics of a document and use this to
help translation and disambiguation.
Topic Modelling
Topic modelling is another NLP task, that treats topics as a latent
variable that can be learned through word distributions.
Topic Modelling is often used as an alternative to clustering – grouping

words of the same ”latent” topic together.
Popular techniques (to be covered in Lecture 5) include:
Non-negative matrix factorisation
Latent Dirichlet Allocation

Corpus

Collections of Texts - Corpora

Many NLP systems, such as statistical machine translation systems are
trained on large collection of texts.
Corpora
A corpus (plural: corpora) is a collection of texts or documents.
The text-corpus method, namely, corpus linguistics uses the body of texts
written in any natural language to derive the set of abstract rules which
govern that language. Typical Usage:
Explore the relationships between that subject language and other
languages which have undergone a similar analysis.
Compile dictionaries and grammar guides.
A landmark in modern corpus linguistics was the publication of
Computational Analysis of Present-Day American English in 1967
[Kučera and Francis].
The work was based on an analysis of the Brown Corpus, a
contemporary compilation of about a million American English
words, carefully selected from a wide variety of sources.
Domain and Topic of Texts
Domain A system that works excellently on scientific neuro-science

articles may perform poorly at online chats between teenagers.
It is very challenging to create general-purpose NLP systems.
A restricted domain makes it much easier to confine and
contextualise the meanings of words.
Topic Training corpus in computer science may be a bad source for
training a NLP system that is to work on civil or geology texts.
Take machine translation for example, much of the available
translated text comes from international organizations, such as the
United Nations or the European Union.
The European Parliament proceedings cover many political,
economic, and cultural matters, but may still not be a good source to
learn to translate texts in a specialized scientific domain.

Modality of Texts
Modality Natural language present in both written and spoken language.

The modality of communication modality matters.
Spoken language is typically transcribed (either manually or through
automatic speech recognition system) into text to make use of a
textual system for meaning extraction.
This implies possible polishing by removing restarts and filler words (I
really believe, um, believe that we should do this.),
Spoken language is different from written text. It is often
ungrammatical, full of unfinished sentences and (especially in the
case of automatic speech recognition systems of dialogues) reliant on
gestures and mutually understood knowledge.
Much of this is also true for informal uses of written text, such as
Internet chat, email, and text messages.

spaCy for NLP

Objects in spaCy
spaCy objects
containers pipline components
Doc POS NER
Token Span
Doc.sents Doc.noun_chunks

Objects in spaCy
spaCy objects
A container object
groups multiple elements into a
single unit. It can be a collection of
objects, like tokens or sentences, or a
set of annotations related to a single
object.
Doc POS NER
Token Span
Doc.sents Doc.noun_chunks

Objects in spaCy
spaCy objects
A container object
groups multiple elements into a
single unit. It can be a collection of
objects, like tokens or sentences, or a
set of annotations related to a single
object.
Doc POS NER
Pipeline components
objects that process the text input to
Token Span create containers and fill them with
relevant data, such as a
part-of-speech tagger, a dependency
Doc.sents Doc.noun_chunks parser and an entity recogniser.

Container Objects - Doc

1 from spacy.tokens.doc import Doc
2 from spacy.vocab import Vocab
3
4 """
5 create a spacy.tokens.doc.Doc object
6 using its constructor
7 """
8 doc = Doc(Vocab (), words = [u'Hello ', u'World!'])
9 print(doc)
10 print(type(doc))
code/spacy–doc.py
Hello World!
spacy.tokens.doc.Doc
The Doc() constructor, requires two parameters:
a vocab object, which is a storage container that provides vocabulary
data, such as lexical types (adjective, verb, noun ...);
a list of tokens to add to the Doc object being created.

from a user’s standpoint, represent a token, a phrase or sentence, and a text,
Sentences Document Level Concepts
respectively. A containerCorpus spaCy
can contain for NLP
other containers—for Take-Aways
example, a Doc References
contains Tokens. In this section, we’ll explore working with these container
objects.
Container Objects - Token
Getting the Index of a Token in a Doc Object
spaCy’s TokenA Docobject
objectis a container
contains a collectionfor a set
of the ofobjects
Token annotations
generated asre-
a
lated to a single token, such as that token’s part of speech.
result of the tokenization performed on a submitted text. These tokens have
indices, allowing you to access them based on their positions in the text,
as shown in Figure 3-1.
Doc container
Index [0] [1] [2] [3] [4]
Content I want a green apple.
Annotations PRON VERB DET ADJ NOUN

... ... ... ... ...
Token objects
Figure 3-1: The tokens in a Doc object
1 import spacy The tokens are indexed starting with 0, which makes the length of the
2 nlp = spacy.load('en_core_web_sm
document minus 1 the index of ')the end position. To shred the Doc instance
3 doc = nlp(u'I into
want a green
tokens, you deriveapple.')
the tokens into a Python list by iterating over the
4 # token_text1 Doc
andfrom the start token toproduce
token_text2 the end token:
the same results
5 token_text1 = >>>
[token.text
[doc[i] for i in for token in doc]
range(len(doc))]
6 token_text2 = [A,
[doc[i]. text
severe, storm, hit,for i in .]range(len(doc))]
the, beach,
code/spacy–token.py
It’s worth noting that we can create a Doc object using its constructor
explicitly, as illustrated in the following example:
I want a green apple.
and Token.children
The diagram in Figure 3-2 highlights the syntactic dependencies of
Token.lefts, Token.rights
interest.
I want a green apple.
PRON VERB DET ADJ NOUN
leftward children
of token “apple”
Figure 3-2: An example of leftward syntactic dependencies
1 import spacy To obtain the leftward syntactic children of the word “apple” in this
sample sentence programmatically, we might use the following code:
3 doc = nlp(u'I want a green apple.')
4 print ([t for t in>>>doc
doc [4]. lefts
= nlp(u'I want])
a green apple.')
5 print ([t for t in>>> [w for w in doc[4].lefts]
doc [4].
[a, green]
children ])
6 print ([t for t in doc [1]. rights ])
In this script, we simply iterate through the apple’s children, outputting
code/spacy–children.py
them in a list.
It’s interesting to note that in this example, the leftward syntactic
[a, green] children of the word “apple” represent the entire sequence of the token’s
[a, green] syntactic children. In practice, this means that we might replace Token.lefts
with Token.children, which finds all of a token’s syntactic children:
[apple, .]
A/Prof. Wei Liu UWA >>> [w for w in doc[4].children]
Lecture 2 March 10, 2022 31 / 47
Container Objects - Vocab

20/05/2021 vocab_stringstore-52248f07f3f339d095cc2a6625a12689.svg
Internally, spaCy only “speaks” in hash values.
Whenever possible, spaCy tries to store data in a vocabulary, the

Vocab storage class, that will be shared by multiple documents;
To save memory, spaCy also encodes all strings to hash values. For
example, “coffee” has the hash 3197928453018144401.
Entity labels like “ORG” and part-of-speech tags like “VERB” are
also encoded.
https://spacy.io/usage/spacy-101#vocab

spaCy Code Demo for the Vocab class

1 import spacy
3 doc = nlp('I love coffee!')
4 for token in doc:
5 lexeme = doc.vocab[token.text]
6 print(lexeme.text , lexeme.orth , lexeme.shape_ ,
7 lexeme.prefix_ , lexeme.suffix_ , lexeme.is_alpha ,
8 lexeme.is_digit , lexeme.is_title , lexeme.lang_)
9
10 print(doc.vocab.strings["coffee"]) # 3197928453018144401
11 print(doc.vocab.strings [3197928453018144401]) # 'coffee '
code/spacy–vocab.py
text hash shape prefix suffix alpha digit title lang

I 4690420944186131903 X I I True False True en
love 3702023516439754181 xxxx l ove True False False en
coffee 3197928453018144401 xxxx c fee True False False en
! 17494803046312582752 ! ! ! False False False en
3197928453018144401
'coffee'
Container Objects - Span

spaCy’s Span object is a container that represents an arbitrary
set of neighbouring tokens in the document, which could be an
n-gram, a phrase, a noun_chunk, or a sentence.
Span can be obtained as simple as doc[start:end] where start and

end are the index of starting token and the ending token, respectively.
The two indices can be
manually specified; or
computed through pattern matching
doc[start:end] is a span; an noun phrase chunk is a span; an

n-gram is a span.
1 span = Span(doc , start , end , label=match_id)
code/spacy–span–snippet.py

spaCy’s Pattern Matcher

1 import spacy
2 from spacy.matcher import Matcher
3 from spacy.tokens import Doc , Span , Token
5 matcher = Matcher(nlp.vocab)
6 # A dependency label pattern that matches a word sequence
7 pattern = [{"DEP": "nsubj"},{"DEP": "aux"},{"DEP": "ROOT"}]
8 matcher.add("NsubjAuxRoot", [pattern ])
9 doc = nlp(u"We can overtake them.")
10 # 1. Return (match_id , start , end) tuples
11 matches = matcher(doc)
12 for match_id , start , end in matches:
13 span = doc[start:end]
14 print("Span: ", span.text)
15 print("The positions in the doc are: ", start , "-", end)
16 # 2. Return Span objects directly
17 matches = matcher(doc , as_spans=True)
18 for span in matches:
19 print(span.text , span.label_)
code/spacy–pattern–matcher.py
Span: We can overtake
The positions in the doc are: 0 - 3
Rule-based Matching
Steps for using the Matcher class:
1 Create a Matcher instance by passing in a shared Vocab object;
2 Specify the pattern as an list of dependency labels;
3 Add the pattern to the a Matcher object;
4 Input a Doc object to the matcher;
5 Go through each match hmatch_id, start, endi.
We have seen a Dependency Matcher just now, there are more Rule-
based matching support in spaCy:
Token Matcher: regex, and patterns such as ["LOWER": "hello",
"IS_PUNCT": True, "LOWER": "world"]
Phrase Matcher: PhraseMatcher class
Entity Ruler
Combining models with rules
https://spacy.io/usage/rule-based-matching#matcher

doc.noun_chunks and Retokenising

Noun Chunks
A noun chunk is a phrase that has a noun as its head.
1 doc = nlp(u'The Golden Gate Bridge is an iconic landmark in San

Francisco.')
2 # Retokenize to treat each noun_chunk as a single token
3 for chunk in doc.noun_chunks:
4 with doc.retokenize () as retokenizer:
5 retokenizer.merge(chunk)
6 for token in doc:
7 print(token)
The Golden Gate Bridgecode/spacy–noun–chunk.py

is Define a function to extract
an iconic landmark noun phrases based on syntactic
in dependency parsing.
San Francisco
.
doc.sents
spaCy’s Doc object represents a text, which may contain one or
more sentences.
doc.sents is a generator object. You can use for each in loop,
but not subset indexing.
Each member of the generator object is a Span of type
spacy.tokens.span.Span.
1 doc = nlp(u'A storm hit the beach. It started to rain.')
3 print(type(sent))
4 # Sentence level index
5 [sent[i] for i in range(len(sent))]
6 # Doc level index
7 [doc[i] for i in range(len(doc))]
code/spacy–sents–snippet.py
1 Print the second sentence begins with a pronoun.

2 How many sentences end with a verb?
Text-Processing Pipeline - Traditional in NLTK
Using Python @coroutine to create information extraction pipelines
Information Extraction in NLTK

An NLTK Pipeline Example

1 import nltk
2 nltk.download('punkt ') # Sentence Tokenize
3 nltk.download('averaged_perceptron_tagger ') # POS Tagging
4 nltk.download('maxent_ne_chunker ') # Named Entity Chunking
5 nltk.download('words ') # Word Tokenize
6
7 # texts is a collection of documents.
8 # Here is a single document with two sentences.
9 texts = [u"A storm hit the beach in Perth. It started to rain."]
10 for text in texts:
11 sentences = nltk.sent_tokenize(text)
12 for sentence in sentences:
13 words = nltk.word_tokenize(sentence)
14 tagged_words = nltk.pos_tag(words)
15 ne_tagged_words = nltk.ne_chunk(tagged_words)
16 print(ne_tagged_words)
code/nltk–ie.py
(S A/DT storm/NN hit/VBD the/DT beach/NN in/IN (GPE Perth/NNP)
./.)
(S It/PRP started/VBD to/TO rain/VB ./.)

NLP Pipeline in spaCy
Recall that spaCy’s container objects represent linguistic units, such as a

text (i.e. document), a sentence and an individual token with linguistic
features already extracted for them.
How does spaCy create these containers and fill them with rele-
vant data?
Processing Pipeline Components

A spaCy pipeline (V2.x include, by default, a part-of-speech tagger, a
dependency parser and an entity recognizer.
>>> nlp.pipe_names
['tagger', 'parser', 'ner']

NLP Pipeline in spaCy
Recall that spaCy’s container objects represent linguistic units, such as a

text (i.e. document), a sentence and an individual token with linguistic
features already extracted for them.
How does spaCy create these containers and fill them with rele-
vant data?

Disabling Pipeline Components
spaCy allows you to load a selected set of pipeline components, disabling

those that aren’t necessary.
You can do this when creating an nlp object by setting the disable
parameter:
nlp = spacy.load('en_core_web_sm',
disable=['parser'])
Or, you can disable it after the nlp object is created:
nlp.disable_pipes('tagger')
nlp.disable_pipes('parser')

Customising a NLP pipeline
1 import spacy
3
4 doc = nlp(u'I need a taxi to Cottesloe.')
5 for ent in doc.ents:
6 print(ent.text , ent.label_)
code/spacy–customise–org.py
Cottesloe ORG
If we would like to introduce a new entity type SUBURB for
Cottesloe and other suburb names, how should we inform the
NER component about it?

Steps of Customising a spaCy NER pipe
1 Create a training example to show the entity recognizer so it will

learn what to apply the SUBURB label to;
2 Add a new label called SUBURB to the list of supported entity types;
3 Disable other pipe to ensure that only the entity recogniser will be
updated during training;
4 Start training;
5 Test your new NER pipe;
6 Serialise the pip to disk;
7 Load the customised NER.

1 import spacy
3
4 # Specify new label and training data
5 LABEL = 'SUBURB '
6 TRAIN_DATA = [('I need a taxi to Cottesloe ',
7 { 'entities ': [(17, 26, 'SUBURB ')] }),
8 ('I like red oranges ', { 'entities ': []})]
9
10 # Add new label to the ner pipe
11 ner = nlp.get_pipe('ner')
12 ner.add_label(LABEL)
13
14 # Disable other two default pipes
15 nlp.disable_pipes('tagger ')
16 nlp.disable_pipes('parser ')
17
18 # Train
19 optimizer = nlp.entity.create_optimizer ()
20 import random
21 for i in range (25):
22 random.shuffle(TRAIN_DATA)
23 for text , annotations in TRAIN_DATA:
24 nlp.update ([ text], [annotations], sgd=optimizer)
25 # Test
26 doc = nlp(u'I need a taxi to Cottesloe ')

28 print(ent.text , ent.label_) # Cottesloe SURBURB
29
30 # Serialize the pipe to disk
31 ner.to_disk('C:\\ Users \\00051693\\ CITS4012 ') # Windows Path
32
33 # Load spacy without NER
34 import spacy
35 from spacy.pipeline import EntityRecognizer
36 nlp = spacy.load('en_core_web_sm ', disable =['ner'])
37
38 # Load ner pip from disk
39 ner = EntityRecognizer(nlp.vocab)
40 ner.from_disk('C:\\ Users \\00051693\\ CITS4012 ')
41 # Add pipe to nlp
42 nlp.add_pipe(ner)
43
44 # Test
45 doc = nlp(u'I need a taxi to Western Australia ')
47 print(ent.text , ent.label_) # Western SURBURB
code/spacy–customised–ner.py

Take-Aways

The components of a language
Phonology: Science of language sounds

morphology: Science of word form structure
Lexicon: Listing analysed words
Syntax: Science of compositing word forms
Semantics: Science of literal meaning
Pragmatics: Science of using language expressions

References
[1] Philipp Koehn. Statistical machine translation. Cambridge Univer-

sity Press, Cambridge, 2012 - 2010. ISBN 9780521874151.

CITS4012 Lecture02 PDF

Uploaded by

Copyright:

Available Formats

You might also like

CITS4012 Lecture02 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CITS4012 Lecture02 PDF

Uploaded by

Copyright:

Available Formats

Sentences Document Level Concepts Corpus spaCy for NLP Take-Aways References

A/Prof. Wei Liu

March 10, 2022

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 1 / 47

What we are going to cover today

2 Document Level Concepts

4 spaCy for NLP

[see Statistical Machine Translation, Chapter 2]

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 2 / 47

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 3 / 47

Jane bought the house.

Recursion in Natural Languages

the house → the (very) beautiful house

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 5 / 47

Prepositional phrase attachment and connectives introduce ambiguity for

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 6 / 47

Parse Tree of a sentence

NNP NNP , CD NNS JJ , will VB NP PP-CLR NP-TMP

Pierre Vinken 61 years old join DT NN IN NP NNP CD

the board as DT JJ NN Nov. 29

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 7 / 47

Parse trees illustrate the recursive nature of grammatical structure of a

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 8 / 47

Phrase Structure Grammar

The concept of subject and object refer to a phrasal unit, not a

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 9 / 47

Context Free Grammar (CFG)

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 10 / 47

The formalism of context-free grammars can be extended in many ways.

Probabilistic Context Free Grammars (PCFG)

For instance, structural ambiguity that arise from ambiguous sentences

can be resolved by assigning a probability distribution to different syntac-

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 11 / 47

the board join a director as

61 years old Vinken nonexecutive

Dependency Parsing using Spacy

amod prep pobj

compound nummod npadvmod aux det amod nummod

Dependency Labels in spaCy

Dependency label Description

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 14 / 47

Using Dependency Labels for Question Answering

How a Question Answering System works

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 15 / 47

Lexical Functional Grammar (LFG)

Lexical Functional Grammar (LFG)

PRED ‘join (SUBJ, OBJ)’

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 16 / 47

Named Entity Recognition

If the ent_type attribute of a token is not set to 0, then the token is a

Document Level Concepts

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 18 / 47

Discourse analysis - Detecting Intention

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 19 / 47

Discourse analysis - Detecting Intention

A/Prof. Wei Liu UWA Lecture 2 March 10, 2022 19 / 47

A later mention may not have sufficient information, so we need to back-