Professional Documents
Culture Documents
Phases of NLP (8 Files Merged)
Phases of NLP (8 Files Merged)
Below are the main differences between Natural Language and Computer Language:
Redundancy Natural languages employ lots of Formal languages are less redundant.
redundancy.
Literalness Natural languages are made of idiom & Formal languages mean exactly what they
metaphor want to say
From the computer’s point of view, any natural language is a free form text. That means there
are no set keywords at set positions when providing an input ie., unstructured.
Beyond the unstructured nature, there can also be multiple ways to express something using a
natural language. For example, consider these three sentences:
All these sentences have the same underlying question, which is to enquire about today’s
weather forecast.
As humans, we can identify such underlying similarities almost effortlessly and respond
accordingly. But this is a problem for machines—any algorithm will need the input to be in a
set format, and these three sentences vary in their structure and format. And if we decide to
code rules for each and every combination of words in any natural language to help a machine
understand, then things will get very complicated very quickly.
NLP is a subset of AI tasked with enabling machines to interact using natural languages. The
domain of NLP also ensures that machines can:
The aim of NLP is to process the free form natural language text so that it gets transformed
into a standardized structure.
NLP is an umbrella term which encompasses any and everything related to making machines
able to process natural language—be it receiving the input, understanding the input, or
generating a response.
Natural Language Understanding (NLU) helps the machine to understand and analyze human
language by extracting the text from large data such as keywords, emotions, relations, and
semantics, etc. The NLU is harder than NLG.
Natural Language Generation is the technology that analyzes, interprets, and organizes data
into comprehensible, written text. NLG aids the machine in sorting through many variables
and putting “text into context,” thus delivering natural-sounding sentences and paragraphs that
observe the rules of English grammar. It is the process of producing meaningful phrases and
sentences in the form of natural language from some internal representation.
It involves −
• Text planning − It includes retrieving the relevant content from knowledge base
(domain).
• Sentence planning − It includes choosing required words, forming meaningful
phrases, setting tone of the sentence.
• Text Realization − It is mapping sentence plan into sentence structure.
So, how do NLP & NLU differ?
In natural language, what is expressed (either via speech or text) is not always what is meant.
Let’s take an example sentence:
NLP will take the request to crack the windows in the literal sense, but it will be NLU which
will help draw the inference that the user may be intending to roll down the windows.
On our quest to make more robust autonomous machines, it is imperative that we are able to
not only process the input in the form of natural language, but also understand the meaning and
context—that’s the value of NLU. This enables machines to produce more accurate and
appropriate responses during interactions.
NLP Terminology
Phases of NLP
-Lexical Analysis:
Morphological or Lexical Analysis deals with text at the individual word level. It looks
for morphemes, the smallest unit of a word. For example, irrationally can be broken
into ir (prefix), rational (root, which is a morpheme with dictionary meaning) and -ly (suffix).
Lexical Analysis finds the relation between these morphemes and converts the word into its
takes into consideration the dictionary of the language. It involves identifying and analyzing
the structure of words. Lexicon of a language means the collection of words and phrases in that
particular language. The lexical analysis divides the text into paragraphs, sentences, and words.
-Syntactic Analysis:
Syntactic Analysis is used to check grammar, arrangements of words, and the interrelationship
between the words. The syntax refers to the principles and rules that govern the sentence
structure of any individual languages. It focuses on the proper ordering of words which can
affect its meaning. This involves analysis of the words in a sentence by following the
grammatical structure of the sentence. Given the possible POS generated from the previous
step, a syntax analyzer assigns POS tags based on the sentence structure.
Example-1:
Correct Syntax: Sun rises in the east.
Incorrect Syntax: Rise in sun the east
-Semantic Analysis:
It draws the exact meaning or the dictionary meaning from the text. The text is checked for
meaningfulness. It is done by mapping syntactic structures and objects in the task domain. The
Consider the sentence: “The apple ate a banana”. Although the sentence is syntactically correct,
it doesn’t make sense because apples can’t eat. Semantic analysis looks for meaning in the
given sentence. It also deals with combining words into phrases. For example, “red apple”
provides information regarding one object; hence we treat it as a single phrase. Similarly, we
can group names referring to the same category, person, object or organisation. “Robert Hill”
refers to the same person and not two separate names – “Robert” and “Hill”.
–Discourse Integration:
meaning of any sentence depends upon the meaning of the sentence just before it. In addition,
it also brings about the meaning of immediately succeeding sentence. In the text, “Jack is a
bright student. He spends most of the time in the library.” Here, discourse assigns “he” to refer
to “Jack”
In the above statement, we can clearly see that the “it” keyword does not make any sense. In
fact, it is referring to anything that we don’t know. That is nothing but this “it” word depends
upon the previous sentence which is not given. So, once we get to know about “it”, we can
–Pragmatic Analysis:
The final stage of NLP, Pragmatics interprets the given text using information from the
previous steps. Given a sentence, “Turn off the lights” is an order or request to switch off the
lights. During this, what was said is re-interpreted on what it actually meant. It involves
deriving those aspects of language which require real world knowledge. It means the study of
meanings in a given language, process of extraction of insights from the text. It includes the
repetition of words, who said to whom? etc. It understands that how people communicate with
each other, in which context they are talking and so many aspects. In this analysis, the main
focus is always on what was said is reinterpreted on what is actually meant.
Challenges of NLP:
1) Irony, sarcasm (the words may appear positive or negative but they convey opposite
meaning)
2) Colloquialism (difficult as they don’t have dictionary definitions) and slang
3) Synonyms
4) Homonyms (their and there)
Same word with different meaning
Eg., I ran to the store because we ran out of milk
5) Domain-specific language :
An NLP processing model needed for healthcare, for example, would be very different
than one used to process legal documents. These days, however, there are a number of
Lexical Ambiguity: It happens when a word has different meanings. The same word could be
used as a verb, noun, or adjective. Lexical ambiguity can be resolved by using parts-of-speech
(POS)tagging techniques.
For Example:-
Syntactical ambiguity or Grammatical ambiguity: It happens when there are multiple meanings
From the first sentence, it is not able to make out if the fish is ready to eat his/her food or the
Referential ambiguity: It happens when you are referring to something using the pronoun.
For example, Rima went to Gauri. She said, “I am tired.” − Exactly who is tired?
Advantages of NLP
• Users can ask questions about any subject and get a direct response within seconds.
• NLP system provides answers to the questions in natural language
• NLP system offers exact answers to the questions, no unnecessary or unwanted
information
• The accuracy of the answers increases with the amount of relevant information
provided in the question.
• NLP process helps computers communicate with humans in their language and scales
other language-related tasks
• Allows you to perform more language-based data compares to a human being without
fatigue and in an unbiased and consistent way.
• Structuring a highly unstructured data source
Disadvantages of NLP
NLP applications
1) Spellcheck
2) Autocomplete
the incomplete terms entered by the user are compared to a dictionary to suggest
possible options of words.
Coding autocomplete: for programming languages
3) Spam filters
4) Voice text messaging
5) Virtual assistants-alexa, siri
6) Customer support bot or intelligent bots to offer personalized assistance (conventional
Chatbot answers basic customer queries and customers routine requests with canned
responses. But these bots cannot recognize more nuanced questions. So, support bots
are now equipped with artificial intelligence and machine learning technologies to
overcome these limitations. In addition to understanding and comparing user inputs,
they can generate answers to questions on their own without pre-written responses. )
7) Language Identifier:
The process of determining the language of a particular body of text involves
rummaging through different dialects, slangs, common words between different
languages, and the use of multiple languages in one page. But with machine learning,
this task becomes a lot simpler.
8) Sentiment analysis or media monitor
9) Speech to text (NPTel lectures)
10) Translators
11) Text summarization
12) Automated question answering
******************************
Parts of Speech (PoS) or word classes (or) morphological classes (or) lexical classes:
- Gives significant amount of information about the word and its neighbors. Knowing that the
current word is possessive (their) or personal pronoun (they) can tell what words are likely to
occur in its vicinity.
Eg., ‘they watch’ and ‘their watch’- in the first phrase, watch is tagged as a verb as the previous
word is a personal pronoun. In the second phrase, watch is tagged as a noun as the previous
word is a possessive pronoun.
- It tells us about how the word is pronounced. Eg., Object (noun with more stress on ‘ob’) and
object (verb with more stress on ‘ject’)
- Can tell which affixes it can take
- Used for partial parsing texts to quickly find names or other phrases for information extraction
applications
- Corpora marked for pos are useful for linguistic research
Hidden Markov model is a stochastic technique that can be used to find the appropriate tag
sequence for a given sentence.
Transition Probability:
It gives the likelihood of a particular sequence. It tells how likely is that a noun is followed by a verb,
a verb followed by an adjective and an adjective followed by a noun and so on. This probability
should be high for a particular sequence to be correct.
Emission probability:
Eg., It tells about what is the probability that Mary is a noun, write is a noun etc
P(yk | yk-1) -> Given the current tag yk-1, it tells what is the probability for the next tag to be yk.
P(xk | yk) -> Given a specific tag, it tells what is the probability for the word to have that tag.
Eg. Given the following set of sentences with their tagging, calculate the 2 probabilities and use
them to tag the test sentence, “Will can spot Mary”.
1) Mary Jane can see Will (N N M V N) where M denotes the modal verb
2) Spot will see Mary (N M V N)
3) Will Jane spot Mary? (M N V N)
4) Mary will pat Spot (N M V N)
Transition Probability:
Dr. Varalakshmi M
Assume the cell in the first row and first column. That cell stores the probability of seeing a noun
after the start symbol which is equal to the count of the times when noun follows the start symbol /
total count of start symbols =3/4. Likewise fill the other cells too.
10 11 12 13 14
<s> 3/4 1/4 0 0
20 21 22 23 24
Noun 1/9 3/9 1/9 4/9
30 31 32 33 34
Modal verb 1/4 0 3/4 0
40 41 42 43 44
Verb 4/4 0 0 0
Emission Probability:
It is manually tagged as N M V N.
If the model tags the sentence wrong, the probability will be zero. If it tags the sentence correct, the
probability will be greater than 0.
Eg., let’s assume that the sentence is tagged as M V N N,
Dr. Varalakshmi M
=1/4 *3/4 * 3/4 * 0 * 4/9 * 2/9 * 1/9 * 4/9 * 4/4 =0
The number of hidden states, N=3 (N, V, M). The number of observed states in the input, T=4.
The total number of possible combinations to be explored = NT = 34 =81.
ie., probability of the words ‘will’, ‘can’, ‘spot’, ‘Mary’ to get any of the 81 tags such as NNN,
NNM, NNV, NMN, NMV…… have to be calculated and the tag sequence that results in the
maximum probability value should be taken as the output.
Viterbi algorithm follows dynamic programming technique to predict the output efficiently. In
this method, if there are multiple paths leading to the same node, then only the path with the
=0 * 0 = 0
Dr. Varalakshmi M
p(V | can) =0.0833* p(can | V) * p(V| N)
=0.0367*0*1/9=0
Therefore, the word sequence “Will can spot Mary” is tagged as “N M V N”.
*******************
Dr. Varalakshmi M
Sentence segmentation
• Symbols like !,? are relatively unambiguous but few like dot(.) are ambiguous
which could be a end-of-sentence, abbreviation or decimal point.
• We could build a binary classifier to identify this.
o Few other options we can use are handwritten rules, regular
expressions & machine learning algorithms.
• Decision tree is the simplest form of it. We could draw a decision tree something
like this:
obj = TreebankWordTokenizer()
obj = WordPunctTokenizer()
obj.tokenize(text)
word_tokenize on the other hand is based on a TreebankWordTokenizer. It basically
tokenizes text like in the Penn Treebank.
words = nltk.word_tokenize(text)
cursor = 0
sentences = []
for c in text:
if c == "." or c == "!":
sentences.append(text[current_position:cursor+1])
current_position = cursor + 2
cursor+=1
print(sentences)
Output:
import nltk
nltk.download('punkt')
text = "This is the 1.0.2 version;Backgammon is one of the oldest known board games. Its
history can be traced back nearly 5,000 years to archeological discoveries in the Middle
East. It is a two player game where each player has fifteen checkers which move between
listofsent = nltk.sent_tokenize(text)
print(listofsent)
Output:
['This is the 1.0.2 version;Backgammon is one of the oldest known
NLP Material – Dr Varalakshmi M, SCOPE
board games.', 'Its history can be traced back nearly 5,000 years to
archeological discoveries in the Middle East.', 'It is a two player
game where each player has fifteen checkers which move between
twenty-four points according to the roll of two dice.']
Naïve Bayes method for sentence segmentation:
If the symbol is any one of the following – (., ?,!,;), the function below checks if the first
character of the next word is upper and the previous word is lower, and if the previous
word is a single character word and returns the corresponding boolean values. The dataset
is prepared using these boolean values and the Naïve Bayes model is trained using this
dataset. When a new document is provided as input, the model makes the prediction (if a
period represents the end of a sentence or not).
"Its history can be traced back nearly 5,000 years to archeological discoveries in the
Middle East.","It is a two player game where each player has fifteen checkers which move
between twenty-four points according to the roll of two dice."]
tokens = []
boundaries = set()
offset = 0
print(sentences)
print(sent)
tokens.extend(sent)
offset += len(sent)
boundaries.add(offset-1)
print("tokens",tokens)
print("boundaries",boundaries)
print(featuresets)
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
def segment_sentences(inputsent):
start = 0
sents = []
sents.append(inputsent[start:i+1])
start = i+1
sents.append(inputsent[start:])
return sents
#print(segment_sentences(sentences))
print(segment_sentences(sample))
import re
import string
import nltk # Python library for NLP
from nltk.corpus import twitter_samples
#Remove hyperlinks
tweet2 = re.sub(r'https?:\/\/.*[\r\n]*','', tweet2)
#Remove hastags
#Only removing the hash # sign from the word
tweet2 = re.sub(r'#','',tweet2)
# tokenize tweets
tweet_tokens = tokenizer.tokenize(tweet2)
return tweets_stem
#Frequency generating function
def build_freqs(tweets, ys):
yslist = np.squeeze(ys).tolist()
freqs = {}
for y, tweet in zip(yslist, tweets):
for word in process_tweet(tweet):
pair = (word, y)
freqs[pair] = freqs.get(pair, 0) + 1
return freqs
def sigmoid(z):
return h
def gradientDescent(x, y, theta, alpha, num_iters):
'''
Input:
x: matrix of features which is (m,n+1)
y: corresponding labels of the input matrix x, dimensions (m,1)
theta: weight vector of dimension (n+1,1)
alpha: learning rate
num_iters: number of iterations you want to train your model for
Output:
J: the final cost
theta: your final weight vector
Hint: you might want to print the cost to make sure that it is going down.
'''
m = len(x)
J = float(J)
return J, theta
def extract_features(tweet, freqs):
'''
Input:
tweet: a list of words for one tweet
freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
Output:
x: a feature vector of dimension (1,3)
'''
# process_tweet tokenizes, stems, and removes stopwords
return accuracy
#main module
# split the data into two pieces, one for training and one for testing (validation set)
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
tweets = all_positive_tweets + all_negative_tweets
labels = np.append(np.ones((len(all_positive_tweets))), np.zeros((len(all_negative_tweets))))
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]
train_x = train_pos + train_neg
test_x = test_pos + test_neg
# combine positive and negative labels
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)
freqs = build_freqs(tweets, labels)
# collect the features 'x' and stack them into a matrix 'X'
X = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
X[i, :]= extract_features(train_x[i], freqs)
# training labels corresponding to X
Y = train_y
# Apply gradient descent
J, theta = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 1500)
print(f"The cost after training is {J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(theta)]}")
# Check your function
# test 1
# test on training data
test_accuracy = test_logistic_regression(test_x, test_y, freqs, theta)
print(f"Logistic regression model's accuracy = {test_accuracy:.4f}")
tmp1 = extract_features(train_x[0], freqs)
Word Segmentation
1) Simpler Approach
scan each character one at a time from left to right and look up those characters in a
dictionary. If the series of characters found in the dictionary, then we have a matched word
and segment that sequence as a word. But this will match a shorter length word
2) Maximal matching
This is a greedy technique used to avoid matching the shortest word by finding the longest
sequence of characters in the dictionary instead. This approach is called the longest
matching algorithm or maximal matching.
character as unknown. So in “ndineh…”, we just took ’n’ out as an unknown word and matching
the next word starting with ‘d’. Assume the word “din” or “re” are not in our dictionary, we
would get the series of words as “theme n dine here”. We only get one unknown word here. But
as you can see the longest word can make incorrect segmentation. This result in overextending
the first word “theme” into a part of the second word “men” making the next word unknown
“n”.
Another approach to solving the greedy nature of longest matching is an algorithm called
‘maximum matching’. This approach would segment multiple possibilities and choose the one
with fewer words in the sentence. It would also prioritize fewer unknown words. Using this
approach, we would get the correct segmentation from the example text as shown below.
The first word can be “the”, “them”, and “theme”. From each of these nodes, there are
multiple choices. Like in “the”, the next word can be “me”, “men”, or “mend”. The word
“mend” would result in an incorrect word after “in”. Only “men” would give use “dine”,
then “here” as correct segmentation. This approach goes through all the different
combinations based on our dictionary. So unknown is still a problem that we have not
Probabilistic Approaches
1) Unigram
Instead of just the words from a training dataset, like Google Web Trillion Word Corpus, we
can add the count of all words. Then based on the probability (word count/total number of
words) we can predict the likelihood of each word. To calculate each option, you take the
product of the probability of each word.
Let’s see how we can segment the sequence “meneat”. It can be segmented as: “me neat”, “men
eat”, or “mene at”. But the frequency of these various words in the unigram list of Google trillion
corpus is
me 566,617,666
neat 4,643,885
men 174,058,407
eat 29,237,400
mene 77555
at 2272272772
t=1024908267229.0 #total number of words in the corpus
p(me_neat) = p(me)*p(neat)=(566617666/t * 4643885/t) = 2.5e-09
p(men_eat) = p(men)*p(eat)= (174058407/t * 29237400/t)= 4.8e-09
p(mene_at) = p(mene)*p(at)= (77555/t * 2272272772/t) = 1.7e-10
With these three possibilities, the option “men eat” has the highest score. So, the algorithm
would select the highest score option.
As an example: “thenonprofit” can be segmented as “the non profit” or “then on profit”. From
the Google trillion word bigram list we get:
3. N-Gram
This algorithm ‘N-gram’ can extend to tri-gram or bigger N. The larger N can generate a very
large list of lookup terms. It can give more context as N get large. But generally, bi-gram and
probabilistic algorithm requires each character into different categorical labels. We can label
each of the characters in the word as:
S: Single letter word
B: Beginning letter
M: Middle letter
E: Ending letter
As an example of a string “thisisatest” which should correspond to “this is a test”, it is tagged
as:
Feature: t h i s i s a t e s t
Tag: B M M E B E S B M M E
We are looking for the probability of each letter given its label. For example, calculate the
probability of letter ‘t’ given that it has a label ‘B’ — a beginning letter. We can use Naive
So let’s go back to the example “thisisatest” above. We need to calculate the label joint
probability which has a total of 11 labels. We have one single character word ‘S’, 3 beginning
letters ‘B’, 4 middle letters ‘M’, and 3 ending letters ‘E’. So the probability of each label is as
follow:
P(S) = 1/11
P(B) = 3/11
P(M) = 4/11
P(E) = 3/11
The conditional probability for the letter ‘t’ given its tag S is zero, given its tag B is two (‘this’,
‘test’), given its tag M is zero, and given its tag as E is one (‘test’).
P(t|S) = 0
P(t|B) = 2/3 -- 'this', 'test'
P(t|M) = 0
P(t|E) = 1/3 -- 'test'
To calculate the tag for the letter ‘t’ from our training sentence, we have.
P(y,x) = P(x|y)P(y)
p(S,t) = P(t|S)P(S) = 0 * 1/11 = 0
P(B,t) = P(t|B)P(B) = 2/3* 3/11 = 0.66
p(M,t) = P(t|M)P(M) = 0 * 4/11 = 0
p(E,t) = P(t|E)P(E) = 1/3 * 3/11 = 0.33
For the given input letter ‘t’, the algorithm will predict the tag as B (beginning letter) since it
has the highest probability.
This approach only looks at one character at a time and assumes the sequence is independent.
But, actually the input/tag pair are not independent. But one of the main use cases for Naive
Bayes is in text classification like predicting spam and not spam email.
But HMM incorporates the dependency of the states (ie., the next hidden state, Yk is dependent
on the current state, Yk-1.). Instead of using just the probability of the next state Yk, HMM uses
the conditional probability of the next state Yk, given the current state, Yk-1. This results in the
using the term P(Yk|Yk-1) instead of P(Yk) as given in the formula below.
To illustrate in a graph format, we can think of Naive Bayes joint probability between label and
input but independence between each pair. It can be shown as:
Therefore, for the word segmentation problem, X denotes the given character and Y denotes the
tag (S,B,M,E).
In the formula,
ii) P(Yk|Yk-1) =P(tagk|tagk-1) which refers to the probability for the next tag to be tagk given
the current tag, tagk-1.
Eg., P(E|B) is the probability that the next tag is ‘E’ given that the current tag is ‘B’.
Transition Probabilities: a matrix that represents the probability of transitioning to another state
=1/1=1 tags
P(‘a’|B)=0/3=0 =0/1=0
P(‘a’|M)=0/4=0 P(B|S)= count of B tag after S tag/count of S
P(‘a’|E)=0/3=0 tags
ii) For the letter ‘e’ =1/1=1
P(‘e’|S)=0/1=0 P(M|S)=0/1=0
P(‘e’|B)=0/3=0 P(E|S)=0/1=0
Emission Matrix:
Tag| char a E h i s t
S 1 0 0 0 0 0
B 0 0 0 0.33 0 0.66
E 0 0 0 0 0.66 0.33
Tag|tag S B M E
S 0 1 0 0
B 0 0 0.66 0.33
M 0 0 0.5 0.5
E 0.66 0.33 0 0
While starting, the transition probability to the various states is unknown. Therefore, it can be
Let us assume that we are given the input character sequence ‘test’ to be tagged with the labels
S,B,M,E, for which the joint probability P(t,e,s,t,S,B,M,E) has to be calculated.
The number of hidden states, N=4. The number of observed states in the input, T=4.
The total number of possible combinations to be explored = NT = 44 =256.
ie., probability of the letters t,e,s,t to get any of the 256 tags such as SSSS, SSSB,SSSM,SSSE,
Viterbi algorithm follows dynamic programming technique to predict the output efficiently. In
this method, if there are multiple paths leading to the same node, then only the path with the
highest probability value can be explored. Other paths can be omitted.
From the <start>, when the first input letter ‘t’ is given
p(B|t)=p(t|B)*p(B|<start>)
=0.66*0.25=0.165
p(M|t)=p(t|M)*p(M|<start>)
=0*0.25=0
p(E|t)=p(t|E)*p(E|<start>)
=0.33*0.25=0.0825
‘e’
i) p(B|e)=p(e|B)*p(B|B)*0.165 = 0*0*0.165=0
ii) p(B|e)=p(e|B)*p(B|E)*0.165=0*0.33*0.165=0
i) p(M|e)=p(e|M)*p(M|B)*0.165=0.25*0.66*0.165=0.027
ii) p(M|e)=p(e|M)*p(M|E)*0.0825=0.25*0*0.0825=0
i) p(E|e)=p(e|E)*p(E|B)*0.165=0*0.33*0.165=0
ii) p(E|e)=p(e|E)*p(E|E)*0.0825=0*0*0.0825=0
i) p(S|e)=p(e|S)*p(S|B)*0.165=0*0*0.165=0
ii) p(S|e)=p(e|S)*p(S|E)*0.0825=0*0.66*0.0825=0
‘s’
p(B|s)=p(s|B)*p(B|M)*0.027 = 0*0*0.027=0
p(M|s)=p(s|M)*p(M|M)*0.027= 0.25*0.5*0.027=0.0034
p(E|s)=p(s|E)*p(E|M)*0.027=0.66*0.5*0.027=0.00891
p(S|s)=p(s|S)*p(S|M)*0.027=0*0*0.027=0
‘t’
iv) p(B|t)=p(t|B)*p(B|E)*0.00891=0.66*0.33*0.00891=0.0019
*******************
The sequence of rule expansions is called a derivation of the string of words. A parse tree is generally
used to represent a derivation.
Eg.,
Syntactic parsing: It is the problem of mapping from a string of words to its parse tree. It determines
if the structure of a sentence is according to the grammar of the language.
There are several approaches to construct a parse tree -top-down, bottom-up.
Ambiguous sentences lead to construction of more than one parse tree.
CYK or CKY (Cockie-Kasami-Younger) algorithm is a DP (Dynamic Programming) technique used to
efficiently generate all possible parse trees for a given word sequence and a grammar.
Dr. Varalakshmi M
Note: For the CYK algorithm to be applied, the grammar should be in CNF (Chomsky Normal form).
A grammar is said to be in CNF, if each rule derives a single non-terminal or two non-terminals.
Eg., S-> NP VP, VP->V NP, PP->P V, DET->a, P->with ---> all these are in CNF
NP->the noun, DET->a the , NP -> N P V ----> all these are not in CNF
01 02 03 04 05
11 12 13 14 15
N->flight --- ----- ------
21 22 23 24 25
V->includes --- VP->V NP
31 32 33 34 35
DET->a NP->DET,N
41 42 43 44 45
N->meals
I) X02=x01+x12= DET,N =NP (since there is a production in the grammar where NP derives
DET,N)
X13=x12+x23=N,V= null (as there is no production that derives N,V)
Dr. Varalakshmi M
X24=x23+x34=V,DET=null
X35=x34+x45=DET,N =NP
II) X03=(x01+x13) or (x02+x23)
=(DET,null) or (NP,V)
=(null) or (null)
X14=(x12+x24) or (x13+x34)
=(N,null) or (null, DET)
=(null) or(null)
X25=(x23+x35) or (x24+x45)
=(V,NP) or (null,N)
=(VP->V,NP) or (null)
III) X04=(x01+x14) or (x02+x24) or (x03+x34)
=(DET,null) or (NP,null) or (null,DET)
=null
X15=(x12+x25) or (x13+x35) or (x14+x45)
=(N,VP) or (null,NP) or (null,N)
=null
IV) X05=(x01+x15) or (x02+x25) or (x03+x35) or (x04+x45)
=(DET,null) or (NP,VP) or (null,NP) or (null,N)
=S->NP,VP
Note: At last, if the start symbol is obtained, it infers that the sentence is correct according to the
given CNF grammar. If the start symbol is not reached, then the sentence is syntactically wrong.
Ie., arrangement of the words is not correct.
Now, to construct the parse tree, start with the cell ‘S->NP, VP’. Look for the cell in the same row
that has rule for NP and look for the cell in the same column that has rule for VP and proceed with
the same way until all the leaf nodes contain the terminal symbols.
Example-2:
Let’s consider an example grammar and sentence that lead to multiple parse trees.
Note: If there are 2 productions for the same non-terminal, let us label them as rule 1 and rule2.
Eg., in the CNF grammar given below, there are two rules for VP. So, label them as VP1 and VP2
S → NP VP
VP1 → V NP
Dr. Varalakshmi M
VP2 → VP PP
PP → P NP
V → eat
NP → NP PP
NP → we
NP → fish
NP → fork
P → with
“We eat fish with fork”
01 02 03 04 05
31 32 33 34 35
P->with PP->P,NP
41 42 43 44 45
NP->fork
Dr. Varalakshmi M
=(VP1->V,NP) or (VP2->VP,PP)
IV) X05=(x01+x15) or (x02+x25) or (x03+x35) or (x04+x45)
=(NP,VP1 and NP,VP2) or (null,NP) or (S,PP) or (null,NP)
=S->NP,VP1 and S->NP, VP2
Parse tree generation:
i) First construct the tree with the first production of ‘S’, S->NP,VP1
(Note: Follow the lines from the rules encircled in the table given below.
This results in the interpretation “we eat, fish with fork” ie., we eat that fish which has a fork..
ii) Now construct the tree with the second production of ‘S’, S->NP,VP2
(Note: Follow the lines from the rules encircled in the table given below.
This results in the interpretation “we eat fish, with fork” ie., we eat fish using a fork.
Example-3
Consider the sentence “ a pilot likes flying planes” and the following CNF grammar
S->NP VP DET->a (Here, NN-singular noun
VP1->VBG NNS NN-pilot NNS-Plural noun
VP2->VBZ VP VBG->flying VBG-Continuous tense verb
VP3->VBZ NP VBZ->likes VBZ-third person present
NP1->DET NN JJ->flying tense
NP2->JJ NNS NNS->planes jj-adjective
Dr. Varalakshmi M
A Pilot Likes flying planes
01 02 03 04 05
S->NP1,VP2
DET->a NP1->DET,NN ------ ------ S->NP1, VP3
11 12 13 14 15
NN->pilot --- ------- -------
21 22 23 24 25
VBZ->likes --- VP2->VBZ,VP1
VP3->VBZ,NP
31 32 33 34 35
VBG->flying VP1->VBG, NNS
JJ->flying NP2->JJ,NNS
41 42 43 44 45
NNS->planes
Dr. Varalakshmi M
This results in the interpretation “a pilot likes flying, planes” ie., a pilot likes to fly planes.
ii) Now construct the tree with the second production of ‘S’, S->NP1,VP3
(Note: Follow the lines from the rules encircled in the table given below.
This results in the interpretation “a pilot likes, flying planes” ie., a pilot likes those planes which are
flying.
*****************************
Dr. Varalakshmi M
Language Models
Models that assign probabilities to sequence of words are called Language models.
An n-gram is a contiguous sequence of n items from a given sample of text or speech. The
items can be phonemes, syllables, letters, words or base pairs according to the application
which are typically collected from a text or speech corpus. It is useful for spelling and
grammar correction and translation.
A 2-gram is a two-word sequence of words like “turn off” and a 3-gram is a three-word
sequence of words like “take the test”.
n-gram models can be used to estimate the probability of the last word of an n-gram, given
the previous words and also to assign probabilities to entire sequences.
The term ‘n-gram’ can be used to mean either the word sequence itself or the predictive
model that assigns it a probability.
The joint probability of A, B can be written as
P(A,B)=p(A).p(B|A)
This joint probability formula can be extended for multiple variables as follows.
Eg., the probability of occurrence of the word sequence “school students do their homework”
is
P(school students do their homework)=p(school).p(students | school).p(do | school
students).p(their | school students do).p(homework | school students do their) ……(1)
But the drawback here is that the longer the sequence, the less likely we are to find it in a
training corpus.
This can be resolved using n-gram models.
The intuition of the n-gram model is that instead of computing the probability of a word
given its entire history, we can approximate the history by just the last few words.
The bigram model, for example, approximates the probability of a word given
all the previous words P(wn | w1:n-1) by using only the conditional probability of the
preceding word P(wn | wn-1).
In other words, instead of computing the probability, p(homework | school students do their),
we approximate it with the probability, p(homework | their).
This way of assuming that the probability of a word depends only on the previous word is
called a Markov assumption.
We can generalize the bigram (which looks one word into the past) to the trigram (which
looks two words into the past) and thus to the n-gram (which looks n-1 words into the past).
Dr. Varalakshmi M
Thus, the general equation for this N-gram approximation to the conditional probability of the
next word in a sequence is
P(wn|w1:n-1) ≈ P(wn | wn-N+1:n-1)
In a unigram (1-gram) model, no history is used. In a bigram, one word history is used and in
a n-gram, n-1 words history is used.
Based on this, Equation (1) can be written as follows.
P(school students do their homework)=p(school)p(students | school).p(do | students).p(their |
do).p(homework | their) ……………….(2)
Maximum Likelihood Estimation (MLE) can be used to find the probabilities of n-grams that
uses the count of occurrences of the n-grams.
Eg, for the bigram model, probability of a bigram is calculated by taking the count of the
number of times a given bigram (𝑤𝑛−1 𝑤𝑛 ) occurs in a corpus and normalizing it by the total
number of bigrams that share the same word (𝑤𝑛−1 ), in the corpus.
𝑐𝑜𝑢𝑛𝑡(𝑤 𝑤𝑛 )
P(wn | wn-1)= ∑𝑐𝑜𝑢𝑛𝑡(𝑤𝑛−1
𝑛−1𝑤 )
But the count of the n-grams that contain the word is equal to the unigram count of the word.
So, the formula can be changed into
𝑐𝑜𝑢𝑛𝑡(𝑤𝑛−1𝑤𝑛 )
P(wn | wn-1)=
∑𝑐𝑜𝑢𝑛𝑡(𝑤𝑛−1 )
𝑐𝑜𝑢𝑛𝑡("𝑡ℎ𝑒𝑖𝑟 ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘")
ie., for p(homework | their) =
𝑐𝑜𝑢𝑛𝑡("𝑡ℎ𝑒𝑖𝑟")
Likewise, the probability should be calculated for all the other components p(students |
school), p(do | students) and p(their | do) given in equation (2).
Next word estimation using bi-gram model:
Conditional Probability is given by
𝑝(𝐴,𝐵)
P(B|A) =
𝑝(𝐴)
………….. (3)
Given the word sequence “school students do their homework”, if we wish to find the
probability for the next word to be “regularly”, based on eqn (3), the formula can be written
as
P(regularly | school students do their homework)
𝑝(school students do their homework regularly)
= p(school students do their homework)
----- from eqn (3)
Dr. Varalakshmi M
P(regularly | school students do their homework) = p(regularly | homework) =
𝑐𝑜𝑢𝑛𝑡(ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘 𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑙𝑦)
𝑐𝑜𝑢𝑛𝑡(ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘)
Thus, to find the probability of the next word in the sequence “school students do their
homework” to be “regularly”, it is enough if we find the probability of “regularly” given just
the previous word, “homework”.
Trigram:
The same idea for bigram model can be extended to trigram, four-gram and in general, n-
gram models.
𝑐𝑜𝑢𝑛𝑡(𝑤1 𝑤2 𝑤3 )
P(w1,w2,w3)=p(w3 | w1,w2)= 𝑐𝑜𝑢𝑛𝑡(𝑤1 𝑤2 )
=
p(school)p(students |school,<s>).p(do |school,students).p(their |students,do).p(homework | do,their).p(regularly | their homewo
p(school)p(students |school,<s>).p(do |school,students).p(their |students,do).p(homework | do,their).
Thus, in a trigram model, to find the probability of the next word in the sequence “school
students do their homework” to be “regularly”, it is enough if we find the probability of
“regularly” given just the previous two words, “their homework”.
In practice, unigram models are not commonly used because the next word will be predicted
as the highest frequency word in the corpus.
Eg., Assume the next word has to be predicted for the sequence “ covishield is the vaccine for
_______”. If the word “and” occurs, the highest number of times in a corpus, then the
unigram model will predict “and” as the next word resulting in “covishield is the vaccine for
and”.
Bigram, trigram and four-gram models are usually preferred. For models with higher values
of ‘n’, large corpus is required.
Dr. Varalakshmi M
Examples:
Consider the following tiny corpus with seven sentences.
<s> I am Henry</s>
<s> I like college</s> Word frequency
<s> Do Henry like college</s> <s> 7
<s> Henry I am</s> </s> 7
<s> Do I like Henry</s> I 6
<s> Do I like college </s> Am 2
<s> I do like Henry </s> Henry 5
Like 5
College 3
do 4
i) Using bigram model, predict the most probable next word for the sequence,
<s> do _______
Next word Probability of next word
P(</s> | do) 0/4
P(I | do) 2/4 -> the most probable next word is “I”
P(am | do) 0/4
P(Henry | do) ¼
P( like | do) ¼
P(college| do) 0/4
P(do | do) 0/4
ii) Using bigram model, predict the most probable next word for the sequence,
<s> I like Henry _______
iii) Using trigram model, predict the most probable next word for the sequence,
<s> Do I like _______
Note: p(I like)=3
Next word Probability of next word
P(</s> | I like) 0/3
Dr. Varalakshmi M
P(I | I like) 0/3
P(am | I like) 0/3
P(Henry | I like) 1/3
P( like | I like) 0/3
P(college| I like) 2/3 -> college is more probable
P(do | I like) 0/3
iv) Using 4-gram model, predict the most probable next word for the sequence,
<s> Do I like college _______
Note: p(I like college)=2
Next word Probability of next word
P(</s> | I like college) 2/2 -> </s> is more probable
P(I | I like college) 0/2
P(am | I like college) 0/2
P(Henry | I like college) 0/2
P( like | I like college) 0/2
P(college| I like college) 0/2
P(do | I like college) 0/2
v) Using bigram model and the corpus mentioned above, predict the most probable
sentence out of the following two.
a) <s> I like college </s>
b) <s> do I like Henry</s>
Dr. Varalakshmi M
Eg., in the calculation above (b), the result is 0.0257 and if we keep multiplying the
probabilities of few more bigrams, the result will become smaller and smaller leading to
underflow. To avoid that, the same two calculations can be done as follows.
a) <s> I like college </s>
= p(I | <s>) * p(like | I) * p(college | like) * p( </s> | college)
= 3/7 * 3/6 * 3/5 * 3/3
=log(3/7) + log(3/6)+log( 3/5)+log(3/3) = -2.0513
Even with logarithmic calculations, the first sentence is found to be more probable.
Zero-Probability problem:
Let us assume that we have to calculate the probability of the word sequence
“<s> like college </s>”
= p(like | <s>) * p(college | like) * p( </s> | college)
= 0/7*3/5*3/3 = 0
Probability is evaluated as 0. Though “ like college” is present thrice in the corpus but
because “<s>like” doesn’t occur even once, the first term becomes 0 and so, the answer is 0.
This is termed as zero-probability problem.
Dr. Varalakshmi M
daily 0 0/20 1 1/24=
0.04
Dr. Varalakshmi M
2
3 2 1
PGT=18 = ==
54 27
Therefore, the probability of existing word “tennis” is discounted from 1/18 to 1/27 and
that extra probability mass is used to account for the unseen words.
Perplexity (PP) is an intrinsic evaluation measure to evaluate the evaluating language models.
A language model is the best if it predicts an unseen test set.
Perplexity is the inverse probability of the test data which is normalized by the number of
words.
PP(w)=p(w1,w2,w3…wn)-1/n
Lower the value of perplexity, better will be the model.
More the value of perplexity, confused will be the model for prediction.
Perplexity calculation for bigram and trigram models based on the sentence,
<s> I like college </s>:
i) Bigram model:
= p(I | <s>) * p(like | I) * p(college | like) * p( </s> | college)
= 3/7 * 3/6 * 3/5 * 3/3=9/70 = 0.13
PP(w)=(1/0.13)1/4 = 1.67
ii) Trigram model:
= p(like | <s> I) * p(college | I like) * p( </s> | like college)
= 1/3 * 2/3 * 3/3 =2/9 = 0.22
PP(w)=(1/0.22)1/3 = 1.66 (since this value is smaller, the trigram model is better, for
this example).
****************************
Dr. Varalakshmi M
Dr. Varalakshmi M
Dr. Varalakshmi M
Semantic Role Labeling:
In linguistics, predicate refers to the main verb in the sentence. Predicate takes arguments. The role
of Semantic Role Labelling (SRL) is to determine how these arguments are semantically related to
the predicate.
Consider the sentence "Mary loaded the truck with hay at the depot on Friday". 'Loaded' is the
predicate. Mary, truck and hay have respective semantic roles of loader, bearer and cargo. We can
identify additional roles of location (depot) and time (Friday). The job of SRL is to identify these roles
so that downstream NLP tasks can "understand" the sentence.
SRL is also known by other names such as thematic role labelling, case role assignment, or shallow
semantic parsing.
Often an idea can be expressed in multiple ways. Consider these sentences that all mean the same
thing: "Yesterday, Kristina hit Scott with a baseball"; "Scott was hit by Kristina yesterday with a
baseball"; "With a baseball, Kristina hit Scott yesterday"; "Kristina hit Scott with a baseball
yesterday".
Either constituent or dependency parsing will analyze these sentence syntactically. But syntactic
relations don't necessarily help in determining semantic roles. However, parsing is not completely
useless for SRL. In a traditional SRL pipeline, a parse tree helps in identifying the predicate
arguments.
But SRL performance can be impacted if the parse tree is wrong. This has motivated SRL approaches
that completely ignore syntax.
SRL is useful in any NLP application that requires semantic understanding: machine translation,
information extraction, text summarization, question answering, and more. For example, predicates
and heads of roles help in document summarization. For information extraction, SRL can be used to
construct extraction rules.
SRL can be seen as answering "who did what to whom". Obtaining semantic information thus
benefits many downstream NLP tasks such as question answering, dialogue systems, machine
reading, machine translation, text-to-scene generation, and social network analysis.
One of the oldest models of semantic roles is called thematic roles where roles are assigned to
subjects and objects in a sentence. Roles are based on the type of event. For example, if the verb is
'breaking', roles would be breaker and broken thing for subject and object respectively. Some
examples of thematic roles are agent, experiencer, result, content, instrument, and source. There's
no well-defined universal set of thematic roles.
Varalakshmi M, SCOPE
A modern alternative from 1991 is proto-roles that defines only two roles: Proto-Agent and Proto-
Patient. Using heuristic features, algorithms can say if an argument is more agent-like (intentionality,
volitionality, causality, etc.) or patient-like (undergoing change, affected by, etc.).
Verbnet, PropBank and FrameNet help in semantic role labelling. VerbNet is a resource that groups
verbs into semantic classes and their alternations.
PropBank contains sentences annotated with proto-roles and verb-specific semantic roles.
Arguments to verbs are simply named Arg0, Arg1, etc. Typically, Arg0 is the Proto-Agent and Arg1 is
the Proto-Patient.
FrameNet is another lexical resource defined in terms of frames rather than verbs. For every
frame, core roles and non-core roles are defined. Frames can inherit from or causally link to other
frames.
Path feature:
This feature is designed to capture the syntactic relation of a constituent to the rest of the sentence.
However, the path feature describes the syntactic relation between the target word (that is, the
predicate invoking the semantic frame) and the constituent in question, whereas the previous
feature is independent of where the target word appears in the sentence; that is, it identifies all
subjects whether they are the subject of the target word or not. This feature is defined as the path
from the target word through the parse tree to the constituent in question, represented as a string
of parse tree non-terminals linked by symbols indicating upward or downward movement through
the tree, as shown in Figure. Although the path is composed as a string of symbols, our systems will
treat the string as an atomic value. The path includes, as the first element of the string, the part of
speech of the target word, and, as the last element, the phrase type or syntactic category of the
sentence constituent marked as a frame element.
In this example, the path from the target word (predicate) ‘ate’ to the frame element ‘He’ can be
represented as VB↑VP↑S↓NP, with ↑ indicating upward movement in the parse tree and ↓
downward movement. Our path feature is dependent on the syntactic representation used, and in
this case, the Treebank-2 annotation style is used.
The most common values of the path feature, along with interpretations, are shown in the table. As
per the table, VB↑VP↑S↓NP results in a subject. And so, ‘he’ should be labelled as subject for the
predicate ‘ate’.
Varalakshmi M, SCOPE
The following figure shows the annotation for the sentence “They expect him to cut costs
throughout the organization”, which exhibits the syntactic phenomenon known as subject-to-object
raising, in which the main verb’s object is interpreted as the embedded verb’s subject. The
Treebank-2 uses S nodes generously to indicate clauses, so as to make possible a relatively
straightforward mapping from S nodes to predications. In this example, the path from ‘cut’ to the
frame element ‘him’ would be VB↑VP↑VP↑S↓NP, which typically indicates a verb’s subject as per
the above table. But when we consider the accusative case of the pronoun ‘him’, with the target
word (predicate) to be ‘expect’ in the sentence, the path to him would be VB↑VP↓S↓NP, rather
than the typical direct object path of VB↑VP↓NP.
Varalakshmi M, SCOPE
OVERLAP BASED APPROACHES
Require a Machine Readable Dictionary (MRD).
CFILT - IITB
Find the overlap between the features of different senses of an
ambiguous word (sense bag) and the features of the words in its
context (context bag).
2
LESK’S ALGORITHM
Sense Bag: contains the words in the definition of a candidate sense of the
ambiguous word.
Context Bag: contains the words in the definition of each sense of each context
word.
E.g. “On burning coal we get ash.”
From Wordnet
The noun ash has 3 senses (first 2 from tagged texts)
1. (2) ash -- (the residue that remains when something is burned)
2. (1) ash, ash tree -- (any of various deciduous pinnate-leaved
ornamental or timber trees of the genus Fraxinus)
3. ash -- (strong elastic wood of any of various ash trees; used for
furniture and tool handles and sporting goods such as baseball
bats)
The verb ash has 1 sense (no senses from tagged texts)
1. ash -- (convert into ashes) 3
CRITIQUE
Proper nouns in the context of an ambiguous word can act as
strong disambiguators.
E.g. “Sachin Tendulkar” will be a strong indicator of the
category “sports”.
Sachin Tendulkar plays cricket.
Proper nouns are not present in the thesaurus. Hence this
approach fails to capture the strong clues provided by proper
nouns.
Accuracy
50% when tested on 10 highly polysemous English words.
4
Extended Lesk’s algorithm
where,
gloss(S) is the gloss of sense S from the lexical resource.
Context(W) is the gloss of each sense of each context word.
rel(s) gives the senses related to s in WordNet under some relations.
WordNet Sub-Graph
Hyponymy
Dwelling,abode
Hypernymy
Meronymy
kitchen
Hyponymy
bckyard
bedroom
M
e
r house,home Gloss
o
veranda n A place that serves as the living
Hyponymy quarters of one or mor efamilies
y
m
y
study
From Wordnet
The noun ash has 3 senses (first 2 from tagged texts)
1. (2) ash -- (the residue that remains when something is burned)
2. (1) ash, ash tree -- (any of various deciduous pinnate-leaved
ornamental or timber trees of the genus Fraxinus)
3. ash -- (strong elastic wood of any of various ash trees; used for
furniture and tool handles and sporting goods such as baseball
bats)
The verb ash has 1 sense (no senses from tagged texts)
1. ash -- (convert into ashes)
Example: Extended Lesk (cntd)