Phases of NLP (8 Files Merged)

Phases of NLP
Natural Language vs. Computer Language
Below are the main differences between Natural Language and Computer Language:
Natural Language or free form

Parameter language Computer Language
Ambiguous They are ambiguous in nature. They are designed to be unambiguous.
Redundancy Natural languages employ lots of Formal languages are less redundant.
redundancy.
Literalness Natural languages are made of idiom & Formal languages mean exactly what they
metaphor want to say
From the computer’s point of view, any natural language is a free form text. That means there
are no set keywords at set positions when providing an input ie., unstructured.
Beyond the unstructured nature, there can also be multiple ways to express something using a
natural language. For example, consider these three sentences:
• How is the weather today?

• Is it going to rain today?
• Do I need to take my umbrella today?
All these sentences have the same underlying question, which is to enquire about today’s
weather forecast.
As humans, we can identify such underlying similarities almost effortlessly and respond
accordingly. But this is a problem for machines—any algorithm will need the input to be in a
set format, and these three sentences vary in their structure and format. And if we decide to
code rules for each and every combination of words in any natural language to help a machine
understand, then things will get very complicated very quickly.
This is where NLP enters the picture.
NLP is a subset of AI tasked with enabling machines to interact using natural languages. The
domain of NLP also ensures that machines can:
NLP Material – Dr Varalakshmi M

• Process large amounts of natural language data
• Derive insights and information
The aim of NLP is to process the free form natural language text so that it gets transformed
into a standardized structure.
NLP is an umbrella term which encompasses any and everything related to making machines
able to process natural language—be it receiving the input, understanding the input, or
generating a response.
Natural Language Processing (NLP) is a combination of computer science, artificial

intelligence, and computational linguistics – aimed to help humans and machines communicate
in natural language, just like a human to human conversation.
NLP is divided into two components.
• Natural Language Understanding
Natural Language Understanding (NLU) helps the machine to understand and analyze human
language by extracting the text from large data such as keywords, emotions, relations, and
semantics, etc. The NLU is harder than NLG.
• Natural Language Generation:
Natural Language Generation is the technology that analyzes, interprets, and organizes data
into comprehensible, written text. NLG aids the machine in sorting through many variables
and putting “text into context,” thus delivering natural-sounding sentences and paragraphs that
observe the rules of English grammar. It is the process of producing meaningful phrases and
sentences in the form of natural language from some internal representation.
It involves −
• Text planning − It includes retrieving the relevant content from knowledge base
(domain).
• Sentence planning − It includes choosing required words, forming meaningful
phrases, setting tone of the sentence.
• Text Realization − It is mapping sentence plan into sentence structure.
So, how do NLP & NLU differ?
In natural language, what is expressed (either via speech or text) is not always what is meant.
Let’s take an example sentence:
• Please crack the windows, the car is getting hot.

NLP focuses on processing the text in a literal sense, like what was said. Conversely, NLU
focuses on extracting the context and intent, or in other words, what was meant.
NLP will take the request to crack the windows in the literal sense, but it will be NLU which
will help draw the inference that the user may be intending to roll down the windows.

NLP alone could result in literal damage
NLP can process text from grammar, structure, typo, and point of view—but it will be NLU
that will help the machine infer the intent behind the language text. So, even though there are
many overlaps between NLP and NLU, this differentiation sets them distinctly apart.
On our quest to make more robust autonomous machines, it is imperative that we are able to
not only process the input in the form of natural language, but also understand the meaning and
context—that’s the value of NLU. This enables machines to produce more accurate and
appropriate responses during interactions.
Let’s take the example of ubiquitous chatbots.

Gone are the days when chatbots could only produce programmed and rule-based interactions
with their users. Back then, the moment a user strayed from the set format, the chatbot either
made the user start over or made the user wait while they find a human to take over the
conversation.
Combining NLU and NLP, today’s chatbots are more robust. Using NLU methods, chatbots
are able to:
• Be aware of the conversation’s context

• Extract the conversation’s meaning based on that context
• Guide users on the topic of conversation
NLU NLP NLG
Considered a subtopic of NLP,

the main focus of natural
language understanding is to
Once a chatbot, smart device, or
make machines:
search function understands the
language it’s “hearing,” it has to
• Interpret the natural
talk back to you in a way that
language But, if we talk about NLP, it is
you, in turn, will understand.
• Derive meaning about how the machine
• Identify context processes the given data. Such
That’s where NLG comes in. It
• Draw insights as make decisions, take actions,
takes data from a search result,
and respond to the system. It
for example, and turns it into
Going back to our weather contains the whole End to End
understandable language.
enquiry example, it is NLU process. Every time NLP doesn't
which enables the machine to need to contain NLU.
In fact, chatbots have become so
understand that those three
advanced; you may not even
different questions have the
know you’re talking to a
same underlying weather
machine.
forecast query. After all,
different sentences can mean the
same thing, and, vice versa, the

same words can mean different
things depending on how they
are used.
Let’s take another example:
• The banks will be closed

for Thanksgiving.
• The river will overflow
the banks during floods.
A task called word sense

disambiguation, which sits
under the NLU umbrella, makes
sure that the machine is able to
understand the two different
senses that the word “bank” is
used.
But if we want more than

If we only talk about an It generates a human-like
understanding, such as decision
understanding text, then NLU is manner text based on the
making, then NLP comes into
enough. structured data.
play.
NLP is a combination of NLU

NLU is a subset of NLP. and NLG for conversational NLG is a subset of NLP.
Artificial Intelligence problems.
NLG writes structured data

It reads data and converts it to NLP converts unstructured data
which will be human-
structured data. to structured data.
understandable.
NLP Terminology
• Phonology − It is study of organizing sound systematically.

• Morphology − It is a study of construction of words from primitive meaningful units.
• Morpheme − It is primitive unit of meaning in a language.

• Syntax − It refers to arranging words to make a sentence. It also involves determining
the structural role of words in the sentence and in phrases.
• Semantics − It is concerned with the meaning of words and how to combine words
into meaningful phrases and sentences.
• Pragmatics − It deals with using and understanding sentences in different situations
and how the interpretation of the sentence is affected.
• Discourse − It deals with how the immediately preceding sentence can affect the
interpretation of the next sentence.
• World Knowledge − It includes the general knowledge about the world.
Phases of NLP
-Lexical Analysis:
Morphological or Lexical Analysis deals with text at the individual word level. It looks
for morphemes, the smallest unit of a word. For example, irrationally can be broken
into ir (prefix), rational (root, which is a morpheme with dictionary meaning) and -ly (suffix).
Lexical Analysis finds the relation between these morphemes and converts the word into its

root form. A lexical analyzer also assigns the possible Part-Of-Speech (POS) to the word. It
takes into consideration the dictionary of the language. It involves identifying and analyzing
the structure of words. Lexicon of a language means the collection of words and phrases in that
particular language. The lexical analysis divides the text into paragraphs, sentences, and words.
So, we need to perform Lexicon Normalization.
-Syntactic Analysis:
Syntactic Analysis is used to check grammar, arrangements of words, and the interrelationship
between the words. The syntax refers to the principles and rules that govern the sentence
structure of any individual languages. It focuses on the proper ordering of words which can
affect its meaning. This involves analysis of the words in a sentence by following the
grammatical structure of the sentence. Given the possible POS generated from the previous
step, a syntax analyzer assigns POS tags based on the sentence structure.
Example-1:
Correct Syntax: Sun rises in the east.
Incorrect Syntax: Rise in sun the east
Sentences rejected by syntactic parser:
Mumbai goes to Sara
The school goes to boy”
Simple methods for syntactic analysis:

• Context-Free Grammar
• Top-Down Parser
-Semantic Analysis:
It draws the exact meaning or the dictionary meaning from the text. The text is checked for
meaningfulness. It is done by mapping syntactic structures and objects in the task domain. The
semantic analyzer disregards sentence such as “hot ice-cream”.
Consider the sentence: “The apple ate a banana”. Although the sentence is syntactically correct,
it doesn’t make sense because apples can’t eat. Semantic analysis looks for meaning in the
given sentence. It also deals with combining words into phrases. For example, “red apple”
provides information regarding one object; hence we treat it as a single phrase. Similarly, we
can group names referring to the same category, person, object or organisation. “Robert Hill”
refers to the same person and not two separate names – “Robert” and “Hill”.
–Discourse Integration:

Discourse deals with the effect of a previous sentence on the sentence in consideration. The
meaning of any sentence depends upon the meaning of the sentence just before it. In addition,
it also brings about the meaning of immediately succeeding sentence. In the text, “Jack is a
bright student. He spends most of the time in the library.” Here, discourse assigns “he” to refer
to “Jack”
For example, Ram wants it.
In the above statement, we can clearly see that the “it” keyword does not make any sense. In
fact, it is referring to anything that we don’t know. That is nothing but this “it” word depends
upon the previous sentence which is not given. So, once we get to know about “it”, we can
easily find out the reference.

For example, the word "that" in the sentence "He wanted that" depends upon the prior discourse
context.
–Pragmatic Analysis:
The final stage of NLP, Pragmatics interprets the given text using information from the
previous steps. Given a sentence, “Turn off the lights” is an order or request to switch off the
lights. During this, what was said is re-interpreted on what it actually meant. It involves
deriving those aspects of language which require real world knowledge. It means the study of
meanings in a given language, process of extraction of insights from the text. It includes the
repetition of words, who said to whom? etc. It understands that how people communicate with
each other, in which context they are talking and so many aspects. In this analysis, the main
focus is always on what was said is reinterpreted on what is actually meant.
E.g., "close the window?" should be interpreted as a request instead of an order.
Challenges of NLP:
1) Irony, sarcasm (the words may appear positive or negative but they convey opposite
meaning)
2) Colloquialism (difficult as they don’t have dictionary definitions) and slang
3) Synonyms
4) Homonyms (their and there)
Same word with different meaning
Eg., I ran to the store because we ran out of milk
5) Domain-specific language :
An NLP processing model needed for healthcare, for example, would be very different
than one used to process legal documents. These days, however, there are a number of

analysis tools trained for specific fields, but extremely niche industries may need to
build or train their own models
6) Low-resource languages
7) Ambiguities
Lexical Ambiguity: It happens when a word has different meanings. The same word could be
used as a verb, noun, or adjective. Lexical ambiguity can be resolved by using parts-of-speech
(POS)tagging techniques.
For Example:-
• He is looking for a match.
The word ‘match’ may represent a partner or cricket or football.
Syntactical ambiguity or Grammatical ambiguity: It happens when there are multiple meanings
in a sequence of words (sentence).
• The Fish is ready to eat.

• He lifted the bottle with red cap.
From the first sentence, it is not able to make out if the fish is ready to eat his/her food or the
fish is ready for someone to eat.
Referential ambiguity: It happens when you are referring to something using the pronoun.
For example, Rima went to Gauri. She said, “I am tired.” − Exactly who is tired?
Advantages of NLP
• Users can ask questions about any subject and get a direct response within seconds.
• NLP system provides answers to the questions in natural language
• NLP system offers exact answers to the questions, no unnecessary or unwanted
information
• The accuracy of the answers increases with the amount of relevant information
provided in the question.
• NLP process helps computers communicate with humans in their language and scales
other language-related tasks
• Allows you to perform more language-based data compares to a human being without
fatigue and in an unbiased and consistent way.
• Structuring a highly unstructured data source
Disadvantages of NLP

• Complex Query Language- the system may not be able to provide the correct answer
to the question that is poorly worded or ambiguous.
• The system is built for a single and specific task only; it is unable to adapt to new
domains and problems because of limited functions.
• NLP system doesn't have a user interface which lacks features that allow users to further
interact with the system
NLP applications
1) Spellcheck
2) Autocomplete
the incomplete terms entered by the user are compared to a dictionary to suggest
possible options of words.
Coding autocomplete: for programming languages
3) Spam filters
4) Voice text messaging
5) Virtual assistants-alexa, siri
6) Customer support bot or intelligent bots to offer personalized assistance (conventional
Chatbot answers basic customer queries and customers routine requests with canned
responses. But these bots cannot recognize more nuanced questions. So, support bots
are now equipped with artificial intelligence and machine learning technologies to
overcome these limitations. In addition to understanding and comparing user inputs,
they can generate answers to questions on their own without pre-written responses. )
7) Language Identifier:
The process of determining the language of a particular body of text involves
rummaging through different dialects, slangs, common words between different
languages, and the use of multiple languages in one page. But with machine learning,
this task becomes a lot simpler.
8) Sentiment analysis or media monitor
9) Speech to text (NPTel lectures)
10) Translators
11) Text summarization
12) Automated question answering
******************************

Parts of speech tagging
Parts of Speech (PoS) or word classes (or) morphological classes (or) lexical classes:
PoS tagging significance:
- Gives significant amount of information about the word and its neighbors. Knowing that the
current word is possessive (their) or personal pronoun (they) can tell what words are likely to
occur in its vicinity.
Eg., ‘they watch’ and ‘their watch’- in the first phrase, watch is tagged as a verb as the previous
word is a personal pronoun. In the second phrase, watch is tagged as a noun as the previous
word is a possessive pronoun.
- It tells us about how the word is pronounced. Eg., Object (noun with more stress on ‘ob’) and
object (verb with more stress on ‘ject’)
- Can tell which affixes it can take
- Used for partial parsing texts to quickly find names or other phrases for information extraction
applications
- Corpora marked for pos are useful for linguistic research
Hidden Markov model is a stochastic technique that can be used to find the appropriate tag
sequence for a given sentence.
Transition Probability:
It gives the likelihood of a particular sequence. It tells how likely is that a noun is followed by a verb,
a verb followed by an adjective and an adjective followed by a noun and so on. This probability
should be high for a particular sequence to be correct.
Emission probability:
It is the probability of seeing a specific observable given a hidden state.
Eg., It tells about what is the probability that Mary is a noun, write is a noun etc
P(yk | yk-1) -> Given the current tag yk-1, it tells what is the probability for the next tag to be yk.
P(xk | yk) -> Given a specific tag, it tells what is the probability for the word to have that tag.
Eg. Given the following set of sentences with their tagging, calculate the 2 probabilities and use
them to tag the test sentence, “Will can spot Mary”.
1) Mary Jane can see Will (N N M V N) where M denotes the modal verb
2) Spot will see Mary (N M V N)
3) Will Jane spot Mary? (M N V N)
4) Mary will pat Spot (N M V N)
Count(Noun)=9, Count(Verb)=4, Count(Modal verb)=4
Transition Probability:
Dr. Varalakshmi M
Assume the cell in the first row and first column. That cell stores the probability of seeing a noun
after the start symbol which is equal to the count of the times when noun follows the start symbol /
total count of start symbols =3/4. Likewise fill the other cells too.
Noun Modal Verb <End>

00 01 02 03 04
10 11 12 13 14
<s> 3/4 1/4 0 0
20 21 22 23 24
Noun 1/9 3/9 1/9 4/9
30 31 32 33 34
Modal verb 1/4 0 3/4 0
40 41 42 43 44
Verb 4/4 0 0 0
Emission Probability:
Noun Verb Modal Verb

Mary 4/9 0 0
Jane 2/9 0 0
Will 1/9 0 ¾
Spot 2/9 ¼ 0
Can 0 0 ¼
See 0 2/4 0
Pat 0 1/4 0
Tagging the test sentence, “Will can spot Mary”:
It is manually tagged as N M V N.
If the model tags the sentence wrong, the probability will be zero. If it tags the sentence correct, the
probability will be greater than 0.
Eg., let’s assume that the sentence is tagged as M V N N,
Dr. Varalakshmi M
=1/4 *3/4 * 3/4 * 0 * 4/9 * 2/9 * 1/9 * 4/9 * 4/4 =0
Let’s assume that the sentence is tagged as “ N M V N”
=3/4 * 1/9 * 3/9 * 1/4 * 3/4 * 1/4 * 4/4 * 4/9 * 4/9
=1728/6718464 = 2.572*10-4 =0.0002572 which is greater than 0.
The number of hidden states, N=3 (N, V, M). The number of observed states in the input, T=4.
The total number of possible combinations to be explored = NT = 34 =81.
ie., probability of the words ‘will’, ‘can’, ‘spot’, ‘Mary’ to get any of the 81 tags such as NNN,
NNM, NNV, NMN, NMV…… have to be calculated and the tag sequence that results in the
maximum probability value should be taken as the output.
Viterbi algorithm follows dynamic programming technique to predict the output efficiently. In
this method, if there are multiple paths leading to the same node, then only the path with the
highest probability value can be explored. Other paths can be omitted.
i) For the first word ‘Will’:
p(N | Will) = p(Will | N) * p(N|<s>)
=1/9*3/4=0.0833 ✓ (should be eplored in the next stage)
p(M | Will) = p(Will | M) * p(M|<s>)
=3/4*1/4=0.1875 ✓ (should be explored in the next stage)
p(V | Will) = p(Will | V) * p(V|<s>)
=0 * 0 = 0
ii) For the second word ‘can’:

From ‘N’ branch: From ‘M’ branch:
p(N | can) =0.0833*p(can | N) * p(N | N) p(N | can)= 0.1875 *(can | N)* p(N| M)
=0.0833*0*1/9 =0 =0.1875 *0*1/4 =0
p(M | can) =0.0833*p(can | M) * p(M| N) p(M | can)= 0.1875 *p(can|M)*p(M| M)
=0.0833*1/4*3/9=0.00694 =0.1875 *1/4*0=0
✓ (should be explored in the next p(V | can)= 0.1875 * p(can | V)* p(V|
stage) M)
=0.1875 *0*3/4= 0
Dr. Varalakshmi M
p(V | can) =0.0833* p(can | V) * p(V| N)
=0.0367*0*1/9=0
iii) For the third word “spot”:

From ‘M’ branch:
p(N | spot) =0.00694*p(spot | N) * p(N | M)
=0.00694*2/9*1/4 =3.86*10-4 ✓ (should be explored in the next stage)
p(M | spot) =0.00694*p(spot | M) * p(M| M)

=0.00694*0*0=0
p(V | spot) =0.00694* p(spot | V) * p(V| M)
=0.00694*1/4*3/4=0.00130 ✓ (should be explored in the next stage)
iv) For the fourth word “Mary”:

From ‘N’ branch: From ‘V’ branch:
p(N | Mary) =0.000386*p(Mary | N) * p(N p(N | Mary)= 0.00130 * p(Mary|N)*
| N) p(N| V)
=0.000386*4/9*1/9 =1.906*10-5 =0.00130 *4/9*4/4 =5.78*10-4
p(M | Mary) =0.000386*p(Mary|M)* p(M | Mary)= 0.00130 * p(Mary |
p(M|N) M)*p(M| V)
=0.000386*0*3/9=0 =0.00130 *0*0=0
p(V | Mary) =0.000386* p(Mary | V) * p(V| p(V | Mary)= 0.00130* p(Mary | V)*
N) p(V| V)
=0.000386*0*1/9=0 =0.00130 *0*0= 0
Therefore, the word sequence “Will can spot Mary” is tagged as “N M V N”.
*******************
Dr. Varalakshmi M
Sentence segmentation
• Symbols like !,? are relatively unambiguous but few like dot(.) are ambiguous
which could be a end-of-sentence, abbreviation or decimal point.
• We could build a binary classifier to identify this.
o Few other options we can use are handwritten rules, regular
expressions & machine learning algorithms.
• Decision tree is the simplest form of it. We could draw a decision tree something
like this:
• A Decision tree is just an if-then-else statement. We can think of the questions in

a decision tree as features that could be exploited by any kind of classifier.
• The interesting research in a decision tree is choosing the features.
• Setting up the structure is often too hard to do by hand – Hand-building is only
possible for very simple features, domains. For numeric features, it’s too hard to
pick each threshold. So, the structure is usually learned by machine learning from
a training corpus.
• These features could be exploited by any kind of classifier like Logistic
Regression, SVMs etc.
More sophisticated decision tree features
• Case of word with “.”: Upper, Lower, Cap, Number

NLP Material – Dr Varalakshmi M, SCOPE
• Case of word after “.”: Upper, Lower, Cap, Number
• Numeric features
• Length of word with “.”

• Probability(word with “.” occurs at end-of-s)
• Probability(word after “.” occurs at beginning-of-s)
#Word tokenization can be performed by instantiating the TreebankWordTokenizer class

and calling tokenize() function. This instance of NLTK has been pre-trained to perform
sentence tokenization to words based on spaces and punctuations. In case of
TreebankWordTokenizer – the fullstop after the word is taken together and tokenized
accordingly.
from nltk.tokenize import TreebankWordTokenizer
obj = TreebankWordTokenizer()
obj.tokenize(" I am a Professor. I work in VIT")
WordPunctTokenizer is another type of word tokenizer present in NLTK. This works by

splitting punctuation separately from the words. It is based on a simple regexp
tokenization. It uses the regular expression \w+|[^\w\s]+ to split the input.
from nltk.tokenize import WordPunctTokenizer
obj = WordPunctTokenizer()
obj.tokenize(text)
word_tokenize on the other hand is based on a TreebankWordTokenizer. It basically
tokenizes text like in the Penn Treebank.
words = nltk.word_tokenize(text)
#TweetTokenizer class is used to segments the words in tweets.
nltk.tokenize import TweetTokenizer

obj = TweetTokenizer()
s= "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"
obj.tokenize(s)
Rule based sentence segmentation
text = "I love coding and programming. I also love sleeping!"

current_position = 0
cursor = 0
sentences = []
for c in text:
if c == "." or c == "!":
sentences.append(text[current_position:cursor+1])
current_position = cursor + 2
cursor+=1
print(sentences)
Output:
['I love coding and programming.', 'I also love sleeping!']
“””Punkt Sentence Tokenizer
This tokenizer divides a text into a list of sentences

by using an unsupervised algorithm to build a model for abbreviation
words, collocations, and words that start sentences. It must be
trained on a large collection of plaintext in the target language
before it can be used.”””
import nltk
nltk.download('punkt')
text = "This is the 1.0.2 version;Backgammon is one of the oldest known board games. Its
history can be traced back nearly 5,000 years to archeological discoveries in the Middle
East. It is a two player game where each player has fifteen checkers which move between
twenty-four points according to the roll of two dice."
listofsent = nltk.sent_tokenize(text)
print(listofsent)
Output:
['This is the 1.0.2 version;Backgammon is one of the oldest known
board games.', 'Its history can be traced back nearly 5,000 years to
archeological discoveries in the Middle East.', 'It is a two player
game where each player has fifteen checkers which move between
twenty-four points according to the roll of two dice.']
Naïve Bayes method for sentence segmentation:
If the symbol is any one of the following – (., ?,!,;), the function below checks if the first
character of the next word is upper and the previous word is lower, and if the previous
word is a single character word and returns the corresponding boolean values. The dataset
is prepared using these boolean values and the Naïve Bayes model is trained using this
dataset. When a new document is provided as input, the model makes the prediction (if a
period represents the end of a sentence or not).
def punct_features(tokens, i):
return {'next-word-capitalized': tokens[i+1][0].isupper(), 'prev-word': tokens[i-

1].lower(),'punct': tokens[i],'prev-word-is-one-char': len(tokens[i-1]) == 1}
sentences=["This is the 1.0.2 version;","Backgammon is one of the oldest known board

games.",
"Its history can be traced back nearly 5,000 years to archeological discoveries in the
Middle East.","It is a two player game where each player has fifteen checkers which move
between twenty-four points according to the roll of two dice."]
sents = sentences #nltk.corpus.treebank_raw.sents()
tokens = []
boundaries = set()
offset = 0
print(sentences)
for sent in sents:
print(sent)
tokens.extend(sent)
offset += len(sent)
boundaries.add(offset-1)
print("tokens",tokens)
print("boundaries",boundaries)
featuresets = [(punct_features(tokens, i), (i in boundaries)) for i in range(1, len(tokens)-1)

if tokens[i] in '.?!;']
print(featuresets)
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
def segment_sentences(inputsent):
start = 0
sents = []
for i, character in enumerate(inputsent):
if character in '.?!;' and classifier.classify(punct_features(inputsent, i)) == True:
sents.append(inputsent[start:i+1])
start = i+1
if start < len(inputsent):
sents.append(inputsent[start:])
return sents
sample="Wow!this is a sample 4.566.\"I know, it will perform better\".But, what is the

accuracy?I do not know;Even no one knows"
#print(segment_sentences(sentences))
print(segment_sentences(sample))
Sentiment Analysis – Twitter Data

nltk.download('stopwords')
nltk.download('twitter_samples')
import re
import string
import nltk # Python library for NLP
from nltk.corpus import twitter_samples
from nltk.corpus import stopwords

from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
import numpy as np
#Preprocessing tweets
def process_tweet(tweet):
#Remove old style retweet text "RT"
tweet2 = re.sub(r'^RT[\s]','', tweet)
#Remove hyperlinks
tweet2 = re.sub(r'https?:\/\/.*[\r\n]*','', tweet2)
#Remove hastags
#Only removing the hash # sign from the word
tweet2 = re.sub(r'#','',tweet2)
# instantiate tokenizer class

tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
# tokenize tweets
tweet_tokens = tokenizer.tokenize(tweet2)
#Import the english stop words list from NLTK

stopwords_english = stopwords.words('english')
#Creating a list of words without stopwords

tweets_clean = []
for word in tweet_tokens:
if word not in stopwords_english and word not in string.punctuation:
tweets_clean.append(word)
#Instantiate stemming class

stemmer = PorterStemmer()
#Creating a list of stems of words in tweet

tweets_stem = []
for word in tweets_clean:
stem_word = stemmer.stem(word)
tweets_stem.append(stem_word)
return tweets_stem
#Frequency generating function
def build_freqs(tweets, ys):
yslist = np.squeeze(ys).tolist()
freqs = {}
for y, tweet in zip(yslist, tweets):
for word in process_tweet(tweet):
pair = (word, y)
freqs[pair] = freqs.get(pair, 0) + 1
return freqs
def sigmoid(z):

'''
Input:
z: is the input (can be a scalar or an array)
Output:
h: the sigmoid of z
'''
# calculate the sigmoid of z
h = 1/(1 + np.exp(-z))
return h
def gradientDescent(x, y, theta, alpha, num_iters):
'''
Input:
x: matrix of features which is (m,n+1)
y: corresponding labels of the input matrix x, dimensions (m,1)
theta: weight vector of dimension (n+1,1)
alpha: learning rate
num_iters: number of iterations you want to train your model for
Output:
J: the final cost
theta: your final weight vector
Hint: you might want to print the cost to make sure that it is going down.
'''
m = len(x)
for i in range(0, num_iters):
# get z, the dot product of x and theta

z = np.dot(x,theta)
# get the sigmoid of z

h = sigmoid(z)
# calculate the cost function

J = (-1/m)*(np.dot(y.T,np.log(h)) + np.dot((1-y).T,np.log(1-h)))
# update the weights theta

theta = theta - (alpha/m)*np.dot(x.T, h-y)
J = float(J)
return J, theta
def extract_features(tweet, freqs):
'''
Input:
tweet: a list of words for one tweet
freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
Output:
x: a feature vector of dimension (1,3)
'''
# process_tweet tokenizes, stems, and removes stopwords

word_l = process_tweet(tweet)
# 3 elements in the form of a 1 x 3 vector

x = np.zeros((1, 3))
#bias term is set to 1

x[0,0] = 1
# loop through each word in the list of words

for word in word_l:
# increment the word count for the positive label 1

x[0,1] += freqs.get((word,1),0)
# increment the word count for the negative label 0

x[0,2] += freqs.get((word,0),0)
assert(x.shape == (1, 3))

return x
def predict_tweet(tweet, freqs, theta):
'''
Input:
tweet: a string
freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
theta: (3,1) vector of weights
Output:
y_pred: the probability of a tweet being positive or negative
'''
# extract the features of the tweet and store it into x

x = extract_features(tweet, freqs)
# make the prediction using x and theta

z = np.dot(x,theta)
y_pred = sigmoid(z)
return y_pred
def test_logistic_regression(test_x, test_y, freqs, theta):
"""
Input:
test_x: a list of tweets
test_y: (m, 1) vector with the corresponding labels for the list of tweets
freqs: a dictionary with the frequency of each pair (or tuple)
theta: weight vector of dimension (3, 1)
Output:
accuracy: (# of tweets classified correctly) / (total # of tweets)
"""
# the list for storing predictions

y_hat = []
for tweet in test_x:

# get the label prediction for the tweet
y_pred = predict_tweet(tweet, freqs, theta)
if y_pred > 0.5:

# append 1.0 to the list
y_hat.append(1)
else:
# append 0 to the list
y_hat.append(0)
# With the above implementation, y_hat is a list, but test_y is (m,1) array
# convert both to one-dimensional arrays in order to compare them using the '==' operator
y_hat = np.array(y_hat)
print("test_x",test_x)
print("y_hat",y_hat)
test_y = test_y.reshape(-1)
accuracy = np.sum((test_y == y_hat).astype(int))/len(test_x)
return accuracy
#main module
# split the data into two pieces, one for training and one for testing (validation set)
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
tweets = all_positive_tweets + all_negative_tweets
labels = np.append(np.ones((len(all_positive_tweets))), np.zeros((len(all_negative_tweets))))
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]
train_x = train_pos + train_neg
test_x = test_pos + test_neg
# combine positive and negative labels
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)
freqs = build_freqs(tweets, labels)
# collect the features 'x' and stack them into a matrix 'X'
X = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
X[i, :]= extract_features(train_x[i], freqs)
# training labels corresponding to X
Y = train_y
# Apply gradient descent
J, theta = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 1500)
print(f"The cost after training is {J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(theta)]}")
# Check your function
# test 1
# test on training data
test_accuracy = test_logistic_regression(test_x, test_y, freqs, theta)
print(f"Logistic regression model's accuracy = {test_accuracy:.4f}")
tmp1 = extract_features(train_x[0], freqs)

print(tmp1)
# #### Expected output

# ```
# [[1.00e+00 3.02e+03 6.10e+01]]

Text Segmentation
Word Segmentation
I) Rule based Algorithms

If-else condition-based algorithms which check for space character, dot, question
marks etc. as a delimiting character
II) Using Dictionary Based Algorithms

Khmer (Cambodian) text does not have spaces to separate words. Khmer characters are written
from left to right consecutively without space between words. There are occasional spaces to
separate phrases to indicate pauses. Native speakers can separate words effortlessly but teaching
a computer to separate them is a different challenge.
1) Simpler Approach
scan each character one at a time from left to right and look up those characters in a
dictionary. If the series of characters found in the dictionary, then we have a matched word
and segment that sequence as a word. But this will match a shorter length word
2) Maximal matching
This is a greedy technique used to avoid matching the shortest word by finding the longest
sequence of characters in the dictionary instead. This approach is called the longest
matching algorithm or maximal matching.
For example in English, consider the sequence of characters “themendinehere”

For the first word, we would find: the, them, theme and no longer word would match after that.
[Note: For each word, check till the end, so as to match the longest possible word].
Now we just choose the longest which is “theme”, then start again from ’n’. But now we don’t
have any word in this series “ndine…”. When we can’t match a word, we just mark the first
character as unknown. So in “ndineh…”, we just took ’n’ out as an unknown word and matching
the next word starting with ‘d’. Assume the word “din” or “re” are not in our dictionary, we
would get the series of words as “theme n dine here”. We only get one unknown word here. But
as you can see the longest word can make incorrect segmentation. This result in overextending
the first word “theme” into a part of the second word “men” making the next word unknown
“n”.
NLP Material – Dr. Varalakshmi M, SCOPE

3) Bi-Directional Maximal Matching
One way to solve this issue is to also match backward. This approach is called bi-directional
maximal matching. It goes from left to right (forward matching) first; then from the end of the
sentence go from right to left (backward matching). Then choose the best result. As we have
seen earlier, the forward gave us incorrect segmentation. But the backward would give us the
correct result. Narin Bi et al. shows an accuracy of 98% for Bi-Directional Maximal
Matching algorithm.[1]
4) Maximum Matching
Another approach to solving the greedy nature of longest matching is an algorithm called
‘maximum matching’. This approach would segment multiple possibilities and choose the one
with fewer words in the sentence. It would also prioritize fewer unknown words. Using this
approach, we would get the correct segmentation from the example text as shown below.
Possible segmentation using Maximum Matching [2]
The first word can be “the”, “them”, and “theme”. From each of these nodes, there are
multiple choices. Like in “the”, the next word can be “me”, “men”, or “mend”. The word
“mend” would result in an incorrect word after “in”. Only “men” would give use “dine”,
then “here” as correct segmentation. This approach goes through all the different
combinations based on our dictionary. So unknown is still a problem that we have not

addressed yet. In addition, it can still result in errors with the known words when the
correct segmentable text prefers more words instead of a fewer number of words.
Previous research showed the performance is 90% precision and 94% recall for this
approach on Khmer documents.
Disadvantages of dictionary-based approaches:

- They require a good dictionary list of words.
- Unknown words such as names tend to be the typical issues in these approaches.
- When there are multiple possible segmentations with the same number of words, these
algorithms will not be able to decide one over another.
- They do not take into account the frequency of how the word is used in a sentence
Probabilistic Approaches
1) Unigram
Instead of just the words from a training dataset, like Google Web Trillion Word Corpus, we
can add the count of all words. Then based on the probability (word count/total number of
words) we can predict the likelihood of each word. To calculate each option, you take the
product of the probability of each word.
Let’s see how we can segment the sequence “meneat”. It can be segmented as: “me neat”, “men
eat”, or “mene at”. But the frequency of these various words in the unigram list of Google trillion
corpus is
me 566,617,666
neat 4,643,885
men 174,058,407
eat 29,237,400
mene 77555
at 2272272772
t=1024908267229.0 #total number of words in the corpus
p(me_neat) = p(me)*p(neat)=(566617666/t * 4643885/t) = 2.5e-09
p(men_eat) = p(men)*p(eat)= (174058407/t * 29237400/t)= 4.8e-09
p(mene_at) = p(mene)*p(at)= (77555/t * 2272272772/t) = 1.7e-10
With these three possibilities, the option “men eat” has the highest score. So, the algorithm
would select the highest score option.

2. Bigram
The unigram treats each word independently. It does not take into account any context between
words. In bigram, we would match two terms that the algorithm has seen before. This involves
getting training data of documents every two words. So instead of a single dictionary word, we
generate a pair of words and its frequency. To generate bi-gram from the same example: “The
men dine here very often”, you would get 7 bigram list as:
<B> The
The men
men dine
dine here
here very
Very often
Often <E>
<B> is a token to mean a begin of a sentence and <E> means the end of a sentence.
To generate the bigram, we would add the count after each pair and increment it as we add more
training data.
In this case, we update the equation to:
As an example: “thenonprofit” can be segmented as “the non profit” or “then on profit”. From
the Google trillion word bigram list we get:
<s> the 258483382 <s> then 11926082

the non 739031 then on 1263045
non profit 218392 on profit 105801
P(the non profit) = P(the|<s>) * P(non|the) *
P(profit|non) P(then on profit) = P(then|<s>) * P(on|then) *
P(profit|on)
= P(<s> the) * P(the non) * P(non profit)
= 258483382/t* 739031/t * 218392/t = P(<s> then) * P(then on) * P(on profit)
= 3.88E-17
= 11926082/t* 1263045/t * 105801/t
= 1.48E-18

This results in choosing “the non profit” since it has a higher probability which seems to be
more probable.
This approach includes unigram which is a single word much like the dictionary but with
frequency count. This can solve cases where the pair was not seen in the training text.
3. N-Gram
This algorithm ‘N-gram’ can extend to tri-gram or bigger N. The larger N can generate a very
large list of lookup terms. It can give more context as N get large. But generally, bi-gram and
tri-gram work pretty well in many cases.

4. Naïve Bayes:
It calculates the frequency of each letter and tries to infer the pattern from them. This
probabilistic algorithm requires each character into different categorical labels. We can label
each of the characters in the word as:
S: Single letter word
B: Beginning letter
M: Middle letter
E: Ending letter
As an example of a string “thisisatest” which should correspond to “this is a test”, it is tagged
as:
Feature: t h i s i s a t e s t
Tag: B M M E B E S B M M E
We are looking for the probability of each letter given its label. For example, calculate the
probability of letter ‘t’ given that it has a label ‘B’ — a beginning letter. We can use Naive
Bayes to calculate its probability.

We have x as a set feature consists of letters ‘a’ to ‘z’. We have only one feature so m =1.
Then y is the label in set S, B, M, E. The P(y|x) is a conditional probability of y given x. This
algorithm is a generative approach since it models the joint probability of input x and label y.
The model tries to predict the tag for a given character. For example, ‘t’ would most likely tag
as B (Beginning letter) instead of E (Ending letter) given our example data with the word ‘this’
and ‘test’. Naive Bayes treats each letter and tag pair independently. It assumes that the input
values (characters) are conditionally independent. That is why it is called a naive model.
Example
So let’s go back to the example “thisisatest” above. We need to calculate the label joint
probability which has a total of 11 labels. We have one single character word ‘S’, 3 beginning
letters ‘B’, 4 middle letters ‘M’, and 3 ending letters ‘E’. So the probability of each label is as
follow:
P(S) = 1/11
P(B) = 3/11
P(M) = 4/11
P(E) = 3/11
The conditional probability for the letter ‘t’ given its tag S is zero, given its tag B is two (‘this’,
‘test’), given its tag M is zero, and given its tag as E is one (‘test’).
P(t|S) = 0
P(t|B) = 2/3 -- 'this', 'test'
P(t|M) = 0
P(t|E) = 1/3 -- 'test'
To calculate the tag for the letter ‘t’ from our training sentence, we have.
P(y,x) = P(x|y)P(y)
p(S,t) = P(t|S)P(S) = 0 * 1/11 = 0
P(B,t) = P(t|B)P(B) = 2/3* 3/11 = 0.66
p(M,t) = P(t|M)P(M) = 0 * 4/11 = 0
p(E,t) = P(t|E)P(E) = 1/3 * 3/11 = 0.33
For the given input letter ‘t’, the algorithm will predict the tag as B (beginning letter) since it
has the highest probability.
This approach only looks at one character at a time and assumes the sequence is independent.
But, actually the input/tag pair are not independent. But one of the main use cases for Naive
Bayes is in text classification like predicting spam and not spam email.

5. Hidden Markov Model and Viterbi Algorithm
In sequence tagging problems, given an input sequence, the model has to predict the output
sequence. Word segmentation is one such sequence tagging problem because, a sequence of
characters is given as input and the model has to produce a sequence of tags (S/B/M/E)
corresponding to those characters. It’s trickier than classification, as in classification, you only
have to make independent labelling decisions.
Unlike Naïve Bayes, the Hidden Markov Model provides a joint distribution over the letters/tags
with an assumption of the dependencies of variables x and y between adjacent tags.
In probability theory, a Markov model is a stochastic model used to model randomly changing
systems.
Markov property:
Future states depend only on the current state, not on the events that occurred before it.
A Markov model assumes the Markov property. Hidden Markov Model (HMM) is
a statistical Markov model in which the system being modelled is assumed to be a Markov
process with unobserved (i.e. hidden) states. (ie., we cannot/ do not observe the actual state
which we are interested in. They are hidden)
Eg-1., Let’s say the mood of your friend can be either happy or sad. Assuming that we are not
told directly if he is happy or sad. Instead, we are told about what activities he is doing such as
sitting in a corner, playing, cooking, using his laptop etc. So, mood (happy, sad) are the hidden
states whereas the activities are the observed states.
Eg-2., Let’s say you are inside a room and not allowed to go out. So, you do not have any
information about the weather of the day (hot/cold) and hence ‘hot’ and ‘cold’ can be taken as
the hidden states. Assume you observe the person who brings food for you everyday, carries an
umbrella on one day, wears raincoat on few days or drinks ice water on few days. These are the
observed states. By seeing him, wear a raincoat, you can predict the day as ‘cold’; If he drinks
ice water, the weather outside can be predicted as ‘hot’; Carrying umbrella can be either done
on either a hot or cold (rainy) day. Given these observed states, the hidden Markov model has
to predict the hidden states (hot/cold).
Naïve Bayes assumes the states to be independent which results in the formula given below.
But HMM incorporates the dependency of the states (ie., the next hidden state, Yk is dependent
on the current state, Yk-1.). Instead of using just the probability of the next state Yk, HMM uses
the conditional probability of the next state Yk, given the current state, Yk-1. This results in the
using the term P(Yk|Yk-1) instead of P(Yk) as given in the formula below.
To illustrate in a graph format, we can think of Naive Bayes joint probability between label and
input but independence between each pair. It can be shown as:

For HMM, the graph shows the dependencies between states.
The general schematic for these two models is as follows.
Therefore, for the word segmentation problem, X denotes the given character and Y denotes the
tag (S,B,M,E).
In the formula,

i) P(Xk|Yk) =P(char|tag) which refers to the probability for the character to have that
specific tag, given the tag.
Eg., P(‘t’|B) is the probability for the letter ‘t’ to have the tag ‘B’, given the tag is ‘B’.
Emission Probabilities: a matrix that represents the probability of seeing a specific observable
given a hidden state.
The probability for each character to have the various possible tags is calculated based on the
training data and stored in the emission matrix and this is the same as what is used in Naive
Bayes.
ii) P(Yk|Yk-1) =P(tagk|tagk-1) which refers to the probability for the next tag to be tagk given
the current tag, tagk-1.
Eg., P(E|B) is the probability that the next tag is ‘E’ given that the current tag is ‘B’.
Transition Probabilities: a matrix that represents the probability of transitioning to another state
given the current state.

Let’s construct the emission and the transition probability matrices for the following training
document.
“thisisatest” which is labelled as “BMMEBESBMME”
Emission Probability: Transition Probability:

The distinct letters in the input are a,e,h,i,s,t. For each state, calculate the probability of
For each of these letters calculate the transitioning to every other state.
probability for the various tags. i) Given the state S, find the probability of
i) P(‘a’|S)=count of ‘a’ with tag S/total count going to S,B,M,E.
of S tags P(S|S)= count of S tag after S tag/count of S
=1/1=1 tags
P(‘a’|B)=0/3=0 =0/1=0
P(‘a’|M)=0/4=0 P(B|S)= count of B tag after S tag/count of S
P(‘a’|E)=0/3=0 tags
ii) For the letter ‘e’ =1/1=1
P(‘e’|S)=0/1=0 P(M|S)=0/1=0
P(‘e’|B)=0/3=0 P(E|S)=0/1=0

P(‘e’|M)=1/4=0.25 ii) For the state B,
P(‘e’|E)=0/3=0 P(S|B)=0/3=0
iii) For the letter ‘h’ P(B|B)=0/3=0
P(‘h’|S)=0/1=0 P(M|B)=2/3=0.66
P(‘h’|B)=0/3=0 P(E|B)=1/3=0.33
P(‘h’|M)=1/4=0.25 iii) For the state M,
P(‘h’|E)=0/3=0 P(S|M)=0/4=0
iv) For the letter ‘i’ P(B|M)=0/4=0

P(‘i’|S)=0/1=0 P(M|M)=2/4=0.5
P(‘i’|B)=1/3=0.33 P(E|M)=2/4=0.5
P(‘i’|M)=1/4=0.25 iv) For the state E,

P(‘i’|E)=0/3=0 P(S|E) =1/3=0.33
v) For the letter ‘s’ P(B|E)=1/3=0.33
P(‘s’|S)=0/1=0 P(M|E)=0/3=0
P(‘s’|B)=0/3=0 P(E|E)=0/3=0
P(‘s’|M)=1/4=0.25 P(<END OF INPUT> |E)=1/3=0.33

P(‘s’|E)=2/3=0.66
vi) For the letter ‘t’
P(‘t’|S)=0/1=0
P(‘t’|B)=2/3=0.66
P(‘t’|M)=0/4=0
P(‘t’|E)=1/3=0.33
Emission Matrix:
Tag| char a E h i s t
S 1 0 0 0 0 0
B 0 0 0 0.33 0 0.66
M 0 0.25 0.25 0.25 0.25 0
E 0 0 0 0 0.66 0.33

Transition Matrix:
Tag|tag S B M E
S 0 1 0 0
B 0 0 0.66 0.33
M 0 0 0.5 0.5
E 0.66 0.33 0 0
While starting, the transition probability to the various states is unknown. Therefore, it can be
randomly initialized as follows.
P(S|<start>) P(B|<start>) P(M|<start>) P(E|<start>)
0.25 0.25 0.25 0.25
Let us assume that we are given the input character sequence ‘test’ to be tagged with the labels
S,B,M,E, for which the joint probability P(t,e,s,t,S,B,M,E) has to be calculated.
The number of hidden states, N=4. The number of observed states in the input, T=4.
The total number of possible combinations to be explored = NT = 44 =256.
ie., probability of the letters t,e,s,t to get any of the 256 tags such as SSSS, SSSB,SSSM,SSSE,
SSBS,SSBM,SSBE,SSBE,SSMS…… have to be calculated and the tag sequence that results in

the maximum probability value should be taken as the output.
Viterbi algorithm follows dynamic programming technique to predict the output efficiently. In
this method, if there are multiple paths leading to the same node, then only the path with the
highest probability value can be explored. Other paths can be omitted.
From the <start>, when the first input letter ‘t’ is given
p(B|t)=p(t|B)*p(B|<start>)
=0.66*0.25=0.165
p(M|t)=p(t|M)*p(M|<start>)
=0*0.25=0
p(E|t)=p(t|E)*p(E|<start>)
=0.33*0.25=0.0825

p(S|t)=p(t|S)*p(S|<start>
=0*0.25=0
‘e’
i) p(B|e)=p(e|B)*p(B|B)*0.165 = 0*0*0.165=0
ii) p(B|e)=p(e|B)*p(B|E)*0.165=0*0.33*0.165=0
i) p(M|e)=p(e|M)*p(M|B)*0.165=0.25*0.66*0.165=0.027
ii) p(M|e)=p(e|M)*p(M|E)*0.0825=0.25*0*0.0825=0
i) p(E|e)=p(e|E)*p(E|B)*0.165=0*0.33*0.165=0
ii) p(E|e)=p(e|E)*p(E|E)*0.0825=0*0*0.0825=0
i) p(S|e)=p(e|S)*p(S|B)*0.165=0*0*0.165=0
ii) p(S|e)=p(e|S)*p(S|E)*0.0825=0*0.66*0.0825=0
‘s’
p(B|s)=p(s|B)*p(B|M)*0.027 = 0*0*0.027=0
p(M|s)=p(s|M)*p(M|M)*0.027= 0.25*0.5*0.027=0.0034
p(E|s)=p(s|E)*p(E|M)*0.027=0.66*0.5*0.027=0.00891
p(S|s)=p(s|S)*p(S|M)*0.027=0*0*0.027=0
‘t’
iii) p(B|t)=p(t|B)*p(B|M)*0.0034 = 0.66*0*0.0034=0
iv) p(B|t)=p(t|B)*p(B|E)*0.00891=0.66*0.33*0.00891=0.0019
iii) p(M|t)=p(t|M)*p(M|M)* 0.0034 =0*0.5*0.0034 =0

iv) p(M|t)=p(t|M)*p(M|E)* 0.00891=0*0*0.00891=0
iii) p(E|t)=p(t|E)*p(E|M)* 0.0034 =0.33*0.5*0.0034 =0.000561

iv) p(E|t)=p(t|E)*p(E|E)* 0.00891=0.33*0*0.00891=0
iii) p(S|t)=p(t|S)*p(S|M)* 0.0034 =0*0*0.0034 =0

iv) p(S|t)=p(t|S)*p(S|E)* 0.00891=0*0.66*0.00891=0
Therefore, the output tag sequence for ‘test’ is ‘BMEB’.
*******************

Syntax: It refers to the way words are arranged together in a sentence.
Syntactic constituency: Group of words behaving as single units or constituents
Noun phrase: A sequence of words surrounding atleast one noun eg., John the servant, the
lighthouse
An entire noun phrase can occur before a verb eg., John the servant came out.
Similarly, a prepositional phrase can be placed at the beginning (preposed), at the end (postposed)
or even placed in the middle of the sentence. But the individual words cannot be split up.
Eg., “on January 3rd”
On January 3rd, I would like to go for a trip. (preposed)
I would like to go for a trip on January 3rd. (postposed)
I would like to go on January 3rd for a trip (placed in the middle)
I would like to go on January for a trip 3rd ( this is wrong… a prepositional phrase cannot be split)
Context-free grammar (CFG) or Phrase-Structure Grammar:

- It is used to model constituent structure.
- Consists of a set of rules or productions, each of which expresses the ways that words of the
language can be grouped and ordered together
- Also consists of a lexicon of words and symbols
- Consists of 2 types of symbols – terminals (symbols that correspond to words in the
language eg., the, horse) and non-terminals (symbols that express abstractions over these
terminals)
- Consists of a Start symbol which is also a member of non-terminals
Eg., NP ->DET NOUN | ADJ NOUN
Which means a noun phrase derives (consists of) a determiner followed by a noun or and adjective
followed by a noun
The sequence of rule expansions is called a derivation of the string of words. A parse tree is generally
used to represent a derivation.
Eg.,
Syntactic parsing: It is the problem of mapping from a string of words to its parse tree. It determines
if the structure of a sentence is according to the grammar of the language.
There are several approaches to construct a parse tree -top-down, bottom-up.
Ambiguous sentences lead to construction of more than one parse tree.
CYK or CKY (Cockie-Kasami-Younger) algorithm is a DP (Dynamic Programming) technique used to
efficiently generate all possible parse trees for a given word sequence and a grammar.
Dr. Varalakshmi M
Note: For the CYK algorithm to be applied, the grammar should be in CNF (Chomsky Normal form).
A grammar is said to be in CNF, if each rule derives a single non-terminal or two non-terminals.
Eg., S-> NP VP, VP->V NP, PP->P V, DET->a, P->with ---> all these are in CNF
NP->the noun, DET->a the , NP -> N P V ----> all these are not in CNF
How to convert a CFG into CNF grammar?

Introduce a dummy non-terminal wherever necessary.
Eg.,i) NP-> the NOUN
Can be written as
NP->DET N
DET-> the
ii) NP-> DET NOUN PP can be written as
NP->DET NOM
NOM->NOUN PP
Parse tree Construction:

Example-1:
Consider the following CNF and the sentence “the flight includes a meal”.
S->NP VP
NP->DET N
VP->V NP
V->includes
DET->the
DET->a
N->meals
N->flight
The Flight Includes a meals
01 02 03 04 05
DET->the NP->DET,N ------- ------ S->NP,VP
11 12 13 14 15
N->flight --- ----- ------
21 22 23 24 25
V->includes --- VP->V NP
31 32 33 34 35
DET->a NP->DET,N
41 42 43 44 45
N->meals
I) X02=x01+x12= DET,N =NP (since there is a production in the grammar where NP derives
DET,N)
X13=x12+x23=N,V= null (as there is no production that derives N,V)
Dr. Varalakshmi M
X24=x23+x34=V,DET=null
X35=x34+x45=DET,N =NP
II) X03=(x01+x13) or (x02+x23)
=(DET,null) or (NP,V)
=(null) or (null)
X14=(x12+x24) or (x13+x34)
=(N,null) or (null, DET)
=(null) or(null)
X25=(x23+x35) or (x24+x45)
=(V,NP) or (null,N)
=(VP->V,NP) or (null)
III) X04=(x01+x14) or (x02+x24) or (x03+x34)
=(DET,null) or (NP,null) or (null,DET)
=null
X15=(x12+x25) or (x13+x35) or (x14+x45)
=(N,VP) or (null,NP) or (null,N)
=null
IV) X05=(x01+x15) or (x02+x25) or (x03+x35) or (x04+x45)
=(DET,null) or (NP,VP) or (null,NP) or (null,N)
=S->NP,VP
Note: At last, if the start symbol is obtained, it infers that the sentence is correct according to the
given CNF grammar. If the start symbol is not reached, then the sentence is syntactically wrong.
Ie., arrangement of the words is not correct.
Now, to construct the parse tree, start with the cell ‘S->NP, VP’. Look for the cell in the same row
that has rule for NP and look for the cell in the same column that has rule for VP and proceed with
the same way until all the leaf nodes contain the terminal symbols.
Example-2:
Let’s consider an example grammar and sentence that lead to multiple parse trees.
Note: If there are 2 productions for the same non-terminal, let us label them as rule 1 and rule2.
Eg., in the CNF grammar given below, there are two rules for VP. So, label them as VP1 and VP2
S → NP VP
VP1 → V NP
Dr. Varalakshmi M
VP2 → VP PP
PP → P NP
V → eat
NP → NP PP
NP → we
NP → fish
NP → fork
P → with
“We eat fish with fork”
we eat fish with Fork
01 02 03 04 05
NP->we ---- S->NP,VP1 ------ S->NP,VP1

S->NP,VP2
11 12 13 14 15
V->eat VP1->V,NP ------- VP1->V,NP
VP2->VP,PP
21 22 23 24 25
NP->fish --- NP->NP,PP
31 32 33 34 35
P->with PP->P,NP
41 42 43 44 45
NP->fork
I) X02=x01+x12= NP,V =null

X13=x12+x23=V,NP=VP1
X24=x23+x34=NP,P=null
X35=x34+x45=P,NP=PP
II) X03=(x01+x13) or (x02+x23)
=(NP,VP1) or (null,NP)
=(S->NP,VP1) or (null)
X14=(x12+x24) or (x13+x34)
=(N,null) or (VP1, P)
=(null) or(null)
X25=(x23+x35) or (x24+x45)
=(NP,PP) or (null,NP)
=(NP->NP,PP) or (null)
III) X04=(x01+x14) or (x02+x24) or (x03+x34)
=(NP,null) or (null,null) or (S,P)
=null
X15=(x12+x25) or (x13+x35) or (x14+x45)
=(V,NP) or (VP1,PP) or (null,NP)
Dr. Varalakshmi M
=(VP1->V,NP) or (VP2->VP,PP)
IV) X05=(x01+x15) or (x02+x25) or (x03+x35) or (x04+x45)
=(NP,VP1 and NP,VP2) or (null,NP) or (S,PP) or (null,NP)
=S->NP,VP1 and S->NP, VP2
Parse tree generation:
i) First construct the tree with the first production of ‘S’, S->NP,VP1
(Note: Follow the lines from the rules encircled in the table given below.
This results in the interpretation “we eat, fish with fork” ie., we eat that fish which has a fork..
ii) Now construct the tree with the second production of ‘S’, S->NP,VP2
This results in the interpretation “we eat fish, with fork” ie., we eat fish using a fork.
Example-3
Consider the sentence “ a pilot likes flying planes” and the following CNF grammar
S->NP VP DET->a (Here, NN-singular noun
VP1->VBG NNS NN-pilot NNS-Plural noun
VP2->VBZ VP VBG->flying VBG-Continuous tense verb
VP3->VBZ NP VBZ->likes VBZ-third person present
NP1->DET NN JJ->flying tense
NP2->JJ NNS NNS->planes jj-adjective
Dr. Varalakshmi M
A Pilot Likes flying planes
01 02 03 04 05
S->NP1,VP2
DET->a NP1->DET,NN ------ ------ S->NP1, VP3
11 12 13 14 15
NN->pilot --- ------- -------
21 22 23 24 25
VBZ->likes --- VP2->VBZ,VP1
VP3->VBZ,NP
31 32 33 34 35
VBG->flying VP1->VBG, NNS
JJ->flying NP2->JJ,NNS
41 42 43 44 45
NNS->planes
V) X02=x01+x12= DET,NN =NP1->DET,NN

X13=x12+x23=NN, VBZ=null
X24=x23+x34=(VBZ,VBG) and (VBZ,JJ)=null
X35=x34+x45=(VBG,NNS) and (JJ, NNS)= VP1->VBG,NNS and NP2->JJ,NNS
VI) X03=(x01+x13) or (x02+x23)

=(DET,null) or (NP1,VBZ)
=null
X14=(x12+x24) or (x13+x34)
=(NN,null) or (null,VBG) and (null,JJ)
=(null) or(null)
X25=(x23+x35) or (x24+x45)
=(VBZ,VP1) and (VBZ,NP2) or (null,NNS)
=(VP2->VBZ,VP1) and (VP3->VBZ,NP) or (null)
VII) X04=(x01+x14) or (x02+x24) or (x03+x34)
=(DET,null) or (NP1,null) or (null)
=null
X15=(x12+x25) or (x13+x35) or (x14+x45)
=(NN,VP2) and (NN,VP3) or (null) or (null)
=null
VIII) X05=(x01+x15) or (x02+x25) or (x03+x35) or (x04+x45)
=(null) or (NP1,VP2) and (NP1,VP3) or (null) or (null)
=S->NP1,VP2 and S->NP1, VP3
Parse tree generation:
i) First construct the tree with the first production of ‘S’, S->NP1,VP2
Dr. Varalakshmi M
This results in the interpretation “a pilot likes flying, planes” ie., a pilot likes to fly planes.
ii) Now construct the tree with the second production of ‘S’, S->NP1,VP3
This results in the interpretation “a pilot likes, flying planes” ie., a pilot likes those planes which are
flying.
*****************************
Dr. Varalakshmi M
Language Models
Models that assign probabilities to sequence of words are called Language models.
An n-gram is a contiguous sequence of n items from a given sample of text or speech. The
items can be phonemes, syllables, letters, words or base pairs according to the application
which are typically collected from a text or speech corpus. It is useful for spelling and
grammar correction and translation.
A 2-gram is a two-word sequence of words like “turn off” and a 3-gram is a three-word
sequence of words like “take the test”.
n-gram models can be used to estimate the probability of the last word of an n-gram, given
the previous words and also to assign probabilities to entire sequences.
The term ‘n-gram’ can be used to mean either the word sequence itself or the predictive
model that assigns it a probability.
The joint probability of A, B can be written as
P(A,B)=p(A).p(B|A)
This joint probability formula can be extended for multiple variables as follows.
P(x1,x2,x3….xn) =p(x1).p(x2|x1).p(x3|x1x2).p(x4|x1.x2.x3)……..p(xn| x1.x2.x3….. xn-1)
Therefore, the joint probability of words in sentences is as follows.
P(w1,w2,w3….wn) =p(w1).p(w2|w1).p(w3|w1w2).p(w4|w1.w2.w3)……..p(wn| w1.w2.w3….. wn-1)
Eg., the probability of occurrence of the word sequence “school students do their homework”
is
P(school students do their homework)=p(school).p(students | school).p(do | school
students).p(their | school students do).p(homework | school students do their) ……(1)
But the drawback here is that the longer the sequence, the less likely we are to find it in a
training corpus.
This can be resolved using n-gram models.
The intuition of the n-gram model is that instead of computing the probability of a word
given its entire history, we can approximate the history by just the last few words.
The bigram model, for example, approximates the probability of a word given
all the previous words P(wn | w1:n-1) by using only the conditional probability of the
preceding word P(wn | wn-1).
In other words, instead of computing the probability, p(homework | school students do their),
we approximate it with the probability, p(homework | their).
This way of assuming that the probability of a word depends only on the previous word is
called a Markov assumption.
We can generalize the bigram (which looks one word into the past) to the trigram (which
looks two words into the past) and thus to the n-gram (which looks n-1 words into the past).
Dr. Varalakshmi M
Thus, the general equation for this N-gram approximation to the conditional probability of the
next word in a sequence is
P(wn|w1:n-1) ≈ P(wn | wn-N+1:n-1)
In a unigram (1-gram) model, no history is used. In a bigram, one word history is used and in
a n-gram, n-1 words history is used.
Based on this, Equation (1) can be written as follows.
P(school students do their homework)=p(school)p(students | school).p(do | students).p(their |
do).p(homework | their) ……………….(2)
Maximum Likelihood Estimation (MLE) can be used to find the probabilities of n-grams that
uses the count of occurrences of the n-grams.
Eg, for the bigram model, probability of a bigram is calculated by taking the count of the
number of times a given bigram (𝑤𝑛−1 𝑤𝑛 ) occurs in a corpus and normalizing it by the total
number of bigrams that share the same word (𝑤𝑛−1 ), in the corpus.
𝑐𝑜𝑢𝑛𝑡(𝑤 𝑤𝑛 )
P(wn | wn-1)= ∑𝑐𝑜𝑢𝑛𝑡(𝑤𝑛−1
𝑛−1𝑤 )
But the count of the n-grams that contain the word is equal to the unigram count of the word.
So, the formula can be changed into
𝑐𝑜𝑢𝑛𝑡(𝑤𝑛−1𝑤𝑛 )
P(wn | wn-1)=
∑𝑐𝑜𝑢𝑛𝑡(𝑤𝑛−1 )
𝑐𝑜𝑢𝑛𝑡("𝑡ℎ𝑒𝑖𝑟 ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘")
ie., for p(homework | their) =
𝑐𝑜𝑢𝑛𝑡("𝑡ℎ𝑒𝑖𝑟")
Likewise, the probability should be calculated for all the other components p(students |
school), p(do | students) and p(their | do) given in equation (2).
Next word estimation using bi-gram model:
Conditional Probability is given by
𝑝(𝐴,𝐵)
P(B|A) =
𝑝(𝐴)
………….. (3)
Given the word sequence “school students do their homework”, if we wish to find the
probability for the next word to be “regularly”, based on eqn (3), the formula can be written
as
P(regularly | school students do their homework)
𝑝(school students do their homework regularly)
= p(school students do their homework)
----- from eqn (3)
p(school)p(students | school).p(do | students).p(their | do).p(homework | their).p(regularly | homework)

=
p(school)p(students | school).p(do | students).p(their | do).p(homework | their)
……. From eqn (2)

(after cancelling out the common terms highlighted with colors)
Dr. Varalakshmi M
P(regularly | school students do their homework) = p(regularly | homework) =
𝑐𝑜𝑢𝑛𝑡(ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘 𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑙𝑦)
𝑐𝑜𝑢𝑛𝑡(ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘)
Thus, to find the probability of the next word in the sequence “school students do their
homework” to be “regularly”, it is enough if we find the probability of “regularly” given just
the previous word, “homework”.
Trigram:
The same idea for bigram model can be extended to trigram, four-gram and in general, n-
gram models.
𝑐𝑜𝑢𝑛𝑡(𝑤1 𝑤2 𝑤3 )
 P(w1,w2,w3)=p(w3 | w1,w2)= 𝑐𝑜𝑢𝑛𝑡(𝑤1 𝑤2 )
P(school students do their homework)=p(school).p(students | school,<s>).p(do | school,

students).p(their | students do).p(homework | do their)
Given the word sequence “school students do their homework”, if we wish to find the
probability for the next word to be “regularly”, using trigram model, the formula can be
written as
P(regularly | school students do their homework)
𝑝(school students do their homework regularly)
= ----- from eqn (3)
p(school students do their homework)
=
p(school)p(students |school,<s>).p(do |school,students).p(their |students,do).p(homework | do,their).p(regularly | their homewo
p(school)p(students |school,<s>).p(do |school,students).p(their |students,do).p(homework | do,their).
(after cancelling out the common terms highlighted with colors)

P(regularly | school students do their homework) = p(regularly | their homework) =
𝑐𝑜𝑢𝑛𝑡(𝑡ℎ𝑒𝑖𝑟 ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘 𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑙𝑦)
𝑐𝑜𝑢𝑛𝑡(ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘 𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑙𝑦)
Thus, in a trigram model, to find the probability of the next word in the sequence “school
students do their homework” to be “regularly”, it is enough if we find the probability of
“regularly” given just the previous two words, “their homework”.
In practice, unigram models are not commonly used because the next word will be predicted
as the highest frequency word in the corpus.
Eg., Assume the next word has to be predicted for the sequence “ covishield is the vaccine for
_______”. If the word “and” occurs, the highest number of times in a corpus, then the
unigram model will predict “and” as the next word resulting in “covishield is the vaccine for
and”.
Bigram, trigram and four-gram models are usually preferred. For models with higher values
of ‘n’, large corpus is required.
Dr. Varalakshmi M
Examples:
Consider the following tiny corpus with seven sentences.
<s> I am Henry</s>
<s> I like college</s> Word frequency
<s> Do Henry like college</s> <s> 7
<s> Henry I am</s> </s> 7
<s> Do I like Henry</s> I 6
<s> Do I like college </s> Am 2
<s> I do like Henry </s> Henry 5
Like 5
College 3
do 4
i) Using bigram model, predict the most probable next word for the sequence,
<s> do _______
Next word Probability of next word
P(</s> | do) 0/4
P(I | do) 2/4 -> the most probable next word is “I”
P(am | do) 0/4
P(Henry | do) ¼
P( like | do) ¼
P(college| do) 0/4
P(do | do) 0/4
ii) Using bigram model, predict the most probable next word for the sequence,
<s> I like Henry _______

P(</s> | Henry) 3/5 -> the most probable next word is “</s>”
P(I | Henry) 1/5
P(am | Henry) 0
P(Henry | Henry) 0
P( like | Henry) 1/5
P(college| Henry) 0
P(do | Henry) 0
iii) Using trigram model, predict the most probable next word for the sequence,
<s> Do I like _______
Note: p(I like)=3
P(</s> | I like) 0/3
Dr. Varalakshmi M
P(I | I like) 0/3
P(am | I like) 0/3
P(Henry | I like) 1/3
P( like | I like) 0/3
P(college| I like) 2/3 -> college is more probable
P(do | I like) 0/3
iv) Using 4-gram model, predict the most probable next word for the sequence,
<s> Do I like college _______
Note: p(I like college)=2
P(</s> | I like college) 2/2 -> </s> is more probable
P(I | I like college) 0/2
P(am | I like college) 0/2
P(Henry | I like college) 0/2
P( like | I like college) 0/2
P(college| I like college) 0/2
P(do | I like college) 0/2
v) Using bigram model and the corpus mentioned above, predict the most probable
sentence out of the following two.
a) <s> I like college </s>
b) <s> do I like Henry</s>

= p(I | <s>) * p(like | I) * p(college | like) * p( </s> | college)
= 3/7 * 3/6 * 3/5 * 3/3=9/70 = 0.13
= p(do | <s>) * p(I | do) * p(like | I) * p( Henry | like) * p(</s> | Henry)
=3/7 * 2/4 *3/6*2/5*3/5=9/350 = 0.0257
From the above two values, it is inferred that <s> I like college </s> is more probable to occur.
Underflow Problem:
Since probabilities are (by definition) less than or equal to 1, the more probabilities we multiply
together, the smaller the product becomes. Multiplying enough n-grams together would result
in numerical underflow. By using log probabilities instead of raw probabilities, we get numbers
that are not as small. Adding in log space is equivalent to multiplying in linear space, so we
combine log probabilities by adding them. The result of doing all computation and storage in
log space is that we only need to convert back into probabilities if we need to report them at
the end; then we can just take the exp of the logprob:
p1 * p2* p3 * p4 = exp(log p1+log p2+log p3+log p4)
Dr. Varalakshmi M
Eg., in the calculation above (b), the result is 0.0257 and if we keep multiplying the
probabilities of few more bigrams, the result will become smaller and smaller leading to
underflow. To avoid that, the same two calculations can be done as follows.
= 3/7 * 3/6 * 3/5 * 3/3
=log(3/7) + log(3/6)+log( 3/5)+log(3/3) = -2.0513

= p(do | <s>) * p(I | do) * p(like | I) * p( Henry | like) * p(</s> | Henry)
=3/7 * 2/4 *3/6*2/5*3/5
=log(3/7)+log(2/4)+log(3/6)+log(2/5)+log(3/5) = -3.6607
Even with logarithmic calculations, the first sentence is found to be more probable.
Zero-Probability problem:
Let us assume that we have to calculate the probability of the word sequence
“<s> like college </s>”
= p(like | <s>) * p(college | like) * p( </s> | college)
= 0/7*3/5*3/3 = 0
Probability is evaluated as 0. Though “ like college” is present thrice in the corpus but
because “<s>like” doesn’t occur even once, the first term becomes 0 and so, the answer is 0.
This is termed as zero-probability problem.
This problem can be solved using smoothing algorithms.

I) Laplace or add-one smoothing:
𝐶
Instead of calculating the probability as p= , where c is the total count of the bigram and
𝑁
N is the total count of the preceding word, add 1 to the numerator to make it non-zero and
add the total number of unique words, V to the denominator. [Adding V to the
denominator is done to normalize so that the probability value comes within the range 0-
1]
𝐶+1
p=𝑁+𝑉
Add-one smoothing for unigram model Add-one smoothing for bigram model
(Assume a tiny corpus with N=20 and V=4 )
Excluding </s>, as it never comes in bigram
calculations, the total number of unique
Word Freq unsmo New New words in the aforementioned corpus =7
othed freq p So, V=7
p P(<s> like college </s>)
Eat 10 10/20 11 11/24 = p(like | <s>) * p(college | like) * p( </s> |
=0.46 college)
citrus 4 4/20 5 5/24= = (0+1)/(7+7)*(3+1)/(5+7)*(3+1)/(3+7)
0.21 = 1/14 * 4/12 * 4/10 = 0.0095
Fruits 6 6/20 7 7/24=
0.29
Dr. Varalakshmi M
daily 0 0/20 1 1/24=
0.04
Without using Add-one smoothing With using Add-one sm

<s> do I like Henry</s> <s> do I like Henry</s>
= p(do | <s>) * p(I | do) * p(like | I) * p( = p(do | <s>) * p(I | do) * p(like | I) * p(
Henry | like) * p(</s> | Henry) Henry | like) * p(</s> | Henry)
=3/7 * 2/4 *3/6*2/5*3/5=9/350 = 0.0257 =(3+1)/(7+7) * (2+1)/(4+7)
*(3+1)/(6+7)*(2+1)/(5+7)*(3+1)/(5+7)
=4/11 * 3/11 * 4/13 * 3/12 * 4/12
= 0.0020
Observe the sharp change in the probability
after applying add-one smoothing.
It has changed from 0.0257 to 0.0020
Drawbacks of add-one smoothing:

- Leads to sharp changes in the probability
- Too much probability mass goes to unseen events
II) Good turing algorithm

Uses the count of things we have seen once to help estimate the count of things we have
never seen.
Let Nc denote the count of the things which occurred with frequency ‘c’ (ie., count of
things we have seen ‘c’ times)
N1 -> Number of words which occurred once in the corpus
N3 -> Number of words which occurred thrice in the corpus
Eg., “to be or not to be phrase”
c(to) = 2, c(be) =2, c(or)=1, c(not)=1, c(phrase)=1
Therefore, N1=3, N2=2 because there are 3 words which occur only once and 2 words
occurring twice.
Assume the following words in a corpus
C(cricket)=10, c(hockey)=3, c(chess)=2, c(badminton)=1, c(tennis)=1, c(volleyball)=1
Total count of words = 18
P(tennis)=1/18
P(basketball)=0/18=0
As “basketball” word is not present in the corpus, the probability of the words which
occurred once will be used to estimate the probability of the word “basketball”.
PGT (unseen words) =N1/N PGT (existing words) =c*/N, where
(c+1)Nc+1
c* =
Nc
Eg., PMLE(basketball)=0/18=0 PMLE(tennis) =1/18, c(tennis)=1
But, PGT(basketball) = N1/N = 3/18 (c+1)Nc+1
C*(tennis) = Nc
(N1=3, as there are 3 words with frequency (1+1)N1+1 2N2 2∗1 2
1) = N1 = 3 = 3 =3
Dr. Varalakshmi M
2
3 2 1
PGT=18 = ==
54 27
Therefore, the probability of existing word “tennis” is discounted from 1/18 to 1/27 and
that extra probability mass is used to account for the unseen words.
III) Witten Bell algorithm

This algorithm also models the probability of the unseen word by the probability of the word
occurring once.
For a bigram model, the probability of bigram is interpolated as
P(wi | wi-1) = λ2PMLE(wi | wi-1) +(1-λ2)P(wi)
λ2 refers to the λ for bigram model.
How to choose λ value?

For those grams that already exist, a large λ value will be better; for unseen grams, small λ
will be better. So, an optimum λ value is chosen in the Witten-Bell method as follows.
u(𝑤𝑖−1)
λ (wi-1) =1- 𝑢(𝑤 where u(wi-1) denotes the number of unique words after wi-1
𝑖−1) +𝑐(𝑤𝑖−1)
Eg., Let c(Delhi)=3, u(Delhi)=2
2
 λ (Delhi)=1- =3/5 =0.6
2+3
Evaluating Language Models
Perplexity (PP) is an intrinsic evaluation measure to evaluate the evaluating language models.
A language model is the best if it predicts an unseen test set.
Perplexity is the inverse probability of the test data which is normalized by the number of
words.
PP(w)=p(w1,w2,w3…wn)-1/n
Lower the value of perplexity, better will be the model.
More the value of perplexity, confused will be the model for prediction.
Perplexity calculation for bigram and trigram models based on the sentence,
<s> I like college </s>:
i) Bigram model:
= 3/7 * 3/6 * 3/5 * 3/3=9/70 = 0.13
PP(w)=(1/0.13)1/4 = 1.67
ii) Trigram model:
= p(like | <s> I) * p(college | I like) * p( </s> | like college)
= 1/3 * 2/3 * 3/3 =2/9 = 0.22
PP(w)=(1/0.22)1/3 = 1.66 (since this value is smaller, the trigram model is better, for
this example).
****************************
Dr. Varalakshmi M
Dr. Varalakshmi M
Dr. Varalakshmi M
Semantic Role Labeling:
In linguistics, predicate refers to the main verb in the sentence. Predicate takes arguments. The role
of Semantic Role Labelling (SRL) is to determine how these arguments are semantically related to
the predicate.
Consider the sentence "Mary loaded the truck with hay at the depot on Friday". 'Loaded' is the
predicate. Mary, truck and hay have respective semantic roles of loader, bearer and cargo. We can
identify additional roles of location (depot) and time (Friday). The job of SRL is to identify these roles
so that downstream NLP tasks can "understand" the sentence.
SRL is also known by other names such as thematic role labelling, case role assignment, or shallow
semantic parsing.
Often an idea can be expressed in multiple ways. Consider these sentences that all mean the same
thing: "Yesterday, Kristina hit Scott with a baseball"; "Scott was hit by Kristina yesterday with a
baseball"; "With a baseball, Kristina hit Scott yesterday"; "Kristina hit Scott with a baseball
yesterday".
Either constituent or dependency parsing will analyze these sentence syntactically. But syntactic
relations don't necessarily help in determining semantic roles. However, parsing is not completely
useless for SRL. In a traditional SRL pipeline, a parse tree helps in identifying the predicate
arguments.
But SRL performance can be impacted if the parse tree is wrong. This has motivated SRL approaches
that completely ignore syntax.
SRL is useful in any NLP application that requires semantic understanding: machine translation,
information extraction, text summarization, question answering, and more. For example, predicates
and heads of roles help in document summarization. For information extraction, SRL can be used to
construct extraction rules.
SRL can be seen as answering "who did what to whom". Obtaining semantic information thus
benefits many downstream NLP tasks such as question answering, dialogue systems, machine
reading, machine translation, text-to-scene generation, and social network analysis.
One of the oldest models of semantic roles is called thematic roles where roles are assigned to
subjects and objects in a sentence. Roles are based on the type of event. For example, if the verb is
'breaking', roles would be breaker and broken thing for subject and object respectively. Some
examples of thematic roles are agent, experiencer, result, content, instrument, and source. There's
no well-defined universal set of thematic roles.
Varalakshmi M, SCOPE
A modern alternative from 1991 is proto-roles that defines only two roles: Proto-Agent and Proto-
Patient. Using heuristic features, algorithms can say if an argument is more agent-like (intentionality,
volitionality, causality, etc.) or patient-like (undergoing change, affected by, etc.).
Verbnet, PropBank and FrameNet help in semantic role labelling. VerbNet is a resource that groups
verbs into semantic classes and their alternations.
PropBank contains sentences annotated with proto-roles and verb-specific semantic roles.
Arguments to verbs are simply named Arg0, Arg1, etc. Typically, Arg0 is the Proto-Agent and Arg1 is
the Proto-Patient.
FrameNet is another lexical resource defined in terms of frames rather than verbs. For every
frame, core roles and non-core roles are defined. Frames can inherit from or causally link to other
frames.
Parse tree path is one of the approaches to SRL .
Parse Tree Path:
Path feature:
This feature is designed to capture the syntactic relation of a constituent to the rest of the sentence.
However, the path feature describes the syntactic relation between the target word (that is, the
predicate invoking the semantic frame) and the constituent in question, whereas the previous
feature is independent of where the target word appears in the sentence; that is, it identifies all
subjects whether they are the subject of the target word or not. This feature is defined as the path
from the target word through the parse tree to the constituent in question, represented as a string
of parse tree non-terminals linked by symbols indicating upward or downward movement through
the tree, as shown in Figure. Although the path is composed as a string of symbols, our systems will
treat the string as an atomic value. The path includes, as the first element of the string, the part of
speech of the target word, and, as the last element, the phrase type or syntactic category of the
sentence constituent marked as a frame element.
In this example, the path from the target word (predicate) ‘ate’ to the frame element ‘He’ can be
represented as VB↑VP↑S↓NP, with ↑ indicating upward movement in the parse tree and ↓
downward movement. Our path feature is dependent on the syntactic representation used, and in
this case, the Treebank-2 annotation style is used.
The most common values of the path feature, along with interpretations, are shown in the table. As
per the table, VB↑VP↑S↓NP results in a subject. And so, ‘he’ should be labelled as subject for the
predicate ‘ate’.
The following figure shows the annotation for the sentence “They expect him to cut costs
throughout the organization”, which exhibits the syntactic phenomenon known as subject-to-object
raising, in which the main verb’s object is interpreted as the embedded verb’s subject. The
Treebank-2 uses S nodes generously to indicate clauses, so as to make possible a relatively
straightforward mapping from S nodes to predications. In this example, the path from ‘cut’ to the
frame element ‘him’ would be VB↑VP↑VP↑S↓NP, which typically indicates a verb’s subject as per
the above table. But when we consider the accusative case of the pronoun ‘him’, with the target
word (predicate) to be ‘expect’ in the sentence, the path to him would be VB↑VP↓S↓NP, rather
than the typical direct object path of VB↑VP↓NP.
OVERLAP BASED APPROACHES
Require a Machine Readable Dictionary (MRD).
CFILT - IITB
Find the overlap between the features of different senses of an
ambiguous word (sense bag) and the features of the words in its
context (context bag).
These features could be sense definitions, example sentences,

hypernyms etc.
The features could also be given weights.
The sense which has the maximum overlap is selected as the

contextually appropriate sense.
2
2
LESK’S ALGORITHM
Sense Bag: contains the words in the definition of a candidate sense of the
ambiguous word.
Context Bag: contains the words in the definition of each sense of each context
word.
E.g. “On burning coal we get ash.”
From Wordnet
The noun ash has 3 senses (first 2 from tagged texts)
1. (2) ash -- (the residue that remains when something is burned)
2. (1) ash, ash tree -- (any of various deciduous pinnate-leaved
ornamental or timber trees of the genus Fraxinus)
3. ash -- (strong elastic wood of any of various ash trees; used for
furniture and tool handles and sporting goods such as baseball
bats)
The verb ash has 1 sense (no senses from tagged texts)
1. ash -- (convert into ashes) 3
CRITIQUE
Proper nouns in the context of an ambiguous word can act as
strong disambiguators.
E.g. “Sachin Tendulkar” will be a strong indicator of the
category “sports”.
Sachin Tendulkar plays cricket.
Proper nouns are not present in the thesaurus. Hence this
approach fails to capture the strong clues provided by proper
nouns.
Accuracy
50% when tested on 10 highly polysemous English words.
4
Extended Lesk’s algorithm
Original algorithm is sensitive towards exact words in the

definition.
Extension includes glosses of semantically related senses from
WordNet (e.g. hypernyms, hyponyms, etc.).
The scoring function becomes:
scoreext ( S ) = ∑ | context(w) I gloss(s′) |

s′∈rel ( s ) or s ≡ s′
where,
gloss(S) is the gloss of sense S from the lexical resource.
Context(W) is the gloss of each sense of each context word.
rel(s) gives the senses related to s in WordNet under some relations.
WordNet Sub-Graph
Hyponymy
Dwelling,abode
Hypernymy
Meronymy
kitchen
Hyponymy
bckyard
bedroom
M
e
r house,home Gloss
o
veranda n A place that serves as the living
Hyponymy quarters of one or mor efamilies
y
m
y
study
guestroom hermitage cottage

Example: Extended Lesk
“On combustion of coal we get ash”
From Wordnet
The noun ash has 3 senses (first 2 from tagged texts)
1. (2) ash -- (the residue that remains when something is burned)
2. (1) ash, ash tree -- (any of various deciduous pinnate-leaved
ornamental or timber trees of the genus Fraxinus)
3. ash -- (strong elastic wood of any of various ash trees; used for
furniture and tool handles and sporting goods such as baseball
bats)
The verb ash has 1 sense (no senses from tagged texts)
1. ash -- (convert into ashes)
Example: Extended Lesk (cntd)
“On combustion of coal we get ash”

From Wordnet (through hyponymy)
ash -- (the residue that remains when something is burned)
=> fly ash -- (fine solid particles of ash that are carried into the
air when fuel is combusted)
=> bone ash -- (ash left when bones burn; high in calcium
phosphate; used as fertilizer and in bone china)
Critique of Extended Lesk
Larger region of matching in WordNet
Increased chance of Matching
BUT
Increased chance of Topic Drift

Phases of NLP (8 Files Merged)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Phases of NLP (8 Files Merged)

Uploaded by

Copyright:

Available Formats

Phases of NLP

Natural Language vs. Computer Language

Natural Language or free form

Ambiguous They are ambiguous in nature. They are designed to be unambiguous.

• How is the weather today?

This is where NLP enters the picture.

NLP Material – Dr Varalakshmi M

Natural Language Processing (NLP) is a combination of computer science, artificial

NLP is divided into two components.

• Natural Language Understanding

• Natural Language Generation:

• Please crack the windows, the car is getting hot.

NLP Material – Dr Varalakshmi M

Let’s take the example of ubiquitous chatbots.

• Be aware of the conversation’s context

NLU NLP NLG

Considered a subtopic of NLP,

NLP Material – Dr Varalakshmi M

Let’s take another example:

• The banks will be closed

A task called word sense

But if we want more than

NLP is a combination of NLU

NLG writes structured data

• Phonology − It is study of organizing sound systematically.

NLP Material – Dr Varalakshmi M

NLP Material – Dr Varalakshmi M

So, we need to perform Lexicon Normalization.

Sentences rejected by syntactic parser:

Mumbai goes to Sara

The school goes to boy”

Simple methods for syntactic analysis:

semantic analyzer disregards sentence such as “hot ice-cream”.

NLP Material – Dr Varalakshmi M

For example, Ram wants it.

easily find out the reference.

E.g., "close the window?" should be interpreted as a request instead of an order.

NLP Material – Dr Varalakshmi M

• He is looking for a match.

The word ‘match’ may represent a partner or cricket or football.

in a sequence of words (sentence).

• The Fish is ready to eat.

fish is ready for someone to eat.

NLP Material – Dr Varalakshmi M

NLP Material – Dr Varalakshmi M

PoS tagging significance:

It is the probability of seeing a specific observable given a hidden state.

Count(Noun)=9, Count(Verb)=4, Count(Modal verb)=4

Noun Modal Verb <End>

Noun Verb Modal Verb

Tagging the test sentence, “Will can spot Mary”:

Let’s assume that the sentence is tagged as “ N M V N”

=3/4 * 1/9 * 3/9 * 1/4 * 3/4 * 1/4 * 4/4 * 4/9 * 4/9

=1728/6718464 = 2.572*10-4 =0.0002572 which is greater than 0.

highest probability value can be explored. Other paths can be omitted.

i) For the first word ‘Will’:

p(N | Will) = p(Will | N) * p(N|<s>)

=1/9*3/4=0.0833 ✓ (should be eplored in the next stage)

p(M | Will) = p(Will | M) * p(M|<s>)

=3/4*1/4=0.1875 ✓ (should be explored in the next stage)

p(V | Will) = p(Will | V) * p(V|<s>)

ii) For the second word ‘can’:

=0.006942/91/4 =3.86*10-4 ✓ (should be explored in the next stage)

p(M | spot) =0.00694p(spot | M) p(M| M)

=0.006941/43/4=0.00130 ✓ (should be explored in the next stage)