Notes of NLP - Unit-2

NLP NOTES
UNIT-2
1.Language Model :
Models that assign probabilities to sequences of words are called language models (LMs).
• The simplest language model that assigns probabilities to sentences and sequences of words is
the n-gram.
• An n-gram is a sequence of N words:
– A 1-gram (unigram) is a single word sequence of words like “please” or “ turn”.
– A 2-gram (bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your
homework”.
– A 3-gram (trigram) is a three-word sequence of words like “please turn your”, or “turn your
homework”.
• We can use n-gram models to estimate the probability of the last word of an n-gram given the
previous words, and also to assign probabilities to entire word sequences.
Probabilistic language models

Probabilistic language models can be used to assign a probability to a sentence in many NLP
tasks.
• Machine Translation:
– P(high winds tonight) > P(large winds tonight)
• Spell Correction:
– Thek office is about ten minutes from here
– P(The Office is) > P(Then office is)
• Speech Recognition:
– P(I saw a van) >> P(eyes awe of an)
• Summarization, question-answering,
Our goal is to compute the probability of a sentence or sequence of words
W (=w1 ,w2 ,…wn ): – P(W) = P(w1 ,w2 ,w3 ,w4 ,w5…wn )
• What is the probability of an upcoming word?:
– P(w5 |w1 ,w2 ,w3 ,w4 )
• A model that computes either of these:
– P(W) or P(wn |w1 ,w2…wn-1 ) is called a language model.
Chain Rule of Probability
How can we compute probabilities of entire word sequences like w1 ,w2 ,…wn ?
– The probability of the word sequence
w1 ,w2 ,…wn is P(w1 ,w2 ,…wn ).
• We can use the chain rule of the probability to decompose this probability:
P(w1 n ) = P(w1 ) P(w2 |w1 ) P(w3 |w1 2 ) … P(wn |w1 n-1 )
Example: P(the man from jupiter) = P(the) P(man|the) P(from|the man) P(jupiter|the man from)
Conditional Probabilities:
The chain rule shows the link between computing the joint probability of a sequence and
computing the conditional probability of a word given previous words.
• Definition of Conditional Probabilities:
P(B|A) = P(A,B) / P(A)  P(A,B) = P(A) P(B|A)
• Conditional Probabilities with More Variables:
P(A,B,C,D) = P(A) P(B|A) P(C|A,B) P(D|A,B,C)
• Chain Rule:
P(w1… wn ) = P(w1 ) P(w2 |w1 ) P(w3 |w1w2 ) … P(wn |w1…wn-1 )
To compute the exact probability of a word given a long sequence of preceding words is difficult
(sometimes impossible).
• We are trying to compute
P(wn |w1…wn-1 )
which is the probability of seeing wn after seeing w1 n-1 .
• We may try to compute
P(wn |w1…wn-1 )
exactly as follows:
P(wn |w1…wn-1 ) = count(w1…wn-1wn ) / count(w1…wn-1 )
• Too many possible sentences and we may never see enough data for estimating these
probability values. • So, we need to compute P(wn |w1…wn-1 ) approximately.
2.N-gram Model
The intuition of the n-gram model (simplifying assumption):
– instead of computing the probability of a word given its entire history, we can approximate the
history by just the last few words.
P(wn |w1…wn-1 ) ≈ P(wn ) unigram
P(wn |w1…wn-1 ) ≈ P(wn |wn-1 ) bigram
P(wn |w1…wn-1 ) ≈ P(wn |wn-1wn-2 ) trigram
P(wn |w1…wn-1 ) ≈ P(wn |wn-1wn-2wn-3 ) 4-gram
P(wn |w1…wn-1 ) ≈ P(wn |wn-1wn-2wn-3wn-4 ) 5-gram
• In general, N-Gram is
P(wn |w1…wn-1 ) ≈ 𝐏(wn |𝐰𝐧−𝐍+𝟏 𝐧−𝟏 )
In general, a n-gram model is an insufficient model of a language because languages have long-
distance dependencies.
– “The computer(s) which I had just put into the machine room is (are) crashing.”
– But we can still effectively use N-Gram models to represent languages.
• Which N-Gram should be used as a language model?
– Bigger N, the model will be more accurate.
• But we may not get good estimates for N-Gram probabilities.
• The N-Gram tables will be more sparse.
– Smaller N, the model will be less accurate.
• But we may get better estimates for N-Gram probabilities.
• The N-Gram table will be less sparse.
– In reality, we do not use higher than Trigram (not more than Bigram).
– How big are N-Gram tables with 10,000 words?
• Unigram -- 10,000
• Bigram – 10,000*10,000 = 100,000,000
• Trigram – 10,000*10,000*10,000 = 1,000,000,000,000
N-gram and Markov model
The assumption that the probability of a word depends only on the previous word(s) is called
Markov assumption.
• Markov models are the class of probabilistic models that assume that we can predict the
probability of some future unit without looking too far into the past.
• A bigram is called a first-order Markov model (because it looks one token into the past);
• A trigram is called a second-order Markov model;
• In general a N-Gram is called a N-1 order Markov model.
3. Evaluating Language Models

Does our language model prefer good sentences to bad ones?
– Assign higher probability to “real” or “frequently observed” sentences than “ungrammatical”
or “rarely observed” sentences?
• We train parameters of our model on a training set.
• We test the model’s performance on data we haven’t seen.
– A test set is an unseen dataset that is different from our training set, totally unused.
• An evaluation metric tells us how well our model does on the test set.
Extrinsic Evaluation
Extrinsic Evaluation of a N-gram language model is to use it in an application and measure how
much the application improves.
• To compare two language models A and B:
– Use each of language model in a task such as spelling corrector, MT system.
– Get an accuracy for A and for B
• How many misspelled words corrected properly
• How many words translated correctly
– Compare accuracy for A and B
• The model produces the better accuracy is the better model.
• Extrinsic evaluation can be time-consuming.
intrinsic evaluation
An intrinsic evaluation metric is one that measures the quality of a model independent of any
application. • When a corpus of text is given and to compare two different n-gram models,
– Divide the data into training and test sets,
– Train the parameters of both models on the training set, and
– Compare how well the two trained models fit the test set.
• Whichever model assigns a higher probability to the test set
• In practice, probability-based metric called perplexity is used instead of raw probability as our
metric for evaluating language models.
Perplexity can be seen as the weighted average branching factor of a language.
– The branching factor of a language is the number of possible next words that can follow any
word.
• Let’s suppose a sentence consisting of random digits
• What is the perplexity of this sentence according to a model that assign P=1/10 to each digit?
Generalization and Zeros

The n-gram model, like many statistical models, is dependent on the training corpus.
– the probabilities often encode specific facts about a given training corpus.
• N-grams only work well for word prediction if the test corpus looks like the training corpus
– In real life, it often doesn’t – We need to train robust models that generalize!
– One kind of generalization: Getting rid of Zeros!
• Things that don’t ever occur in the training set, but occur in the test set
• Zeros: things that don’t ever occur in the training set but do occur in the test set causes
problem for two reasons.
– First, underestimating the probability of all sorts of words that might occur,
– Second, if probability of any word in test set is 0, entire probability of test set is 0.
4. Smoothing
To keep a language model from assigning zero probability to these unseen events, we’ll have to
shave off a bit of probability mass from some more frequent events and give it to the events
we’ve never seen.
• This modification is called smoothing (or discounting).
• There are many ways to do smoothing, and some of them are:
– Add-1 smoothing (Laplace Smoothing)
– Add-k smoothing,
– Backoff
Add-1 smoothing (Laplace Smoothing)
The simplest way to do smoothing is to add one to all the counts, before we normalize them into
probabilities.
– All the counts that used to be zero will now have a count of 1, the counts of 1 will be 2, and so
on.
• This algorithm is called Laplace smoothing (Add-1 Smoothing).
• We pretend that we saw each word one more time than we did, and we just add one to all the
counts!,
Laplace Smoothing for Unigrams:
• The unsmoothed maximum likelihood estimate of the unigram probability of the word wi is its
count ci normalized by the total number of word tokens
N: P(wi ) = ci / N
• Laplace smoothing adds one to each count. Since there are V words in the vocabulary and each
one was incremented, we also need to adjust the denominator to take into account the extra V
observations.
PLaplace(wi ) = (ci + 1) / (N + V)
5. Classification and categorization of Text Data :
Text Classification is the processing of labeling or organizing text data into groups. It forms a
fundamental part of Natural Language Processing. In the digital age that we live in we are
surrounded by text on our social media accounts, in commercials, on websites, Ebooks, etc. The
majority of this text data is unstructured, so classifying this data can be extremely useful.
Sentiment Analysis is an important application of Text Classification

Text classification is a common NLP task used to solve business problems in various fields. The
goal of text classification is to categorize or predict a class of unseen text documents, often with
the help of supervised machine learning.
Similar to a classification algorithm that has been trained on a tabular dataset to predict a class,
text classification also uses supervised machine learning. The fact that text is involved in text
classification is the main distinction between the two.
Applications
Text Classification has a wide array of applications. Some popular uses are:
 Spam detection in emails

 Sentiment analysis of online reviews
 Topic labeling documents like research papers
 Language detection like in Google Translate
 Age/gender identification of anonymous users
 Tagging online content
 Speech recognition used in virtual assistants like Siri and Alexa
Example : Spam classification
There are many practical use cases for text classification across many industries. For example, a
spam filter is a common application that uses text classification to sort emails into spam and non-
spam categories.
Types of Text Classification Systems
There are mainly two types of text classification systems; rule-based and machine learning-based
text classification.
1)Rule-based text classification
Rule-based techniques use a set of manually constructed language rules to categorize text into
categories or groups. These rules tell the system to classify text into a particular category based
on the content of a text by using semantically relevant textual elements. An antecedent or pattern
and a projected category make up each rule.
For example, imagine you have tons of new articles, and your goal is to assign them to relevant
categories such as Sports, Politics, Economy, etc.
With a rule-based classification system, you will do a human review of a couple of documents to
come up with linguistic rules like this one:
 If the document contains words such as money, dollar, GDP, or inflation, it belongs to
the Politics group (class).
Rule-based systems can be refined over time and are understandable to humans. However, there
are certain drawbacks to this strategy.
These systems, to begin with, demand in-depth expertise in the field. They take a lot of time
since creating rules for a complicated system can be difficult and frequently necessitates
extensive study and testing.
Given that adding new rules can alter the outcomes of the pre-existing rules, rule-based systems
are also challenging to maintain and do not scale effectively.
2)Machine learning-based text classification
Machine learning-based text classification is a supervised machine learning problem. It learns

the mapping of input data (raw text) with the labels (also known as target variables). This is
similar to non-text classification problems where we train a supervised classification algorithm
on a tabular dataset to predict a class, with the exception that in text classification, the input data
is raw text instead of numeric features.
We can use machine learning to train models on large sets of text data to predict categories of
new text. To train models, we need to transform text data into numerical data – this is known
as feature extraction. Important feature extraction techniques include bag of words and n-
grams.
There are several useful machine learning algorithms we can use for text classification. The most
popular ones are:
 Naive Bayes classifiers

 Support vector machines
 Deep learning algorithms
Like any other supervised machine learning, text classification machine learning has two phases;
training and prediction.
Training phase
A supervised machine learning algorithm is trained on the input-labeled dataset during the
training phase. At the end of this process, we get a trained model that we can use to obtain
predictions (labels) on new and unseen data.
Prediction phase
Once a machine learning model is trained, it can be used to predict labels on new and unseen
data. This is usually done by deploying the best model from an earlier phase as an API on the
server.
Feature Extraction
The two most common methods for extracting feature from text or in other words converting text
data (strings) into numeric features so machine learning model can be trained are: Bag of Words
(a.k.a CountVectorizer) and Tf-IDF.
Bag of Words
A bag of words (BoW) model is a simple way of representing text data as numeric features. It
involves creating a vocabulary of known words in the corpus and then creating a vector for each
document that contains counts of how often each word appears.
TF-IDF
TF-IDF stands for term frequency-inverse document frequency, and it is another way of
representing text as numeric features. There are some shortcomings of the Bag of Words (BoW)
model that Tf-IDF overcomes. We won’t go into detail about that in this article, but if you would
like to explore this concept further, check out our Introduction to Natural Language
Processing in Python course.
The TF-IDF model is different from the bag of words model in that it takes into account the
frequency of the words in the document, as well as the inverse document frequency. This means
that the TF-IDF model is more likely to identify the important words in a document than the bag
of words model.
6. Naive Bayes algorithm:

We can use Supervised Machine Learning:
Given:
 a document d
 a fixed set of classes C = { c1, c2, … , cn }
 a training set of m documents that we have pre-determined to belong to a specific
class
We train our classifier using the training set, and result in a learned classifier.
We can then use this learned classifier to classify new documents.
Notation: we use Υ(d) = C to represent our classifier, where Υ() is the classifier, d is the
document, and c is the class we assigned to the document.
(Google’s mark as spam button probably works this way).
Naive Bayes Classifier

This is a simple (naive) classification method based on Bayes rule. It relies on a very simple
representation of the document (called the bag of words representation)
Imagine we have 2 classes ( positive and negative ), and our input is a text representing a review
of a movie. We want to know whether the review was positive or negative. So we may have a
bag of positive words (e.g. love, amazing,hilarious, great), and a bag of negative words
(e.g. hate, terrible).
We may then count the number of times each of those words appears in the document, in order to
classify the document as positive or negative.
This technique works well for topic classification; say we have a set of academic papers, and we
want to classify them into different topics (computer science, biology, mathematics).
Bayes’ Rule applied to Documents and Classes
For a document d and a class c, and using Bayes’ rule,
P( c | d ) = [ P( d | c ) x P( c ) ] / [ P( d ) ]
The class mapping for a given document is the class which has the maximum value of the above
probability.
Since all probabilities have P( d ) as their denominator, we can eliminate the denominator, and
simply compare the different values of the numerator:
P( c | d ) = P( d | c ) x P( c )
Now, what do we mean by the term P( d | c ) ?
Let’s represent the document as a set of features (words or tokens) x1, x2, x3, …
We can then re-write P( d | c ) as:
P( x1, x2, x3, … , xn | c )
What about P( c ) ? How do we calculate it?
=> P( c ) is the total probability of a class. => How often does this class occur in total?
E.g. in the case of classes positive and negative, we would be calculating the probability that any
given review is positive or negative, without actually analyzing the current input document.
This is calculated by counting the relative frequencies of each class in a corpus.
E.g. out of 10 reviews we have seen, 3 have been classified as positive.
=> P ( positive ) = 3 / 10
Now let’s go back to the first term in the Naive Bayes equation:
P( d | c ), or P( x1, x2, x3, … , xn | c ).
We use some assumptions to simplify the computation of this probability:
 the Bag of Words assumption => assume the position of the words in the document
doesn’t matter.
 Conditional Independence => Assume the feature probabilities P( xi | cj ) are
independent given the class c.
It is important to note that both of these assumptions aren’t actually correct — of course, the order
of words matter, and they are not independent. A phrase like this movie was incredibly
terrible shows an example of how both of these assumptions don't hold up in regular english.
However, these assumptions greatly simplify the complexity of calculating the classification
probability. And in practice, we can calculate probabilities with a reasonable level of accuracy
given these assumptions.
So..
To calculate the Naive Bayes probability, P( d | c ) x P( c ), we calculate P( xi | c ) for each xi

in d, and multiply them together.
Then we multiply the result by P( c ) for the current class. We do this for each of our classes, and
choose the class that has the maximum overall value.
7. Support Vector Machines(SVM):
“Support Vector Machine” (SVM) is a supervised machine learning algorithm that can be used
for both classification or regression use cases. However, it is mostly used in classification
problems. In the SVM algorithm, we plot each data item as a point in n-dimensional space
(where n is a number of features you have) with the value of each feature being the value of a
particular coordinate. Then, we perform classification by finding the hyper-plane that
differentiates the two classes very well.
In simpler terms, support vectors are simply the coordinates of individual observation and the
SVM classifier is a frontier that best segregates the two classes (hyper-plane/line).
Translation : we can apply this algorithm for text classification, and for sentiment analysis.
Sentiment Analysis is an NLP technique that we perform on text to determine whether the
author’s intentions towards a particular topic or product, etc. are positive, negative, or neutral.
Note that there are audio based techniques as well, which we'll not cover here.
We first convert a piece of text into a vector of numbers, to run SVM with them. In other words,
which features are needed to classify texts using SVM?
The most common answer is word frequencies. This means that we treat a text as a bag of words,
and for every word that appears in that bag we have a feature. The value of that feature will be
how frequent that word is in the text. This method boils down to just counting how many times
every word appears in a text and dividing it by the total number of words. So in the
sentence “The soccer World Cup will be held in November and the world will be hooked to
soccer games of the highest quality", the word soccer has a frequency of 2/10 = 0.2, and the
word and has a frequency of 1/10 = 0.1 .
Thus, every text in the dataset is represented as a vector with thousands (or tens of thousands) of
dimensions, every one representing the frequency of one of the words of the text. This is fed into
the SVM for training. Now that we have the feature vectors, the only thing left to do is choosing
a kernel function for our model. A linear kernel is usually used for NLP classification tasks.
For the training, we take the labeled ('sentimented') texts, convert them to vectors using word
frequencies, and feed them to the SVM algorithm, which will use the chosen linear kernel
function, and so it produces a model. When we have a new unlabeled ('unsentimented') text that
we want to classify, we convert it into a vector and give it to the model, which will output the
sentiment of the text.
8. Decision Tree Classification Algorithm

o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes represent
the features of a dataset, branches represent the decision rules and each leaf node
represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
Decision Tree Terminologies

 Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from the tree.
 Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node (distance
from the office) and one leaf node based on the corresponding labels. The next decision node
further gets split into one decision node (Cab facility) and one leaf node. Finally, the decision
node splits into two leaf nodes (Accepted offers and Declined offer). Consider the below
diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute for
the root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best
attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:
1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies

randomness in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
o S= Total number of samples

o P(yes)= probability of yes
o P(no)= probability of no
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:
o Cost Complexity Pruning

o Reduced Error Pruning.
Advantages of the Decision Tree
o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
o The decision tree contains lots of layers, which makes it complex.

o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may increase.
9. Evaluation Measures:
Whenever we build Machine Learning models, we need some form of metric to measure the
goodness of the model. Bear in mind that the “goodness” of the model could have multiple
interpretations, but generally when we speak of it in a Machine Learning context we are talking of
the measure of a model's performance on new instances that weren’t a part of the training data.
Determining whether the model being used for a specific task is successful depends on 2 key
factors:
1. Whether the evaluation metric we have selected is the correct one for our problem
2. If we are following the correct evaluation process
In this article, I will focus only on the first factor — Selecting the correct evaluation metric.
Different Types of Evaluation Metrics
The evaluation metric we decide to use depends on the type of NLP task that we are doing. To
further add, the stage the project is at also affects the evaluation metric we are using. For instance,
during the model building and deployment phase, we’d more often than not use a different
evaluation metric to when the model is in production. In the first 2 scenarios, ML metrics would
suffice but in production, we care about business impact, therefore we’d rather use business
metrics to measure the goodness of our model.
With that being said, we could categorize evaluation metrics into 2 buckets.
 Intrinsic Evaluation — Focuses on intermediary objectives (i.e. the performance of an

NLP component on a defined subtask)
 Extrinsic Evaluation — Focuses on the performance of the final objective (i.e. the
performance of the component on the complete application)
Stakeholders typically care about extrinsic evaluation since they’d want to know how good the
model is at solving the business problem at hand. However, it’s still important to have intrinsic
evaluation metrics in order for the AI team to measure how they are doing. We will be focusing
more on intrinsic metrics for the remainder of this article.
Classification metrics:
Defining the Metrics
Some common intrinsic metrics to evaluate NLP systems are as follows:
Accuracy
Whenever the accuracy metric is used, we aim to learn the closeness of a measured value to a
known value. It’s therefore typically used in instances where the output variable is categorical or
discrete — Namely a classification task.
• accuracy – just a percentage of correctly predicted instances
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠/ 𝑎𝑙𝑙 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 × 100%
• correctly predicted = identical with human decisions as stored in manually annotated data
• an example – a part-of-speech tagger
• an expert (trained annotator) chooses the correct POS value for each word in a corpus,
• the annotated data is split into two parts
• the first part is used for training a POS tagger
• the trained tagger is applied on the second part
• POS accuracy = the ratio of correctly tokens with correctly predicted POS values
Precision
In instances where we are concerned with how exact the model's predictions are we would use
Precision. The precision metric would inform us of the number of labels that are actually labeled
as positive in correspondence to the instances that the classifier labeled as positive.
– if our systems gives a prediction, what’s its avarage quality
Recall
Recall measures how well the model can recall the positive class (i.e. the number of positive
labels that the model identified as positive).
what avergage proportion of all possible correct answers is given by our system
F1 Score
Precision and Recall are complementary metrics that have an inverse relationship. If both are of
interest to us then we’d use the F1 score to combine precision and recall into a single metric.
• F-measure = weighted harmonic mean of P and R • (note: if two quantities are to be weighted,
we need just 1 weighting parameter X, and the other one can be computed 1-X
𝐹𝛽 = (𝛽2 + 1) ⋅ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ⋅ 𝑟𝑒𝑐𝑎𝑙𝑙 / 𝛽 2 ⋅ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙l

Notes of NLP - Unit-2

Uploaded by

Copyright:

Available Formats

You might also like

Notes of NLP - Unit-2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Notes of NLP - Unit-2

Uploaded by

Copyright:

Available Formats

NLP NOTES

Probabilistic language models

3. Evaluating Language Models

Generalization and Zeros

Sentiment Analysis is an important application of Text Classification

 Spam detection in emails

Example : Spam classification

Types of Text Classification Systems

1)Rule-based text classification

Machine learning-based text classification is a supervised machine learning problem. It learns

 Naive Bayes classifiers

6. Naive Bayes algorithm:

Naive Bayes Classifier

Bayes’ Rule applied to Documents and Classes

For a document d and a class c, and using Bayes’ rule,

What about P( c ) ? How do we calculate it?

To calculate the Naive Bayes probability, P( d | c ) x P( c ), we calculate P( xi | c ) for each xi

7. Support Vector Machines(SVM):

8. Decision Tree Classification Algorithm

Why use Decision Trees?

Decision Tree Terminologies

How does the Decision Tree algorithm Work?

Attribute Selection Measures

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

o S= Total number of samples

Gini Index= 1- ∑jPj2

o Cost Complexity Pruning

Advantages of the Decision Tree

Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex.

Different Types of Evaluation Metrics

 Intrinsic Evaluation — Focuses on intermediary objectives (i.e. the performance of an

Defining the Metrics

Some common intrinsic metrics to evaluate NLP systems are as follows:

• accuracy – just a percentage of correctly predicted instances

𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠/ 𝑎𝑙𝑙 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 × 100%

• the annotated data is split into two parts

• the first part is used for training a POS tagger

• the trained tagger is applied on the second part

– if our systems gives a prediction, what’s its avarage quality

𝐹𝛽 = (𝛽2 + 1) ⋅ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ⋅ 𝑟𝑒𝑐𝑎𝑙𝑙 / 𝛽 2 ⋅ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙l

You might also like