Professional Documents
Culture Documents
Notes of NLP - Unit-2
Notes of NLP - Unit-2
Notes of NLP - Unit-2
UNIT-2
1.Language Model :
Models that assign probabilities to sequences of words are called language models (LMs).
• The simplest language model that assigns probabilities to sentences and sequences of words is
the n-gram.
• An n-gram is a sequence of N words:
– A 1-gram (unigram) is a single word sequence of words like “please” or “ turn”.
– A 2-gram (bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your
homework”.
– A 3-gram (trigram) is a three-word sequence of words like “please turn your”, or “turn your
homework”.
• We can use n-gram models to estimate the probability of the last word of an n-gram given the
previous words, and also to assign probabilities to entire word sequences.
Example: P(the man from jupiter) = P(the) P(man|the) P(from|the man) P(jupiter|the man from)
Conditional Probabilities:
The chain rule shows the link between computing the joint probability of a sequence and
computing the conditional probability of a word given previous words.
• Definition of Conditional Probabilities:
P(B|A) = P(A,B) / P(A) P(A,B) = P(A) P(B|A)
• Conditional Probabilities with More Variables:
P(A,B,C,D) = P(A) P(B|A) P(C|A,B) P(D|A,B,C)
• Chain Rule:
P(w1… wn ) = P(w1 ) P(w2 |w1 ) P(w3 |w1w2 ) … P(wn |w1…wn-1 )
To compute the exact probability of a word given a long sequence of preceding words is difficult
(sometimes impossible).
• We are trying to compute
P(wn |w1…wn-1 )
which is the probability of seeing wn after seeing w1 n-1 .
• We may try to compute
P(wn |w1…wn-1 )
exactly as follows:
P(wn |w1…wn-1 ) = count(w1…wn-1wn ) / count(w1…wn-1 )
• Too many possible sentences and we may never see enough data for estimating these
probability values. • So, we need to compute P(wn |w1…wn-1 ) approximately.
2.N-gram Model
The intuition of the n-gram model (simplifying assumption):
– instead of computing the probability of a word given its entire history, we can approximate the
history by just the last few words.
P(wn |w1…wn-1 ) ≈ P(wn ) unigram
P(wn |w1…wn-1 ) ≈ P(wn |wn-1 ) bigram
P(wn |w1…wn-1 ) ≈ P(wn |wn-1wn-2 ) trigram
P(wn |w1…wn-1 ) ≈ P(wn |wn-1wn-2wn-3 ) 4-gram
P(wn |w1…wn-1 ) ≈ P(wn |wn-1wn-2wn-3wn-4 ) 5-gram
• In general, N-Gram is
P(wn |w1…wn-1 ) ≈ 𝐏(wn |𝐰𝐧−𝐍+𝟏 𝐧−𝟏 )
In general, a n-gram model is an insufficient model of a language because languages have long-
distance dependencies.
– “The computer(s) which I had just put into the machine room is (are) crashing.”
– But we can still effectively use N-Gram models to represent languages.
• Which N-Gram should be used as a language model?
– Bigger N, the model will be more accurate.
• But we may not get good estimates for N-Gram probabilities.
• The N-Gram tables will be more sparse.
– Smaller N, the model will be less accurate.
• But we may get better estimates for N-Gram probabilities.
• The N-Gram table will be less sparse.
– In reality, we do not use higher than Trigram (not more than Bigram).
– How big are N-Gram tables with 10,000 words?
• Unigram -- 10,000
• Bigram – 10,000*10,000 = 100,000,000
• Trigram – 10,000*10,000*10,000 = 1,000,000,000,000
N-gram and Markov model
The assumption that the probability of a word depends only on the previous word(s) is called
Markov assumption.
• Markov models are the class of probabilistic models that assume that we can predict the
probability of some future unit without looking too far into the past.
• A bigram is called a first-order Markov model (because it looks one token into the past);
• A trigram is called a second-order Markov model;
• In general a N-Gram is called a N-1 order Markov model.
4. Smoothing
To keep a language model from assigning zero probability to these unseen events, we’ll have to
shave off a bit of probability mass from some more frequent events and give it to the events
we’ve never seen.
• This modification is called smoothing (or discounting).
• There are many ways to do smoothing, and some of them are:
– Add-1 smoothing (Laplace Smoothing)
– Add-k smoothing,
– Backoff
Add-1 smoothing (Laplace Smoothing)
The simplest way to do smoothing is to add one to all the counts, before we normalize them into
probabilities.
– All the counts that used to be zero will now have a count of 1, the counts of 1 will be 2, and so
on.
• This algorithm is called Laplace smoothing (Add-1 Smoothing).
• We pretend that we saw each word one more time than we did, and we just add one to all the
counts!,
Laplace Smoothing for Unigrams:
• The unsmoothed maximum likelihood estimate of the unigram probability of the word wi is its
count ci normalized by the total number of word tokens
N: P(wi ) = ci / N
• Laplace smoothing adds one to each count. Since there are V words in the vocabulary and each
one was incremented, we also need to adjust the denominator to take into account the extra V
observations.
PLaplace(wi ) = (ci + 1) / (N + V)
5. Classification and categorization of Text Data :
Text Classification is the processing of labeling or organizing text data into groups. It forms a
fundamental part of Natural Language Processing. In the digital age that we live in we are
surrounded by text on our social media accounts, in commercials, on websites, Ebooks, etc. The
majority of this text data is unstructured, so classifying this data can be extremely useful.
Similar to a classification algorithm that has been trained on a tabular dataset to predict a class,
text classification also uses supervised machine learning. The fact that text is involved in text
classification is the main distinction between the two.
Applications
Text Classification has a wide array of applications. Some popular uses are:
There are many practical use cases for text classification across many industries. For example, a
spam filter is a common application that uses text classification to sort emails into spam and non-
spam categories.
There are mainly two types of text classification systems; rule-based and machine learning-based
text classification.
Rule-based techniques use a set of manually constructed language rules to categorize text into
categories or groups. These rules tell the system to classify text into a particular category based
on the content of a text by using semantically relevant textual elements. An antecedent or pattern
and a projected category make up each rule.
For example, imagine you have tons of new articles, and your goal is to assign them to relevant
categories such as Sports, Politics, Economy, etc.
With a rule-based classification system, you will do a human review of a couple of documents to
come up with linguistic rules like this one:
If the document contains words such as money, dollar, GDP, or inflation, it belongs to
the Politics group (class).
Rule-based systems can be refined over time and are understandable to humans. However, there
are certain drawbacks to this strategy.
These systems, to begin with, demand in-depth expertise in the field. They take a lot of time
since creating rules for a complicated system can be difficult and frequently necessitates
extensive study and testing.
Given that adding new rules can alter the outcomes of the pre-existing rules, rule-based systems
are also challenging to maintain and do not scale effectively.
2)Machine learning-based text classification
We can use machine learning to train models on large sets of text data to predict categories of
new text. To train models, we need to transform text data into numerical data – this is known
as feature extraction. Important feature extraction techniques include bag of words and n-
grams.
There are several useful machine learning algorithms we can use for text classification. The most
popular ones are:
Like any other supervised machine learning, text classification machine learning has two phases;
training and prediction.
Training phase
A supervised machine learning algorithm is trained on the input-labeled dataset during the
training phase. At the end of this process, we get a trained model that we can use to obtain
predictions (labels) on new and unseen data.
Prediction phase
Once a machine learning model is trained, it can be used to predict labels on new and unseen
data. This is usually done by deploying the best model from an earlier phase as an API on the
server.
Feature Extraction
The two most common methods for extracting feature from text or in other words converting text
data (strings) into numeric features so machine learning model can be trained are: Bag of Words
(a.k.a CountVectorizer) and Tf-IDF.
Bag of Words
A bag of words (BoW) model is a simple way of representing text data as numeric features. It
involves creating a vocabulary of known words in the corpus and then creating a vector for each
document that contains counts of how often each word appears.
TF-IDF
TF-IDF stands for term frequency-inverse document frequency, and it is another way of
representing text as numeric features. There are some shortcomings of the Bag of Words (BoW)
model that Tf-IDF overcomes. We won’t go into detail about that in this article, but if you would
like to explore this concept further, check out our Introduction to Natural Language
Processing in Python course.
The TF-IDF model is different from the bag of words model in that it takes into account the
frequency of the words in the document, as well as the inverse document frequency. This means
that the TF-IDF model is more likely to identify the important words in a document than the bag
of words model.
P( c | d ) = [ P( d | c ) x P( c ) ] / [ P( d ) ]
The class mapping for a given document is the class which has the maximum value of the above
probability.
Since all probabilities have P( d ) as their denominator, we can eliminate the denominator, and
simply compare the different values of the numerator:
P( c | d ) = P( d | c ) x P( c )
Now, what do we mean by the term P( d | c ) ?
Let’s represent the document as a set of features (words or tokens) x1, x2, x3, …
We can then re-write P( d | c ) as:
P( x1, x2, x3, … , xn | c )
=> P( c ) is the total probability of a class. => How often does this class occur in total?
E.g. in the case of classes positive and negative, we would be calculating the probability that any
given review is positive or negative, without actually analyzing the current input document.
This is calculated by counting the relative frequencies of each class in a corpus.
E.g. out of 10 reviews we have seen, 3 have been classified as positive.
=> P ( positive ) = 3 / 10
Now let’s go back to the first term in the Naive Bayes equation:
P( d | c ), or P( x1, x2, x3, … , xn | c ).
We use some assumptions to simplify the computation of this probability:
the Bag of Words assumption => assume the position of the words in the document
doesn’t matter.
Conditional Independence => Assume the feature probabilities P( xi | cj ) are
independent given the class c.
It is important to note that both of these assumptions aren’t actually correct — of course, the order
of words matter, and they are not independent. A phrase like this movie was incredibly
terrible shows an example of how both of these assumptions don't hold up in regular english.
However, these assumptions greatly simplify the complexity of calculating the classification
probability. And in practice, we can calculate probabilities with a reasonable level of accuracy
given these assumptions.
So..
Then we multiply the result by P( c ) for the current class. We do this for each of our classes, and
choose the class that has the maximum overall value.
“Support Vector Machine” (SVM) is a supervised machine learning algorithm that can be used
for both classification or regression use cases. However, it is mostly used in classification
problems. In the SVM algorithm, we plot each data item as a point in n-dimensional space
(where n is a number of features you have) with the value of each feature being the value of a
particular coordinate. Then, we perform classification by finding the hyper-plane that
differentiates the two classes very well.
In simpler terms, support vectors are simply the coordinates of individual observation and the
SVM classifier is a frontier that best segregates the two classes (hyper-plane/line).
Translation : we can apply this algorithm for text classification, and for sentiment analysis.
Sentiment Analysis is an NLP technique that we perform on text to determine whether the
author’s intentions towards a particular topic or product, etc. are positive, negative, or neutral.
Note that there are audio based techniques as well, which we'll not cover here.
We first convert a piece of text into a vector of numbers, to run SVM with them. In other words,
which features are needed to classify texts using SVM?
The most common answer is word frequencies. This means that we treat a text as a bag of words,
and for every word that appears in that bag we have a feature. The value of that feature will be
how frequent that word is in the text. This method boils down to just counting how many times
every word appears in a text and dividing it by the total number of words. So in the
sentence “The soccer World Cup will be held in November and the world will be hooked to
soccer games of the highest quality", the word soccer has a frequency of 2/10 = 0.2, and the
word and has a frequency of 1/10 = 0.1 .
Thus, every text in the dataset is represented as a vector with thousands (or tens of thousands) of
dimensions, every one representing the frequency of one of the words of the text. This is fed into
the SVM for training. Now that we have the feature vectors, the only thing left to do is choosing
a kernel function for our model. A linear kernel is usually used for NLP classification tasks.
For the training, we take the labeled ('sentimented') texts, convert them to vectors using word
frequencies, and feed them to the SVM algorithm, which will use the chosen linear kernel
function, and so it produces a model. When we have a new unlabeled ('unsentimented') text that
we want to classify, we convert it into a vector and give it to the model, which will output the
sentiment of the text.
There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node (distance
from the office) and one leaf node based on the corresponding labels. The next decision node
further gets split into one decision node (Cab facility) and one leaf node. Finally, the decision
node splits into two leaf nodes (Accepted offers and Declined offer). Consider the below
diagram:
While implementing a Decision tree, the main issue arises that how to select the best attribute for
the root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best
attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:
o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
9. Evaluation Measures:
Whenever we build Machine Learning models, we need some form of metric to measure the
goodness of the model. Bear in mind that the “goodness” of the model could have multiple
interpretations, but generally when we speak of it in a Machine Learning context we are talking of
the measure of a model's performance on new instances that weren’t a part of the training data.
Determining whether the model being used for a specific task is successful depends on 2 key
factors:
1. Whether the evaluation metric we have selected is the correct one for our problem
2. If we are following the correct evaluation process
In this article, I will focus only on the first factor — Selecting the correct evaluation metric.
The evaluation metric we decide to use depends on the type of NLP task that we are doing. To
further add, the stage the project is at also affects the evaluation metric we are using. For instance,
during the model building and deployment phase, we’d more often than not use a different
evaluation metric to when the model is in production. In the first 2 scenarios, ML metrics would
suffice but in production, we care about business impact, therefore we’d rather use business
metrics to measure the goodness of our model.
With that being said, we could categorize evaluation metrics into 2 buckets.
Stakeholders typically care about extrinsic evaluation since they’d want to know how good the
model is at solving the business problem at hand. However, it’s still important to have intrinsic
evaluation metrics in order for the AI team to measure how they are doing. We will be focusing
more on intrinsic metrics for the remainder of this article.
Classification metrics:
Accuracy
Whenever the accuracy metric is used, we aim to learn the closeness of a measured value to a
known value. It’s therefore typically used in instances where the output variable is categorical or
discrete — Namely a classification task.
• correctly predicted = identical with human decisions as stored in manually annotated data
• an example – a part-of-speech tagger
• an expert (trained annotator) chooses the correct POS value for each word in a corpus,
• POS accuracy = the ratio of correctly tokens with correctly predicted POS values
Precision
In instances where we are concerned with how exact the model's predictions are we would use
Precision. The precision metric would inform us of the number of labels that are actually labeled
as positive in correspondence to the instances that the classifier labeled as positive.
Recall
Recall measures how well the model can recall the positive class (i.e. the number of positive
labels that the model identified as positive).
what avergage proportion of all possible correct answers is given by our system
F1 Score
Precision and Recall are complementary metrics that have an inverse relationship. If both are of
interest to us then we’d use the F1 score to combine precision and recall into a single metric.
• F-measure = weighted harmonic mean of P and R • (note: if two quantities are to be weighted,
we need just 1 weighting parameter X, and the other one can be computed 1-X