Professional Documents
Culture Documents
Unit 6 Natural Language Processing AI Class 10
Unit 6 Natural Language Processing AI Class 10
Natural Language Processing is one of the branches of AI that helps machines to understand,
interpret, and manipulate human languages such as English or Hindi to analyze and derive their
meaning.
NLP takes the data as input from the spoken words, verbal commands or speech recognition
software that humans use in their daily lives and operates on this.
Just open the browser and play the game then answer a few questions like:
Automatic Summarization
Today the internet is a huge source of information. So it is very difficult to access specific information
from the web.
Summarizing helps us to understand the emotional meaning of the information. For example social
media
It can be also helpful to provide an overview of a blog post or new story by avoiding redundancy from
multiple sources and maximising the diversity of content obtained.
For example, newsletters, social media marketing, video scripting etc.
Sentiment Analysis
Sometimes companies need to identify the options and sentiments online to help them to understand
customers’ thoughts about products and services.
Sometimes sentiment analysis also helps to define the overall reputation.
It helps customers to purchase the product or services based on expressed opinions, and understand
the sentiment in context to help better.
For example, Customers think “I like the new smartphone, but it has weak battery backup” brand
monitoring, customer support analysis, customer feedback analysis, market research etc.
Text Classification
It helps to assign a predefined category to a document, and organize it in such a way that helps
customers to find the information they want.
It also helps to simplify the activities.
For Examples, the spam filtering in email, auto-tagging in social media, categorization of news articles
etc.
Virtual Assistants
We are using google assistant, Cortana, Siri, and Alexa daily.
We can talk with them and they make our tasks easy and comfortable.
They can be used for keeping notes, reminders for our scheduled tasks, making calls, sending
messages etc.
It can also identify the words spoken by the user.
Problem Scoping
CBT technique is used to cure depression and stress. But what to do who do not want to connect with
a psychiatrist. So let us try to look at this problem with the 4Ws problem canvas.
Who Canvas
Who? People who suffer from stress and depressed
What did we know about them? People who are going through stress are reluctant to consul
What Canvas
What is the problem? People who need help are reluctant to consult a psychiatrist and h
How do you know it is a problem? Studies around mental stress and depression are available from va
Where Canvas
What is the context/situation in which the stakeholders experience this problem? When they are going throug
Why Canvas
People get a platform where they can talk and vent out their feel
What would be of key value to the stakeholders?
People get a medium that can interact with them and applies pri
People would be able to vent out their stress
How would it improve their situation?
They would consider going to a psychiatrist whenever required
After 4Ws canvas the problem statement template is created which is as follows:
An ideal solution would Provide them with a platform to share their thoughts anonymously and suggest help wh
“To create a chatbot which can interact with people, help them to vent out their feelings and
take them through primitive CBT.”
Data Acquisition
To understand the sentiments of people, data should be collected.
After collecting data machines can interpret the words that they use and understand their meaning.
Data can be collected from various means:
o Surveys
o Observing the therapist’s sessions
o Database available on the internet
o Interview
Data Exploration
Once data is collected it needs to be processed and cleaned.
After processing the easier version can be sent.
The text is normalised through various steps and is lowered to minimum vocabulary since the machine
does not require grammatically correct statements but the essence of it.
Modelling
Once the text has been normalised, it is then fed to an NLP based AI model.
Note that in NLP, modelling requires data pre-processing only after which the data is fed to the
machine.
Depending upon the type of chatbot we try to make, there are a lot of AI models available which help us
build the foundation of our project.
Evaluation
The model trained is then evaluated and the accuracy for the same is generated on the basis of the
relevance of the answers that the machine gives to the user’s responses.
To understand the efficiency of the model, the suggested answers by the chatbot are compared to the
actual answers.
In the above diagram, the blue line talks about the model’s output while the green one is the actual
output along with data samples.
1. Figure 1: The model’s output does not match the true function at all. Hence the model is said to be
underfitting and its accuracy is lower.
2. Figure 2: In the second one, the model’s performance matches well with the true function which states
that the model has optimum accuracy and the model is called a perfect fit.
3. Figure 3: In the third case, model performance is trying to cover all the data samples even if they are
out of alignment with the true function. This model is said to be overfitting and this too has lower
accuracy.
Once the model is evaluated thoroughly, it is then deployed in the form of an app that people can use
easily.
In the next section of Unit 6 Natural Language Processing AI Class 10 we are going to discuss the
multiple meanings of words. Here it is!
“His face turned red after he found out that he took the wrong bag”
This statement is correct in syntax but does this make any sense?
In human language, a perfect balance of syntax and semantics is important for better understanding.
Chatbot
A chatbot is a very common and popular model of NLP.
There are lots of chatbots available and many of them use the same approach.
Some of the popular chatbots are as follows:
o Mitsuku Bot
o Clever Bot
o Jabberwacky
o Haptik
o Rose
o Chatbot
Types of chatbot
There are two types of chatbots:
1. Script-bot:
o Script-bot are very easy to make
o It works around the script which is programmed in them
o Free to use
o easy to integrate into a messaging platform
o no or little processing skills required
o offers limited functionality
o story speaker is an example of script-bot
2. Smart-bot
o Flexible and powerful
o Work on bigger databases and other resources directly
o Learn with more data
o coding is required to take up this on board
o Wide functionality
o Google Virtual Assistant, Alexa, Siri, Cortana etc
In the next section of Unit 6 Natural Language Processing AI Class 10, we will discuss human
language vs computer language.
Text Normalization
As human languages are to complex, it needs to be simplified to understand.
Text Normalization helps in cleaning up the textual data in such a way that it comes down to a level
where its complexity is lower than actual data.
In text normalization, we follow several steps to normalise the text to a lower level.
We need to collect the text for text normalization.
Corpus
The text and terms collected from various documents and used for whole textual data from all
documents altogether is known as corpus.
To work out on corpus these steps are required:
Sentence Segmentation
Sentence segmentation divides the corpus into sentences.
Each sentence is taken as a different data so now the corpus gets reduced to sentences.
Tokenisation
After sentence segmentation, each sentence is further divided into tokens.
The token is a term used for any word or number or special character occurring in a sentence.
Under tokenisation, every word, number and special character is considered separately and each of
them is now a separate token.
Removing Stopwords, Special Characters and Numbers
Stop words are those words that are used very frequently in the corpus and do not add any
value to the corpus.
In human language, there are certain words used to grammar which does not add any essence
to the corpus.
Some examples of stop words are:
The above words have little or no meaning in the corpus, hence these words are removed and
focused on meaningful terms.
Along with these stopwords, the corpus may have some special characters and numbers. Sometimes
some of them are meaningful, sometimes not. For example, for email ids, the symbol @ and some
numbers are very important. If symbolism special characters and numbers are not meaningful can be
removed like stopwords.
Stemming
In this step, the words are reduced to their root words.
Stemming is the process in which the affixes of words are removed and the words are converted to
their base form.
Note that in stemming, the stemmed words (words which we get after removing the affixes) might not
be meaningful.
Here in this example as you can see: healed, healing and healer all were reduced to heal but studies
were reduced to studi after the affix removal which is not a meaningful word.
Stemming does not take into account if the stemmed word is meaningful or not.
It just removes the affixes hence it is faster.
healed ed heal
healer er heal
studies es studi
In the next section of Unit 6 Natural Language Processing AI Class 10 we are going to discuss
lemmatization. Here we go!
Lemmatization
It is an alternate process of stemming.
It also removes the affix from the corpus.
The only difference between lemmatization and stemming is the output of lemmatization are meaningful
words.
The final output is known as a lemma.
It takes a longer time to execute than stemming.
The following table shows the process.
healed ed heal
healer er heal
studies es study
Compare the tables of stemming and lemmatization table, you will find the word studies converted
into studi by stemming whereas the lemma word is study.
Bag of words
A bag of words is an NLP model that extracts the features of the text which can be helpful in machine
learning algorithms.
We get the occurrences of each word and develop the vocabulary for the corpus.
The above image shows how the bag of word algorithm works.
The text given on the left is a normalised corpus after going through all the steps of text processing.
The image in middle shows the bag of words algorithm. Here we put all the words we got from the text
processing.
The image in the rights shows unique words returned by the bag of words algorithm along with their
occurrence in the text corpus.
Eventually, the bag of words returns us two things:
o A vocabulary of words
o The frequency of words
The algorithm “bag” of words symbolises that the sequence of sentences or tokens does not matter in
this case as all we need are the unique words and their frequency in it.
1. Text Normalization
2. Create dictionary
3. Create document vectors
4. Create document vectors for all documents
Text Normalization:
This step collects the data and pre-processes it. For example:
The above example consists of three documents having one sentence each. After text normalization,
the text would be:
Note that no tokens have been removed in the stopwords removal step.
It is because we have very little data and since the frequency of all the words is almost the same, no
word can be said to have lesser value than the other.
Dictionary:
In this step, the repeated words are written just once and we create a list of unique words.
Documen
1 1 1 0 1 1 0 0 0 0 0 0
t1
Documen
0 0 1 1 0 0 1 1 1 0 0 0
t2
Documen
1 0 0 1 0 0 1 1 0 1 1 1
t3
To convert these tokens into numbers the TFIDF (Term Frequency and Inverse Document
Frequency) is used.
Term Frequency
It helps to identify the value of each word in a document.
Term frequency can easily be found in the document vector table in that table we mention the
frequency of each word of the vocabulary in each document. (Refer to the above table)
The numbers represent the frequency of the word
ar
Divya and Rani went stressed to a therapist download health chatbot
e
1 1 1 0 1 1 0 0 0 0 0 0
0 0 1 1 0 0 1 1 1 0 0 0
1 0 0 1 0 0 1 1 0 1 1 1
an ar
Divya Rani went stressed to a therapist download health chatbot
d e
2 1 2 2 1 1 2 2 1 1 1 1
In the above table you can observe the words – “Divya”,”Rani”,”went”,”to”,”a” is having 2 frequencies as
they occurred in two documents.
The rest of the terms occurred in just one document.
Now for Inverse Frequency put the document frequency in the denominator while the no. of documents
is the numerator.
For example,
wen downloa
Divya and Rani are stressed to a therapist health chatbot
t d
3/2 3/1 3/2 3/2 3/1 3/1 3/2 3/2 3/1 3/1 3/1 3/1
The formula is
Here, the log is to the base of 10. Don’t worry! You don’t need to calculate the log values by yourself.
Simply use the log function in the calculator and find out!
downl healt
Divya and Rani went are stressed to a therapist chatbot
oad h
1 1 1 0
0 0 0 0
*log(3/ *log( 0 *log(3) *log(3/2 1 *log(3/2) 1 *log(3) *log( 0 *log(3)
*log(3/2) *log(3) *log(3) *log(3)
2) 3/2) ) 3)
0 1 1 1
1 0 0 1
*log(3/ *log( 0 *log(3) *log(3/2 1 *log(3/2) 0 *log(3) *log( 1 *log(3)
*log(3/2) *log(3) *log(3) *log(3)
2) 3/2) ) 3)
Divya and Rani went are stressed to a therapist download health chatbot
On the other hand, suppose a number of documents in which ‘Artificial’ occurs: 3 IDF(Artificial) =
10/3 = 3.3333…
This means log(3.3333) = 0.522; which shows that the word ‘pollution’ has considerable value in the
corpus.