Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

Unit 6 Natural Language Processing AI Class 10

Natural Language Processing is one of the branches of AI that helps machines to understand,
interpret, and manipulate human languages such as English or Hindi to analyze and derive their
meaning.

NLP takes the data as input from the spoken words, verbal commands or speech recognition
software that humans use in their daily lives and operates on this.

Just open the browser and play the game then answer a few questions like:

1. Were you able to guess the animal?


2. If yes, in how many questions were you able to guess it?
3. If no, how many times did you try playing this game?
4. What according to you was the task of the machine?
5. Were there any challenges that you faced while playing this game? If yes, list them down.
6. What approach must one follow to win this game?

Automatic Summarization
 Today the internet is a huge source of information. So it is very difficult to access specific information
from the web.
 Summarizing helps us to understand the emotional meaning of the information. For example social
media
 It can be also helpful to provide an overview of a blog post or new story by avoiding redundancy from
multiple sources and maximising the diversity of content obtained.
 For example, newsletters, social media marketing, video scripting etc.

Sentiment Analysis
 Sometimes companies need to identify the options and sentiments online to help them to understand
customers’ thoughts about products and services.
 Sometimes sentiment analysis also helps to define the overall reputation.
 It helps customers to purchase the product or services based on expressed opinions, and understand
the sentiment in context to help better.
 For example, Customers think “I like the new smartphone, but it has weak battery backup” brand
monitoring, customer support analysis, customer feedback analysis, market research etc.

Text Classification
 It helps to assign a predefined category to a document, and organize it in such a way that helps
customers to find the information they want.
 It also helps to simplify the activities.
 For Examples, the spam filtering in email, auto-tagging in social media, categorization of news articles
etc.
Virtual Assistants
 We are using google assistant, Cortana, Siri, and Alexa daily.
 We can talk with them and they make our tasks easy and comfortable.
 They can be used for keeping notes, reminders for our scheduled tasks, making calls, sending
messages etc.
 It can also identify the words spoken by the user.

Revisiting the AI project cycle – The scenario


 The world is competitive nowadays.
 People face competition in every task and are expected to give their best at every point in time.
 Sometimes people are not able to meet this expectation, they get stressed and could go into
depression.
 We get to hear a lot of cases where people are depressed due to many reasons.
 These reasons are pressure, studies, family issues, and relationships.
 To overcome this Cognitive Behavioural Therapy (CBT), which is one of the best methods to address
stress as it is easy to implement on people and also gives good results.
 CBT included the behaviour and mindset of a person in normal life.

Problem Scoping
CBT technique is used to cure depression and stress. But what to do who do not want to connect with
a psychiatrist. So let us try to look at this problem with the 4Ws problem canvas.

Who Canvas
Who? People who suffer from stress and depressed

What did we know about them? People who are going through stress are reluctant to consul

What Canvas
What is the problem? People who need help are reluctant to consult a psychiatrist and h

How do you know it is a problem? Studies around mental stress and depression are available from va

Where Canvas
What is the context/situation in which the stakeholders experience this problem? When they are going throug

Why Canvas
People get a platform where they can talk and vent out their feel
What would be of key value to the stakeholders?
People get a medium that can interact with them and applies pri
People would be able to vent out their stress
How would it improve their situation?
They would consider going to a psychiatrist whenever required

After 4Ws canvas the problem statement template is created which is as follows:

Our People undergoing stress

Have a problem of Not being able to share their feelings

While They need help in venting out their emotions

An ideal solution would Provide them with a platform to share their thoughts anonymously and suggest help wh

This leads us to the goal of our project which is:

“To create a chatbot which can interact with people, help them to vent out their feelings and
take them through primitive CBT.”

Data Acquisition
 To understand the sentiments of people, data should be collected.
 After collecting data machines can interpret the words that they use and understand their meaning.
 Data can be collected from various means:
o Surveys
o Observing the therapist’s sessions
o Database available on the internet
o Interview

Data Exploration
 Once data is collected it needs to be processed and cleaned.
 After processing the easier version can be sent.
 The text is normalised through various steps and is lowered to minimum vocabulary since the machine
does not require grammatically correct statements but the essence of it.

Modelling
 Once the text has been normalised, it is then fed to an NLP based AI model.
 Note that in NLP, modelling requires data pre-processing only after which the data is fed to the
machine.
 Depending upon the type of chatbot we try to make, there are a lot of AI models available which help us
build the foundation of our project.
Evaluation
 The model trained is then evaluated and the accuracy for the same is generated on the basis of the
relevance of the answers that the machine gives to the user’s responses.
 To understand the efficiency of the model, the suggested answers by the chatbot are compared to the
actual answers.

In the above diagram, the blue line talks about the model’s output while the green one is the actual
output along with data samples.

1. Figure 1: The model’s output does not match the true function at all. Hence the model is said to be
underfitting and its accuracy is lower.
2. Figure 2: In the second one, the model’s performance matches well with the true function which states
that the model has optimum accuracy and the model is called a perfect fit.
3. Figure 3: In the third case, model performance is trying to cover all the data samples even if they are
out of alignment with the true function. This model is said to be overfitting and this too has lower
accuracy.

Once the model is evaluated thoroughly, it is then deployed in the form of an app that people can use
easily.

 On the other hand, the computer understands the language of numbers.


 Everything that is sent to the machine has to be converted to numbers.
 And while typing, if a single mistake is made, the computer throws an error and does not process that
part.
 The communications made by the machines are very basic and simple.
 Now, if we want the machine to understand our language, how should this happen?
 What are the possible difficulties a machine would face in processing natural language?

Arrangement of the words and meaning


 Every language has its own rules.
 In human language, we have nouns, verbs, adverbs, adjectives etc.
 A word can be a noun at one time or a verb or adjective at another time.
 The rules provide a structure for the language.
 Every language has its own syntax. Syntax refers to the grammatical structure of the sentence.
 Similarly, rules are required to process the computer language.
 To do so part-of-speech tagging is required which allows the computer to identify different parts of
speech.
 Human language has multiple characteristics that are easy for a human to understand but extremely
difficult for a computer to understand.
Analogy with programming
 Different syntax, same semantics: 2+3 = 3+2
o Here the way these statements are written is different, but their meanings are the same that is
5.
 Different semantics, same syntax: 2/3 (Python 2.7) ≠ 2/3 (Python 3)
o Here the statements written have the same syntax but their meanings are different. In Python
2.7, this statement would result in 1 while in Python 3, it would give an output of 1.5.

In the next section of Unit 6 Natural Language Processing AI Class 10 we are going to discuss the
multiple meanings of words. Here it is!

Multiple meanings of the word


Let’s consider these three sentences:

“His face turned red after he found out that he took the wrong bag”

 What does this mean?


 Is he feeling ashamed because he took another person’s bag instead of his?
 Is he feeling angry because he did not manage to steal the bag that he has been targeting?

“The red car zoomed past his nose”

 Probably talking about the colour of the car

“His face turns red after consuming the medicine”

 Is he having an allergic reaction?


 Or is he not able to bear the taste of that medicine?
 Here we can see that context is important.
 We understand a sentence almost intuitively, depending on our history of using the language, and the
memories that have been built within.
 In all three sentences, the word red has been used in three different ways which according to the
context of the statement changes its meaning completely.
 Thus, in natural language, it is important to understand that a word can have multiple meanings and the
meanings fit into the statement according to its context of it.

Perfect Syntax, no meaning


 Sometimes a sentence has perfect syntax but no meaning. For example,

“Chickens feed extravagantly while the moon drinks tea.”

 This statement is correct in syntax but does this make any sense?
 In human language, a perfect balance of syntax and semantics is important for better understanding.

Chatbot
 A chatbot is a very common and popular model of NLP.
 There are lots of chatbots available and many of them use the same approach.
 Some of the popular chatbots are as follows:
o Mitsuku Bot
o Clever Bot
o Jabberwacky
o Haptik
o Rose
o Chatbot

Answer the following questions:

 Which chatbot did you try? Name anyone.


 What is the purpose of this chatbot?
 How was the interaction with the chatbot?
 Did the chat feel like talking to a human or a robot? Why do you think so?
 Do you feel that the chatbot has a certain personality?

Types of chatbot
There are two types of chatbots:

1. Script-bot:
o Script-bot are very easy to make
o It works around the script which is programmed in them
o Free to use
o easy to integrate into a messaging platform
o no or little processing skills required
o offers limited functionality
o story speaker is an example of script-bot
2. Smart-bot
o Flexible and powerful
o Work on bigger databases and other resources directly
o Learn with more data
o coding is required to take up this on board
o Wide functionality
o Google Virtual Assistant, Alexa, Siri, Cortana etc

In the next section of Unit 6 Natural Language Processing AI Class 10, we will discuss human
language vs computer language.

Human Language vs Computer Language


 Humans communicate through language which we process all the time.
 Our brain keeps on processing the sounds that it hears around itself and tries to make sense of them all
the time.
 Even in the classroom, as the teacher delivers the session, our brain is continuously processing
everything and storing it in someplace. Also, while this is happening, when your friend whispers
something, the focus of your brain automatically shifts from the teacher’s speech to your friend’s
conversation.
 So now, the brain is processing both the sounds but is prioritising the one on which our interest lies.
 The sound reaches the brain through a long channel.
 As a person speaks, the sound travels from his mouth and goes to the listener’s eardrum.
 The sound striking the eardrum is converted into neuron impulse, gets transported to the brain and then
gets processed.
 After processing the signal, the brain gains an understanding of its meaning of it.
 If it is clear, the signal gets stored. Otherwise, the listener asks for clarity from the speaker.
 This is how human languages are processed by humans.
Data Processing
 Humans communicate with each other very easily.
 The natural language can be used very conveniently and efficiently by speaking and understanding by
humans.
 At the same time, it is very difficult and complex for computers to process them.
 Now the question is how machines can understand and speak in the Natual Language just like humans.
 So the computer understands only numbers, so the basic step is to convert each word or letter into
numbers.
 The conversion requires text normalization.

Text Normalization
 As human languages are to complex, it needs to be simplified to understand.
 Text Normalization helps in cleaning up the textual data in such a way that it comes down to a level
where its complexity is lower than actual data.
 In text normalization, we follow several steps to normalise the text to a lower level.
 We need to collect the text for text normalization.

What does text normalization include?


The process of transforming a text into a canonical (standard) form. For example, the word ‘Well’ and
‘Wel’ can be transformed to “Well”.

Why is text normalization important?


While text normalization we reduce the randomness and bring them closer to predefined standards. It
reduces the amount of different information that the computer has to deal with and therefore improves
efficiency.

Corpus
 The text and terms collected from various documents and used for whole textual data from all
documents altogether is known as corpus.
 To work out on corpus these steps are required:

Sentence Segmentation
 Sentence segmentation divides the corpus into sentences.
 Each sentence is taken as a different data so now the corpus gets reduced to sentences.

Tokenisation
 After sentence segmentation, each sentence is further divided into tokens.
 The token is a term used for any word or number or special character occurring in a sentence.
 Under tokenisation, every word, number and special character is considered separately and each of
them is now a separate token.
Removing Stopwords, Special Characters and Numbers
 Stop words are those words that are used very frequently in the corpus and do not add any
value to the corpus.
 In human language, there are certain words used to grammar which does not add any essence
to the corpus.
 Some examples of stop words are:

The above words have little or no meaning in the corpus, hence these words are removed and
focused on meaningful terms.

Along with these stopwords, the corpus may have some special characters and numbers. Sometimes
some of them are meaningful, sometimes not. For example, for email ids, the symbol @ and some
numbers are very important. If symbolism special characters and numbers are not meaningful can be
removed like stopwords.

Converting text to a common case


 The next step after removing stopwords, convert the whole text into a similar case.
 The most preferable case is the lower case.
 This ensures that the case sensitivity of the machine does not consider the same words as different just
because of different cases.
In the above example, the word “hello” is written in 6 different forms, which are converted into lower
case and hence all of them are treated as a similar word by the machine.

Stemming
 In this step, the words are reduced to their root words.
 Stemming is the process in which the affixes of words are removed and the words are converted to
their base form.
 Note that in stemming, the stemmed words (words which we get after removing the affixes) might not
be meaningful.
 Here in this example as you can see: healed, healing and healer all were reduced to heal but studies
were reduced to studi after the affix removal which is not a meaningful word.
 Stemming does not take into account if the stemmed word is meaningful or not.
 It just removes the affixes hence it is faster.

Word Affix Stem

healed ed heal

healing ing heal

healer er heal

studies es studi

studying ing study

 In stemming, the stemmed words are not meaningful always.


 As you can observe in the above table healed, healing, healer reduced to heal and studies reduced to
studi.
 Stemming does not take into account if the stemmed word is meaningful or not.
 It just removes the affixes hence it is faster.

In the next section of Unit 6 Natural Language Processing AI Class 10 we are going to discuss
lemmatization. Here we go!

Lemmatization
 It is an alternate process of stemming.
 It also removes the affix from the corpus.
 The only difference between lemmatization and stemming is the output of lemmatization are meaningful
words.
 The final output is known as a lemma.
 It takes a longer time to execute than stemming.
 The following table shows the process.

Word Affix Stem

healed ed heal

healing ing heal

healer er heal

studies es study

studying ing study

Compare the tables of stemming and lemmatization table, you will find the word studies converted
into studi by stemming whereas the lemma word is study.

Observe the following example to understand stemming and lemmatization:


After normalisation of the corpus, let’s convert the tokens into numbers. To do so the bag of
words algorithm will be used.

Bag of words
 A bag of words is an NLP model that extracts the features of the text which can be helpful in machine
learning algorithms.
 We get the occurrences of each word and develop the vocabulary for the corpus.

 The above image shows how the bag of word algorithm works.
 The text given on the left is a normalised corpus after going through all the steps of text processing.
 The image in middle shows the bag of words algorithm. Here we put all the words we got from the text
processing.
 The image in the rights shows unique words returned by the bag of words algorithm along with their
occurrence in the text corpus.
 Eventually, the bag of words returns us two things:
o A vocabulary of words
o The frequency of words
 The algorithm “bag” of words symbolises that the sequence of sentences or tokens does not matter in
this case as all we need are the unique words and their frequency in it.

The step by step process to implement the bag of words


algorithm
The following steps should be followed to implement the bag of words:

1. Text Normalization
2. Create dictionary
3. Create document vectors
4. Create document vectors for all documents

Text Normalization:
This step collects the data and pre-processes it. For example:

Document 1: Divya and Rani both are stressed


Document 2: Rani wents to a therapist

Document 3: Divya went to download a health chatbot

The above example consists of three documents having one sentence each. After text normalization,
the text would be:

Document 1: [Divya, and, Rani, both, are, stressed]

Document 2: [Rani, went, to, a, therapist]

Document 3:[Divya, went, to, download, a, health, chatbot]

 Note that no tokens have been removed in the stopwords removal step.
 It is because we have very little data and since the frequency of all the words is almost the same, no
word can be said to have lesser value than the other.

Step 2 Create a Dictionary


To create a dictionary write all words which occurred in the three documents.

Dictionary:

Divya and Rani went are stressed

to a therapist download health chatbot

In this step, the repeated words are written just once and we create a list of unique words.

Step 3 Create a document vector


 In this step write all the words in the top row.
 Now write check the document for the same word and write 1 (One) under it if it’s found otherwise write
0 (zero)
 If the same word appears more than once do the increment of that word.
wen downloa
 Divya and Rani are stressed to a therapist health chatbot
t d

Documen
1 1 1 0 1 1 0 0 0 0 0 0
t1

Documen
0 0 1 1 0 0 1 1 1 0 0 0
t2

Documen
1 0 0 1 0 0 1 1 0 1 1 1
t3

In the above table,


1. The header row contains the vocabulary of the corpus.
2. Three rows represent the three documents.
3. Analyse the 1s and 0s.
4. This gives the document vector table for the corpus.
5. Finally, the tokens should be converted into numbers.

To convert these tokens into numbers the TFIDF (Term Frequency and Inverse Document
Frequency) is used.

TFIDF (Term Frequency and Inverse Document Frequency)


There are two terms in TFIDF, one is Term Frequency and another one is Inverse Document
Frequency.

Term Frequency
 It helps to identify the value of each word in a document.
 Term frequency can easily be found in the document vector table in that table we mention the
frequency of each word of the vocabulary in each document. (Refer to the above table)
 The numbers represent the frequency of the word

ar
Divya and Rani went stressed to a therapist download health chatbot
e

1 1 1 0 1 1 0 0 0 0 0 0

0 0 1 1 0 0 1 1 1 0 0 0

1 0 0 1 0 0 1 1 0 1 1 1

Inverse Document Frequency


 Now let’s discuss Inverse Document Frequency.
 It refers to the number of documents in which the word occurs irrespective of how many times it has
occurred in those documents.
 For example,

an ar
Divya Rani went stressed to a therapist download health chatbot
d e

2 1 2 2 1 1 2 2 1 1 1 1

 In the above table you can observe the words – “Divya”,”Rani”,”went”,”to”,”a” is having 2 frequencies as
they occurred in two documents.
 The rest of the terms occurred in just one document.
 Now for Inverse Frequency put the document frequency in the denominator while the no. of documents
is the numerator.
 For example,
wen downloa
Divya and Rani are stressed to a therapist health chatbot
t d

3/2 3/1 3/2 3/2 3/1 3/1 3/2 3/2 3/1 3/1 3/1 3/1

 The formula is

TFIDF(W) = TF(W) * log( IDF(W) )

 Here, the log is to the base of 10. Don’t worry! You don’t need to calculate the log values by yourself.
Simply use the log function in the calculator and find out!

 Now, let’s multiply the IDF values to the TF values.


 Note that the TF values are for each document while the IDF values are for the whole corpus.
 Hence, we need to multiply the IDF values to each row of the document vector table.

downl healt
Divya and Rani went are stressed to a therapist chatbot
oad h

1*log(3/2 1*log( 0*log 0*log(3/ 0*log( 0*log


1*log(3) 1*log(3) 1*log(3) 0*log(3/2) 0*log(3/1) 0*log(3/1)
) 3/2) ((3/2) 2) 3/1) (3/1)

1 1 1 0
0 0 0 0
*log(3/ *log( 0 *log(3) *log(3/2 1 *log(3/2) 1 *log(3) *log( 0 *log(3)
*log(3/2) *log(3) *log(3) *log(3)
2) 3/2) ) 3)

0 1 1 1
1 0 0 1
*log(3/ *log( 0 *log(3) *log(3/2 1 *log(3/2) 0 *log(3) *log( 1 *log(3)
*log(3/2) *log(3) *log(3) *log(3)
2) 3/2) ) 3)

 The IDF values for each word is as follows:

Divya and Rani went are stressed to a therapist download health chatbot

0.176 0.477 0.176 0 0.477 0.477 0 0 0 0 0 0

0 0 0.176 0.176 0 0 0.176 0.176 0.477 0 0 0

0.176 0 0 0.176 0 0 0.176 0.176 0 0.477 0.477 0.477

 Finally, the words have been converted to numbers.


 These numbers are the values of each document.
 Here, you can see that since we have less amount of data, words like ‘are’ and ‘and’ also have a high
value.
 But as the IDF value increases, the value of that word decreases.
 That is, for example:

Total Number of documents: 10

Number of documents in which ‘and’ occurs: 10

Therefore, IDF(and) = 10/10 = 1

Which means: log(1) = 0.

Hence, the value of ‘and’ becomes 0.

On the other hand, suppose a number of documents in which ‘Artificial’ occurs: 3 IDF(Artificial) =
10/3 = 3.3333…

This means log(3.3333) = 0.522; which shows that the word ‘pollution’ has considerable value in the
corpus.

You might also like