NLP - Worksheet Solved

NLP Worksheet
Text processing, bag of words, tf-idf activity

Suppose you have obtained these information and you would like to analyse it. Let’s start by making it ready for the
computer!
Corpus
Document 1: We can use health chatbots for treating stress.
Document 2: We can use NLP to create chatbots and we will be making health chatbots now!
Document 3: Health Chatbots cannot replace human counsellors now. Yay >< !! @1nteLA!4Y
Step 1: Sentence Segmentation
No. Sentence
1 We can use health chatbots for treating stress.

2 We can use NLP to create chatbots and we will be making health chatbots now!
3 Health Chatbots cannot replace human counsellors now. Yay! >#< @!@!nLtAY
Step 2: Tokenization
Separate your sentences into tokens. How many tokens do you have?
Tokens
We, can, use, health, chatbots, for, treating, stress, ., We, can, use, NLP, to, create, chatbots,
and, we, will, be, making, health, chatbots, now, !, Health, Chatbots, cannot, replace, human,
counsellors, now, ., Yay, !, >#<, @!@!nLtAY
26
Number of tokens: ________
Step 3: Remove stopwords, special characters, numbers
List out the stopwords, special characters, and numbers that you want to remove!
Stopwords, special characters, and numbers
Remove stopwords, special characters, numbers List out the stopwords, special characters,
and numbers that you want to remove! Stopwords, special characters, and numbers
we, can, use, for, to, and, we, will, be, now, ., !, >#<, @!@!nLtAY
Step 4: Converting text to a common case

Which text do you need to modify? What is the modified form?
Modified form
Converting text to a common case Which text do you need to modify? What is the modified
form? Modified form
health chatbots, nlp, create chatbots, making health chatbots, health chatbots cannot replace
human counsellors, yay
Step 5: Stemming
List out the stem words.
Stem words
Stemming List out the stem words. Stem words

health, chatbot, nlp, creat, chatbot, make, health, chatbot, health, chatbot, cannot, replac,
human, counsellor, yay
Step 6: Lemmatization
List out the root words/ lemma.
Lemma
Lemmatization List out the root words/ lemma. Lemma
health, chatbot, nlp, create, chatbot, make, health, chatbot, health, chatbot, can, replace,
human, counsellor, yay
Final data
List out the final, processed data.
Processed data
Congratulations, you’ve managed to process the data!
Bag of words
Step 1: Collect data and process it
For this exercise, we can use the sentences without processing it so that it is easier for us to read the sentences.
No. Sentence
1 We can use health chatbots for treating stress
2 We can use NLP to create chatbots and we will be making health chatbots now
3 Health chatbots cannot replace human counsellors now
Step 2: Create dictionary

Make a list of all the different words in the text.
Dictionary
dictionary = [‘we’, ‘can’, ‘use’, ‘health’, ‘chatbots’, ‘for’, ‘treating’, ‘stress’, ‘nlp’, ‘to’, ‘create’,
‘and’, ‘will’, ‘be’, ‘making’, ‘now’, ‘cannot’, ‘replace’, ‘human’, ‘counsellors’]
Step 3:Create document vectors

Use the next page to create your document vector!
document_vectors = [ [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],# sentence 1
[1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0], # sentence 2
[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1] # sentence 3 ]
Each vector has the same length as the dictionary, and each element corresponds to the
count of a word in the sentence. For example, the first element of the first vector is 1, which
means that the word ‘we’ appears once in the first sentence. The last element of the last vector
is 1, which means that the word ‘counsellors’ appears once in the last sentence.
Tf-idf
You’ve obtained your bag of words. Now let’s continue with the tf-idf!
Step 1 - 3: Count the number of documents where the word appears at least once & write that
number down next to the word in your vocabulary to get your document frequency. Draw your
own table for this!
Word Document
Example 1 Document
of a document frequency:2 Document 3
we 1 1 0
can 1 1 0
use
aman1 1 0and Anil are stressed went to a therapist download health chatbot
health 1 1 1
chatbots 1 1 1
for 21 0 0 1 2 1 1 2 2 2 1 1 1 1
treating 1 0 0
stress 1 0 0
nlp 0 1 0
Your
to 0document
10 frequency:
create 0 1 0
and 0 1 0
will 0 1 0
be 0 1 0
making 0 1 0
now 0 1 1
cannot 0 0 1
replace 0 0 1
human 0 0 1
counsellors 0 0 1
Step 4: Get your inverse document frequency.

Word Document
Example 1 Document
of an inverse document 2 Document 3
frequency:
we 0.025 0.016 0
can 0.025 0.016 0
use 0.025and
aman 0.016are
anil 0
stressed went to a therapist download health chatbot
health 0 0 0
chatbots0 0 0
for 0.068 0 0
treating
3/2 0.068
3/1 3/2 0 3/1 0
3/1 3/2 3/2 3/2 3/1 3/1 3/1 3/1
stress 0.068 0 0
nlp 0 0.043 0
to 0 0.043 0
Your inverse
create 0 document
0.043frequency:
0
and 0 0.043 0
will 0 0.043 0
be 0 0.043 0
making 0 0.043 0
now 0 0.016 0.016
Step 5: Get your tf-idf
Example of a tf-idf:
After log operation:
Your tf-idf:

NLP - Worksheet Solved

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP - Worksheet Solved

Uploaded by

Copyright:

Available Formats

NLP Worksheet

Text processing, bag of words, tf-idf activity

Step 1: Sentence Segmentation

1 We can use health chatbots for treating stress.

Stopwords, special characters, and numbers

Step 4: Converting text to a common case

Stemming List out the stem words. Stem words

Lemmatization List out the root words/ lemma. Lemma

Congratulations, you’ve managed to process the data!

1 We can use health chatbots for treating stress

3 Health chatbots cannot replace human counsellors now

Step 2: Create dictionary

Step 3:Create document vectors

Step 4: Get your inverse document frequency.

After log operation:

You might also like