Professional Documents
Culture Documents
NLP Manual (1-12)
NLP Manual (1-12)
NLP Manual (1-12)
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA
Experiment No. : 1
Team Members :
1. Sanika S. Bhatye (Roll Number : 14)
2. Nachiket S. Gaikwad (Roll Number : 35)
3. Priyanka A. Gupta (Roll Number : 45)
Page | 1
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA
Experiment No. : 2
Page | 1
Tokenization : Given a character sequence and a defined document unit, tokenization is the
task of chopping it up into pieces, called tokens. Tokenization is the act of breaking up a
sequence of strings into pieces such as words, keywords, phrases, symbols and other
elements called tokens. Tokens can be individual words, phrases or even whole sentences.
In the process of tokenization, some characters like punctuation marks are discarded.
Filtration : Many of the words used in the phrase are insignificant and hold no
meaning. For example – English is a subject.
Here, ‘English’ and ‘subject’ are the most significant words and ‘is’, ‘a’ are almost useless.
English subject and subject English holds the same meaning even if we remove the
insignificant words – (‘is’, ‘a’).
Using the nltk, we can remove the insignificant words by looking at their part-of-speech
tags. For that we have to decide which Part-Of-Speech tags are significant.
Steps: Tokenization
a. In order to get started, we need the NLTK module, as well as Python.
b. Download the latest version of Python if you are on Windows. If you are on Mac or
Linux, you should be able to run an apt-get install python3.
c. Next, we need NLTK 3. The easiest method to installing the NLTK module is going to
be with pip. For all users, that is done by opening up cmd.exe, bash, or whatever
shell you use and typing: pip install nltk
d. Next, we need to install some of the components for NLTK.
Open python via whatever means you normally do, and type:
import nltk
nltk.download()
Page | 2
Unless you are operating headless, a GUI will pop up like this, only probably with
red instead ofgreen:
Choose to download "all" for all packages, and then click 'download.' This will give you all of
the tokenizers, chunkers, other algorithms, and all of the corpora. If space is an issue, you
can elect to selectively download everything manually. The NLTK module will take up about
7MB, and the entire nltk_data directory will take up about 1.8GB, which includes your
chunkers, parsers, and the corpora. If you are operating headless, like on a VPS, you can
install everything by running Python and doing:
import nltk
nltk.download()
d(for download)
Now that you have all the things that you need, let's knock out some quick vocabulary:
Page | 3
Lexicon - Words and their meanings.
Example: English dictionary.
Consider, however, that various fields will have different lexicons. For example: To
a financial investor, the first meaning for the word "Bull" is someone who is
confident about the market, as compared to the common English lexicon, where the
first meaning for the word "Bull" is an animal. As such, there is a speciallexicon for
financial investors, doctors, children, mechanics, and so on.
Token - Each "entity" that is a part of whatever was split up based on rules.
For examples, each word is a token when a sentence is "tokenized" into
words. Each sentence can also be a token, if you tokenized the sentences
out of a paragraph.
These are the words you will most commonly hear upon entering the Natural
Language Processing (NLP) space. With that, let's show an example of how one
might actually tokenize something into tokens with the NLTK module.
EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The
weather is great, and Python is awesome. The sky is pinkish-
blue. You shouldn't eat cardboard."
print(sent_tokenize(EXAMPLE_TEXT))
The above code will output the sentences, split up into a list of sentences, which
you can do things like iterate through with a for loop.
So there, we have created tokens, which are sentences. Let's tokenize by word instead this time:
print(word_tokenize(EXAMPLE_TEXT))
Page | 4
Now our output is: ['Hello', 'Mr.', 'Smith', ',',
'how', 'are',
'you', 'doing', 'today', '?', 'The', 'weather', 'is',
'great',
',', 'and', 'Python', 'is', 'awesome', '.',
'The', 'sky',
'is', 'pinkish-blue', '.', 'You', 'should', "n't",
'eat','cardboard', '.']
OUTPUT :
Page | 5
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA
Experiment No. : 3
AIM : Apply various other text preprocessing techniques for any given text :
StopWord Removal, Lemmatization / Stemming.
THEORY :
Stop Word Removal : One of the major forms of pre-processing is to filter out
useless data. In NLP, useless words (data) are referred to as stop words.
In another word, there is one root word, but there are many variations of the
same words. For example, the root word is "eat" and it's variations are "eats,
eating, eaten and like so". In the same way, with the help of Stemming, we can
find the root word of any variations.
Page | 1
Lemmatization : Lemmatization is a text normalization technique used in Natural Language
Processing (NLP). It has been studied for a very long time and lemmatization algorithms have
been made since the 1960s. Essentially, lemmatization is a technique that switches any kind
of a word to its base root mode. Lemmatization is responsible for grouping different
inflected forms of words into the root form, having the same meaning.
We can do this easily, by storing a list of words that you consider to be stop
words. NLTK starts you off with a bunch of words that they consider to be stop
words, you can access it via the NLTK corpus with:
>>> set(stopwords.words('english'))
{'ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during',
'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours',
'such',
'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each',
'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don',
'nor',
'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above',
'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them',
'same',
'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what',
'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has',
'just',
'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if',
'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than'}
Here is how we might incorporate using the stop_words set to remove the stopwords from your text:
Page | 2
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
print(word_tokens)
print(filtered_sentence)
Steps : Stemming
ps = PorterStemmer()
example_words
Page | 3
Next, we can easily stem by doing something like:
for w in example_words:
print(ps.stem(w))
Our output:
python
python
python
python
pythonli
Now let's try stemming a typical sentence, rather than some words:
It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
All
python
have
python
poorli
at
least
onc
.
Page | 4
Steps : Lemmatization
So, your root stem, meaning the word you end up with, is not something you can
just look up ina dictionary, but you can look up a lemma.
Sometimes you will wind up with a very similar word, but sometimes, you will
wind up with a completely different word. Let's see some examples.
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))
Page | 5
OUTPUT :
Stopwords Removal -
Stemming -
Page | 6
Lemmatization -
Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA
Experiment No. : 4
Page | 1
There are three morphemes, each carrying a certain amount of meaning. un
means "not", while ness means "being in a state or condition". Happy is a free
morpheme because it can appear on its own (as a "word" in its own right).
Bound morphemes have to be attached to a free morpheme, and so cannot be
words in their own right. Thus, you cannot have sentences in English such as
"Jason feels very un ness today".
Inflection:
Inflection is the process of changing the form of a word so that it expresses
information such as number, person, case, gender, tense, mood and aspect,
but the syntactic category of the word remains unchanged. As an example, the
plural form of the noun in English is usually formed from the singular form by
adding an s.
• car / cars
• table / tables
• dog / dogs
In each of these cases, the syntactic category of the word remains unchanged.
Derivation:
As was seen above, inflection does not change the syntactic category of a
word. Derivation does change the category. Linguists classify derivation in
English according to whether or not it induces a change of pronunciation. For
instance, adding the suffix ity changes the pronunciation of the root of active
so the stress is on the second syllable: activity. The addition of the suffix al to
approve doesn't change the pronunciation of the root: approval.
Page | 2
Code POS tagging :
Result :
Page | 3
Code TextSimilar() :
Result :
Page | 4
Code Stemming :
Result :
Page | 5
Code Stemming :
Result :
Page | 6
Code Lemmatization :
Result :
Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA
Experiment No. : 5
THEORY :
N – Grams : The general idea is that we can look at each pair (or triple, set of
four, etc.) of words that occur next to each other. In a sufficiently-large corpus,
we are likely to see "the red" and "red apple" several times, but less likely to
see "apple red" and "red the". This is useful to know if, for example, we are
trying to figure out what someone is more likely to say to help decide between
possible output for an automatic speech recognition system. These co-
occurring words are known as "n-grams", where "n" is a number saying how
long a string of words we considered. (Unigrams are single words, bigrams are
two words, trigrams are three words, 4-grams are four words, 5-grams are five
words, etc.) In particular, nltk has the n-grams function that returns a
generator of n-grams given a tokenized sentence.
Page | 1
An n-gram tagger is a generalization of a unigram tagger whose context is the
current word together with the part-of-speech tags of the n-1 preceding
tokens.
Generating Unigrams :
Result:
Page | 2
Generating Bigrams :
Result:
Generating Trigrams :
Page | 3
Result:
Page | 4
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA
Experiment No. : 6
Context-pattern rules.
Page | 1
Or, as Regular expression compiled into finite-
state automata, intersected with lexically
ambiguous sentence representation.
We can also understand Rule-based POS tagging by its two-stage architecture −
First stage − Uses a dictionary to assign each word
a list of potential parts-of-speech.
Second stage − Uses large lists of hand-
writtendisambiguation rules to sort down the
list to asingle part-of-speech for each word.
Properties of Rule-Based POS Tagging : Rule-based POS taggers possess the
following properties −
These taggers are knowledge-driven taggers.
The rules in Rule-based POS tagging are built manually.
The information is coded in the form of rules.
We have some limited number of rules approximately around
1000.
Smoothing and language modeling is defined explicitly in rule-
based taggers.
Page | 2
Tag Sequence Probabilities - It is another approach of stochastic tagging,
where the tagger calculates the probability of a given sequence of tags
occurring. It is also called n-gram approach. It is called so because the best tag
for a given word is determined by the probability at which it occurs with the n
previous tags.
One of the more powerful aspects of the NLTK module is the Part of
Speech tagging that it can do for you. This means labeling words in a
sentence as nouns, adjectives, verbs, etc. Even more impressive, it
also labels by tense, and more.
CODE :
Page | 4
RESULT :
Page | 5
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA
Experiment No. : 7
Page | 1
+ = match 1 or more
? = match 0 or 1 repetitions.
* = match 0 or MORE repetitions
. = Any character except a new line
The last things to note is that the part of speech tags are denoted
with the "<" and ">" and we can also place regular expressions within
the tags themselves, so account for things like "all nouns" (<N.*>).
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
Page | 2
The main line here in question is:
Page | 3
Cool, that helps us visually, but what if we want to access this data
via our program? Well, what is happening here is our "chunked"
variable is an NLTK tree. Each "chunk" and "non chunk" is a"subtree"
of the tree. We can reference these by doing something like
chunked.subtrees(). We can then iterate through these subtrees like
so:
Page | 4
Now, we're filtering to only show the subtrees with the label of
"Chunk." Keep in mind, this isn't "Chunk" as in the NLTK chunk
attribute... this is "Chunk" literally because that's the label we gave it
here:
Page | 5
RESULT :
Page | 6
CONCLUSION : Hence, we have successfully implemented the experiment on
Chunking.
Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA
Experiment No. : 8
THEORY : In any text document, there are particular terms that represent
specific entities that are more informative and have a unique context. These
entities are known as named entities, which more specifically refer to terms
that represent real-world objects like people, places, organizations, and so on,
which are often denoted by proper names. A naive approach could be to find
these by looking at the noun phrases in text documents. Named entity
recognition (NER), also known as entity chunking/extraction, is a popular
technique used in information extraction to identify and segment the named
entities and classify or categorize them under various predefined classes. One
of the most major forms of chunking in NLP is called "Named
EntityRecognition." The idea is to have the machine immediately be able to pull
out "entities" like people, places, things, locations, monetary figures, and
more. This can be a bit of a challenge, but NLTK is this built in for us. There are
two major options with NLTK's named entity recognition: either recognize all
named entities, or recognize named entities as their respective type, like
people, places, locations, etc.
Here's an example:
Page | 1
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-
GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content():
try:
for i in tokenized[5:]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
namedEnt = nltk.ne_chunk(tagged, binary=True)
namedEnt.draw()
except Exception as e:
print(str(e))
process_content()
Page | 2
Immediately, you can see a few things. When Binary is False, it
picked up the same things, but wound up splitting up terms like
White House into "White" and "House" as if they were different,
whereas we could see in the binary = True option, the named
entity recognition was correct to say White House was part of the
same named entity. Depending on your goals, you may use the
binary option how you see fit. Here are the types of Named Entities
that you can get if you have binary as false:
Page | 3
RESULT :
Page | 4
Binary = true
Binary = false
Page | 5
Page | 6
CONCLUSION : Thus, we have successfully implemented EDA.
Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA
Experiment No. :
THEORY :
1. Clinical Documentation.
The NLP’s clinical documentation helps free clinicians from the laborious physical systems of
EHRs and permits them to invest more time in the patient; this is how NLP can help doctors.
Both speech-to-text dictation and formulated data entry have been a blessing.
The Nuance and M*Modal consists of technology that functions in team and speech
recognition technologies for getting structured data at the point of care and formalised
vocabularies for future use.
The NLP technologies bring out relevant data from speech recognition equipment which will
Page | 1
considerably modify analytical data used to run VBC and PHM efforts. This has better
outcomes for the clinicians. In upcoming times, it will apply NLP tools to various public data
sets and social media to determine Social Determinants of Health (SDOH) and the usefulness
of wellness-based policies.
2. Speech Recognition.
NLP has matured its use case in speech recognition over the years by allowing clinicians to
transcribe notes for useful EHR data entry. Front-end speech recognition eliminates the task
of physicians to dictate notes instead of having to sit at a point of care, while back-end
technology works to detect and correct any errors in the transcription before passing it on
for human proofing.
The market is almost saturated with speech recognition technologies, but a few start-ups
are disrupting the space with deep learning algorithms in mining applications, uncovering
more extensive possibilities.
Page | 2
Implementing Predictive Analytics in Healthcare :
CONCLUSION : Thus, we have successfully curated a case study on the applications of NLP.
Page | 3
Name :
Roll No. :
Experiment No. :
THEORY :
Abstract: Grammatical Error Correction (GEC) systems aim to correct grammatical mistakes
in the text. Grammarly is an example of such a grammar correction product. Error correction
can improve the quality of written text in emails, blogs and chats. GEC task can be thought
of as a sequence to sequence task where a Transformer model is trained to take an
ungrammatical sentence as input and return a grammatically correct sentence.
Implementation:
1. Dataset:
For the training of our Grammar Corrector, we have used the C4_200M dataset
recently released by Google. This dataset consists of 200MM examples of
synthetically generated grammatical corruptions along with the correct text.
Page | 1
One of the biggest challenges in GEC is getting a good variety of data that simulates
the errors typically made in written language. If the corruptions are random, then
they would not be representative of the distribution of errors encountered in real
use cases.
To generate the corruption, a tagged corruption model is first trained. This model is
trained on existing datasets by taking as input a clean text and generating a
corrupted text. This is represented in the figure below:
For C4_2OOM dataset, the authors first determined the distribution of relative type
of errors encountered in written language. When generating the corruptions, they
were conditioned on the type of error. As shown in figure below, the corruption
model was conditioned to generate a determiner type error.
This allows the C4_200M dataset to have a diverse set of errors reflecting their
relative frequency in real-world applications. For the purpose of this project, we
extracted 550K sentences from C4_200M. The C4_200M dataset is available on TF
datasets. We extracted the sentences we needed and saved them as a CSV.
2. Model Training:
T5 is a text-to-text model meaning it can be trained to go from input text of one
format to output text of one format. This model can be used for many different
objectives like summarization and text classification, also can be used to build a trivia
bot that can retrieve answers from memory without any provided context.
Page | 2
T5 is preferred for a lot of tasks for a few reasons :
1. Can be used for any text-to-text task.
2. Good accuracy on downstream tasks after fine-tuning.
Steps:
Page | 3
Code:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
import random
import numpy as np
import torch
import datasets
def set_seed(seed):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
set_seed(42)
from transformers import
( T5ForConditionalGeneration,
T5Tokenizer,
Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
)
def calc_token_len(example):
return len(tokenizer(example).input_ids)
from sklearn.model_selection import train_test_split
Page | 4
train_df, test_df = train_test_split(df, test_size=0.10, shuffle=True)
train_df.shape, test_df.shape
# tokenize inputs
tokenized_inputs =
tokenizer(input_, pad_to_max_length=self.pad_to_max_len
gth,
max_length=self.max_len,
return_attention_mask=True)
tokenized_targets =
tokenizer(target_, pad_to_max_length=self.pad_to_max_leng
th,
max_length=self.max_len,
return_attention_mask=True)
inputs={"input_ids": tokenized_inputs['input_ids'],
"attention_mask": tokenized_inputs['attention_mask'],
"labels": tokenized_targets['input_ids']
}
return inputs
if self.print_text:
for k in inputs.keys():
print(k,
len(inputs[k]))
return inputs
Page | 5
# defining training related arguments
batch_size = 16
args =
Seq2SeqTrainingArguments(output_dir="/content/drive/MyDrive/c4_200m/wei
ghts",
evaluation_strategy="steps",
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
learning_rate=2e-5,
num_train_epochs=1,
weight_decay=0.01,
save_total_limit=2,
predict_with_generate=True,
fp16 = True,
gradient_accumulation_steps =
6,
eval_steps = 500,
save_steps = 500,
load_best_model_at_end=True,
logging_dir="/logs",
report_to="wandb")
import nltk
nltk.download('punkt')
import numpy as np
def compute_metrics(eval_pred):
predictions, labels = eval_pred
decoded_preds = tokenizer.batch_decode(predictions,
skip_special_tokens=True)
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels,
skip_special_tokens=True)
result = rouge_metric.compute(predictions=decoded_preds,
references=decoded_labels, use_stemmer=True)
# Extract a few results
result = {key: value.mid.fmeasure * 100 for key, value in
result.items()}
Page | 7
train_dataset= GrammarDataset(train_dataset,
tokenizer),
eval_dataset=GrammarDataset(test_dataset, tokenizer),
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics)
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
model_name = 'deep-learning-analytics/GrammarCorrector'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model =
T5ForConditionalGeneration.from_pretrained(model_name).to(torch_device)
def correct_grammar(input_text,num_return_sequences):
batch =
tokenizer([input_text],truncation=True,padding='max_length',max_length=
64, return_tensors="pt").to(torch_device)
translated = model.generate(**batch,max_length=64,num_beams=4,
num_return_sequences=num_return_sequences, temperature=1.5)
tgt_text = tokenizer.batch_decode(translated,
skip_special_tokens=True)
return tgt_text
Output:
Applications:
1. Can be used for Grammar error correction specific applications like Grammarly.
2. Can be implemented in paraphrasing software and applications.
3. Can be included in document or content writing software like Microsoft Word, Libra
and Google Docs.
Page | 8
Results:
Fine Tuning T5 Transformer to Grammar Error Correction and training it on C4_550k dataset
we achieved a Rogue Score of 80%.
Conclusion:
In this project, we proposed a new strategy for the Grammar Error Correction system based
on Deep Learning, and the experimental results show that the proposed method is effective.
It makes full use of the advantages of Deep Learning.
Page | 9