NLP Manual (1-12)

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 55

Name :

Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 1

AIM : Study various applications of NLP and formulate the Problem


Statement for Mini Project based on chosen real world NLP applications.

PROBLEM STATEMENT : The application of grammatical error-correcting


systems is the subject of the mini-project (GEC). The study of grammar focuses on
words and how they might be utilized to construct sentences. It can also cover a
word's pronunciation, definition, and linguistic background in addition to the
language's inflexions, grammar, and word construction. Numerous communication
issues, including those that negatively affect both personal and professional
contacts, can be brought on by grammatical errors. These programs work to fix
grammatical errors in the text. An illustration of one of these grammar checkers is
Grammarly. Correction of typographical errors can raise the caliber of writing in
chats, blogs, and emails.

Team Members :
1. Sanika S. Bhatye (Roll Number : 14)
2. Nachiket S. Gaikwad (Roll Number : 35)
3. Priyanka A. Gupta (Roll Number : 45)

Page | 1
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 2

AIM : Program on Preprocessing of Text (Tokenization, Filtration, Script


Validation).

THEORY : Text preprocessing is traditionally an important step for Natural Language


Processing (NLP) tasks. It transforms text into a more digestible form so that machine
learning algorithms can perform better. The process of converting data to something a
computer can understand is referred to as pre-processing. Following is the list of Text
Preprocessing steps:

 Remove HTML tags.


 Remove extra whitespaces.
 Convert accented characters to ASCII characters.
 Expand contractions.
 Remove special characters.
 Lowercase all texts.
 Convert number words to numeric form.
 Remove numbers.
 Remove stop words.
 Lemmatization.
 Stemming.
 Script validation, etc.

Page | 1
Tokenization : Given a character sequence and a defined document unit, tokenization is the
task of chopping it up into pieces, called tokens. Tokenization is the act of breaking up a
sequence of strings into pieces such as words, keywords, phrases, symbols and other
elements called tokens. Tokens can be individual words, phrases or even whole sentences.
In the process of tokenization, some characters like punctuation marks are discarded.

Filtration : Many of the words used in the phrase are insignificant and hold no
meaning. For example – English is a subject.
Here, ‘English’ and ‘subject’ are the most significant words and ‘is’, ‘a’ are almost useless.
English subject and subject English holds the same meaning even if we remove the
insignificant words – (‘is’, ‘a’).
Using the nltk, we can remove the insignificant words by looking at their part-of-speech
tags. For that we have to decide which Part-Of-Speech tags are significant.

Steps: Tokenization
a. In order to get started, we need the NLTK module, as well as Python.
b. Download the latest version of Python if you are on Windows. If you are on Mac or
Linux, you should be able to run an apt-get install python3.
c. Next, we need NLTK 3. The easiest method to installing the NLTK module is going to
be with pip. For all users, that is done by opening up cmd.exe, bash, or whatever
shell you use and typing: pip install nltk
d. Next, we need to install some of the components for NLTK.

Open python via whatever means you normally do, and type:

import nltk
nltk.download()

Page | 2
Unless you are operating headless, a GUI will pop up like this, only probably with
red instead ofgreen:

Choose to download "all" for all packages, and then click 'download.' This will give you all of
the tokenizers, chunkers, other algorithms, and all of the corpora. If space is an issue, you
can elect to selectively download everything manually. The NLTK module will take up about
7MB, and the entire nltk_data directory will take up about 1.8GB, which includes your
chunkers, parsers, and the corpora. If you are operating headless, like on a VPS, you can
install everything by running Python and doing:

import nltk
nltk.download()

d(for download)

all(for download everything)

Now that you have all the things that you need, let's knock out some quick vocabulary:

Corpus - Body of text, singular. Corpora is the plural of this.


Example: A collection of medical journals.

Page | 3
Lexicon - Words and their meanings.
Example: English dictionary.
Consider, however, that various fields will have different lexicons. For example: To
a financial investor, the first meaning for the word "Bull" is someone who is
confident about the market, as compared to the common English lexicon, where the
first meaning for the word "Bull" is an animal. As such, there is a speciallexicon for
financial investors, doctors, children, mechanics, and so on.

Token - Each "entity" that is a part of whatever was split up based on rules.
For examples, each word is a token when a sentence is "tokenized" into
words. Each sentence can also be a token, if you tokenized the sentences
out of a paragraph.

These are the words you will most commonly hear upon entering the Natural
Language Processing (NLP) space. With that, let's show an example of how one
might actually tokenize something into tokens with the NLTK module.

from nltk.tokenize import sent_tokenize, word_tokenize

EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The
weather is great, and Python is awesome. The sky is pinkish-
blue. You shouldn't eat cardboard."

print(sent_tokenize(EXAMPLE_TEXT))

The above code will output the sentences, split up into a list of sentences, which
you can do things like iterate through with a for loop.

['Hello Mr. Smith, how are you doing today?',


'The weather is great, and Python is awesome.',
'The sky is pinkish-blue.', "You shouldn't eat
cardboard."]

So there, we have created tokens, which are sentences. Let's tokenize by word instead this time:

print(word_tokenize(EXAMPLE_TEXT))

Page | 4
Now our output is: ['Hello', 'Mr.', 'Smith', ',',
'how', 'are',
'you', 'doing', 'today', '?', 'The', 'weather', 'is',
'great',
',', 'and', 'Python', 'is', 'awesome', '.',
'The', 'sky',
'is', 'pinkish-blue', '.', 'You', 'should', "n't",
'eat','cardboard', '.']

CONCLUSION : Thus, we have successfully performed an experiment on Pre-processing


of text.

OUTPUT :

Page | 5
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 3

AIM : Apply various other text preprocessing techniques for any given text :
StopWord Removal, Lemmatization / Stemming.

THEORY :
Stop Word Removal : One of the major forms of pre-processing is to filter out
useless data. In NLP, useless words (data) are referred to as stop words.

What are Stop words?


Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”,
“in”) that a searchengine has been programmed to ignore, both when indexing
entries for searching and when retrieving them as the result of a search query.

Stemming : Stemming is a kind of normalization for words. Normalization is a


technique where a set of words in a sentence are converted into a sequence to
shorten its lookup. The words which have the same meaning but have some
variation according to the context or sentence are normalized.

In another word, there is one root word, but there are many variations of the
same words. For example, the root word is "eat" and it's variations are "eats,
eating, eaten and like so". In the same way, with the help of Stemming, we can
find the root word of any variations.

Page | 1
Lemmatization : Lemmatization is a text normalization technique used in Natural Language
Processing (NLP). It has been studied for a very long time and lemmatization algorithms have
been made since the 1960s. Essentially, lemmatization is a technique that switches any kind
of a word to its base root mode. Lemmatization is responsible for grouping different
inflected forms of words into the root form, having the same meaning.

Steps: Stop word removal.

We can do this easily, by storing a list of words that you consider to be stop
words. NLTK starts you off with a bunch of words that they consider to be stop
words, you can access it via the NLTK corpus with:

from nltk.corpus import stopwords

Here is the list:

>>> set(stopwords.words('english'))

{'ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during',
'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours',
'such',
'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each',
'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don',
'nor',
'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above',
'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them',
'same',
'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what',
'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has',
'just',
'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if',
'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than'}

Here is how we might incorporate using the stop_words set to remove the stopwords from your text:

Page | 2
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off


the stop words filtration."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence [w for w in word_tokens if not w in


stop_words]

filtered_sentence = []
for w in word_tokens:
if w not in stop_words:

print(word_tokens)
print(filtered_sentence)

Our output here:

['This', 'is', 'a', 'sample', 'sentence', ',',


'showing',
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',',
'showing', 'stop','words',
'filtration', '.']

Steps : Stemming

First, we're going to grab and define our stemmer:

from nltk.stem import PorterStemmer


from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

Now, let's choose some words with a similar stem, like:

example_words

Page | 3
Next, we can easily stem by doing something like:

for w in example_words:
print(ps.stem(w))

Our output:

python
python
python
python
pythonli

Now let's try stemming a typical sentence, rather than some words:

new_text = "It is important to by very pythonly while you are


pythoning with python. All pythoners have pythoned poorly at
least once."
words = word_tokenize(new_text)
for w in words:
print(ps.stem(w))

Now our result is:

It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
All
python
have
python
poorli
at
least
onc
.

Page | 4
Steps : Lemmatization

A very similar operation to stemming is called lemmatizing. The major difference


between these is, as you saw earlier, stemming can often create non-existent
words, whereas lemmas are actual words.

So, your root stem, meaning the word you end up with, is not something you can
just look up ina dictionary, but you can look up a lemma.

Sometimes you will wind up with a very similar word, but sometimes, you will
wind up with a completely different word. Let's see some examples.

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

Page | 5
OUTPUT :
Stopwords Removal -

Stemming -

Page | 6
Lemmatization -

CONCLUSION : Thus, we have successfully performed an experiment on


various text processing techniques.

Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 4

AIM : Program to demonstrate Morphological Analysis.

THEORY : Morphology is the study of the structure and formation of words.


It’s most important unit is the morpheme, which is defined as the "minimal
unit of meaning".

In linguistics, morphology refers to the mental system involved in word


formation or to the branch of linguistics that deals with words, their internal
structure, and how they are formed. Morphological Analysis is very essential
for various automatic natural language processing applications.

Consider a word like: "unhappiness". This has three parts:

Page | 1
There are three morphemes, each carrying a certain amount of meaning. un
means "not", while ness means "being in a state or condition". Happy is a free
morpheme because it can appear on its own (as a "word" in its own right).
Bound morphemes have to be attached to a free morpheme, and so cannot be
words in their own right. Thus, you cannot have sentences in English such as
"Jason feels very un ness today".

Inflection:
Inflection is the process of changing the form of a word so that it expresses
information such as number, person, case, gender, tense, mood and aspect,
but the syntactic category of the word remains unchanged. As an example, the
plural form of the noun in English is usually formed from the singular form by
adding an s.

• car / cars
• table / tables
• dog / dogs

In each of these cases, the syntactic category of the word remains unchanged.

Derivation:
As was seen above, inflection does not change the syntactic category of a
word. Derivation does change the category. Linguists classify derivation in
English according to whether or not it induces a change of pronunciation. For
instance, adding the suffix ity changes the pronunciation of the root of active
so the stress is on the second syllable: activity. The addition of the suffix al to
approve doesn't change the pronunciation of the root: approval.

Page | 2
Code POS tagging :

Result :

Page | 3
Code TextSimilar() :

Result :

Page | 4
Code Stemming :

Result :

Page | 5
Code Stemming :

Result :

Page | 6
Code Lemmatization :

Result :

CONCLUSION : Hence, we have successfully implemented the program to


demonstrate Morphological Analysis.

Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 5

AIM : Program to implement N-gram model.

THEORY :

N – Grams : The general idea is that we can look at each pair (or triple, set of
four, etc.) of words that occur next to each other. In a sufficiently-large corpus,
we are likely to see "the red" and "red apple" several times, but less likely to
see "apple red" and "red the". This is useful to know if, for example, we are
trying to figure out what someone is more likely to say to help decide between
possible output for an automatic speech recognition system. These co-
occurring words are known as "n-grams", where "n" is a number saying how
long a string of words we considered. (Unigrams are single words, bigrams are
two words, trigrams are three words, 4-grams are four words, 5-grams are five
words, etc.) In particular, nltk has the n-grams function that returns a
generator of n-grams given a tokenized sentence.

Page | 1
An n-gram tagger is a generalization of a unigram tagger whose context is the
current word together with the part-of-speech tags of the n-1 preceding
tokens.

Generating Unigrams :

Result:

Page | 2
Generating Bigrams :

Result:

Generating Trigrams :

Page | 3
Result:

CONCLUSION : Hence, we have successfully implemented the program to


demonstrate N-gram model.

Page | 4
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 6

AIM : Program to implement POS tagging.

THEORY : Tagging is a kind of classification that may be defined as the


automatic assignment of description to the tokens. Here, the descriptor is
called tag, which may represent one of the part-of-speech, semantic
information and so on. PoS tagging may be defined as the process of assigning
one of the parts of speech to the given word.

Rule-based POS Tagging : One of the oldest techniques of tagging is rule-based


POS tagging. Rule-based taggers use dictionary or lexicon for getting possible
tags for tagging each word. If the word has more than one possible tag, then
rule-based taggers use hand-written rules to identify the correct tag.
Disambiguation can also be performed in rule-based tagging by analyzing the
linguistic featuresof a word along with its preceding as well as following words.
For example, suppose if the preceding word of a word is article, then the word
must be a noun. As the name suggests, all such kind of information in rule-
based POS tagging is coded in the form of rules. These rules may be either –

 Context-pattern rules.

Page | 1
 Or, as Regular expression compiled into finite-
state automata, intersected with lexically
ambiguous sentence representation.
We can also understand Rule-based POS tagging by its two-stage architecture −
 First stage − Uses a dictionary to assign each word
a list of potential parts-of-speech.
 Second stage − Uses large lists of hand-
writtendisambiguation rules to sort down the
list to asingle part-of-speech for each word.
Properties of Rule-Based POS Tagging : Rule-based POS taggers possess the
following properties −
 These taggers are knowledge-driven taggers.
 The rules in Rule-based POS tagging are built manually.
 The information is coded in the form of rules.
 We have some limited number of rules approximately around
1000.
 Smoothing and language modeling is defined explicitly in rule-
based taggers.

Stochastic POS Tagging : Another technique of tagging is Stochastic POS


Tagging. The model that includes frequency or probability (statistics) can be
called stochastic. Any number of different approaches to the problem of part-
of- speech tagging can be referred to as stochastic tagger. The simplest
stochastic tagger applies the following approaches for POS tagging:

Word Frequency Approach - In this approach, the stochastic taggers


disambiguate the words based on the probability that a word occurs with a
particular tag. We can also say that the tag encountered most frequently with
the word in the training set is the one assigned to an ambiguous instance of
that word. The main issue with this approach is that it may yield inadmissible
sequence of tags.

Page | 2
Tag Sequence Probabilities - It is another approach of stochastic tagging,
where the tagger calculates the probability of a given sequence of tags
occurring. It is also called n-gram approach. It is called so because the best tag
for a given word is determined by the probability at which it occurs with the n
previous tags.

Properties of Stochastic POST Tagging :


Stochastic POS taggers possess the following properties −
 This POS tagging is based on the probability of tag occurring.
 It requires training corpus.
 There would be no probability for the words that do not exist in
the corpus.
 It uses different testing corpus (other than training corpus).
 It is the simplest POS tagging because it chooses
most frequent tags associated with a word in
training corpus.

Transformation-based Tagging : Transformation based tagging is also called


Brill tagging. It is an instance of the transformation-based learning (TBL), which
is a rule-based algorithm for automatic tagging of POS to the given text. TBL,
allows us to have linguistic knowledge in a readable form, transforms one state
to another state by using transformation rules. It draws the inspiration from
both the previous explained taggers − rule-based and stochastic. If we see
similarity between rule-based and transformation tagger, then like rule-based,
it is also based on the rules that specify what tags need to be assigned to what
words. On the other hand,if we see similarity between stochastic and
transformation tagger then like stochastic, it is machine learning technique in
which rules are automatically induced from data.

Working of Transformation Based Learning (TBL) : In order to understand the


working and concept of transformation-based taggers, we need to understand
the working of transformation-based learning. Consider the following steps to
understand the working of TBL −
 Start with the solution − The TBL usually starts with
Page | 3
some solution to the problem and works in cycles.
 Most beneficial transformation chosen − In each
cycle, TBL will choose the most beneficial
transformation.
 Apply to the problem − The transformation chosen in
the last step will be applied to the problem.
 The algorithm will stop when the selected
transformation in step 2 will not add either more
value or there are no more transformations to be
selected. Such kind of learning is best suited in
classification tasks.

One of the more powerful aspects of the NLTK module is the Part of
Speech tagging that it can do for you. This means labeling words in a
sentence as nouns, adjectives, verbs, etc. Even more impressive, it
also labels by tense, and more.

CODE :

Page | 4
RESULT :

CONCLUSION : Hence, we have successfully implemented the program to


demonstrate PoS tagging.

Page | 5
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 7

AIM : Program to implement Chunking.

THEORY : Text chunking, also referred to as shallow parsing, is a task that


follows Part-Of-Speech Tagging and that adds more structure to the sentence.
The result is a grouping of the words in “chunks”. Chunk extraction or partial
parsing is a process of meaningful extracting short phrases from the sentence
(tagged with Part-of-Speech). Chunks are made up of words and the kinds of
words are defined using the part-of-speech tags. A Chunking activity involves
breaking down a difficult text into more manageable pieces and having
students rewrite these “chunks” in their own words. Now that we know the
parts of speech, we can do what is called chunking, and group words into
hopefully meaningful chunks. One of the main goals of chunking is to group
into what are known as "noun phrases." These are phrases of one or more
words that contain a noun, maybe some descriptive words, maybe a verb, and
maybe something like an adverb. The idea is to group nouns with the words
that are in relation to them. In order to chunk, we combine the part of speech
tags with regular expressions. Mainly from regular expressions, we are going to
utilize the following:

Page | 1
+ = match 1 or more
? = match 0 or 1 repetitions.
* = match 0 or MORE repetitions
. = Any character except a new line

The last things to note is that the part of speech tags are denoted
with the "<" and ">" and we can also place regular expressions within
the tags themselves, so account for things like "all nouns" (<N.*>).

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

The result of this is something like:

Page | 2
The main line here in question is:

chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""

This line, broken down:

<RB.?>* = "0 or more of any tense of adverb," followed by:

<VB.?>* = "0 or more of any tense of verb," followed by:

<NNP>+ = "One or more proper nouns," followed by

<NN>? = "zero or one singular noun."

Try playing around with combinations to group various instances


until you feel comfortable withchunking. Say you print the chunks
out, you are going to see output like:

Page | 3
Cool, that helps us visually, but what if we want to access this data
via our program? Well, what is happening here is our "chunked"
variable is an NLTK tree. Each "chunk" and "non chunk" is a"subtree"
of the tree. We can reference these by doing something like
chunked.subtrees(). We can then iterate through these subtrees like
so:

for subtree in chunked.subtrees():


print(subtree)

Next, we might be only interested in getting just the chunks,


ignoring the rest. We can use the filter parameter in the
chunked.subtrees() call.

for subtree in chunked.subtrees(filter=lambda t: t.label() ==


'Chunk'):
print(subtree)

Page | 4
Now, we're filtering to only show the subtrees with the label of
"Chunk." Keep in mind, this isn't "Chunk" as in the NLTK chunk
attribute... this is "Chunk" literally because that's the label we gave it
here:

chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""

Had we said instead something like chunkGram = r"""Pythons:


{<RB.?>*<VB.?>*<NNP>+<NN>?}""", then we would filter by the label
of "Pythons." The result here should be something like:

Page | 5
RESULT :

Page | 6
CONCLUSION : Hence, we have successfully implemented the experiment on
Chunking.

Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 8

AIM : Program to implement Named Entity Recognition.

THEORY : In any text document, there are particular terms that represent
specific entities that are more informative and have a unique context. These
entities are known as named entities, which more specifically refer to terms
that represent real-world objects like people, places, organizations, and so on,
which are often denoted by proper names. A naive approach could be to find
these by looking at the noun phrases in text documents. Named entity
recognition (NER), also known as entity chunking/extraction, is a popular
technique used in information extraction to identify and segment the named
entities and classify or categorize them under various predefined classes. One
of the most major forms of chunking in NLP is called "Named
EntityRecognition." The idea is to have the machine immediately be able to pull
out "entities" like people, places, things, locations, monetary figures, and
more. This can be a bit of a challenge, but NLTK is this built in for us. There are
two major options with NLTK's named entity recognition: either recognize all
named entities, or recognize named entities as their respective type, like
people, places, locations, etc.

Here's an example:

Page | 1
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-
GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
try:
for i in tokenized[5:]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
namedEnt = nltk.ne_chunk(tagged, binary=True)
namedEnt.draw()
except Exception as e:
print(str(e))

process_content()

Here, with the option of binary = True, this means either


something is a named entity, or not.
The result is:

If you set binary = False, then the result is:

Page | 2
Immediately, you can see a few things. When Binary is False, it
picked up the same things, but wound up splitting up terms like
White House into "White" and "House" as if they were different,
whereas we could see in the binary = True option, the named
entity recognition was correct to say White House was part of the
same named entity. Depending on your goals, you may use the
binary option how you see fit. Here are the types of Named Entities
that you can get if you have binary as false:

Page | 3
RESULT :

Page | 4
Binary = true

Binary = false

CONCLUSION : Hence, we have successfully implemented Named Entity


Recognition.

Page | 5
Page | 6
CONCLUSION : Thus, we have successfully implemented EDA.

Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. :

AIM : Case study on applications of NLP.

THEORY :

Topic: NLP in Healthcare.


The NLP illustrates the manners in which artificial intelligence policies gather and assess
unstructured data from the language of humans to extract patterns, get the meaning and
thus compose feedback. This is helping the healthcare industry to make the best use of
unstructured data. This technology facilitates providers to automate the managerial job,
invest more time in taking care of the patients, and enrich the patient’s experience using
real-time data.

Best Use Cases of NLP in Healthcare :

1. Clinical Documentation.
The NLP’s clinical documentation helps free clinicians from the laborious physical systems of
EHRs and permits them to invest more time in the patient; this is how NLP can help doctors.
Both speech-to-text dictation and formulated data entry have been a blessing.
The Nuance and M*Modal consists of technology that functions in team and speech
recognition technologies for getting structured data at the point of care and formalised
vocabularies for future use.
The NLP technologies bring out relevant data from speech recognition equipment which will

Page | 1
considerably modify analytical data used to run VBC and PHM efforts. This has better
outcomes for the clinicians. In upcoming times, it will apply NLP tools to various public data
sets and social media to determine Social Determinants of Health (SDOH) and the usefulness
of wellness-based policies.

2. Speech Recognition.
NLP has matured its use case in speech recognition over the years by allowing clinicians to
transcribe notes for useful EHR data entry. Front-end speech recognition eliminates the task
of physicians to dictate notes instead of having to sit at a point of care, while back-end
technology works to detect and correct any errors in the transcription before passing it on
for human proofing.
The market is almost saturated with speech recognition technologies, but a few start-ups
are disrupting the space with deep learning algorithms in mining applications, uncovering
more extensive possibilities.

3. Computer-Assisted Coding (CAC).


CAC captures data of procedures and treatments to grasp each possible code to maximise
claims. It is one of the most popular uses of NLP, but unfortunately, its adoption rate is just
30%. It has enriched the speed of coding but fell short at accuracy.

4. Data Mining Research.


The integration of data mining in healthcare systems allows organizations to reduce the
levels of subjectivity in decision-making and provide useful medical know-how. Once
started, data mining can become a cyclic technology for knowledge discovery, which can
help any HCO create a good business strategy to deliver better care to patients.

5. Automated Registry Reporting.


An NLP use case is to extract values as needed by each use case. Many health IT systems are
burdened by regulatory reporting when measures such as ejection fraction are not stored as
discrete values. For automated reporting, health systems will have to identify when an
ejection fraction is documented as part of a note, and save each value in a form that can be
utilized by the organization’s analytics platform for automated registry reporting.

How can Healthcare Organizations leverage NLP?


Healthcare organizations can use NLP to transform the way they deliver care and manage
solutions. Organizations can use machine learning in healthcare to improve provider
workflows and patient outcomes.

Page | 2
Implementing Predictive Analytics in Healthcare :

Identification of high-risk patients, as well as improvement of the diagnosis process, can be


done by deploying Predictive Analytics along with Natural Language Processing in
Healthcare along with predictive analytics.
It is vital for emergency departments to have complete data quickly, at hand. For example,
the delay in diagnosis of Kawasaki diseases leads to critical complications in case it is
omitted or mistreated in any way. As proved by scientific results, an NLP based algorithm
identified at-risk patients of Kawasaki disease with a sensitivity of 93.6% and specificity of
77.5% compared to the manual review of clinician’s notes.
A set of researchers from France worked on developing another NLP based algorithm that
would monitor, detect and prevent hospital-acquired infections (HAI) among patients. NLP
helped in rendering unstructured data which was then used to identify early signs and
intimate clinicians accordingly.
Similarly, another experiment was carried out to automate the identification as well as risk
prediction for heart failure patients that were already hospitalized. Natural Language
Processing was implemented to analyse free text reports from the last 24 hours and predict
the patient’s risk of hospital readmission and mortality over the time of 30 days. At the end
of the successful experiment, the algorithm performed better than expected and the
model’s overall positive predictive value stood at 97.45%.
The benefits of deploying NLP can be applied to other areas of interest and a myriad of
algorithms can be deployed to pick out and predict specified conditions amongst patients.
Even though the healthcare industry at large still needs to refine its data capabilities prior to
deploying NLP tools, it still has a massive potential to significantly improve care delivery as
well as streamline workflows. Down the line, Natural Language Processing and other ML
tools will be the key to superior clinical decision support & patient health outcomes.

CONCLUSION : Thus, we have successfully curated a case study on the applications of NLP.

Page | 3
Name :
Roll No. :

Class : BE – A / Computer Engineering


UID :

Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)

Submitted to : PROF. NAZIA SULTHANA

Experiment No. :

AIM : Miniproject based on real life application of Natural Language


Processing.

THEORY :

Title: GRAMMATICAL ERROR CORRECTION (GEC).

Abstract: Grammatical Error Correction (GEC) systems aim to correct grammatical mistakes
in the text. Grammarly is an example of such a grammar correction product. Error correction
can improve the quality of written text in emails, blogs and chats. GEC task can be thought
of as a sequence to sequence task where a Transformer model is trained to take an
ungrammatical sentence as input and return a grammatically correct sentence.

Implementation:

1. Dataset:
For the training of our Grammar Corrector, we have used the C4_200M dataset
recently released by Google. This dataset consists of 200MM examples of
synthetically generated grammatical corruptions along with the correct text.

Page | 1
One of the biggest challenges in GEC is getting a good variety of data that simulates
the errors typically made in written language. If the corruptions are random, then
they would not be representative of the distribution of errors encountered in real
use cases.

To generate the corruption, a tagged corruption model is first trained. This model is
trained on existing datasets by taking as input a clean text and generating a
corrupted text. This is represented in the figure below:

For C4_2OOM dataset, the authors first determined the distribution of relative type
of errors encountered in written language. When generating the corruptions, they
were conditioned on the type of error. As shown in figure below, the corruption
model was conditioned to generate a determiner type error.

This allows the C4_200M dataset to have a diverse set of errors reflecting their
relative frequency in real-world applications. For the purpose of this project, we
extracted 550K sentences from C4_200M. The C4_200M dataset is available on TF
datasets. We extracted the sentences we needed and saved them as a CSV.

2. Model Training:
T5 is a text-to-text model meaning it can be trained to go from input text of one
format to output text of one format. This model can be used for many different
objectives like summarization and text classification, also can be used to build a trivia
bot that can retrieve answers from memory without any provided context.

Page | 2
T5 is preferred for a lot of tasks for a few reasons :
1. Can be used for any text-to-text task.
2. Good accuracy on downstream tasks after fine-tuning.

Steps:

1. Tokenizing the data


We set the incorrect sentence as the input and the corrected text as the label. Both
the inputs and targets are tokenized using the T5 tokenizer. The max length is set to
64 since most of the inputs in C4_200M are sentences and the assumption is that
this model will also be used on sentences.

2. Training the model using seq2seq trainer class


We use the Seq2Seq trainer class in Huggingface to instantiate the model and we
instantiate logging to wandb. Using weights and biases with HuggingFace is very
simple. All that needs to be done is to set report_to = “wandb" in the training
arguments.

3. Monitoring and evaluating the data


We have used the Rouge score as the metric for evaluating the model. As seen in the
plots below from W&B, the model gets to a rouge score of 72 after 1 epoch of
training.

Page | 3
Code:

from datasets import load_dataset


from tqdm import tqdm
import
argparse
import glob
import os
import json
import time
import logging
import random
import re
from itertools import chain
from string import punctuation

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
import pandas as pd

import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

from transformers import (


AdamW,
T5ForConditionalGeneration,
T5Tokenizer,
get_linear_schedule_with_warmup
)

import random
import numpy as np
import torch
import datasets

def set_seed(seed):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

set_seed(42)
from transformers import
( T5ForConditionalGeneration,
T5Tokenizer,
Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
)

from torch.utils.data import Dataset, DataLoader


model_name = 't5-base'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

def calc_token_len(example):
return len(tokenizer(example).input_ids)
from sklearn.model_selection import train_test_split
Page | 4
train_df, test_df = train_test_split(df, test_size=0.10, shuffle=True)
train_df.shape, test_df.shape

from torch.utils.data import Dataset, DataLoader


class GrammarDataset(Dataset):
def init (self, dataset, tokenizer,print_text=False):
self.dataset = dataset
self.pad_to_max_length = False
self.tokenizer = tokenizer
self.print_text = print_text
self.max_len = 64

def len (self):


return len(self.dataset)

def tokenize_data(self, example):


input_, target_ = example['input'], example['output']

# tokenize inputs
tokenized_inputs =
tokenizer(input_, pad_to_max_length=self.pad_to_max_len
gth,
max_length=self.max_len,
return_attention_mask=True)

tokenized_targets =
tokenizer(target_, pad_to_max_length=self.pad_to_max_leng
th,
max_length=self.max_len,
return_attention_mask=True)

inputs={"input_ids": tokenized_inputs['input_ids'],
"attention_mask": tokenized_inputs['attention_mask'],
"labels": tokenized_targets['input_ids']
}

return inputs

def getitem (self, index):


inputs = self.tokenize_data(self.dataset[index])

if self.print_text:
for k in inputs.keys():
print(k,
len(inputs[k]))

return inputs

from datasets import load_metric


rouge_metric = load_metric("rouge")

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model,


padding='longest', return_tensors='pt')

Page | 5
# defining training related arguments
batch_size = 16
args =
Seq2SeqTrainingArguments(output_dir="/content/drive/MyDrive/c4_200m/wei
ghts",
evaluation_strategy="steps",
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
learning_rate=2e-5,
num_train_epochs=1,
weight_decay=0.01,
save_total_limit=2,
predict_with_generate=True,
fp16 = True,
gradient_accumulation_steps =
6,
eval_steps = 500,
save_steps = 500,
load_best_model_at_end=True,
logging_dir="/logs",
report_to="wandb")

import nltk
nltk.download('punkt')
import numpy as np

def compute_metrics(eval_pred):
predictions, labels = eval_pred
decoded_preds = tokenizer.batch_decode(predictions,
skip_special_tokens=True)
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels,
skip_special_tokens=True)

# Rouge expects a newline after each sentence


decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for
pred in decoded_preds]
decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for
label in decoded_labels]

result = rouge_metric.compute(predictions=decoded_preds,
references=decoded_labels, use_stemmer=True)
# Extract a few results
result = {key: value.mid.fmeasure * 100 for key, value in
result.items()}

# Add mean generated length


prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id)
for pred in predictions]
result["gen_len"] = np.mean(prediction_lens)
return {k: round(v, 4) for k, v in
result.items()}

# defining trainer using huggingface


trainer = Seq2SeqTrainer(model=model,
Page | 6
args=args,

Page | 7
train_dataset= GrammarDataset(train_dataset,
tokenizer),
eval_dataset=GrammarDataset(test_dataset, tokenizer),
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics)

import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
model_name = 'deep-learning-analytics/GrammarCorrector'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model =
T5ForConditionalGeneration.from_pretrained(model_name).to(torch_device)

def correct_grammar(input_text,num_return_sequences):
batch =
tokenizer([input_text],truncation=True,padding='max_length',max_length=
64, return_tensors="pt").to(torch_device)
translated = model.generate(**batch,max_length=64,num_beams=4,
num_return_sequences=num_return_sequences, temperature=1.5)
tgt_text = tokenizer.batch_decode(translated,
skip_special_tokens=True)
return tgt_text

Output:

Applications:

1. Can be used for Grammar error correction specific applications like Grammarly.
2. Can be implemented in paraphrasing software and applications.
3. Can be included in document or content writing software like Microsoft Word, Libra
and Google Docs.

Page | 8
Results:

Fine Tuning T5 Transformer to Grammar Error Correction and training it on C4_550k dataset
we achieved a Rogue Score of 80%.

Conclusion:
In this project, we proposed a new strategy for the Grammar Error Correction system based
on Deep Learning, and the experimental results show that the proposed method is effective.
It makes full use of the advantages of Deep Learning.

Page | 9

You might also like