NLP Cheatsheet

Beyond 🅼🅻
NLP
CheatSheet
NLP
The goal of Natural Language Processing is to analyze and extract
meaningful information from natural human languages such as speech and
text.
Natural Language Processing
Natural Language Natural Language

Understanding(NLU) Generation (NLG)
You can do this. Don’t tell people your plans. Show them your
results. No pressure, no diamonds. Try Again. Fail again. Fail
better.
1.Tokenization
Tokenization is a process of converting huge text or paragraph or sentences into Tokens.
Sentence Tokenization Word Tokenization

>>> from nltk.tokenize import word_tokenize
>>> from nltk.tokenize import sent_tokenize
>>> para = """You can do this. Don’t tell people your
>>> para = """You can do this. Don’t tell people your plans.
plans. Show them your results. No pressure, no
Show them your results. No pressure, no diamonds. Try Again.
diamonds. Try Again. Fail again. Fail better.""".lower()
Fail again. Fail better.""".lower()
>>> word_tokenize(para)
>>> sent_tokenize(para)
['you', 'can', 'do', 'this', '.', 'don', '’', 't', 'tell', 'people', 'your',
['you can do this.', 'don’t tell people your plans.', 'show them
'plans', '.', 'show', 'them', 'your', 'results', '.', 'no',
your results.', 'no pressure, no diamonds.', 'try again.', 'fail
'pressure', ',', 'no', 'diamonds', '.', 'try', 'again', '.', 'fail',
again.', 'fail better.']
'again', '.', 'fail', 'better', '.']
>>>
>>>
2. Filtering Stop Words
Stop words are the words that are frequently common to all sentences or paragraphs.
>>> import nltk

>>> nltk.download("stopwords")
Sentence Tokenization
>>> from nltk.corpus import stopwords
>>> tokenized_text = ['you', 'can', 'do', 'this', '.', 'don', '’', 't', 'tell', 'people', 'your',
['You
'plans', '.',can do 'them',
'show', this.', 'your', 'results', '.', 'no', 'pressure', ',', 'no', 'diamonds', '.', 'try',
'again', '.', tell
'Don’t 'fail',people
'again', your
'.', 'fail',plans.',
'better', '.']
'Show them your results.',
>>> stop_words = set(stopwords.words("english"))
'No pressure, no diamonds.',
>>> tokenized_text = [word for word in tokenized_text if word.casefold() not in
'Try Again.', 'Fail again.', 'Fail better.']
stop_words]
>>> tokenized_text
['.', '’', 'tell', 'people', 'plans', '.', 'show', 'results', '.', 'pressure', ',', 'diamonds', '.', 'try', '.',
'fail', '.', 'fail', 'better', '.']
3. Stemming & Lemmatization
Stemming is a text processing task in which you reduce words to their root, which is the core part of a word. For
example, the words “helping” and “helper” share the root “help.”
Lemmatization is a text processing task in which you reduce words to their meaningful root, which is the core
part of a word. For example, the words “helping” and “helper” share the root “help.”
Stemming Lemmatization
>>> import nltk
Sentence Tokenization
>>> from nltk.stem import PorterStemmer >>> from nltk.stem import WordNetLemmatizer
['You', 'can', 'do', 'this', '.',
>>> stemmer = PorterStemmer() >>> nltk.download('wordnet')
>>> tokenized_text = ['.', '’', 'tell', 'people', 'plans', '.', 'show', 'results', '.', >>> nltk.download('omw-1.4') 'Don', '’', 't', 'tell', 'people', 'your',
['You can do this.',
'pressure', ',', 'diamonds', '.', 'try', '.', 'fail', '.', 'fail', 'better', '.'] >>> lemmatizer = WordNetLemmatizer()
'plans', '.', 'Show', 'them', 'your', 'results',
'Don’t
>>> tell people
stemmed_text your plans.', for word in tokenized_text
= [stemmer.stem(word) >>> tokenized_text = ['.', '’', 'tell', 'people', 'plans', '.', 'show', 'results', '.', 'pressure', ',',
if word not in '.,’']
'Show them your results.', 'diamonds', '.', 'try', '.','.', 'No',
'fail', 'pressure',
'.', 'fail', 'better', '.'] ',', 'no', 'diamonds', '.',
>>> stemmed_text >>> lemmatized_text = [lemmatizer.lemmatize(word) 'Try', 'Again', '.', 'Fail',
for word 'again', if
in tokenized_text
'No 'peopl',
['tell', pressure, no diamonds.',
'plan', 'show', 'result', 'pressur', 'diamond', 'tri', 'fail', 'fail', word not in '.,’']
'.', 'Fail', 'better', '.']
'Try Again.', 'Fail again.', 'Fail better.']
'better'] >>> lemmatized_text
>>> ['tell', 'people', 'plan', 'show', 'result', 'pressure', 'diamond', 'try', 'fail', 'fail', 'better']
>>>
3.1 Bag Of Words(BOW)
A Bag-of-Words is a representation of Extracted Text features in the form of a vector that
describes the occurrence of words within a document.
Sent1 -> AI is about smart machines.

Sent2 -> Machine Learning is a technique using which machine learns.
Sent3 -> Deep Learning is powerful technique.
Words Frequency Bag Of Word Vector

machine 3 ai machine smart technique using learns deep powerful
learns 3 sent1 1 1 1 0 0 0 0 0
techniqu
2 sent2 0 2 0 1 1 2 0 0
e
deep 1 sent2 2 0 0 1 0 1 1 1
smart 1 ... ... .... ... ... ... ... ... ...
... ...
3.2 Bag Of Words(BOW)
>>> from nltk.corpus import stopwords >>> from sklearn.feature_extraction.text import CountVectorizer
>>> from nltk.tokenize import word_tokenize, sent_tokenize >>> vectorizer = CountVectorizer()
>>> stop_words = set(stopwords.words("english")) >>> bag_of_words = vectorizer.fit_transform(lemmatized_token ).toarray()
>>> >>> bag_of_words
>>> para = "You can do this. Don’t tell people your plans. Show them your array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
results. No pressure, no diamonds. Try Again. Fail again. Fail better.".lower() [0, 0, 0, 1, 1, 0, 0, 0, 1, 0],
['You can do this.',
>>> tokenized_sent = sent_tokenize(para) [0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
>>> tokenized_words_sent = [word_tokenize(sent) for sent in tokenized_sent] [0, 1, 0, 0, 0, 1, 0, 0, 0, 0],
'Don’t tell people your plans.',
>>> filtered_sent_token = [[word for word in sent if word.casefold() not in [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
stop_words] for sent in'Show them your results.',
tokenized_words_sent] [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
>>> lemmatizer = WordNetLemmatizer()
'No pressure, no diamonds.',
>>> lemmatized_token = [' '.join([lemmatizer.lemmatize(word) for word in
[1, 0, 1, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)
>>>
sent_token if word not'Try
in '.,’'])Again.', 'Failinagain.',
for sent_token 'Fail better.']
filtered_sent_token]
3.3 TF-IDF
Tf-Idf stands for term frequency-inverse document frequency, and the Tf-Idf weight is a weight
often used in information retrieval and text mining. This weight is a statistical measure used to
evaluate how important a word is to a document in a collection or corpus.
IDF
Words Freq Words IDF
ai 1 ai log(3/1)
machine 3 machine log(3/2)
smart 1 TF smart log(3/1)
technique 2 ai machine smart technique using learns deep powerful technique log(3/2)
using 1 sent1 1/3 1/3 1/3 0 0 0 0 0 using log(3/1)

TF-Idf
learns 3 sent2 0 2/4 0 1/4 1/4 2/4 0 0 X learns log(3/2) = Vector
deep 1 sent2 0 0 0 1/4 0 1/4 1/4 1/4 deep log(3/1)
powerful 1 ... ... .... ... ... ... ... ... ... powerful log(3/1)
3.4 TF-IDF
>>> from sklearn.feature_extraction.text import TfidfVectorizer

>>>
>>> vectorizer = TfidfVectorizer()
>>> tf_idf_vect = vectorizer.fit_transform(lemmatized_token).toarray()
>>
>>> tf_idf_vect
array([[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0.57735027, 0.57735027, 0. , 0. , 0. , 0.57735027, 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0.70710678, 0.70710678, 0. , 0. ],
[0. , 0.70710678, 0. , 0. , 0. , 0.70710678, 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. ],
[0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0.76944876, 0. , 0.63870855, 0. , 0. , 0. , 0. , 0. , 0. , 0. ]])
>>>
4. POS Tagging
In grammar, part of speech refers to the roles words play in sentences.
POS tagging refers to labeling your words according to their parts of speech.

>>> import nltk
>>> para = "You can do this. Don’t tell people your plans. Show them your results. No
pressure, no diamonds. Try Again. Fail again. Fail better.".lower()
>>> tokens = word_tokenize(para)
>>> nltk.pos_tag(tokens)
[('you', 'PRP'), ('can', 'MD'), ('do', 'VB'), ('this', 'DT'), ('.', '.'), ('don', 'VB'), ('’', 'JJ'), ('t',
'NN'), ('tell', 'VBP'), ('people', 'NNS'), ('your', 'PRP$'), ('plans', 'NNS'), ('.', '.'), ('show',
'VB'), ('them', 'PRP'), ('your', 'PRP$'), ('results', 'NNS'), ('.', '.'), ('no', 'DT'), ('pressure',
'NN'), (',', ','), ('no', 'DT'), ('diamonds', 'NNS'), ('.', '.'), ('try', 'VB'), ('again', 'RB'), ('.', '.'),
('fail', 'VB'), ('again', 'RB'), ('.', '.'), ('fail', 'VB'), ('better', 'JJR'), ('.', '.')]
>>>
5. Chunking
While tokenizing allows you to identify words and sentences, chunking allows you to identify phrases.
Here are some examples:

“A horse”
“A white horse” >>> from nltk.tokenize import word_tokenize
>>> import nltk
“A running white horse” >>> para = "You can do this. Don’t tell people your plans. ".lower()
>>>> pos_tokens = nltk.pos_tag(tokens)
>>> grammar = "NP: {<DT>?<JJ>*<NN>}"
>>> chunk_parser = nltk.RegexpParser(grammar)
>>> tree = chunk_parser.parse(pos_tokens)
>>> tree.draw()
6. Chinking
Chinking is used together with chunking, but while chunking is used to include a pattern, chinking is
used to exclude a pattern.

>>> import nltk
>>> para = "You can do this. Don’t tell people your plans. ".lower()
>>> grammar = """
Chunk: {<.*>+}
}<JJ>{"""
>>> tree.draw()
7. Named Entity recognition
>>> import nltk
>>> nltk.download("maxent_ne_chunker")
>>> nltk.download('words')
Named Entities are noun phrases that >>> tokens = word_tokenize(para)
refer to specific locations, people, >>>> pos_tokens = nltk.pos_tag(tokens)
>>> grammar = """
organizations, and so on. Chunk: {<.*>+}
}<JJ>{"""
>>> tree.draw()
8. WordEmbeddings
Word Embeddings are meaningful feature
representation of a document, vocabulary. >>> from nltk.tokenize import word_tokenize
They try to preserve syntactical and >>> import nltk
>>> nltk.download("maxent_ne_chunker")
semantic information. >>> nltk.download('words')
>>> grammar = """
Chunk: {<.*>+}
}<JJ>{"""
>>> tree.draw()
9. Text Classification
It is the process of assigning a label or
class to a given text. Sentiment analysis,
natural language inference, and >>> from transformers import pipeline
grammatical correctness assessment are >>> classifier = pipeline("sentiment-analysis")
>>> classifier("I loved Star Wars so much!")
some of the use cases.
>>> [{'label': 'POSITIVE', 'score': 0.99}]
>>> classifier("Where is the capital of France?,
Paris is the capital of France.")
## [{'label': 'entailment', 'score': 0.997}]
10. Text Summarization
Text Summarization is the process of
reducing a document to its essential elements
while preserving its main points. It is possible >>> from transformers import pipeline
to extract text from the original input, or to >>> summarizer = pipeline("sentiment-analysis")
>>> summarizer("The tower is 324 metres (1,063 ft) tall, about the
generate entirely new text using some models. same height as an 81-storey building, and the tallest structure in Paris.
Its base is square, measuring 125 metres (410 ft) on each side. It was
the first structure to reach a height of 300 metres. Excluding
transmitters, the Eiffel Tower is the second tallest free-standing
structure in France after the Millau Viaduct.")
>>> The tower is 324 metres (1,063 ft) tall, about the same height as
an 81-storey building. It was the first structure to reach a height of
300 metres.
Beyond 🅼🅻 being._.happy
/skhapijulhossen

NLP Cheatsheet

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Cheatsheet

Uploaded by

Copyright:

Available Formats

Beyond 🅼🅻

Natural Language Processing

Natural Language Natural Language

Sentence Tokenization Word Tokenization

>>> import nltk

Sent1 -> AI is about smart machines.

Words Frequency Bag Of Word Vector

machine 3 machine log(3/2)

smart 1 TF smart log(3/1)

using 1 sent1 1/3 1/3 1/3 0 0 0 0 0 using log(3/1)

>>> from sklearn.feature_extraction.text import TfidfVectorizer

>>> from nltk.tokenize import word_tokenize

Here are some examples:

>>> from nltk.tokenize import word_tokenize

You might also like