Text Preprocessing Techniques

Dataset:
https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
(https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)
In [1]: import pandas as pd
In [3]: !unzip /content/IMDB_Dataset.zip
Archive: /content/IMDB_Dataset.zip
inflating: IMDB Dataset.csv
In [13]: data_path = "/content/IMDB Dataset.csv"
In [14]: df = pd.read_csv(data_path)
In [15]: df.shape
Out[15]: (50000, 2)
In [16]: df.head()
Out[16]: review sentiment
0 One of the other reviewers has mentioned that ... positive
1 A wonderful little production. The... positive
2 I thought this was a wonderful way to spend ti... positive
3 Basically there's a family where a little boy ... negative
4 Petter Mattei's "Love in the Time of Money" is... positive
In [17]: df = df.head(100)
In [18]: df.shape
Out[18]: (100, 2)
In [19]: df.head()
0 One of the other reviewers has mentioned that ... positive
1 A wonderful little production. The... positive
2 I thought this was a wonderful way to spend ti... positive
3 Basically there's a family where a little boy ... negative
4 Petter Mattei's "Love in the Time of Money" is... positive
lower case
In [20]: df['review'][3]
Out[20]: "Basically there's a family where a little boy (Jake) thinks there's a zombie in his cl
oset & his parents are fighting all the time. This movie is slower than a so
ap opera... and suddenly, Jake decides to become Rambo and kill the zombie. 
OK, first of all when you're going to make a film you must Decide if its a thriller or
a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in rea
l life. And then we have Jake with his closet which totally ruins all the film! I expec
ted to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningle
ss thriller spots. 3 out of 10 just for the well playing parents & descent d
ialogs. As for the shots with Jake: just ignore them."
In [21]: df['review'] = df['review'].str.lower()
In [22]: df
0 one of the other reviewers has mentioned that ... positive
1 a wonderful little production. the... positive
2 i thought this was a wonderful way to spend ti... positive
3 basically there's a family where a little boy ... negative
4 petter mattei's "love in the time of money" is... positive
... ... ...
95 daniel day-lewis is the most versatile actor a... positive
96 my guess would be this was originally going to... negative
97 well, i like to watch bad horror b-movies, cau... negative
98 this is the worst movie i have ever seen, as w... negative
99 i have been a mario fan for as long as i can r... positive
100 rows × 2 columns

Out[23]: "basically there's a family where a little boy (jake) thinks there's a zombie in his cl
oset & his parents are fighting all the time. this movie is slower than a so
ap opera... and suddenly, jake decides to become rambo and kill the zombie. 
ok, first of all when you're going to make a film you must decide if its a thriller or
a drama! as a drama the movie is watchable. parents are divorcing & arguing like in rea
l life. and then we have jake with his closet which totally ruins all the film! i expec
ted to see a boogeyman similar movie, and instead i watched a drama with some meaningle
ss thriller spots. 3 out of 10 just for the well playing parents & descent d
ialogs. as for the shots with jake: just ignore them."
remove_html_tags
In [24]: import re
def remove_html_tags(text):
pattern = re.compile('<.*?>')
return pattern.sub(r'', text)
In [25]: text = "<html><body> Movie 1 Actor - Aamir Khan Click here to <a href='
In [26]: remove_html_tags(text)
Out[26]: ' Movie 1 Actor - Aamir Khan Click here to download'
In [27]: df['review'] = df['review'].apply(remove_html_tags)
Out[28]: 'probably my all-time favorite movie, a story of selflessness, sacrifice and dedication
to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my h
aving seen it some 15 or more times in the last 25 years. paul lukas\' performance brin
gs tears to my eyes, and bette davis, in one of her very few truly sympathetic roles, i
s a delight. the kids are, as grandma says, more like "dressed-up midgets" than childre
n, but that only makes them more fun to watch. and the mother\'s slow awakening to what
\'s happening in the world and under her own roof is believable and startling. if i had
a dozen thumbs, they\'d all be "up" for this movie.'
remove_url
In [29]: def remove_url(text):
pattern = re.compile(r'https?://\S+|www\.\S+')
return pattern.sub(r'', text)
In [32]: text1 = 'Check out my youtube https://www.youtube.com/dswithbappy dswithbappy'

text2 = 'Check out my linkedin https://www.linkedin.com/in/boktiarahmed73/'
text3 = 'Google search here www.google.com'
text4 = 'For data click https://www.kaggle.com/'
In [33]: remove_url(text1)
Out[33]: 'Check out my youtube dswithbappy'
punctuation handling
In [34]: import string,time
string.punctuation
Out[34]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
In [35]: exclude = string.punctuation
exclude
Out[35]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
In [36]: def remove_punc(text):

for char in exclude:
text = text.replace(char,'')
return text

In [37]: text = 'string. With. Punctuation?'
In [38]: start = time.time()

print(remove_punc(text))
time1 = time.time() - start
print(time1*50000)
string With Punctuation

10.466575622558594
In [39]: def remove_punc1(text):

return text.translate(str.maketrans('', '', exclude))
In [40]: start = time.time()

remove_punc1(text)
time2 = time.time() - start
print(time2*50000)
5.602836608886719
In [41]: time1/time2
Out[41]: 1.8680851063829786
Out[42]: 'probably my all-time favorite movie, a story of selflessness, sacrifice and dedication
to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my h
aving seen it some 15 or more times in the last 25 years. paul lukas\' performance brin
gs tears to my eyes, and bette davis, in one of her very few truly sympathetic roles, i
s a delight. the kids are, as grandma says, more like "dressed-up midgets" than childre
n, but that only makes them more fun to watch. and the mother\'s slow awakening to what
\'s happening in the world and under her own roof is believable and startling. if i had
a dozen thumbs, they\'d all be "up" for this movie.'
In [43]: remove_punc1(df['review'][5])
Out[43]: 'probably my alltime favorite movie a story of selflessness sacrifice and dedication to
a noble cause but its not preachy or boring it just never gets old despite my having se
en it some 15 or more times in the last 25 years paul lukas performance brings tears to
my eyes and bette davis in one of her very few truly sympathetic roles is a delight the
kids are as grandma says more like dressedup midgets than children but that only makes
them more fun to watch and the mothers slow awakening to whats happening in the world a
nd under her own roof is believable and startling if i had a dozen thumbs theyd all be
up for this movie'
chat_conversion handle
In [44]: chat_words = {
'AFAIK':'As Far As I Know',
'AFK':'Away From Keyboard',
'ASAP':'As Soon As Possible'
}
In [45]: def chat_conversion(text):
new_text = []
for w in text.split():
if w.upper() in chat_words:
new_text.append(chat_words[w.upper()])
else:
new_text.append(w)
return " ".join(new_text)
In [46]: chat_conversion('Do this work ASAP')
Out[46]: 'Do this work As Soon As Possible'
incorrect_text handling
In [47]: from textblob import TextBlob
In [48]: incorrect_text = 'ceertain conditionas duriing seveal ggenerations aree moodified in the

textBlb = TextBlob(incorrect_text)

textBlb.correct().string
Out[48]: 'certain conditions during several generations are modified in the same manner.'
stopwords
In [49]: from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /root/nltk_data...

[nltk_data] Unzipping corpora/stopwords.zip.
Out[49]: True
In [50]: stopwords.words('english')
Out[50]: ['i',
'me',
'my',
'myself',
'we',
'our',
'ours',
'ourselves',
'you',
"you're",
"you've",
"you'll",
"you'd",
'your',
'yours',
'yourself',
'yourselves',
'he',
'him',
'his'
In [51]: len(stopwords.words('english'))
Out[51]: 179
In [52]: def remove_stopwords(text):
new_text = []

for word in text.split():
if word in stopwords.words('english'):
new_text.append('')
else:
new_text.append(word)
x = new_text[:]
new_text.clear()
return " ".join(x)
In [53]: remove_stopwords('probably my all-time favorite movie, a story of selflessness, sacrific
Out[53]: 'probably all-time favorite movie, story selflessness, sacrifice dedication noble
cause, preachy boring. never gets old, despite seen 15 times'
In [54]: df.head()
1 a wonderful little production. the filming tec... positive
In [55]: df['review'].apply(remove_stopwords)
Out[55]: 0 one reviewers mentioned watching 1 oz e...

1 wonderful little production. filming techniq...
2 thought wonderful way spend time hot s...
3 basically there's family little boy (jake) ...
4 petter mattei's "love time money" visuall...
...
95 daniel day-lewis versatile actor alive. eng...
96 guess would originally going least two ...
97 well, like watch bad horror b-movies, cause ...
98 worst movie ever seen, well as, worst ...
99 mario fan long remember, fond memo...
Name: review, Length: 100, dtype: object
remove_emoji handle
In [56]: import re
def remove_emoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', text)
In [57]: remove_emoji("Loved the movie. It was 😘😘")

Out[57]: 'Loved the movie. It was '
In [58]: remove_emoji("Lmao 😂😂")

Out[58]: 'Lmao '
In [59]: !pip install emoji
Collecting emoji
Downloading emoji-2.8.0-py2.py3-none-any.whl (358 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 358.9/358.9 kB 4.7 MB/s eta 0:00:00
Installing collected packages: emoji
Successfully installed emoji-2.8.0
In [60]: import emoji

print(emoji.demojize('Python is 🔥'))
Python is :fire:
In [61]: print(emoji.demojize('Loved the movie. It was 😘'))

Loved the movie. It was :face_blowing_a_kiss:
Tokenization
1. Using the split function
In [62]: # word tokenization

sent1 = 'I am going to delhi'
sent1.split()
Out[62]: ['I', 'am', 'going', 'to', 'delhi']
In [63]: # sentence tokenization

sent2 = 'I am going to delhi. I will stay there for 3 days. Let\'s hope the trip to be g
sent2.split('.')
Out[63]: ['I am going to delhi',

' I will stay there for 3 days',
" Let's hope the trip to be great"]
In [64]: # Problems with split function

sent3 = 'I am going to delhi!'
sent3.split()
Out[64]: ['I', 'am', 'going', 'to', 'delhi!']
In [65]: sent4 = 'Where do think I should go? I have 3 day holiday'

sent4.split('.')
Out[65]: ['Where do think I should go? I have 3 day holiday']
2. Regular Expression
In [66]: import re
sent3 = 'I am going to delhi!'
tokens = re.findall("[\w']+", sent3)
tokens
Out[66]: ['I', 'am', 'going', 'to', 'delhi']
In [67]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry?
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen b
sentences = re.compile('[.!?] ').split(text)
sentences
Out[67]: ['Lorem Ipsum is simply dummy text of the printing and typesetting industry',
"\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s, \nwhe
n an unknown printer took a galley of type and scrambled it to make a type specimen boo
k."]
3. NLTK
In [68]: from nltk.tokenize import word_tokenize,sent_tokenize

import nltk
nltk.download('punkt')
[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data] Unzipping tokenizers/punkt.zip.
Out[68]: True
In [69]: sent1 = 'I am going to visit delhi!'

word_tokenize(sent1)
Out[69]: ['I', 'am', 'going', 'to', 'visit', 'delhi', '!']
In [70]: text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry?
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen b

sent_tokenize(text)
Out[70]: ['Lorem Ipsum is simply dummy text of the printing and typesetting industry?',
"Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, \nwhen
an unknown printer took a galley of type and scrambled it to make a type specimen boo
k."]
In [71]: sent5 = 'I have a Ph.D in A.I'

sent6 = "We're here to help! mail us at nks@gmail.com"
sent7 = 'A 5km ride cost $10.50'

word_tokenize(sent5)
Out[71]: ['I', 'have', 'a', 'Ph.D', 'in', 'A.I']
In [72]: word_tokenize(sent6)
Out[72]: ['We',
"'re",
'here',
'to',
'help',
'!',
'mail',
'us',
'at',
'nks',
'@',
'gmail.com']
In [73]: word_tokenize(sent7)
Out[73]: ['A', '5km', 'ride', 'cost', '$', '10.50']
4. Spacy (good)
In [74]: import spacy

nlp = spacy.load('en_core_web_sm')
In [75]: doc1 = nlp(sent5)

doc2 = nlp(sent6)
doc3 = nlp(sent7)
doc4 = nlp(sent1)
In [76]: doc4 = nlp(sent1)

doc4
Out[76]: I am going to visit delhi!

In [77]: for token in doc4:
print(token)
I
am
going
to
visit
delhi
!
In [78]: df.head()
1 a wonderful little production. the filming tec... positive
Stemmer
In [79]: from nltk.stem.porter import PorterStemmer
In [80]: ps = PorterStemmer()
def stem_words(text):
return " ".join([ps.stem(word) for word in text.split()])
In [81]: sample = "walk walks walking walked"

stem_words(sample)
Out[81]: 'walk walk walk walk'
In [82]: text = 'probably my alltime favorite movie a story of selflessness sacrifice and dedicat
print(text)
probably my alltime favorite movie a story of selflessness sacrifice and dedication to

a noble cause but its not preachy or boring it just never gets old despite my having se
en it some 15 or more times in the last 25 years paul lukas performance brings tears to
my eyes and bette davis in one of her very few truly sympathetic roles is a delight the
kids are as grandma says more like dressedup midgets than children but that only makes
them more fun to watch and the mothers slow awakening to whats happening in the world a
nd under her own roof is believable and startling if i had a dozen thumbs theyd all be
up for this movie
In [83]: stem_words(text)
Out[83]: 'probabl my alltim favorit movi a stori of selfless sacrific and dedic to a nobl caus b
ut it not preachi or bore it just never get old despit my have seen it some 15 or more
time in the last 25 year paul luka perform bring tear to my eye and bett davi in one of
her veri few truli sympathet role is a delight the kid are as grandma say more like dre
ssedup midget than children but that onli make them more fun to watch and the mother sl
ow awaken to what happen in the world and under her own roof is believ and startl if i
had a dozen thumb theyd all be up for thi movi'
Lemmatization
In [84]: import nltk
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
wordnet_lemmatizer = WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after p
punctuations="?:!.,;"
sentence_words = nltk.word_tokenize(sentence)
for word in sentence_words:
if word in punctuations:
sentence_words.remove(word)

sentence_words
print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word,pos='v')))
[nltk_data] Downloading package wordnet to /root/nltk_data...

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
Word Lemma
He He
was be
running run
and and
eating eat
at at
same same
time time
He He
has have
bad bad
habit habit
of of
swimming swim
after after
playing play
long long
hours hours
in in
the the
Sun Sun
NOTE: Stemming & lamatization are same to retrieve root words but lamatization is worked good.
Lamatization is slow & stemming is fast
In [ ]:
Some common terms to remember:
1. Corpus - Paragrapgh
2. Vocabulary - Unique word
3. Document - one row
4. Word - one word
Bag of words
In [1]: import numpy as np
import pandas as pd
In [2]: df = pd.DataFrame({"text":["people watch dswithbappy",

"dswithbappy watch dswithbappy",
"people write comment",
"dswithbappy write comment"],"output":[1,1,0,0]})

df
Out[2]: text output
0 people watch dswithbappy 1
1 dswithbappy watch dswithbappy 1
2 people write comment 0
3 dswithbappy write comment 0
In [3]: from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
In [4]: bow = cv.fit_transform(df['text'])
In [5]: #vocabulary
print(cv.vocabulary_)
{'people': 2, 'watch': 3, 'dswithbappy': 1, 'write': 4, 'comment': 0}
In [10]: bow.toarray()
Out[10]: array([[0, 1, 1, 1, 0],

[0, 2, 0, 1, 0],
[1, 0, 1, 0, 1],
[1, 1, 0, 0, 1]])
In [ ]: print(bow[0].toarray())
print(bow[1].toarray())
[[0 1 1 1 0]]
[[0 2 0 1 0]]
[[1 0 1 0 1]]
In [8]: # new
cv.transform(['Bappy watch dswithbappy']).toarray()
Out[8]: array([[0, 1, 0, 1, 0]])
In [ ]: X = bow.toarray()
y = df['output']
In [ ]:
N-grams
In [11]: df = pd.DataFrame({"text":["people watch dswithbappy",

df
Out[11]: text output
In [12]: # BI grams
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(2,2))
In [14]: print(cv.vocabulary_)
{'people watch': 2, 'watch dswithbappy': 4, 'dswithbappy watch': 0, 'people write': 3,

'write comment': 5, 'dswithbappy write': 1}
[[0 0 1 0 1 0]]
[[1 0 0 0 1 0]]
[[0 0 0 1 0 1]]
In [15]: #Ti gram

# BI grams
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(3,3))
In [ ]: print(cv.vocabulary_)
{'people watch dswithbappy': 2, 'dswithbappy watch dswithbappy': 0, 'people write comme

nt': 3, 'dswithbappy write comment': 1}
[[0 0 1 0]]
[[1 0 0 0]]
[[0 0 0 1]]
TF-IDF (Term frequency- Inverse document

frequency)
In [ ]: df = pd.DataFrame({"text":["people watch dswithbappy",

df
Out[20]: text output
In [ ]: from sklearn.feature_extraction.text import TfidfVectorizer

tfid= TfidfVectorizer()
In [ ]: tfid.fit_transform(df['text']).toarray()
Out[22]: array([[0. , 0.49681612, 0.61366674, 0.61366674, 0. ],

[0. , 0.8508161 , 0. , 0.52546357, 0. ],
[0.57735027, 0. , 0.57735027, 0. , 0.57735027],
[0.61366674, 0.49681612, 0. , 0. , 0.61366674]])
In [ ]: print(tfid.idf_)
[1.51082562 1.22314355 1.51082562 1.51082562 1.51082562]
In [ ]:

Text Preprocessing Techniques

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text Preprocessing Techniques

Uploaded by

Copyright:

Available Formats

Dataset:

In [1]: import pandas as pd

In [3]: !unzip /content/IMDB_Dataset.zip

In [13]: data_path = "/content/IMDB Dataset.csv"

Out[16]: review sentiment

0 One of the other reviewers has mentioned that ... positive

1 A wonderful little production. <br /><br />The... positive

2 I thought this was a wonderful way to spend ti... positive

3 Basically there's a family where a little boy ... negative

4 Petter Mattei's "Love in the Time of Money" is... positive

Out[19]: review sentiment

0 One of the other reviewers has mentioned that ... positive

1 A wonderful little production. <br /><br />The... positive

2 I thought this was a wonderful way to spend ti... positive

3 Basically there's a family where a little boy ... negative

4 Petter Mattei's "Love in the Time of Money" is... positive

In [21]: df['review'] = df['review'].str.lower()

Out[22]: review sentiment

0 one of the other reviewers has mentioned that ... positive

1 a wonderful little production. <br /><br />the... positive

2 i thought this was a wonderful way to spend ti... positive

3 basically there's a family where a little boy ... negative

4 petter mattei's "love in the time of money" is... positive

... ... ...

95 daniel day-lewis is the most versatile actor a... positive

96 my guess would be this was originally going to... negative

97 well, i like to watch bad horror b-movies, cau... negative

98 this is the worst movie i have ever seen, as w... negative

99 i have been a mario fan for as long as i can r... positive

100 rows × 2 columns

Out[26]: ' Movie 1 Actor - Aamir Khan Click here to download'

In [27]: df['review'] = df['review'].apply(remove_html_tags)

In [32]: text1 = 'Check out my youtube https://www.youtube.com/dswithbappy dswithbappy'

Out[33]: 'Check out my youtube dswithbappy'

In [36]: def remove_punc(text):

In [37]: text = 'string. With. Punctuation?'

In [38]: start = time.time()

string With Punctuation

In [39]: def remove_punc1(text):

In [40]: start = time.time()

In [46]: chat_conversion('Do this work ASAP')

Out[46]: 'Do this work As Soon As Possible'

[nltk_data] Downloading package stopwords to /root/nltk_data...

In [53]: remove_stopwords('probably my all-time favorite movie, a story of selflessness, sacrific

Out[54]: review sentiment

0 one of the other reviewers has mentioned that ... positive

1 a wonderful little production. the filming tec... positive

2 i thought this was a wonderful way to spend ti... positive

3 basically there's a family where a little boy ... negative

4 petter mattei's "love in the time of money" is... positive

Out[55]: 0 one reviewers mentioned watching 1 oz e...

In [57]: remove_emoji("Loved the movie. It was 😘😘")

In [58]: remove_emoji("Lmao 😂😂")

In [60]: import emoji

In [61]: print(emoji.demojize('Loved the movie. It was 😘'))

1. Using the split function

In [62]: # word tokenization

Out[62]: ['I', 'am', 'going', 'to', 'delhi']

In [63]: # sentence tokenization

Out[63]: ['I am going to delhi',

In [64]: # Problems with split function

Out[64]: ['I', 'am', 'going', 'to', 'delhi!']

In [65]: sent4 = 'Where do think I should go? I have 3 day holiday'

Out[65]: ['Where do think I should go? I have 3 day holiday']