Approaching Almost Any NLP

Approaching (almost) any NLP problem
@abhi1thakur
AI is like an imaginary
friend most enterprises
claim to have these days
3
4
I like big data
and
I cannot lie
5
Agenda
➢ Not so much intro
➢ Where is NLP used
➢ Pre-processing
➢ Machine Learning Models
➢ Solving a problem
➢ Traditional approaches
➢ Deep Learning Models
➢ Muppets
6
Applications of natural language processing
Search Engine Speech to Text
Review Rating
Sentiment Classification Entity Extraction Prediction
Topic Extraction Autocomplete Question Answering
Chatbots / VAs Translation

Pre-processing the text data
can u he.lp me with loan? 😊
Unintentional
Abbreviations Symbols Emojis
Characters
can you help me with loan ?

8
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
9
➢ Tokenization
➢ Spelling correction def remove_space(text):
text = text.strip()
➢ Contraction mapping text = text.split()
return " ".join(text)
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
10
➢ Very important step
➢ Tokenization ➢ Is not always about spaces
➢ Spelling correction ➢ Converts words into tokens
➢ Might be different for different
➢ Contraction mapping languages
➢ Stemming ➢ Simplest is to use `word_tokenizer`
from NLTK
➢ Emoji handling ➢ Write your own ;)
➢ Cleaning HTML
11
➢ Removing weird spaces import nltk
nltk.download('punkt')
➢ Tokenization
from nltk.tokenize import word_tokenize
➢ Spelling correction text = "hello, how are you?"
tokens = word_tokenize(text)
➢ Contraction mapping print(tokens)
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
hello, how are you?
'hello', ',', 'how', 'are', 'you', '?'

12
➢ Tokenization ➢ Very very crucial step
➢ In chat: can u tel me abot new sim
➢ Spelling correction card pland?
➢ Contraction mapping ➢ Most models without spelling
correction will fail
➢ Stemming ➢ Peter Norvig’s spelling correction
➢ Emoji handling ➢ Make your own ;)
➢ Cleaning HTML
13
I need a new car insurance
➢ Tokenization
I ned a new car insuraance
➢
Bidirectional Stacked
Contraction mapping
Embeddings Layer
I needd a new carr insurance
➢ Stemming
char-LSTM
Output
➢ Emoji handling I need a neew car insurance
➢ Stopwords handling I need a new car insurancee
➢ Cleaning HTML I need aa new car insurance
14
➢ Removing weird spaces def edits1(word):
➢ Tokenization letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
➢ Spelling correction deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
➢ Contraction mapping replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
➢ Stemming return set(deletes + transposes + replaces + inserts)
➢ Emoji handling
➢ Cleaning HTML def edits2(word):
return (e2 for e1 in edits1(word) for e2 in edits1(e1))
15
contraction = {
➢ Removing weird spaces "'cause": 'because',
➢ Tokenization ',cause': 'because',
';cause': 'because',
➢ Spelling correction "ain't": 'am not',
'ain,t': 'am not',
➢ Contraction mapping 'ain;t': 'am not',
'ain´t': 'am not',
➢ Stemming 'ain’t': 'am not',
➢ Emoji handling "aren't": 'are not',
'aren,t': 'are not',
➢ Stopwords handling 'aren;t': 'are not',
'aren´t': 'are not',
➢ Cleaning HTML 'aren’t': 'are not'
}
16
➢ Tokenization
def mapping_replacer(x, dic):
➢ Contraction mapping for word in dic.keys():
if " " + word + " " in x:
➢ Stemming x = x.replace(" " + word + " ", " " + dic[word] + " ")
➢ Emoji handling return x
➢ Cleaning HTML
17
➢ Tokenization
➢ Spelling correction ➢ Reduces words to root form
➢ Contraction mapping ➢ Why is stemming important?
➢ NLTK stemmers
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
18
fishing
➢ Tokenization fished fish
➢ Contraction mapping fishes
➢ Stemming
➢ Emoji handling
➢ Stopwords handling In [1]: from nltk.stem import SnowballStemmer
In [2]: s = SnowballStemmer('english')
➢ Cleaning HTML In [3]: s.stem("fishing")
Out[3]: 'fish'
19
pip install emoji
➢ Tokenization
➢ Contraction mapping import emoji
emojis = emoji.UNICODE_EMOJI
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
20
I need new car insurance
➢ Tokenization
I car insurance
➢ Stemming
➢ Emoji handling
new
➢ Cleaning HTML
need
21
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
22
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
23
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
24
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
25
What kind of models to use?
➢ SVM
➢ Logistic Regression
➢ Gradient Boosting
➢ Neural Networks
26
Let’s look at a problem
27
Quora duplicate question identification
➢ ~ 13 million questions
➢ Many duplicate questions
➢ Cluster and join duplicates together
➢ Remove clutter
28
Non-duplicate questions
➢ Who should I address my cover letter to if I'm applying for a big company
like Mozilla?
➢ Which car is better from safety view?""swift or grand i10"".My first priority
is safety?
➢ How can I start an online shopping (e-commerce) website?

➢ Which web technology is best suitable for building a big E-Commerce
website?
29
Duplicate questions
➢ How does Quora quickly mark questions as needing improvement?
➢ Why does Quora mark my questions as needing
improvement/clarification before I have time to give it details? Literally
within seconds…
➢ What practical applications might evolve from the discovery of the Higgs
Boson?
➢ What are some practical benefits of discovery of the Higgs Boson?
30
Dataset
➢ 400,000+ pairs of questions

➢ Initially data was very skewed
➢ Negative sampling
➢ Noise exists (as usual)
31
Dataset
➢ 255045 negative samples (non-duplicates)

➢ 149306 positive samples (duplicates)
➢ 40% positive samples
32
Dataset: basic exploration
➢ Average number characters in question1: 59.57

➢ Minimum number of characters in question1: 1
➢ Maximum number of characters in question1: 623
➢ Average number characters in question2: 60.14

➢ Minimum number of characters in question2: 1
➢ Maximum number of characters in question2: 1169
33
Basic feature engineering
➢ Length of question1
➢ Length of question2
➢ Difference in the two lengths
➢ Character length of question1 without spaces
➢ Character length of question2 without spaces
➢ Number of words in question1
➢ Number of words in question2
➢ Number of common words in question1 and question2
34
data['len_q1'] = data.question1.apply(lambda x: len(str(x)))

data['len_q2'] = data.question2.apply(lambda x: len(str(x)))
data['diff_len'] = data.len_q1 - data.len_q2
data['len_char_q1'] = data.question1.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))
data['len_char_q2'] = data.question2.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))
data['len_word_q1'] = data.question1.apply(lambda x: len(str(x).split()))
data['len_word_q2'] = data.question2.apply(lambda x: len(str(x).split()))
35
data['len_common_words'] = data.apply(lambda x:
len(
set(str(x['question1']).lower().split()).intersection(set(str(x['question2']).lower().split())
)), axis=1)
Basic modelling Normalization
Logistic
Tabular Data Training Set Regression 0.658
(Basic
Features)
Validation Set
XGB 0.721
Fuzzy features
➢ Also known as approximate string matching
➢ Number of “primitive” operations required to convert string to exact
match
➢ Primitive operations:
○ Insertion
○ Deletion
○ Substitution
➢ Typically used for:
○ Spell checking
○ Plagiarism detection
○ DNA sequence matching
○ Spam filtering 38
Fuzzy features
➢ pip install fuzzywuzzy

➢ Uses Levenshtein distance
➢ QRatio
➢ WRatio
➢ Token set ratio
➢ Token sort ratio
➢ Partial token set ratio
➢ Partial token sort ratio
39
https://github.com/seatgeek/fuzzywuzzy
Fuzzy features
data['fuzz_qratio'] = data.apply(
lambda x: fuzz.QRatio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_WRatio'] = data.apply(
lambda x: fuzz.WRatio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_partial_ratio'] = data.apply(
lambda x: fuzz.partial_ratio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_partial_token_set_ratio'] = data.apply(
lambda x: fuzz.partial_token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)
40
Fuzzy features
data['fuzz_partial_token_sort_ratio'] = data.apply(
lambda x: fuzz.partial_token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_token_set_ratio'] = data.apply(
lambda x: fuzz.token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_token_sort_ratio'] = data.apply(
lambda x: fuzz.token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1)
41
Improving models Normalization
Tabular Data
Logistic 0.658
(Basic Training Set Regression
0.660
Features +
Fuzzy
Features) Validation Set
XGB
0.721
0.738
Can we improve it further?
43
Traditional handling of text data
➢ Hashing of words
➢ Count vectorization
➢ TF-IDF
➢ SVD
46
TF-IDF
Number of times a term t appears in a document
TF(t) = -------------------------------------------------------
Total number of terms in the document
Total number of documents

IDF(t) = LOG( ------------------------------------------------------- )
Number of documents with term t in it
TF-IDF(t) = TF(t) * IDF(t)
47
TF-IDF
tfidf = TfidfVectorizer(
min_df=3,
max_features=None,
strip_accents='unicode',
analyzer='word',
token_pattern=r'\w{1,}',
ngram_range=(1, 2),
use_idf=1,
smooth_idf=1,
sublinear_tf=1,
stop_words='english'
) 48
SVD
➢ Latent semantic analysis
➢ scikit-learn version of SVD
➢ 120 components
svd = decomposition.TruncatedSVD(n_components=120)
xtrain_svd = svd.fit_transform(xtrain)
xtest_svd = svd.transform(xtest)
49
Simply using TF-IDF: method-1
Question-1 Question-2
TF-IDF TF-IDF
0.658 0.721
Logistic
0.660 Regression
XGB 0.738
0.777 0.749
Simply using TF-IDF: method-2
TF-IDF
0.658 0.721
Logistic
0.660 Regression
XGB 0.738
0.804 0.748
Simply using TF-IDF + SVD: method-1
TF-IDF TF-IDF
SVD SVD
0.658 0.721
Logistic
0.660 Regression
XGB 0.738
0.706 0.763
TF-IDF TF-IDF
SVD
0.658 0.721
Logistic
0.660 Regression
XGB 0.738
0.700 0.753
TF-IDF
SVD
0.658 0.721
Logistic
0.660 Regression
XGB 0.738
0.714 0.759
Word embeddings
WORD | | | | | | |
➢ Multi-dimensional vector for all the words in any dictionary

➢ Always great insights
➢ Very popular in natural language processing tasks
➢ Google news vectors 300d
➢ GloVe
➢ FastText
Word embeddings Berlin - Germany + France ~ Paris
+ France
- Germany
Paris
Berlin
France
Germany
Every word gets a position in space
Word embeddings
➢ Embeddings for words

➢ Embeddings for whole sentence
Word embeddings
def sent2vec(s, model, stop_words, tokenizer):
words = str(s).lower()
words = tokenizer(words)
words = [w for w in words if not w in stop_words]
words = [w for w in words if w.isalpha()]
M = []
for w in words:
M.append(model[w])
M = np.array(M)
v = M.sum(axis=0)
return v / np.sqrt((v ** 2).sum())
Word embeddings
Word embeddings features
Euclidean
Manhattan
Cosine
Spatial
Distances
Canberra
Minkowski
Braycurtis
Skew ➢ Skew = 0 for normal

distribution
Statistical
➢ Skew > 0: more weight in left
Features tail
Kurtosis
➢ Kurtosis: 4th central moment
over the square of variance
Word mover’s distance: WMD
Kusner, M., Sun, Y., Kolkin, N. & Weinberger, K.. (2015). From Word Embeddings To Document Distances.
Results comparison
Features Logistic XGBoost

Regression Accuracy
Accuracy
Basic Features 0.658 0.721
Basic Features + Fuzzy Features 0.660 0.738
Basic + Fuzzy + Word2Vec Features 0.676 0.766
Word2Vec Features X 0.78
Basic + Fuzzy + Word2Vec Features + Full Word2Vec X 0.814

Vectors
TFIDF + SVD (Best Combination) 0.804 0.763

What can deep learning do?
➢ Natural language processing

➢ Speech processing
➢ Computer vision
➢ And more and more
1-D CNN
➢ One dimensional convolutional layer

➢ Temporal convolution
➢ Simple to implement:
for i in range(sample_length):
y[i] = 0
for j in range(kernel_length):
y[i] += x[i-j] * h[j]
LSTM
➢ Long short term memory

➢ A type of RNN
➢ Used two LSTM layers
Embedding layers
➢ Simple layer
➢ Converts indexes to vectors
➢ [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
Time distributed dense layer
➢ TimeDistributed wrapper around dense layer

➢ TimeDistributed applies the layer to every temporal slice of input
➢ Followed by Lambda layer
➢ Implements “translation” layer used by Stephen Merity (keras snli
model)
model1 = Sequential()
model1.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model1.add(TimeDistributed(Dense(300, activation='relu')))
model1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))
Handling text data before training
tk = text.Tokenizer(nb_words=200000)
max_len = 40
tk.fit_on_texts(list(data.question1.values) + list(data.question2.values.astype(str)))
x1 = tk.texts_to_sequences(data.question1.values)
x1 = sequence.pad_sequences(x1, maxlen=max_len)
x2 = tk.texts_to_sequences(data.question2.values.astype(str))
x2 = sequence.pad_sequences(x2, maxlen=max_len)
word_index = tk.word_index
embeddings_index = {}
f = open('glove.840B.300d.txt')
for line in tqdm(f):
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
embedding_matrix = np.zeros((len(word_index) + 1, 300))

for word, i in tqdm(word_index.items()):
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
Basis of deep learning model
➢ Keras-snli model: https://github.com/Smerity/keras_snli
Creating the deep learning model
Final Deep Learning Model
300,
Model 1 and Model 2 weights=[embedding_matrix],
input_length=40,
trainable=False))
model1.add(Lambda(lambda x: K.sum(x, axis=1),
output_shape=(300,)))
300,
input_length=40,
trainable=False))
model2.add(Lambda(lambda x: K.sum(x, axis=1),
output_shape=(300,)))
Model 3 and Model 4
Model 3 and Model 4
300,
input_length=40,
trainable=False))
model3.add(Convolution1D(nb_filter=nb_filter,
filter_length=filter_length,
border_mode='valid',
activation='relu',
subsample_length=1))
model3.add(Dropout(0.2))
.
.
.
model3.add(Dense(300))
model3.add(Dropout(0.2))
model3.add(BatchNormalization())
Model 5 and Model 6
model5.add(Embedding(len(word_index) + 1, 300, input_length=40,
dropout=0.2))
model5.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))
model6.add(Embedding(len(word_index) + 1, 300, input_length=40,
dropout=0.2))
model6.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))
Merged Model
Time to Train the DeepNet
➢ Total params: 174,913,917
➢ Trainable params: 60,172,917
➢ Non-trainable params: 114,741,000
➢ NVIDIA Titan X
Time to Train the DeepNet
➢ The deep network was trained on an NVIDIA TitanX and took approximately
300 seconds for each epoch and took 10-15 hours to train. This network
achieved an accuracy of 0.848 (~0.85).
➢ The SOTA at that time was around 0.88. (Bi-MPM model)
Can we end without talking about the muppets?
Ofcourse!
Just kidding, no!
BERT
➢ Based on transformer encoder

➢ Each encoder block has self-attention
➢ Encoder blocks: 12 or 24
➢ Feed forward hidden units: 768 or 1024
➢ Attention heads: 12 or 16
BERT encoder block
1
Vectors of size 768 or 1024

__
__
__
__
512 inputs
__
__
__ Encoder Block
__
__
__
__
__
512
How BERT learns?
➢ BERT has a fixed vocab

➢ BERT has encoder blocks (transformer blocks)
➢ A word is masked and BERT tries to predict that word
➢ BERT training also tries to predict next sentence
➢ Combining losses from two above approaches, BERT learns
BERT tokenization
➢ [CLS] TOKENS [SEP]
➢ [CLS] TOKENS_A [SEP] TOKENS_B [SEP]
Example of tokenization:
hi, everyone! this is tokenization example
[CLS] hi , everyone ! this is token ##ization example [SEP]

BERT tokenization
https://github.com/huggingface/tokenizers
Approaching duplicate questions using BERT
There is a lot more….
Maybe next time!
Few things to remember...
Fine-tuning often gives good results
➢ It is faster
➢ It is better (not always)
➢ Why reinvent the wheel?
Fine-tuning often gives good results
Bigger isn’t always better
A good model has some key ingredients...
Sugar
Understanding the data
Exploring the data

Spice
Pre-processing
Feature engineering
Feature selection
All the things that are nice
A good cross validation
Low Error Rate
Simple or combination of models
Post-processing
Chemical X
A
Good
Machine Learning
Model
Approaching (almost) any
machine learning problem:
the book will release in
Summer 2020.
Fill out the form here to be the

first one to know when it’s
ready to buy:
➢
➢
e-mail: abhishek4@gmail.com
Linkedin: linkedin.com/in/abhi1thakur
http://bit.ly/approachingalmost
➢ kaggle: kaggle.com/abhishek
➢ tweet me: @abhi1thakur
➢ YouTube: youtube.com/AbhishekThakurAbhi

Approaching Almost Any NLP

Uploaded by

Copyright:

Available Formats

You might also like

Approaching Almost Any NLP

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Approaching Almost Any NLP

Uploaded by

Copyright:

Available Formats

Approaching (almost) any NLP problem

Search Engine Speech to Text

Topic Extraction Autocomplete Question Answering

Chatbots / VAs Translation

can u he.lp me with loan? 😊

can you help me with loan ?

'hello', ',', 'how', 'are', 'you', '?'

➢ Stopwords handling I need a new car insurancee

➢ Cleaning HTML I need aa new car insurance

➢ How can I start an online shopping (e-commerce) website?

➢ 400,000+ pairs of questions

➢ 255045 negative samples (non-duplicates)

➢ Average number characters in question1: 59.57

➢ Average number characters in question2: 60.14

data['len_q1'] = data.question1.apply(lambda x: len(str(x)))

➢ pip install fuzzywuzzy

Total number of documents

TF-IDF(t) = TF(t) * IDF(t)

➢ Multi-dimensional vector for all the words in any dictionary

➢ Embeddings for words

Skew ➢ Skew = 0 for normal

Features Logistic XGBoost

Basic Features 0.658 0.721

Basic Features + Fuzzy Features 0.660 0.738

Basic + Fuzzy + Word2Vec Features 0.676 0.766

Word2Vec Features X 0.78

Basic + Fuzzy + Word2Vec Features + Full Word2Vec X 0.814

TFIDF + SVD (Best Combination) 0.804 0.763

➢ Natural language processing

➢ One dimensional convolutional layer

➢ Long short term memory

➢ TimeDistributed wrapper around dense layer

embedding_matrix = np.zeros((len(word_index) + 1, 300))

➢ Based on transformer encoder

Vectors of size 768 or 1024

➢ BERT has a ﬁxed vocab

hi, everyone! this is tokenization example

[CLS] hi , everyone ! this is token ##ization example [SEP]

Understanding the data

Exploring the data

A good cross validation

Low Error Rate

Simple or combination of models

Fill out the form here to be the

You might also like