Approaching Almost Any NLP

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 118

Approaching (almost) any NLP problem

@abhi1thakur
AI is like an imaginary
friend most enterprises
claim to have these days
3
4
I like big data
and
I cannot lie
5
Agenda
➢ Not so much intro
➢ Where is NLP used
➢ Pre-processing
➢ Machine Learning Models
➢ Solving a problem
➢ Traditional approaches
➢ Deep Learning Models
➢ Muppets

6
Applications of natural language processing

Search Engine Speech to Text

Review Rating
Sentiment Classification Entity Extraction Prediction

Topic Extraction Autocomplete Question Answering

Chatbots / VAs Translation


Pre-processing the text data

can u he.lp me with loan? 😊

Unintentional
Abbreviations Symbols Emojis
Characters

can you help me with loan ?


8
Pre-processing the text data
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML

9
Pre-processing the text data
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction def remove_space(text):
text = text.strip()
➢ Contraction mapping text = text.split()
return " ".join(text)
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML

10
Pre-processing the text data
➢ Removing weird spaces
➢ Very important step
➢ Tokenization ➢ Is not always about spaces
➢ Spelling correction ➢ Converts words into tokens
➢ Might be different for different
➢ Contraction mapping languages
➢ Stemming ➢ Simplest is to use `word_tokenizer`
from NLTK
➢ Emoji handling ➢ Write your own ;)
➢ Stopwords handling
➢ Cleaning HTML

11
Pre-processing the text data
➢ Removing weird spaces import nltk
nltk.download('punkt')
➢ Tokenization
from nltk.tokenize import word_tokenize
➢ Spelling correction text = "hello, how are you?"
tokens = word_tokenize(text)
➢ Contraction mapping print(tokens)
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
hello, how are you?

'hello', ',', 'how', 'are', 'you', '?'


12
Pre-processing the text data
➢ Removing weird spaces
➢ Tokenization ➢ Very very crucial step
➢ In chat: can u tel me abot new sim
➢ Spelling correction card pland?
➢ Contraction mapping ➢ Most models without spelling
correction will fail
➢ Stemming ➢ Peter Norvig’s spelling correction
➢ Emoji handling ➢ Make your own ;)

➢ Stopwords handling
➢ Cleaning HTML

13
Pre-processing the text data
I need a new car insurance
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
I ned a new car insuraance

Bidirectional Stacked
Contraction mapping

Embeddings Layer
I needd a new carr insurance
➢ Stemming

char-LSTM

Output
➢ Emoji handling I need a neew car insurance

➢ Stopwords handling I need a new car insurancee

➢ Cleaning HTML I need aa new car insurance

14
Pre-processing the text data
➢ Removing weird spaces def edits1(word):
➢ Tokenization letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
➢ Spelling correction deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
➢ Contraction mapping replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
➢ Stemming return set(deletes + transposes + replaces + inserts)
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML def edits2(word):
return (e2 for e1 in edits1(word) for e2 in edits1(e1))

15
Pre-processing the text data
contraction = {
➢ Removing weird spaces "'cause": 'because',
➢ Tokenization ',cause': 'because',
';cause': 'because',
➢ Spelling correction "ain't": 'am not',
'ain,t': 'am not',
➢ Contraction mapping 'ain;t': 'am not',
'ain´t': 'am not',
➢ Stemming 'ain’t': 'am not',
➢ Emoji handling "aren't": 'are not',
'aren,t': 'are not',
➢ Stopwords handling 'aren;t': 'are not',
'aren´t': 'are not',
➢ Cleaning HTML 'aren’t': 'are not'
}

16
Pre-processing the text data
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
def mapping_replacer(x, dic):
➢ Contraction mapping for word in dic.keys():
if " " + word + " " in x:
➢ Stemming x = x.replace(" " + word + " ", " " + dic[word] + " ")
➢ Emoji handling return x

➢ Stopwords handling
➢ Cleaning HTML

17
Pre-processing the text data
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction ➢ Reduces words to root form
➢ Contraction mapping ➢ Why is stemming important?
➢ NLTK stemmers
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML

18
Pre-processing the text data
fishing
➢ Removing weird spaces
➢ Tokenization fished fish
➢ Spelling correction
➢ Contraction mapping fishes

➢ Stemming
➢ Emoji handling
➢ Stopwords handling In [1]: from nltk.stem import SnowballStemmer
In [2]: s = SnowballStemmer('english')
➢ Cleaning HTML In [3]: s.stem("fishing")
Out[3]: 'fish'

19
Pre-processing the text data
➢ Removing weird spaces
pip install emoji
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping import emoji
emojis = emoji.UNICODE_EMOJI
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML

20
Pre-processing the text data
I need new car insurance
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
I car insurance
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
new
➢ Cleaning HTML
need

21
Pre-processing the text data
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML

22
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Pre-processing the text data
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML

23
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Pre-processing the text data
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML

24
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Pre-processing the text data
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML

25
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
What kind of models to use?

➢ SVM
➢ Logistic Regression
➢ Gradient Boosting
➢ Neural Networks

26
Let’s look at a problem

27
Quora duplicate question identification

➢ ~ 13 million questions
➢ Many duplicate questions
➢ Cluster and join duplicates together
➢ Remove clutter

28
Non-duplicate questions
➢ Who should I address my cover letter to if I'm applying for a big company
like Mozilla?
➢ Which car is better from safety view?""swift or grand i10"".My first priority
is safety?

➢ How can I start an online shopping (e-commerce) website?


➢ Which web technology is best suitable for building a big E-Commerce
website?

29
Duplicate questions
➢ How does Quora quickly mark questions as needing improvement?
➢ Why does Quora mark my questions as needing
improvement/clarification before I have time to give it details? Literally
within seconds…

➢ What practical applications might evolve from the discovery of the Higgs
Boson?
➢ What are some practical benefits of discovery of the Higgs Boson?

30
Dataset

➢ 400,000+ pairs of questions


➢ Initially data was very skewed
➢ Negative sampling
➢ Noise exists (as usual)

31
Dataset

➢ 255045 negative samples (non-duplicates)


➢ 149306 positive samples (duplicates)
➢ 40% positive samples

32
Dataset: basic exploration

➢ Average number characters in question1: 59.57


➢ Minimum number of characters in question1: 1
➢ Maximum number of characters in question1: 623

➢ Average number characters in question2: 60.14


➢ Minimum number of characters in question2: 1
➢ Maximum number of characters in question2: 1169

33
Basic feature engineering
➢ Length of question1
➢ Length of question2
➢ Difference in the two lengths
➢ Character length of question1 without spaces
➢ Character length of question2 without spaces
➢ Number of words in question1
➢ Number of words in question2
➢ Number of common words in question1 and question2

34
Basic feature engineering

data['len_q1'] = data.question1.apply(lambda x: len(str(x)))


data['len_q2'] = data.question2.apply(lambda x: len(str(x)))
data['diff_len'] = data.len_q1 - data.len_q2
data['len_char_q1'] = data.question1.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))
data['len_char_q2'] = data.question2.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))
data['len_word_q1'] = data.question1.apply(lambda x: len(str(x).split()))
data['len_word_q2'] = data.question2.apply(lambda x: len(str(x).split()))

35
Basic feature engineering

data['len_common_words'] = data.apply(lambda x:
len(
set(str(x['question1']).lower().split()).intersection(set(str(x['question2']).lower().split())
)), axis=1)
Basic modelling Normalization

Logistic
Tabular Data Training Set Regression 0.658
(Basic
Features)
Validation Set

XGB 0.721
Fuzzy features
➢ Also known as approximate string matching
➢ Number of “primitive” operations required to convert string to exact
match
➢ Primitive operations:
○ Insertion
○ Deletion
○ Substitution
➢ Typically used for:
○ Spell checking
○ Plagiarism detection
○ DNA sequence matching
○ Spam filtering 38
Fuzzy features

➢ pip install fuzzywuzzy


➢ Uses Levenshtein distance

➢ QRatio
➢ WRatio
➢ Token set ratio
➢ Token sort ratio
➢ Partial token set ratio
➢ Partial token sort ratio
39
https://github.com/seatgeek/fuzzywuzzy
Fuzzy features

data['fuzz_qratio'] = data.apply(
lambda x: fuzz.QRatio(str(x['question1']), str(x['question2'])), axis=1)

data['fuzz_WRatio'] = data.apply(
lambda x: fuzz.WRatio(str(x['question1']), str(x['question2'])), axis=1)

data['fuzz_partial_ratio'] = data.apply(
lambda x: fuzz.partial_ratio(str(x['question1']), str(x['question2'])), axis=1)

data['fuzz_partial_token_set_ratio'] = data.apply(
lambda x: fuzz.partial_token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)
40
Fuzzy features

data['fuzz_partial_token_sort_ratio'] = data.apply(
lambda x: fuzz.partial_token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1)

data['fuzz_token_set_ratio'] = data.apply(
lambda x: fuzz.token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)

data['fuzz_token_sort_ratio'] = data.apply(
lambda x: fuzz.token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1)

41
Improving models Normalization

Tabular Data
Logistic 0.658
(Basic Training Set Regression
0.660
Features +
Fuzzy
Features) Validation Set

XGB
0.721
0.738
Can we improve it further?

43
Traditional handling of text data

➢ Hashing of words
➢ Count vectorization
➢ TF-IDF
➢ SVD

46
TF-IDF
Number of times a term t appears in a document
TF(t) = -------------------------------------------------------
Total number of terms in the document

Total number of documents


IDF(t) = LOG( ------------------------------------------------------- )
Number of documents with term t in it

TF-IDF(t) = TF(t) * IDF(t)

47
TF-IDF
tfidf = TfidfVectorizer(
min_df=3,
max_features=None,
strip_accents='unicode',
analyzer='word',
token_pattern=r'\w{1,}',
ngram_range=(1, 2),
use_idf=1,
smooth_idf=1,
sublinear_tf=1,
stop_words='english'
) 48
SVD
➢ Latent semantic analysis
➢ scikit-learn version of SVD
➢ 120 components

svd = decomposition.TruncatedSVD(n_components=120)
xtrain_svd = svd.fit_transform(xtrain)
xtest_svd = svd.transform(xtest)

49
Simply using TF-IDF: method-1

Question-1 Question-2

TF-IDF TF-IDF

0.658 0.721
Logistic
0.660 Regression
XGB 0.738
0.777 0.749
Simply using TF-IDF: method-2

Question-1 Question-2

TF-IDF

0.658 0.721
Logistic
0.660 Regression
XGB 0.738
0.804 0.748
Simply using TF-IDF + SVD: method-1
Question-1 Question-2

TF-IDF TF-IDF

SVD SVD

0.658 0.721
Logistic
0.660 Regression
XGB 0.738
0.706 0.763
Simply using TF-IDF + SVD: method-2
Question-1 Question-2

TF-IDF TF-IDF

SVD

0.658 0.721
Logistic
0.660 Regression
XGB 0.738
0.700 0.753
Simply using TF-IDF + SVD: method-3
Question-1 Question-2

TF-IDF

SVD

0.658 0.721
Logistic
0.660 Regression
XGB 0.738
0.714 0.759
Word embeddings

WORD | | | | | | |

➢ Multi-dimensional vector for all the words in any dictionary


➢ Always great insights
➢ Very popular in natural language processing tasks
➢ Google news vectors 300d
➢ GloVe
➢ FastText
Word embeddings Berlin - Germany + France ~ Paris

+ France

- Germany
Paris

Berlin

France

Germany
Every word gets a position in space
Word embeddings

➢ Embeddings for words


➢ Embeddings for whole sentence
Word embeddings
def sent2vec(s, model, stop_words, tokenizer):
words = str(s).lower()
words = tokenizer(words)
words = [w for w in words if not w in stop_words]
words = [w for w in words if w.isalpha()]
M = []
for w in words:
M.append(model[w])
M = np.array(M)
v = M.sum(axis=0)
return v / np.sqrt((v ** 2).sum())
Word embeddings
Word embeddings features
Word embeddings features
Euclidean

Manhattan

Cosine
Spatial
Distances
Canberra

Minkowski

Braycurtis
Word embeddings features

Skew ➢ Skew = 0 for normal


distribution
Statistical
➢ Skew > 0: more weight in left
Features tail
Kurtosis
➢ Kurtosis: 4th central moment
over the square of variance
Word mover’s distance: WMD

Kusner, M., Sun, Y., Kolkin, N. & Weinberger, K.. (2015). From Word Embeddings To Document Distances.
Results comparison

Features Logistic XGBoost


Regression Accuracy
Accuracy

Basic Features 0.658 0.721

Basic Features + Fuzzy Features 0.660 0.738

Basic + Fuzzy + Word2Vec Features 0.676 0.766

Word2Vec Features X 0.78

Basic + Fuzzy + Word2Vec Features + Full Word2Vec X 0.814


Vectors

TFIDF + SVD (Best Combination) 0.804 0.763


What can deep learning do?

➢ Natural language processing


➢ Speech processing
➢ Computer vision
➢ And more and more
1-D CNN

➢ One dimensional convolutional layer


➢ Temporal convolution
➢ Simple to implement:

for i in range(sample_length):
y[i] = 0
for j in range(kernel_length):
y[i] += x[i-j] * h[j]
LSTM

➢ Long short term memory


➢ A type of RNN
➢ Used two LSTM layers
Embedding layers

➢ Simple layer
➢ Converts indexes to vectors
➢ [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
Time distributed dense layer

➢ TimeDistributed wrapper around dense layer


➢ TimeDistributed applies the layer to every temporal slice of input
➢ Followed by Lambda layer
➢ Implements “translation” layer used by Stephen Merity (keras snli
model)
model1 = Sequential()
model1.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model1.add(TimeDistributed(Dense(300, activation='relu')))
model1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))
Handling text data before training

tk = text.Tokenizer(nb_words=200000)
max_len = 40
tk.fit_on_texts(list(data.question1.values) + list(data.question2.values.astype(str)))
x1 = tk.texts_to_sequences(data.question1.values)
x1 = sequence.pad_sequences(x1, maxlen=max_len)
x2 = tk.texts_to_sequences(data.question2.values.astype(str))
x2 = sequence.pad_sequences(x2, maxlen=max_len)
word_index = tk.word_index
Handling text data before training

embeddings_index = {}
f = open('glove.840B.300d.txt')
for line in tqdm(f):
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
Handling text data before training
Handling text data before training
Handling text data before training

embedding_matrix = np.zeros((len(word_index) + 1, 300))


for word, i in tqdm(word_index.items()):
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
Basis of deep learning model
➢ Keras-snli model: https://github.com/Smerity/keras_snli
Creating the deep learning model
Final Deep Learning Model
model1 = Sequential()
model1.add(Embedding(len(word_index) + 1,
300,
Model 1 and Model 2 weights=[embedding_matrix],
input_length=40,
trainable=False))

model1.add(TimeDistributed(Dense(300, activation='relu')))
model1.add(Lambda(lambda x: K.sum(x, axis=1),
output_shape=(300,)))

model2 = Sequential()
model2.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))

model2.add(TimeDistributed(Dense(300, activation='relu')))
model2.add(Lambda(lambda x: K.sum(x, axis=1),
output_shape=(300,)))
Final Deep Learning Model
Model 3 and Model 4
Model 3 and Model 4
model3 = Sequential()
model3.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model3.add(Convolution1D(nb_filter=nb_filter,
filter_length=filter_length,
border_mode='valid',
activation='relu',
subsample_length=1))
model3.add(Dropout(0.2))
.
.
.
model3.add(Dense(300))
model3.add(Dropout(0.2))
model3.add(BatchNormalization())
Final Deep Learning Model
Model 5 and Model 6
model5 = Sequential()
model5.add(Embedding(len(word_index) + 1, 300, input_length=40,
dropout=0.2))
model5.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))

model6 = Sequential()
model6.add(Embedding(len(word_index) + 1, 300, input_length=40,
dropout=0.2))
model6.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))
Final Deep Learning Model
Merged Model
Time to Train the DeepNet
➢ Total params: 174,913,917
➢ Trainable params: 60,172,917
➢ Non-trainable params: 114,741,000

➢ NVIDIA Titan X
Time to Train the DeepNet

➢ The deep network was trained on an NVIDIA TitanX and took approximately
300 seconds for each epoch and took 10-15 hours to train. This network
achieved an accuracy of 0.848 (~0.85).
➢ The SOTA at that time was around 0.88. (Bi-MPM model)
Can we end without talking about the muppets?
Ofcourse!
Just kidding, no!
BERT

➢ Based on transformer encoder


➢ Each encoder block has self-attention
➢ Encoder blocks: 12 or 24
➢ Feed forward hidden units: 768 or 1024
➢ Attention heads: 12 or 16
BERT encoder block
1

Vectors of size 768 or 1024


__
__
__
__
512 inputs

__
__
__ Encoder Block
__
__
__
__
__

512
How BERT learns?

➢ BERT has a fixed vocab


➢ BERT has encoder blocks (transformer blocks)
➢ A word is masked and BERT tries to predict that word
➢ BERT training also tries to predict next sentence
➢ Combining losses from two above approaches, BERT learns
BERT tokenization
➢ [CLS] TOKENS [SEP]
➢ [CLS] TOKENS_A [SEP] TOKENS_B [SEP]

Example of tokenization:

hi, everyone! this is tokenization example

[CLS] hi , everyone ! this is token ##ization example [SEP]


BERT tokenization

https://github.com/huggingface/tokenizers
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
There is a lot more….
Maybe next time!
Few things to remember...
Fine-tuning often gives good results

➢ It is faster
➢ It is better (not always)
➢ Why reinvent the wheel?
Fine-tuning often gives good results
Bigger isn’t always better
A good model has some key ingredients...
Sugar

Understanding the data

Exploring the data


Spice

Pre-processing

Feature engineering

Feature selection
All the things that are nice

A good cross validation

Low Error Rate

Simple or combination of models

Post-processing
Chemical X
A
Good
Machine Learning
Model
Approaching (almost) any
machine learning problem:
the book will release in
Summer 2020.

Fill out the form here to be the


first one to know when it’s
ready to buy:


e-mail: abhishek4@gmail.com
Linkedin: linkedin.com/in/abhi1thakur
http://bit.ly/approachingalmost
➢ kaggle: kaggle.com/abhishek
➢ tweet me: @abhi1thakur
➢ YouTube: youtube.com/AbhishekThakurAbhi

You might also like