Professional Documents
Culture Documents
Approaching Almost Any NLP
Approaching Almost Any NLP
Approaching Almost Any NLP
@abhi1thakur
AI is like an imaginary
friend most enterprises
claim to have these days
3
4
I like big data
and
I cannot lie
5
Agenda
➢ Not so much intro
➢ Where is NLP used
➢ Pre-processing
➢ Machine Learning Models
➢ Solving a problem
➢ Traditional approaches
➢ Deep Learning Models
➢ Muppets
6
Applications of natural language processing
Review Rating
Sentiment Classification Entity Extraction Prediction
Unintentional
Abbreviations Symbols Emojis
Characters
9
Pre-processing the text data
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction def remove_space(text):
text = text.strip()
➢ Contraction mapping text = text.split()
return " ".join(text)
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
10
Pre-processing the text data
➢ Removing weird spaces
➢ Very important step
➢ Tokenization ➢ Is not always about spaces
➢ Spelling correction ➢ Converts words into tokens
➢ Might be different for different
➢ Contraction mapping languages
➢ Stemming ➢ Simplest is to use `word_tokenizer`
from NLTK
➢ Emoji handling ➢ Write your own ;)
➢ Stopwords handling
➢ Cleaning HTML
11
Pre-processing the text data
➢ Removing weird spaces import nltk
nltk.download('punkt')
➢ Tokenization
from nltk.tokenize import word_tokenize
➢ Spelling correction text = "hello, how are you?"
tokens = word_tokenize(text)
➢ Contraction mapping print(tokens)
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
hello, how are you?
➢ Stopwords handling
➢ Cleaning HTML
13
Pre-processing the text data
I need a new car insurance
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
I ned a new car insuraance
➢
Bidirectional Stacked
Contraction mapping
Embeddings Layer
I needd a new carr insurance
➢ Stemming
char-LSTM
Output
➢ Emoji handling I need a neew car insurance
14
Pre-processing the text data
➢ Removing weird spaces def edits1(word):
➢ Tokenization letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
➢ Spelling correction deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
➢ Contraction mapping replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
➢ Stemming return set(deletes + transposes + replaces + inserts)
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML def edits2(word):
return (e2 for e1 in edits1(word) for e2 in edits1(e1))
15
Pre-processing the text data
contraction = {
➢ Removing weird spaces "'cause": 'because',
➢ Tokenization ',cause': 'because',
';cause': 'because',
➢ Spelling correction "ain't": 'am not',
'ain,t': 'am not',
➢ Contraction mapping 'ain;t': 'am not',
'ain´t': 'am not',
➢ Stemming 'ain’t': 'am not',
➢ Emoji handling "aren't": 'are not',
'aren,t': 'are not',
➢ Stopwords handling 'aren;t': 'are not',
'aren´t': 'are not',
➢ Cleaning HTML 'aren’t': 'are not'
}
16
Pre-processing the text data
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
def mapping_replacer(x, dic):
➢ Contraction mapping for word in dic.keys():
if " " + word + " " in x:
➢ Stemming x = x.replace(" " + word + " ", " " + dic[word] + " ")
➢ Emoji handling return x
➢ Stopwords handling
➢ Cleaning HTML
17
Pre-processing the text data
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction ➢ Reduces words to root form
➢ Contraction mapping ➢ Why is stemming important?
➢ NLTK stemmers
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
18
Pre-processing the text data
fishing
➢ Removing weird spaces
➢ Tokenization fished fish
➢ Spelling correction
➢ Contraction mapping fishes
➢ Stemming
➢ Emoji handling
➢ Stopwords handling In [1]: from nltk.stem import SnowballStemmer
In [2]: s = SnowballStemmer('english')
➢ Cleaning HTML In [3]: s.stem("fishing")
Out[3]: 'fish'
19
Pre-processing the text data
➢ Removing weird spaces
pip install emoji
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping import emoji
emojis = emoji.UNICODE_EMOJI
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
20
Pre-processing the text data
I need new car insurance
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
I car insurance
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
new
➢ Cleaning HTML
need
21
Pre-processing the text data
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
22
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Pre-processing the text data
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
23
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Pre-processing the text data
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
24
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Pre-processing the text data
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
25
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
What kind of models to use?
➢ SVM
➢ Logistic Regression
➢ Gradient Boosting
➢ Neural Networks
26
Let’s look at a problem
27
Quora duplicate question identification
➢ ~ 13 million questions
➢ Many duplicate questions
➢ Cluster and join duplicates together
➢ Remove clutter
28
Non-duplicate questions
➢ Who should I address my cover letter to if I'm applying for a big company
like Mozilla?
➢ Which car is better from safety view?""swift or grand i10"".My first priority
is safety?
29
Duplicate questions
➢ How does Quora quickly mark questions as needing improvement?
➢ Why does Quora mark my questions as needing
improvement/clarification before I have time to give it details? Literally
within seconds…
➢ What practical applications might evolve from the discovery of the Higgs
Boson?
➢ What are some practical benefits of discovery of the Higgs Boson?
30
Dataset
31
Dataset
32
Dataset: basic exploration
33
Basic feature engineering
➢ Length of question1
➢ Length of question2
➢ Difference in the two lengths
➢ Character length of question1 without spaces
➢ Character length of question2 without spaces
➢ Number of words in question1
➢ Number of words in question2
➢ Number of common words in question1 and question2
34
Basic feature engineering
35
Basic feature engineering
data['len_common_words'] = data.apply(lambda x:
len(
set(str(x['question1']).lower().split()).intersection(set(str(x['question2']).lower().split())
)), axis=1)
Basic modelling Normalization
Logistic
Tabular Data Training Set Regression 0.658
(Basic
Features)
Validation Set
XGB 0.721
Fuzzy features
➢ Also known as approximate string matching
➢ Number of “primitive” operations required to convert string to exact
match
➢ Primitive operations:
○ Insertion
○ Deletion
○ Substitution
➢ Typically used for:
○ Spell checking
○ Plagiarism detection
○ DNA sequence matching
○ Spam filtering 38
Fuzzy features
➢ QRatio
➢ WRatio
➢ Token set ratio
➢ Token sort ratio
➢ Partial token set ratio
➢ Partial token sort ratio
39
https://github.com/seatgeek/fuzzywuzzy
Fuzzy features
data['fuzz_qratio'] = data.apply(
lambda x: fuzz.QRatio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_WRatio'] = data.apply(
lambda x: fuzz.WRatio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_partial_ratio'] = data.apply(
lambda x: fuzz.partial_ratio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_partial_token_set_ratio'] = data.apply(
lambda x: fuzz.partial_token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)
40
Fuzzy features
data['fuzz_partial_token_sort_ratio'] = data.apply(
lambda x: fuzz.partial_token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_token_set_ratio'] = data.apply(
lambda x: fuzz.token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_token_sort_ratio'] = data.apply(
lambda x: fuzz.token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1)
41
Improving models Normalization
Tabular Data
Logistic 0.658
(Basic Training Set Regression
0.660
Features +
Fuzzy
Features) Validation Set
XGB
0.721
0.738
Can we improve it further?
43
Traditional handling of text data
➢ Hashing of words
➢ Count vectorization
➢ TF-IDF
➢ SVD
46
TF-IDF
Number of times a term t appears in a document
TF(t) = -------------------------------------------------------
Total number of terms in the document
47
TF-IDF
tfidf = TfidfVectorizer(
min_df=3,
max_features=None,
strip_accents='unicode',
analyzer='word',
token_pattern=r'\w{1,}',
ngram_range=(1, 2),
use_idf=1,
smooth_idf=1,
sublinear_tf=1,
stop_words='english'
) 48
SVD
➢ Latent semantic analysis
➢ scikit-learn version of SVD
➢ 120 components
svd = decomposition.TruncatedSVD(n_components=120)
xtrain_svd = svd.fit_transform(xtrain)
xtest_svd = svd.transform(xtest)
49
Simply using TF-IDF: method-1
Question-1 Question-2
TF-IDF TF-IDF
0.658 0.721
Logistic
0.660 Regression
XGB 0.738
0.777 0.749
Simply using TF-IDF: method-2
Question-1 Question-2
TF-IDF
0.658 0.721
Logistic
0.660 Regression
XGB 0.738
0.804 0.748
Simply using TF-IDF + SVD: method-1
Question-1 Question-2
TF-IDF TF-IDF
SVD SVD
0.658 0.721
Logistic
0.660 Regression
XGB 0.738
0.706 0.763
Simply using TF-IDF + SVD: method-2
Question-1 Question-2
TF-IDF TF-IDF
SVD
0.658 0.721
Logistic
0.660 Regression
XGB 0.738
0.700 0.753
Simply using TF-IDF + SVD: method-3
Question-1 Question-2
TF-IDF
SVD
0.658 0.721
Logistic
0.660 Regression
XGB 0.738
0.714 0.759
Word embeddings
WORD | | | | | | |
+ France
- Germany
Paris
Berlin
France
Germany
Every word gets a position in space
Word embeddings
Manhattan
Cosine
Spatial
Distances
Canberra
Minkowski
Braycurtis
Word embeddings features
Kusner, M., Sun, Y., Kolkin, N. & Weinberger, K.. (2015). From Word Embeddings To Document Distances.
Results comparison
for i in range(sample_length):
y[i] = 0
for j in range(kernel_length):
y[i] += x[i-j] * h[j]
LSTM
➢ Simple layer
➢ Converts indexes to vectors
➢ [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
Time distributed dense layer
tk = text.Tokenizer(nb_words=200000)
max_len = 40
tk.fit_on_texts(list(data.question1.values) + list(data.question2.values.astype(str)))
x1 = tk.texts_to_sequences(data.question1.values)
x1 = sequence.pad_sequences(x1, maxlen=max_len)
x2 = tk.texts_to_sequences(data.question2.values.astype(str))
x2 = sequence.pad_sequences(x2, maxlen=max_len)
word_index = tk.word_index
Handling text data before training
embeddings_index = {}
f = open('glove.840B.300d.txt')
for line in tqdm(f):
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
Handling text data before training
Handling text data before training
Handling text data before training
model1.add(TimeDistributed(Dense(300, activation='relu')))
model1.add(Lambda(lambda x: K.sum(x, axis=1),
output_shape=(300,)))
model2 = Sequential()
model2.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model2.add(TimeDistributed(Dense(300, activation='relu')))
model2.add(Lambda(lambda x: K.sum(x, axis=1),
output_shape=(300,)))
Final Deep Learning Model
Model 3 and Model 4
Model 3 and Model 4
model3 = Sequential()
model3.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model3.add(Convolution1D(nb_filter=nb_filter,
filter_length=filter_length,
border_mode='valid',
activation='relu',
subsample_length=1))
model3.add(Dropout(0.2))
.
.
.
model3.add(Dense(300))
model3.add(Dropout(0.2))
model3.add(BatchNormalization())
Final Deep Learning Model
Model 5 and Model 6
model5 = Sequential()
model5.add(Embedding(len(word_index) + 1, 300, input_length=40,
dropout=0.2))
model5.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))
model6 = Sequential()
model6.add(Embedding(len(word_index) + 1, 300, input_length=40,
dropout=0.2))
model6.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))
Final Deep Learning Model
Merged Model
Time to Train the DeepNet
➢ Total params: 174,913,917
➢ Trainable params: 60,172,917
➢ Non-trainable params: 114,741,000
➢ NVIDIA Titan X
Time to Train the DeepNet
➢ The deep network was trained on an NVIDIA TitanX and took approximately
300 seconds for each epoch and took 10-15 hours to train. This network
achieved an accuracy of 0.848 (~0.85).
➢ The SOTA at that time was around 0.88. (Bi-MPM model)
Can we end without talking about the muppets?
Ofcourse!
Just kidding, no!
BERT
__
__
__ Encoder Block
__
__
__
__
__
512
How BERT learns?
Example of tokenization:
https://github.com/huggingface/tokenizers
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
There is a lot more….
Maybe next time!
Few things to remember...
Fine-tuning often gives good results
➢ It is faster
➢ It is better (not always)
➢ Why reinvent the wheel?
Fine-tuning often gives good results
Bigger isn’t always better
A good model has some key ingredients...
Sugar
Pre-processing
Feature engineering
Feature selection
All the things that are nice
Post-processing
Chemical X
A
Good
Machine Learning
Model
Approaching (almost) any
machine learning problem:
the book will release in
Summer 2020.