Professional Documents
Culture Documents
Unit 2a
Unit 2a
Unit2
What Is A Language Model?
Language Model Applications
Language Model Applications
Word Embedding/Vectorization
Types
• Bag of Words
• TF-IDF
• One hot encoding
• Word2Vec
• GloVe
• FastText
Bag-of-Words(BoW)
“It was the best of times”
“It was the worst of times”
‘It’, ‘was’, ‘the’, ‘best’, ‘of’, ‘times’, ‘worst’, ‘age’, ‘wisdom’, ‘foolishness’
“It was the age of wisdom”
“It was the age of foolishness”
We take the first document — “It was the best of times” and
we check the frequency of words from the 10 unique words.
“it” = 1
“was” = 1
“the” = 1
“best” = 1
“of” = 1 “It was the best of times” = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
“times” = 1 “It was the worst of times” = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
“worst” = 0 “It was the age of wisdom” = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
“age” = 0 “It was the age of foolishness” = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]
“wisdom” = 0
“foolishness” = 0
CountVectorizer
CountVectorizer
Another pure way
• # Bag of Words
• # count vectorizer
• from sklearn.feature_extraction.text import CountVectorizer
• corpus = ["This pasta is very tasty and affordable.", "This pasta is not tasty
and affordable", "This pasta is very very delicious"]
• countvectorizer = CountVectorizer()
• X = countvectorizer.fit_transform(corpus)
• result = X.toarray()
• print(result)
N-grams
• 1. Similar to the count vectorization technique, in the N-Gram method, a document term matrix is generated, and each cell
represents the count.
• 4. N-grams consider the sequence of n words in the text; where n is (1,2,3.. ) like 1-gram, 2-gram. for token pair. Unlike BOW, it
maintains word order.
• For Example,
• if n=2, i.e bigram, then the columns would be — [“I am”, “am reading”, ‘studying NLP”]
• if n=3, i.e trigram, then the columns would be — [“I am studying”, ”am studying NLP”]
• if n=4,i.e four-gram, then the column would be -[‘“I am studying NLP”]
The term-document matrix for four words in four Shakespeare plays. Each cell
contains the number of times the (row) word occurs in the (column) document.
if two documents have similar words their column vectors will tend to be similar.
The vectors for the comedies As You Like It [1,114,36,20] and Twelfth Night [0,80,58,15] look a lot more like each other
(more fools and wit than battles) than they look like Julius Caesar [7,62,1,2] or Henry V [13,89,4,3].
Vectors and documents
A spatial visualization of the document vectors for the four Shakespeare play documents, showing just
two of the dimensions, corresponding to the words battle and fool. The comedies have high values for
the fool dimension and low values for the battle dimension.
Words as vectors: document dimensions
Words as vectors: document dimensions
The term-document matrix for four words in four Shakespeare plays. The red boxes show that each
word is represented as a row vector of length four.
For documents, we saw that similar documents had similar vectors, because similar documents
tend to have similar words. This same principle applies to words: similar words have similar vectors
because they tend to occur in similar documents.
term-term matrix or term-context matrix
term-term matrix or term-context matrix
The two words cherry and strawberry are more similar to each other (both pie and
sugar tend to occur in their window) than they are to other words like digital;
conversely, digital and information are more similar to each other than, say, to
strawberry.
term-term matrix or term-context matrix
Finding Text/Doc Similarities
Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity
• Numerical measure of how different are two data objects
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Euclidian distance between vectors
• Vectors x, y
• x=[x1,x2,...xn] and y=[y1,y2,...yn]
Euclidian distance NLP example
Text 1: I love ice cream
2
2.24
2
2.64
where,
A tf-idf weighted term-document matrix for four words in four Shakespeare plays, the 0.049 value for wit in
As You Like It is the product of tf = log10(20+1) = 1.322 and idf = .037. Note that the idf weighting has
eliminated the importance of the ubiquitous word good and vastly reduced the impact of the almost-
ubiquitous word fool.
code
• The concept of n-grams is applicable here as well, we can combine words in
groups of 2,3,4, and so on to build our final set of features.
• Along with n-grams, there are also a number of parameters such as min_df,
max_df, max_features, sublinear_tf, etc. to play around with. Carefully
tuning these parameters can do wonders for your model’s capabilities.
• Despite being so simple, TF-IDF is known to be extensively used in tasks like
Information Retrieval to judge which response is the best for a query,
especially useful in a chatbot or in Keyword Extraction to determine which
word is the most relevant in a document, and thus, you’ll often find
yourself banking on the intuitive wisdom of the TF-IDF.
N-gram code
One Hot Encoding
• 2 dimensional vector representing a document.
Doc1: The quick brown fox jumped over the lazy dog.
Doc2: She sells seashells by the seashore.
Doc3: Peter Piper picked a peck of pickled peppers.
• Words in sentence X Unique words
p p e.
as ls
g .
do ers
pe hor
by ped
se hel
ov l ed
fo ed
se n
r
pi k
sh r
ow
ck
la r
te
as
pe
a .
lls
ic
ck
ck
m
zy
e
e
e
x
qu
pe
pe
se
of
th
br
ju
pi
pi
The 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
quick 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
brown 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
fox 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
jumped 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
over 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
the 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
lazy 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
dog 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
Word2Vec
• In Bag of Words and TF-IDF, we saw how every word was treated as an individual
entity, and semantics were completely ignored. With the introduction of
Word2Vec, the vector representation of words was said to be contextually aware,
probably for the first time ever.
• Dense vectors work better in every NLP task than sparse vectors
• Word2Vec is one of the most popular technique to learn word embeddings using
shallow neural network. It was developed by Tomas Mikolov in 2013 at Google.
• Since every word is represented as an n-dimensional vector, one can imagine that
all of the words are mapped to this n-dimensional space in such a manner that
words having similar meanings exist in close proximity to one another in this
hyperspace.
• Two methods
• Skip Gram
• Common Bag Of Words (CBOW)
Skip Gram
• Skip gram predicts the surrounding context words within specific window given
current word. The input layer contains the current word and the output layer
contains the context words. The hidden layer contains the number of dimensions
in which we want to represent current word present at the input layer.
• skip-gram trains a probabilistic
classifier that, given a test target
word w and its context window of
L words c1:L, assigns a probability
based on how similar this context
window is to the target word.
• Skip-gram actually stores two
embeddings for each word, one
for the word as a target, and one
for the word considered as
context.
Input using one hot encoding
CBOW (Continuous Bag of Words)
• CBOW model predicts the current word given context words within a specific
window. The input layer contains the context words and the output layer contains
the current word. The hidden layer contains the number of dimensions in which
we want to represent the current word present at the output layer.
code pip install nltk
pip install gensim