Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

Language modelling

Unit2
What Is A Language Model?
Language Model Applications
Language Model Applications
Word Embedding/Vectorization
Types
• Bag of Words
• TF-IDF
• One hot encoding
• Word2Vec
• GloVe
• FastText
Bag-of-Words(BoW)
“It was the best of times”
“It was the worst of times”
‘It’, ‘was’, ‘the’, ‘best’, ‘of’, ‘times’, ‘worst’, ‘age’, ‘wisdom’, ‘foolishness’
“It was the age of wisdom”
“It was the age of foolishness”

We take the first document — “It was the best of times” and
we check the frequency of words from the 10 unique words.
“it” = 1
“was” = 1
“the” = 1
“best” = 1
“of” = 1 “It was the best of times” = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
“times” = 1 “It was the worst of times” = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
“worst” = 0 “It was the age of wisdom” = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
“age” = 0 “It was the age of foolishness” = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]
“wisdom” = 0
“foolishness” = 0
CountVectorizer
CountVectorizer
Another pure way
• # Bag of Words
• # count vectorizer
• from sklearn.feature_extraction.text import CountVectorizer
• corpus = ["This pasta is very tasty and affordable.", "This pasta is not tasty
and affordable", "This pasta is very very delicious"]

• countvectorizer = CountVectorizer()
• X = countvectorizer.fit_transform(corpus)
• result = X.toarray()
• print(result)
N-grams
• 1. Similar to the count vectorization technique, in the N-Gram method, a document term matrix is generated, and each cell
represents the count.

• 2. The columns represent all columns of adjacent words of length n.

• 3. Count vectorization is a special case of N-Gram where n=1.

• 4. N-grams consider the sequence of n words in the text; where n is (1,2,3.. ) like 1-gram, 2-gram. for token pair. Unlike BOW, it
maintains word order.

• For Example,

• “I am studying NLP” has four words and n=4.

• if n=2, i.e bigram, then the columns would be — [“I am”, “am reading”, ‘studying NLP”]
• if n=3, i.e trigram, then the columns would be — [“I am studying”, ”am studying NLP”]
• if n=4,i.e four-gram, then the column would be -[‘“I am studying NLP”]

• The n value is chosen based on the performance.


Bag-of-Words with N-grams
N-grams Models
Vectors and documents
Vectors and documents

The term-document matrix for four words in four Shakespeare plays. Each cell
contains the number of times the (row) word occurs in the (column) document.
if two documents have similar words their column vectors will tend to be similar.
The vectors for the comedies As You Like It [1,114,36,20] and Twelfth Night [0,80,58,15] look a lot more like each other
(more fools and wit than battles) than they look like Julius Caesar [7,62,1,2] or Henry V [13,89,4,3].
Vectors and documents

A spatial visualization of the document vectors for the four Shakespeare play documents, showing just
two of the dimensions, corresponding to the words battle and fool. The comedies have high values for
the fool dimension and low values for the battle dimension.
Words as vectors: document dimensions
Words as vectors: document dimensions

The term-document matrix for four words in four Shakespeare plays. The red boxes show that each
word is represented as a row vector of length four.
For documents, we saw that similar documents had similar vectors, because similar documents
tend to have similar words. This same principle applies to words: similar words have similar vectors
because they tend to occur in similar documents.
term-term matrix or term-context matrix
term-term matrix or term-context matrix

The two words cherry and strawberry are more similar to each other (both pie and
sugar tend to occur in their window) than they are to other words like digital;
conversely, digital and information are more similar to each other than, say, to
strawberry.
term-term matrix or term-context matrix
Finding Text/Doc Similarities
Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity
• Numerical measure of how different are two data objects
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Euclidian distance between vectors
• Vectors x, y
• x=[x1,x2,...xn] and y=[y1,y2,...yn]
Euclidian distance NLP example
Text 1: I love ice cream

Text 2: I like ice cream


2
Text 3: I offer ice cream to the lady that I love
Bag of words
Compare the sentences using the Euclidean distance to find the two most similar sentences.

2
2.24

2
2.64

Thus text1 is similar (nearer) to text2


Cosine Similarity
• Cosine similarity is one of the metric to measure the text-similarity
between two documents irrespective of their size in Natural language
Processing.
• Mathematically, Cosine similarity metric measures the cosine of the
angle between two n-dimensional vectors projected in a multi-
dimensional space.
• The Cosine similarity of two documents will range from 0 to 1. If the
Cosine similarity score is 1, it means two vectors have the same
orientation. The value closer to 0 indicates that the two documents
have less similarity.
Cosine similarity of vectors
• An alternative measure of distance is the Cosine Similarity.
• The cosine similarity measures the angle between two vectors, and
has the property that it only considers the direction of the vectors,
not their the magnitudes.
Cos(x, y) = x . y / ||x|| * ||y||

where,

x . y = product (dot) of the vectors ‘x’ and ‘y’.


||x|| and ||y|| = length of the two vectors ‘x’ and ‘y’.
||x|| * ||y|| = cross product of the two vectors ‘x’ and ‘y’.
Cosine Similarity

doc_1 = "Data is the oil of the digital economy"


doc_2 = "Data is a new oil"
Cosine Similarity
Advantages of cosine similarity
• The cosine similarity is beneficial because even if the two similar data
objects are far apart by the Euclidean distance because of the size,
they could still have a smaller angle between them. Smaller the angle,
higher the similarity.
• When plotted on a multi-dimensional space, the cosine similarity
captures the orientation (the angle) of the data objects and not the
magnitude.
Cosine Similarity NLP example
Text 1: I love ice cream

Text 2: I like ice cream


2
Text 3: I offer ice cream to the lady that I love
Bag of words
Cos Sim xi*yi xi*yi xi*yi xi*yi xi*yi xi*yi xi*yi xi*yi xi*yi sqrt(∑xi*yi)
text1,text2 1 0 1 1 0 0 0 0 0 1.73205 0.433013
text1,text3 2 1 1 1 0 0 0 0 0 2.23606 0.3371
text2,text3 2 0 1 1 0 0 0 0 0 2 0.301511
xi2 xi2 xi2 xi2 xi2 xi2 xi2 xi2 xi2 sqrt(∑xi2)
Text1 1 1 1 1 0 0 0 0 0 2
Text2 1 0 1 1 1 0 0 0 0 2
Text3 4 1 1 1 0 1 1 1 1 3.31662

Thus text1 is similar (nearer) to text2


Jaccard Similarity
• Jaccard Similarity defined as an intersection of two documents
divided by the union of that two documents that refer to the number
of common words over a total number of words.
• Here, we will use the set of words to find the intersection and union
of the document.
Jaccard Similarity
doc_1 = "Data is the oil of the digital economy"
doc_2 = "Data is a new oil"
Jaccard Similarity
• Jaccard similarity is good for cases where duplication does not matter,
cosine similarity is good for cases where duplication matters while
analyzing text similarity
Term frequency-inverse document frequency
( TF-IDF)
• TF-IDF or Term Frequency–Inverse Document Frequency, is a numerical
statistic that’s intended to reflect how important a word is to a document.
• How does TF-IDF improve over Bag of Words?
• In Bag of Words, we witnessed how vectorization was just concerned with the
frequency of vocabulary words in a given document. As a result, articles,
prepositions, and conjunctions which don’t contribute a lot to the meaning get as
much importance as, say, adjectives.
• TF-IDF helps us to overcome this issue. Words that get repeated too often don’t
overpower less frequent but important words.
• It has two parts:
• TF
• IDF
TF
• TF stands for Term Frequency. It can be understood as a normalized
frequency score. It is calculated via the following formula:
IDF
• DF – Document Frequency. It’s given by the following formula:

• IDF stands for Inverse Document Frequency

Word can also be called a ‘term’


TF-IDF = TF*IDF
TF-IDF = TF*IDF

A tf-idf weighted term-document matrix for four words in four Shakespeare plays, the 0.049 value for wit in
As You Like It is the product of tf = log10(20+1) = 1.322 and idf = .037. Note that the idf weighting has
eliminated the importance of the ubiquitous word good and vastly reduced the impact of the almost-
ubiquitous word fool.
code
• The concept of n-grams is applicable here as well, we can combine words in
groups of 2,3,4, and so on to build our final set of features.
• Along with n-grams, there are also a number of parameters such as min_df,
max_df, max_features, sublinear_tf, etc. to play around with. Carefully
tuning these parameters can do wonders for your model’s capabilities.
• Despite being so simple, TF-IDF is known to be extensively used in tasks like
Information Retrieval to judge which response is the best for a query,
especially useful in a chatbot or in Keyword Extraction to determine which
word is the most relevant in a document, and thus, you’ll often find
yourself banking on the intuitive wisdom of the TF-IDF.
N-gram code
One Hot Encoding
• 2 dimensional vector representing a document.
Doc1: The quick brown fox jumped over the lazy dog.
Doc2: She sells seashells by the seashore.
Doc3: Peter Piper picked a peck of pickled peppers.
• Words in sentence X Unique words

p p e.
as ls

g .
do ers
pe hor
by ped

se hel
ov l ed
fo ed

se n

r
pi k
sh r

ow
ck

la r

te
as
pe

a .

lls
ic
ck

ck
m

zy
e
e
e

x
qu

pe

pe
se
of
th

br
ju
pi

pi
The 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
quick 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
brown 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
fox 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
jumped 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
over 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
the 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
lazy 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
dog 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
Word2Vec
• In Bag of Words and TF-IDF, we saw how every word was treated as an individual
entity, and semantics were completely ignored. With the introduction of
Word2Vec, the vector representation of words was said to be contextually aware,
probably for the first time ever.
• Dense vectors work better in every NLP task than sparse vectors
• Word2Vec is one of the most popular technique to learn word embeddings using
shallow neural network. It was developed by Tomas Mikolov in 2013 at Google.
• Since every word is represented as an n-dimensional vector, one can imagine that
all of the words are mapped to this n-dimensional space in such a manner that
words having similar meanings exist in close proximity to one another in this
hyperspace.
• Two methods
• Skip Gram
• Common Bag Of Words (CBOW)
Skip Gram
• Skip gram predicts the surrounding context words within specific window given
current word. The input layer contains the current word and the output layer
contains the context words. The hidden layer contains the number of dimensions
in which we want to represent current word present at the input layer.
• skip-gram trains a probabilistic
classifier that, given a test target
word w and its context window of
L words c1:L, assigns a probability
based on how similar this context
window is to the target word.
• Skip-gram actually stores two
embeddings for each word, one
for the word as a target, and one
for the word considered as
context.
Input using one hot encoding
CBOW (Continuous Bag of Words)
• CBOW model predicts the current word given context words within a specific
window. The input layer contains the context words and the output layer contains
the current word. The hidden layer contains the number of dimensions in which
we want to represent the current word present at the output layer.
code pip install nltk
pip install gensim

You might also like