Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 37

Word2vec

Skip Gram
Today: Using word embeddings
• Word2vec
– Train distributed word representations to predict
observed words
– Using word co-occurrence within a window size
– Two basic models
• Continuous bag-of-words (CBOW)
– Using context words to predict each word
– A 2c window: P( wi| wi-c, wi-c+1, … , wi+c-1, wi+c)
• Skip-gram
– Using each word to predict its context words
– A 2c window: P( wi-c, wi-c+1, … , wi+c-1, wi+c| wi)
Moreover...
• Word2Vec
– a class of neural network models
– work with unlabelled training corpus,
– produce a vector for each word in the corpus that
encodes its semantic information
These vectors are useful for two main reasons.
1. We can measure the semantic similarity
between two words are by calculating the cosine
similarity between their corresponding word
vectors.
For instance, words that we know to be
synonyms tend to have similar vectors in terms
of cosine similarity and antonyms tend to have
dissimilar vectors
2. We can use these word vectors as features for
various supervised NLP tasks such as document
classification, named entity recognition, and
sentiment analysis.
Main Idea of word2vec
• Instead of capturing cooccurrence counts
directly: Predict surrounding words of every
word
• Faster and can easily incorporate a new
sentence/document or add a word to the
vocabulary
Skip-Gram Model
Input to Skip Gram
The input of the skip-gram model is a single word WI and the
output is the words in WI 's context defined by a word window
of size C .

For example, consider the sentence "I drove my car to the store".
A potential training instance could be the word "car" as an input
and the words {"I","drove","my","to","the","store"} as
outputs.

All of these words are one-hot encoded meaning they are


vectors of length (the size of the vocabulary) with a value of at
the index corresponding to the word and zeros in all other
indexes.
Phases of Skip Gram
• Forward Propagation
– To predict the word and calculate the error
• Back Propagation
– To update the weight matrix to reduce the error
Skip Gram
Architecture
Forward propagation
Example 1
• Consider the training corpus having the following
sentences:
– “the dog saw a cat”,
– “the dog chased the cat”,
– “the cat climbed a tree”
• The corpus vocabulary has eight words. Once
ordered alphabetically, each word can be
referenced by its index.
• Let us assume the feature size equals to 3 so we
have three neurons in the hidden layer.
Example 1
• This means that WI and WO will be 8×3 and 3×8 matrices,
respectively. Before training begins, these matrices are
initialized to small random values.
Example 1
• Let cat is the i/p word and as per our training data
climbed should be the o/p.
• So, input vector X will be [0 1 0 0 0 0 0 0]t
• Target vector will look like [0 0 0 1 0 0 0 0 ]t.

• Ht = XtWI = [-0.490796 -0.229903 0.065460]

• HtWO = [0.100934 -0.309331 -0.122361 -


0.151399 0.143463 -0.051262 -0.079686 0.112928]
Example 1
• Since the goal is produce probabilities for
words in the output
layer, Pr(wordk|wordcontext) for k = 1, V, to
reflect their next word relationship with the
context word at input, we need the sum of
neuron outputs in the output layer to add to
one. Word2vec achieves this by converting
activation values of output layer neurons to
probabilities using the softmax function.
Example 1

• Thus, the probabilities for eight words in the corpus are:


0.143073 0.094925 0.114441 0.111166 0.149289 0.122874 0.119431
0.144800

• The probability in bold is for the chosen target word “climbed”. Given the
target vector [0 0 0 1 0 0 0 0 ]t, the error vector for the output layer is
easily computed by subtracting the probability vector from the target
vector.
Ex 2
• Suppose, we have a corpus
C = “Hey, this is sample corpus using only one
context word.”
• Let context window of 1
Ex 2
• Let input-hidden layer wt matrix is
• Let hidden-o/p layer wt matrix is
Back propagation
• The loss function is
• Weight Updation
Vectorization form (SkipGram)
Ex 3
label (expected o/p) = [[ 1. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 1. 0. 0. 0. 0.]
[ 0. 0. 0. 1. 0. 0. 0.]],

Center (i/p) x = [ 0. 1. 0. 0. 0. 0. 0.]

W1 = [[ 0.62199038 0.30934835]
[ 0.40213707 0.85600117]
[-0.00432069 0.14562168]
[ 0.68382212 0.81335847]
[ 0.11156295 0.53410073]
[ 0.87701927 0.18616941]
[ 0.17993324 0.08900217]]

W2 = [[ 0.18344406 0.99802854 0.85318831 0.16665383 0.80046283 0.26762708


0.43454347]
[ 0.9294146 0.81215672 0.37164133 0.17961877 0.34761577 0.0164294
0.24327192]]
Step 1:

Transpose([[ 0.62199038 0.30934835]


[ 0.40213707 0.85600117]
[-0.00432069 0.14562168]
[ 0.68382212 0.81335847]
[ 0.11156295 0.53410073]
[ 0.87701927 0.18616941]
[ 0.17993324 0.08900217]])

*
[ 0. 1. 0. 0. 0. 0. 0.]

= [ 0.40213707
0.85600117]
Step 2:

transpose
[[ 0.18344406 0.99802854 0.85318831 0.16665383 0.80046283 0.2676270 0.43454347]
[ 0.9294146 0.81215672 0.37164133 0.17961877 0.34761577 0.0164294 0.24327192]]
*
[ 0.40213707 0.85600117]
=
[0.869 1.097 0.661 0.221 0.619 0.122 0.383] T

Step 3:

Apply softmax as exp (0.869)/ [ exp(0.869) + ...exp(0.383)]

[ 0.183182 0.230092 0.148782 0.095821 0.142662 0.086789 0.112672 ] T


Step 4:
Predicted o/p
[ 0.183182 0.230092 0.148782 0.095821 0.142662 0.086789 0.112672 ] T
[ 0.183182 0.230092 0.148782 0.095821 0.142662 0.086789 0.112672 ] T
[ 0.183182 0.230092 0.148782 0.095821 0.142662 0.086789 0.112672 ] T

Expected
[ 1. 0. 0. 0. 0. 0. 0.] T
[ 0. 0. 1. 0. 0. 0. 0.] T
[ 0. 0. 0. 1. 0. 0. 0.] T

Error = predicted –expected

[-0.81682 0.230092 0.148782 0.095821 0.142662 0.086789 0.112672 ] T


[0.183182 0.230092 -0.85122 0.095821 0.142662 0.086789 0.112672 ] T
[0.183182 0.230092 0.148782 -0.90418 0.142662 0.086789 0.112672 ] T

Summed error

[-0.45045 0.690277 -0.55365 -0.71254 0.427987 0.260367 0.338016] T


Step 5:

For dW’, calculate outer product (h, summed error)


[[0.402137]
[ 0.856001] ]
with
[-0.45045 0.690277 -0.55365 -0.71254 0.427987 0.260367 0.338016 ]

dW’ =
[[-0.18114 0.277586 -0.22265 -0.28654 0.172109 0.104703 0.135929 ]
[-0.38559 0.590878 -0.47393 -0.60993 0.366357 0.222875 0.289342]]
W’ =W’ – Learning_rate * dW’

Let Learning_rate = 0.1

Then
W’ =

[[ 0.18344406 0.99802854 0.85318831 0.16665383 0.80046283 0.2676270 0.43454347]


[ 0.9294146 0.81215672 0.37164133 0.17961877 0.34761577 0.0164294 0.24327192]]
-

0.1 * [[-0.18114 0.277586 -0.22265 -0.28654 0.172109 0.104703 0.135929 ]


[-0.38559 0.590878 -0.47393 -0.60993 0.366357 0.222875 0.289342]]

[[0.201558 0.97027 0.875453 0.195308 0.783252 0.257157 0.420951 ]


[0.967974 0.753069 0.419034 0.240612 0.31098 -0.00586 0.214338 ]]
Next obtain Wnew, dW = outerProduct( x, (W’ * e))

So, (W’ * e) =
[[ 0.18344406 0.99802854 0.85318831 0.16665383 0.80046283 0.2676270
0.43454347]
[ 0.9294146 0.81215672 0.37164133 0.17961877 0.34761577 0.0164294
0.24327192]]

[-0.45045 0.690277 -0.55365 -0.71254 0.427987 0.260367 0.338016] T


=
[[0.574
[0.043]]
dW = outerProduct( x, (W’ * e))

[ 0.
1.
0.
0.
0.
0.
=
0.] 0 0
0.574 0.043
0 0
with 0 0
0 0
0 0
[0.574 0.043] 0 0
W =W – Learning_rate * dW

[[ 0.62199038 0.30934835] 0 0
[ 0.40213707 0.85600117] 0.574 0.043
[-0.00432069 0.14562168] 0 0
- 0.1 * 0 0
[ 0.68382212 0.81335847]
0 0
[ 0.11156295 0.53410073]
0 0
[ 0.87701927 0.18616941] 0 0
[ 0.17993324 0.08900217]]

0.622 0.309
0.345 0.852
-0.004 0.146
0.684 0.813
0.112 0.534
0.877 0.186
0.18 0.089
CBOW
Forward Pass - CBOW
Back Propagation : CBOW
Acknowledgement
• Thanks to various online resources which are
used to compile this presentation.

You might also like