Professional Documents
Culture Documents
Skip Gram
Skip Gram
Skip Gram
Today: Using word embeddings
• Word2vec
– Train distributed word representations to predict
observed words
– Using word co-occurrence within a window size
– Two basic models
• Continuous bag-of-words (CBOW)
– Using context words to predict each word
– A 2c window: P( wi| wi-c, wi-c+1, … , wi+c-1, wi+c)
• Skip-gram
– Using each word to predict its context words
– A 2c window: P( wi-c, wi-c+1, … , wi+c-1, wi+c| wi)
Moreover...
• Word2Vec
– a class of neural network models
– work with unlabelled training corpus,
– produce a vector for each word in the corpus that
encodes its semantic information
These vectors are useful for two main reasons.
1. We can measure the semantic similarity
between two words are by calculating the cosine
similarity between their corresponding word
vectors.
For instance, words that we know to be
synonyms tend to have similar vectors in terms
of cosine similarity and antonyms tend to have
dissimilar vectors
2. We can use these word vectors as features for
various supervised NLP tasks such as document
classification, named entity recognition, and
sentiment analysis.
Main Idea of word2vec
• Instead of capturing cooccurrence counts
directly: Predict surrounding words of every
word
• Faster and can easily incorporate a new
sentence/document or add a word to the
vocabulary
Skip-Gram Model
Input to Skip Gram
The input of the skip-gram model is a single word WI and the
output is the words in WI 's context defined by a word window
of size C .
For example, consider the sentence "I drove my car to the store".
A potential training instance could be the word "car" as an input
and the words {"I","drove","my","to","the","store"} as
outputs.
• The probability in bold is for the chosen target word “climbed”. Given the
target vector [0 0 0 1 0 0 0 0 ]t, the error vector for the output layer is
easily computed by subtracting the probability vector from the target
vector.
Ex 2
• Suppose, we have a corpus
C = “Hey, this is sample corpus using only one
context word.”
• Let context window of 1
Ex 2
• Let input-hidden layer wt matrix is
• Let hidden-o/p layer wt matrix is
Back propagation
• The loss function is
• Weight Updation
Vectorization form (SkipGram)
Ex 3
label (expected o/p) = [[ 1. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 1. 0. 0. 0. 0.]
[ 0. 0. 0. 1. 0. 0. 0.]],
W1 = [[ 0.62199038 0.30934835]
[ 0.40213707 0.85600117]
[-0.00432069 0.14562168]
[ 0.68382212 0.81335847]
[ 0.11156295 0.53410073]
[ 0.87701927 0.18616941]
[ 0.17993324 0.08900217]]
*
[ 0. 1. 0. 0. 0. 0. 0.]
= [ 0.40213707
0.85600117]
Step 2:
transpose
[[ 0.18344406 0.99802854 0.85318831 0.16665383 0.80046283 0.2676270 0.43454347]
[ 0.9294146 0.81215672 0.37164133 0.17961877 0.34761577 0.0164294 0.24327192]]
*
[ 0.40213707 0.85600117]
=
[0.869 1.097 0.661 0.221 0.619 0.122 0.383] T
Step 3:
Expected
[ 1. 0. 0. 0. 0. 0. 0.] T
[ 0. 0. 1. 0. 0. 0. 0.] T
[ 0. 0. 0. 1. 0. 0. 0.] T
Summed error
dW’ =
[[-0.18114 0.277586 -0.22265 -0.28654 0.172109 0.104703 0.135929 ]
[-0.38559 0.590878 -0.47393 -0.60993 0.366357 0.222875 0.289342]]
W’ =W’ – Learning_rate * dW’
Then
W’ =
So, (W’ * e) =
[[ 0.18344406 0.99802854 0.85318831 0.16665383 0.80046283 0.2676270
0.43454347]
[ 0.9294146 0.81215672 0.37164133 0.17961877 0.34761577 0.0164294
0.24327192]]
[ 0.
1.
0.
0.
0.
0.
=
0.] 0 0
0.574 0.043
0 0
with 0 0
0 0
0 0
[0.574 0.043] 0 0
W =W – Learning_rate * dW
[[ 0.62199038 0.30934835] 0 0
[ 0.40213707 0.85600117] 0.574 0.043
[-0.00432069 0.14562168] 0 0
- 0.1 * 0 0
[ 0.68382212 0.81335847]
0 0
[ 0.11156295 0.53410073]
0 0
[ 0.87701927 0.18616941] 0 0
[ 0.17993324 0.08900217]]
0.622 0.309
0.345 0.852
-0.004 0.146
0.684 0.813
0.112 0.534
0.877 0.186
0.18 0.089
CBOW
Forward Pass - CBOW
Back Propagation : CBOW
Acknowledgement
• Thanks to various online resources which are
used to compile this presentation.