Wordembedding

WordEmbedding
Terminology
• The term “Word Embedding” came from deep
learning community
• For computational linguistic community, they prefer

“Distributional Semantic Model”
• Other terms:
– Distributed Representation
– Semantic Vector Space
– Word Space
Representing words by their context
• Distributional semantics: A word’s meaning is given

by the words that frequently appear close-by
• “You shall know a word by the company it keeps” (J. R. Firth 1957: 11)
• One of the most successful ideas of modern statistical NLP!
• When a word w appears in a text, its context is the set of words

that appear nearby (within a fixed-size window).
• Use the many contexts of w to build up a representation of w
…government debt problems turning into banking crises as happened in 2009…

…saying that Europe needs unified banking regulation to replace the hodgepodge…
…India has just given its banking system a shot in the arm…
These context words will represent banking

Word embedding
We will build a dense vector for each word, chosen so that it is

similar to vectors of words that appear in similar contexts
0.286
0.792
−0.177
−0.107
banking = 0.109
−0.542
0.349
0.271
Note: word vectors are sometimes called word embeddings or

word representations. They are a distributed representation.
Why Word Embeddings?
• Can capture the rich relational structure of the
lexicon
Vector Space Model
Term-Document Matrix
Each cell is the count of word t in document d
D1 D2 D3 D4 D5
économie 0 1 40 38 1
vertigineux 4 5 1 3 30
finance 1 2 30 25 2
malade 4 6 0 4 25
inflation 8 1 15 14 1
Two documents are similar if they have similar vector!

D3 = [40, 1, 30, 0, 15]
D4 = [38, 3, 25, 4, 14]
Vector Space Model
D1 D2 D3 D4 D5
finance 1 2 30 25 2
malade 4 6 0 4 25
Vector of word “malade” = [4, 6, 0, 4, 25]

Vector Space Model
D1 D2 D3 D4 D5
finance 1 2 30 25 2
malade 4 6 0 4 25
Two words are similar if they have similar vector!

vertigineux = [4, 5, 1, 3, 30]
malade = [4, 6, 0, 4, 25]
Vector Space Model
Term-Context Matrix
• Previously, we use entire Documents as our Context of word
– document-based models capture semantic relatedness (e.g. “boat” –
“water”), NOT semantic similarity (e.g. “boat” – “ship”)
• We can get precise vector representation of word (for
semantic similarity task) if we use smaller context, i.e, Words
as our Context!
– Window of N words
• A word is defined by a vector of over counts of context

words.
Vector Space Model
Term-Context Matrix
• Sample context of 4 words ...
économie médicament malade céphalée finance ...
inflation 2 0 0 0 3 ...
vertigineux 0 1 6 6 1 ...
vertiges 0 2 6 6 0 ...
finance 2 0 0 0 4 ...
...
Two words are similar in meaning if they have similar context vector!
Context Vector of “vertigineux” = [0, 1, 6, 6, 1, ...]
Context Vector of “vertiges” = [0, 2, 6, 6, 0, ...]
Term weighting
• Weighting: it practically works well... instead of
just using raw counts.
• For Term-Document matrix

– We usually use TF-IDF, instead of Raw Counts (only
TF)
• For Term-Context matrix

– We usually use Pointwise Mutual Information (PMI)
Collection vs. Document frequency
The collection frequency of is the number of occurrences in the
collection, counting multiple occurrences
Example:
Word Collection frequency Document frequency
insurance 10440 3997
try 10422 8760
Which word is a better search term (and should get a higher weight)?
How does “importance” or “informativeness” relate to document frequency?
Ideas?
Rare terms are more informative than frequent terms

– Recall stop words
Consider a term in the query that is rare in the collection (e.g.,

"centric")
A document containing this term is very likely to be relevant to the

query "centric"
We want a high weight for rare terms like "centric"

Inverse document frequency
dft is the document frequency of t: the number of documents
that contain t
– df is a measure of the informativeness of t
We define the idf (inverse document frequency) of t by
where N is the number of documents in the collection
what does the log do?

Inverse document frequency
Why do we have N here?
normalizes for corpus size:

N/dft = proportion of documents containing term t
idf example, suppose N= 1 million
term dft idft
calpurnia 1 6
animal 100 4
sunday 1,000 3
fly 10,000 2
under 100,000 1
the 1,000,000 0
There is one idf value for each term t in a collection.

term dft idft
calpurnia 1
animal 100
sunday 1,000
fly 10,000
under 100,000
the 1,000,000
What if we didn’t use the log?

term dft idft
calpurnia 1 1,000,000
animal 100 10,000
sunday 1,000 1,000
fly 10,000 100
under 100,000 10
the 1,000,000 1
The log dampens the scores

Putting it all together
We have a notion of term frequency overlap
We have a notion of term importance
We have a similarity measure (cosine similarity)
Can we put all of these together?

Define a weighting for each term
The tf-idf weight of a term is the product of its tf weight and its idf weight
tf-idf weighting
Best known weighting scheme in information retrieval
Increases with the number of occurrences within a document
Increases with the rarity of the term in the collection
Works surprisingly well!
Works in many other application domains

weight matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 5.25 3.18 0 0 0 0.35
Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95
Each document is now represented by a real-valued vector of

tf-idf weights ∈ R|V|
We then calculate the similarity using cosine similarity with

these vectors
Log-frequency weighting
Want to reduce the effect of multiple occurrences of a term
A document about “Clinton” will have “Clinton” occurring many times
Rather than use the frequency, us the log of the frequency
0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

Cosine similarity with 3 documents
How similar are the three Novels:(N1, N2, N3)
Term frequencies (counts)

term N1 N2 N3
affection 115 58 20
jealous 10 7 11
gossip 2 0 6
3 documents example contd.
Log frequency weighting After normalization
term N1 N2 N3 term N1 N2 N3
affection 3.06 2.76 2.30 affection 0.789 0.832 0.524
jealous 2.00 1.85 2.04 jealous 0.515 0.555 0.465
gossip 1.30 0 1.78 gossip 0.335 0 0.405
wuthering 0 0 2.58 wuthering 0 0 0.588
cos(N1,N2) ≈
0.789 ∗ 0.832 + 0.515 ∗ 0.555 + 0.335 ∗ 0.0 + 0.0 ∗ 0.0 ≈ 0.94
cos(N1,N3) ≈ 0.79
cos(N2,N3) ≈ 0.69
tf-idf weighting has many variants

Wordembedding

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wordembedding

Uploaded by

Copyright:

Available Formats

WordEmbedding

• For computational linguistic community, they prefer

• Distributional semantics: A word’s meaning is given

• When a word w appears in a text, its context is the set of words

…government debt problems turning into banking crises as happened in 2009…

These context words will represent banking

We will build a dense vector for each word, chosen so that it is

Note: word vectors are sometimes called word embeddings or

Two documents are similar if they have similar vector!

Vector of word “malade” = [4, 6, 0, 4, 25]

Two words are similar if they have similar vector!

• A word is defined by a vector of over counts of context

• For Term-Document matrix

• For Term-Context matrix

insurance 10440 3997

try 10422 8760

Rare terms are more informative than frequent terms

Consider a term in the query that is rare in the collection (e.g.,

A document containing this term is very likely to be relevant to the

We want a high weight for rare terms like "centric"

We define the idf (inverse document frequency) of t by

where N is the number of documents in the collection

what does the log do?

Why do we have N here?

normalizes for corpus size:

There is one idf value for each term t in a collection.

What if we didn’t use the log?

The log dampens the scores

Can we put all of these together?

Best known weighting scheme in information retrieval

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection

Works surprisingly well!

Works in many other application domains

Each document is now represented by a real-valued vector of

We then calculate the similarity using cosine similarity with

A document about “Clinton” will have “Clinton” occurring many times

Rather than use the frequency, us the log of the frequency

0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

Term frequencies (counts)

You might also like