Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

WordEmbedding

Terminology
• The term “Word Embedding” came from deep
learning community

• For computational linguistic community, they prefer


“Distributional Semantic Model”

• Other terms:
– Distributed Representation
– Semantic Vector Space
– Word Space
Representing words by their context

• Distributional semantics: A word’s meaning is given


by the words that frequently appear close-by
• “You shall know a word by the company it keeps” (J. R. Firth 1957: 11)
• One of the most successful ideas of modern statistical NLP!

• When a word w appears in a text, its context is the set of words


that appear nearby (within a fixed-size window).
• Use the many contexts of w to build up a representation of w

…government debt problems turning into banking crises as happened in 2009…


…saying that Europe needs unified banking regulation to replace the hodgepodge…
…India has just given its banking system a shot in the arm…

These context words will represent banking


Word embedding

We will build a dense vector for each word, chosen so that it is


similar to vectors of words that appear in similar contexts

0.286
0.792
−0.177
−0.107
banking = 0.109
−0.542
0.349
0.271

Note: word vectors are sometimes called word embeddings or


word representations. They are a distributed representation.
Why Word Embeddings?
• Can capture the rich relational structure of the
lexicon
Vector Space Model
Term-Document Matrix
Each cell is the count of word t in document d
D1 D2 D3 D4 D5
économie 0 1 40 38 1
vertigineux 4 5 1 3 30
finance 1 2 30 25 2
malade 4 6 0 4 25
inflation 8 1 15 14 1

Two documents are similar if they have similar vector!


D3 = [40, 1, 30, 0, 15]
D4 = [38, 3, 25, 4, 14]
Vector Space Model
Term-Document Matrix
Each cell is the count of word t in document d
D1 D2 D3 D4 D5
économie 0 1 40 38 1
vertigineux 4 5 1 3 30
finance 1 2 30 25 2
malade 4 6 0 4 25
inflation 8 1 15 14 1

Vector of word “malade” = [4, 6, 0, 4, 25]


Vector Space Model
Term-Document Matrix
Each cell is the count of word t in document d
D1 D2 D3 D4 D5
économie 0 1 40 38 1
vertigineux 4 5 1 3 30
finance 1 2 30 25 2
malade 4 6 0 4 25
inflation 8 1 15 14 1

Two words are similar if they have similar vector!


vertigineux = [4, 5, 1, 3, 30]
malade = [4, 6, 0, 4, 25]
Vector Space Model
Term-Context Matrix
• Previously, we use entire Documents as our Context of word
– document-based models capture semantic relatedness (e.g. “boat” –
“water”), NOT semantic similarity (e.g. “boat” – “ship”)
• We can get precise vector representation of word (for
semantic similarity task) if we use smaller context, i.e, Words
as our Context!
– Window of N words

• A word is defined by a vector of over counts of context


words.
Vector Space Model
Term-Context Matrix
• Sample context of 4 words ...
économie médicament malade céphalée finance ...
inflation 2 0 0 0 3 ...
vertigineux 0 1 6 6 1 ...
vertiges 0 2 6 6 0 ...
finance 2 0 0 0 4 ...
...

Two words are similar in meaning if they have similar context vector!
Context Vector of “vertigineux” = [0, 1, 6, 6, 1, ...]
Context Vector of “vertiges” = [0, 2, 6, 6, 0, ...]
Term weighting
• Weighting: it practically works well... instead of
just using raw counts.

• For Term-Document matrix


– We usually use TF-IDF, instead of Raw Counts (only
TF)

• For Term-Context matrix


– We usually use Pointwise Mutual Information (PMI)
Collection vs. Document frequency
The collection frequency of is the number of occurrences in the
collection, counting multiple occurrences

Example:
Word Collection frequency Document frequency

insurance 10440 3997

try 10422 8760

Which word is a better search term (and should get a higher weight)?
How does “importance” or “informativeness” relate to document frequency?
Ideas?

Rare terms are more informative than frequent terms


– Recall stop words

Consider a term in the query that is rare in the collection (e.g.,


"centric")

A document containing this term is very likely to be relevant to the


query "centric"

We want a high weight for rare terms like "centric"


Inverse document frequency
dft is the document frequency of t: the number of documents
that contain t
– df is a measure of the informativeness of t

We define the idf (inverse document frequency) of t by

where N is the number of documents in the collection

what does the log do?


Inverse document frequency

Why do we have N here?

normalizes for corpus size:


N/dft = proportion of documents containing term t
idf example, suppose N= 1 million
term dft idft
calpurnia 1 6
animal 100 4
sunday 1,000 3
fly 10,000 2
under 100,000 1
the 1,000,000 0

There is one idf value for each term t in a collection.


idf example, suppose N= 1 million
term dft idft
calpurnia 1
animal 100
sunday 1,000
fly 10,000
under 100,000
the 1,000,000

What if we didn’t use the log?


idf example, suppose N= 1 million
term dft idft
calpurnia 1 1,000,000
animal 100 10,000
sunday 1,000 1,000
fly 10,000 100
under 100,000 10
the 1,000,000 1

The log dampens the scores


Putting it all together
We have a notion of term frequency overlap
We have a notion of term importance
We have a similarity measure (cosine similarity)

Can we put all of these together?


Define a weighting for each term

The tf-idf weight of a term is the product of its tf weight and its idf weight
tf-idf weighting

Best known weighting scheme in information retrieval

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection

Works surprisingly well!

Works in many other application domains


weight matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 5.25 3.18 0 0 0 0.35
Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95

Each document is now represented by a real-valued vector of


tf-idf weights ∈ R|V|

We then calculate the similarity using cosine similarity with


these vectors
Log-frequency weighting
Want to reduce the effect of multiple occurrences of a term

A document about “Clinton” will have “Clinton” occurring many times

Rather than use the frequency, us the log of the frequency

0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.


Cosine similarity with 3 documents
How similar are the three Novels:(N1, N2, N3)

Term frequencies (counts)


term N1 N2 N3

affection 115 58 20

jealous 10 7 11

gossip 2 0 6
3 documents example contd.
Log frequency weighting After normalization

term N1 N2 N3 term N1 N2 N3
affection 3.06 2.76 2.30 affection 0.789 0.832 0.524
jealous 2.00 1.85 2.04 jealous 0.515 0.555 0.465
gossip 1.30 0 1.78 gossip 0.335 0 0.405
wuthering 0 0 2.58 wuthering 0 0 0.588

cos(N1,N2) ≈
0.789 ∗ 0.832 + 0.515 ∗ 0.555 + 0.335 ∗ 0.0 + 0.0 ∗ 0.0 ≈ 0.94
cos(N1,N3) ≈ 0.79
cos(N2,N3) ≈ 0.69
tf-idf weighting has many variants

You might also like