Professional Documents
Culture Documents
Word 2 Vector
Word 2 Vector
Word 2 Vector
word2vec
We use words every single day. Still we almost never ponder on how information they
convey, and how easily we understand each other only by using what would seem to be just a
sequence of characters. Every single word we hear activates our associative machine in a way that
we know the words meaning instantly. There are so many words in each of our vocabularies. A
question arises.
An
The most common way that meaning has been represented in computers is by using taxonomic
resources. An example of such would be WordNet. It is very popular among linguists and people
who work with NLP. It is free to use and to copy. It contains a lot of taxonomic information and it
has been and still is very useful when solving language processing problems.
It has hypernyms(is-a) relationships and synonym sets. It is the largest such representation.
Yet WordNet still fails to capture some relationships. For example it has the following words
labeled as synonyms: adept, expert, good, practiced, proficient, skillful. We can see how a problem
would arise. Also it is common for it to lack new words. That being if we would want to process
tweets for example, a lot of the words would be unknown for WordNet. It also requires human labor
to create and adapt. Generally it is hard to compute accurate word similarity.
Most of NLP work regards words as atomic symbols, sometimes using statistical models for
words(applying probabilities to words appearing with other words). The vectors used for each
words when we use them as atomic would get enormous as we increase the vocabulary. Also when
we use that kind of representation in neural networks we don’t capture the relationships between
words(maybe a little bit with the statistical models, but far from enough). For example if we would
want to implement these representations in a search engine there would be no way to capture the
meaning of the words. Having search results that only match words is not sufficient. We would want
the results to contain websites that have words that relate to the words we have typed in. For
example if we search “food near me” we would want to have results for all kinds of foods like
pizza, pasta, sandwiches, hamburgers etc. All of this being said, we should understand the need for
some other, more useful word representation.
2. Word embedding
Word embedding is a kind of word representation. It can be used for all kinds of texts,
depending on what we teach it on. Words that are similar would have similar representations. It is
this specific approach to representing words that would solve the above stated problem. This is
currently the most used approach to representing words. There are many benefits of using this kind
of approach to this problem. One is computational. Word embedding techniques produce dense and
low dimensional vectors which are apt for neural networks. Neural networks are known not to work
well with sparse vectors that are high dimensional. Perhaps the greatest benefit we get from these
techniques is the generalization that we achieve with the vectors being so dense.
It works in a way such that words are represented as real valued vectors in a predefined
vector space. Each word(vector that represents it) is assigned initial values and then goes through a
learning process that is similar to neural networks. It is this reason that it is meshed together with
deep learning models.
Often the real-valued vectors have tens or hundreds of dimensions. In contrast to some other
methods such as one hot encoding, this is a very small number of dimensions. One hot encoded
words can have thousands even millions of dimensions.
In a sense each dimension of the word embedding vectors is a kind of a feature. So a value
in each dimension, is associated to some kind of semantic meaning of the word. Words are seen as
points in the vector space.
The values of the vectors are tweaked, learned, from the usage of the words. This is exactly
what enables these methods to capture the semantic meaning of the words. When contrasted to a
method like bag of words, the meaning that is captured is substantial. In BOW, unless managed in
some way, different words have totally different representation, without regard to how they are
used.
There is a linguistic theory behind approaches of the rank of word embedding. It is called
“distributional hypothesis”. It states that words that have similar context have similar meaning. We
can see how that makes sense. Just think of some word. You can immediately conjure up words that
are similar to that one. Our personal neural network, the associative machine in our head does it
without breaking a sweat. Just take a moment to appreciate that.
There is deeper linguistic theory behind the approach, namely the “distributional hypothesis” by
Zellig Harris that could be summarized as: words that have similar context will have similar
meanings.
This notion of letting the usage of the word define its meaning can be summarized by an oft
repeated quip by John Firth:
You shall know a word by the company it keeps!