Professional Documents
Culture Documents
NLP UNIT-V
NLP UNIT-V
Semantics Vector Semantics; Words and Vector; Measuring Similarity; Semantics with dense
vectors; SVD and Latent Semantic Analysis; Embeddings from prediction: Skip-gram and
CBOW; Concept of Word Sense; Introduction to WordNet.
Semantics refers to the study of the meaning of language. In natural language processing
(NLP), understanding the meaning of words and phrases is crucial for various
language-related tasks. Vector Semantics, also known as distributional semantics, is an
approach to understanding this meaning by representing words as vectors in a
high-dimensional space. This approach is based on the idea that words that appear in similar
contexts tend to have similar meanings.
Example: Let's create a simplified example. Suppose we have a small corpus of text (a
collection of sentences) with the words "cat," "dog," "pet," and "animal":
We can represent the words in a vector space based on how often they appear together:
● The vector for "cat" might have high values in dimensions related to "pet" and
"animal."
● Similarly, the vector for "dog" will also have high values in those dimensions.
● The vectors for "pet" and "animal" might have overlapping values in the same
dimensions.
Vector Operations: Once words are represented as vectors, you can perform various
mathematical operations on these vectors to derive meaning relationships. For instance, you
can calculate the cosine similarity between two word vectors. Words with vectors that are
close in the vector space (i.e., have a high cosine similarity) are considered to be similar in
meaning.
● Word Similarity: It can determine how similar two words are in meaning. Words
with vectors close in space are similar.
● Document Retrieval: It helps match search queries to relevant documents.
● Sentiment Analysis: It can analyze the sentiment of text by considering the meanings
of words.
● Machine Translation: In translation systems, it helps find equivalent words or
phrases in different languages.
Training: To create these word vectors, large text corpora are used for training. Techniques
like Word2Vec, GloVe, and fastText are commonly used for generating word vectors based on
co-occurrence statistics in text.
Words: Words are fundamental units of language that carry meaning. In NLP, text is
composed of words, and understanding the meaning and relationships between
words is a key challenge. Words can represent entities, actions, emotions, and
more, and they are the building blocks of language.
https://vijaykumarnvs.blogspot.com/
Measuring Similarity
1. Cosine Similarity:
● Definition: Cosine similarity measures the cosine of the angle between two
vectors in a high-dimensional space.
● Example: Suppose we have word embeddings for two words, "king" and
"queen," represented as vectors in a high-dimensional space. Cosine similarity
calculates how similar their directions are in this space. If the cosine similarity
is close to 1, it indicates a high degree of similarity.
● Calculation: Cosine similarity between vectors A and B is calculated as (A
dot B) / (||A|| * ||B||).
2. Jaccard Similarity:
3. Edit Distance:
4. WordNet Similarity:
5. Embedding-Based Similarity:
https://vijaykumarnvs.blogspot.com/
● Definition: Embedding-based similarity measures the similarity between
words or phrases by calculating the cosine similarity between their respective
word embeddings.
● Example: Using word embeddings, you can calculate the similarity between
"apple" and "orange" by computing the cosine similarity between their vectors.
High cosine similarity indicates that these words are related in meaning.
Choice of Measure: The choice of similarity measure depends on the specific NLP
task and the nature of the data. Cosine similarity is commonly used when working
with word embeddings. Jaccard similarity is suitable for comparing sets of words,
while edit distance quantifies the similarity in terms of character-level operations.
WordNet similarity is useful for capturing hierarchical relationships, and
embedding-based similarity leverages word vectors for semantic similarity.
Semantics with Dense Vectors: Semantics with dense vectors, also known as
distributed or dense vector semantics, is an approach in NLP that represents words
as dense vectors in a high-dimensional space. These dense vectors are designed to
capture the meaning and relationships between words, and they are particularly
well-suited for computational efficiency and effectiveness.
Key Points:
Advantages:
Popular Algorithms: Word2Vec, GloVe, and fastText are popular algorithms that
learn dense vector representations of words from large text corpora.
https://vijaykumarnvs.blogspot.com/
where
Mk = approximated matrix of M
Uk, ∑k, VTk are the matrices containing only the k contexts from U, ∑, VT respectively
https://vijaykumarnvs.blogspot.com/
# Input word to get its embedding
input_array = tokenizer.texts_to_sequences(["Data"])
output_array = model.predict(input_array)
print(output_array)
In this example, we create a simple word embedding model that condenses word
representations into a 3-dimensional vector space. The word "Data" is passed through the
model to obtain its embedding vector.
The output represents the word "Data" in the 3-dimensional embedding space, and similar
words should have similar embeddings in this space, capturing their contextual relationships.
Objective: The Skip-gram model aims to learn word embeddings by predicting the
context words surrounding a given target word.
Data Preparation:
Text Corpus: Take the example sentence: "the quick brown fox jumps over the
lazy dog."
Context Window: Define a context window size. In this example, we'll use a
context window of 1, meaning we consider one word to the left and one word
to the right of the target word.
Pairs Creation: For each word in the corpus, create training pairs. Pair the
target word with each word within its context window. This forms a dataset of
(target word, context word) pairs. Here are some pairs for our example:
● ("quick", "the")
● ("quick", "brown")
● ("brown", "quick")
● ("brown", "fox")
● ("fox", "brown")
● ("fox", "jumps")
● ("jumps", "fox")
● ("jumps", "over")
● ("over", "jumps")
● ("over", "the")
https://vijaykumarnvs.blogspot.com/
● ("the", "over")
● ("the", "lazy")
● ("lazy", "the")
● ("lazy", "dog")
● ("dog", "lazy")
Training: During the training process, the neural network adjusts its weights
(word vectors) to maximize the probability of predicting the correct context
words for a given target word. The objective is to make the predicted context
words as close as possible to the actual context words observed in the
training data.
Embedding Space:
Word Vectors: After training is complete, the word vectors learned by the
model reside in a lower-dimensional vector space. Each word in the
vocabulary now has an associated vector that represents its semantic
meaning based on the contextual usage observed in the training corpus.
https://vijaykumarnvs.blogspot.com/
These word vectors can be used for various natural language processing tasks, such
as measuring word similarity, text classification, and machine translation. The
Skip-gram model captures semantic relationships between words by learning from
their contextual usage within the corpus.
Please note that in practice, the training data is usually much larger, and the
dimensionality of the word vectors is often set to a relatively small number, e.g.,
100-300 dimensions, to balance computational efficiency and meaningful semantic
representation.
Objective: The CBOW model aims to learn word embeddings by predicting a target
word based on the context words surrounding it.
Data Preparation:
Text Corpus: Use the example sentence: "It would be sad memory to watch it
would be unhappy memory to watch."
Context Window: Define a context window size. In this example, we'll use a
context window of 2, meaning we consider two words to the left and two
words to the right of the target word.
Pairs Creation: For each word in the corpus, create training pairs to predict the
target words "sad" and "unhappy." Pair each target word with the context
words within its context window. Here are some pairs for both "sad" and
"unhappy":
https://vijaykumarnvs.blogspot.com/
● For "sad":
● ("sad", "be", "would", "memory")
● ("sad", "would", "be", "memory")
● ("sad", "be", "would", "unhappy")
● ("sad", "would", "be", "memory")
● ("sad", "be", "would", "to")
● ("sad", "would", "be", "memory")
● For "unhappy":
● ("unhappy", "be", "would", "memory")
● ("unhappy", "would", "be", "memory")
● ("unhappy", "be", "would", "sad")
● ("unhappy", "would", "be", "memory")
● ("unhappy", "be", "would", "to")
● ("unhappy", "would", "be", "memory")
Neural Network Setup: Create a neural network with an embedding layer that
learns word vectors. In the CBOW model, the context words are used as input,
and the model is trained to predict the target word.
● Input: Context words (e.g., ["be", "would", "memory"] for predicting
"sad").
● Output: Probability distribution over the vocabulary for the target word
("sad").
Training: During the training process, the neural network adjusts its weights
(word vectors) to maximize the probability of predicting the correct target
word based on the context words.
Embedding Space:
Word Vectors: After training, the word vectors learned by the model reside in a
lower-dimensional vector space. Each word in the vocabulary has an
associated vector representing its semantic meaning based on the contextual
usage observed in the training corpus.
These word vectors can be used for various NLP tasks. In this case, the word vectors
for "sad" and "unhappy" have been learned based on their contextual usage in the
given sentence.
https://vijaykumarnvs.blogspot.com/
The CBOW model is particularly useful for capturing semantic relationships between
words when the training data is limited or noisy, as it considers multiple context
words to predict the target word.
Word Sense: In linguistics and natural language processing (NLP), the term "word
sense" refers to the multiple meanings or interpretations that a single word can have
in different contexts. It acknowledges that words can carry distinct senses or
nuances based on their usage, and these senses are the specific ways in which a
word conveys meaning.
Importance of Word Sense: Word sense is a crucial concept in NLP for several
reasons:
Example: Let's consider the word "run," which has multiple senses:
In the first sentence, "run" refers to physical exercise, while in the second sentence, it
has a different sense, meaning managing a business. The context of the word helps
us disambiguate its meaning.
Introduction to WordNet
Example:
Sense Disambiguation: WordNet links not only word forms but specific
senses of words. This means that words are linked based on their distinct
meanings in different contexts, enabling precise sense disambiguation.
Semantic Relations: WordNet labels the semantic relationships among words
explicitly. These relationships provide deeper insights into word connections
beyond mere synonymy.
https://vijaykumarnvs.blogspot.com/
https://vijaykumarnvs.blogspot.com/