Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

Entropy

Entropy is a metric that has been used to quantify the randomness of a process in many fields
and compare worldwide languages, specifically in computational linguistics.
Definition for Entropy: The entropy (also called self-information) of a random variable is
the average level of the information, surprise, or uncertainty inherent to the single variable's
possible outcomes.
 The more certain or the more deterministic an event is, the less information it will
contain. In a nutshell, the information is an increase in uncertainty or entropy.
Entropy in Different Fields of NLP & AI
 In terms of probability theory NLP, language perspective, and probability theory
NLP, entropy can also be defined as a statistical parameter that measures how much
information is produced for each letter of a text in the language.
o If the language is translated into binary digits (0 or 1) in the most efficient
way, the entropy H is the average number of binary digits required per letter of
the original language.
 From a machine learning perspective, entropy is a measure of uncertainty, and the
objective of the machine learning model is to minimize uncertainty.
o Decision tree learning algorithms use relative entropy to determine the
decision rules that govern the data at each node.
o Classification algorithms in machine learning like logistic regression or
artificial neural networks often employ a standard loss function called cross
entropy loss that minimizes the average cross entropy between ground truth
and predicted distributions.
Historical Perspective for Entropy
 Entropy in Information Theory: It was introduced by Claude Shannon in his
definition is a statistical parameter which measures, in a certain sense, how much
information is produced on the average for each letter of a text in the language.
o If the language is translated into binary digits (0 or 1) in the most efficient
way, the entropy is the average number of binary digits required per
letter of the original language.
 Entropy for a natural language: The entropy of a natural language is the average
amount of information of one character in an infinite length of text, which
characterizes the complexity of natural language.
o Historically, there have been many proposals for experimentally estimating the
entropy rate as the true probability distributions of natural language.
o Most of these approaches relied on the predictive power of humans or
computational models such as n-gram language models and compression
algorithms.
 Using entropy as a metric: The main idea is that if a model captures more of the
structure of a language, then the entropy of the model should be lower and we can use
entropy as a measure of the quality of the models.
Cross Entropy
Due to the fact that we can not access an infinite amount of text in the language, and the
true distribution of the language is unknown, we define a more useful and usable
metric called Cross Entropy.
 Intuition for Cross entropy: It is often used to measure the closeness of two
distributions where one distribution is from the sample text (Q) that the language
model aims to learn with as much proximity as possible and the other is the empirical
distribution of the language (P).
 From the formulation, we can see that the cross entropy of Q with respect to P is
the sum of two terms entropy and relative entropy:
o H(P), the entropy of P, is the average number of bits needed to encode any
possible outcome of P.
o The number of extra bits required to encode any possible outcome of P
optimized over Q.
The empirical entropy H(P) is unoptimizable, so when we train a language model with the
objective of minimizing the cross-entropy loss, the true objective is to minimize the KL
divergence of the distribution which was learned by our language model from the empirical
distribution of the language.
Handling Unknown Words
 Tokenizers in Language Models: Tokenization is the first and important step in any
NLP pipeline, especially for language models which break unstructured data and
natural language text into chunks of information that can be considered as discrete
elements.
o The token occurrences in a document can be used directly as
a vector representing that document.
o The goal when crafting the vocabulary with tokenizers is to do it in such a way
that the tokenizer tokenizes as few words as possible into the unknown token.
 Issue with unknown vocabulary/tokens: The general approach in most tokenizers is
to encode the rare words in your dataset using a special token UNK by convention so
that any new out-of-vocabulary word would be labeled as belonging to the rare word
category.
o We expect the model to learn how to deal with the other words from the
custom UNK token.
o It is also generally a bad sign if we see that the tokenizer is producing a lot of
these unknown tokens as the tokenizer was not able to retrieve a sensible
representation of a word and we are losing information along the way.
 Methods to handle unknown tokens / OOV (out of vocabulary): Character level
embeddings and sub-word tokenization are some effective ways to unknown tokens.
o Under sub-word tokenization, WordPiece and BPE are de facto
methods employed by successful language models such as BERT and GPT,
etc.
 Character level embeddings: Character and subword embeddings are introduced as
an attempt to limit the size of embedding matrices such as in BERT but they have
the advantage of being able to handle new slang words, misspellings, and OOV
words.
o The required embedding matrix is much smaller than what is required for
word-level embeddings. Generally, the vectors represent each character in any
language
o Example: Instead of a single vector for "king" like in word embeddings, there
would be a separate vector for each of the letters "k", "i", "n", and "g".
o Character embeddings do not encode the same type of information that word
embeddings contain and can be thought of as encoding lexical information and
may be used to enhance or enrich word-level embeddings.
o Character level embeddings are also generally shallow in meaning but if we
have the character embedding, every single word's vector can be formed even
it is out-of-vocabulary words.
 Subword tokenization: Subword tokenization allows the model to have a reasonable
vocabulary size while being able to learn meaningful context-independent
representations and also enables the model to process words it has never seen before
by decomposing them into known subwords.
o Example: The word refactoring can be split into re, factor, and ing.
Subwords re, factor, and ing occur more frequently than the word refactoring,
and their overall meaning is also kept intact.
 Byte-Pair Encoding (BPE): BPE was initially developed as an algorithm
to compress texts and then used by OpenAI for tokenization when pretraining the
GPT model.
o It is used by a lot of Transformer models like GPT, GPT-2, RoBERTa, BART,
and DeBERTa.
o BPE brings the perfect balance between character and word-level hybrid
representations which makes it capable of managing large corpora.
o This kind of behavior also enables the encoding of any rare words in the
vocabulary with appropriate subword tokens without introducing any
“unknown” tokens.
What is N-grams in NLP?
N-grams in NLP refers to contiguous sequences of n words extracted from text for language
processing and analysis. An n-gram can be as short as a single word (unigram) or as long as
multiple words (bigram, trigram, etc.). These n-grams capture the contextual information and
relationships between words in a given text.
How N-grams in NLP works
N-grams in NLP can be generated by sliding a window of n words across a sentence or text
corpus. By extracting these n-grams, it becomes possible to analyze the frequency of
occurrence of certain word sequences, identify collocations or commonly co-occurring
words, and model the language patterns in a text. N-grams can also be used as features for
training machine learning models in tasks like text classification or sentiment analysis.
Why N-grams in NLP is important
N-grams in NLP play a crucial role in various natural language processing tasks. By
considering the context of words, n-grams provide a more nuanced understanding of text and
enable more accurate language processing. Some key benefits of using n-grams include:
 Language modeling: N-grams help capture the probability distribution of words in a
given language, which is useful for tasks like machine translation, speech recognition,
and auto-completion.
 Information retrieval: N-grams can be used to index and search text efficiently,
providing relevant results even for partial word queries.
 Text prediction: By analyzing the most frequent n-grams, it becomes possible to predict
the next word in a sequence, aiding in applications like text generation and autocomplete.
The most important N-grams in NLP use cases
N-grams in NLP find applications across a wide range of domains, including:
 Sentiment analysis: Analyzing n-grams helps in understanding the sentiment expressed
in text by capturing the context of words and phrases.
 Named Entity Recognition (NER): NER systems utilize n-grams to identify and classify
named entities such as names, locations, organizations, dates, and more.
 Text classification: N-grams are used as features in machine learning models for
classifying text into predefined categories.
 Topic modeling: N-grams aid in uncovering latent topics within a collection of
documents, enabling clustering and categorization.
 Language generation: N-grams provide the foundation for generating realistic and
coherent text, such as in chatbots or language translation systems.

Part-of-Speech (POS) tagging is a fundamental task in Natural Language Processing (NLP)

that involves assigning a grammatical category (such as noun, verb, adjective, etc.) to each

word in a sentence. The goal is to understand the syntactic structure of a sentence and identify

the grammatical roles of individual words. POS tagging provides essential information for

various NLP applications, including text analysis, machine translation, and information

retrieval.
Key Concepts:

POS Tags: POS tags are short codes representing specific parts of speech. Common POS tags

include:

 Noun (NN)

 Verb (VB)

 Adjective (JJ)

 Adverb (RB)

 Pronoun (PRP)

 Preposition (IN)
 Conjunction (CC)

 Determiner (DT)

 Interjection (UH)

2.Tag Sets:

 Different tag sets may be used depending on the POS tagging system or language. For

example, the Penn Treebank POS tag set is widely used in English NLP tasks.

3.Ambiguity:

 Words may have multiple possible POS tags based on context. For example, “lead” can

be a noun (the metal) or a verb (to guide).


Importance and Applications:

1. Syntactic Analysis:

 POS tagging is crucial for understanding the grammatical structure of a sentence,

enabling syntactic analysis. It helps identify the subject, verb, object, and other syntactic

elements.

2.Semantic Analysis:

 POS tags contribute to understanding the meaning of words in context. For example,

distinguishing between a noun and a verb can significantly impact the interpretation of a

sentence.

3. Information Retrieval:

 POS tagging is used in information retrieval systems to improve the precision and

relevance of search results. For instance, searching for “NN” (noun) in a document can

prioritize nouns over other words.

4. Named Entity Recognition (NER):

 POS tags play a role in named entity recognition by providing information about the

grammatical category of words. For example, recognizing that “New York” is a proper

noun.

5.Machine Translation:
 POS tagging is essential in machine translation to ensure accurate translation based on the

grammatical structure of sentences.


Methods of POS Tagging:

1. Rule-Based Tagging:

 Based on handcrafted rules that consider word morphology, context, and syntactic

information. It can be effective but may struggle with ambiguity.

2. Statistical Tagging:

 Uses statistical models trained on large annotated corpora to predict POS tags. Hidden

Markov Models (HMMs) and Conditional Random Fields (CRFs) are common statistical

approaches.

3.Machine Learning-Based Tagging:


 Utilizes machine learning algorithms such as decision trees, support vector machines, or

neural networks to learn patterns from data. Particular emphasis is given to contextual

information.

4.Deep Learning-Based Tagging:

 Deep learning models, such as recurrent neural networks (RNNs) and long short-term

memory networks (LSTMs), are employed for POS tagging, capturing complex

contextual dependencies.
Challenges:

1. Ambiguity:

 Words often have multiple possible POS tags based on context, leading to challenges in

disambiguation.

2.Context Dependency:

 POS tags can depend on the surrounding words, making accurate tagging sensitive to

context.

3. Out-of-Vocabulary Words:

 Handling words not seen during training is a challenge, as their POS tags need to be

predicted based on context.


Example:
Consider the sentence: “The quick brown fox jumps over the lazy dog.”

POS tagging might yield:

 “The” (DT): Determiner

 “quick” (JJ): Adjective

 “brown” (JJ): Adjective

 “fox” (NN): Noun

 “jumps” (VBZ): Verb

 “over” (IN): Preposition

 “the” (DT): Determiner

 “lazy” (JJ): Adjective

 “dog” (NN): Noun


In this example, each word is assigned a POS tag indicating its grammatical category in the

sentence.

————————————————————————

Applications where POS tagging plays a crucial role:

1. Syntactic Parsing:

 Application: Understanding the grammatical structure of sentences.

 Role of POS Tagging: POS tags provide information about the syntactic role of each

word, aiding in syntactic parsing and tree construction.

2. Named Entity Recognition (NER):

 Application: Identifying and classifying entities (e.g., persons, organizations, locations)

in text.

 Role of POS Tagging: POS tags help in identifying proper nouns, which are often

indicative of named entities.

3. Information Retrieval:

 Application: Improving search and retrieval of relevant documents or information.

 Role of POS Tagging: Using POS tags, one can prioritize or filter search results based

on the grammatical category of words. For instance, focusing on nouns for certain

queries.
4. Text Summarization:

 Application: Generating concise summaries of longer texts.

 Role of POS Tagging: Understanding the syntactic structure helps in identifying key

elements and relationships in the text, aiding in the creation of coherent summaries.

5.Machine Translation:

 Application: Translating text from one language to another.

 Role of POS Tagging: POS tags provide information about the grammatical structure,

aiding in accurate translation by preserving the syntactic and grammatical nuances of the

source language.

6.Sentiment Analysis:
 Application: Determining the sentiment expressed in a piece of text (positive, negative,

neutral).

 Role of POS Tagging: Identifying adjectives and verbs in particular helps in capturing

the sentiment expressed by the author.

7.Question Answering Systems:

 Application: Generating accurate answers to user queries.

 Role of POS Tagging: Understanding the grammatical structure of questions helps in

extracting key information and formulating appropriate answers.

8.Text-to-Speech Synthesis:

 Application: Converting written text into spoken language.

 Role of POS Tagging: POS tags guide the synthesis process, ensuring that the spoken

output follows appropriate intonation and emphasis based on the grammatical structure.

9. Speech Recognition:

 Application: Converting spoken language into written text.

 Role of POS Tagging: POS tags contribute to language models used in speech

recognition, aiding in predicting the likely sequence of words based on their grammatical

roles.

10.Grammar Checking:
 Application: Identifying and correcting grammatical errors in written text.

 Role of POS Tagging: POS tags help in detecting errors related to word usage,

agreement, and syntactic structure.

What is Backoff?
The general idea of backoff is that it always helps to use lesser context to generalize for
contexts that the model doesn’t know enough about.
 For example, we can use trigram probabilities if there is sufficient evidence, else we
can use bigram or unigram probabilities.
 It is the concept where sentences are generated in small steps which can be
recombined in other ways.
Why Do We Need Backoff in NLP?
We need techniques like smoothing (and backoff is a subset of smoothing techniques) to
tackle the problems of sparsity and improve the generalization power of NLP models.
Issues of Sparsity in NLP
 Language models use huge amounts of parameters on data to build the NLP models,
and sparsity becomes an issue, and incorporating proper smoothing techniques into
the models usually leads to more accuracy than the original models.
 Data sparsity is another issue with language models, statistical modeling, and also
NLP techniques in general, and smoothing can help with the performance issue.
 Sometimes, we come across the extreme case where there is so much training data,
and even though all parameters can be accurately trained without smoothing, even in
such cases, it is almost always usual to expand the model such as by moving to a
higher n-gram model to achieve improved performance and smoothing can help
with such cases too.
Issues with Model Generalization in NLP
Language models, in general, use a huge amount of data for training, and the output of these
models is dependent on the N-grams appearing in the training corpus. The models work well
only if the training corpus is similar to the testing dataset; this is the risk of overfitting to the
training set.
 As with any machine learning method, we would like results that are generalizable to
new information.
 One of the other harder problems with language models and, in general, NLP is how
we deal with words that do not even appear in training but are in the test data.
Thus, no matter how much data one has or how many parameters are there in the model,
smoothing can almost always help performance for a relatively small effort.

Interpolation
Interpolation is another smoothing technique to combine knowledge from multiple types of
N-gram* into the computation of the probabilities. Interpolation is similar to backoff, where
for computing trigram probability, we use the bigram and unigram information of focus
words.
 We can say that we are linearly interpolating a bigram and a unigram model when
combining bigrams and unigrams with trigrams. We can generalize this to
interpolating an N-gram model using an (N-1)-gram model.
o We need to note that this leads to a recursive procedure if the lower order N-
gram probability also doesn't exist & hence if necessary, everything can be
estimated in terms of a unigram model.
 A scaling factor is also used to make sure that the conditional distribution will sum to
one.
o An N-gram-specific weight is used, and it would lead to far too many
parameters to estimate. In such cases, we need to cluster multiple such weights
like with word classes suitably, or in extreme cases. We can also use a single
weight if clustering doesn't work.

You might also like