Professional Documents
Culture Documents
NLP New
NLP New
The Minimum Edit Distance (MED) algorithm is a classic dynamic programming algorithm used in
Natural Language Processing (NLP) for tasks such as spelling correction, OCR error detection, and
machine translation. It measures the similarity between two sequences of symbols (typically strings)
by counting the minimum number of operations required to transform one sequence into the other.
Common operations include insertion, deletion, and substitution/replacement of symbols.
Homonymy:
Meaning: Words with the same spelling and pronunciation but completely unrelated meanings and
origins.
Example:
"Bat" (an animal) and "bat" (a baseball tool) are homonyms. They sound the same and are spelled
the same, but they have different etymological roots and meanings.
Polysemy:
Meaning: A single word with multiple related meanings that have evolved from a common origin.
Example:
"Bank" can refer to the side of a river, a financial institution, or a row of objects. These meanings are
all connected to the idea of something providing support or acting as a boundary.
1. Unigram (N = 1):
2. Bigram (N = 2):
Benefits of N-grams:
EXAMPLE:
Unigrams:
Analyze each word independently: "the", "cat", "sat", "on", "the", "mat",
"because", "it", "was", "tired".
Unigrams tell us the frequency of individual words but don't capture how they
relate to each other.
Bigrams:
Look at sequences of two consecutive words: "the cat", "cat sat", "sat on", "on
the", "the mat", "mat because", "because it", "it was", "was tired".
Bigrams reveal how often words appear together. For example, "the cat"
appears together frequently, suggesting "the" is likely an article modifying
"cat".
Trigrams:
Consider sequences of three words: "the cat sat", "cat sat on", "sat on the",
"on the mat", "the mat because", "mat because it", "because it was", "it was
tired".
Trigrams provide even deeper context. Here, "the cat sat" suggests a subject-
verb-object relationship.