Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 52

Module 7

Natural Language
Processing
Natural Language Processing (NLP)

NLP is a field of computer It bridges the gap between


science and artificial human communication and
intelligence concerned with machine code, allowing
enabling computers to computers to process
understand and manipulate information in the way we
human language. naturally use language.
Applications of NLP

NLP has a vast range of applications that are woven into our daily lives:

Machine Translation: Breaking down language barriers by translating text or speech from one language
to another [e.g., Google Translate].

Smart Assistants: Responding to voice commands and questions in a natural way [e.g., Siri, Alexa,
Google Assistant].

Chatbots: Providing customer service or information through automated chat conversations.


Cont...

• Sentiment Analysis: Extracting opinions and emotions from text data [e.g., social
media monitoring].
• Text Summarization: Condensing large amounts of text into key points.
• Autocorrect and Predictive Text: Suggesting corrections and completions as you
type.
• Spam Filtering: Identifying and blocking unwanted emails.
• Search Engines: Ranking search results based on relevance to your query.
Challenges in Processing Human Language

Human language is complex and nuanced, which presents several


challenges for NLP:

Ambiguity: Words can have multiple meanings depending on context


(e.g., "bat" can refer to a flying mammal or a sports equipment).

Sarcasm and Irony: Computers struggle to understand the subtle cues


that convey these forms of expression.
Cont...

• Slang and Informal Language: Keeping up with ever-evolving slang and


informal language usage.
• Incomplete Sentences and Utterances: Human conversation often involves
shortcuts and missing information that can be confusing for machines.
NLP researchers are constantly developing techniques to address these
challenges and improve the accuracy and robustness of NLP systems.
Key NLP Tasks

Here's a glimpse into some fundamental NLP tasks that form the building blocks
for many applications:
• Tokenization: Breaking down text into smaller units like words, punctuation
marks, or phrases.
• Part-of-Speech (POS) tagging: Identifying the grammatical function of each
word in a sentence (e.g., noun, verb, adjective).
• Named Entity Recognition (NER): Recognizing and classifying named
entities in text, such as people, organizations, locations, dates, monetary
values, etc.
1. Tokenization:

Imagine you're dissecting a sentence. Tokenization is the first step, where you
break the sentence down into its individual building blocks. These blocks
can be:
• Words: "The", "quick", "brown", "fox"
• Punctuation marks: ".", ",", "?"
• Sometimes even phrases: "New York City" (depending on the application)
2. POS Tagging:

After you have your tokens, POS tagging assigns a grammatical role
(part-of-speech) to each one. Here's an example:

Sentence: "The quick brown fox jumps over the lazy dog."

POS Tags: (Determiner, Adjective, Adjective, Noun) (Verbs)


(Preposition, Determiner, Adjective, Noun)
3. Named Entity Recognition (NER):

This focuses on identifying and classifying specific entities within the tokens. Imagine
circling important names on a page. NER does something similar, recognizing entities like:
• People: "Albert Einstein"
• Organizations: "Google"
• Locations: "Paris"
• Dates: "July 4th, 2024"
• Monetary values: "$100"
Practical Examples

1. Search Engines:

Tokenization: When you search for "best restaurants NYC", the search engine
breaks it down into tokens like "best", "restaurants", "NYC".

POS Tagging: It can identify "best" as an adjective, "restaurants" as a noun, and


"NYC" as a proper noun (likely a location).

NER: This helps the search engine understand you're looking for highly-rated
restaurants in New York City and refines the search results accordingly.
2. Social Media Analysis:

Tokenization: Analyzing a tweet like "Feeling POS Tagging: It can identify "Feeling" as a verb, NER: This might not be relevant here, but NER
great after winning the game #GoTeam! "great" as an adjective, "winning" as a verb could be used to identify the team mentioned in the
#Champions". (participle), "game" as a noun, and hashtags as hashtags for further analysis.
proper nouns.
3. Spam Filtering:
Tokenization: Breaking down a spam email with subject line "Free $
$$ for you!".

POS Tagging: It can identify "Free" as an adjective, "$$$" as


symbols, and "you" as a pronoun.

NER: This might not have much role here, but tokenization and POS
tagging help identify the generic and promotional nature of the email,
potentially flagging it as spam.
• Tokenization: Breaking down a sentence in one
4. Machine language (e.g., Spanish) into individual words.

Translation: • POS Tagging: Identifying the grammatical function


of each word to understand the sentence structure.
• NER: Recognizing named entities to ensure accurate
translation within the context.
• These tasks work together for the translation engine
to understand the original sentence's meaning and
produce a grammatically correct and meaningful
translation in the target language.
Text Cleaning and Normalization for NLP

• Text data often comes in a raw and messy format. It can contain inconsistencies,
irrelevant information, and variations in how words are written.
• Cleaning and normalization are crucial steps in NLP to prepare the text for further
processing. Here's a breakdown of some common techniques:
Stopwords are very common words
1. that carry little meaning on their own
(e.g., "the", "a", "is").
Removing Removing them can improve
Stopwords: processing efficiency and focus the
analysis on more content-rich words.
• Punctuation marks, symbols, and
emojis can add noise to the data.
• Depending on the task, you might
choose to remove them entirely
2. Removing or convert them to a standard
Special format.

Characters:
Text data can be written in
different cases (uppercase,
3. lowercase).
Lowercasing/Uppercasing:
Converting everything to
lowercase or uppercase ensures
consistency and simplifies further
processing.
4. Normalizing Text:

This can involve:


• Expanding Abbreviations: Converting abbreviations to their full forms
(e.g., "e.g." to "for example").
• Handling Emojis: Converting emojis to text descriptions or removing them
altogether.
• Handling Numbers: Converting numbers to text (e.g., "2023" to "two
thousand twenty-three") or leaving them as numerals depending on the task.
5. Lemmatization vs. Stemming:

These techniques aim to reduce words to their base forms. However, they have subtle
differences:

Lemmatization: This process tries to convert a word to its dictionary form (lemma),
considering its grammatical role in the sentence (e.g., "running" becomes "run", "better"
becomes "good"). It requires a morphological analysis of the word.

Stemming: This process chops off suffixes to arrive at a base form (stem) that might not
always be a real word (e.g., "running" becomes "run", "better" becomes "bet"). It's a
simpler and faster approach but can sometimes lead to incorrect base forms.
The choice between lemmatization
and stemming depends on your
specific application.
Lemmatization is generally
preferred for tasks where
preserving meaning and
Cont... grammatical accuracy is crucial.
Stemming can be faster and
sufficient for simpler tasks where
the exact meaning of the base form
isn't critical.
• Text Normalization Libraries: Libraries
like NLTK (Python) and spaCy (Python)
offer functionalities for many of these text
cleaning and normalization tasks.
• Context-Specific Normalization: The
specific techniques you apply might vary
Additional depending on your NLP task and the nature
Considerations of your text data.
• Trade-offs: There can be trade-offs between
cleaning too aggressively and losing
information, and cleaning too lightly and
introducing noise. Finding the right balance
depends on your specific needs.
Some of the examples
1. Social Media Sentiment Analysis:
Imagine analyzing tweets to understand public sentiment towards a new
product launch. You'd want to clean the text by:
• Removing stopwords: Words like "a", "the", "is" don't contribute much to
sentiment.
• Removing special characters: Emojis, hashtags, and punctuation can be
removed or converted for consistency.
• Lowercasing: Case variations shouldn't affect sentiment analysis.
• Normalizing slang and abbreviations: "OMG" could be converted to "oh
my god" for better understanding.
2. Web Scraping and Text Summarization:
You might scrape news articles to summarize the main points.
Here, cleaning involves:
Removing HTML tags and code: Irrelevant for textual content.

Removing stopwords: Focus on the core information.

Normalizing text: Standardize dates, locations, etc.


3. Chatbot Development:
When building a chatbot, you need to understand user queries effectively.
Cleaning involves:

Correcting typos and misspellings: Users might make mistakes while typing.

Removing irrelevant information: Greetings, salutations, and emojis might


not be crucial for understanding the intent.

Normalization: Standardize formats for dates, times, and measurements.


4. Machine Translation:
Machine translation systems need clean and normalized text for accurate translation.
Cleaning involves:
Removing special characters: Symbols and emojis might not translate well.

Handling named entities: Proper names (people, locations) should be preserved.

Normalization: Standardize date and time formats across languages.


5. Text Classification:
Classifying emails as spam or not-spam requires cleaned text.
Cleaning involves:
Removing email headers and footers: Irrelevant for
classification.
Removing URLs and attachments: Not useful for content
analysis.
Normalization: Standardize greetings and salutations.
Concept: BoW is a simple way to represent documents
as numerical vectors.
Process:
• Each document is treated as a "bag" of words,
ignoring the order and grammar of the words.
1. Bag-of-Words • A vocabulary of unique words is created across all
documents in the corpus.
(BoW) Model: • Each document is represented by a vector where each
element corresponds to a word in the vocabulary.
• The value of each element indicates the frequency
(count) of the corresponding word appearing in that
document.
Document 1: "The cat sat on the mat." Document 2:
"The dog chased the cat."
Vocabulary: {the, cat, sat, on, mat, dog, chased}
Example: Document 1 vector: [3, 1, 1, 1, 1, 0, 0] (3 occurrences
of "the", etc.) Document 2 vector: [2, 1, 0, 0, 0, 1, 1]
Limitations:

Ignores word order and context.

Doesn't capture the relationships between words.

Can be sensitive to high-frequency stopwords.


2. Term Concept: TF-IDF builds upon BoW but considers the
Frequency- importance of words within a document and across the entire
corpus.
Inverse Process:
Document • TF (Term Frequency) for a word in a document is
calculated as its count divided by the total number of words
Frequency in that document.
• IDF (Inverse Document Frequency) for a word is calculated
(TF-IDF): as the logarithm of the total number of documents in the
corpus divided by the number of documents containing that
word. High IDF means the word is less frequent across
documents and potentially more informative.
• The TF-IDF weight for a word is then calculated by
multiplying TF and IDF.
Gives more weight to
important words (rare
but informative).
Benefits:
Reduces the impact of
stopwords.
3. Word Embeddings and Distributed
Representations (Word2Vec, GloVe):
Concept: Word embeddings map words to numerical vectors, capturing semantic relationships between words. Similar
words will have similar vector representations in high-dimensional space.

Techniques:

Word2Vec: Two popular architectures are Skip-gram and CBOW. They predict surrounding words based on a given
word (Skip-gram) or vice versa (CBOW). Words used for prediction and the target word become closer in the vector
space.

GloVe: Analyzes word co-occurrence statistics from a large corpus to learn word vectors. Words that frequently co-occur
are positioned closer in the vector space.
Benefits:

Captures semantic relationships Enables tasks like word similarity Can be used as input features for
between words. detection and analogy completion. various NLP models.
4. Language Models and Pre-trained
Transformers:
Concept: Language models are statistical methods that predict the next word in a sequence based on the preceding
words. Pre-trained transformers are powerful language models trained on massive amounts of text data.

Techniques:

Traditional Language Models (e.g., n-grams): Predict the next word based on the n preceding words (e.g., bigrams,
trigrams).

Pre-trained Transformers (e.g., BERT, GPT-3): These are complex neural network architectures trained on massive text
corpora. They learn contextual representations of words and can be fine-tuned for various NLP tasks like text
classification, question answering, and summarization.
Benefits:

Can handle complex Achieve state-of-the- Offer flexibility for


relationships between art performance on fine-tuning to specific
words in a sentence. many NLP tasks. domains.
Here's an analogy:

Language models and


pre-trained transformers
BoW and TF-IDF are like are like highly
Word embeddings are like
simple indexes in a knowledgeable librarians
advanced search features
library, listing all the who can not only find
that consider synonyms
words in each book relevant information but
and related terms.
(document). also understand the
context and relationships
between them.
Understanding Sentiment Analysis

Sentiment analysis, also known as opinion mining, is the process of computationally identifying and classifying the emotional tone behind a
piece of text. It aims to understand whether the sentiment expressed is positive, negative, or neutral.

Here's a breakdown of the concept:

Applications:
• Social media monitoring: Analyze public opinion towards brands, products, or events.
• Customer reviews: Understand customer satisfaction and identify areas for improvement.
• Market research: Gauge audience sentiment towards specific topics or products.
• Spam filtering: Identify and filter out spam emails with negative or promotional tones.
• Lexicon-based approach: Uses pre-defined dictionaries of
words with positive, negative, and neutral sentiment scores.
The overall sentiment is calculated based on the sentiment
scores of the words in the text.
• Machine learning: Trains models on labeled data (text
with known sentiment) to automatically classify new text.
Popular algorithms include Naive Bayes, Support Vector
Machines (SVM), and Logistic Regression.
Techniques: • Deep learning: Utilizes neural networks like Recurrent
Neural Networks (RNNs) and Long Short-Term Memory
(LSTM) networks to capture complex relationships
between words and improve sentiment classification
accuracy.
Building Sentiment Analysis Models

1. Data Preparation:

Collect a dataset of text samples with labeled sentiment (positive, negative, or


neutral).

Preprocess the text by cleaning it (removing noise, punctuation, stop words) and
potentially normalizing it (lowercasing, stemming/lemmatization).
Bag-of-Words
For machine learning (BoW): Represent the
models, create text as a vector where
features that represent each element
the text. This could indicates the
involve: frequency of a word
in the vocabulary.
2. Feature
Engineering: TF-IDF: Assigns
Word Embeddings:
weights to words
Represent words as
based on their
numerical vectors
importance within the
capturing semantic
document and across
relationships.
the corpus.
1 2 3
3. Model
Training: Choose a suitable machine
learning or deep learning
algorithm for sentiment
classification.
Train the model on your
labeled data.
Evaluate the model's
performance on a separate
test dataset.
Use metrics like accuracy,
precision, recall, and F1-
score to assess the model's
performance.
4. Evaluation: Fine-tune the model or
explore different algorithms
if performance is not
satisfactory.
Sentiment analysis models assign a sentiment score
or class (positive, negative, neutral) to a piece of
text.

Interpreting
It's crucial to understand the limitations:
Sentiment
Analysis Models might misclassify sarcasm, irony, or
Results complex emotions.

Contextual information beyond the text itself might


be needed for accurate interpretation.
Use the results as an indicator of overall
sentiment but don't rely solely on them for
Cont.. drawing definitive conclusions. Analyze the
data with a critical eye and consider the
context in which the text was written.
Sentiment analysis builds upon the field of Natural
Language Processing (NLP) and leverages various
techniques from machine learning and deep learning:

Linguistics: Sentiment analysis relies on understanding


the emotional connotation of words and phrases.

Theoretical
Explanation Machine Learning: Algorithms learn patterns from
labeled data to classify new text samples.

Deep Learning: Deep neural networks can capture


complex relationships between words and context,
improving classification accuracy.
1. Introduction to Topic modeling is a statistical method for
uncovering hidden thematic structures
Topic Modeling within a collection of documents. It aims to
identify groups of words (topics) that
and Latent frequently appear together and describe the
Dirichlet main subjects discussed in the corpus.

Allocation
(LDA)
Latent Dirichlet Allocation (LDA) is a
popular topic modeling algorithm. Here's
the basic idea:
• Each document is assumed to be a mixture
of various topics in different proportions.
• Each topic is represented by a probability
distribution over words in the vocabulary.
Cont...
LDA analyzes the documents in a corpus and
tries to discover these underlying topics and
their distribution across documents.
3. Evaluating Topic There's no single "best" number of topics for
LDA. Here are some approaches to guide
Models and your selection:
• Perplexity: LDA calculates perplexity, a
Selecting the measure of how well the model fits unseen
data. Lower perplexity often indicates a
Optimal Number of better fit. However, it can be sensitive to
model parameters.
Topics • Topic Coherence: Evaluate how well the
words within a topic are semantically
related. Various metrics like coherence
score (CoherenceModel in Gensim) can
help assess this.
• Domain Knowledge: Consider your
understanding of the domain and the
expected number of relevant themes within
the documents.
4. Introduction to Text Generation Techniques

Text generation aims to create coherent and realistic sequences of words, similar to
human-written text. Here are two common approaches:

1. Markov Chains:

A Markov chain is a statistical model that predicts the next word based on the
probability of it appearing after a specific sequence of preceding words (n-grams).

Simple and computationally efficient, but generated text can be repetitive and lack
long-range coherence.
RNNs are a type of neural network architecture
specifically designed for sequential data like text.
2. Recurrent
Neural They can learn complex relationships between
Networks words across longer sequences, leading to more
sophisticated and grammatically correct text
(RNNs): generation.

However, training RNNs often requires large


datasets and significant computational resources.

You might also like