Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Madras Institute of Technology 12-04-2024

Artificial Intelligence - AZ5401

PRESENTATION ASSIGNMENT
Mastering Natural Language Processing: From Basics to Brilliance

By,
Srilakshmi H
2022510053
B. Tech Artificial Intelligence and Data Science
Introduction to Natural Language Processing 3

Text Preprocessing Techniques 7

Tokenization and Text Normalization 9

Text Representation Methods 13

Topics Introduction to NLP Tasks

Sentiment Analysis
18

19

Text Classification 20

Named Entity Recognition (NER) 22

Introduction to NLP Libraries: NLTK and spaCy 24

Building NLP Models with scikit-learn 25

Other Topics 26
Introduction to NLP
NLP is a field of artificial intelligence focused on enabling computers to understand,
interpret, and generate human language in a manner that is both meaningful and
contextually relevant.

It involves the development of algorithms and models that enable computers to


understand, interpret, and generate human language in a way that is both meaningful
and useful.

The scope of NLP is vast and includes various tasks such as Text classification and
categorization, Sentiment analysis, Named entity recognition, Machine translation,
Speech recognition, Question answering, Text summarization, Language generation,
Dialogue systems, Language modeling, etc.

NLP plays a crucial role in enhancing human-computer interaction, improving


information access, automating content creation, and driving innovation in various
domains.
Importance & Applications of NLP
Enhanced information access and retrieval: extract meaningful insights from vast
amounts of unstructured text data.

Automated content generation: generate human-like text, useful fot summarizing large
documents, product descriptions, etc.

Improved customer service: NLP-powered chatbots and virtual assistants.

Language translation: translation of text from one language to another while preserving its
meaning.

Healthcare applications: analyze medical records, extract patient information, and assist
in diagnosis and treatment planning.

Financial analysis: analyze financial reports, identify market trends, and detect fraud.

Educational Support: assist in grading essays, providing personalized learning


experiences, and translating educational materials.
Challenges in NLP
Ambiguity: Words and phrases can have multiple meanings, making it challenging to
determine the intended interpretation.

Context Dependency: The meaning of language elements can change based on the
context in which they are used, requiring accurate contextual understanding.

Syntax and Grammar: Parsing complex sentence structures and grammar rules is
essential for accurate language comprehension.

Anaphora Resolution: Resolving references to previously mentioned entities in a text


(anaphora) can be difficult without proper tracking and linking mechanisms.

Negation and Uncertainty: Understanding negation, uncertainty, and modalities in


language is crucial for interpreting the intended meaning accurately.
Named Entity Recognition: Identifying and categorizing named entities accurately,
such as names, locations, and organizations, can be challenging in noisy or
unstructured text.

Domain Specificity: NLU systems may struggle with understanding language specific
to certain domains, requiring specialized knowledge for effective interpretation.

Data Sparsity: NLU models require large amounts of annotated data for training, and
data sparsity can hinder model performance, particularly in languages with limited
resources.

Ethical and Bias Concerns: Ensuring fairness, transparency, and bias mitigation in
NLU systems is crucial to prevent unintended consequences like discrimination or
misinformation propagation.

Figurative Language: Dealing with metaphors, idioms, sarcasm, and other figurative
language forms requires a deep understanding of cultural and contextual cues.
Text Preprocessing Techniques
Text preprocessing in NLP refers to the process of cleaning and preparing text data
before it is fed into a machine learning or natural language processing model. The most
common text preprocessing techniques are:

Lowercasing: Converting all text to lowercase to ensure consistency in the data.

Tokenization: Breaking text into individual words or tokens. eg., the sentence "I love
NLP" would be tokenized into ["I", "love", "NLP"].

Removing punctuation and special characters: Eliminating non-alphabetic characters


that do not contribute to the meaning of the text.

Removing stop words: Eliminating common words (e.g., "and", "the", "is") that do not
carry much information.

Lemmatization or stemming: Reducing words to their base form to normalize


variations (e.g., "running" to "run").
Removing numbers: This step involves eliminating numerical digits from the text as
they may not be relevant for certain NLP tasks.

Removing HTML tags: In text data extracted from web pages, HTML tags are often
present. Removing these tags ensures that only the text content is considered for
analysis.

Handling contractions: Expanding contractions involves converting words like "can't"


to their full form, such as "cannot".

Removing URLs and email addresses: Text data may contain URLs and email
addresses that are not relevant for analysis. Removing them helps clean the text.

Removing whitespace and extra spaces: Cleaning up extra spaces and whitespace in
the text ensures consistency and readability in the data.
Tokenization & Text Normalization
Tokenization is a fundamental step in Natural Language Processing (NLP) that involves
breaking down text into smaller units called tokens. Text normalization focuses on
standardizing and transforming text data to ensure consistency and reduce redundancy.
It’s an essential step in NLP tasks such as text analysis, text mining, and machine learning.

Types:
Word Tokenization: Divides text into individual words. Eg., "Hello, how are you?"
would be tokenized into ["Hello", ",", "how", "are", "you", "?"].

Sentence Tokenization: Splits text into individual sentences. Eg., "He likes apples. She
likes oranges." would be tokenized into ["He likes apples.", "She likes oranges."].

Subword Tokenization: Breaks words into smaller units, often useful for handling
languages with complex morphology or for creating word embeddings.

Character Tokenization: Splits text into individual characters. Eg., "Hello" would be
tokenized into ["H", "e", "l", "l", "o"].
Methods:
Rule-based: Relies on predefined rules to determine how to split text into tokens. Eg., a
rule might specify that words are separated by whitespace or punctuation marks.

Statistical: Statistical models are used to determine token boundaries based on


probabilities. These models analyze patterns in the text to decide where to split it into
tokens.

Dictionary-based: Words are matched against a predefined dictionary to identify tokens. If


a word is found in the dictionary, it is treated as a token; otherwise, it may be split or
handled differently.

Hybrid: Combine multiple techniques to achieve more accurate tokenization. Eg., a hybrid
approach might use both rule-based and statistical methods to handle different aspects of
the text or to adapt to different languages or domains.

Neural: It’s a method that uses neural networks to segment text into meaningful units,
learning patterns directly from data for adaptability to different languages, domains, and
text styles.
Techniques:

Whitespace Tokenization: This technique involves splitting text based on whitespace


characters such as spaces, tabs, and line breaks. It is a simple method but may not work
well for all languages or text formats.

Punctuation-based Tokenization: Text can be tokenized based on punctuation marks such


as periods, commas, and exclamation points. This technique can help separate sentences
or phrases in the text.

Regular Expression Tokenization: Regular expressions can be used to define patterns for
tokenizing text. Eg., a regular expression can be used to split text based on specific
character sequences or patterns.

Language-specific Tokenization: Some languages have specific rules for tokenization. Eg.,
in languages like Chinese or Japanese, words are not separated by spaces, so special
techniques are needed to tokenize text in these languages.
Using Machine Learning: Machine learning models can be
Functions
trained to automatically tokenize text based on patterns
in the data. This can be useful for complex tokenization word_tokenize(): A function
in libraries like NLTK and
tasks or for languages with unique tokenization rules.
spaCy to tokenize text into
words.
Sub word Tokenization: Instead of splitting text into
words, it breaks down text into smaller units such as sub sent_tokenize(): A function
word units or characters. This is useful for handling out- to tokenize text into
of-vocabulary words and improving the efficiency of sentences.
language models. char_tokenize(): A function
to tokenize text into
Byte Pair Encoding (BPE): BPE is a sub word tokenization characters.
technique that iteratively merges the most frequent pairs
of characters in a corpus to create a vocabulary of sub Tokenization APIs provided
by various NLP libraries
word units. This method is commonly used in
and tools.
transformer-based models like BERT.
Text Representation Methods
Text representation methods involve converting text data into a format that can be
understood and processed by machine learning algorithms.

These text representation methods play a crucial role in various NLP tasks such as
sentiment analysis, text classification, and machine translation by enabling algorithms
to understand and process textual data effectively. The methods are,

Bag of words (BoW)

N-gram

TF-IDF

Word embedding

Sentence embedding

Document embedding
Bag of words (BoW)
Represents text as a collection of unique words or tokens, ignoring grammar and word
order. Each word is assigned a numerical value based on its frequency in the document.

Example: Let us consider 3 sentences,


a. The cat in the hat
b. The dog in the home
c. The bird in the Sky

Text dog cat bird in home sky the hat

The cat in the hat 0 1 0 1 0 0 2 1

The dog in the home 1 0 0 1 1 0 2 0

The bird in the Sky 0 0 1 1 0 1 2 0


N-gram

An N-gram is a traditional text representation technique that involves breaking down the
text into contiguous sequences of n-words. A uni-gram gives all the words in a sentence. A
Bi-gram gives sets of two consecutive words and similarly, a Tri-gram gives sets of
consecutive 3 words, and so on.
Eg.: The dog in the house Uni-gram: “The”, “dog”, “in”, “the”, “house”
Bi-gram: “The dog”, “dog in”, “in the”, “the house”

Word embedding

It represents each word as a dense vector of real numbers, such that the similar or
closely related words are nearer to each other in the vector space.
This is achieved by training a neural network model on a large corpus of text, where each
word is represented as a unique input and the network learns to predict the surrounding
words in the text. The semantic meaning of the word is captured using this.
The dimension of these words can range from a few hundred (Glove, Wod2vec)to
thousands (Language models).
Sentence Embedding

It is similar to that of word embedding, the only difference is in place of a word, a


sentence is represented as a numerical vector in a high-dimensional space.

The goal of sentence embedding is to capture the meaning and semantic


relationships between words in a sentence, as well as the context in which the
sentence is used.

Document embedding
Document embedding refers to the process of representing an entire document,
such as a paragraph, article, or book, as a single vector.

It captures not only the meaning and context of individual sentences but also the
relationships and coherence between sentences within the document.
Term Frequency-Inverse Document Frequency
TF-IDF stands for Term Frequency-Inverse Document Frequency. This is better than BoW
since it interprets the importance of a word in a document. The idea behind is to weigh
words based on how often they appear in a document (the term frequency) and how
common they are across all documents (the inverse document frequency).

The formula for calculating the TF-IDF score of a word in a document is:
TF-IDF = Term frequency in document X log Total no. of documents
(---------------------------------------------------)
No. of documents containing the term

Text bird cat flying in jumped sky the tiger

The cat jumped 0 0.5844 0 0 0.5844 0 0.3452 0

The white tiger roared 0 0 0 0 0 0 0.3227 0.5464

Bird flying in the Sky 0.5046 0 0.5046 0.3838 0 0.5046 0.2980 0


Introduction to NLP Tasks
NLP encompasses a wide range of tasks aimed at enabling computers to understand,
interpret, and generate human language. Understanding these tasks is essential for
building NLP applications and solving real-world problems involving text data.

The tasks include,


Sentiment Analysis: Determining the sentiment or opinion expressed in text data.
Text Classification: Categorizing text into predefined classes or categories.
Named Entity Recognition: Identifying and classifying named entities such as people,
organizations, and locations in text.
Machine Translation: Automatically translating text from one language to another.
Question Answering: Generating answers to questions posed in natural language.
Text Summarization: Generating concise summaries of longer text documents.
Language Generation: Generating natural language text, including chatbots and text
completion.
Sentiment Analysis
It is a branch of NLP that focuses on understanding and extracting the sentiment or
opinion expressed in a piece of text. Its main goal is to determine whether the sentiment
conveyed by the text is positive, negative, or neutral.
Semantic Analysis?
This process involves analyzing the text to identify
subjective information, such as emotions, attitudes, or Semantic analysis, also
known as semantic parsing
opinions, expressed by the author.
or semantic processing, is a
It can be performed at various levels crucial component in
Document level (analyzing the sentiment of an natural language processing
entire document) (NLP) that aims to
Sentence level (analyzing the sentiment of understand the meaning of
individual sentences) words, phrases, sentences,
Aspect level (analyzing the sentiment towards or even entire documents in
a computational manner.
specific aspects or entities mentioned in the text)
Social media monitoring, customer feedback analysis, brand reputation management,
market research & customer service automation.
Text Classification
What? - Involves the categorization of text documents into predefined categories or
classes.

Goal - To automatically assign a label or category to a piece of text based on its content.

Where? - Spam detection, sentiment analysis, topic categorization, etc.

Process?
Feature Model Hyperparameter
Extraction Training Tuning

Data Model Model Model


Preparation Selection Evaluation Deployment
Data Preparation: Collect and preprocess the text data by tokenizing and performing
other text cleaning techniques.
Feature Extraction: Convert the text data into numerical or vector representations that
machine learning algorithms can understand. Bag-of-words, TF-IDF and word embeddings
like Word2Vec or GloVe.
Model Selection: Appropriate model should be chosen. There’re popular algorithms
include Naive Bayes, SVM, Logistic Regression, and deep learning models such as
Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs).
Model Training: Train the selected model using labeled examples (training data).
Model Evaluation: Assess the performance of the trained model using metrics such as
accuracy, precision, recall, and F1-score on a separate dataset (test data) to ensure its
effectiveness in classifying unseen text.
Hyperparameter Tuning: Fine-tune the model's hyperparameters (e.g., learning rate,
regularization strength) to optimize its performance. Involves techniques like grid search
or random search.
Model Deployment: Deploy the trained model to classify new, unseen text data in real-
world applications.
Named Entity Recognition
NER is a vital NLP task that involves identifying and classifying named entities in text data
into predefined categories such as names of people, organizations, locations, dates, and
more.

NER finds applications in information extraction, question answering, and entity linking.

Types: Person: Names of people


Organization: Names of companies, institutions, etc.
N
Location: Names of places
Date: References to dates E
Time: References to time
Money: References to monetary values
Percent: References to percentages
R
NER plays a crucial role in extracting structured information from unstructured text data,
enabling a wide range of downstream NLP tasks and applications.
Approaches to NER: Challenges in NER:
Ambiguity: Precision: The ratio of correctly identified named
Entities with multiple possible entities to the total number of named entities
categories or meanings. identified.
Recall: The ratio of correctly identified named
Named Entity Variation:
entities to the total number of named entities in the
Entities with different forms
text.
or spellings.
F1-score: The harmonic mean of precision and
Context Dependency: recall, providing a balance between the two metrics.
Entities that change categories
Evaluation metrics:
based on context.
Rule-based: Using handcrafted rules and patterns
Named Entity Novelty:
to identify named entities.
Entities that are not present
Statistical: Training machine learning models on
in the training data.
labeled data to predict named entities.
Out-of-Domain Entities: Deep Learning: Utilizing neural networks, such as
Entities that are not covered Recurrent Neural Networks (RNNs) or Transformers,
in the predefined categories. to learn patterns in text for NER.
Introduction to NLP Libraries: NLTK and spaCy
NLTK and spaCy are two popular NLP libraries widely used for text processing, analysis,
and modeling.
1. NLTK (Natural Language Toolkit): 2. spaCy:
NLTK is a comprehensive library for NLP Designed for efficiency and production-
tasks, developed using Python. level applications.
It offers a wide range of functionalities Known for its speed and performance,
for tasks such as tokenization, stemming, making it suitable for processing large
lemmatization, part-of-speech tagging, volumes of text.
parsing, and named entity recognition. spaCy provides pre-trained models for
NLTK provides access to various corpora, various NLP tasks like tokenization, part-
lexical resources, and pre-trained models of-speech tagging, NER, dependency
for different NLP tasks. parsing, and more.
It is a great resource for educational It offers seamless integration with deep
purposes and research in NLP due to its learning frameworks like TensorFlow and
extensive documentation and tutorials. PyTorch for advanced NLP tasks.
Building NLP Models with scikit-learn
Scikit-learn is a versatile and user-friendly library that provides efficient tools for data
mining and data analysis, including support for various NLP tasks.

Text Vectorization: Process of converting text data into numerical form so that it
can be used in machine learning algorithms.
Feature Extraction: Involves extracting important and relevant features from text
data using techniques such as Bag-of-Words and TF-IDF.
Model Training: Process of training machine learning classifiers using labeled data,
so that they can learn to make predictions or classifications.
Model Evaluation: Assessing the performance of natural language processing (NLP)
models using evaluation metrics to determine how well they are performing.
Hyperparameter Tuning: Process of optimizing the performance of a model by
adjusting its parameters to find the best possible configuration.
Pipelines: Building end-to-end workflows for text processing and modeling, which
can include all of the above steps in a coordinated and automated manner.
Other Concepts in NLP
Language Models
Models that predict the probability of a sequence of words or characters in a
language.
N-gram models: Predict the likelihood of a word based on the previous 'n' words in
the sequence.
Transformer models: Such as BERT, GPT, and RoBERTa - use self-attention
mechanisms to capture relationships between words in a sequence.
RNNs: Process sequences of words iteratively, capturing dependencies between
words.
Long Short-Term Memory (LSTM) networks: A type of RNN that can capture long-
range dependencies in a sequence of words.
Statistical language models: Such as Hidden Markov Models (HMMs) and
Conditional Random Fields (CRFs), which use probabilistic methods to model
sequences of words.
Part-of-Speech (POS) Tagging
POS Tagging is the process of assigning grammatical categories, such as noun or verb,
to each word in a sentence.
This helps in understanding the structure and meaning of the sentence.

Dependency parsing

Dependency parsing is a process of analyzing the grammatical structure of sentences


by identifying and understanding the connections and dependencies between
individual words.
This helps in understanding how different words in a sentence relate to each other and
contribute to the overall meaning of the sentence.

Coreference Resolution

Coreference resolution is the process of identifying and linking together words or


phrases in a text that refer to the same entity or thing.
This helps in understanding the relationships between different parts of a text and can
improve overall comprehension.
Discourse Analysis
Discourse analysis involves examining text beyond individual sentences to understand
how they are connected and relate to each other.
This approach looks at the larger context in which language is used to uncover
underlying meanings and patterns within a conversation or written piece.

Knowledge Graphs: Representing knowledge in the form of graphs to capture


relationships between entities.
Emotion Analysis: Identifying and analyzing emotions expressed in text.
Irony Detection: Detecting instances of irony or sarcasm in text.
Sarcasm Detection: Identifying instances of sarcasm in text.
Style Transfer: Transforming the style or tone of text while preserving its content.
Anaphora Resolution: Resolving references to previously mentioned entities in text.
Text Annotation: Adding metadata or labels to text data for training machine learning
models.
Neural Machine Translation (NMT): Machine translation using neural network models.
Cross-lingual Learning in NLP: Learning representations that are applicable across
multiple languages.
Thank You!

You might also like