Professional Documents
Culture Documents
NLP - Srilakshmi H - PPT Assignment
NLP - Srilakshmi H - PPT Assignment
PRESENTATION ASSIGNMENT
Mastering Natural Language Processing: From Basics to Brilliance
By,
Srilakshmi H
2022510053
B. Tech Artificial Intelligence and Data Science
Introduction to Natural Language Processing 3
Sentiment Analysis
18
19
Text Classification 20
Other Topics 26
Introduction to NLP
NLP is a field of artificial intelligence focused on enabling computers to understand,
interpret, and generate human language in a manner that is both meaningful and
contextually relevant.
The scope of NLP is vast and includes various tasks such as Text classification and
categorization, Sentiment analysis, Named entity recognition, Machine translation,
Speech recognition, Question answering, Text summarization, Language generation,
Dialogue systems, Language modeling, etc.
Automated content generation: generate human-like text, useful fot summarizing large
documents, product descriptions, etc.
Language translation: translation of text from one language to another while preserving its
meaning.
Healthcare applications: analyze medical records, extract patient information, and assist
in diagnosis and treatment planning.
Financial analysis: analyze financial reports, identify market trends, and detect fraud.
Context Dependency: The meaning of language elements can change based on the
context in which they are used, requiring accurate contextual understanding.
Syntax and Grammar: Parsing complex sentence structures and grammar rules is
essential for accurate language comprehension.
Domain Specificity: NLU systems may struggle with understanding language specific
to certain domains, requiring specialized knowledge for effective interpretation.
Data Sparsity: NLU models require large amounts of annotated data for training, and
data sparsity can hinder model performance, particularly in languages with limited
resources.
Ethical and Bias Concerns: Ensuring fairness, transparency, and bias mitigation in
NLU systems is crucial to prevent unintended consequences like discrimination or
misinformation propagation.
Figurative Language: Dealing with metaphors, idioms, sarcasm, and other figurative
language forms requires a deep understanding of cultural and contextual cues.
Text Preprocessing Techniques
Text preprocessing in NLP refers to the process of cleaning and preparing text data
before it is fed into a machine learning or natural language processing model. The most
common text preprocessing techniques are:
Tokenization: Breaking text into individual words or tokens. eg., the sentence "I love
NLP" would be tokenized into ["I", "love", "NLP"].
Removing stop words: Eliminating common words (e.g., "and", "the", "is") that do not
carry much information.
Removing HTML tags: In text data extracted from web pages, HTML tags are often
present. Removing these tags ensures that only the text content is considered for
analysis.
Removing URLs and email addresses: Text data may contain URLs and email
addresses that are not relevant for analysis. Removing them helps clean the text.
Removing whitespace and extra spaces: Cleaning up extra spaces and whitespace in
the text ensures consistency and readability in the data.
Tokenization & Text Normalization
Tokenization is a fundamental step in Natural Language Processing (NLP) that involves
breaking down text into smaller units called tokens. Text normalization focuses on
standardizing and transforming text data to ensure consistency and reduce redundancy.
It’s an essential step in NLP tasks such as text analysis, text mining, and machine learning.
Types:
Word Tokenization: Divides text into individual words. Eg., "Hello, how are you?"
would be tokenized into ["Hello", ",", "how", "are", "you", "?"].
Sentence Tokenization: Splits text into individual sentences. Eg., "He likes apples. She
likes oranges." would be tokenized into ["He likes apples.", "She likes oranges."].
Subword Tokenization: Breaks words into smaller units, often useful for handling
languages with complex morphology or for creating word embeddings.
Character Tokenization: Splits text into individual characters. Eg., "Hello" would be
tokenized into ["H", "e", "l", "l", "o"].
Methods:
Rule-based: Relies on predefined rules to determine how to split text into tokens. Eg., a
rule might specify that words are separated by whitespace or punctuation marks.
Hybrid: Combine multiple techniques to achieve more accurate tokenization. Eg., a hybrid
approach might use both rule-based and statistical methods to handle different aspects of
the text or to adapt to different languages or domains.
Neural: It’s a method that uses neural networks to segment text into meaningful units,
learning patterns directly from data for adaptability to different languages, domains, and
text styles.
Techniques:
Regular Expression Tokenization: Regular expressions can be used to define patterns for
tokenizing text. Eg., a regular expression can be used to split text based on specific
character sequences or patterns.
Language-specific Tokenization: Some languages have specific rules for tokenization. Eg.,
in languages like Chinese or Japanese, words are not separated by spaces, so special
techniques are needed to tokenize text in these languages.
Using Machine Learning: Machine learning models can be
Functions
trained to automatically tokenize text based on patterns
in the data. This can be useful for complex tokenization word_tokenize(): A function
in libraries like NLTK and
tasks or for languages with unique tokenization rules.
spaCy to tokenize text into
words.
Sub word Tokenization: Instead of splitting text into
words, it breaks down text into smaller units such as sub sent_tokenize(): A function
word units or characters. This is useful for handling out- to tokenize text into
of-vocabulary words and improving the efficiency of sentences.
language models. char_tokenize(): A function
to tokenize text into
Byte Pair Encoding (BPE): BPE is a sub word tokenization characters.
technique that iteratively merges the most frequent pairs
of characters in a corpus to create a vocabulary of sub Tokenization APIs provided
by various NLP libraries
word units. This method is commonly used in
and tools.
transformer-based models like BERT.
Text Representation Methods
Text representation methods involve converting text data into a format that can be
understood and processed by machine learning algorithms.
These text representation methods play a crucial role in various NLP tasks such as
sentiment analysis, text classification, and machine translation by enabling algorithms
to understand and process textual data effectively. The methods are,
N-gram
TF-IDF
Word embedding
Sentence embedding
Document embedding
Bag of words (BoW)
Represents text as a collection of unique words or tokens, ignoring grammar and word
order. Each word is assigned a numerical value based on its frequency in the document.
An N-gram is a traditional text representation technique that involves breaking down the
text into contiguous sequences of n-words. A uni-gram gives all the words in a sentence. A
Bi-gram gives sets of two consecutive words and similarly, a Tri-gram gives sets of
consecutive 3 words, and so on.
Eg.: The dog in the house Uni-gram: “The”, “dog”, “in”, “the”, “house”
Bi-gram: “The dog”, “dog in”, “in the”, “the house”
Word embedding
It represents each word as a dense vector of real numbers, such that the similar or
closely related words are nearer to each other in the vector space.
This is achieved by training a neural network model on a large corpus of text, where each
word is represented as a unique input and the network learns to predict the surrounding
words in the text. The semantic meaning of the word is captured using this.
The dimension of these words can range from a few hundred (Glove, Wod2vec)to
thousands (Language models).
Sentence Embedding
Document embedding
Document embedding refers to the process of representing an entire document,
such as a paragraph, article, or book, as a single vector.
It captures not only the meaning and context of individual sentences but also the
relationships and coherence between sentences within the document.
Term Frequency-Inverse Document Frequency
TF-IDF stands for Term Frequency-Inverse Document Frequency. This is better than BoW
since it interprets the importance of a word in a document. The idea behind is to weigh
words based on how often they appear in a document (the term frequency) and how
common they are across all documents (the inverse document frequency).
The formula for calculating the TF-IDF score of a word in a document is:
TF-IDF = Term frequency in document X log Total no. of documents
(---------------------------------------------------)
No. of documents containing the term
Goal - To automatically assign a label or category to a piece of text based on its content.
Process?
Feature Model Hyperparameter
Extraction Training Tuning
NER finds applications in information extraction, question answering, and entity linking.
Text Vectorization: Process of converting text data into numerical form so that it
can be used in machine learning algorithms.
Feature Extraction: Involves extracting important and relevant features from text
data using techniques such as Bag-of-Words and TF-IDF.
Model Training: Process of training machine learning classifiers using labeled data,
so that they can learn to make predictions or classifications.
Model Evaluation: Assessing the performance of natural language processing (NLP)
models using evaluation metrics to determine how well they are performing.
Hyperparameter Tuning: Process of optimizing the performance of a model by
adjusting its parameters to find the best possible configuration.
Pipelines: Building end-to-end workflows for text processing and modeling, which
can include all of the above steps in a coordinated and automated manner.
Other Concepts in NLP
Language Models
Models that predict the probability of a sequence of words or characters in a
language.
N-gram models: Predict the likelihood of a word based on the previous 'n' words in
the sequence.
Transformer models: Such as BERT, GPT, and RoBERTa - use self-attention
mechanisms to capture relationships between words in a sequence.
RNNs: Process sequences of words iteratively, capturing dependencies between
words.
Long Short-Term Memory (LSTM) networks: A type of RNN that can capture long-
range dependencies in a sequence of words.
Statistical language models: Such as Hidden Markov Models (HMMs) and
Conditional Random Fields (CRFs), which use probabilistic methods to model
sequences of words.
Part-of-Speech (POS) Tagging
POS Tagging is the process of assigning grammatical categories, such as noun or verb,
to each word in a sentence.
This helps in understanding the structure and meaning of the sentence.
Dependency parsing
Coreference Resolution