Natural Language Processing

Dr. Meghana Harsh Ghogare

• Introduction, Syntactic Processing, Semantic Analysis, Semantic
Analysis, Discourse and Pragmatic Processing, Spell Checking, Stop
words removal, Bag of words technique, TF-IDF analysis, spacy library

Introduction to NLP
• Field of artificial intelligence (AI) that focuses on the interaction
between computers and human in human language.
• NLP plays a critical role in bridging the gap between human
communication and computer understanding,
• The ultimate goal of NLP is to help computers understand human language (as
good as we humans understand)
• Speech recognition — the translation of spoken language into text.
• Natural language understanding — a computer’s ability to understand language.
• Natural language generation — the generation of natural language by a

Key concepts and components of NLP:
• Language Understanding:
• Tokenization: The process of breaking down a text into individual words or
• Part-of-Speech (POS) Tagging: Assigning grammatical categories (e.g., noun,
verb, adjective) to each token in a sentence.
• Syntax Parsing: Analyzing the grammatical structure of a sentence to
understand the relationships between words and phrases.
• Language Generation:
• Text Generation: Creating relevant text based on given input eg: chatbots
• Machine Translation: Translating text from one language to another using
algorithms and models that understand both languages.
• Sentiment Analysis:
• Determining the sentiment or emotion of text, such as positive,
negative, or neutral. This is useful for tasks like customer feedback
analysis and social media monitoring.
• Information Extraction:
• Extracting structured information from unstructured text. For
example, identifying entities (e.g., names of people, places,
organizations) and their relationships in a news article.
• Speech Recognition and Synthesis:
• Converting Speech to text (speech recognition) and
• Converting text into Speech (speech synthesis).

Lexical(Morphological) Analyzer
• Morphological lexicon analysis is a fundamental step in various NLP
• It focuses on how words are constructed from smaller units called
morphemes, which are the smallest units of meaning in a language.
• Lexicon: A lexicon is a dictionary or vocabulary of words in a
language, including their meanings

Parser/Parsing/Syntactic Analysis

Semantic Analysis
• Word Sense Disambiguation: Many words have multiple meanings. Semantic
analysis aims to determine the correct meaning of a word in a particular context.
Eg in the sentence "I saw a bat," "bat" could refer to either the flying mammal or a
sports equipment, and disambiguation is needed to determine the correct
• Named Entity Recognition (NER): Identifying named entities like names of people,
organizations, locations, and dates
• Eg: Person Names:
• Sentence: "Barack Obama was the 44th President of the United States."
• NER Output:
• "Barack Obama" - Person
• Semantic Role Labelling (SRL): Sentence: "She ate the cake with a fork.“
• The Agent (She) is the one performing the action of eating.
• The Predicate (ate) is the eating action itself.
• The Patient (the cake) is what is being eaten.
• The Instrument (with a fork) is the tool used to perform the action.
• Sentiment Analysis: -ve, +ve, neutral
• Word Embeddings: capture their semantic meaning.(college is my
second home)
• Semantic Parsing: Converting NLP, into a (Database Query):
• Eg: Find all books published by John Smith.“

• refers to the analysis and understanding of language beyond individual sentences
• John was feeling tired after a long day at work. He decided to take a nap. However, his
neighbor's dog kept barking loudly. John tried to ignore it, but the noise was
unbearable. Finally, he decided to close his window to block out the noise and took a
• Coherence and Cohesion: linking (1 sentence to another) conjunctions ("However,"
"but," "Finally") help connect the sentences.
• Coreference Resolution: "He" in the second sentence correctly refers to "John“
• Discourse Parsing: sequence of events:
• Connectives and Discourse Markers: The discourse markers "However," "but," and
"Finally“, shows the progression
Pragmatic Processing “Do you know
what time it
Be interpreted
As request
Spell Checking
• Spell check in Natural Language Processing (NLP) is the process of identifying and
correcting spelling errors in text.
• Spell check in NLP involves several key steps:
• Error Detection: Identify potential spelling errors in the text
• Candidate Generation: are words that are similar in spelling to the misspelled
word. Various techniques &algorithms (e.g., Levenshtein distance), n-grams, or
phonetic algorithms, can be used to generate
• Candidate Ranking: Words are ranked based on their likelihood of being the
correct word.
• Correction Selection:
• Contextual Analysis: can help disambiguate between homophones (cell, sell.
buy, by, bye. ate, eight, eye, I, know, no)
• User Feedback: user feedback can be used to further refine the correction

Stop Words Removal
• Definition: These words are considered to be of little value in most NLP
tasks because they appear frequently in a given language and don't carry
significant meaning on their own,
• Ex of Stop Words: "the," "and," "of," "in," "to," "is," "for," "on," "it," and
many others.
• Text Preprocessing: Before analyzing, tasks like tokenization (splitting text
into words or tokens), lowercasing (converting all text to lowercase), and
punctuation removal.
• Stop Words Removal:
• Original Sentence: "The quick brown fox jumps over the lazy dog."
• After Stop Words Removal: "quick brown fox jumps lazy dog."

Advantages of stop word removal in Natural
Language Processing (NLP)
• Improved Computational Efficiency: Large text corpora, as it can
significantly speed up processing times.
• Reduced Dimensionality:
• Focus on Content Words
• Improved Interpretability:
• Reduced Storage Requirements
• Improved Model Performance
• Enhanced Visualization

Bag of words Technique
• Fundamental concept in Natural Language Processing (NLP) for text
analysis and text classification tasks.
• Steps
1. Tokenization
2. Vocabulary Building:
3. Counting Word Frequencies:
4. Creating the Bag of Words: To create the Bag of Words
representation for a document or text, you concatenate the word
count vectors for all documents in your corpus

Advantages & Disadvantages of Bag of Words
• Advantage
• Simplicity: It's a straightforward technique.
• Versatility: It can be used for various NLP tasks.
• Interpretability: The vectors represent word frequencies, which are easy to
• Disadvantage:
• Sparse Representation: The vectors are often sparse because most
documents use only a subset of the vocabulary.
• Ignores Context: treats all words as independent,
• Loss of Sequence Information: disregards word order and grammar,

TF-IDF analysis
• TF-IDF, short for Term Frequency-Inverse Document Frequency
• step-by-step explanation of TF-IDF analysis with an example/
• Step 1: Corpus collection of documents, is called a corpus
• Document 1: "The quick brown fox jumps over the lazy dog."
Document 2: "A brown fox is fast."
• Document 3: "The dog is lazy.“
• Step 2: Tokenization
• get a list of unique terms in the entire corpus:

• Step 3: Term Frequency (TF): For each document, calculate the term
• Eg:
• Document 1: "The quick brown fox jumps over the lazy dog.“
• The-2, quick-1, brown-1, fox-1, jumps-1, over-1, lazy-1, dog-1
• Document 2: "A brown fox is fast."
• A-1, brown-1, fox-1, is-1,fast-1."
• Document 3: "The dog is lazy.“
• The-1, dog-1, is-1, lazy-1

• Example:
• Imagine you have a collection of articles about cats.
• Term Frequency (TF) (Common words)
• Term frequency measures how often a word appears in a document.
• Words that appear more frequently are likely to be important within that
• For example, in an article about cats, the word "cat" might appear 10 times,
"play" 5 times, and "food" 2 times.

• Inverse Document Frequency (IDF) (Rare words)

• IDF measures the uniqueness of a word across the entire corpus. Words that
are rare across the corpus are considered more important because they
provide specific information.

spaCy Library
• spaCy:
• spaCy is a fast and efficient NLP library written in Python. It is
designed specifically for production use and is known for its speed
and accuracy in tasks like tokenization, part-of-speech tagging, and
named entity recognition.

features of spacy library
• Tokenization: provides robust and efficient tokenization
• Part-of-Speech Tagging (POS)
• Named Entity Recognition (NER): pre-trained models for named entity
recognition, dentify and classify entities like names of people,
organizations, locations, dates, and more in text.
• Dependency Parsing:: determining the relationships between words.
• Lemmatization: is the process of reducing words to their base or dictionary
form (lemmas)
• Eg:"The quick brown foxes are jumping over the lazy dogs."
The --> the, quick --> quick, brown --> brown foxes --> fox are --> be
jumping --> jump over --> over the --> the lazy --> lazy dogs --> dog

• Sentence Segmentation: It identifies different delimiters
• Eg: This is the first sentence. This is the second sentence! And this is
the third sentence?
• Text Classification: SpaCy supports text classification tasks, making it
suitable for tasks like sentiment analysis, spam detection, and topic
• Word Vectors:numerical representations of words in a vector space
eg: King Queen, Man Women are similar so they will have some what
similar vectors
• Multilingual Support
• Integration with Deep Learning Libraries:spaCy can be integrated
with deep learning libraries like TensorFlow and PyTorch, allowing you
to combine its NLP capabilities with deep learning models.

Advantages of spaCy Library
1. Efficiency and Speed
2. Pre-trained Models:
3. Multilingual Support:
4. Integration:integrated with other popular Python
libraries like TensorFlow and PyTorch, and giving a
combined benefit

