Natural Language Processing Notes

NATURAL LANGUAGE PROCESSING
MODULE-1
MODULE-1
1.1 Introduction to text data and Natural Language Processing
1.2 History of NLP, Need of language processing
1.3 Applications of NLP, Components of NLP, NLP Phases
Introduction to text data
 Text data is a form of unstructured data that is represented as sequences of characters.

 In the context of computing, text data can take various forms, including documents,
articles, tweets, emails, and more.
 Analysing and extracting insights from text data is a crucial aspect of many
applications, and it forms the basis for Natural Language Processing (NLP).
Natural Language Processing

 Natural Language Processing is a field of artificial intelligence that focuses on the
interaction between computers and humans through natural language.
 The goal of NLP is to enable computers to understand, interpret, and generate human-
like text in a way that is both meaningful and contextually relevant.
History of Natural Language Processing
 1950s-1960s: The origins of NLP can be traced back to the mid-20th century. During
this time, researchers like Alan Turing and John McCarthy laid the groundwork for
artificial intelligence (AI) and computational linguistics. Turing proposed the famous
Turing Test in 1950, which became a benchmark for evaluating a machine's ability to
exhibit intelligent behavior indistinguishable from a human.
 1960s-1970s: Early NLP efforts focused on rule-based systems and symbolic
approaches. One of the pioneering systems was the Georgetown-IBM Experiment in
1954, which translated Russian sentences into English using an IBM 701 computer. In
the 1960s and 1970s, researchers like Roger Schank and Terry Winograd worked on
natural language understanding and dialogue systems, leading to developments such
as the SHRDLU program by Winograd.
 1980s-1990s: This era witnessed advancements in statistical NLP and machine
learning techniques. The advent of probabilistic models like Hidden Markov Models
(HMMs) and the use of corpora for training and evaluation marked a shift towards
data-driven approaches. Systems like IBM's "Watson" were also developed during
this time, showcasing the potential of NLP in real-world applications.
 2000s-Present: The 2000s saw a surge in research and development in NLP, fueled
by the availability of large datasets, computational power, and algorithmic
improvements. Key developments include the rise of deep learning methods such as
Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks,
and Transformer models like BERT and GPT. These models revolutionized tasks like
machine translation, sentiment analysis, question answering, and more.
 Recent Trends: In recent years, NLP has witnessed rapid progress in areas like
transfer learning, pre-trained language models, and multimodal AI, where systems can
understand and generate text, speech, and images. Models like GPT-3, developed by
OpenAI, have showcased the power of large-scale language models in performing
diverse NLP tasks with human-like fluency.
Need of Natural Language Processing
Natural Language Processing (NLP) is a crucial field within artificial intelligence and
computational linguistics, serving various purposes and addressing specific needs.
 Human-Computer Interaction: NLP enables more natural and intuitive

communication between humans and computers. Voice assistants, chatbots, and
virtual assistants leverage NLP to understand and respond to user queries in natural
language, making technology more accessible and user-friendly.
 Information Extraction and Retrieval: NLP is essential for extracting meaningful
information from large volumes of text data. It helps in identifying entities,
relationships, and sentiments, facilitating efficient information retrieval from
documents, articles, social media, and other textual sources.
 Language Translation: NLP plays a crucial role in machine translation systems.
Technologies like Google Translate and other translation services rely on NLP
algorithms to understand and translate text between different languages, breaking
down language barriers and fostering global communication.
 Sentiment Analysis: Businesses use sentiment analysis, a branch of NLP, to analyze
and understand the sentiments expressed in text data. This is valuable for monitoring
customer feedback, social media reactions, and online reviews, helping organizations
gauge public opinion and make data-driven decisions.
 Text Summarization: NLP assists in summarizing large volumes of text into concise
and coherent summaries. This is particularly useful for handling vast amounts of
information, such as news articles, research papers, and legal documents, providing
users with a quick overview of content.
 Search Engines: Search engines employ NLP techniques to understand user queries
and deliver relevant search results. Semantic search, which considers the meaning and
context of words, enhances the accuracy and relevance of search engine results.
 Speech Recognition: NLP is fundamental to speech recognition systems that convert
spoken language into text. Applications range from voice commands on smartphones
to transcription services and voice-activated devices, making human-computer
interaction more convenient.
 Healthcare Applications: In the healthcare sector, NLP is used for processing
clinical notes, medical literature, and patient records. It aids in information extraction,
coding, and analysis, contributing to improved healthcare management and research.
 Chatbots and Virtual Assistants: NLP powers chatbots and virtual assistants,
allowing them to understand and respond to user queries in natural language. This is
applied across various domains, including customer support, e-commerce, and
information retrieval.
 Content Generation: NLP models can be used for content generation, including
automatic article writing, summarization, and creative writing. This is particularly
useful in scenarios where generating human-like text is required.
Applications of Natural Language Processing
Natural Language Processing (NLP) is a field at the intersection of artificial
intelligence, computer science, and linguistics. Following are the applications of NLP:
1. Text and Document Processing
 Text Classification: Automating the categorization of text into predefined
categories (e.g., spam detection in emails).
 Sentiment Analysis: Determining the sentiment behind a text, useful in social
media monitoring and customer feedback analysis.
 Named Entity Recognition (NER): Identifying and classifying entities (e.g.,
names of people, organizations, locations) within a text.
 Language Translation: Translating text from one language to another, such as
Google Translate.
 Summarization: Generating a concise summary of a long document or article.
2. Chatbots and Virtual Assistants
 Customer Support: Providing automated customer service through chatbots,
which can handle common queries and issues.
 Personal Assistants: Virtual assistants like Siri, Alexa, and Google Assistant
that can perform tasks and answer questions.
3. Healthcare
 Medical Records Analysis: Extracting and interpreting information from patient
records for better diagnosis and treatment plans.
 Clinical Trial Matching: Matching patients with appropriate clinical trials based
on their medical history and conditions.
 Drug Discovery: Analyzing scientific literature and data to aid in the discovery
of new drugs.
4. Finance
 Fraud Detection: Identifying fraudulent activities by analyzing transaction records
and textual data.
 Market Analysis: Analyzing news, reports, and social media to gauge market
sentiment and make investment decisions.
 Customer Service Automation: Providing automated responses and support
through NLP-powered chatbots.
5. E-commerce
 Product Recommendations: Analyzing customer reviews and preferences to
recommend products.
 Search Optimization: Enhancing search functionality on e-commerce platforms
through better understanding of user queries.
 Customer Feedback Analysis: Extracting insights from reviews and feedback to
improve products and services.
6. Education
 Automated Grading: Using NLP to grade essays and assignments, providing
consistent and quick evaluations.
 Personalized Learning: Developing adaptive learning systems that tailor
educational content to individual student needs.
 Language Learning: Creating applications that help users learn new languages
through interactive and engaging content.
7. Legal and Compliance
 Document Review: Automating the review and analysis of legal documents to
identify relevant information and ensure compliance.
 Contract Analysis: Extracting key terms and conditions from contracts and legal
documents.
8. Human Resources
 Resume Screening: Automating the process of screening resumes to identify
suitable candidates for job positions.
 Employee Feedback Analysis: Analyzing employee surveys and feedback to
understand workplace sentiment and improve HR practices.
9. Entertainment
 Content Creation: Assisting in generating content for books, articles, and scripts.
 Media Monitoring: Tracking mentions and trends across various media outlets
and social media platforms.
10. Research and Development
 Literature Review: Assisting researchers in summarizing and synthesizing large
volumes of scientific literature.
 Patent Analysis: Analyzing patent documents to identify trends and innovations.
Components of Natural Language Processing
Natural Language Processing (NLP) is broadly divided into two main components: Natural
Language Understanding (NLU) and Natural Language Generation (NLG).
1. Natural Language Understanding (NLU)
Natural Language Understanding involves transforming human language into a machine-

readable format. It helps the machine to understand and analyse human language by
extracting the text from large data such as keywords, emotions, relations, and semantics. The
key tasks and techniques in NLU include:
a. Tokenization
It is the process of breaking down a text into smaller units called tokens (e.g., words, phrases,
symbols).It facilitates further analysis by providing basic units of language.
b. Part-of-Speech (POS) Tagging
It identifies the grammatical parts of speech (e.g., nouns, verbs, adjectives) for each token in
a sentence. It helps in understanding the syntactic structure and meaning of sentences.
c. Named Entity Recognition (NER)
It identifies and classifying named entities (e.g., people, organizations, locations) within text.
It extracts important information and entities from text for further analysis.
d. Syntax and Parsing
 Syntax Analysis: Understanding the grammatical structure of a sentence by

identifying relationships between words.
 Parsing: Creating parse trees to represent the syntactic structure of sentences.
 Purpose: Provides structural representation of sentences to aid in deeper
understanding.
e. Semantic Analysis
 It extracts the meaning of words, phrases, and sentences. It Ensures accurate

interpretation of language meaning
 Techniques: Word sense disambiguation (determining the correct meaning of a word
in context), semantic role labelling (identifying the roles of entities in actions).
f. Coreference Resolution
 It Identify when different words or phrases refer to the same entity in a text. It
maintains coherence and consistency in understanding text.
g. Sentiment Analysis
 It determines the sentiment or emotional tone behind a text.

 Purpose: Gauges public opinion, customer satisfaction, and social media trends.
h. Discourse Analysis
 It analysing the structure and coherence of larger text segments (e.g., paragraphs,
documents).It also understands how sentences connect and form a meaningful whole.
2. Natural Language Generation (NLG)
Natural Language Generation involves the creation of human-like text by machines. NLG
focuses on producing coherent, contextually appropriate, and grammatically correct text. The
key tasks and techniques in NLG include:
a. Content Determination
 It decides what information should be included in the generated text. It ensures

relevant and useful content is selected for generation.
b. Document Structuring
 It organizes the information into a coherent structure (e.g., paragraphs, sections). It

enhances readability and logical flow of generated text.
c. Sentence Planning
 It determining how information should be expressed in individual sentences. It also

ensures clarity and cohesiveness in sentence construction.
 Techniques: Sentence aggregation (combining information into fewer sentences),
referring expression generation (choosing appropriate pronouns and noun phrases).
d. Lexicalization
 It chooses the appropriate words and phrases to convey the intended meaning.
 It enhances the fluency and naturalness of the generated text.
e. Surface Realization
 It converts the abstract representation of sentences into grammatically correct text. It

Produces text that is syntactically and grammatically accurate.
 Techniques: Applying syntactic and morphological rules.
f. Text Summarization
 It creates a concise summary of a longer text document. It provides a quick overview

of the main points in a document.
 Techniques: Extractive summarization (selecting key sentences), abstractive
summarization (generating new sentences to summarize content).
g. Dialogue Generation
 It creates conversational responses in chatbots and virtual assistants. It facilitates

natural and coherent interactions with users.
 Techniques: Template-based generation, rule-based generation, and neural network-
based generation.
h. Story and Report Generation
 It creates narratives and reports from structured data. It also automates the creation of
detailed and informative narratives.
 Applications: Automated news reporting, financial report generation, and data-driven

storytelling.
Integration of NLU and NLG
While NLU focuses on understanding and interpreting text, NLG is concerned with
generating text. Both components often work together in applications such as:
 Chatbots: NLU interprets user input, and NLG generates appropriate responses.
 Machine Translation: NLU understands the source text, and NLG generates the
translated text in the target language.
 Text Summarization: NLU identifies key information, and NLG produces a concise
summary.
Together, NLU and NLG enable a wide range of applications that require both understanding
and generating human language, making NLP a powerful tool for interacting with and
processing natural language data.
Steps of Natural Language Processing

Fig 1:Phases of NLP
1. Lexical Analysis and Morphological:
Lexicon describes the understandable vocabulary that makes up a language. Lexical analysis
deciphers and segments language into units—or lexemes—like paragraphs, sentences,
phrases, and words. NLP algorithms categorize words into parts of speech (POS) and split
lexemes into morphemes—meaningful language units that you can‟t further divide. There are
2 types of morphemes:
1. Free morphemes function independently as words (like “cow” and “house”).

2. Bound morphemes make up larger words. The word “unimaginable” contains the
morphemes “un-” (a bound morpheme signifying a negative context), “imagine” (the
free morpheme root of the whole word), and “-able” (a bound morpheme denoting the
root morpheme‟s ability to end).
2. Syntactic Analysis (Parsing)
Syntax describes how a language‟s words and phrases arrange to form sentences. Syntactic
analysis checks word arrangements for proper grammar.
For instance, the sentence “Dave wrote the paper” passes a syntactic analysis checks because
it‟s grammatically correct. Conversely, a syntactic analysis categorizes a sentence like “Dave
do jumps” as syntactically incorrect.
3. Semantic Analysis
Semantics describe the meaning of words, phrases, sentences, and paragraphs. Semantic
analysis attempts to understand the literal meaning of individual language selections, not
syntactic correctness. However, a semantic analysis doesn‟t check language data before and
after a selection to clarify its meaning.
For instance, “Manhattan calls out to Dave” passes a syntactic analysis because it‟s a
grammatically correct sentence. However, it fails a semantic analysis. Because Manhattan is
a place (and can‟t literally call out to people), the sentence‟s meaning doesn‟t make sense.
4. Discourse Integration
Discourse describes communication between 2 or more individuals. Discourse integration

analyzes prior words and sentences to understand the meaning of ambiguous language.
For instance, if one sentence reads, “Manhattan speaks to all its people,” and the following
sentence reads, “It calls out to Dave,” discourse integration checks the first sentence for
context to understand that “It” in the latter sentence refers to Manhattan.
5. Pragmatic Analysis
Pragmatism describes the interpretation of language‟s intended meaning. For instance, a

pragmatic analysis can uncover the intended meaning of “Manhattan speaks to all its people.”
Methods like neural networks assess the context to understand that the sentence isn‟t literal,
and most people won‟t interpret it as such. A pragmatic analysis deduces that this sentence is
a metaphor for how people emotionally connect with places.
MODULE-2
2.1 NLP Libraries, Data Types: structured and unstructured
2.2 Linguistic resources, Word Level Analysis, Regular Expression and its applications.
2.3 Types of regular expression, regular expression function
NLP Libraries
1. NLTK (Natural Language Toolkit):
- Python library for working with human language data.
- Provides easy-to-use interfaces to perform tasks like tokenization, - stemming, tagging,

parsing, and more.
- Suitable for education, research, and rapid prototyping.
2. Spacy:
- Designed for efficient and production-ready NLP.
- Features pre-trained models for various languages, supporting tasks such as POS tagging,
named entity recognition (NER), and dependency parsing.
- Emphasizes speed and usability.
3. Gensim:
 It is used for topic modeling and document similarity analysis.
 Implements algorithms like Latent Semantic Analysis (LSA) and Latent Dirichlet
Allocation (LDA).
 Well-suited for large text corpora and scalable solutions.
4. Stanford NLP:
 Suite of NLP tools developed by Stanford University.
 Offers a range of functionalities, including POS tagging, NER, sentiment

analysis, and coreference resolution.
 Java-based with available Python wrappers.

5. Transformers (Hugging Face):
 Known for its repository of pre-trained transformer models.
 Provides easy-to-use interfaces for various transformer-based architectures,

such as BERT, GPT, and others.
 Widely adopted for state-of-the-art performance in many NLP tasks.
6. TextBlob:
 Simplifies common NLP tasks with a high-level API.
 Built on NLTK and Pattern libraries.
 Supports tasks like part-of-speech tagging, noun phrase extraction, sentiment

analysis, classification, and more.
7. AllenNLP:
 Framework for building and evaluating NLP models.
 Provides pre-built components for tasks like text classification, named entity
recognition, and machine reading comprehension.
 Based on PyTorch and designed for flexibility and extensibility.
8. CoreNLP (Stanford):
 Comprehensive NLP toolkit offering various tools and pipelines.
 Supports tasks like tokenization, POS tagging, named entity recognition,

sentiment analysis, and more.
 Java-based with available Python wrappers.
Data Types
Structured Data:
 Tabular Information: In certain NLP applications, data may be organized in

structured tables or databases. For instance, a dataset might contain columns like
"Text," "Category," and "Sentiment Score." This structured format facilitates easy
integration with traditional machine learning models that are designed to work with
tabular data.
 Metadata: Structured information accompanying text data, such as timestamps,

author names, or geographic locations, can be considered structured data. Metadata
often helps contextualize and organize the unstructured text.
 Labeled Datasets: NLP models often require labeled datasets for training. These
labeled datasets, where each piece of text is associated with a specific category or
sentiment label, provide a structured foundation for supervised learning tasks.
Unstructured Data:
 Free-Form Text: The majority of natural language data is unstructured, consisting of

free-form text found in articles, social media posts, books, or other written content.
Unstructured text is rich in linguistic nuances and requires specialized techniques for
analysis.
 Speech and Audio Data: NLP extends beyond written text to include spoken
language. Transcriptions of speech, phone call recordings, or other audio data fall into
the category of unstructured data that requires processing to extract meaningful
information.
Linguistic resources:
Linguistic resources typically refer to various types of data or tools used in the study or
processing of language.
Following are the types of Linguistic resources:
1. Dictionaries: Lexical resources containing words or phrases along with their

definitions, pronunciations, and other linguistic information.
2. Corpora: Collections of texts or speech data used for linguistic analysis, including
written texts, transcripts, and audio recordings.
3. Thesauri: Resources providing synonyms and antonyms for words, often used for
expanding vocabulary or improving natural language processing tasks.
4. Word Lists: Lists of words categorized by various criteria, such as frequency, part of
speech, or semantic category.
5. Lexicons: Structured databases containing linguistic information about words,

including morphological, syntactic, and semantic properties.
6. Ontologies: Formal representations of knowledge or concepts in a specific domain,

often used in natural language understanding and semantic analysis.
7. Part-of-Speech (POS) Taggers: Software tools that assign grammatical categories

(e.g., noun, verb, adjective) to words in text or speech data.
8. Named Entity Recognizers (NER): Tools that identify and classify named entities
(e.g., persons, organizations, locations) in text data.
9. Semantic Networks: Graph-based representations of semantic relationships between
words or concepts, often used in natural language processing tasks.
10. Language Models: Statistical or machine learning models trained on large text
corpora to predict or generate text, commonly used in speech recognition, machine
translation, and text generation.
11. Parsing Tools: Software tools that analyze the grammatical structure of sentences,
often used in syntactic analysis and parsing.
12. Speech Recognition and Synthesis Systems: Tools for converting spoken language
into text and vice versa, commonly used in voice interfaces and speech-to-text
applications.
Word Level Analysis:
Word-level analysis in natural language processing (NLP) involves studying and processing
text data at the level of individual words. This type of analysis is fundamental in many NLP
tasks and applications. some common techniques and tasks involved in word-level analysis:
1. Tokenization: Tokenization is the process of breaking a text into individual words or

tokens. This step is often the first preprocessing step in NLP tasks. Tokenization can
be straightforward for languages like English, where words are typically separated by
spaces, but it can be more complex for languages with agglutinative or
morphologically rich structures.
2. Part-of-Speech (POS) Tagging: POS tagging involves assigning grammatical

categories (e.g., noun, verb, adjective) to each word in a text. POS tagging is essential
for many downstream NLP tasks, such as parsing, named entity recognition, and
machine translation.
3. Morphological Analysis: Morphological analysis involves studying the internal

structure of words and how they are formed from smaller units called morphemes.
Morphological analysis is crucial for tasks like stemming (reducing words to their
base or root forms) and lemmatization (reducing words to their dictionary forms).
4. Word Embeddings: Word embeddings are dense vector representations of words in a

continuous vector space. They capture semantic relationships between words based on
their contextual usage in large text corpora. Word embeddings are widely used in
NLP tasks such as text classification, sentiment analysis, and machine translation.
5. Named Entity Recognition (NER): NER is the task of identifying and classifying
named entities (e.g., persons, organizations, locations) mentioned in a text. Named
entities are often crucial for information extraction and knowledge discovery tasks.
6. Word Sense Disambiguation (WSD): WSD is the task of determining the correct
meaning of a word in context, particularly when a word has multiple possible
meanings. WSD is important for improving the accuracy of NLP applications such as
machine translation and question answering.
7. Word Frequency Analysis: Word frequency analysis involves counting the

occurrence of each word in a text or corpus. It helps identify common words, rare
words, and trends in language usage. Word frequency analysis is useful for tasks such
as keyword extraction, topic modeling, and text summarization.
8. Sentiment Analysis: Sentiment analysis involves determining the sentiment or

opinion expressed in a piece of text. Word-level sentiment analysis assigns sentiment
scores to individual words or phrases to classify the overall sentiment of the text as
positive, negative, or neutral.
Types of Regular Expression
Regular expressions (regex) are powerful tools used in Natural Language Processing (NLP)
for pattern matching and text manipulation.
Types of RE:
1. Basic Text Matching:
Example: pattern = "word"
Description: Matches the exact occurrence of the word "word" in the text.
2. Wildcards:
Example: pattern = "w.rd"
Description: The dot (.) represents any character, so this pattern matches words like "ward,"
"word," and "wind."
3. Character Classes:
Example: pattern = "[aeiou]"
Description: Matches any single vowel. Square brackets [ ] denote a character class, and the
pattern matches any character within the specified set.
4. Negation in Character Classes:
Example: pattern = "[^aeiou]"
Description: Matches any single character that is not a vowel. The caret (^) inside the
character class negates the set.
5. Quantifiers:
Example: pattern = "go+l"
Description: Matches "gol," "gool," "gooool," and so on. The plus (+) indicates one or more
occurrences of the preceding character.
6. Optional Character:
Example: pattern = "colou?r"
Description: Matches both "color" and "colour." The question mark (?) makes the preceding
character optional.
7. Anchors:
Example: pattern = "^start"
Description: Matches the pattern only if it appears at the beginning of a line. The caret (^) is
an anchor for the start of a line.
8. Word Boundaries:
Example: pattern = r"\bword\b"
Description: Matches the whole word "word" and not a part of a larger word. The \b denotes
a word boundary.
9. Grouping:
Example: pattern = "(red|blue)car"
Description: Matches either "redcar" or "bluecar." The pipe (|) acts as an OR operator within
the parentheses.
10. Quantifier Modifiers:
Example: pattern = "a{2,4}"
Description: Matches "aa," "aaa," or "aaaa." The curly braces {} specify a range for the
number of occurrences.
Applications of Regular Expression:
Regular expressions (regex) are a powerful tool in natural language processing (NLP) for
pattern matching and text manipulation. They are used to find, extract, or replace specific
patterns in text.
1. Tokenization:
o Splitting Text: Separating text into words, sentences, or other meaningful
units.
import re
text = "Hello, world! How are you?"
tokens = re.findall(r'\b\w+\b', text)
2. Text Cleaning:
o Removing Punctuation: Cleaning text by removing punctuation marks.
text = "Hello, world! How are you?"

cleaned_text = re.sub(r'[^\w\s]', '', text)
3. Finding Patterns:
o Extracting Email Addresses: Identifying and extracting email addresses from
text.
text = "Please contact us at support@example.com or sales@example.org"

emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
text)
4. Replacing Text:
o Censoring Words: Replacing specific words or phrases with asterisks.
text = "This is a bad example."

censored_text = re.sub(r'\bbad\b', '****', text)
5. Validation:
o Checking Phone Numbers: Validating phone numbers in a specific format.
phone = "123-456-7890"
valid = re.match(r'^\d{3}-\d{3}-\d{4}$', phone)
Python Code for Specific Regular Expressions

i) Any words that end with “t”
To search for words ending with "t", you can use the following regular expression pattern:
import re
text = "This is a test text to find words that end with t, such as heat, light, and flight."
# Regular expression to find words ending with "t"

pattern = r'\b\w*t\b'
# Find all matching words

words_ending_with_t = re.findall(pattern, text)
print(words_ending_with_t)
ii) Search the string to see if it starts with "I" and ends with "Asia"
To check if a string starts with "I" and ends with "Asia", you can use the following regular
expression pattern:
import re
text = "I love travelling across Asia"
# Regular expression to check if string starts with "I" and ends with "Asia"
pattern = r'^I.*Asia$'
# Check if the pattern matches the text

match = re.match(pattern, text)
if match:
print("The string starts with 'I' and ends with 'Asia'")
else:
print("The string does not match the criteria")
MODULE-3
3.1 Dependency Grammar: Named Entity Recognition, Question Answer System, Co-
reference resolution, text summarization, text classification
3.2 Tokenization, Text Normalization, Part of speech tagging: lexical syntax
3.3 Hidden Markov Models, Dependency Parsing, Corpus, Tokens and N-grams
3.4 Normalization: Stemming, Lemmatization, Processing with stop words
Dependency Grammar:
Dependency Grammar is a class of syntactic theories that regards the structure of a sentence
as based on the dependency relations between words. Each word is connected to another
word in the sentence, establishing a "head-dependent" relationship. The primary focus is on
the direct relationships between words, rather than constituent structure.
Key Concepts
 Head: The central word that determines the syntactic type of the phrase.
 Dependent: The word that modifies or complements the head.
 Dependency Relation: The link between a head and its dependent.
Example
In the sentence "She enjoys playing tennis":
"enjoys" is the head of "playing."
"playing" is the head of "tennis."
"She" is dependent on "enjoys."
Named Entity Recognition (NER):
Named Entity Recognition (NER) is a subtask of information extraction in natural language

processing (NLP) that involves identifying and classifying named entities mentioned in text
into predefined categories. These categories typically include entities such as:
Persons: Names of people (e.g., "Barack Obama").
Organizations: Names of companies, institutions, etc. (e.g., "Google").
Locations: Geographical locations (e.g., "New York").

Dates and Times: Specific dates or time expressions (e.g., "January 1, 2020").
Miscellaneous: Other types of entities like monetary values, percentages, product names, etc.
Process:
NER systems typically perform two main tasks:
 Detection: Identifying the boundaries of entities in the text.

 Classification: Assigning the identified entities to their respective categories.
Significance in Text Analysis:
NER is significant in various text analysis tasks due to the following reasons:
Information Extraction:
 Improved Search and Retrieval: By identifying named entities, search engines can
better understand the context of queries and documents, leading to more relevant
search results.
 Content Summarization: NER helps in summarizing documents by extracting key
entities, providing a concise overview of the content.
Knowledge Base Construction:
 Database Population: NER assists in populating knowledge bases and databases with
structured information extracted from unstructured text.
 Entity Linking: It helps in linking entities in text to their corresponding entries in a
knowledge base, enhancing the richness of information.
Customer Insights:
 Sentiment Analysis: By identifying entities in customer reviews or social media posts,

companies can gain insights into customer opinions about specific products or
services.
 Trend Analysis: NER enables the detection of trending topics by identifying
frequently mentioned entities over time.
Business Intelligence:
 Competitor Analysis: NER helps in monitoring mentions of competitors and related

entities in news articles, social media, and other sources.
 Market Research: It assists in extracting relevant market information, such as names
of companies, products, and key personnel.
Natural Language Understanding:
 Contextual Understanding: NER enhances the understanding of context by identifying

the entities involved in a text, improving the performance of various NLP applications
like machine translation, summarization, and question answering.
Challenges:
Despite its significance, NER faces several challenges:
 Ambiguity: Entities can be ambiguous (e.g., "Apple" can refer to the fruit or the
company).
 Variability: Entities can have various forms and spellings (e.g., "USA", "United
States", "America").
 Context-Dependence: The meaning of entities can depend on context (e.g., "Jordan"
can be a country or a person).
Python Code:
import spacy
# Load the pre-trained model
nlp = spacy.load('en_core_web_sm')
# Input sentence
sentence = "Apple Inc. was founded by Steve Jobs and Steve Wozniak in April 1976 in
Cupertino, California."
# Process the sentence
doc = nlp(sentence)
# Extract and print named entities
for ent in doc.ents:
if ent.label_ == "ORG":
entity_type = "Organization"
elif ent.label_ == "PERSON":

entity_type = "Person"
elif ent.label_ == "DATE":
entity_type = "Date"
elif ent.label_ == "GPE":
entity_type = "Location"
else:
entity_type = ent.label_
print(f"{ent.text} ({entity_type})")
Question Answering System:
Question answering builds systems that automatically answer questions posed by humans in a
natural language. It is a computer program, which constructs its answers by querying a
structured database of knowledge or information, usually a knowledge base. Examples of
natural language document collections used for Question Answer systems consist of :
 A local collection of reference texts

 Internal organization documents and web pages
 Compiled newswire reports
 A set of Wikipedia pages
 A subset of World Wide Web pages
It deals with fact, list, definition, How, Why, hypothetical, semantically constrained, and
cross-lingual questions.
1. Closed-domain question answering
2. Open-domain question answering
1. Closed-domain question answering:
It deals with questions under a specific domain, Natural Language Processing systems can
exploit domain-specific knowledge frequently formalized in ontologies.
2. Open-domain question answering: It deals with questions about nearly anything, and can
only rely on general ontologies and world knowledge.
Question Answer systems includes a question classifier module that determines the type of
question and the type of answer. The idea of data redundancy in massive collections, such as
the web, means that nuggets of information are likely to be phrased in many different ways in
differing contexts and documents which lead to two benefits:
1. By having the right information appear in many forms, the burden on the Question
Answer system to perform complex NLP techniques to understand the text is lessened.
2. Correct answers can be filtered from false positives by relying on the correct answer to
appear more times in the documents than instances of incorrect ones.
For example, systems have been developed to automatically answer temporal and
geospatial questions, questions of definition and terminology, biographical questions,
multilingual questions, and questions about the content of audio, images, and video.
Corefernce Resolution:
Conference resolution is a crucial task in natural language processing (NLP). It involves

determining which words in a text refer to the same entity. This is essential for understanding
the meaning and context of a document, as well as for applications like information
extraction, summarization, and question answering.
Key Concepts in Coreference Resolution
1. Mention Detection:
o Identifying all the phrases or words (mentions) in a text that potentially refer
to entities.
2. Coreference Chains:
o Grouping mentions that refer to the same entity into clusters. Each cluster
represents one unique entity.
3. Types of Coreference:
o Pronouns: E.g., "John arrived. He was late." Here, "He" refers to "John".
o Proper Names: E.g., "Barack Obama" and "Obama".
o Common Nouns: E.g., "The president" and "Barack Obama".
o Synonyms and Hypernyms: E.g., "The car" and "The vehicle".
Steps in Coreference Resolution
1. Mention Extraction:
o Extract all potential mentions from the text. This can include noun phrases,
pronouns, and named entities.
2. Feature Extraction:
o Extract features for each pair of mentions. Features can include grammatical
roles, proximity, gender, number agreement, semantic compatibility, etc.
3. Mention Pair Classification:
o Classify whether pairs of mentions refer to the same entity using machine
learning models. Techniques can range from rule-based methods to neural
networks.
4. Cluster Formation:
o Form clusters of mentions that refer to the same entity based on the pairwise
classifications.
Techniques and Approaches
1. Rule-Based Methods:
o Utilize linguistic rules and heuristics. For example, resolving pronouns based
on grammatical and syntactic cues.
2. Machine Learning Approaches:
o Use supervised learning with annotated datasets. Common models include
decision trees, support vector machines, and neural networks.
o Feature-Based Models: Handcrafted features are used to train classifiers.
o Deep Learning Models: End-to-end models, such as those using LSTM or
transformers, which learn representations automatically from data.
3. Recent Advances:
o Neural Networks: Use of RNNs, LSTMs, and transformers to capture context
and dependencies better.
o Pre-trained Language Models: Models like BERT, GPT-3, and their
derivatives fine-tuned for coreference tasks have shown state-of-the-art
performance.
Datasets
1. OntoNotes:
o A large corpus annotated for coreference resolution, commonly used for
training and evaluating models.
2. ACE (Automatic Content Extraction):
o Another dataset used for coreference resolution among other NLP tasks.
Evaluation Metrics
1. MUC (Message Understanding Conference) Score:

o Measures the number of links (pairs of mentions) correctly identified.
2. B^3 (Bagga and Baldwin):
o Evaluates precision, recall, and F1-score for coreference chains.
3. CEAF (Constrained Entity-Alignment F-Measure):
o Aligns coreference clusters in gold and system responses to evaluate
performance.
4. BLANC (BiLateral Assessment of Noun-Phrase Coreference):
o Averages precision and recall over both coreference and non-coreference
links.
Challenges
1. Ambiguity:
o Determining the correct antecedent for pronouns and other ambiguous
mentions can be difficult.
2. Variability in Language:
o Differences in writing style, use of synonyms, and indirect references add
complexity.
3. Context Understanding:
o Requires deep understanding of context and world knowledge.
Applications
1. Information Extraction:
o Improves the extraction of structured information from unstructured text.
2. Text Summarization:
o Enhances the coherence and cohesion of summaries by correctly linking
entities.
3. Question Answering:
o Ensures accurate linking of entities in questions and answers for better
performance.
Coreference resolution remains an active area of research in NLP, with continuous

improvements being driven by advancements in machine learning and the availability of large
annotated corpora.
Text Classification and Summarization
Text summarization and text classification are two fundamental tasks in natural language
processing (NLP), each with its distinct methodologies and applications.
Text Summarization
Text summarization involves creating a concise and coherent version of a longer document,
capturing its main points. There are two main types of text summarization:
1. Extractive Summarization:
Method: Selects important sentences, phrases, or sections directly from the source document
and combines them to form a summary.
Techniques:
1. Frequency-Based: Sentences containing frequently occurring terms are considered

important.
2. Graph-Based: Techniques like TextRank represent sentences as nodes in a graph,
with edges indicating similarities, and use algorithms to identify the most central
sentences.
3. Machine Learning: Supervised models trained on annotated data to score sentences
based on features like term frequency, position in text, and linguistic cues.
2. Abstractive Summarization:
Method: Generates new sentences that capture the essence of the source text, rather than just
extracting parts of it.
Techniques:
1. Seq2Seq Models: Encoder-decoder frameworks where the encoder processes the

input text and the decoder generates the summary.
2. Attention Mechanisms: Improve seq2seq models by allowing the decoder to focus
on specific parts of the input text at each step of the summary generation.
3. Transformers: Pre-trained models like BERT, GPT-3, and T5 fine-tuned for
summarization tasks, leveraging their ability to understand context and generate
coherent text.
Text Classification
Text classification involves assigning predefined categories or labels to a given piece of text.
It has various applications such as sentiment analysis, spam detection, topic categorization,
and more.
Key Steps in Text Classification
Text Preprocessing:
1. Tokenization: Splitting text into words, phrases, or other meaningful elements.

2. Stopword Removal: Removing common but uninformative words (e.g., "and", "the").
3. Stemming and Lemmatization: Reducing words to their base or root forms.
4. Vectorization: Converting text into numerical representations (e.g., TF-IDF, word
embeddings).
Feature Extraction:
1. Bag-of-Words (BoW): Represents text as a set of word counts or binary indicators.
2. TF-IDF (Term Frequency-Inverse Document Frequency): Weighs words based on
their importance in the document and across the corpus.
3. Word Embeddings: Dense vector representations of words (e.g., Word2Vec, GloVe,
FastText).
4. Contextual Embeddings: Captures word meanings in context using models like
BERT, ELMo, and GPT.
Model Training:
1. Traditional Machine Learning Models: Naive Bayes, Support Vector Machines

(SVM), Logistic Regression, Decision Trees, etc.
2. Deep Learning Models: Convolutional Neural Networks (CNNs) for text, Recurrent
Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), and
transformers.
Popular Text Classification Techniques
1. Naive Bayes:
 A probabilistic classifier based on Bayes' theorem, often used for spam detection and
sentiment analysis.
Example:
Find the tag for the sentence “She sings very sweet” using Naive Bayes algorithm.
Given Data:
Sentence Category
A Great medicine Medicine
She is having sweet voice Music
Tulsi is a medicinal plant Medicine
Ginger is good for health Medicine
Yaman is one of the sweet raga. Music
Listening to music is good for Music
health
To classify the sentence "She sings very sweet" using the Naive Bayes algorithm, we will
follow these steps:
1. Prepare the Data: Create a dataset with the given sentences and their corresponding
categories.
2. Preprocess the Data: Tokenize the sentences and convert them into a suitable format
for the Naive Bayes classifier.
3. Train the Naive Bayes Classifier: Using the training data, train a Naive Bayes
classifier.
4. Classify the New Sentence: Use the trained classifier to predict the category of the
new sentence.
Python Implementation:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
# Training data
data = {
'Sentence': [
'A Great medicine',
'She is having sweet voice',
'Tulsi is a medicinal plant',

'Ginger is good for health',
'Yaman is one of the sweet raga.',
'Listening to music is good for health'
],
'Category': [
'Medicine',
'Music',
'Medicine',
'Medicine',
'Music',
'Music'
# Create a DataFrame
df = pd.DataFrame(data)
# Prepare the data
X_train = df['Sentence']
y_train = df['Category']
# Create a pipeline with a CountVectorizer and a MultinomialNB
model = make_pipeline(CountVectorizer(), MultinomialNB())
# Train the model
model.fit(X_train, y_train)
# New sentence to classify
new_sentence = ["She sings very sweet"]
# Predict the category
predicted_category = model.predict(new_sentence)
print(predicted_category[0])
2. Support Vector Machines (SVM):
 Effective in high-dimensional spaces and suitable for text classification tasks.
3. Neural Networks:
 CNNs: Capture local patterns in text data.

 RNNs and LSTMs: Handle sequential data, capturing dependencies and order of
words.
 Transformers: Use self-attention mechanisms to capture global context and
relationships in text.
4. Pre-trained Language Models:
Fine-tuning models like BERT, RoBERTa, and GPT for specific classification tasks,
leveraging their deep understanding of language context and semantics.
Applications
1. Text Summarization:
 News Summarization: Summarizing long news articles for quick reading.

 Document Summarization: Creating abstracts for scientific papers, legal documents,
etc.
 Content Aggregation: Summarizing user-generated content from forums, reviews,
and social media.
2. Text Classification:
 Sentiment Analysis: Determining the sentiment (positive, negative, neutral) of a text,

commonly used in social media monitoring and customer feedback analysis.
 Spam Detection: Classifying emails or messages as spam or not.
 Topic Classification: Categorizing articles, blog posts, or documents into predefined
topics or categories.
 Intent Detection: Identifying the intent behind user queries in chatbots and virtual
assistants.
Both text summarization and text classification are integral to numerous real-world
applications, enabling better information retrieval, understanding, and decision-making
processes in various domains.
Tokenization
Tokenization is the process of breaking down a text into smaller units called tokens. These
tokens can be words, phrases, symbols, or other meaningful elements.
Tokenization is a fundamental step in text processing and analysis, as it converts raw text into
a structured form that can be easily analyzed.
Following are the types of tokenization:
 Word Tokenization: Splitting text into individual words.

 Sentence Tokenization: Splitting text into individual sentences.
 Subword Tokenization: Breaking down words into smaller units like prefixes,
suffixes, or even characters.
Techniques
 Whitespace Tokenization: Splitting text based on spaces.

 Punctuation-based Tokenization: Using punctuation marks as delimiters.
 Rule-based Tokenization: Applying language-specific rules to handle special cases
like contractions and hyphenated words.
 Machine Learning-based Tokenization: Using models trained to recognize token
boundaries, useful for complex languages.
Example
Consider the sentence: "Apple Inc. was founded by Steve Jobs and Steve Wozniak in April
1976 in Cupertino, California."
Tokens:
"Apple"
"Inc."
"was"
"founded"
"by"
"Steve"
"Jobs"
"and"
"Steve"
"Wozniak"
"in"
"April"
"1976"
"in"
"Cupertino"
","
"California"
"."
Python Code Implementation
import spacy
# Load the pre-trained spaCy model
nlp = spacy.load("en_core_web_sm")
# Input sentence
# Process the sentence using spaCy
doc = nlp(sentence)
# Tokenize the sentence
tokens = [token.text for token in doc]
# Print the tokens
print("Tokens:", tokens)
Text Normalization
Text normalization is the process of transforming text into a standard format. It is a crucial
pre-processing step in natural language processing (NLP) to ensure that the text is consistent
and clean before further analysis or model training.
Following are the steps in Text Normalization
 Lowercasing: Converting all characters in the text to lowercase to maintain

uniformity.
 Removing Punctuation: Stripping out punctuation marks which are often not useful
for analysis.
 Removing Stop Words: Eliminating common words (e.g., "and", "the") that may not
carry significant meaning in many contexts.
 Stemming: Reducing words to their root form (e.g., "running" to "run") using
algorithms like Porter or Snowball stemmer.
 Lemmatization: Reducing words to their base or dictionary form (e.g., "better" to
"good"), often more sophisticated than stemming as it considers the context.
 Handling Special Characters: Removing or replacing special characters and
symbols.
 Expanding Contractions: Converting contractions to their expanded forms (e.g.,
"can't" to "cannot").
 Removing Extra Whitespace: Removing unnecessary spaces to avoid irregular
spacing issues.
Example
Consider the sentence: "Running faster and faster, he couldn't believe it!"
Steps:
 Lowercasing: "running faster and faster, he couldn't believe it!"

 Removing Punctuation: "running faster and faster he couldnt believe it"
 Removing Stop Words: "running faster faster couldnt believe"
 Stemming/Lemmatization: "run fast fast could not believe"
 Handling Special Characters: (already handled in punctuation step)
 Expanding Contractions: "running faster and faster he could not believe it"
 Removing Extra Whitespace: (already handled)
Python Code Implementation
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import string

# Function for text normalization

def normalize_text(text):
# Process the text using spaCy
doc = nlp(text)
# Lowercase the text

normalized_text = text.lower()
# Remove punctuation and special characters

normalized_text = ''.join([char for char in normalized_text if char not in
string.punctuation])
# Tokenize the text

tokens = [token.text for token in nlp(normalized_text)]
# Remove stop words

tokens = [token for token in tokens if token not in STOP_WORDS]
# Lemmatize the tokens

lemmatized_tokens = [token.lemma_ for token in nlp(' '.join(tokens))]
# Remove extra whitespace
lemmatized_tokens = [token.strip() for token in lemmatized_tokens if token.strip()]
return ' '.join(lemmatized_tokens)
# Example sentence
sentence = "Running faster and faster, he couldn't believe it!"
# Normalize the sentence

normalized_sentence = normalize_text(sentence)
# Print the normalized sentence

print("Original Sentence:", sentence)
print("Normalized Sentence:", normalized_sentence)
Part of Speech Tagging
 Part of Speech (POS) tagging involves assigning each word in a sentence its
corresponding part of speech, such as noun, verb, adjective, etc. POS tagging is a
critical step in the syntactic analysis of a language, helping to understand the structure
and meaning of a sentence.
 Lexical syntax refers to the structure and formation of tokens within a language,
covering both their identification and syntactic roles. POS tagging is a part of this
broader area, as it deals with the syntactic roles of words.
Key Concepts
Parts of Speech:
 Noun (NN): A person, place, thing, or idea.

 Verb (VB): An action or state of being.
 Adjective (JJ): A word that describes a noun.
 Adverb (RB): A word that describes a verb, adjective, or other adverbs.
 Pronoun (PRP): A word that takes the place of a noun.
 Preposition (IN): A word that shows the relationship between a noun (or pronoun) and
other words in a sentence.
 Conjunction (CC): A word that joins words, phrases, or clauses.
 Interjection (UH): An exclamation.
 Determiner (DT): A word that introduces a noun.
 Tagset: A predefined set of POS tags used for annotation. One common tagset is the
Penn Treebank Tagset.
Techniques for POS Tagging
 Rule-Based Tagging: Uses a set of hand-crafted rules to assign POS tags based on
word patterns and context.
 Statistical Tagging: Uses probabilistic models like Hidden Markov Models (HMM) to
assign POS tags based on the likelihood of sequences of tags.
 Machine Learning Tagging: Utilizes supervised learning models such as Conditional
Random Fields (CRF) and neural networks to predict POS tags based on annotated
training data.
Example
 Consider the sentence: "Apple Inc. was founded by Steve Jobs and Steve Wozniak in
April 1976 in Cupertino, California."
POS Tags:
 "Apple" (NNP)
 "Inc." (NNP)
 "was" (VBD)
 "founded" (VBN)
 "by" (IN)
 "Steve" (NNP)
 "Jobs" (NNP)
 "and" (CC)
 "Steve" (NNP)
 "Wozniak" (NNP)
 "in" (IN)
 "April" (NNP)
 "1976" (CD)
 "in" (IN)
 "Cupertino" (NNP)
 "," (,)
 "California" (NNP)
 "." (.)
Python Code Implementation:
import spacy
# Input sentence
# Process the sentence using spaCy
doc = nlp(sentence)
# Extract POS tags
pos_tags = [(token.text, token.pos_, token.tag_) for token in doc]
# Print the POS tags
for token, pos, tag in pos_tags:
print(f"Token: {token}, POS: {pos}, Detailed POS: {tag}")

Hidden Markov Model
A Hidden Markov Model (HMM) is a statistical model used to describe systems that are
modeled by a Markov process with hidden states. HMMs are widely used in various fields
such as speech recognition, handwriting recognition, and natural language processing,
particularly in part-of-speech tagging, named entity recognition, and other sequence labeling
tasks.
Key Concepts
1. States: The possible conditions or positions that can be taken by the system (e.g.,
parts of speech in a sentence). In HMM, these states are not directly visible (hidden).
2. Observations: The data or outputs that can be directly observed (e.g., the words in a
sentence).
3. Transition Probabilities: The probabilities of transitioning from one state to another.
4. Emission Probabilities: The probabilities of an observation being generated from a
particular state.
5. Initial State Probabilities: The probabilities of the system starting in each state.
Components of HMM
1. Set of States (S): A finite set of hidden states.

2. Observations (O): A sequence of observations.
3. Transition Probability Matrix (A): A={aij}A = \{a_{ij}\}A={aij}, where
aij=P(sj∣si)a_{ij} = P(s_j \mid s_i)aij=P(sj∣si) is the probability of transitioning from
state sis_isi to state sjs_jsj.
4. Emission Probability Matrix (B): B={bj(ot)}B = \{b_j(o_t)\}B={bj(ot)}, where
bj(ot)=P(ot∣sj)b_j(o_t) = P(o_t \mid s_j)bj(ot)=P(ot∣sj) is the probability of observing
oto_tot given state sjs_jsj.
5. Initial State Distribution (π\piπ): π={πi}\pi = \{\pi_i\}π={πi}, where πi=P(si)\pi_i =
P(s_i)πi=P(si) is the probability of starting in state sis_isi.
Applications in NLP
1. Part-of-Speech Tagging: Assigning parts of speech to each word in a sentence.

2. Named Entity Recognition: Identifying and classifying named entities in text.
3. Speech Recognition: Decoding the sequence of phonemes from audio signals.
Example: Part-of-Speech Tagging
Consider the sentence: "Time flies like an arrow."
We aim to tag each word with its part of speech using an HMM.
Python Code Implementation:
import numpy as np
from hmmlearn import hmm
# Define states and observations
states = ["Noun", "Verb", "Preposition", "Determiner"]
n_states = len(states)
observations = ["the", "cat", "sat", "on", "mat"]
n_observations = len(observations)
# Transition matrix
A = np.array([[0.3, 0.2, 0.1, 0.4],
[0.1, 0.4, 0.2, 0.3],
[0.3, 0.2, 0.4, 0.1],
[0.25, 0.25, 0.25, 0.25]])
# Emission matrix
B = np.array([[0.4, 0.1, 0.2, 0.2, 0.1],
[0.1, 0.3, 0.4, 0.1, 0.1],
[0.1, 0.1, 0.1, 0.1, 0.6],
[0.5, 0.1, 0.1, 0.2, 0.1]])

# Initial probabilities
π = np.array([0.4, 0.3, 0.2, 0.1])
# Create HMM model
model = hmm.MultinomialHMM(n_components=n_states)
model.startprob_ = π
model.transmat_ = A
model.emissionprob_ = B
# Define observation sequence
obs_seq = np.array([[0, 1, 2, 3, 4]]).T # Mapping "the cat sat on mat" to indices [0, 1, 2, 3, 4]
# Decode the hidden states
logprob, hidden_states = model.decode(obs_seq, algorithm="viterbi")
# Print the result
print("Observation sequence:", observations)
print("Hidden states sequence:", [states[i] for i in hidden_states])

Explanation:
Dependency Parsing
Dependency parsing is a type of syntactic parsing that focuses on the relationships between
words in a sentence. Unlike phrase structure parsing, which emphasizes constituent structures
(like noun phrases and verb phrases), dependency parsing is concerned with how words are
connected through grammatical relationships.
Key Concepts
Dependencies:
 Head: A word that governs another word.

 Dependent: A word that is governed by another word.
 Dependency Relation: The grammatical relationship between a head and its
dependent (e.g., subject, object).
Root:
 The main verb or action in a sentence, which governs all other words either directly or
indirectly.
Arcs:
Directed edges in a dependency tree that point from heads to dependents.
Example
Consider the sentence: "She enjoys playing tennis."
The dependency relationships might be:
"enjoys" (head) → "She" (subject)
"enjoys" (head) → "playing" (object)
"playing" (head) → "tennis" (object)
Dependency Tree
enjoys
|-- She (nsubj)
|-- playing (dobj)

|
|-- tennis (dobj)
Importance of Dependency Parsing
 Semantic Analysis: Understanding relationships between words helps in extracting

the meaning.
 Information Retrieval: Improves search algorithms by understanding the syntactic
structure of queries.
 Machine Translation: Ensures syntactically and semantically correct translations
Python Code:
import spacy
# Load the English model
# Input sentence
sentence = "She enjoys playing tennis."
# Process the sentence with spaCy
doc = nlp(sentence)
# Print dependency parsing results
for token in doc:
print(f"Token: {token.text}\tHead: {token.head.text}\tDependency: {token.dep_}")
# Visualize the dependency tree
spacy.displacy.serve(doc, style="dep")
Corpus, Tokens and N-grams
1. Corpus:
A corpus (plural: corpora) is a large and structured set of texts. It serves as a dataset for
training and evaluating NLP models. A corpus can contain various types of text data such as
articles, books, emails, tweets, etc.
Example
 General Corpus: A collection of general texts, like Wikipedia articles.

 Domain-specific Corpus: Texts from a specific domain, such as medical records or
legal documents.
2. Tokens
Tokenization is the process of splitting text into individual pieces, called tokens. Tokens can
be words, subwords, or characters. Tokenization is one of the first steps in text processing.
Example
Given the sentence: "She enjoys playing tennis."
 Tokens: ["She", "enjoys", "playing", "tennis", "."]
3. Contribution of n-grams to NLP in Language Modeling and Text Analysis:
An n-gram is a collection of n successive items in a text document that may include words,
numbers, symbols, and punctuation. N-gram models are useful in many text analytics
applications where sequences of words are relevant, such as in sentiment analysis, text
classification, and text generation. N-gram modeling is one of the many techniques used to
convert text from an unstructured format to a structured format. An alternative to n-gram is
word embedding techniques, such as word2vec.
Types of N-grams:
 Unigrams: Single words. Example: "She", "enjoys", "playing", "tennis".

 Bigrams: Pairs of consecutive words. Example: "She enjoys", "enjoys playing",
"playing tennis".
 Trigrams: Triplets of consecutive words. Example: "She enjoys playing", "enjoys
playing tennis".
Applications:
 Language Modeling: n-grams are used to predict the next item in a sequence,
improving the accuracy of machine translation, speech recognition, and text
prediction.
 Text Analysis: They help in understanding the context and frequency of phrases in
large text corpora, useful in information retrieval and text mining.
 Spell Checking and Autocorrect: n-grams help in identifying and correcting
misspelled words based on common usage patterns.
 Sentiment Analysis: n-grams aid in detecting sentiment by analyzing the
combination of words and phrases.
Advantages:
 Simplicity: Easy to implement and understand.

 Efficiency: Useful for capturing local dependencies within text.
Limitations:
 Data Sparsity: Higher-order n-grams require large amounts of data to capture

meaningful patterns, leading to sparsity issues.
 Context Limitation: n-grams do not capture long-range dependencies well, as they
only consider fixed-length sequences
Language Modeling:
n-grams play a critical role in language modeling by predicting the likelihood of a word
given its preceding words.
Text Analysis:
In text analysis, n-grams assist in understanding the structure and frequency of phrases within
large text corpora:
 Frequency Analysis: n-grams provide insights into common word combinations and
phrases by analyzing their frequency of occurrence. This is useful in various
applications like information retrieval, where identifying common phrases can
improve search results.
 Contextual Understanding: By examining n-grams, one can better understand the
local context of words within a text, aiding tasks like sentiment analysis and keyword
extraction.
Python Code:
import nltk
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
from collections import Counter
# Download required NLTK data files
nltk.download('punkt')
# Sample text (corpus)
text = "She enjoys playing tennis. She plays it every weekend."
# Tokenize text
tokens = word_tokenize(text)
print("Tokens:", tokens)
# Generate Bigrams
bigrams = list(ngrams(tokens, 2))
print("Bigrams:", bigrams)
# Generate Trigrams
trigrams = list(ngrams(tokens, 3))
print("Trigrams:", trigrams)
# Count frequency of N-grams
bigram_freq = Counter(bigrams)
trigram_freq = Counter(trigrams)
print("Bigram Frequencies:", bigram_freq)
print("Trigram Frequencies:", trigram_freq)
Example2: Using Textblob object in Python construct trigrams for the following text
“ Nelson Rolihlahla Mandela was a South African anti-apartheid activist and politician
who served as the first president of South Africa”. Write suitable Python code for it.
from textblob import TextBlob
text = "Nelson Rolihlahla Mandela was a South African anti-apartheid activist and politician
who served as the first president of South Africa."
blob = TextBlob(text)
trigrams = blob.ngrams(n=3)
for trigram in trigrams:
print(trigram)
Stemming
Stemming is a method in text processing that eliminates prefixes and suffixes from words,
transforming them into their fundamental or root form, The main objective of stemming is to
streamline and standardize words, enhancing the effectiveness of the natural language
processing tasks.
Some more example of stemming for root word "like" include:
->"likes"
->"liked"
->"likely"
->"liking"
Types of stemmer:
1.Porter Stemmer: Proposed in 1980 by Martin Porter, it's one of the most popular
stemming methods. It's fast and widely used in English-based applications like data mining
and information retrieval. However, it can produce stems that are not real words.
Example: EED -> EE means “if the word has at least one vowel and consonant plus EED
ending, change the ending to EE” as „agreed‟ becomes „agree‟.
2.Lovins Stemmer: Developed by Lovins in 1968, this stemmer removes the longest suffix
from a word. It's fast and handles irregular plurals well, but it may not always produce valid
words from stems.
Example: sitting -> sitt -> sit
3.Dawson Stemmer: An extension of the Lovins stemmer, it stores suffixes in a reversed

order indexed by length and last letter. It's fast and covers more suffixes but can be complex
to implement.
4.Krovetz Stemmer: Proposed in 1993 by Robert Krovetz, this stemmer converts plural
forms to singular and past tense to present, removing 'ing' suffixes. It's light and can be used
as a pre-stemmer, but it may be inefficient for large documents.
Example: „children‟ -> „child‟
5.Xerox Stemmer: Capable of processing extensive datasets and generating valid words, but
it can over-stem due to reliance on lexicons, making it language-dependent.
Example:
„children‟ -> „child‟
„understood‟ -> „understand‟
„whom‟ -> „who‟
„best‟ -> „good‟
6.N-Gram Stemmer: Breaks words into segments of length 'n' and applies statistical analysis
to identify patterns. It's language-dependent and requires space to create and index n-grams.
Example: „INTRODUCTIONS‟ for n=2 becomes : *I, IN, NT, TR, RO, OD, DU, UC, CT,
TI, IO, ON, NS, S*
7.Snowball Stemmer (Porter2): Multi-lingual and more aggressive than Porter Stemmer, it's
faster and supports various languages. It's based on the Snowball programming language.
Example:
Input: running
Output: run
The Snowball Stemmer (for English) is based on the Porter Stemmer algorithm and operates
similarly by removing common suffixes. It recognizes "ing" as a suffix and removes it to
obtain the base form "run".
8.Lancaster Stemmer: More aggressive and dynamic, it's faster but can be confusing with
small words. It saves rules externally and uses an iterative algorithm.
Example: Input: running
Output: run
The Lancaster Stemmer is more aggressive than the Porter Stemmer. It removes common
suffixes based on a different set of rules. In this case, it identifies "ing" as a suffix and
removes it to produce the base form "run".
9.Regexp Stemmer: Uses regular expressions to define custom rules for stemming. It offers
flexibility and control over the stemming process for specific applications.
Example:
Input: running
Output: run
The regex-based stemmer uses regular expressions to identify and remove specific suffixes.
In this example, it recognizes "ing" as a suffix and removes it, resulting in the base form
"run".
Stemming Operations in Python:
import nltk
from nltk.stem import PorterStemmer, SnowballStemmer,
LancasterStemmer, RegexpStemmer
# Sample text
text = "The quick brown foxes were running and jumping over the lazy
dogs."
# Tokenize the text into words

words = nltk.word_tokenize(text)
# Initialize the stemmers

porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer("english")
lancaster_stemmer = LancasterStemmer()
regexp_stemmer = RegexpStemmer('ing$|s$|ed$', min=4)
# Apply stemming using different stemmers

porter_stemmed_words = [porter_stemmer.stem(word) for word in words]
snowball_stemmed_words = [snowball_stemmer.stem(word) for word in
words]
lancaster_stemmed_words = [lancaster_stemmer.stem(word) for word in
words]
regexp_stemmed_words = [regexp_stemmer.stem(word) for word in
words]
# Print the results

print("Original Words:", words)
print("Porter Stemmed Words:", porter_stemmed_words)
print("Snowball Stemmed Words:", snowball_stemmed_words)
print("Lancaster Stemmed Words:", lancaster_stemmed_words)
print("Regexp Stemmed Words:", regexp_stemmed_words)
Lemmatization
Lemmatization techniques in natural language processing (NLP) involve methods to identify

and transform words into their base or root forms, known as lemmas. These approaches
contribute to text normalization, facilitating more accurate language analysis and processing
in various NLP applications.
Three types of lemmatization techniques are:
1. Rule Based Lemmatization

Rule-based lemmatization involves the application of predefined rules to derive the base or
root form of a word. Unlike machine learning-based approaches, which learn from data, rule-
based lemmatization relies on linguistic rules and patterns.
Here‟s a simplified example of rule-based lemmatization for English verbs:
Rule: For regular verbs ending in “-ed,” remove the “-ed” suffix.
Example:
 Word: “walked”
 Rule Application: Remove “-ed”
 Result: “walk
2. Dictionary-Based Lemmatization
Dictionary-based lemmatization relies on predefined dictionaries or lookup tables to map
words to their corresponding base forms or lemmas. Each word is matched against the
dictionary entries to find its lemma. This method is effective for languages with well-defined
rules.
Suppose we have a dictionary with lemmatized forms for some words:
 „running‟ -> „run‟
 „better‟ -> „good‟
 „went‟ -> „go‟
3. Machine Learning-Based Lemmatization
Machine learning-based lemmatization leverages computational models to automatically
learn the relationships between words and their base forms. Unlike rule-based or dictionary-
based approaches, machine learning models, such as neural networks or statistical models,
are trained on large text datasets to generalize patterns in language.
Example:
Consider a machine learning-based lemmatizer trained on diverse texts. When encountering
the word „went,‟ the model, having learned patterns, predicts the base form as „go.‟ Similarly,
for „happier,‟ the model deduces „happy‟ as the lemma. The advantage lies in the model‟s
ability to adapt to varied linguistic nuances and handle irregularities, making it robust for
lemmatizing diverse vocabularies.
Python Code:
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
import spacy
# Download necessary NLTK data
nltk.download('wordnet')
nltk.download('omw-1.4')
# Sample text
text = "The quick brown foxes were running and jumping over the lazy dogs."
# Initialize NLTK's WordNet Lemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
# Function to get wordnet POS tag
def get_wordnet_pos(word):
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ, "N": wordnet.NOUN, "V": wordnet.VERB, "R":

wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
# Lemmatize using NLTK's WordNet Lemmatizer with POS tagging
nltk_lemmatized_words = [wordnet_lemmatizer.lemmatize(word, get_wordnet_pos(word))

for word in words]
# Initialize spaCy's English model
# Process the text with spaCy
doc = nlp(text)
# Lemmatize using spaCy
spacy_lemmatized_words = [token.lemma_ for token in doc]
# Print the results
print("NLTK Lemmatized Words:", nltk_lemmatized_words)
print("spaCy Lemmatized Words:", spacy_lemmatized_words)

Processing with stop words
Stop words are common words such as "and", "the", "is", "in", etc., that are often removed
from text because they carry little meaningful information for tasks like text classification,
sentiment analysis, and more.
Python Code:
import nltk
from nltk.corpus import stopwords
import spacy
# Download necessary NLTK data
nltk.download('stopwords')
# Sample text
text = "The quick brown foxes were running and jumping over the lazy dogs."
# Initialize NLTK's stop words
nltk_stop_words = set(stopwords.words('english'))
# Remove stop words using NLTK
nltk_filtered_words = [word for word in words if word.lower() not in nltk_stop_words]
# Initialize spaCy's English model
# Process the text with spaCy
doc = nlp(text)
# Remove stop words using spaCy
spacy_filtered_words = [token.text for token in doc if not token.is_stop]
# Print the results
print("NLTK Filtered Words:", nltk_filtered_words)
print("spaCy Filtered Words:", spacy_filtered_words)

MODULE-4
4.1 Introduction to term frequency (TF) and Inverse Document Frequency(IDF)
4.2 Word embedding: Word2Vec, Fast Text
Term Frequency and Inverse Document Frequency
Term Frequency (TF) and Inverse Document Frequency (IDF) are fundamental concepts in
information retrieval and text mining. They are used to evaluate how important a word is to a
document in a collection (or corpus). The combination of these two metrics forms the TF-IDF
score, which is widely used in search engines, text mining, and various NLP applications.
Term Frequency (TF)
 Term Frequency measures how frequently a term (word) appears in a document.

 The more times a term appears in a document, the higher its TF score. However, the
raw frequency of a term is often normalized to prevent bias towards longer
documents.
Formula:
Python Code:
from collections import Counter
import numpy as np
def compute_tf(doc):
tf_dict = {}
word_counts = Counter(doc)
total_terms = len(doc)
for word, count in word_counts.items():
tf_dict[word] = count / total_terms
return tf_dict
# Example document
doc = "the quick brown fox jumps over the lazy dog".split()
tf = compute_tf(doc)
print("Term Frequency (TF):")
print(tf)
Inverse Document Frequency (IDF)
Inverse Document Frequency measures how important a term is. While computing TF, all
terms are considered equally important. However, certain terms like "is", "of", and "that" may
appear frequently in many documents but have little importance. IDF diminishes the weight
of such common terms and increases the weight of rare terms.
Python Code:
def compute_idf(docs):
idf_dict = {}
N = len(docs)
all_words = set([word for doc in docs for word in doc])
for word in all_words:
containing_docs = sum(1 for doc in docs if word in doc)
idf_dict[word] = np.log(N / (1 + containing_docs))
return idf_dict
# Example documents
docs = [
"the quick brown fox jumps over the lazy dog".split(),
"never jump over the lazy dog quickly".split(),
"a quick brown dog outpaces a quick fox".split()
idf = compute_idf(docs)
print("Inverse Document Frequency (IDF):")

print(idf)
TF-IDF
Definition: TF-IDF is a numerical statistic intended to reflect how important a word is to a

document in a collection or corpus. It is the product of TF and IDF.
Python Code:
def compute_tfidf(tf, idf):
tfidf_dict = {}
for word, tf_val in tf.items():
tfidf_dict[word] = tf_val * idf.get(word, 0)
return tfidf_dict
# Compute TF-IDF for each document
tfidf_docs = []
for doc in docs:
tf = compute_tf(doc)
tfidf = compute_tfidf(tf, idf)
tfidf_docs.append(tfidf)
print("TF-IDF:")
for i, tfidf in enumerate(tfidf_docs):
print(f"Document {i+1}:")
print(tfidf)
Significance of TF and IDF in NLP:

Term Frequency (TF)
Significance:
 Local Importance: TF measures the frequency of a term in a single document,

providing a sense of how significant that term is within the specific context of that
document. This is crucial because frequently occurring words in a document are likely
to be central to its meaning.
 Baseline for Weighting: TF serves as a baseline metric for many weighting schemes
in text analysis and retrieval systems. It helps in identifying common terms that are
essential to understanding the document's content.
Use Cases:
 Text Summarization: TF helps in identifying key terms that are important for
summarizing the content of a document.
 Document Clustering: High TF values indicate terms that can be used for clustering
similar documents together.
Inverse Document Frequency (IDF)
Significance:
 Global Importance: IDF measures the rarity of a term across all documents in the
corpus. Words that appear in many documents (common words) get lower IDF scores,
while words that appear in fewer documents (rare words) get higher IDF scores. This
helps in distinguishing important terms from common ones.
 Balancing Commonality: By giving less weight to common words and more weight to
rare ones, IDF helps in reducing the noise caused by frequent but uninformative terms
(e.g., "the", "is", "and").
 Enhancing Specificity: IDF boosts the significance of terms that are more informative
and specific to particular documents, making it easier to distinguish between
documents based on their unique content.
Use Cases:
 Information Retrieval: In search engines, IDF helps in ranking documents by ensuring

that results with rare but significant terms are given higher importance.
 Text Classification: IDF helps in identifying unique features that can be used to
classify documents into different categories.
Significance of TF-IDF:
 Relevance Scoring: TF-IDF scores help in ranking documents by their relevance to a

query in information retrieval systems.
 Feature Extraction: TF-IDF is widely used for extracting features from text data in
machine learning models. It helps in representing text in a way that highlights
important terms.
 Dimensionality Reduction: By focusing on significant terms, TF-IDF reduces the
dimensionality of text data, making it more manageable for computational models.
Word Embedding: Word2vec, FastText
Word embeddings are a type of word representation that allows words to be represented as
vectors in a continuous vector space. This representation helps capture the semantic meaning
of words based on their context in large corpora. Two popular models for generating word
embeddings are Word2Vec and FastText.
Word2Vec
Word2Vec is a popular word embedding technique that aims to represent words as

continuous vectors in a high-dimensional space. It introduces two models: Continuous Bag of
Words (CBOW) and Skip-gram, each contributing to the learning of vector representations.
1. Model Architecture:
 Continuous Bag of Words (CBOW): In CBOW, the model predicts a target word
based on its context. The context words are used as input, and the target word is the
output. The model is trained to minimize the difference between the predicted and
actual target words.
 Skip-gram: Conversely, the Skip-gram model predicts context words given a target
word. The target word serves as input, and the model aims to predict the words that
are likely to appear in its context. Like CBOW, the goal is to minimize the difference
between the predicted and actual context words.
2. Neural Network Training:
Both CBOW and Skip-gram models leverage neural networks to learn vector representations.
The neural network is trained on a large text corpus, adjusting the weights of connections to
minimize the prediction error. This process places similar words closer together in the
resulting vector space.
3. Vector Representations:
Once trained, Word2Vec assigns each word a unique vector in the high-dimensional space.
These vectors capture semantic relationships between words. Words with similar meanings or
those that often appear in similar contexts have vectors that are close to each other, indicating
their semantic similarity.
4. Advantages and Disadvantages:
Advantages:
 Captures semantic relationships effectively.

 Efficient for large datasets.
 Provides meaningful word representations.
Disadvantages:
 May struggle with rare words.

 Ignores word order.
Python Code:
from gensim.models import Word2Vec
# Sample corpus
corpus = [
['apple', 'banana', 'orange', 'grape'],
['cat', 'dog', 'elephant', 'lion'],
['house', 'car', 'bicycle', 'train']
# Train Word2Vec model
model = Word2Vec(sentences=corpus, vector_size=100, window=5, min_count=1,

workers=4)
# Find similar words
similar_words = model.wv.most_similar('cat')
print("Words similar to 'cat':", similar_words)

FastText
FastText is an advanced word embedding technique developed by Facebook AI Research

(FAIR) that extends the Word2Vec model. Unlike Word2Vec, FastText not only considers
whole words but also incorporates subword information — parts of words like n-grams. This
approach enables the handling of morphologically rich languages and captures information
about word structure more effectively.
1. Subword Information:
FastText represents each word as a bag of character n-grams in addition to the whole word
itself. This means that the word “apple” is represented by the word itself and its constituent n-
grams like “ap”, “pp”, “pl”, “le”, etc. This approach helps capture the meanings of shorter
words and affords a better understanding of suffixes and prefixes.
2. Model Training:
Similar to Word2Vec, FastText can use either the CBOW or Skip-gram architecture.
However, it incorporates the subword information during training. The neural network in
FastText is trained to predict words (in CBOW) or context (in Skip-gram) not just based on
the target words but also based on these n-grams.
3. Handling Rare and Unknown Words:
A significant advantage of FastText is its ability to generate better word representations for
rare words or even words not seen during training. By breaking down words into n-grams,
FastText can construct meaningful representations for these words based on their subword
units.
4. Advantages and Disadvantages:
Advantages:
 Better representation of rare words.

 Capable of handling out-of-vocabulary words.
 Richer word representations due to subword information.
Disadvantages:
 Increased model size due to n-gram information.

 Longer training times compared to Word2Vec.
Python Code:
from gensim.models import FastText
# Example corpus
corpus = [
"I like to eat apples",
"Apples are delicious fruits",
"I enjoy eating bananas",
"Bananas are healthy snacks"
# Tokenize the corpus into words
tokenized_corpus = [sentence.split() for sentence in corpus]
# Train FastText model
model = FastText(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1,

workers=4)
# Find similar words
similar_words = model.wv.most_similar('apples')
print("Words similar to 'apples':")
for word, similarity in similar_words:
print(f"{word}: {similarity}")
MODULE-5
5.1 Introduction to deep learning models, ELMO(Embedding from Language Models)
5.2 BERT(Bidirectional Encoder Representations from Transformers), Basic Neural

Networks
Introduction to Deep Learning Models
Deep learning models are a class of machine learning algorithms that are capable of learning
complex patterns and representations from large amounts of data. They have revolutionized
many fields, including Natural Language Processing (NLP), computer vision, speech
recognition, and more. Following are some common deep learning models:
1. Convolutional Neural Networks (CNNs)
Use Case: Computer Vision
 Key Features:
 Consist of convolutional layers followed by pooling layers.
 Learn hierarchical representations of image features.
 Effective in tasks like image classification, object detection, and image segmentation.
Example Applications: Image recognition (e.g., classifying objects in images), image

generation (e.g., deep dream), medical image analysis.
2. Recurrent Neural Networks (RNNs)
Use Case: Sequential Data Processing
Key Features:
 Designed to handle sequential data with temporal dependencies.

 Utilize feedback loops to process sequences of inputs.
 Suitable for tasks like sequence generation, time series prediction, and natural
language processing.
Example Applications: Language modeling, machine translation, sentiment analysis, speech

recognition.
3. Long Short-Term Memory (LSTM) Networks
Use Case: Sequential Data Processing (Improved RNNs)
Key Features:
 A type of RNN designed to address the vanishing gradient problem.
 Maintain long-term dependencies in sequential data.
 Include forget, input, and output gates to control information flow.
Example Applications: Language modeling, sentiment analysis, time series prediction, speech
recognition.
4. Gated Recurrent Units (GRUs)
Use Case: Sequential Data Processing (Alternative to LSTMs)
Key Features:
 Similar to LSTMs but with a simplified architecture.

 Have update and reset gates to control information flow.
 Easier to train and computationally less expensive than LSTMs.
Example Applications: Language modeling, machine translation, sentiment analysis.
5. Transformer Models
Use Case: Natural Language Processing
Key Features:
 Attention mechanism to focus on relevant parts of input sequences.

 Self-attention and multi-head attention mechanisms.
 Used in state-of-the-art language models like BERT, GPT, and T5.
Example Applications: Language translation, text summarization, question answering,

sentiment analysis.
6. Generative Adversarial Networks (GANs)
Use Case: Generative Modeling
Key Features:
 Composed of a generator and a discriminator network.

 Learn to generate realistic data samples (e.g., images, text) from noise.
 Used in image generation, style transfer, and data augmentation.
Example Applications: Image generation (e.g., faces, artworks), data augmentation, anomaly
detection.
7. Variational Autoencoders (VAEs)
Use Case: Generative Modeling (Variational Approach)
Key Features:
 Composed of an encoder and a decoder network.

 Learn a probabilistic latent representation of input data.
 Used in generating new data samples and data reconstruction.
Example Applications: Image generation, data compression, anomaly detection.
8. Deep Reinforcement Learning (DRL)
Use Case: Decision Making in Complex Environments
Key Features:
 Learn to make sequential decisions to maximize a reward signal.

 Combine deep learning with reinforcement learning techniques.
 Used in game playing, robotics, and control systems.
Example Applications: Game playing (e.g., AlphaGo), robotics control, autonomous driving.
Each of these deep learning models has its strengths and is suited for specific tasks.
Understanding their principles and architectures can help in selecting the right model for a
given problem and designing effective machine learning solutions.
ELMO (Embedding from Language Models)
"Embedding from Language Models" (ELMo) is a contextualized word representation model

developed by researchers at the Allen Institute for Artificial Intelligence (AI2). ELMo
introduces a ground-breaking approach to word embeddings in Natural Language Processing
(NLP) by providing contextualized representations of words based on their surrounding
context. It is also called as Embedding input from Language Models(EIMO)
ELMo Overview:
 Contextualized Embeddings: ELMo generates word embeddings that are context-

aware. This means that the embedding of a word can vary depending on its context
within a sentence or document. For example, the word "bank" would have different
embeddings in "river bank" versus "bank account".
 Deep Bidirectional Model: ELMo is based on a deep bidirectional language model
(biLM). It utilizes multiple layers of bidirectional LSTM (Long Short-Term Memory)
units, allowing it to capture complex linguistic patterns and dependencies in both
forward and backward directions.
 Layer-wise Combination: ELMo combines representations from different layers of
the biLM to generate embeddings. Each layer captures different levels of linguistic
information, such as syntactic structures and semantic meanings. By aggregating
information from multiple layers, ELMo produces rich and contextually sensitive
embeddings.
 Pre-trained Model: ELMo is typically pre-trained on a large corpus of text data
using a language modeling objective. This pre-training allows ELMo to learn general
language patterns and semantics before being fine-tuned on specific downstream
tasks.
Differences from Traditional Embedding Techniques:
 Contextual Sensitivity: Unlike traditional embedding techniques such as Word2Vec

or GloVe, which provide static representations for words, ELMo's embeddings are
contextually sensitive. They capture the varying meanings of words based on their
context, leading to more nuanced and accurate representations.
 Deep Learning Architecture: ELMo employs a deep learning architecture with
multiple layers of bidirectional LSTMs. This architecture enables ELMo to capture
long-range dependencies and semantic relationships, which are often challenging for
traditional methods to handle.
 Layer-wise Combination: Traditional embedding techniques typically provide a
single vector representation for each word. In contrast, ELMo combines information
from multiple layers of its deep bidirectional model, allowing it to capture
hierarchical and nuanced linguistic features.
 Transfer Learning Capabilities: ELMo's pre-training and fine-tuning capabilities
make it suitable for transfer learning. It can leverage its pre-trained knowledge to
improve performance on various NLP tasks, even with limited task-specific data.
Advantages of ELMo:
 Improved Contextual Understanding: ELMo's contextualized embeddings excel in

tasks requiring a deep understanding of language context, such as sentiment analysis,
question answering, and natural language inference.
 Transfer Learning: ELMo's pre-trained model can be fine-tuned on specific tasks or
domains, leveraging the pre-learned linguistic knowledge for improved performance.
 State-of-the-Art Performance: ELMo embeddings, when incorporated into NLP
models like neural networks, have achieved state-of-the-art performance on various
benchmarks and tasks.
 Multilingual Applications: ELMo's contextualized embeddings can be applied to
multilingual NLP tasks, benefiting from its ability to capture context-specific
linguistic features across different languages.
BERT(Bidirectional Encoder Representations from Transformers)
BERT, or Bidirectional Encoder Representations from Transformers, is a powerful and

widely-used natural language processing (NLP) model developed by Google AI. It has
significantly advanced the field of NLP by introducing bidirectional context understanding
and pre-training techniques.
Overview of BERT:
 Bidirectional Context Understanding: Unlike previous models like Word2Vec or

GloVe, which process text in one direction (e.g., left-to-right or right-to-left), BERT
is bidirectional. It considers the entire context of a word by processing it in both
directions, capturing richer contextual information.
 Transformer Architecture: BERT is based on the Transformer architecture, which
uses self-attention mechanisms to process input data in parallel and capture long-
range dependencies effectively. This architecture allows BERT to handle complex
language structures and relationships.
 Pre-training and Fine-tuning: BERT is pre-trained on a large corpus of text using
unsupervised learning objectives, such as masked language modeling (MLM) and
next sentence prediction (NSP). After pre-training, BERT can be fine-tuned on
specific NLP tasks with labeled data, leading to improved performance.
 Contextual Word Embeddings: BERT generates contextualized word embeddings
that consider the surrounding context of each word. This results in more accurate
representations, especially for tasks requiring understanding of nuances and context.
Impact and Contributions of BERT:
 State-of-the-Art Performance: BERT has achieved state-of-the-art performance on a

wide range of NLP tasks, including question answering, sentiment analysis, named
entity recognition, text classification, and more. It has significantly surpassed
previous models in terms of accuracy and generalization.
 Transfer Learning: BERT's pre-trained model can be fine-tuned on specific tasks
with relatively small amounts of task-specific data. This facilitates transfer learning
and reduces the need for large annotated datasets for each task.
 Multilingual Support: BERT has been extended to support multiple languages,
making it valuable for multilingual NLP applications. Multilingual BERT models can
handle diverse languages and tasks with consistent performance.
 Model Variants: Over time, several variants and improvements to BERT have been
developed, such as RoBERTa, DistilBERT, and ALBERT. These variants optimize
different aspects of BERT, such as training efficiency, model size, and performance
on specific tasks.
Python Code:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
# Example text
text = "This is an example sentence."
# Tokenize the text
inputs = tokenizer(text, return_tensors='pt')
# Perform inference using BERT model
outputs = model(**inputs)
predictions = torch.softmax(outputs.logits, dim=-1).detach().numpy()
print(predictions)
Zipf's Law
Zipf's Law states that the frequency of a word in a corpus is inversely proportional to its rank.
In simpler terms, the most frequent word appears twice as often as the second most frequent
word, three times as often as the third most frequent word, and so on. Mathematically, Zipf's
Law can be expressed as:
Example:
The total number of words in a document is 80. The “Commendable”
Word appeared 12 times in the document. Using Zipf‟s Law calculate the
probability of word “Commendable” at rank 6.
Ans:
To calculate the probability of the word "Commendable" at rank 6 using
Zipf's Law, we can follow these steps:
Heap's Law
 Heap's Law is an empirical law that describes the relationship

between the vocabulary size V (number of unique words) in a
document or corpus and the total number of words N. It is
commonly used in linguistics and natural language processing to
understand how the vocabulary grows as the document size
increases.
 Heap's Law suggests that as the document size increases, the
vocabulary size also increases, but at a decreasing rate. In other
words, the larger the document or corpus, the slower the vocabulary
grows relative to the document size.
Formula:
Example:
Using Heap‟s Law, calculate number of unique words in a natural

document having total 625 words.
Ans:
For simplicity, let's assume k=10k = 10k=10 and b=0.5b = 0.5b=0.5,
which are commonly used values within the typical range for these
constants.
Using these values, we can calculate the vocabulary size (V) for a
document with N=625 words
V=10×6250.5
V=10×25
V=250
So, according to Heap's Law with k=10k = 10k=10 and b=0.5b =

0.5b=0.5, the estimated number of unique words in a document with 625
total words would be approximately 250 words.

Natural Language Processing Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Natural Language Processing Notes

Uploaded by

Copyright:

Available Formats

NATURAL LANGUAGE PROCESSING

1.1 Introduction to text data and Natural Language Processing

1.2 History of NLP, Need of language processing

1.3 Applications of NLP, Components of NLP, NLP Phases

Introduction to text data

 Text data is a form of unstructured data that is represented as sequences of characters.

Natural Language Processing

History of Natural Language Processing

 Human-Computer Interaction: NLP enables more natural and intuitive

Components of Natural Language Processing

1. Natural Language Understanding (NLU)

Natural Language Understanding involves transforming human language into a machine-

b. Part-of-Speech (POS) Tagging

c. Named Entity Recognition (NER)

d. Syntax and Parsing

 Syntax Analysis: Understanding the grammatical structure of a sentence by

 It extracts the meaning of words, phrases, and sentences. It Ensures accurate

 It determines the sentiment or emotional tone behind a text.

2. Natural Language Generation (NLG)

 It decides what information should be included in the generated text. It ensures

 It organizes the information into a coherent structure (e.g., paragraphs, sections). It

 It determining how information should be expressed in individual sentences. It also

 It converts the abstract representation of sentences into grammatically correct text. It

 It creates a concise summary of a longer text document. It provides a quick overview

 It creates conversational responses in chatbots and virtual assistants. It facilitates

 Applications: Automated news reporting, financial report generation, and data-driven

Integration of NLU and NLG

Steps of Natural Language Processing

1. Lexical Analysis and Morphological:

1. Free morphemes function independently as words (like “cow” and “house”).

2. Syntactic Analysis (Parsing)

Discourse describes communication between 2 or more individuals. Discourse integration

Pragmatism describes the interpretation of language‟s intended meaning. For instance, a

2.1 NLP Libraries, Data Types: structured and unstructured

2.3 Types of regular expression, regular expression function

1. NLTK (Natural Language Toolkit):

- Python library for working with human language data.

- Provides easy-to-use interfaces to perform tasks like tokenization, - stemming, tagging,

- Suitable for education, research, and rapid prototyping.

- Designed for efficient and production-ready NLP.

- Emphasizes speed and usability.

 It is used for topic modeling and document similarity analysis.

 Well-suited for large text corpora and scalable solutions.

 Suite of NLP tools developed by Stanford University.

 Offers a range of functionalities, including POS tagging, NER, sentiment

 Java-based with available Python wrappers.

 Known for its repository of pre-trained transformer models.

 Provides easy-to-use interfaces for various transformer-based architectures,

 Widely adopted for state-of-the-art performance in many NLP tasks.

 Simplifies common NLP tasks with a high-level API.

 Built on NLTK and Pattern libraries.

 Supports tasks like part-of-speech tagging, noun phrase extraction, sentiment

 Framework for building and evaluating NLP models.

 Based on PyTorch and designed for flexibility and extensibility.

 Comprehensive NLP toolkit offering various tools and pipelines.

 Supports tasks like tokenization, POS tagging, named entity recognition,

 Java-based with available Python wrappers.

 Tabular Information: In certain NLP applications, data may be organized in

 Metadata: Structured information accompanying text data, such as timestamps,

 Free-Form Text: The majority of natural language data is unstructured, consisting of

Following are the types of Linguistic resources:

1. Dictionaries: Lexical resources containing words or phrases along with their

5. Lexicons: Structured databases containing linguistic information about words,

6. Ontologies: Formal representations of knowledge or concepts in a specific domain,