Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

Natural Language Processing

Mahmmoud Mahdi
Introduction
Introduction to NLP

● What is NLP?
● How does it related to ML and Tech?
● What are the typical task?
What’s that all about

speak read write

plan
think

All in natural language


What is the Natural Language Processing

In a way, if we unlock the key to understanding how language


works, we unlock the key to understanding how human brain
works.
Where NLP Stands
Background
Corpora, Tokens, and Types

● All NLP methods, be they classic or modern, begin with a


text dataset, also called a corpus (plural: corpora).
● A corpus usually contains raw text and any metadata
associated with the text. The raw text is a sequence of
characters (bytes), but most times it is useful to group
those characters into contiguous units called tokens.
● The process of breaking a text down into tokens is called
tokenization.
● Types are unique tokens present in a corpus. The set of
all types in a corpus is its vocabulary
In machine learning
parlance, the text along
with its metadata is called
an instance or data point.

The corpus, a collection of


instances, is also known as
a dataset.
Feature engineering

The process of understanding the linguistics of a


language and applying it to solving NLP problems is
called feature engineering.
Unigrams, Bigrams, Trigrams, …, N-grams

● N-grams are fixed-length (n) consecutive token sequences


occurring in the text.
● A bigram has two tokens, a unigram one. Generating
n-grams from a text is straightforward enough.
Lemmas and Stems

Lemmas are root forms of words.

Consider the verb fly.

● It can be inflected into many different words [flow,


flew, flies, flown, flowing, and so on]
● fly is the lemma for all of these seemingly different
words.

This reduction is called lemmatization


Lemmas and Stems

Stemming is the poorman’s lemmatization. It involves the use


of handcrafted rules to strip endings of words to reduce
them to a common form called stems.
Categorizing Sentences and Documents

Problems (supervised document classification)

● Assigning topic labels,


● Predicting sentiment of reviews,
● Filtering spam emails,
● Language identification, and
● Email triaging
Categorizing Words: POS Tagging

A common example of categorizing words is


part-of-speech (POS) tagging
Categorizing Spans: Chunking and Named Entity Recognition

We might want to identify the noun phrases (NP) and verb


phrases (VP) in text. This is called chunking or shallow
parsing. Shallow parsing aims to derive higher-order units
composed of the grammatical atoms, like nouns, verbs,
adjectives, and so on.

A named entity is a string mention of a real-world concept


like a person, location, organization, drug name, and so on
Structure of Sentences

Whereas shallow parsing


identifies phrasal units, the
task of identifying the
relationship between them is
called parsing.
NLP with Classification and
Vector
NLP with Classification and Vector

1. Sentiment Analysis with Logistic Regression


2. Sentiment Analysis with Naïve Bayes
3. Vector Space Models
4. Machine Translation and Document Search
Sentiment Analysis with Logistic Regression
Types of Machine Learning
Supervised ML (training)
Sentiment Analysis
Sentiment Analysis
Vocabulary
Feature Extraction
Problem with sparse representations
Positive and negative counts
Positive and negative counts
Positive and negative counts
Positive and negative counts
Word frequency in classes
Feature extraction

Freqs: dictionary mapping from (word, class) to frequency


Feature Extraction
Feature Extraction
Feature Extraction
Feature Extraction
Preprocessing

● Eliminate handles and URLs


● Tokenize the string into words.
● Remove stop words like "and, is, a, on, etc."
● Stemming- or convert every word to its stem. Like dancer,
dancing, danced, becomes 'danc'. You can use porter
stemmer to take care of this.
● Convert all your words to lowercase.
Preprocessing: stop words and punctuation
Preprocessing: stemming and lowercasing
General Overview
General Overview

● Your X becomes of dimension (m,3) as follows.


General Implementation
Logistic
Regression
Overview of logistic regression
Logistic Regression Overview

Logistic regression makes use of the sigmoid function which


outputs a probability between 0 and 1.

The sigmoid function with some weight parameter θ and some


input x^(i) is defined as follows.
Overview of logistic regression
Overview of logistic regression
LR: Training
LR: Training
LR: Testing
LR: Testing
Cost Function for LR
Cost Function for LR
Cost Function for LR
Cost Function for LR
Cost Function for LR
Your Turn

● Dataset:
https://data.world/data-society/twitter-user-data
● Feature engineering
● convert to vector
● Apply Logistic regression
Questions ?

You might also like