Introduction To Natural Language Processing (Nlp) : Ths. Đặng Nhân Cách Email: Cach.Dang@Ut.Edu.Vn

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 25

INTRODUCTION TO

NATURAL LANGUAGE
PROCESSING (NLP)

ThS. Đặng Nhân Cách

Email: cach.dang@ut.edu.vn
Scope of discussion

- Language of focus: English


- Focus on: Text preprocessing
- Language programming: Python 3
What is NLP?

- Natural language processing (NLP) is a field of computer


science.
- It is concerned with the interaction between computers
and human (natural) languages.

https://en.wikipedia.org/wiki/Natural_language_processing
Flow chart of the NLP
Flow chart of the NLP
CONCEPTS AND EXAMPLES
Raw Text cleansing data stages

● The raw data needs to be cleaned.


● The final accuracy depends on the quality of input data.
● There are many steps in cleaning data
○ Tokenizing
○ Stemming
○ Lemmatization
○ Stopwords removal
○ ...
Corpus

● Corpus is collection of text or text files.


● Corpus is as a raw data for text data.
Lowercasing

● Lowercasing all your text data is a necessity.


● E.g:
○ Tomato -> tomato
○ ToMaTo -> tomato
○ TomATO -> tomato
○ TOmATO -> tomato
Lowercasing

In Python, lower() is a built-in method used for string handling.


Tokenization

● A string is any sequence of characters (letters, numbers,


symbols,...) delimited by spaces.
● Tokenization is based on specific split rule.
● E.g: “today is my birthday.”

Word_tokenize: [‘today’, ‘is’, ‘my’, ‘birthday’, ‘ . ’]


Sentence_tokenize: [‘today is my birthday.]
Tokenization

NLTK (Natural Language Toolkit) is used for almost all NLP tasks.

Word Tokenize

Sentence
Tokenize
Stop-words removal

● Stop-words are a set of commonly used in every


document.
● These words are not informative. E.g: “is”, “a”, “the”, …
● These words are important to human but for analysis not.
● E.g: “This is a sample text”

Stop-words removal: “sample text”


Stop-words removal
Stop-words removal
Stemming

● A few words in the document have the same root but used in
different ways.
● Stemming is the process of eliminating suffixes, prefixes from
a word to obtain a root word.
● E.g: connect, connection, connected, connections, connects.
○ Stemmed_word: connect
Stemming
Stemming

Sometimes, It will create a completely strange word.


Lemmatization

● Lemmatization is very similar to stemming, where the


target is to remove inflectional endings only and map a
word to its root form.
● Stemming can often create completely strange words,
while lemmatization is actual words.
● E.g: better -> good, troubling -> trouble, studies -> study.
Lemmatization
Parts of Speech (POS) Tagging

● The POS explain how a word is used in a sentence.


● There are 8 major POS in English grammar: nouns,
pronouns, verbs, adjectives, adverbs, conjunctions,
prepositions, and interjections.
● E.g: “Today is a good day”.

Today is a good day

Noun Verb Article Adjective Noun


Parts of Speech (POS) Tagging
ANY QUESTIONS?
THANK YOU FOR YOUR ATTENTION!
What is Named-entity recognition - NER?

You might also like