Tokenization

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Date:- 18/07/2022

Tokenization
(Types & Methods )

Submitted By: Submitted To:


Name: - Ayush Jain Prof. Tanushree Dholpuriya
Enroll. No: -0103AL213D01
Branch :- CSE- AIML
SEM. :- 5th
Table of Contents:

• Write down the different ways to tokenization the text in python?


• Describe python token & character set?
• How you can implement string literals, character, numerical, Boolean and special characters?
• Describe literals collection?
• Describe operators in tokens?
Different ways to tokenization the text in python

1. Tokenization using Python’s split() function


The split() method as it is the most basic one. It returns a list of strings after breaking the given string by the
specified separator. By default, split() breaks a string at each space.
2. Tokenization using NLTK
This is a library you will appreciate the more you work with text data. NLTK, short for Natural Language ToolKit, is a
library written in Python for symbolic and statistical Natural Language Processing.

NLTK contains a module called tokenize() which further classifies into two sub-categories:
Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words
Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences
3. Tokenization using the spaCy library
spaCy is an open-source library for advanced Natural Language Processing (NLP). It supports over 49+
languages and provides state-of-the-art computation speed.
5. Tokenization using Keras

• Keras lowers the case of all the alphabets before tokenizing them.
Describe python tokens in character set.
Character sets are used in programming languages. A character set is a standard defined so that different
characters (English alphabets, numeric digits 0 to 9 for example) that we use in daily life canalso be used in
source program and when the program runs these characters can also be interpreted (i.e converted to binary
equivalents).

•Letters – A-Z, a-z ( Also letters from most of the other languages)

•Digits – 0 to 9

•Special Characters – Space, operators (+, – *, %,= etc.), separators (, (), [], {}, comma, fullstop (.) etc.,
and all other symbols like ‘ ” / \ %^&@~! etc.

•White Space characters – Blank Space, tab, Carriage return (↵), newline, formfeed

•Other Characters – Python is capable of processing all the ASCII and Unicode characters (character
sets) as part of data and literals(constants)
Implementations of types of literals.
1. String Literals:
The text written in single, double, or triple quotes represents the string literals in Python.

2. Character Literals:
Character literal is also a string literal type in which the character is enclosed in single or double-quotes
3. Numeric Literals:
These are the literals written in form of numbers. Python supports the following numerical literals:
Integer Literal: It includes both positive and negative numbers along with 0.

It doesn’t include fractional parts. It can also include binary, decimal, octal, hexadecimal literal.

Float Literal: It includes both positive and negative real numbers.

It also includes fractional parts.Complex Literal: It includes a+bi numeral, here a represents the real part and b
represents the complex part.

Output
4. Boolean Literals:
Boolean literals have only two values in Python. These are True and False.

5. Special Literals:
Python has a special literal ‘None’. It is used to denote nothing, no values, or the absence of value.
Describe literal collections.
Literals collections in python includes list, tuple, dictionary, and sets.

1. List: It is a list of elements represented in square brackets with commas in between. These variables can be of
any data type and can be changed as well.

2. Tuple: It is also a list of comma-separated elements or values in round brackets. The values can be of any data
type but can’t be changed.

3. Dictionary: It is the unordered set of key-value pairs.

4. Set: It is the unordered collection of elements in curly braces ‘{}’.


Describe operators in tokens.
Operators: These are the tokens responsible to perform an operation in an expression. The variables on which
operation is applied are called operands. Operators can be unary or binary. Unary operators are the ones acting on a
single operand like complement operator, etc. While binary operators need two operands to operate.
THANK YOU!

You might also like