Welcome to Scribd!

Skip carousel

Module 2 Feature Engineering and Text Representation

Uploaded by

raonithin252

0% found this document useful (0 votes)

2 views19 pages

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

2 views19 pages

Module 2 Feature Engineering and Text Representation

Uploaded by

raonithin252

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 19

Search inside document

NLP Module: Feature Engineering &

Text Representation
NLP Process

Text Processing Feature Engineering & Learning Models

Text Representation
Clean up the text to make it Learn how to extract information Use learning models to
easier to use and more from text and represent it identify parts of speech,
consistent to increase numerically entities, sentiment, and
prediction accuracy later on other aspects of the text.
Feature Engineering & Text Representation

Bag of Words Model

One-Hot-Encoding N-Gram Encoding TFIDF
Using Countvectorizer
Transforms categorical Generalization of Captures word order in a Converts a collection of
feature into many binary one-hot-encoding for a string vector model raw documents to a
features of text matrix of TFIDF features
What is Feature
Engineering & What is
Text Representation?
Feature engineering is the process of transforming the raw
data to improve the accuracy of models by creating new
features from existing data
Text Representation is numerically representing text to
make it mathematically computable
One Hot Encoding
One Hot Encoding
Encodes categorical data as real numbers such that the magnitude of each
dimension is meaningful

For each distinct possible value, a new feature is created

Exp:
One Hot Encoding in Scikit-Learn
from sklearn.preprocessing import OneHotEncoder

oh_enc = OneHotEncoder()

oh_enc.ﬁt(df[['name', 'kind']])

oh_enc.transform(df[['name', 'kind']]).todense()

One Hot Encoding in Pandas

pd.get_dummies(df[['name', 'kind']])
Bag of Words Model
Bag of Words Model
Extracts features from text

Stop words not included, word order is lost, sparse encoding

Exp:
Bag of Words Model using Scikit-Learn
sample_text = ['This is the ﬁrst document.', 'This document is the second
document.', ‘This is the third document.’ ]

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words="english")

vectorizer.ﬁt(sample_text)

#to see what words were kept

print("Words:", list(enumerate(vectorizer.get_feature_names())))
N-gram Encoding
N-gram Encoding
Extracts features from text while capturing local word order by deﬁning
counts over sliding windows

Exp n =2 :
N-gram Encoding using Scikit-Learn
sample_text = ['This is the ﬁrst document.', 'This document is the second
document.', ‘This is the third document.’ ]

from sklearn.feature_extraction.text import CountVectorizer

bigram = CountVectorizer(ngram_range=(1, 2))

bigram.ﬁt(sample_text)

#to see what words were kept

print("Words:", list(zip(range(0,len(bigram.get_feature_names())),
bigram.get_feature_names())))
TFIDF Vectorizer
TFIDF Vectorizer
Converts a collection of raw documents to a matrix of TFIDF features

What are TFIDF Features?

TFIDF stands for term frequency inverse document frequency and it represents
text data by indicating the importance of the word relative to the other words in
the text

2 Parts:

TF: (# of times term t appears in a document)/ (total # of terms in the document)

IDF: (log10 (total # of documents)/(# of documents with term t in it)

TFIDF Vectorizer con.

TFIDF = TF*IDF

TF represents how frequently the word shows up in the document

IDF represents how important the word is to the document (rare words)
TFIDF Vectorizer Encoding using Scikit-Learn

from sklearn.feature_extraction.text import TﬁdfVectorizer

sample_text = ['This is the ﬁrst document.', 'This document is the second

document.', ‘This is the third document.’ ]

vectorizer = TﬁdfVectorizer()

X = vectorizer.ﬁt_transform(sample_text)

print(vectorizer.get_feature_names())

Torque Drag
Document4 pages
Torque Drag
qbangka
No ratings yet
ARWUsersGuideV3 9
Document443 pages
ARWUsersGuideV3 9
theverdugo
No ratings yet
Communication Management Plan
Document5 pages
Communication Management Plan
Mk Nimesh
100% (2)
Basic SACS Training
Document4 pages
Basic SACS Training
kunlef
58% (12)
A Comprehensive Guide To Understand and Implement Text Classification in Python
Document34 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
rahacse
No ratings yet
NLP Manual
Document21 pages
NLP Manual
1nt21ai012.vynavi
No ratings yet
6 - Text Vectorization-CSC688-SP22
Document5 pages
6 - Text Vectorization-CSC688-SP22
Crypto Genius
No ratings yet
Python Chatbot Project
Document10 pages
Python Chatbot Project
Vanitha G
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
Document6 pages
CSDM2-Text Preprocessing For NL Data - 011050
ignaciojudyann596
No ratings yet
Lab3 IR BIM
Document14 pages
Lab3 IR BIM
Pac SaQii
No ratings yet
Lab Manual - NLP
Document60 pages
Lab Manual - NLP
Indumathi
No ratings yet
How To Use NLP in Python A Practical Step-by-Step ExampleTo Find Out The In-Demand Skills For Data SC
Document12 pages
How To Use NLP in Python A Practical Step-by-Step ExampleTo Find Out The In-Demand Skills For Data SC
Juanito Alimaña
No ratings yet
MD Adil Irshad
Document37 pages
MD Adil Irshad
chatroom Mern
No ratings yet
NLP Lab Manual
Document33 pages
NLP Lab Manual
Rohiteswar gouthu
No ratings yet
E Coop Re Factoring
Document2 pages
E Coop Re Factoring
crennydane
No ratings yet
Expt 8
Document2 pages
Expt 8
Riya Bisht
No ratings yet
Explore Metaprogram
Document13 pages
Explore Metaprogram
Benjamin Culkin
No ratings yet
Python Unit-4
Document17 pages
Python Unit-4
Denu
No ratings yet
Machine Learning Codes
Document30 pages
Machine Learning Codes
Deepanshu Singhal
No ratings yet
Feature Engineering
Document44 pages
Feature Engineering
Venkata Gnaneswar Dasari
No ratings yet
Next Word Prediction With NLP and Deep Learning
Document13 pages
Next Word Prediction With NLP and Deep Learning
Alebachew Mekuriaw
No ratings yet
NLP Soc
Document15 pages
NLP Soc
Subbu Buddu
No ratings yet
Assign 3
Document1 page
Assign 3
Tayub khan.A
No ratings yet
Machine Translation Using Natural Language Process
Document6 pages
Machine Translation Using Natural Language Process
Fidelzy Moreno
No ratings yet
Python With Django
Document11 pages
Python With Django
Mithun Satyanarayana
No ratings yet
SteganographyinMSWordDocumentusingitsIn-builtFeatures Tidke ICCNS 2008
Document5 pages
SteganographyinMSWordDocumentusingitsIn-builtFeatures Tidke ICCNS 2008
Vishnu Priya
No ratings yet
Lab Manual SPOSL
Document60 pages
Lab Manual SPOSL
dd
No ratings yet
Scikit Hca
Document8 pages
Scikit Hca
hade.bisns
No ratings yet
Tsa Lab Record - Cse
Document53 pages
Tsa Lab Record - Cse
jerujef.2723
No ratings yet
Datasciendeusingpython 6 Weeks
Document7 pages
Datasciendeusingpython 6 Weeks
harshtyagi2212
No ratings yet
Ccs369 - Text and Speech Analysis - Lab Manual
Document23 pages
Ccs369 - Text and Speech Analysis - Lab Manual
I yr IT 10-Cherisha S
No ratings yet
Python Question Paper 2
Document15 pages
Python Question Paper 2
fakescout007
No ratings yet
NLP Exps
Document10 pages
NLP Exps
20115016 HICET STUDENT AIML
No ratings yet
0jhKAy5cS6K4SgMuXHuiyg - TensorFlow On Google Cloud - Course Summary
Document7 pages
0jhKAy5cS6K4SgMuXHuiyg - TensorFlow On Google Cloud - Course Summary
afif.yassir12
No ratings yet
Combine PDF
Document124 pages
Combine PDF
rsdhiva22
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
Document17 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
Shubham Jade
No ratings yet
100 ChatGPT Programming Prompts
Document15 pages
100 ChatGPT Programming Prompts
db118
100% (1)
Unit-3 Python
Document29 pages
Unit-3 Python
Divyasri Jegan
No ratings yet
Rcs454: Python Language Programming LAB: Write A Python Program To
Document39 pages
Rcs454: Python Language Programming LAB: Write A Python Program To
Shikha Arya
No ratings yet
Syntax Tree
Document3 pages
Syntax Tree
poopapule
0% (1)
Lab - 11 - File Handling in Python
Document6 pages
Lab - 11 - File Handling in Python
Waseem Abbas
No ratings yet
NLP Final Review
Document32 pages
NLP Final Review
gabriel-l
No ratings yet
Using Context Managers: Shayne Miel
Document37 pages
Using Context Managers: Shayne Miel
asfsa
No ratings yet
NLP Record
Document6 pages
NLP Record
nuzzurockzz301
No ratings yet
Lab1 IR
Document14 pages
Lab1 IR
Pac SaQii
No ratings yet
Functions, Modules and Packages in Python: N.Venkatesh Asst - Prof of CSE
Document18 pages
Functions, Modules and Packages in Python: N.Venkatesh Asst - Prof of CSE
Ankur
No ratings yet
C++ Templates As Partial Evaluation: Todd L. Veldhuizen
Document13 pages
C++ Templates As Partial Evaluation: Todd L. Veldhuizen
anca irina
No ratings yet
Creating PDF Documents With Python - GeeksforGeeks
Document1 page
Creating PDF Documents With Python - GeeksforGeeks
rheed1
No ratings yet
C Programming
Document8 pages
C Programming
Ligo Pasti
No ratings yet
Understanding The 10 Most Difficult Python Concepts - by Joanna - Geek Culture
Document24 pages
Understanding The 10 Most Difficult Python Concepts - by Joanna - Geek Culture
fatihdeniz
No ratings yet
NLP Preparing The Text Data (Part I)
Document2 pages
NLP Preparing The Text Data (Part I)
learnit learnit
No ratings yet
Report On Text Classification Using CNN, RNN & HAN - Jatana - Medium
Document15 pages
Report On Text Classification Using CNN, RNN & HAN - Jatana - Medium
pradeep_dhote9
No ratings yet
Itc Lab Report: Ojective
Document5 pages
Itc Lab Report: Ojective
Muhammad Moaz Awan FastNU
No ratings yet
Chap-4, 5,6,7
Document19 pages
Chap-4, 5,6,7
Eman
No ratings yet
09 Rohit Jujaray NLP Experiments
Document24 pages
09 Rohit Jujaray NLP Experiments
NEMAT KHAN
No ratings yet
SK NLP Practical (FS)
Document22 pages
SK NLP Practical (FS)
sfardin785
No ratings yet
Python Automation
Document54 pages
Python Automation
Deva K
No ratings yet
Sess2 Angular Install Project Setup
Document27 pages
Sess2 Angular Install Project Setup
Shashwat Singh
No ratings yet
Medical Text Classifier GabrieldeOlaguibel
Document12 pages
Medical Text Classifier GabrieldeOlaguibel
gabriel-l
No ratings yet
Introduction
Document17 pages
Introduction
Kishan hari
No ratings yet
C Syllabus
Document8 pages
C Syllabus
D I W A K A R
No ratings yet
Unit 3.3 Cohesion and Coupling Week05a
Document48 pages
Unit 3.3 Cohesion and Coupling Week05a
abhishek
No ratings yet
LINQ Interview Questions
Document32 pages
LINQ Interview Questions
borhan uddin
No ratings yet
Advanced Python Techniques: Expert-Level Coding and Best Practices: Python, #3
From Everand
Advanced Python Techniques: Expert-Level Coding and Best Practices: Python, #3
Kamel Bousnina
No ratings yet
Waste management
Document3 pages
Waste management
raonithin252
No ratings yet
Embedded operating systems
Document26 pages
Embedded operating systems
raonithin252
No ratings yet
Embedded Operating Systems
Document30 pages
Embedded Operating Systems
raonithin252
No ratings yet
Embedded Os
Document10 pages
Embedded Os
raonithin252
No ratings yet
3egm037270 en C 12sv02 Datasheet HW
Document14 pages
3egm037270 en C 12sv02 Datasheet HW
eranyu
No ratings yet
Gis in Defense PDF
Document72 pages
Gis in Defense PDF
Anonymous ciVEOMl83
No ratings yet
Internship Report PDF
Document12 pages
Internship Report PDF
Dandally Roopa
No ratings yet
SOP Sample
Document4 pages
SOP Sample
Muzaffar Zahoor Fatani
No ratings yet
Debug Manual
Document274 pages
Debug Manual
Juan Silverio Hernandez Romero
No ratings yet
FPGA Implementations of The AES Masked Against Power Analysis Attacks
Document11 pages
FPGA Implementations of The AES Masked Against Power Analysis Attacks
Vasanth Vasu
No ratings yet
Programming Robots Inc 2011
Document13 pages
Programming Robots Inc 2011
Ghassan Mansour
No ratings yet
Engineering Internship Final Presentation 22209on
Document14 pages
Engineering Internship Final Presentation 22209on
MOHAMMAD WASEEM SHAIKH
No ratings yet
Rockpack III Features
Document2 pages
Rockpack III Features
Jorge Camargo
No ratings yet
How To Create Local Reports RDLC Featuring Barcode Images in ASP
Document24 pages
How To Create Local Reports RDLC Featuring Barcode Images in ASP
Gustavo Adolfo Sotomayor Lago
No ratings yet
CV (Sagnik) PDF
Document3 pages
CV (Sagnik) PDF
Sagnik Sen
100% (2)
Swift Service Description VPN Boxes PDF
Document21 pages
Swift Service Description VPN Boxes PDF
tuncaydin
No ratings yet
Critical Path Method 1
Document25 pages
Critical Path Method 1
Aztec Mayan
No ratings yet
Computer Science Engineering
Document16 pages
Computer Science Engineering
Chinmay Mokhare
No ratings yet
Game World History PDF
Document26 pages
Game World History PDF
prayudha pratriesya
No ratings yet
Work Management AS400
Document13 pages
Work Management AS400
subex2010
No ratings yet
Unity 5.x by Example - Sample Chapter
Document63 pages
Unity 5.x by Example - Sample Chapter
Packt Publishing
No ratings yet
How To Port Forward Your Unifi Dlink Dir-615 Router
Document13 pages
How To Port Forward Your Unifi Dlink Dir-615 Router
wghazali
No ratings yet
Public Procurement-Preference To Make in India-Order 2019 For Cyber Security Products
Document8 pages
Public Procurement-Preference To Make in India-Order 2019 For Cyber Security Products
Abhishek Vachher
No ratings yet
E-Business Adoption in Yemeni SMEs
Document294 pages
E-Business Adoption in Yemeni SMEs
Mugaahed Alfutaahy
No ratings yet
Supply Chain Tata Motors
Document4 pages
Supply Chain Tata Motors
Aditya Khunteta
0% (1)
Lesson 1 Introduction To Ict
Document35 pages
Lesson 1 Introduction To Ict
Dandreb Sardan
100% (1)
Gsop PDF
Document19 pages
Gsop PDF
Ayush Purohit
No ratings yet
Chapter 8 Lab B: Configuring A Remote Access VPN Server and Client
Document24 pages
Chapter 8 Lab B: Configuring A Remote Access VPN Server and Client
berlodon
No ratings yet
SMK KGV Maths T Term 2 Answer
Document4 pages
SMK KGV Maths T Term 2 Answer
lingboo
No ratings yet