Data - Preprocessing - 3

Uploaded by

Madina Dates

0% found this document useful (0 votes)

1 views4 pages

Original Title

4. data_preprocessing_3

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

1 views4 pages

Data - Preprocessing - 3

Uploaded by

Madina Dates

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 4

Search inside document

Big Data Analytics

Data Preprocessing

Prof. Dr. Fazlul Hasan Siddiqui

Dept. of CSE, DUET, Gazipur
BSc:IUT; MSc:BUET; PhD:ANU (Australia)
siddiqui@duet.ac.bd
Source: https://towardsdatascience.com/introduction-to-data-
preprocessing-in-machine-learning-a9fa83a5dc9d
Standardization
>> In the example below, a model could give more weightage to Salary, which is not the
ideal scenario as Age is also an integral factor here. In order to avoid this issue, we
perform Standardization.

>> Here we transform our values such that the mean of the values is 0 and 1. To do that
we calculate the mean and standard deviation of the values and then for each data
point we just subtract the mean and divide it by standard deviation.
Handling Nominal Categorical Variables
One-Hot Encoding
>> Create ’n’ columns where n is the number of unique values that the nominal variable
can take.

>> Ex — If color can take Blue, Green and White then we will just create three new
columns namely — color_blue, color_green and color_white and if the color is green
then the values of color_blue and color_white column will be 0 and value of color_green
column will be 1 .

>> So out of the n columns, only one column can have value = 1 and the rest all will
have value = 0.

>> Drawbacks: Multicollinearity

Multicollinearity
>> Multicollinearity occurs in our dataset when we have features that are strongly
dependent on each other.
>> Ex- color_blue, color_green and color_white which are all dependent on each other
and it can impact our model.
>> How to identify: Plot a pair plot and you can observe the relationships between
different features. If you get a linear relationship between 2 features then they are
strongly correlated with each other and there is multicollinearity in your dataset.
>> Avoid: drop the first column of color.

Simple Data Science (R)
From Everand
Simple Data Science (R)
Narayana Nemani
Rating: 5 out of 5 stars
5/5 (1)
Data Mining Graded Assignment: Problem 1: Clustering Analysis
Document39 pages
Data Mining Graded Assignment: Problem 1: Clustering Analysis
rakesh sandhyapogu
100% (3)
Data Mining Project
Document22 pages
Data Mining Project
Ranadip Guha
No ratings yet
Introduction To Data Science Project Report: 1 Preprocessing
Document10 pages
Introduction To Data Science Project Report: 1 Preprocessing
Bangtan gurl
No ratings yet
Explain Briefly The Stages in Data Processing
Document7 pages
Explain Briefly The Stages in Data Processing
Khundongbam Suresh
No ratings yet
Learneverythingai
Document14 pages
Learneverythingai
nasby18
No ratings yet
TEAM DS Final Report
Document14 pages
TEAM DS Final Report
Gurucharan Reddy
No ratings yet
Top 90+ Data Science Interview Questions and Answers (2024)
Document38 pages
Top 90+ Data Science Interview Questions and Answers (2024)
nithinjc
No ratings yet
Supervised Alogorithms
Document17 pages
Supervised Alogorithms
Suchismita Sahu
No ratings yet
Decision Tree Random Forest X20147503 Antanas Capskis
Document5 pages
Decision Tree Random Forest X20147503 Antanas Capskis
Antanas Capskis
No ratings yet
Measure of Central Tendancy 2
Document17 pages
Measure of Central Tendancy 2
MD HANJALAH
No ratings yet
Business Statistics Syallbus
Document10 pages
Business Statistics Syallbus
manish yadav
No ratings yet
Data Science
Document17 pages
Data Science
Nabajit
No ratings yet
Comparitive Study of Regression Models Paper
Document6 pages
Comparitive Study of Regression Models Paper
Varad
No ratings yet
Reading 11 - Programming End-to-End Solution
Document13 pages
Reading 11 - Programming End-to-End Solution
lussy
No ratings yet
Plugins Comments - PHP API Key 150283448
Document1 page
Plugins Comments - PHP API Key 150283448
domru
No ratings yet
Notes On Data Processing, Analysis, Presentation
Document63 pages
Notes On Data Processing, Analysis, Presentation
janeth pallangyo
No ratings yet
Notebook 4 - Machine Learning
Document17 pages
Notebook 4 - Machine Learning
faiza.alam0000
No ratings yet
Applied Statistics Outliers Chapter 2
Document12 pages
Applied Statistics Outliers Chapter 2
khadija
No ratings yet
Ai Hon 4
Document22 pages
Ai Hon 4
Gaurav Suryavanshi
No ratings yet
Script
Document5 pages
Script
Simo Jayat
No ratings yet
Data - Preprocessing - 2
Document10 pages
Data - Preprocessing - 2
Madina Dates
No ratings yet
MMW Statistic m6
Document5 pages
MMW Statistic m6
eunice tahil
No ratings yet
Data Analytics Course (IIFT Kolkata) Lectures 1 - 4 21072022
Document221 pages
Data Analytics Course (IIFT Kolkata) Lectures 1 - 4 21072022
omprakash
No ratings yet
Statistics For Management: Q.1 A) 'Statistics Is The Backbone of Decision Making'. Comment
Document10 pages
Statistics For Management: Q.1 A) 'Statistics Is The Backbone of Decision Making'. Comment
khanal_sandeep5696
No ratings yet
2020UCO1572
Document2 pages
2020UCO1572
Ram kumar
No ratings yet
Day04 Business Moments
Document10 pages
Day04 Business Moments
Divya
No ratings yet
How To Perform Monte Carlo Simulation
Document13 pages
How To Perform Monte Carlo Simulation
sarabjeethanspal
No ratings yet
Data Mining
Document34 pages
Data Mining
ShreyaPrakash
No ratings yet
Analytics Advanced Assignment Mubassir Surve
Document7 pages
Analytics Advanced Assignment Mubassir Surve
Mubassir Surve
No ratings yet
Decision Tree Using Sci-Kit Learn
Document9 pages
Decision Tree Using Sci-Kit Learn
sudeepvmenon
No ratings yet
New Microsoft Office Word Document
Document10 pages
New Microsoft Office Word Document
Shafique Rahman
No ratings yet
Midterm Reviews
Document4 pages
Midterm Reviews
Bách Nguyễn
No ratings yet
Measures of Central Tendency
Document48 pages
Measures of Central Tendency
Sheila Mae Guad
100% (1)
Chapter 1 RM
Document44 pages
Chapter 1 RM
Ankur Dharod
No ratings yet
29 K-Nearest Neighbor and Summing Up The End-To-End Workflow
Document6 pages
29 K-Nearest Neighbor and Summing Up The End-To-End Workflow
krishna
No ratings yet
Classification & Prediction
Document19 pages
Classification & Prediction
AKANKSHA GARG
No ratings yet
JaiGoutham Data Mining Report
Document29 pages
JaiGoutham Data Mining Report
hema aarthi
No ratings yet
Notebook 4 - Machine Learning
Document17 pages
Notebook 4 - Machine Learning
Blobby Hatchner
No ratings yet
Probability and Statistics Teacher S Edition Solution Key
Document104 pages
Probability and Statistics Teacher S Edition Solution Key
Deron Zierer
No ratings yet
Revised Clustering Business Report
Document5 pages
Revised Clustering Business Report
Pratigya pathak
No ratings yet
Module 4 MMW
Document26 pages
Module 4 MMW
Clash Of clan
No ratings yet
Decision Tree & Random Forest
Document16 pages
Decision Tree & Random Forest
reshma acharya
No ratings yet
Mba Semester 1 Mb0040 - Statistics For Management-4 Credits (Book ID: B1129) Assignment Set - 1 (60 Marks)
Document10 pages
Mba Semester 1 Mb0040 - Statistics For Management-4 Credits (Book ID: B1129) Assignment Set - 1 (60 Marks)
amarendrasuman
No ratings yet
Model Evaluation
Document29 pages
Model Evaluation
niti gupta
No ratings yet
Answer Report (Preditive Modelling)
Document29 pages
Answer Report (Preditive Modelling)
Shweta Lakhera
100% (1)
BSChem-Statistics in Chemical Analysis PDF
Document6 pages
BSChem-Statistics in Chemical Analysis PDF
KENT BENEDICT PERALES
No ratings yet
Jeffrey Williams (20221013) 4
Document27 pages
Jeffrey Williams (20221013) 4
JEFFREY WILLIAMS P M 20221013
No ratings yet
Metrices of The Model
Document9 pages
Metrices of The Model
Narmatha
No ratings yet
DMDW 03
Document25 pages
DMDW 03
Harsh Nag
No ratings yet
SEM 1 MB0040 1 Statistics For Management
Document8 pages
SEM 1 MB0040 1 Statistics For Management
Kumar Gollapudi
No ratings yet
Mid Term
Document12 pages
Mid Term
sree vishnupriyq
No ratings yet
1.1 - Statistical Analysis PDF
Document10 pages
1.1 - Statistical Analysis PDF
zoohyun91720
No ratings yet
Machine Learning: BY:Vatsal J. Gajera (09BCE010)
Document25 pages
Machine Learning: BY:Vatsal J. Gajera (09BCE010)
Riya Yadav
No ratings yet
Why It Is Necessary To Summarise Data? Explain The Approaches Available To Summarize The Data Distributions?
Document4 pages
Why It Is Necessary To Summarise Data? Explain The Approaches Available To Summarize The Data Distributions?
Chandan Kishore
No ratings yet
CH 2 - Measure of Central Tendency
Document9 pages
CH 2 - Measure of Central Tendency
hehehaswal
No ratings yet
Basic Interview Q's On ML PDF
Document243 pages
Basic Interview Q's On ML PDF
sourajit roy chowdhury
100% (2)
Sajjad DS
Document97 pages
Sajjad DS
Hey Buddy
100% (2)
Whole ML PDF 1614408656
Document214 pages
Whole ML PDF 1614408656
Kshatrapati Singh
100% (1)
DSBDL Asg 2 Write Up
Document4 pages
DSBDL Asg 2 Write Up
sdaradeyt
No ratings yet