Professional Documents
Culture Documents
ML (Supervised) - in General
ML (Supervised) - in General
ML (Supervised) - in General
(Supervised)-In
General
What will we learn in this
Session (Objective)
1 Introduction to Machine Learning
3 Data Preparation
Outline
Introduction to Machine Learning
Data Preparation
Hands-On
Introduction to
Machine Learning
What’s machine
learning?
“The ability of machine to do certain task performed
by human without being explicitly programmed to do
that task”
Human has intelligence to learn things
The ability of machine to learn like human
does, is ...
Machine Learning
This is how machine do that
Turing Test for AI
A Turing Test is a method of inquiry in artificial intelligence (AI) for determining whether or not a
computer is capable of thinking like a human being. The test is named after Alan Turing, the founder of
the Turning Test and an English computer scientist, cryptanalyst, mathematician and theoretical
biologist.
Machine Leaning vs Traditional
Computing
Machine Learning Case
Gartner Analytic Ascendancy Model
Machine Learning Algorithm
Supervised Learning
Unsupervised Learning
Data Cleansing
About Cleaning and
Preprocessing Dataset
Cleaning your data should be the first step in your Data Science
(DS) or Machine Learning (ML) workflow. Without clean data you’ll
be having a much harder time seeing the actual important parts
in your exploration. According to CrowdFlower, data scientists
spend 60% of the time organizing and cleansing data!
Why Cleaning data and preprocess
important?
Reasons:
1. It's easier to visualize and analyze with a cleaned dataset
2. Data interpretation is valid
3. If the data is not cleaned. Sometimes, there is a function that
will error
4. Many data scientists can improve the accuracy of models only
from cleaning
and processing data
Common Problem in Data Cleansing
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
Analyze the
Appropriately
Identifying missing number or
delete or impute
values proportion of
missing values
missing values
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
An outlier is a data point that lies an abnormal distance from other values
in the data.
Basic Outlier Formula :
1. Lower Bound = Q1 - 1.5 x IQR
2. Upper Bound = Q3 + 1.5 x IQR
3. IQR = Q3 - Q1
1. Cleansing
Checking for problems with the collected data, such as missing data or
measurement error, data types of columns, etc
2. Defining questions
Identifying the relationship between the variables that are particularly
interesting or unexpected
3. Visualizations
Using effective visualizations to communicate the result
Data
Preprocessing
Encode Data
Some Approaches
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
Scaling
Use MinMaxScaler
Let’s go to Notebook