ML (Supervised) - in General

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Machine Learning

(Supervised)-In
General
What will we learn in this
Session (Objective)
1 Introduction to Machine Learning

2 EDA & Data Preprocessing

3 Data Preparation
Outline
Introduction to Machine Learning

Data Cleansing with


Pandas
Exploratory Data
Analysis

Data Preparation

Hands-On
Introduction to
Machine Learning
What’s machine
learning?
“The ability of machine to do certain task performed
by human without being explicitly programmed to do
that task”
Human has intelligence to learn things
The ability of machine to learn like human
does, is ...

Machine Learning
This is how machine do that
Turing Test for AI

A Turing Test is a method of inquiry in artificial intelligence (AI) for determining whether or not a
computer is capable of thinking like a human being. The test is named after Alan Turing, the founder of
the Turning Test and an English computer scientist, cryptanalyst, mathematician and theoretical
biologist.
Machine Leaning vs Traditional
Computing
Machine Learning Case
Gartner Analytic Ascendancy Model
Machine Learning Algorithm
Supervised Learning
Unsupervised Learning
Data Cleansing
About Cleaning and
Preprocessing Dataset
Cleaning your data should be the first step in your Data Science
(DS) or Machine Learning (ML) workflow. Without clean data you’ll
be having a much harder time seeing the actual important parts
in your exploration. According to CrowdFlower, data scientists
spend 60% of the time organizing and cleansing data!
Why Cleaning data and preprocess
important?
Reasons:
1. It's easier to visualize and analyze with a cleaned dataset
2. Data interpretation is valid
3. If the data is not cleaned. Sometimes, there is a function that
will error
4. Many data scientists can improve the accuracy of models only
from cleaning
and processing data
Common Problem in Data Cleansing
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type

Analyze the
Appropriately
Identifying missing number or
delete or impute
values proportion of
missing values
missing values
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type

Other Method for Imputing Missing


Value :
1. Median (Used for skewness
distribution)
2. Mode (Used for categorical type)
3. Mean (Used for Normally
Distributed Data)
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type

An outlier is a data point that lies an abnormal distance from other values
in the data.
Basic Outlier Formula :
1. Lower Bound = Q1 - 1.5 x IQR
2. Upper Bound = Q3 + 1.5 x IQR
3. IQR = Q3 - Q1

The box plot is a useful graphical


display for describing the behavior of the
data in the middle as well as at the ends
of the distributions.
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
Exploratory
Data
Analysis
What is EDA?

Exploratory Data Analysis refers to the critical process of performing


initial investigations on data so as to discover patterns, to spot
anomalies, to check assumption with the help of of statistical
summary and graphical representations
3 Parts of EDA

1. Cleansing
Checking for problems with the collected data, such as missing data or
measurement error, data types of columns, etc

2. Defining questions
Identifying the relationship between the variables that are particularly
interesting or unexpected

3. Visualizations
Using effective visualizations to communicate the result
Data
Preprocessing
Encode Data
Some Approaches
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
Scaling
Use MinMaxScaler
Let’s go to Notebook

You might also like