Professional Documents
Culture Documents
Introduction Class
Introduction Class
- IBM
5
What is Machine Learning?
Machine Learning algorithms enable the
computers to learn from data, and even
improve themselves, without being explicitly
programmed
7
Why Machine Learning?
VOLUMINOUS
DATA
COMPUTATIONAL POWERFUL
POWER ALGORITHMS
8
ENIAC
9
Machine Learning Evolution
10
Applications of Machine Learning
11
Machine Learning Types
12
Machine Learning Types
13
Supervised Learning
In Supervised learning, an AI system is presented with
data which is labeled
14
The Goal of Supervised Learning
To approximate the mapping function so well that
when you have new input data (x) that you can
predict the output variables (Y) for that data
X f(x) Y
15
Types of Supervised Learning
Supervised Learning
Classification Regression
16
Classification
It refers to a predictive modeling problem where a
class label is predicted for a given example of input
data
17
Classification Types
Binary Classification
Multiclass Classification
Multilabel Classification
Imbalance Classification
18
Classification Modelling
Training Set
(IV & DV)
Predicted
ML Classification
Original Dataset Trained Model Labels for
Model Training
Test Set
Test Set
(IV)
19
Regression
Regression predictive modeling is the task of
approximating a mapping function (f) from input
variables (X) to a continuous output variable (y)
Prediction
Regression (House Price)
Size, Area, #Floors, Model
#Bedrooms, Parking, …
20
Regression Modelling
Training Set
(IV & DV)
Predicted
ML Regression Real
Original Dataset Trained Model
Model Training Values for
Test Set
Test Set
(IV)
21
Unsupervised Learning
In unsupervised learning, an AI system is presented
with unlabeled, uncategorized data and the system’s
algorithms act on the data without prior training
22
Unsupervised Learning Types
Unsupervised Learning
Dimensionality
Clustering Association Rules
Reduction
Discovering inherent group Reducing the number of Discovering rules that
in the data input variables in a describe large portions of
dataset the data
23
Clustering
● The method of identifying similar groups of data in
a dataset is called clustering
○ Entities in each group are comparatively more similar to
entities of that group than those of the other groups
24
Association Rules
● Association rule mining finds interesting
associations and relationships among large sets of
data items
○ This rule shows how frequently a itemset occurs in a
transaction (e.g. Market Based Analysis)
25
Dimensionality Reduction
● It refers to techniques for reducing the number of
input variables in training data
○ Reducing the dimensionality by projecting the data to a
lower dimensional subspace which captures the
“essence” of the data
26
Semisupervised Learning
● A learning problem that involves a small number of
labeled examples and a large number of unlabeled
examples
○ It is required while working with data where labeling instances
is challenging or expensive
27
Reinforcement Learning
● A reinforcement learning algorithm, or agent,
learns by interacting with its environment
○ The agent receives rewards by performing correctly and
penalties for performing incorrectly
28
Reinforcement Learning
29
Machine Learning Terminlogies
● ML Model
○ The learned program that maps inputs to predictions
○ Alternate Name: Predictor/Classifier/Regression Model
ML Model
Unseen
Predictions
Input
30
Machine Learning Terminlogies
● A table with the data from which the machine
learns
○ The dataset contains
Features/ the features and the target toOutput/
Inputs/ Target/
predict IV DV
Instance/
Record
31
Machine Learning Terminlogies
● Training and Test Sets
Training set
80:20 Split
Test set
32
Training Vs. Validation Vs. Test Sets
● Training Set
○ The sample of data used to fit the model
34
Generalization
● Generalization is a term used to describe a model’s
ability to react to new data
○ after being trained on a training set, a model can digest
new data and make accurate predictions
○ a model’s ability to generalize is central to the success of
a model
35
Overfitting
● If a model has been trained too well on training
data, it will be unable to generalize
● It will make inaccurate predictions when given new
data, making the model useless even though it is
able to make accurate predictions for the training
data
36
Underfitting
● Underfitting happens when a model has not been
trained enough on the data
● In the case of underfitting, it makes the model just
as useless and it is not capable of making accurate
predictions, even with the training data
37
Underfit Vs. Balanced Vs. Overfit
38
Underfit Vs. Balanced Vs. Overfit
41
Variance
● Variance measures the amount that the outputs of
our model will change, if a different dataset is used
○ Models with high variance perform pretty well on training
data but has high error rates on test data
○ High variance model leads to overfitting
42
Bias – Variance Tradeoff
43
Bias – Variance Tradeoff
44
Bias – Variance Tradeoff
Height
Weight 45
Bias – Variance Tradeoff
Training Set
Test Set
Height
Weight 46
Bias – Variance Tradeoff
Training Set with a
Linear Model
Height
Weight 47
Bias – Variance Tradeoff
Training Set with a
Complex Model
Height
Weight 48
Bias – Variance Tradeoff
Test Set with a
Linear Model
Height
Weight 49
Bias – Variance Tradeoff
Test Set with a
Complex Model
Height
Weight 50
ML – Peformance Metrics
● The metrics that you choose to evaluate your
machine learning algorithms are very important
○ They influence
■ How you weigh the importance of different
characteristics in the results and
■ Your ultimate choice of which algorithm to choose
51
Classification Metrics
Confusion Matrix
Classification Accuracy
Classification Report
Log Loss
52
Confusion Matrix
● A confusion matrix is a correlation between the predictions of
a model and the actual class labels of the data points
Actual Actual
Positive Negative
Predicted TP FP
Positive (True Positives) (False Positives)
Predicted FN TN
Negative (False Negatives) (True Negatives)
53
Important Ratios from Confusion
Matrix
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑇𝑃𝑅 =
𝐴𝑐𝑡𝑢𝑎𝑙 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
For a model to be smart,
𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 • TP and TN should be as high as
𝑇𝑁𝑅 =
𝐴𝑐𝑡𝑢𝑎𝑙 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 possible
𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑜𝑡𝑎𝑙 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 Accuracy is
algorithm and
dataset specific
● Accuracy from Confusion Matrix
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
55
Classification Metrics - Precision
● It is the ratio of True Positives (TP) and the total
positive predictions
○ Basically, it tells us how many times your positive
prediction was actually positive
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
56
Classification Metrics - Recall
● It is nothing but TPR (True Positive Rate explained
earlier), also known as Sensitivity
○ It tells us about out of all the positive points how
many were predicted positive
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃+𝐹𝑁
57
Classification Metrics - Specificity
● In contrast to Recall, Specificity measures the
proportion of negatives that are correctly identified
𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁+𝐹𝑃
58
Classification Metrics - F1 Score
● F1 Score is the Harmonic Mean between precision
and recall
● The range for F1 Score is [0, 1] and it tells you
○ how precise your classifier is (how many instances it
classifies correctly)
○ as well as how robust it is (it does not miss a significant
number of instances)
1 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 = 2 ∗ =2∗
1 1 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
+
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑅𝑒𝑐𝑎𝑙𝑙
59
Harmoinc Vs. Arithmetic Mean
● Unlike arithmetic mean, harmonic punishes extreme
values more
○ It is calculated by dividing the number of observations by
the reciprocal of each number in the series
60
Area Under the ROC Curve
● It is used for binary classification problem
○ AUC of a classifier is equal to the probability that the
classifier will rank a randomly chosen positive example
higher than a randomly chosen negative example
63
Mean Absolute Error
● The average of the difference between the Original
Values and the Predicted Values
○ It gives us the measure of how far the predictions were
from the actual output
○ However, they don’t gives us any idea of the direction of
the error i.e. whether we are under predicting the data or
over predicting the data
𝑁
1
𝑀𝐴𝐸 = 𝑦𝑖 − 𝑦𝑖 ′
𝑁
𝑖=1
64
Mean Squared Error
● MSE takes the average of the square of the difference
between the original values and the predicted values
○ The advantage of MSE is it is easier to compute the
gradient, whereas Mean Absolute Error requires
complicated linear programming tools to compute the
gradient
𝑁
1
𝑀𝑆𝐸 = 𝑦𝑖 − 𝑦𝑖′ 2
𝑁
𝑖=1
○ As, we take square of the error, the effect of larger errors become
more pronounced than smaller error, hence the model can now
focus more on the larger errors
65
Root Mean Squared Error (RMSE)
● It follows an assumption that error are unbiased
and follow a normal distribution
○ The power of ‘square root’ empowers this metric to show
large number deviations
○ The ‘squared’ nature of this metric helps to deliver more
robust results which prevents cancelling the positive and
negative error values
𝑁
1 2
𝑀𝑆𝐸 = 𝑦𝑖 − 𝑦𝑖′
𝑁
𝑖=1
66
Data Types
Categorical Numerical
67
Stevens’ typology of measurement scales
Age, Height,
Weight
• Equal Spaces between values
Ratio • A meaningful zero value, mean makes
sense Temperature
(Celsius/Fahrenheit),
• Equal Spaces between values IQ, Credit Score
Nominal
Ordinal
69
Nominal Vs. Ordinal Vs. Interval Vs. Ratio
70
A More Detailed Taxonomy
Types of Data
Quantitative Qualitative
Interval Nominal
Ratio Ordinal
71
Quantitative Vs. Qualitative
● Quantitative data seem to be the easiest to explain and
try to find the answers to questions such as
○ “how many, “how much” and “how often”
● It can be expressed as a number, so it can be quantified
72
Quantitative Vs. Qualitative
● Qualitative data can’t be expressed as a number, so it
can’t be measured
○ It mainly consists of words, pictures, and symbols, but
not numbers
● These can answer the questions like:
○ “how this has happened”, or “why this has happened”
73
Categorical Data
● Categorical data represents characteristics.
○ Therefore it can represent things like a person’s gender,
language etc.
○ Categorical data can also take on numerical values
(Example: 1 for female and 0 for male)
● Two types of categorical data
○ Nominal
○ Ordinal
74
Categorical - Nominal
● Nominal values represent discrete units and are
used to label variables
○ Nominal data don’t have any order
○ Therefore, changing the order won’t change their value
75
Categorical - Ordinal
● Ordinal values represent discrete and ordered
units
○ It is therefore nearly the same as nominal data, except
that it’s ordering matters
76
Numerical - Discrete
● We speak of discrete data if its values are distinct
and separate
○ In other words: We speak of discrete data if the data can
only take on certain values
○ This type of data can’t be measured but it can be counted
○ It basically represents information that can be categorized
into a classification
○ Example:
■ The number of students in a class
■ The number of workers in a company
■ The number of test questions you answered correctly 77
Numerical - Continuous
● Continuous Data represents measurements and
therefore their values can’t be counted but they
can be measured
○ Interval
■ Interval values represent ordered units that have the same
difference
○ Ratio
■ Ratio values are also ordered units that have the same
difference
■ Ratio values are the same as interval values, with the
difference that they do have an absolute zero
78
Interval Vs. Ratio
79
Statistical Descriptions of Data
● They help us measure some very special properties
of the data
● One such property is the central tendency
○ Measuring the central tendency helps us know, where
most of the data lies taking into account the whole set of
data
80
Central Tendancy - Mean
● Mathematically, the mean of n values can be
defined as:
82
Central Tendancy - Mode
● The mode for a set of data is the value that occurs
most frequently in the set
○ Hence, it can be calculated for both qualitative and quantitative
attributes
○ A dataset might have two modes and are known as bimodal
○ In general, a dataset with two or more modes is known as
multimodal
83
Central Tendency – Mid Range
● This is defined as the average of the largest and
smallest values in the set of values
84
Dispersion of the Data
● The dispersion of data means the spread of data
● Measuring the dispersion of data
○ Let x1, x2, x3…xn be a set of observations for some numeric
attribute, X
○ The following terms for measuring the dispersion of data:
■ Range
■ Quantile
■ Interquartile Range (IQR)
■ Variance and Standard Deviation
85
Range
● It is defined as the difference between the largest
and smallest values in the set
5 8 9 4 3 2 7 12 15 6
Range = 15 – 2
= 13
86
Quantiles
● These are points taken at regular intervals of data
distribution, dividing it into essentially equal-size
consecutive sets.
2 3 4 5 7 9 11 13 15 22 24 27 30 31 35
The kth q-quantile for given data distribution is the value x such at most k/q of
data values are less than x and at most (q-k)/q of data values are more than x,
where k is an integer such that 0 < k < q. There are total (q-1) q-quantiles.
87
Quartile – 4 Quantiles
Quartiles are the values that divide a list of numbers into quarters:
• Put the list of numbers in order
• Then cut the list into four equal parts
• The Quartiles are at the "cuts"
88
Quartiles
89
Interquartile Range (IQR)
● The distance between the first and third quartiles is a
simple measure of the spread that gives the range
covered by the middle half of the data
90
Variance & Standard Deviation
● The Standard Deviation is a measure of how spread out
numbers are
● The Variance is defined as the average of the squared
differences from the Mean
○ The variance of N observations, x1, x2, x3….xn, for a
numeric attribute X is –
92
Example - Mean
93
Example - Variance
94
Example – Standard Deviation
95
Standard Deviation – a Look
96
Outliers
● An Outlier is a data object that deviates significantly
from the rest of the objects as if it were generated by a
different mechanism
97
Outlier Example
98
What if remove outlier?
99
Outlier Detection using Box Plot
● A box and whisker plot — also called a box plot — displays
five-number summary of a set of data
● Five number summary
○ Minimum
○ First quartile (Q1)
○ Median
○ Third quartile (Q3)
○ Maximum
100
Outlier Detection using Box Plot
101
Handling missing values in the dataset
● The data this real time will obviously have a lot of
missing values
● Handling missing values:
○ Ignore the tuple with missing values
104
Binning Vs. Encoding
● For a machine learning model, the dataset needs to be
processed in the form of numerical vectors to train it
using an ML algorithm
○ Feature Binning: Conversion of a continuous variable to
categorical
○ Feature Encoding: Conversion of a categorical variable to
numerical features
105
Binning Technique
● The set of data values are sorted in an order, grouped
into “buckets” or “bins” and then each value in a
particular bin is smoothed using its neighbor
○ It is also said that the binning method does local
smoothing because it consults its nearby values to
smooth the values of the attribute
[4, 8, 15, 21, 21, 24, 25, 28, 34]
106
Smoothing by bin means
● In this method, all the values of a particular bin are replaced
by the mean of the values of that particular bin
○ Mean of 4, 8, 15 = 9
○ Mean of 21, 21, 24 = 22
○ Mean of 25, 28, 34 = 29
107
Smoothing by bin medians
● In this method, all the values of a particular bin are replaced
by the median of the values of that particular bin
○ Median of 4, 8, 15 = 8
○ Median of 21, 21, 24 = 21
○ Median of 25, 28, 34 = 28
108
Smoothing by bin boundaries
● In this method, all the values of a particular bin are replaced
by the closest boundary of the values of that particular bin
109
Encoding
● Most of the ML algorithms cannot handle categorical
variables and hence it is important to do feature
encoding
Label Encoding
Ordinal Encoding
Frequency Encoding
Binary Encoding
One-hot Encoding
For Target
Variable
111
Ordinal Encoding
● An ordinal encoding involves mapping each unique
label to an integer value
○ This type of encoding is really only appropriate if there is
a known relationship between the categories
For
Features
112
Frequency Encoding
● It transforms an original categorical variable to a
numerical variable by considering the frequency
distribution of the data
○ It can be useful for nominal features
113
Binary encoding
● Binary Encoding just labels values to an integer then
takes binary of the integer and makes a binary table to
encode data
114
One hot encoding
● One hot encoding technique splits the category each to
a column
○ It creates n different columns each for a category
and replaces one column with 1 rest of the columns
is 0
115
Target Encoding
● Target encoding is the process of replacing a categorical value
with the mean of the target variable
○ Any non-categorical columns are automatically dropped by the
target encoder model
116
Feature Scaling
● Feature scaling means adjusting data that has
different scales so as to avoid biases from big
outliers
○ It standardizes the independent features present in the
data in a fixed range
117
Why Feature Scaling?
● Machine learning algorithm works on numbers and
has no knowledge of what that number represents
○ Many ML algorithms perform better when numerical input
variables are scaled to a standard range
crucial part of the
data
preprocessing
stage
118
Will Feature Scaling Work for all ML
Algorithms?
121
Why Feature Scaling?
● Tree-Based Algorithms
○ They are fairly insensitive to the scale of the features
○ Think about it, a decision tree is only splitting a node
based on a single feature
○ This split on a feature is not influenced by other features
122
Feature Scaling Categories
Feature Scaling
Normalization Standardization
123
Normalizaion
● A scaling technique in which values are shifted and
rescaled so that they end up ranging between 0
and 1
○ It is also known as Min-Max scaling
○ Here’s the formula for normalization:
𝑋 − 𝑋𝑚𝑖𝑛
𝑋′ =
𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛
124
Standardization
● Standardization is another scaling technique where
the values are centered around the mean with a
unit standard deviation
○ This means that the mean of the attribute becomes zero
and the resultant distribution has a unit standard
deviation
○ Here’s the formula for standardization:
𝑋 − 𝜇
𝑋′ =
𝜎
125
Normalization or Standardization?
● Normalization is good to use when you know that the
distribution of your data does not follow a Gaussian
distribution
● Standardization, on the other hand, can be helpful in cases
where the data follows a Gaussian distribution
126
Covariance
● Variables may change in relation to each other
● Covariance measures how much the movement in one
variable predicts the movement in a corresponding
variable
127
Covariance
128
Smoking v Lung Capacity Data
130
Calculating Covariance
131
Covariance Calculation
132
Calculating Correlation
133
Calculating Correlation
134
Calculating Correlation
136
Correlation is not Causation
137
Correlation Is Not Good at Curves
138