Introduction Class

What is Machine Learning?
Machine learning is a branch of artificial

intelligence (AI) and computer science which
focuses on the use of data and algorithms to
imitate the way that humans learn, gradually
improving its accuracy.
- IBM
5
What is Machine Learning?
Machine Learning algorithms enable the
computers to learn from data, and even
improve themselves, without being explicitly
programmed
Arthur Lee Samuel

6
Man Vs. Machine
7
Why Machine Learning?
VOLUMINOUS
DATA
COMPUTATIONAL POWERFUL
POWER ALGORITHMS
8
ENIAC
9
Machine Learning Evolution
10
Applications of Machine Learning
11
Machine Learning Types
12
Machine Learning Types
13
Supervised Learning
In Supervised learning, an AI system is presented with
data which is labeled
14
The Goal of Supervised Learning
To approximate the mapping function so well that
when you have new input data (x) that you can
predict the output variables (Y) for that data
X f(x) Y
15
Types of Supervised Learning
Supervised Learning
Classification Regression
The output variable is a category, the output variable is a real value,

such as “red” or “blue” or such as “dollars” or “weight”
“disease” and “no disease”
16
Classification
It refers to a predictive modeling problem where a
class label is predicted for a given example of input
data
17
Classification Types
Binary Classification
Multiclass Classification
Multilabel Classification
Imbalance Classification
18
Classification Modelling
Training Set
(IV & DV)
Predicted
ML Classification
Original Dataset Trained Model Labels for
Model Training
Test Set
Test Set
(IV)
19
Regression
Regression predictive modeling is the task of
approximating a mapping function (f) from input
variables (X) to a continuous output variable (y)
Prediction
Regression (House Price)
Size, Area, #Floors, Model
#Bedrooms, Parking, …
20
Regression Modelling
Training Set
(IV & DV)
Predicted
ML Regression Real
Original Dataset Trained Model
Model Training Values for
Test Set
Test Set
(IV)
21
Unsupervised Learning
In unsupervised learning, an AI system is presented
with unlabeled, uncategorized data and the system’s
algorithms act on the data without prior training
22
Unsupervised Learning Types
Unsupervised Learning
Dimensionality
Clustering Association Rules
Reduction
Discovering inherent group Reducing the number of Discovering rules that
in the data input variables in a describe large portions of
dataset the data
23
Clustering
● The method of identifying similar groups of data in
a dataset is called clustering
○ Entities in each group are comparatively more similar to
entities of that group than those of the other groups
24
Association Rules
● Association rule mining finds interesting
associations and relationships among large sets of
data items
○ This rule shows how frequently a itemset occurs in a
transaction (e.g. Market Based Analysis)
25
Dimensionality Reduction
● It refers to techniques for reducing the number of
input variables in training data
○ Reducing the dimensionality by projecting the data to a
lower dimensional subspace which captures the
“essence” of the data
26
Semisupervised Learning
● A learning problem that involves a small number of
labeled examples and a large number of unlabeled
examples
○ It is required while working with data where labeling instances
is challenging or expensive
27
Reinforcement Learning
● A reinforcement learning algorithm, or agent,
learns by interacting with its environment
○ The agent receives rewards by performing correctly and
penalties for performing incorrectly
○ The agent learns without intervention from a human by

maximizing its reward and minimizing its penalty
28
Reinforcement Learning
29
Machine Learning Terminlogies
● ML Model
○ The learned program that maps inputs to predictions
○ Alternate Name: Predictor/Classifier/Regression Model
ML Model
Unseen
Predictions
Input
30
● A table with the data from which the machine
learns
○ The dataset contains
Features/ the features and the target toOutput/
Inputs/ Target/
predict IV DV
Instance/
Record
31
● Training and Test Sets
Training set
The original Dataset
80:20 Split
Test set
32
Training Vs. Validation Vs. Test Sets
● Training Set
○ The sample of data used to fit the model
○ The model sees and learns from this data

● Validation Set
○ The sample of data used to provide an unbiased evaluation
of a model fit on the training dataset while tuning model
hyperparameters
● Test Set
○ The sample of data used to provide an unbiased evaluation
of a final model fit on the training dataset
33
Validation Set
34
Generalization
● Generalization is a term used to describe a model’s
ability to react to new data
○ after being trained on a training set, a model can digest
new data and make accurate predictions
○ a model’s ability to generalize is central to the success of
a model
35
Overfitting
● If a model has been trained too well on training
data, it will be unable to generalize
● It will make inaccurate predictions when given new
data, making the model useless even though it is
able to make accurate predictions for the training
data
36
Underfitting
● Underfitting happens when a model has not been
trained enough on the data
● In the case of underfitting, it makes the model just
as useless and it is not capable of making accurate
predictions, even with the training data
37
Underfit Vs. Balanced Vs. Overfit
38
Underfit Vs. Balanced Vs. Overfit
Underfitting Overfitting Balanced fitting

39
The Prediction Erros
• The prediction error for any machine learning
algorithm can be broken down into three parts:
• Bias Error
• Variance Error
• Irreducible Error
• The irreducible error cannot be reduced regardless
of what algorithm is used
• It is the error caused by factors like unknown variables
that influence the mapping of the input variables to the
output variable
40
Bias
● The difference between the average prediction of
our model and the correct value which we are
trying to predict
○ Model with high bias pays very little attention to the
training data and oversimplifies the model
○ It always leads to high error on training and test data
41
Variance
● Variance measures the amount that the outputs of
our model will change, if a different dataset is used
○ Models with high variance perform pretty well on training
data but has high error rates on test data
○ High variance model leads to overfitting
42
Bias – Variance Tradeoff
43
44
Height
Weight 45
Training Set
Test Set
Height
Weight 46
Training Set with a
Linear Model
Height
Weight 47
Training Set with a
Complex Model
Height
Weight 48
Test Set with a
Linear Model
Height
Weight 49
Test Set with a
Complex Model
Height
Weight 50
ML – Peformance Metrics
● The metrics that you choose to evaluate your
machine learning algorithms are very important
○ They influence
■ How you weigh the importance of different
characteristics in the results and
■ Your ultimate choice of which algorithm to choose
51
Classification Metrics
Confusion Matrix
Classification Accuracy
Classification Report
Area Under ROC Curve
Log Loss
52
Confusion Matrix
● A confusion matrix is a correlation between the predictions of
a model and the actual class labels of the data points
Actual Actual
Positive Negative
Predicted TP FP
Positive (True Positives) (False Positives)
Predicted FN TN
Negative (False Negatives) (True Negatives)
53
Important Ratios from Confusion
Matrix
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑇𝑃𝑅 =
𝐴𝑐𝑡𝑢𝑎𝑙 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
For a model to be smart,
𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 • TP and TN should be as high as
𝑇𝑁𝑅 =
𝐴𝑐𝑡𝑢𝑎𝑙 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 possible
• FP and FN should be minimized

𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝐹𝑃𝑅 = • Also, the ratios TPR and TNR should
𝐴𝑐𝑡𝑢𝑎𝑙 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 be very high
• And FPR and FNR should be very low

𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
𝐹𝑁𝑅 =
𝐴𝑐𝑡𝑢𝑎𝑙 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
54
Classification Metrics - Accuracy
● Accuracy is what its literal meaning says, a measure
of how accurate your model is
𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑜𝑡𝑎𝑙 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 Accuracy is
algorithm and
dataset specific
● Accuracy from Confusion Matrix
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
55
Classification Metrics - Precision
● It is the ratio of True Positives (TP) and the total
positive predictions
○ Basically, it tells us how many times your positive
prediction was actually positive
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
56
Classification Metrics - Recall
● It is nothing but TPR (True Positive Rate explained
earlier), also known as Sensitivity
○ It tells us about out of all the positive points how
many were predicted positive
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃+𝐹𝑁
57
Classification Metrics - Specificity
● In contrast to Recall, Specificity measures the
proportion of negatives that are correctly identified
𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁+𝐹𝑃
True Negative Rate
58
Classification Metrics - F1 Score
● F1 Score is the Harmonic Mean between precision
and recall
● The range for F1 Score is [0, 1] and it tells you
○ how precise your classifier is (how many instances it
classifies correctly)
○ as well as how robust it is (it does not miss a significant
number of instances)
1 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 = 2 ∗ =2∗
1 1 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
+
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑅𝑒𝑐𝑎𝑙𝑙
59
Harmoinc Vs. Arithmetic Mean
● Unlike arithmetic mean, harmonic punishes extreme
values more
○ It is calculated by dividing the number of observations by
the reciprocal of each number in the series
○ The harmonic mean is the reciprocal of the arithmetic

mean of the reciprocals
60
Area Under the ROC Curve
● It is used for binary classification problem
○ AUC of a classifier is equal to the probability that the
classifier will rank a randomly chosen positive example
higher than a randomly chosen negative example
FPR and TPR both are computed at

varying threshold values such as (0.00,
0.02, 0.04, …., 1.00) and a graph is
drawn
.90-1 = excellent (A)

.80-.90 = good (B)
.70-.80 = fair (C)
.60-.70 = poor (D)
.50-.60 = fail (F)
61
Log Loss
● AUC ROC only takes into account the order of
probabilities
○ Hence, it does not take into account the model’s capability
to predict higher probability for samples more likely to be
positive
● The log loss is the negative average of the log of
corrected predicted probabilities for each instance
𝑁
1
𝐿𝑜𝑔 𝐿𝑜𝑠𝑠 = − ෍ 𝑦𝑖 − log 𝑝 𝑦𝑖 + 1 − 𝑦𝑖 ∙ log(1 − 𝑝 𝑦𝑖 )
𝑁
𝑖=1
• p(yi) is predicted probability of positive class

• 1- p(yi) is predicted probability of negative class
• yi = 1 for positive class and 0 for negative class (actual values)
62
Regression Metrics
Mean Absolute Error (MAE)
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
63
Mean Absolute Error
● The average of the difference between the Original
Values and the Predicted Values
○ It gives us the measure of how far the predictions were
from the actual output
○ However, they don’t gives us any idea of the direction of
the error i.e. whether we are under predicting the data or
over predicting the data
𝑁
1
𝑀𝐴𝐸 = ෍ 𝑦𝑖 − 𝑦𝑖 ′
𝑁
𝑖=1
64
Mean Squared Error
● MSE takes the average of the square of the difference
between the original values and the predicted values
○ The advantage of MSE is it is easier to compute the
gradient, whereas Mean Absolute Error requires
complicated linear programming tools to compute the
gradient
𝑁
1
𝑀𝑆𝐸 = ෍ 𝑦𝑖 − 𝑦𝑖′ 2
𝑁
𝑖=1
○ As, we take square of the error, the effect of larger errors become
more pronounced than smaller error, hence the model can now
focus more on the larger errors
65
Root Mean Squared Error (RMSE)
● It follows an assumption that error are unbiased
and follow a normal distribution
○ The power of ‘square root’ empowers this metric to show
large number deviations
○ The ‘squared’ nature of this metric helps to deliver more
robust results which prevents cancelling the positive and
negative error values
𝑁
1 2
𝑀𝑆𝐸 = ෍ 𝑦𝑖 − 𝑦𝑖′
𝑁
𝑖=1
66
Data Types
Data types in the world of

Machine Learning
Categorical Numerical
Categorical data generally means Numerical data is used to mean

everything else and in particular anything represented by numbers
discrete labeled groups
67
Stevens’ typology of measurement scales
Age, Height,
Weight
• Equal Spaces between values
Ratio • A meaningful zero value, mean makes
sense Temperature
(Celsius/Fahrenheit),
• Equal Spaces between values IQ, Credit Score
Interval • No meaningful zero value, mean makes

sense
Gender, Ethnicity,
Eye Color, Blood
• 1st, 2nd, 3rd values, but not equal space Type
Nominal between 1st and 2nd and 2nd and 3rd
• Median makes sense
Income Level,
Level of
• No numerical relationship between the Agreement
Ordinal different categories
• Mean and median are meaningless
68
Nominal/Ordinal Examples
Nominal
Ordinal
69
Nominal Vs. Ordinal Vs. Interval Vs. Ratio
70
A More Detailed Taxonomy
Types of Data
Quantitative Qualitative
Discrete Continuous Nonnumerical
Interval Nominal
Ratio Ordinal
71
Quantitative Vs. Qualitative
● Quantitative data seem to be the easiest to explain and
try to find the answers to questions such as
○ “how many, “how much” and “how often”
● It can be expressed as a number, so it can be quantified
72
Quantitative Vs. Qualitative
● Qualitative data can’t be expressed as a number, so it
can’t be measured
○ It mainly consists of words, pictures, and symbols, but
not numbers
● These can answer the questions like:
○ “how this has happened”, or “why this has happened”
73
Categorical Data
● Categorical data represents characteristics.
○ Therefore it can represent things like a person’s gender,
language etc.
○ Categorical data can also take on numerical values
(Example: 1 for female and 0 for male)
● Two types of categorical data
○ Nominal
○ Ordinal
74
Categorical - Nominal
● Nominal values represent discrete units and are
used to label variables
○ Nominal data don’t have any order
○ Therefore, changing the order won’t change their value
75
Categorical - Ordinal
● Ordinal values represent discrete and ordered
units
○ It is therefore nearly the same as nominal data, except
that it’s ordering matters
76
Numerical - Discrete
● We speak of discrete data if its values are distinct
and separate
○ In other words: We speak of discrete data if the data can
only take on certain values
○ This type of data can’t be measured but it can be counted
○ It basically represents information that can be categorized
into a classification
○ Example:
■ The number of students in a class
■ The number of workers in a company
■ The number of test questions you answered correctly 77
Numerical - Continuous
● Continuous Data represents measurements and
therefore their values can’t be counted but they
can be measured
○ Interval
■ Interval values represent ordered units that have the same
difference
○ Ratio
■ Ratio values are also ordered units that have the same
difference
■ Ratio values are the same as interval values, with the
difference that they do have an absolute zero
78
Interval Vs. Ratio
79
Statistical Descriptions of Data
● They help us measure some very special properties
of the data
● One such property is the central tendency
○ Measuring the central tendency helps us know, where
most of the data lies taking into account the whole set of
data
80
Central Tendancy - Mean
● Mathematically, the mean of n values can be
defined as:
○ Suppose that we have a dataset, in which, we have an attribute

“age” of supposing 100 people
○ The mean of ages is equivalent to answering, “what age do most
of the people belong to?”
○ Not a good choice when there are some extreme values in the
data
81
Central Tendancy - Median
● When our dataset has skewness, calculating the
Median could prove to be more beneficial than
Mean
○ Median is defined as the centermost value of an ordered
numerical dataset
82
Central Tendancy - Mode
● The mode for a set of data is the value that occurs
most frequently in the set
○ Hence, it can be calculated for both qualitative and quantitative
attributes
○ A dataset might have two modes and are known as bimodal
○ In general, a dataset with two or more modes is known as
multimodal
83
Central Tendency – Mid Range
● This is defined as the average of the largest and
smallest values in the set of values
84
Dispersion of the Data
● The dispersion of data means the spread of data
● Measuring the dispersion of data
○ Let x1, x2, x3…xn be a set of observations for some numeric
attribute, X
○ The following terms for measuring the dispersion of data:
■ Range
■ Quantile
■ Interquartile Range (IQR)
■ Variance and Standard Deviation
85
Range
● It is defined as the difference between the largest
and smallest values in the set
5 8 9 4 3 2 7 12 15 6
Range = 15 – 2
= 13
86
Quantiles
● These are points taken at regular intervals of data
distribution, dividing it into essentially equal-size
consecutive sets.
Quantile 1 Quantile 2 Quantile 3 Quantile 4
2 3 4 5 7 9 11 13 15 22 24 27 30 31 35
The kth q-quantile for given data distribution is the value x such at most k/q of
data values are less than x and at most (q-k)/q of data values are more than x,
where k is an integer such that 0 < k < q. There are total (q-1) q-quantiles.
87
Quartile – 4 Quantiles
Quartiles are the values that divide a list of numbers into quarters:
• Put the list of numbers in order
• Then cut the list into four equal parts
• The Quartiles are at the "cuts"
88
Quartiles
89
Interquartile Range (IQR)
● The distance between the first and third quartiles is a
simple measure of the spread that gives the range
covered by the middle half of the data
90
Variance & Standard Deviation
● The Standard Deviation is a measure of how spread out
numbers are
● The Variance is defined as the average of the squared
differences from the Mean
○ The variance of N observations, x1, x2, x3….xn, for a
numeric attribute X is –
○ Mathematically, the standard deviation is defined as the

square root of the variance 91
Example
● The heights (at the shoulders) of the dogs below are:
600mm, 470mm, 170mm, 430mm and 300mm
92
Example - Mean
93
Example - Variance
94
Example – Standard Deviation
95
Standard Deviation – a Look
96
Outliers
● An Outlier is a data object that deviates significantly
from the rest of the objects as if it were generated by a
different mechanism
97
Outlier Example
98
What if remove outlier?
99
Outlier Detection using Box Plot
● A box and whisker plot — also called a box plot — displays
five-number summary of a set of data
● Five number summary
○ Minimum
○ First quartile (Q1)
○ Median
○ Third quartile (Q3)
○ Maximum
100
Outlier Detection using Box Plot
101
Handling missing values in the dataset
● The data this real time will obviously have a lot of
missing values
● Handling missing values:
○ Ignore the tuple with missing values
○ Use a measure of central tendency for the attribute to fill

in the missing value
○ Use prediction techniques to fill in the missing value

● Handling missing data is important as many machine learning
algorithms do not support data with missing values
102
Diabetes Dataset
1. Number of times pregnant.
2. Plasma glucose concentration a 2 hours in an oral glucose
tolerance test.
3. Diastolic blood pressure (mm Hg).
4. Triceps skinfold thickness (mm).
5. 2-Hour serum insulin (mu U/ml).
6. Body mass index (weight in kg/(height in m)^2).
7. Diabetes pedigree function.
8. Age (years).
9. Class variable (0 or 1).
103
Removing noise from the data using the Binning
Technique
● What is defined as a noise in data?
○ Suppose that we have a dataset in which we have some
measured attributes
○ Now, these attributes might carry some random error or
variance
○ Such errors in attribute values are called as noise in the
data
● If such errors persist in our data, it will return
inaccurate results
104
Binning Vs. Encoding
● For a machine learning model, the dataset needs to be
processed in the form of numerical vectors to train it
using an ML algorithm
○ Feature Binning: Conversion of a continuous variable to
categorical
○ Feature Encoding: Conversion of a categorical variable to
numerical features
Categorical Data Continuous Data
105
Binning Technique
● The set of data values are sorted in an order, grouped
into “buckets” or “bins” and then each value in a
particular bin is smoothed using its neighbor
○ It is also said that the binning method does local
smoothing because it consults its nearby values to
smooth the values of the attribute
[4, 8, 15, 21, 21, 24, 25, 28, 34]
4, 8, 15 21, 21, 24 25, 28, 34

(Bin 1) (Bin 2) (Bin 3)
106
Smoothing by bin means
● In this method, all the values of a particular bin are replaced
by the mean of the values of that particular bin
○ Mean of 4, 8, 15 = 9
○ Mean of 21, 21, 24 = 22
○ Mean of 25, 28, 34 = 29
9, 9, 9 22, 22, 22 29, 29, 29

(Bin 1) (Bin 2) (Bin 3)
107
Smoothing by bin medians
by the median of the values of that particular bin
○ Median of 4, 8, 15 = 8
○ Median of 21, 21, 24 = 21
○ Median of 25, 28, 34 = 28
8, 8, 8 21, 21, 21 28, 28, 28

(Bin 1) (Bin 2) (Bin 3)
108
Smoothing by bin boundaries
by the closest boundary of the values of that particular bin
4, 4, 15 21, 21, 24 25, 25, 34

(Bin 1) (Bin 2) (Bin 3)
109
Encoding
● Most of the ML algorithms cannot handle categorical
variables and hence it is important to do feature
encoding
Label Encoding
Ordinal Encoding
Frequency Encoding
Binary Encoding
One-hot Encoding
Target mean Encoding

110
Label Encoding
● Label Encoding is a popular encoding technique for
handling categorical variables
○ In this technique, each label is assigned a unique
integer based on alphabetical ordering
For Target
Variable
111
Ordinal Encoding
● An ordinal encoding involves mapping each unique
label to an integer value
○ This type of encoding is really only appropriate if there is
a known relationship between the categories
For
Features
112
Frequency Encoding
● It transforms an original categorical variable to a
numerical variable by considering the frequency
distribution of the data
○ It can be useful for nominal features
113
Binary encoding
● Binary Encoding just labels values to an integer then
takes binary of the integer and makes a binary table to
encode data
114
One hot encoding
● One hot encoding technique splits the category each to
a column
○ It creates n different columns each for a category
and replaces one column with 1 rest of the columns
is 0
115
Target Encoding
● Target encoding is the process of replacing a categorical value
with the mean of the target variable
○ Any non-categorical columns are automatically dropped by the
target encoder model
116
Feature Scaling
● Feature scaling means adjusting data that has
different scales so as to avoid biases from big
outliers
○ It standardizes the independent features present in the
data in a fixed range
117
Why Feature Scaling?
● Machine learning algorithm works on numbers and
has no knowledge of what that number represents
○ Many ML algorithms perform better when numerical input
variables are scaled to a standard range
crucial part of the
data
preprocessing
stage
118
Will Feature Scaling Work for all ML
Algorithms?
It improves the performance of some machine

learning algorithms and does not work at all for
others
119
● Gradient Descent Based Algorithms
○ ML algorithms like linear regression, logistic regression,
neural network, etc. that use gradient descent as an
optimization technique require data to be scaled
The presence of feature value X in the formula will

affect the step size of the gradient descent
Having features on a similar scale can help the gradient

descent converge more quickly towards the minima
120
● Distance-Based Algorithms
○ Distance algorithms like KNN, K-means, and SVM are
mostly affected by the range of features
○ This is because behind the scenes they are using distances
between data points to determine their similarity
121
● Tree-Based Algorithms
○ They are fairly insensitive to the scale of the features
○ Think about it, a decision tree is only splitting a node
based on a single feature
○ This split on a feature is not influenced by other features
122
Feature Scaling Categories
Feature Scaling
Normalization Standardization
123
Normalizaion
● A scaling technique in which values are shifted and
rescaled so that they end up ranging between 0
and 1
○ It is also known as Min-Max scaling
○ Here’s the formula for normalization:
𝑋 − 𝑋𝑚𝑖𝑛
𝑋′ =
𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛
124
Standardization
● Standardization is another scaling technique where
the values are centered around the mean with a
unit standard deviation
○ This means that the mean of the attribute becomes zero
and the resultant distribution has a unit standard
deviation
○ Here’s the formula for standardization:
𝑋 − 𝜇
𝑋′ =
𝜎
125
Normalization or Standardization?
● Normalization is good to use when you know that the
distribution of your data does not follow a Gaussian
distribution
● Standardization, on the other hand, can be helpful in cases
where the data follows a Gaussian distribution
○ However, this does not have to be necessarily true

● However, at the end of the day, the choice of using
normalization or standardization will depend on your
problem and the machine learning algorithm you are using
126
Covariance
● Variables may change in relation to each other
● Covariance measures how much the movement in one
variable predicts the movement in a corresponding
variable
127
Covariance
128
Smoking v Lung Capacity Data
Variables Cigarettes and Lung Capacity

covary inversely
When smoking is above its group mean, lung

capacity tends to be below its group mean. 129
Calculating Covariance
130
Calculating Covariance
131
Covariance Calculation
132
Calculating Correlation
133
134
Greater smoking exposure implies

greater likelihood of lung damage
135
Different Correlation Values
136
Correlation is not Causation
137
Correlation Is Not Good at Curves
138

Introduction Class

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction Class

Uploaded by

Copyright:

Available Formats

What is Machine Learning?

Machine learning is a branch of artificial

Arthur Lee Samuel

The output variable is a category, the output variable is a real value,

○ The agent learns without intervention from a human by

The original Dataset

○ The model sees and learns from this data

Underfitting Overfitting Balanced fitting

Area Under ROC Curve

• FP and FN should be minimized

• And FPR and FNR should be very low

True Negative Rate

○ The harmonic mean is the reciprocal of the arithmetic

FPR and TPR both are computed at

.90-1 = excellent (A)

• p(yi) is predicted probability of positive class

Mean Absolute Error (MAE)

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

Data types in the world of

Categorical data generally means Numerical data is used to mean

Interval • No meaningful zero value, mean makes

Discrete Continuous Nonnumerical

○ Suppose that we have a dataset, in which, we have an attribute

Quantile 1 Quantile 2 Quantile 3 Quantile 4

○ Mathematically, the standard deviation is defined as the

○ Use a measure of central tendency for the attribute to fill

○ Use prediction techniques to fill in the missing value

Categorical Data Continuous Data

4, 8, 15 21, 21, 24 25, 28, 34

9, 9, 9 22, 22, 22 29, 29, 29

8, 8, 8 21, 21, 21 28, 28, 28

4, 4, 15 21, 21, 24 25, 25, 34

Target mean Encoding

It improves the performance of some machine

The presence of feature value X in the formula will

Having features on a similar scale can help the gradient

○ However, this does not have to be necessarily true

Variables Cigarettes and Lung Capacity

When smoking is above its group mean, lung

Greater smoking exposure implies

You might also like