Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 81

20ECE633T

Machine Learning in VLSI


https://meet.google.com/lookup/a3dn366ruy
 
Learning Resources:
✔ Ethem Alpaydin, Introduction to Machine Learning, Second Edition.
✔ Stephen Marsland, MACHINE LEARNING An Algorithmic Perspective
Chapman & Hall/CRC Machine Learning & Pattern Recognition Series.

unit #1 :20ECE633T 1
Definition of Machine Learning

• ML is about extracting knowledge from data.


• ML is a part of Computer Science where the efficiency of a system improves itself by
repeatedly performing the tasks by using data instead of explicitly programmed by
programmers. 
• A computer program is said to learn from experience E with respect to some task T and some
performance measure P, if its performance on T, as measured by P, improves with experience
E. —Tom Mitchell, 1997.
• Example : Given examples of spam emails (e.g., flagged by users) and examples   of regular
(non-spam, also called “ham”) emails - spam filter is a Machine Learning program that can
learn to flag   spam mail.    
– Spam filter system uses to learn are called the training set. 
– Each training example is called a training instance (or sample).
• In this case, the task T is to flag spam for new emails, the experience E is the training data,
and the performance measure P needs to be defined; for example accuracy- the ratio of
correctly classified emails and this particular performance measure  is often used in
classification tasks.
• In ML the user provides the algorithm with a large number of emails (which are the input),
together with information about whether any of these emails are spam (which is the desired
output). Given a new email, the algorithm will then produce a prediction as to whether the
new email is spam
unit #1 :20ECE633T 2
WHY ML?
• In the early days of “intelligent” applications, many systems used handcoded rules of “if ”
and “else” decisions to process data or adjust to user input.
• Problem: spam filter.(whose job is to move the appropriate incoming email messages to a
spam folder)
• Solution: You could make up a blacklist of words that would result in an email being marked
as  spam.
• Problem: detecting faces
• Solution through handcoded system: The main problem is that the way in which pixels
(which make up an image in a computer) are “perceived” by the computer is very different
from how humans perceive a face. This difference in representation makes it basically
impossible for a human to come up with a good set of rules to describe what constitutes a
face in a digital image.
• Solution Using machine learning: simply presenting a program with a large collection of
images of faces is enough for an algorithm to determine what characteristics are needed to
identify a face.
•  Major disadvantages using hand coded rules to make decisions :
1. The logic required to make a decision is specific to a single domain and task.
2. Changing the task even slightly might require a rewrite of the whole system.
3. Designing rules requires a deep understanding of how a decision should be made by a
human expert.
unit #1 :20ECE633T 3
Traditional programming Vs ML
✔ long list of complex ✔ program is much shorter,
✔ rules—pretty hard to easier to maintain,
maintain. ✔ most likely more accurate.

unit #1 :20ECE633T 4
Problems Machine Learning Can Solve
– Identifying the zip code from handwritten digits on an envelope
Input : scan of the handwriting.
Output : actual digits in the zip code.
Dataset creation : you need to collect many envelopes. Then you can read the zip codes yourself and store the
digits as your desired outcomes.
Model: ????
– Determining whether a tumor is benign based on a medical image
Input : image
Output : whether the tumor is benign.
Dataset creation : you need a database of medical images. You also need an expert opinion, so a doctor needs
to look at all of the images and decide which tumors are benign and which are not. It might even be
necessary to do additional diagnosis beyond the content of the image to determine whether the tumor in the
image is cancerous or not.
– Detecting fraudulent activity in credit card transactions
Input : Here the input is a record of the credit card transaction, and the output is
Output :whether it is likely to be fraudulent or not.
Dataset creation : storing all transactions and recording if a user reports any transaction as fraudulent.
In the above examples is that although the inputs and outputs look fairly straightforward, the data collection
process for these three tasks is vastly different.

1. While reading envelopes is laborious, it is easy and cheap.


2. Obtaining medical imaging and diagnoses, on the other hand, requires not only expensive machinery but also rare and
expensive expert knowledge, not to mention the ethical concerns and privacy issues.
3. Detecting credit card fraud, data collection is much simpler. All you have to do to obtain the input/output pairs of fraudulent
and nonfraudulent activity. unit #1 :20ECE633T 5
supervised learning
• Machine learning algorithms that learn from input/output pairs are
called supervised learning algorithms because a “teacher” provides
supervision to the algorithms in the form of the desired outputs for
each example that they learn from.
• supervised learning, the user provides the algorithm with pairs of
inputs and desired outputs (labeled dataset), and the algorithm finds
a way to produce the desired output given an input.
• In particular, the algorithm is able to create an output for an input.
• While creating a dataset of inputs and outputs is often a laborious
manual process.
• supervised learning performance is easy to measure.

Relate slide 5 and 6 ?

unit #1 :20ECE633T 6
Supervised learning –Block diagram
Major types of supervised ML
1. Classification
2. Regression.
• In classification, the goal is to predict a class label, which is a choice from a predefined list
of possibilities.
• Classification is sometimes separated into binary classification, which is the special case of
distinguishing between exactly two classes.
• You can think of binary classification as trying to answer a yes/no question. Classifying
emails as either spam or not spam is an example of a binary classification problem. In this
binary classification task, the yes/no question being asked would be “Is this email spam?”
• In binary classification we often speak of one class being the positive class and the other
class being the negative class. Here, positive doesn’t represent having benefit or value,
but rather what the object of the study is. So, when looking for spam, “positive” could
mean the spam class. Which of the two classes is call

• multiclass classification, which is classification between more than two classes.


Case study:classifying irises into one of three possible species.
unit #1 :20ECE633T 8
Classification
• Example- to segregate the images of objects based on the shape.
• If image is round object, it is put under one category, if it is triangular object, it is
put under another category.
• Machine should put an image of unknown category also called test data.
• Based on the information the model gets from the past data (training data).
• As the training data has a label or category defined for each and every image, the
machine has to map a new image or test data to a set of images to which it is similar
to and assign the same label or category to the test data.
• Assigning a label or category or class to a test data based on the label or category or
class information at is imparted by the training data.
• Target objective of classification is to assign a class label.
• The target categorical feature is known as class.
Regression
✔ In linear regression, the objective is to predict numerical features like real
estate or stock price, temperature, marks in exam ,sales revenues etc.,
✔ The predictor variable and target variable are continuous in nature.
✔ in linear regression, a straight line relationship is fitted between the
predictor and target variable using the concept of least square method.

• In least square method, the sum of square of error between actual and predicted
values of the target variable is tried to be minimized.
• Typical linear regression model: y= a+bx, Where x is the predictor variable and y is the
target variable.
• Example: in the yearly budgeting exercise of sales manager- to give sales prediction for
the next year based on the investments.
• In this, the investment is the predictor variable and sales revenue is the target variable.
• Regression tasks, the goal is to predict a continuous number, or a floating-point
number in programming terms (or real number in mathematical terms).
• Examples:
1. Predicting a person’s annual income from their education, their age. When predicting
income, the predicted value is an amount, and can be any number in a given range.
2. Predicting the yield of a corn farm given attributes such as previous yields, weather, and
number of employees working on the farm. The yield again can be an arbitrary number.
• How to distinguish between classification and regression tasks ?
• Find whether there is some kind of continuity in the output. If there is continuity
between
• possible outcomes, then the problem is a regression problem. Think about predicting
• annual income. There is a clear continuity in the output.
• Whether a person makes $40,000 or $40,001 a year does not make a tangible
difference, even though these are different amounts of money; if our algorithm
predicts $39,999 or $40,001 when it should have predicted $40,000, we don’t mind
that much.
• By contrast, for the task of recognizing the language of a website (which is a
classification problem), there is no matter of degree. A website is in one language, or
it is in another. There is no continuity between languages, and there is no language
that is between English and French.

unit #1 :20ECE633T 11
Supervised learning algorithms
• k-Nearest Neighbors
• Linear Regression
• Logistic Regression
• Support Vector Machine (SVM)
• Decision Trees and Random Forests
• Neural networks
Numerical problem – Classification
k-Nearest Neighbors

• Consider the following table – it consists of the height, age and weight (target) value for 10
people. As you can see, the weight value of ID11 is missing. We need to predict the weight of
this person based on their height and age.

unit #1 :20ECE633T 13
unit #1 :20ECE633T 14
Methods of the calculating distance between points

• There are various methods for calculating this distance, of which the most commonly
known methods are – Euclidian, Manhattan (for continuous) and Hamming distance (for
categorical).

unit #1 :20ECE633T 15
How to choose the k factor?

• This determines the number of neighbours we look at when we assign a value to any new observation.
• In this example, for a value k = 3, the closest points are ID1, ID5 and ID6.
• For the value of k=5, the closest point will be ID1, ID4, ID5, ID6, ID10.

Based on the k value, the final result tends to change.


Then how can we figure out the optimum value of k?
Let us decide it based on the error calculation for our train and validation set ( minimizing
unit #1 :20ECE633T 16
the error is our final goal!).
Error performance

For a very low value of k (suppose k=1), the model overfits on the training data, which
leads to a high error rate on the validation set.
On the other hand, for a high value of k, the model performs poorly on both train and
validation set.
If you observe closely, the validation error curve reaches a minima at a value of k = 9.
This value of k is the optimum value of the model (it will vary for different datasets). This
curve is known as an ‘elbow curve‘ (because it has a shape like an elbow) and is usually
used to determine the k value. unit #1 :20ECE633T 17
Numerical example 2
• Suppose we have height, weight and T-shirt size of some customers and we need to predict the T-shirt size of a new customer given only
height and weight information we have.
• New customer named 'Monica' has height 161cm and weight 61kg.
• Data including height, weight and T-shirt size information is shown below -

unit #1 :20ECE633T 18
• Step 1 : Calculate Similarity based on distance function
Euclidean is the most commonly used measure. It is mainly used when data is continuous. Manhattan distance
is also very common for continuous variables.

• Step 2 : Find K-Nearest Neighbors


Let k be 5. Then the algorithm searches for the 5 customers closest to Monica, i.e. most similar to Monica in
terms of attributes, and see what categories those 5 customers were in. If 4 of them had ‘Medium T shirt sizes’
and 1 had ‘Large T shirt size’ then your best guess for Monica is ‘Medium T shirt. New customer named
'Monica' has height 161cm and weight 61kg

unit #1 :20ECE633T 19
Standardization:
• When independent variables in training data are measured in different units, it is
important to standardize variables before calculating distance. For example, if one
variable is based on height in cms, and the other is based on weight in kgs then
height will influence more on the distance calculation. In order to make them
comparable we need to standardize them which can be done by any of the
following methods :

After standardization, 5th closest value got changed as height was dominating
earlier before standardization.
Hence, it is important to standardize predictors before running K-nearest neighbour
algorithm.

unit #1 :20ECE633T 20
Before standardization After standardization

unit #1 :20ECE633T 21
KNN algorithm
Steps:
Do for all test datapoints
(i) Calculate the distance (usually Euclidean distance ) of the test point
from the difference training data points.
(ii) Find the closest ‘k’ training points ie., training data points whose distances
are least from the test points.
If k=1
then assign class label of the training point to the test data point

Else
Which ever class label is predominantly present in the training data points,
assign that class label to the test data point.
End do
22
• Why KNN is non-parametric?
Non-parametric means not making any assumptions on the underlying data
distribution. Non-parametric methods do not have fixed numbers of parameters in the
model.
Similarly in KNN, model parameters actually grows with the training data set - you can
imagine each training case as a "parameter" in the model.
• Can KNN be used for regression?
Yes, K-nearest neighbour can be used for regression. In other words, K-nearest
neighbour algorithm can be applied  when dependent variable is continuous. We use k-
NN classification when predicting a categorical outcome, and k-NN regression when
predicting a continuous outcome.
How to handle categorical variables in KNN?
Create dummy variables out of a categorical variable and include them instead of
original categorical variable. Unlike regression, create k dummies instead of (k-1). For
example, a categorical variable named "Department" has 5 unique levels / categories.
So we will create 5 dummy variables. Each dummy variable has 1 against its
department and else 0.the predicted value is the average of the values of its k nearest
neighbours.

unit #1 :20ECE633T 23
Why the kNN algorithm is called a lazy learner?
Eager learner follow the general steps of machine learning( performing an abstraction of
the information obtained from the input data and then follow it through by a generalization
step).
The learning steps are skipped, it stores the training data and directly applies the nearest
neighbourhood finding to arrive at the classification. As there is no learning happening in
the real sense, it is called as lazy learner.
Strength of kNN
• Extremely simple algorithm - easy to understand
• Very effective in certain situations ex: recommender system design
• fast ( almost no training time is required for ) training phase
Weakness of kNN
As it does not learn anything in real sense, and classification is done completely on the
training data, if the training data does not represent the problem domain comprehensively,
then the algorithm fails to classify effectively.
Classification procedure is very slow, as there is no model trained in real sense.
Large amount of computational space is required to load the training data for classification.
Applications
1. Recommender system
2. Document searching (information retrieval and is known as concept search)

24
• How to find best K value?
Cross-validation is a smart way to find out the optimal K value. It estimates the validation error rate by
holding out a subset of the training set from the model building process. 

Cross-validation (let's say 10 fold validation) involves randomly dividing the training set into 10
groups, or folds, of approximately equal size. 90% data is used to train the model and remaining 10%
to validate it. The misclassification rate is then computed on the 10% validation data. This procedure
repeats 10 times. Different group of observations are treated as a validation set each of the 10 times.
It results to 10 estimates of the validation error which are then averaged out.

unit #1 :20ECE633T 25
kNN Experience with data set

  

26
Bias
• Data bias in machine learning is a type of error in which certain elements
of a dataset are more heavily weighted and/or represented than others.
• A biased dataset does not accurately represent a model's use case,
resulting in skewed outcomes, low accuracy levels, and analytical errors.
• From Elite Data Science, bias is: “Bias occurs when an algorithm has
limited flexibility to learn the true signal from the dataset.”
• Wikipedia states, “… bias is an error from erroneous assumptions in the
learning algorithm. High bias can cause an algorithm to miss the relevant
relations between features and target outputs (underfitting).”
• “Bias is the algorithm’s tendency to consistently learn the wrong thing by
not taking into account all the information in the data (underfitting).”
• A high bias means the prediction will be inaccurate.
• Bias is the difference between the average prediction of our model and
the correct value which we are trying to predict. Model with high bias
pays very little attention to the training data and oversimplifies the model.
It always leads to high error on training and test data.

27
• parametric algorithms are prone to high bias. A parametric
algorithm is defined as, “A learning model that summarizes
data with a set of parameters of fixed size (independent of
the number of training examples) is called a parametric
model. No matter how much data you throw at a
parametric model, it won’t change its mind about how
many parameters it needs.”
• A linear regression is an example of a parametric algorithm.
These are easy to understand but not flexible to learn the
underlying signal of the data. Thus, they are inaccurate for
complex datasets.
• Examples of high-bias algorithms include Linear Regression,
Linear Discriminant Analysis, and Logistic Regression.

28
Variance

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
29
18ECE307J
Variance
• From EliteDataScience, the variance is: “Variance refers to an algorithm’s
sensitivity to specific sets of the training set occurs when an algorithm has
limited flexibility to learn the true signal from the dataset.”
• Wikipedia states, “… variance is an error from sensitivity to small
fluctuations in the training set. High variance can cause an algorithm to
model the random noise in the training data, rather than the intended
outputs (overfitting).”
• Variance is the algorithm’s tendency to learn random things irrespective of
the real signal by fitting highly flexible models that follow the error/noise in
the data too closely (overfitting).”
• Variance is the variability of model prediction for a given data point or a
value which tells us spread of our data. Model with high variance pays a lot
of attention to training data and does not generalize on the data which it
hasn’t seen before. As a result, such models perform very well on training
data but has high error rates on test data.
• Variance leads to over fitting - in which small fluctuations in the training set
are magnified. A model with high-level variance may reflect random noise
in the training data set instead of the target function

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
30
18ECE307J
• Variance is the difference between many model’s
predictions.
• When we are implementing complicated models. Hence,
any ‘noise’ in the dataset, might be captured by the model.
A high variance tends to occur when we use complicated
models that can overfit our training sets.
• For example: a complicated model might depict people’s
name as a good predictor of our hypothesis.
• However, names are random and should not have any
predictive power.
• In one dataset, people with the name ‘Alex’ can indicate
they are likely to be criminals.
• In another dataset, people with the name ‘Alex’ can indicate
they likely to be graduates. Hence, names should not be
used as a predictive variable.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
31
18ECE307J
What is the TRADE-OFF?
• If you have a simple model, you might conclude that every “Alex”
are amazing people.
• This presents a High Bias and Low Variance problem.
• Your dataset is ‘biased’ towards people with the name Alex. Thus,
most predictions will be similar, since you believe people with
‘Alex’ act a certain way.
• You attempt to fix the model. However, the model is too
complicated.
• Your model has different results for different groups. Thus, Alex
can be a wonderful person, a criminal, an athlete, and a scholar.
• You must find balance! The good thing, if you do Cross-Validation,
you can train on many datasets and average their predictions.
Unfortunately, you cannot minimize bias and variance.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
32
18ECE307J
Low Bias — High Variance:
A low bias and high variance problem is overfitting. Different data sets are
depicting insights given their respective dataset. Hence, the models will predict
differently. However, if average the results, we will have a pretty accurate
prediction.
High Bias — Low Variance:
The predictions will be similar to one another but on average, they are
inaccurate.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
33
18ECE307J
bulls-eye diagram

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
34
18ECE307J
Bias and Variance-bulls-eye diagram
• Goal of supervised learning is to learn the target
function, which can best determine the target variable
from the set of input variables.
• Variance is the variability of model prediction for a
given data point or a value which tells us spread of our
data. Model with high variance pays a lot of attention
to training data and does not generalize on the data
which it hasn’t seen before. As a result, such models
perform very well on training data but has high error
rates on test data.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
35
18ECE307J
underfitting and overfitting
In supervised learning, underfitting :
• happens when a model unable to capture the underlying pattern of the data.
• These models usually have high bias and low variance.
• It happens when we have very less amount of data to build an accurate model or when we try to build a linear model with a
nonlinear data.
• Also, these kind of models are very simple to capture the complex patterns in data like Linear and logistic regression.
In supervised learning, overfitting:
• happens when our model captures the noise along with the underlying pattern in data.
• It happens when we train our model a lot over noisy dataset.
• These models have low bias and high variance.
• These models are very complex like Decision trees which are prone to overfitting.
Underfitting
Goal :- supervised learning is to learn to derive target function which can best determined
the target variable from the set of input variables.
Fitness of a target function by learning algorithm determines how correctly it is able to
classify a set of data it has never seen.
underfitting :
• If the target function is kept too simple, it may not able to capture the essential nuances
(subtle) and represent underlying data well. Happens when a model unable to capture
the underlying pattern of the data.
• It happens when we have very less amount of training data to build an accurate model
• when we try to represent a non-linear data with linear model.
• Also, these kind of models are very simple to capture the complex patterns in data like
Linear and logistic regression.
• Underfitting results in both poor performance in test and training
very complex like Decision trees which are prone to overfitting. These models usually have
high bias and low variance.
Can be avoided by:
– Using more training data
– Reducing features by effective feature selection.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
37
18ECE307J
Overfitting

• This refers to a situation , where the model has been designed in such a way
that it emulates the training data too closely.
• This occurs due to trying to fit an excessively “ complex model” to loosely match
the training data.
• Target function tries to make sure all training data happens when our model
captures the noise along with the underlying pattern in data.
• Any specific deviation in the training data like noise or outliners gets embedded
in the model, it affect the performance of the model on the test data
It happens when we train our model a lot over noisy dataset.
These models have low bias and high variance.
These models provides good performance in traning set poor generalization
To avoid overfitting:
1. Using resampling techniques like cross validation.
2. Removal of nodes which have little or no predictive power for a machine to
learn.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
38
18ECE307J
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
39
18ECE307J
How to use Learning Curves to Diagnose Machine
Learning Model Performance
• A learning curve is a plot of model learning performance
over experience or time.
• Learning curves are a widely used diagnostic tool in
machine learning for algorithms that learn from a training
dataset incrementally.
• Learning Curve: Line plot of learning (y-axis) over
experience (x-axis). Train Learning Curve: Learning curve
calculated from the training dataset that gives an idea of
how well the model is learning.
• Validation Learning Curve: Learning curve calculated from
a hold-out validation dataset that gives an idea of how well
the model is generalizing.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
40
18ECE307J
Underfit Learning Curves
• Underfitting refers to a model that cannot learn the
training dataset.
• A plot of learning curves shows underfitting if:
• The training loss remains flat regardless of training.
• The training loss continues to decrease until the end of training.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
41
18ECE307J
Overfit Learning Curves
• Overfitting refers to a model that has learned the training dataset too well,
including the statistical noise or random fluctuations in the training dataset.
• A plot of learning curves shows overfitting if:
– The plot of training loss continues to decrease with experience.
– The plot of validation loss decreases to a point and begins increasing again.
• The inflection point in validation loss may be the point at which training
could be halted as experience after that point shows the dynamics of
overfitting.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
42
18ECE307J
Good Fit Learning Curves
• Good Fit Learning Curves
• A good fit is the goal of the learning algorithm and exists between an overfit and
underfit model.
• A good fit is identified by a training and validation loss that decreases to a point
of stability with a minimal gap between the two final loss values.
• The loss of the model will almost always be lower on the training dataset than
the validation dataset. This means that we should expect some gap between the
train and validation loss learning curves. This gap is referred to as the
“generalization gap.”
• A plot of learning curves shows a good fit if:
• The plot of training loss decreases to a point of stability.
• The plot of validation loss decreases to a point of stability and has a small gap
with the training loss.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
43
18ECE307J
Training, Testing, and Validation Set
• Training set to actually train the algorithm
• Validation set to keep track of how well it is doing as it learns,
• Test set to produce the final results.
• This is becoming expensive in data, especially since for supervised
learning it all has to have target values attached (and even for
unsupervised learning, the validation and test sets need targets so that
you have something to compare to), and it is not always easy to get
accurate labels (which may well be why you want to learn about the
data).
• Clearly, each algorithm is going to need some reasonable amount of data
to learn from (precise needs vary, but the more data the algorithm sees,
the more likely it is to have seen examples of each possible type of input,
although more data also increases the computational time to learn).
• Generally, the exact proportion of training to testing to validation data is
up to you, but it is typical to do something like 50:25:25 if you have
plenty of data, and 60:20:20 if you don’t.

unit #1 :20ECE633T 44
45
46
47
• Cross-validation ‘or’ k-fold cross-validation :
✔Cross-validation is when the dataset is randomly split up into ‘k’ groups.
✔ One of the groups is used as the test set and the rest are used as the training set.
✔The model is trained on the training set and scored on the test set. Then the process is
repeated until each unique group as been used as the test set.
✔For example: for 5-fold cross validation, the dataset would be split into 5 groups, and
the model would be trained and tested 5 separate times so each group would get a
chance to be the test set.
✔This can be seen in the graph below.

unit #1 :20ECE633T 48
49
Model training ( supervised Learning)

❖ Hold-out method
❖ Cross validation method
• Hold-out is when you split up your dataset into a ‘train’ and ‘test’ set.
✔ The training set is what the model is trained on, and the test set is used to see how
well that model performs on unseen data.
✔ A common split when using the hold-out method is using 80% of data for training
and the remaining 20% of the data for testing.

unit #1 :20ECE633T 50
The dataset is randomly partitioned into K subsets, and one subset is used as a validation set,
while the algorithm is trained on all of the others. A different subset is then left out and a new
model is trained on that subset, repeating the same process for all of the different subsets.
Finally, the model that produced the lowest validation error is tested and used.
We’ve traded off data for computation time, since we’ve had to train K different models instead of
just one.

51
Hold-out Vs. Cross-validation

1. Cross-validation is usually the preferred method because it gives


your model the opportunity to train on multiple train-test splits.
This gives you a better indication of how well your model will
perform on unseen data.
Hold-out, on the other hand, is dependent on just one train-test
split. That makes the hold-out method score dependent on how
the data is split into train and test sets.
2. Hold-out method is good to use when you have a very large
dataset, you’re on a time crunch, or you are starting to build an
initial model in your data science project. 
Cross-validation uses multiple train-test splits, it takes more
computational power and time to run than using the holdout
method.

unit #1 :20ECE633T 52
53
Classification Error and noise
• For classification:- Confusion Matrix – it is a square matrix,
containing all possible classes in both horizontal and vertical
direction.
• List the classes along the top of the table as predicted output and
down left side targets.
• So, each element in the matrix(i,j) tell us how many input patterns
were put into the class i in the target, but class j by the algorithm.
Diagonal elements are c1c1 , c2c2 are correct classification.
• For class c3, miss classified as c1

c1 c2 c3
c1 5 1 0
c2 1 4 1
c3 2 0 4

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
54
18ECE307J
 
• Confusion Matrix
• A confusion matrix is an N X N matrix, where N is the
number of classes being predicted.
• For the problem in hand, we have N=2, and hence we get a
2 X 2 matrix. Here are a few definitions, you need to
remember for a confusion matrix :
• Accuracy : the proportion of the total number of
predictions that were correct.
• Positive Predictive Value or Precision : the proportion of
positive cases that were correctly identified.
• Negative Predictive Value : the proportion of negative
cases that were correctly identified.
• Sensitivity or Recall : the proportion of actual positive
cases which are correctly identified.
• Specificity : the proportion of actual negative cases which
are correctly identified.
55
56
Two primary types of errors.
– Type 1 errors (false positives) - rejection of a true null
hypothesis
– Type 2 errors (false negatives)- the non-rejection of a
false null hypothesis
• TP true positive - is an observation correctly put
into class 1.
• FP false positive -is an observation incorrectly put
into class 1

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
57
18ECE307J
formulas

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
58
18ECE307J
• Accuracy: Overall, how often is the classifier correct?
– (TP+TN)/total = (100+50)/165 = 0.91
• Misclassification Rate/Error Rate: Overall, how often it
is wrong?
– (FP+FN)/total = (10+5)/165 = 0.09 (1-Accuracy)
• True Positive Rate or "Sensitivity" or "Recall": When
it's actually yes, how often does it predict yes?
– TP/actual yes = 100/105 = 0.95

✔ Class yes(positive) =105


✔ Class No (negative) =60
✔ Total= 165 100 10
✔ Truly classified=150 (100 positive cases
and 50 negative cases) 05
✔ Wrongly classified= 15 50
✔ Original true but classified as false =05
✔ Original false but predicted to true=10
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
59
18ECE307J
• False Positive Rate: When it's actually no, how often does it predict
yes?
– FP/actual no = 10/60 = 0.17
• True Negative Rate/ Specificity: When it's actually no, how often
does it predict no?
– TN/actual no = 50/60 = 0.83 (1-False Positive Rate)
• Precision: When it predicts yes, how often is it correct? proportion
of positive predictions which are truly positive.
– TP/predicted yes = 100/110 = 0.91
• Prevalence: How often does the yes condition actually occur in our
sample?
– actual yes/total = 105/165 = 0.64
✔ Class yes(positive) =105
✔ Class No (negative) =60
✔ Total= 165
✔ Truly classified=150 (100 positive cases
100 10 and 50 negative cases)
✔ Wrongly classified= 15
05 50 ✔ Original true but classified as false =05
✔ Original false but predicted to true=10

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
60
18ECE307J
Example: medical –disease prediction ( benign or malignant)- prediction of
malignant- class of interest
• Sensitivity: gives prediction of tumours are actually
malignant and predicted as malignant.
• Specificity: indicates how a good balance of a model
being excessively conservative or excessively
aggressive. Portion of benign tumour which are
correctly classified
• Precision: proportion of positive prediction which are
truly positive.it indicates the reliability of the model
in predicting a class of interest.
• Model with high value of specificity , sensitivity is
desirable than accuracy.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
61
18ECE307J
F score- it combines the precision and recall. It takes the harmonic
mean of precision and recall
• F=(2* precision*recall)/ precision+recall
• Different models can be compared with F-score.
Receiver operating characteristics ROC:
• Visualization is an easier and effective way to understand the model
performance, also helps in comparing the 2 model efficiency.
• A ROC curve is constructed by plotting the true positive rate (TPR)
against the false positive rate (FPR).
• This is a plot of the percentage of true positives on the y axis against
false positives on the x axis
• The true positive rate is the proportion of observations that were
correctly predicted to be positive out of all positive observations
(TP/(TP + FN)).
• Similarly, the false positive rate is the proportion of observations
that are incorrectly predicted to be positive out of all negative
observations (FP/(TN + FP)). 

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
62
18ECE307J
• The ROC curve shows the trade-off between sensitivity (or TPR)
and specificity (1 – FPR).
• Classifiers that give curves closer to the top-left corner indicate a
better performance.
• As a baseline, a random classifier is expected to give points lying
along the diagonal (FPR = TPR).
• The closer the curve comes to the 45-degree diagonal of the ROC
space, the less accurate the test.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
63
18ECE307J
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
64
18ECE307J
Area under curve (AUC)
• To compare different classifiers, it can be useful to
summarize the performance of each classifier into a single
measure.
• One common approach is to calculate the area under the
ROC curve, which is abbreviated to AUC.
• It is equivalent to the probability that a randomly chosen
positive instance is ranked higher than a randomly chosen
negative instance
• A classifier with high AUC can occassionally score worse in a
specific region than another classifier with lower AUC.
• But in practice, the AUC performs well as a general measure
of predictive accuracy.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
65
18ECE307J
Tutorial : Program –kNN
• Iris Flower Species Dataset
• The Iris Flower Dataset involves predicting the flower species given
measurements of iris flowers. It is a multiclass classification problem.
• The number of observations for each class is balanced. There are 150
observations with 4 input variables and 1 output variable.
• variable names are as follows:
– Sepal length in cm.
– Sepal width in cm.
– Petal length in cm.
– Petal width in cm.
– Class
• This k-Nearest Neighbors tutorial is broken down into 3 parts:
• Step 1: Calculate Euclidean Distance.
• Step 2: Get Nearest Neighbors.
• Step 3: Make Predictions.
https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-
scratch/ 66
Inference up to slide 64 taking classification with supervised
learning
MACHINE LEARNING PROCESS
• Data Collection and Preparation:Machine learning algorithms need significant amounts of data,
preferably without too much noise, but with increased dataset size comes increased
computational costs, and the sweet spot at which there is enough data without excessive
computational overhead is generally impossible to predict.
• Feature Selection:It consists of identifying the features that are most useful for the problem
under examination. This invariably requires prior knowledge of the problem and the data; our
common sense was used in the coins example above to identify some potentially useful features
and to exclude others.
• Algorithm Choice:Given the dataset, the choice of an appropriate algorithm
• Parameter and Model Selection:For many of the algorithms there are parameters that have to be
set manually, or that require experimentation to identify appropriate values
• Training:training should be simply the use of computational resources in order to build a model of
the data
• Evaluation:Before a system can be deployed it needs to be tested and evaluated for accuracy on
data that it was not trained on. This can often include a comparison with human experts in the
field, and the selection of appropriate metrics for this comparison.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
67
18ECE307J
Regression
• What is Regression Analysis?
Regression analysis is a form of predictive modelling technique which
investigates the relationship between a dependent (target) and independent
variable (s) (predictor). 
This technique is used for forecasting, time series modelling and finding
the causal effect relationship between the variables.
For example, relationship between rash driving and number of road accidents
by a driver is best studied through regression.
• Linear Regression
•  Logistic Regression

68
Linear regression
• Linear regression is usually among the first few topics which people pick while learning predictive
modeling. In this technique, the dependent variable is continuous, independent variable(s) can
be continuous or discrete, and nature of regression line is linear.
• Linear Regression establishes a relationship between dependent variable (Y) and one or
more independent variables (X) using a best fit straight line (also known as regression line).
• It is represented by an equation Y=a+b*X + e, where a is intercept, b is slope of the line and e is error term.
This equation can be used to predict the value of target variable based on given predictor variable(s).
• The difference between simple linear regression and multiple linear regression is that, multiple linear
regression has (>1) independent variables, whereas simple linear regression has only 1 independent variable

69
How to obtain best fit line (Value of a and b)?
• This task can be easily accomplished by Least Square Method. It is the most common
method used for fitting a regression line.
• It calculates the best-fit line for the observed data by minimizing the sum of the
squares of the vertical deviations from each data point to the line.
• Because the deviations are first squared, when added, there is no cancelling out
between positive and negative values.
In case of Root mean squared logarithmic error, we take the
log of the predictions and actual values. So basically, what
changes are the variance that we are measuring.
RMSLE is usually used when we don’t want to penalize huge
differences in the predicted and the actual values when both
predicted and true values are huge numbers.

70
Root Mean Squared Error (RMSE)
• RMSE is the most popular evaluation metric used in regression problems
• The power of ‘square root’  empowers this metric to show large number deviations.
• The ‘squared’ nature of this metric helps to deliver more robust results which
prevents cancelling the positive and negative error values. In other words, this metric aptly
displays the plausible magnitude of error term.
• It avoids the use of absolute error values which is highly undesirable in mathematical
calculations.
• When we have more samples, reconstructing the error distribution using RMSE is considered
to be more reliable.
• RMSE is highly affected by outlier values. Hence, make sure you’ve removed outliers from your
data set prior to using this metric.
• As compared to mean absolute error, RMSE gives higher weightage and punishes large errors.

71
• We learned that when the RMSE decreases, the model’s
performance will improve. But these values alone are not
intuitive.
• In the case of a classification problem, if the model has
an accuracy of 0.8, we could gauge how good our model
is against a random model, which has an accuracy of  0.5.
So the random model can be treated as a
benchmark. But when we talk about the RMSE metrics,
we do not have a benchmark to compare.
• This is where we can use R-Squared metric. The formula
for R-Squared is as follows:
72
• there must be linear relationship between independent and
dependent variables
• Multiple regression suffers from multicollinearity, autocorrelation,
heteroskedasticity.
• Linear Regression is very sensitive to Outliers. It can terribly affect
the regression line and eventually the forecasted values.
• Multicollinearity can increase the variance of the coefficient
estimates and make the estimates very sensitive to minor changes in
the model. The result is that the coefficient estimates are unstable
• In case of multiple independent variables, we can go with forward
selection, backward elimination and step wise approach for
selection of most significant independent variables.

73
Root Mean Squared Logarithmic Error

• In case of Root mean squared logarithmic error, we take the log of the
predictions and actual values. So basically, what changes are the variance
that we are measuring. RMSLE is usually used when we don’t want to
penalize huge differences in the predicted and the actual values when both
predicted and true values are huge numbers.
• If both predicted and actual values are small: RMSE and RMSLE are same.
• If either predicted or the actual value is big: RMSE > RMSLE
• If both predicted and actual values are big: RMSE > RMSLE (RMSLE becomes
almost negligible)

74
• outliers
• Values distant from most other values. In machine learning,
any of the following are outliers:
• Weights with high absolute values.
• Predicted values relatively far away from the actual values.
• Input data whose values are more than roughly 3 standard
deviations from the mean.
• Outliers often cause problems in model training. Clipping is
one way of managing outliers.

75
Unsupervised Learning
• Unsupervised learning is a type of self-organized learning .
• In unsupervised learning, only the input data is known, and no known output data is given to the
algorithm.
• unsupervised method: take a dataset as input and try to find natural grouping or pattern within
the data elements or records.Hence it is also called as descriptive model.
• The process of unsupervised learning is refereed as pattern discovery or knowledge discovery.
-Identifying topics in a set of blog posts

If you have a large collection of text data, you might want to summarize it and
find prevalent themes in it. You might not know beforehand what these topics
are, or how many topics there might be. Therefore, there are no known outputs.

-Segmenting customers into groups with similar preferences

Given a set of customer records, you might want to identify which customers are
similar, and whether there are groups of customers with similar preferences.
For a shopping site, these might be “parents,” “bookworms,” or “gamers.” Because
you
don’t know in advance what these groups might be, or even how many there are,
you have no known outputs.

-Detecting abnormal access patterns to a website

To identify abuse or bugs, it is often helpful to find access patterns that are different
from the norm. Each abnormal pattern might be very different, and you
might not have any recorded instances of abnormal behaviour.
Because in this example you only observe traffic, and you don’t know what
constitutes normal
and abnormal behaviour, this is an unsupervised problem.
unit #1 :20ECE633T 77
Clustering
• It intends to group or organize similar objects together.
• Objects belonging to some cluster is quite similar to each other while objects
belonging to different cluster are quite dissimilar.
• Hence clustering objective is to discover the intrinsic grouping of unlabeled
data and form clusters,
• One of he most common adopted similarity measure is the distance.
• Distance based clustering: Two data items are considered as a part of the same
cluster if the distance between them is less. In the same way , if the distance
between the data items is high, the items do not generally belong to the same
cluster.
Associative analysis
• Objective of associative analysis is to identify the association
between the data elements .
• Example: market basket analysis.
• From past transaction data in a grocery store.
Trans ID Items brought
1 [butter,bread]
2 [diaper,bread,milk,beet]
3 [milk,chicken,beer,disper]
4 [bread,disper,chicker,deer]
5 [diaper,beer,cookies,ice
cream]
✔ Frequency itemsets-> (diaper,beer)
✔ Possible association; diaper-> beer
✔ This he[s in boosting up sales pipeline, it provides critical input to sales goup.
✔ Critical applications: market basket analysis , recommender system
unsupervised learning algorithms

• Clustering
• —k-Means
• —Hierarchical Cluster Analysis (HCA)
• —Expectation Maximization
• Visualization and dimensionality reduction
• —Principal Component Analysis (PCA)
• —Kernel PCA
• —Locally-Linear Embedding (LLE)
• —t-distributed Stochastic Neighbor Embedding (t-
SNE)
• Association rule learning
• —Apriori
• —Eclat
Knows of dataset/points
✔For both supervised and unsupervised learning tasks, it is important to have a
representation of your input data that a computer can understand.
✔Often it is helpful to think of your data as a table.
✔Each data point that you want to reason about (each email, each customer, each
transaction) is a row, and each property that describes that data point (say, the age of a
customer or the amount or location of a transaction) is a column.
✔You might describe users by their age, their gender, when they created an account, and
how often they have bought from your online shop.
✔You might describe the image of a tumor by the grayscale values of each pixel, or maybe
by using the size, shape, and color of the tumor.
✔Each entity or row here is known as a sample (or data point) in machine learning,
✔while the columns—the properties that describe these entities—are called features.
✔however, that no machine learning algorithm will be able to
✔make a prediction on data for which it has no information.
✔For example, if the only feature that you have for a patient is their last name, no
algorithm will be able to predict their gender. This information is simply not contained in
your data. If you add another feature that contains the patient’s first name, you will
have much better luck,as it is often possible to tell the gender by a person’s first name.
unit #1 :20ECE633T 81

You might also like