Machine Learning

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

Course Contents:

Unit Contents Hours

Introduction: Basic Definitions, Types of learning- Supervised Learning,


CO1 Unsupervised Learning, Reinforcement Learning, Hypothesis Space and Inductive 6
Bias, Evaluation and Cross-validation.

Regressions:
CO2 8
Linear regression, Decision trees, over fitting Instance based learning, Feature reduction,
Collaborative filtering-based recommendation

Probability, Bayes learning, Logistic Regression, Support Vector Machine, Kernel


CO3 8
function and Kernel SVM

Computational learning: Theory and applications, PAC learning model, Sample


CO4 6
complexity, VC Dimension, Ensemble learning

Clustering,k-means, adaptive hierarchical clustering, Gaussian mixture model


CO5 6

1.
Text Mitchell Tom, Machine Learning. McGraw Hill, 1997.

Books Introduction to machine learning, EthemAlpaydin. —2nd ed., The MIT Press,
2.
Cambridge, Massachusetts, London, England.

3. Chris Bishop, Pattern Recognition and Machine Learning

Understanding Machine Learning: From Theory to Algorithms, By Shai


Shalev-Shwartz and Shai Ben-David
1.
URL:
E-Books
https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/index.html

Bayesian Reasoning and Machine Learning By David Barber


2.
URL: http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/091117.pdf

Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning
1.
Data Mining, Inference, and Prediction

Reference Richard O. Duda, Peter E. Hart, David G. Stork. Pattern classification, Wiley, New
Books 2.
York, 2001.

Machine Learning: The Art and Science of Algorithms that Make Sense of Data (1st
3.
Edition) – Peter Falch

NPTEL: Introduction to Machine Learning by Prof. Sudeshna Sarkar, IIT Kharagpur.


1 URL: https://onlinecourses.nptel.ac.in/noc21_cs85/preview,

NPTEL: Essential Mathematics for Machine Learning By Prof. Sanjeev Kumar, Prof. S. K.
2 Gupta, IIT Roorkee URL: https://onlinecourses.nptel.ac.in/noc21_ma38/
On line TL
Material

DataCamp: Machine Learning for Everyone


3 URL: https://www.datacamp.com/courses/machine-learning-for-everyon
Types of Machine Learning
Machine learning is a subset of AI, which enables the machine to automatically learn
from data, improve performance from past experiences, and make predictions. Machine
learning contains a set of algorithms that work on a huge amount of data. Data is fed to these
algorithms to train them, and on the basis of training, they build the model & perform a
specific task.

These ML algorithms help to solve different business problems like Regression,


Classification, Forecasting, Clustering, and Associations, etc.

Based on the methods and way of learning, machine learning is divided into mainly four
types, which are:

1. Supervised Machine Learning


2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning
1. Supervised Machine Learning
As its name suggests, Supervised machine learning is based on supervision. It means in the
supervised learning technique, we train the machines using the "labelled" dataset, and based
on the training, the machine predicts the output. Here, the labelled data specifies that some of
the inputs are already mapped to the output. More preciously, we can say; first, we train the
machine with the input and corresponding output, and then we ask the machine to predict the
output using the test dataset.
Let's understand supervised learning with an example. Suppose we have an input dataset of
cats and dog images. So, first, we will provide the training to the machine to understand the
images, such as the shape & size of the tail of cat and dog, Shape of eyes, colour, height
(dogs are taller, cats are smaller), etc. After completion of training, we input the picture of a
cat and ask the machine to identify the object and predict the output. Now, the machine is
well trained, so it will check all the features of the object, such as height, shape, colour, eyes,
ears, tail, etc., and find that it's a cat. So, it will put it in the Cat category. This is the process
of how the machine identifies the objects in Supervised Learning.
The main goal of the supervised learning technique is to map the input variable(x) with
the output variable(y). Some real-world applications of supervised learning are Risk
Assessment, Fraud Detection, Spam filtering, etc.

Categories of Supervised Machine Learning


Supervised machine learning can be classified into two types of problems, which are given
below:
o Classification
o Regression

a) Classification
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The
classification algorithms predict the categories present in the dataset. Some real-world
examples of classification algorithms are Spam Detection, Email filtering, etc.

b) Regression

Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous output
variables, such as market trends, weather prediction, etc.

Some popular Regression algorithms are given below:

o Simple Linear Regression Algorithm


o Multivariate Regression Algorithm
o Decision Tree Algorithm
o Lasso Regression

Advantages and Disadvantages of Supervised Learning


Advantages:

o Since supervised learning work with the labelled dataset so we can have an exact idea
about the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior experience.

Disadvantages:

o These algorithms are not able to solve complex tasks.


o It may predict the wrong output if the test data is different from the training data.
o It requires lots of computational time to train the algorithm.

Applications of Supervised Learning


Some common applications of Supervised Learning are given below:
o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process,
image classification is performed on different image data with pre-defined labels.
o Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is
done by using medical images and past labelled data with labels for disease
conditions. With such a process, the machine can identify a disease for the new
patients.
o Fraud Detection - Supervised Learning classification algorithms are used for
identifying fraud transactions, fraud customers, etc. It is done by using historic data to
identify the patterns that can lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are used.
These algorithms classify an email as spam or not spam. The spam emails are sent to
the spam folder.
o Speech Recognition - Supervised learning algorithms are also used in speech
recognition. The algorithm is trained with voice data, and various identifications can
be done using the same, such as voice-activated passwords, voice commands, etc.

Difference between Regression and Classification


Regression Algorithm Classification Algorithm

In Regression, the output variable must In Classification, the output variable must be a
be of continuous nature or real value. discrete value.

The task of the regression algorithm is to The task of the classification algorithm is to map the
map the input value (x) with the input value(x) with the discrete output variable(y).
continuous output variable(y).

Regression Algorithms are used with Classification Algorithms are used with discrete
continuous data. data.

In Regression, we try to find the best fit In Classification, we try to find the decision
line, which can predict the output more boundary, which can divide the dataset into
accurately. different classes.

Regression algorithms can be used to Classification Algorithms can be used to solve


solve the regression problems such as classification problems such as Identification of
Weather Prediction, House price spam emails, Speech Recognition, Identification of
prediction, etc. cancer cells, etc.
The regression Algorithm can be further The Classification algorithms can be divided into
divided into Linear and Non-linear Binary Classifier and Multi-class Classifier.
Regression.

2. Unsupervised Machine Learning


Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning, the
machine is trained using the unlabeled dataset, and the machine predicts the output without
any supervision.

In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.

The main aim of the unsupervised learning algorithm is to group or categories the
unsorted dataset according to the similarities, patterns, and differences. Machines are
instructed to find the hidden patterns from the input dataset.

Let's take an example to understand it more preciously; suppose there is a basket of fruit
images, and we input it into the machine learning model. The images are totally unknown to
the model, and the task of the machine is to find the patterns and categories of the objects.

So, now the machine will discover its patterns and differences, such as colour difference,
shape difference, and predict the output when it is tested with the test dataset.

Categories of Unsupervised Machine Learning


Unsupervised Learning can be further classified into two types, which are given below:

o Clustering
o Association

1) Clustering

The clustering technique is used when we want to find the inherent groups from the data. It is
a way to group the objects into a cluster such that the objects with the most similarities
remain in one group and have fewer or no similarities with the objects of other groups. An
example of the clustering algorithm is grouping the customers by their purchasing behaviour.

Some of the popular clustering algorithms are given below:

o K-Means Clustering algorithm


o Mean-shift algorithm
o DBSCAN Algorithm
o Principal Component Analysis
o Independent Component Analysis

2) Association
Association rule learning is an unsupervised learning technique, which finds interesting
relations among variables within a large dataset. The main aim of this learning algorithm is to
find the dependency of one data item on another data item and map those variables
accordingly so that it can generate maximum profit. This algorithm is mainly applied
in Market Basket analysis, Web usage mining, continuous production, etc.

Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat,
FP-growth algorithm.

Advantages and Disadvantages of Unsupervised Learning Algorithm


Advantages:

o These algorithms can be used for complicated tasks compared to the supervised ones
because these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset is easier as compared to the labelled dataset.

Disadvantages:

o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.

Applications of Unsupervised Learning


o Network Analysis: Unsupervised learning is used for identifying plagiarism and
copyright in document network analysis of text data for scholarly articles.
o Recommendation Systems: Recommendation systems widely use unsupervised
learning techniques for building recommendation applications for different web
applications and e-commerce websites.
o Anomaly Detection: Anomaly detection is a popular application of unsupervised
learning, which can identify unusual data points within the dataset. It is used to
discover fraudulent transactions.
o Singular Value Decomposition: Singular Value Decomposition or SVD is used to
extract particular information from the database. For example, extracting information
of each user located at a particular location.

o
o
o
o
o
3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between
Supervised and Unsupervised machine learning. It represents the intermediate ground
between Supervised (With Labelled training data) and Unsupervised learning (with no
labelled training data) algorithms and uses the combination of labelled and unlabeled datasets
during the training period.

Although Semi-supervised learning is the middle ground between supervised and


unsupervised learning and operates on the data that consists of a few labels, it mostly consists
of unlabeled data. As labels are costly, but for corporate purposes, they may have few labels.
It is completely different from supervised and unsupervised learning as they are based on the
presence & absence of labels.

To overcome the drawbacks of supervised learning and unsupervised learning


algorithms, the concept of Semi-supervised learning is introduced. The main aim
of semi-supervised learning is to effectively use all the available data, rather than only
labelled data like in supervised learning. Initially, similar data is clustered along with an
unsupervised learning algorithm, and further, it helps to label the unlabeled data into labelled
data. It is because labelled data is a comparatively more expensive acquisition than unlabeled
data.

We can imagine these algorithms with an example. Supervised learning is where a student is
under the supervision of an instructor at home and college. Further, if that student is
self-analysing the same concept without any help from the instructor, it comes under
unsupervised learning. Under semi-supervised learning, the student has to revise himself after
analyzing the same concept under the guidance of an instructor at college.

Advantages and disadvantages of Semi-supervised Learning


Advantages:

o It is simple and easy to understand the algorithm.


o It is highly efficient.
o It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms.

Disadvantages:

o Iterations results may not be stable.


o We cannot apply these algorithms to network-level data.
o Accuracy is low.
4. Reinforcement Learning
Reinforcement learning works on a feedback-based process, in which an AI agent (A
software component) automatically explore its surrounding by hitting & trail, taking
action, learning from experiences, and improving its performance. Agent gets rewarded
for each good action and get punished for each bad action; hence the goal of reinforcement
learning agent is to maximize the rewards.
Depends on changes in environment.
In reinforcement learning, there is no labelled data like supervised learning, and agents learn
from their experiences only.
The reinforcement learning process is similar to a human being; for example, a child learns
various things by experiences in his day-to-day life. An example of reinforcement learning is
to play a game, where the Game is the environment, moves of an agent at each step define
states, and the goal of the agent is to get a high score. Agent receives feedback in terms of
punishment and rewards.
Due to its way of working, reinforcement learning is employed in different fields such
as Game theory, Operation Research, Information theory, multi-agent systems.
A reinforcement learning problem can be formalized using Markov Decision
Process(MDP). In MDP, the agent constantly interacts with the environment and performs
actions; at each action, the environment responds and generates a new state.

Categories of Reinforcement Learning


Reinforcement learning is categorized mainly into two types of methods/algorithms:
o Positive Reinforcement Learning: Positive reinforcement learning specifies increasing the
tendency that the required behaviour would occur again by adding something. It enhances the
strength of the behaviour of the agent and positively impacts it.
o Negative Reinforcement Learning: Negative reinforcement learning works exactly opposite
to the positive RL. It increases the tendency that the specific behaviour would occur again by
avoiding the negative condition.

Real-world Use cases of Reinforcement Learning


o Video Games:
RL algorithms are much popular in gaming applications. It is used to gain super-human
performance. Some popular games that use RL algorithms are AlphaGO and AlphaGO
Zero.
o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that how to
use RL in computer to automatically learn and schedule resources to wait for different jobs in
order to minimize average job slowdown.
o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement learning.
There are different industries that have their vision of building intelligent robots using AI and
Machine learning technology.
o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with the help of
Reinforcement Learning by Salesforce company.

Advantages and Disadvantages of Reinforcement Learning


Advantages
o It helps in solving complex real-world problems which are difficult to be solved by general
techniques.
o The learning model of RL is similar to the learning of human beings; hence most accurate
results can be found.
o Helps in achieving long term results.

Disadvantage
o RL algorithms are not preferred for simple problems.
o RL algorithms require huge data and computations.
o Too much reinforcement learning can lead to an overload of states which can weaken the
results.
ML | Understanding Hypothesis
In most supervised machine learning algorithm, our main goal is to find out a possible
hypothesis from the hypothesis space that could possibly map out the inputs to the
proper outputs.
The following figure shows the common method to find out the possible hypothesis
from the Hypothesis space:

Hypothesis Space (H):

Hypothesis space is the set of all the possible legal hypothesis. This is the set from
which the machine learning algorithm would determine the best possible (only one)
which would best describe the target function or the outputs.
Hypothesis (h):

A hypothesis is a function that best describes the target in supervised machine


learning. The hypothesis that an algorithm would come up depends upon the data and
also depends upon the restrictions and bias that we have imposed on the data. To
better understand the Hypothesis Space and Hypothesis consider the following
coordinate that shows the distribution of some data:
Say suppose we have test data for which we have to determine the outputs or
results. The test data is as shown below:

We can predict the outcomes by dividing the coordinate as shown below:


So the test data would yield the following result:

But note here that we could have divided the coordinate plane as:

The way in which the coordinate would be divided depends on the data, algorithm and
constraints.
 All these legal possible ways in which we can divide the coordinate plane to predict the
outcome of the test data composes of the Hypothesis Space.
 Each individual possible way is known as the hypothesis.
Hence, in this example the hypothesis space would be like:

Inductive Learning Algorithm


Inductive Learning Algorithm (ILA) is an iterative and inductive machine
learning algorithm which is used for generating a set of a classification rule, which
produces rules of the form “IF-THEN”, for a set of examples, producing rules at each
iteration and appending to the set of rules.
Basic Idea:

There are basically two methods for knowledge extraction firstly from domain experts
and then with machine learning.
For a very large amount of data, the domain experts are not very useful and reliable.
So we move towards the machine learning approach for this work.
To use machine learning One method is to replicate the experts logic in the form of
algorithms but this work is very tedious, time taking and expensive.
So we move towards the inductive algorithms which itself generate the strategy for
performing a task and need not instruct separately at each step.

Need of ILA in presence of other machine learning algorithms:

The ILA is a new algorithm which was needed even when other reinforcement
learnings like ID3 and AQ were available.
● The need was due to the pitfalls which were present in the previous algorithms,
one of the major pitfalls was lack of generalisation of rules.

● The ID3 and AQ used the decision tree production method which was too specific
which were difficult to analyse and was very slow to perform for basic short
classification problems.

● The decision tree-based algorithm was unable to work for a new problem if some
attributes are missing.

● The ILA uses the method of production of a general set of rules instead of decision
trees, which overcome the above problems

THE ILA ALGORITHM:

General requirements at start of the algorithm:-


1. list the examples in the form of a table ‘T’ where each row corresponds to an
example and each column contains an attribute value.
2. Create a set of m training examples, each example composed of k attributes and a
class attribute with n possible decisions.
3. Create a rule set, R, having the initial value false.
4. Initially all rows in the table are unmarked.

Steps in the algorithm:-


Step 1:
divide the table ‘T’ containing m examples into n sub-tables (t1, t2,…..tn). One table for each
possible value of the class attribute. (repeat steps 2-8 for each sub-table)
Step 2:
Initialize the attribute combination count ‘ j ‘ = 1.
Step 3:
For the sub-table on which work is going on, divide the attribute list into distinct
combinations, each combination with ‘j ‘ distinct attributes.
Step 4:
For each combination of attributes, count the number of occurrences of attribute values that
appear under the same combination of attributes in unmarked rows of the sub-table under
consideration, and at the same time, not appears under the same combination of attributes of
other sub-tables. Call the first combination with the maximum number of occurrences the
max-combination ‘ MAX’.
Step 5:
If ‘MAX’ = = null , increase ‘ j ‘ by 1 and go to Step 3.
Step 6:
Mark all rows of the sub-table where working, in which the values of ‘MAX’ appear, as
classi?ed.
Step 7:
Add a rule (IF attribute = “XYZ” –> THEN decision is YES/ NO) to R whose left-hand side
will have attribute names of the ‘MAX’ with their values separated by AND, and its
right-hand side contains the decision attribute value associated with the sub-table.
Step 8:
If all rows are marked as classi?ed, then move on to process another sub-table and go to Step
2. else, go to Step 4. If no sub-tables are available, exit with the set of rules obtained till then.

An example showing the use of ILA


Suppose an example set having attributes Place type, weather, location, decision and
seven examples, our task is to generate a set of rules that under what condition what is
the decision.

Example no. Place type Weather location decision

I) Hilly Winter kullu Yes

II ) Mountain Windy Mumbai No

III ) Mountain Windy Shimla Yes

IV ) Beach Windy Mumbai No


Example no. Place type Weather location decision

V) Beach Warm goa Yes

VI ) Beach Windy goa No

VII ) Beach Warm Shimla Yes

step 1
subset 1
s.no place type weather location decision

1 hilly winter kullu Yes

2 mountain windy Shimla Yes

3 beach warm goa Yes

4 beach warm Shimla Yes

subset 2

.
s.no place type weather location decision

5 mountain windy Mumbai No

6 beach windy Mumbai No

7 beach windy goa No

Cross Validation in Machine Learning


In machine learning, we couldn’t fit the model on the training data and can’t
say that the model will work accurately for the real data. For this, we must assure that
our model got the correct patterns from the data, and it is not getting up too much
noise. For this purpose, we use the cross-validation technique.
Cross-Validation

Cross-validation is a technique in which we train our model using the subset of the
data-set and then evaluate using the complementary subset of the data-set.
The three steps involved in cross-validation are as follows :
1. Reserve some portion of sample data-set.
2. Using the rest data-set train the model.
3. Test the model using the reserve portion of the data-set.
Methods of Cross Validation
Validation
In this method, we perform training on the 50% of the given data-set and rest 50% is
used for the testing purpose.
The major drawback of this method is that we perform training on the 50% of the
dataset, it may possible that the remaining 50% of the data contains some important
information which we are leaving while training our model i.e higher bias.

LOOCV (Leave One Out Cross Validation)


In this method, we perform training on the whole data-set but leaves only one
data-point of the available data-set and then iterates for each data-point.

It has some advantages as well as disadvantages.


An advantage of using this method is that we make use of all data points and hence it
is low bias.
The major drawback of this method is that it leads to higher variation in the testing
model as we are testing against one data point. If the data point is an outlier it can lead
to higher variation. Another drawback is it takes a lot of execution time as it iterates
over ‘the number of data points’ times.

K-Fold Cross Validation


In this method, we split the data-set into k number of subsets(known as folds) then we
perform training on the all the subsets but leave one(k-1) subset for the evaluation of
the trained model. In this method, we iterate k times with a different subset reserved
for testing purpose each time.
Note:
It is always suggested that the value of k should be 10 as the lower value
of k is takes towards validation and higher value of k leads to LOOCV method.

Example
The diagram below shows an example of the training subsets and evaluation subsets
generated in k-fold cross-validation. Here, we have total 25 instances. In first iteration
we use the first 20 percent of data for evaluation, and the remaining 80 percent for
training([1-5] testing and [5-25] training) while in the second iteration we use the
second subset of 20 percent for evaluation, and the remaining three subsets of the data
for training([5-10] testing and [1-5 and 10-25] training), and so on.

Total instances: 25
Value of k :5
No. Iteration Training set observations Testing set observations
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13 14]
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18 19]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24]

Comparison of train/test split to cross-validation

Advantages of train/test split:


1. This runs K times faster than Leave One Out cross-validation because K-fold
cross-validation repeats the train/test split K-times.
2. Simpler to examine the detailed results of the testing process.
Advantages of cross-validation:
1. More accurate estimate of out-of-sample accuracy.
2. More “efficient” use of data as every observation is used for both training and
testing.

Python code for k fold cross-validation.

# This code may not be run on GFG IDE

# as required packages are not found.

# importing cross-validation from sklearn package.

from sklearn import cross_validation

# value of K is 10.

data = cross_validation.KFold(len(train_set), n_folds=10, indices=False)

Unit :2
Regressions analysis is a fundamental concept in the field of machine
learning. It falls under supervised learning wherein the algorithm is trained
with both input features and output labels. It helps in establishing a relationship
among the variables by estimating how one variable affects the other.

Imagine you're car shopping and have decided that gas mileage is a
deciding factor in your decision to buy. If you wanted to predict the miles per
gallon of some promising rides, how would you do it? Well, since you know the
different features of the car (weight, horsepower, displacement, etc.) one
possible method is regression. By plotting the average MPG of each car given
its features you can then use regression techniques to find the relationship of the
MPG and the input features. The regression function here could be represented
as $Y = f(X)$, where Y would be the MPG and X would be the input features
like the weight, displacement, horsepower, etc. The target function is $f$ and
this curve helps us predict whether it’s beneficial to buy or not buy. This
mechanism is called regression.

Regression In Machine Learning


Regression in machine learning consists of mathematical methods that allow data scientists to predict
a continuous outcome (y) based on the value of one or more predictor variables (x). Linear regression
is probably the most popular form of regression analysis because of its ease-of-use in predicting and
forecasting.

Linear Regression
Linear Regression is a machine learning algorithm based on supervised learning. It
performs a regression task. Regression models a target prediction value based on
independent variables. It is mostly used for finding out the relationship between
variables and forecasting.
Different regression models differ based on – the kind of relationship between
dependent and independent variables, they are considering and the number of
independent variables being used.

Linear regression performs the task to predict a dependent variable value (y) based on
a given independent variable (x). So, this regression technique finds out a linear
relationship between x (input) and y(output). Hence, the name is Linear Regression.
In the figure above, X (input) is the work experience and Y (output) is the salary of a
person. The regression line is the best fit line for our model.
Hypothesis function for Linear Regression :

While training the model we are given :


x: input training data (univariate – one input variable(parameter))
y: labels to data (supervised learning)
When training the model – it fits the best line to predict the value of y for a given
value of x. The model gets the best regression fit line by finding the best θ1 and
θ2 values.
θ1: intercept
θ2: coefficient of x
Once we find the best θ1 and θ2 values, we get the best fit line. So when we are finally
using our model for prediction, it will predict the value of y for the input value of x.
How to update θ1 and θ2 values to get the best fit line ?
Cost Function (J):
By achieving the best-fit regression line, the model aims to predict y value such that
the error difference between predicted value and true value is minimum. So, it is very
important to update the θ1 and θ2 values, to reach the best value that minimize the error
between predicted y value (pred) and true y value (y).

Cost function(J) of Linear Regression is the Root Mean Squared Error


(RMSE) between predicted y value (pred) and true y value (y).
Gradient Descent:
To update θ1 and θ2 values in order to reduce Cost function (minimizing RMSE value)
and achieving the best fit line the model uses Gradient Descent. The idea is to start
with random θ1 and θ2 values and then iteratively updating the values, reaching
minimum cost.

Decision Tree Classification Algorithm


o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a
tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
o Below diagram explains the general structure of a decision tree:

Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

Why use Decision Trees?


There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine learning
model. Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

Decision Tree Terminologies


 Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.

 Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.

 Branch/Sub Tree: A tree formed by splitting the tree.

 Pruning: Pruning is the process of removing the unwanted branches from the tree.

 Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The
complete process can be better understood using the below algorithm:

Competitive questions on Structures in Hindi


Keep Watching

Nested Structure in C in Hindi


00:00/05:15

o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step
-3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select
the best attribute for the nodes of the tree. There are two popular techniques for ASM, which
are:
o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)


Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree


Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:
o Cost Complexity Pruning
o Reduced Error Pruning.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

What is Instance-Based and Model-Based Learning?

Most Machine Learning tasks are about making predictions. This means that given a number
of training examples, the system needs to be able to make good “predictions for” /
“generalize to” examples it has never seen before.

Having a good performance measure on the training data is good, but insufficient; the true
goal is to perform well on new instances.

There are two main approaches to generalization: instance-based learning and model-based
learning
1. Instance-based learning:

(sometimes called memory-based learning) is a family of learning algorithms that, instead of


performing explicit generalization, compares new problem instances with instances seen
in training, which have been stored in memory.

Ex- k-nearest neighbor, decision tree

2. Model-based learning:

Machine learning models that are parameterized with a certain number of parameters
that do not change as the size of training data changes.

If you don’t assume any distribution with a fixed number of parameters over your data, for
example, in k-nearest neighbor, or in a decision tree, where the number of parameters grows
with the size of the training data, then you are not model-based, or nonparametric

Overfitting in Machine Learning


Overfitting refers to a model that models the training data too well.
Overfitting happens when a model learns the detail and noise in the training data to the extent
that it negatively impacts the performance of the model on new data. This means that the
noise or random fluctuations in the training data is picked up and learned as concepts by the
model. The problem is that these concepts do not apply to new data and negatively impact the
models ability to generalize.

Overfitting is more likely with nonparametric and nonlinear models that have more flexibility
when learning a target function. As such, many nonparametric machine learning algorithms
also include parameters or techniques to limit and constrain how much detail the model
learns.

For example, decision trees are a nonparametric machine learning algorithm that is very
flexible and is subject to overfitting training data. This problem can be addressed by pruning
a tree after it has learned in order to remove some of the detail it has picked up.

Underfitting in Machine Learning


Underfitting refers to a model that can neither model the training data nor generalize to new
data.

An underfit machine learning model is not a suitable model and will be obvious as it will
have poor performance on the training data.

Underfitting is often not discussed as it is easy to detect given a good performance metric.
The remedy is to move on and try alternate machine learning algorithms. Nevertheless, it
does provide a good contrast to the problem of overfitting.

Overfitting vs. Underfitting

We can understand overfitting better by looking at the opposite problem, underfitting.


Underfitting occurs when a model is too simple – informed by too few features or regularized
too much – which makes it inflexible in learning from the dataset.

Simple learners tend to have less variance in their predictions but more bias towards wrong
outcomes.

On the other hand, complex learners tend to have more variance in their predictions.

Both bias and variance are forms of prediction error in machine learning.

Typically, we can reduce error from bias but might increase error from variance as a result, or
vice versa.

This trade-off between too simple (high bias) vs. too complex (high variance) is a key
concept in statistics and machine learning, and one that affects all supervised learning
algorithms.

What is Feature Reduction?


● Feature reduction, also known as dimensionality reduction, is the process of
reducing the number of features in a resource heavy computation without
losing important information.
● Reducing the number of features means the number of variables is reduced
making the computer’s work easier and faster.
Feature reduction can be divided into two processes:
feature selection and feature extraction.

● There are many techniques by which feature reduction is accomplished.


Some of the most popular are generalized discriminant
analysis, autoencoders, non-negative matrix factorization, and principal
component analysis.
Why is this Useful?
● The purpose of using feature reduction is to reduce the number of features
(or variables) that the computer must process to perform its function.
● Feature reduction leads to the need for fewer resources to complete
computations or tasks. Less computation time and less storage capacity
needed means the computer can do more work.
● During machine learning, feature reduction removes multicollinearity
resulting in improvement of the machine learning model in use.
Another benefit of feature reduction is that it makes data easier to visualize for
humans, particularly when the data is reduced to two or three dimensions which
can be easily displayed graphically. An interesting problem that feature reduction
can help with is called the curse of dimensionality. This refers to a group of
phenomena in which a problem will have so many dimensions that the data
becomes sparse. Feature reduction is used to decrease the number of dimensions,
making the data less sparse and more statistically significant for machine learning
applications.

Collaborative Filtering – ML
In Collaborative Filtering, we tend to find similar users and recommend what similar
users like.
In this type of recommendation system, we don’t use the features of the item to
recommend it, rather we classify the users into the clusters of similar types, and
recommend each user according to the preference of its cluster.
Measuring Similarity:
A simple example of the movie recommendation system will help us in explaining:

In this type of scenario, we can see that User 1 and User 2 give nearly similar ratings
to the movie, so we can conclude that Movie 3 is also going to be averagely liked by
the User 1 but Movie 4 will be a good recommendation to User 2, like this we can also
see that there are users who have different choices like User 1 and User 3 are opposite
to each other.
One can see that User 3 and User 4 have a common interest in the movie, on that basis
we can say that Movie 4 is also going to be disliked by the User 4. This is
Collaborative Filtering, we recommend users the items which are liked by the users of
similar interest domain.

Cosine Distance:

We can also use the cosine distance between the users to find out the users with
similar interests, larger cosine implies that there is a smaller angle between two users,
hence they have similar interests.
We can apply the cosine distance between two users in the utility matrix, and we can
also give the zero value to all the unfilled columns to make calculation easy, if we get
smaller cosine then there will be a larger distance between the users and if the cosine
is larger then we have a small angle between the users, and we can recommend them
similar things.
Rounding the Data:

In collaborative filtering we round off the data to compare it more easily like we can
assign below 3 ratings as 0 and above of it as 1, this will help us to compare data more
easily, for example:

We again took the previous example and we apply the rounding off process, as you
can see how much readable the data has become after performing this process, we can
see that User 1 and User 2 are more similar and User 3 and User 4 are more alike.
Normalizing Rating:

In the process of normalizing we take the average rating of a user and subtract all the
given ratings from it, so we’ll get either positive or negative values as a rating, which
can simply classify further into similar groups. By normalizing the data we can make
the clusters of the users which gives a similar rating to similar items and then we can
use these clusters to recommend items to the users.
Attention reader! Don’t stop learning now. Get hold of all the important Machine
Learning Concepts with the Machine Learning Foundation Course at a
student-friendly price and become industry ready.

Unit 3:

Probability for Machine Learning


Know how Probability strongly influences the way you understand and
implement Machine Learning

When implementing machine learning algorithms, you may have come across situations
where the environment that your algorithm is in, is non-deterministic, i.e., you cannot
guarantee the same output always for the same input. Similarly in the real-world, there are
scenarios such as these where the behavior can vary, though the input remains the same.
Uncertainty exists no matter what. As machine learning includes humongous amounts of
data, multiple hyperparameters, and a complex environment, uncertainties are bound to exist.
It can be in the form of missing variables, incomplete modeling, or the data being
probabilistic.

Bayes’ Theorem
Bayes’ theorem describes how the conditional probability of an event or a
hypothesis can be computed using evidence and prior knowledge. It is similar to
concluding that our code has no bugs given the evidence that it has passed all the
test cases, including our prior belief that we have rarely observed any bugs in our
code. However, this intuition goes beyond that simple hypothesis test where there
are multiple events or hypotheses involved (let us not worry about this for the
moment).

The Bayes’ theorem is given by:

P(θ|X)=P(X|θ)P(θ)P(X)P(θ|X)=P(X|θ)P(θ)P(X)

I will now explain each term in Bayes’ theorem using the above example. Consider
the hypothesis that there are no bugs in our code. $\theta$ and $X$ denote that our
code is bug free and passes all the test cases respectively.

Naïve Bayes Classifier Algorithm


o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of
an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.

Why is it called Naïve Bayes?


The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.

Logistic Regression in Machine Learning


o Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image
is showing the logistic function:

Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.

Logistic Function (Sigmoid Function):


o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function
or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the probability
of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.

Assumptions for Logistic Regression:


o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:


On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it
can learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog.
On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier.
Major Kernel Functions in Support Vector
Machine (SVM)
Kernel Function is a method used to take data as input and transform it into the required form of
processing data. “Kernel” is used due to a set of mathematical functions used in Support Vector
Machine providing the window to manipulate the data. So, Kernel Function generally transforms the
training set of data so that a non-linear decision surface is able to transform to a linear equation in a
higher number of dimension spaces. Basically, It returns the inner product between two points in a
standard feature dimension.
Standard Kernel Function Equation :

Major Kernel Functions :-


For Implementing Kernel Functions, first of all, we have to install the “scikit-learn”
library using the command prompt terminal:

pip install scikit-learn


● Gaussian Kernel: It is used to perform transformation when there is no prior
knowledge about data.

● Gaussian Kernel Radial Basis Function (RBF): Same as above kernel function, adding
radial basis method to improve the transformation.
Unit:4

Introduction to Computational Learning Theory


Computational learning theory, or statistical learning theory, refers to mathematical frameworks
for quantifying learning tasks and algorithms.

These are sub-fields of machine learning that a machine learning practitioner does not need to
know in great depth in order to achieve good results on a wide range of problems. Nevertheless, it
is a sub-field where having a high-level understanding of some of the more prominent methods
may provide insight into the broader task of learning from data.

● Computational learning theory uses formal methods to study learning tasks and learning
algorithms.
● PAC learning provides a way to quantify the computational difficulty of a machine learning task.
● VC Dimension provides a way to quantify the computational capacity of a machine learning
algorithm.
This tutorial is divided into three parts; they are:

1. Computational Learning Theory


2. PAC Learning (Theory of Learning Problems)
3. VC Dimension (Theory of Learning Algorithms)
Computational Learning Theory
Computational learning theory, or CoLT for short, is a field of study concerned with the use of
formal mathematical methods applied to learning systems.
It seeks to use the tools of theoretical computer science to quantify learning problems. This
includes characterizing the difficulty of learning specific tasks.

Computational learning theory may be thought of as an extension or sibling of statistical learning


theory, or SLT for short, that uses formal methods to quantify learning algorithms.
Computational Learning Theory (CoLT): Formal study of learning tasks.
Statistical Learning Theory (SLT): Formal study of learning algorithms.
● How do we know a model has a good approximation for the target function?
● What hypothesis space should be used?
● How do we know if we have a local or globally good solution?
● How do we avoid overfitting?
● How many data examples are needed?
As a machine learning practitioner, it can be useful to know about computational learning theory
and some of the main areas of investigation. The field provides a useful grounding for what we
are trying to achieve when fitting models on data, and it may provide insight into the methods.

There are many subfields of study, although perhaps two of the most widely discussed areas of
study from computational learning theory are:

● PAC Learning.
● VC Dimension.
Tersely, we can say that PAC Learning is the theory of machine learning problems and VC
dimension is the theory of machine learning algorithms.

You may encounter the topics as a practitioner and it is useful to have a thumbnail idea of what
they are about. Let’s take a closer look at each.

If you would like to dive deeper into the field of computational learning theory, I recommend the
book:

PAC Learning (Theory of Learning Problems)


Probably approximately correct learning, or PAC learning, refers to a theoretical machine
learning framework developed by Leslie Valiant.
PAC learning seeks to quantify the difficulty of a learning task and might be considered the
premier sub-field of computational learning theory.

Consider that in supervised learning, we are trying to approximate an unknown underlying


mapping function from inputs to outputs. We don’t know what this mapping function looks like,
but we suspect it exists, and we have examples of data produced by the function.

PAC learning is concerned with how much computational effort is required to find a hypothesis
(fit model) that is a close match for the unknown target function.

For more on the use of “hypothesis” in machine learning to refer to a fit model, see the tutorial:
● What Is a Hypothesis in Machine Learning?
The idea is that a bad hypothesis will be found out based on the predictions it makes on new data,
e.g. based on its generalization error.

A hypothesis that gets most or a large number of predictions correct, e.g. has a small
generalization error, is probably a good approximation for the target function.

Using formal methods, a minimum generalization error can be specified for a supervised learning
task. The theorem can then be used to estimate the expected number of samples from the problem
domain that would be required to determine whether a hypothesis was PAC or not. That is, it
provides a way to estimate the number of samples required to find a PAC hypothesis.

The goal of the PAC framework is to understand how large a data set needs to be in order to give
good generalization. It also gives bounds for the computational cost of learning …

VC Dimension (Theory of Learning Algorithms)


.

VC theory is comprised of many elements, most notably the VC dimension.


The VC dimension quantifies the complexity of a hypothesis space, e.g. the models that could be
fit given a representation and learning algorithm.

One way to consider the complexity of a hypothesis space (space of models that could be fit) is
based on the number of distinct hypotheses it contains and perhaps how the space might be
navigated. The VC dimension is a clever approach that instead measures the number of examples
from the target problem that can be discriminated by hypotheses in the space.

For example, a line (hypothesis space) can be used to shatter three points, but not four points.

Any placement of three points on a 2d plane with class labels 0 or 1 can be “correctly” split by
label with a line, e.g. shattered. But, there exists placements of four points on plane with binary
class labels that cannot be correctly split by label with a line, e.g. cannot be shattered. Instead,
another “algorithm” must be used, such as ovals.
The figure below makes this clear.

Therefore, the VC dimension of a machine learning algorithm is the largest number of data points
in a dataset that a specific configuration of the algorithm (hyperparameters) or specific fit model
can shatter.

A classifier that predicts the same value in all cases will have a VC dimension of 0, no points. A
large VC dimension indicates that an algorithm is very flexible, although the flexibility may come
at the cost of additional risk of overfitting.

Sample complexity

the sample complexity of a machine learning algorithm represents the number of training-samples that it
needs in order to successfully learn a target function.
More precisely, the sample complexity is the number of training-samples that we need to supply to the
algorithm, so that the function returned by the algorithm is within an arbitrarily small error of the best
possible function, with probability arbitrarily close to 1.
There are two variants of sample complexity:

● The weak variant fixes a particular input-output distribution;


● The strong variant takes the worst-case sample complexity over all input-output distributions.
Ensemble learning:

1. Introduction to Ensemble Learning


Let’s understand the concept of ensemble learning with an example. Suppose you
are a movie director and you have created a short movie on a very important and
interesting topic. Now, you want to take preliminary feedback (ratings) on the
movie before making it public. What are the possible ways by which you can do
that?

A: You may ask one of your friends to rate the movie for you.
Now it’s entirely possible that the person you have chosen loves you very much
and doesn’t want to break your heart by providing a 1-star rating to the horrible
work you have created.

B: Another way could be by asking 5 colleagues of yours to rate the movie.


This should provide a better idea of the movie. This method may provide honest
ratings for your movie. But a problem still exists. These 5 people may not be
“Subject Matter Experts” on the topic of your movie. Sure, they might understand
the cinematography, the shots, or the audio, but at the same time may not be the
best judges of dark humour.

C: How about asking 50 people to rate the movie?


Some of which can be your friends, some of them can be your colleagues and some
may even be total strangers.

The responses, in this case, would be more generalized and diversified since now
you have people with different sets of skills. And as it turns out – this is a better
approach to get honest ratings than the previous cases we saw.

With these examples, you can infer that a diverse group of people are likely to
make better decisions as compared to individuals. Similar is true for a diverse set
of models in comparison to single models. This diversification in Machine
Learning is achieved by a technique called Ensemble Learning.

Now that you have got a gist of what ensemble learning is – let us look at the
various techniques in ensemble learning along with their implementations.

2. Simple Ensemble Techniques


In this section, we will look at a few simple but powerful techniques, namely:

1. Max Voting
2. Averaging
3. Weighted Averaging
2.1 Max Voting
The max voting method is generally used for classification problems. In this
technique, multiple models are used to make predictions for each data point. The
predictions by each model are considered as a ‘vote’. The predictions which we get
from the majority of the models are used as the final prediction.

For example, when you asked 5 of your colleagues to rate your movie (out of 5);
we’ll assume three of them rated it as 4 while two of them gave it a 5. Since the
majority gave a rating of 4, the final rating will be taken as 4. You can consider
this as taking the mode of all the predictions.

The result of max voting would be something like this:

Colleague 1 Colleague 2 Colleague 3 Colleague 4 Colleague 5 Final rating

5 4 5 4 4 4

2.2 Averaging
Similar to the max voting technique, multiple predictions are made for each data
point in averaging. In this method, we take an average of predictions from all the
models and use it to make the final prediction. Averaging can be used for making
predictions in regression problems or while calculating probabilities for
classification problems.

For example, in the below case, the averaging method would take the average of
all the values.

i.e. (5+4+5+4+4)/5 = 4.4

Colleague 1 Colleague 2 Colleague 3 Colleague 4 Colleague 5 Final rating

5 4 5 4 4 4.4

2.3 Weighted Average


This is an extension of the averaging method. All models are assigned different
weights defining the importance of each model for prediction. For instance, if two
of your colleagues are critics, while others have no prior experience in this field,
then the answers by these two friends are given more importance as compared to
the other people.

The result is calculated as [(5*0.23) + (4*0.23) + (5*0.18) + (4*0.18) + (4*0.18)] =


4.41.

Colleague 1 Colleague 2 Colleague 3 Colleague 4 Colleague 5 Final rating


weight 0.23 0.23 0.18 0.18 0.18

rating 5 4 5 4 4 4.41

3. Advanced Ensemble techniques


Now that we have covered the basic ensemble techniques, let’s move on to
understanding the advanced techniques.

3.1 Stacking
Stacking is an ensemble learning technique that uses predictions from multiple
models (for example decision tree, knn or svm) to build a new model. This model
is used for making predictions on the test set. Below is a step-wise explanation for
a simple stacked ensemble:

1. The train set is split into 10 parts.

2. A base model (suppose a decision tree) is fitted on 9 parts and predictions


are made for the 10th part. This is done for each part of the train set.

3. The base model (in this case, decision tree) is then fitted on the whole train
dataset.
4. Using this model, predictions are made on the test set.
5. Steps 2 to 4 are repeated for another base model (say knn) resulting in
another set of predictions for the train set and test set.

6. The predictions from the train set are used as features to build a new model.

7. This model is used to make final predictions on the test prediction set.

UNIT 5

Clustering in Machine Learning


Introduction to Clustering
It is basically a type of unsupervised learning method. An unsupervised learning
method is a method in which we draw references from datasets consisting of input
data without labeled responses. Generally, it is used as a process to find meaningful
structure, explanatory underlying processes, generative features, and groupings
inherent in a set of examples.

Clustering is the task of dividing the population or data points into a number of groups
such that data points in the same groups are more similar to other data points in the
same group and dissimilar to the data points in other groups. It is basically a collection
of objects on the basis of similarity and dissimilarity between them.
For ex– The data points in the graph below clustered together can be classified into
one single group. We can distinguish the clusters, and we can identify that there are 3
clusters in the below picture.
It is not necessary for clusters to be spherical. Such as :

Clustering Methods :
● Density-Based Methods: These methods consider the clusters as the dense region
having some similarities and differences from the lower dense region of the space.
These methods have good accuracy and the ability to merge two clusters.
Example DBSCAN (Density-Based Spatial Clustering of Applications with
Noise), OPTICS (Ordering Points to Identify Clustering Structure), etc.
● Hierarchical Based Methods: The clusters formed in this method form a
tree-type structure based on the hierarchy. New clusters are formed using the
previously formed one. It is divided into two category
● Agglomerative (bottom-up approach)
● Divisive (top-down approach)
examples CURE (Clustering Using Representatives), BIRCH (Balanced Iterative
Reducing Clustering and using Hierarchies), etc.
● Partitioning Methods: These methods partition the objects into k clusters and
each partition forms one cluster. This method is used to optimize an objective
criterion similarity function such as when the distance is a major parameter
example K-means, CLARANS (Clustering Large Applications based upon
Randomized Search), etc.
● Grid-based Methods: In this method, the data space is formulated into a finite
number of cells that form a grid-like structure. All the clustering operations done
on these grids are fast and independent of the number of data objects
example STING (Statistical Information Grid), wave cluster, CLIQUE (CLustering
In Quest), etc.
Clustering Algorithms :
K-means clustering algorithm – It is the simplest unsupervised learning algorithm that
solves clustering problem.K-means algorithm partitions n observations into k clusters
where each observation belongs to the cluster with the nearest mean serving as a
prototype of the cluster.

Applications of Clustering in different fields


● Marketing: It can be used to characterize & discover customer segments for
marketing purposes.
● Biology: It can be used for classification among different species of plants and
animals.
● Libraries: It is used in clustering different books on the basis of topics and
information.
● Insurance: It is used to acknowledge the customers, their policies and identifying
the frauds.
City Planning: It is used to make groups of houses and to study their values based on
their geographical locations and other factors present.
Earthquake studies: By learning the earthquake-affected areas we can determine the
dangerous zones.
Key Features of K-means Clustering

Find below some key features of k-means clustering;

1. It is very smooth in terms of interpretation and resolution.


2. For a large number of variables present in the dataset, K-means operates
quicker than Hierarchical clustering.
3. While redetermining the cluster centre, an instance can modify the cluster.
4. K-means reforms compact clusters.
5. It can work on unlabeled numerical data.
6. Moreover, it is fast, robust and uncomplicated to understand and yields the
best outcomes when datasets are well distinctive (thoroughly separated)
from each other.
Limitations of K-means Clustering

The following are a few limitations with K-Means clustering;

1. Sometimes, it is quite tough to forecast the number of clusters, or the value


of k.
2. The output is highly influenced by original input, for example, the number
of clusters.
3. An array of data substantially hits the concluding outcomes.
4. In some cases, clusters show complex spatial views, then executing
clustering is not a good choice.
5. Also, rescaling is sometimes conscious, it can’t be done by normalization or
standardization of data points, the output gets changed entirely.

(Recommended blog: Machine Learning tools)

Disadvantages of K-means Clustering

1. The algorithm demands for the inferred specification of the number of


cluster/ centres.
2. An algorithm goes down for non-linear sets of data and unable to deal with
noisy data and outliers.
3. It is not directly applicable to categorical data since only operatable when
mean is provided.
4. Also, Euclidean distance can weight unequally the underlying factors.
5. The algorithm is not variant to non-linear transformation, i.e provides
different results with different portrayals of data.

adaptive hierarchical clustering

Hierarchical clustering, also known as hierarchical cluster analysis or HCA, is


another unsupervised machine learning approach for grouping unlabeled datasets
into clusters.

The hierarchy of clusters is developed in the form of a tree in this technique, and
this tree-shaped structure is known as the dendrogram.

Simply speaking, Separating data into groups based on some measure of similarity,
finding a technique to quantify how they're alike and different, and limiting down
the data is what hierarchical clustering is all about.

Approaches of Hierarchical Clustering


1. Agglomerative clustering:

Agglomerative Clustering is a bottom-up strategy in which each data point is


originally a cluster of its own, and as one travels up the hierarchy, more pairs
of clusters are combined. In it, two nearest clusters are taken and joined to
form one single cluster.

2. Divisive clustering:

The divisive clustering algorithm is a top-down clustering strategy in which


all points in the dataset are initially assigned to one cluster and then divided
iteratively as one progresses down the hierarchy.

It partitions data points that are clustered together into one cluster based on
the slightest difference. This process continues till the desired number of
clusters is obtained.

Gaussian Mixture Model

Suppose there are set of data points that need to be grouped into several parts or
clusters based on their similarity. In machine learning, this is known as Clustering.
There are several methods available for clustering like:

● K Means Clustering
● Hierarchical Clustering
● Gaussian Mixture Models

What are Gaussian mixture models (GMM)?

The Gaussian mixture model is defined as a clustering algorithm that is used to discover the
underlying groups of data. It can be understood as a probabilistic model where Gaussian
distributions are assumed for each group and they have means and covariances which define their
parameters. GMM consists of two parts – mean vectors (μ) & covariance matrices (Σ). A
Gaussian distribution is defined as a continuous probability distribution that takes on a
bell-shaped curve. Another name for Gaussian distribution is the normal distribution. Here is a
picture of Gaussian mixture models:
What is expectation-maximization (EM) method in
relation to GMM?
In Gaussian mixture models, expectation-maximization method is used to find the gaussian
mixture model parameters. Expectation is termed as E and maximization is termed as M.
Expectation is used to find the gaussian parameters which are used to represent each component
of gaussian mixture models. Maximization is termed as M and it is involved in determining
whether new data points can be added or not.
Expectation-maximization method is a two-step iterative algorithm that alternates between
performing an expectation step, in which we compute expectations for each data point using
current parameter estimates and then maximize these to produce new gaussian, followed by a
maximization step where we update our gaussian means based on the maximum likelihood
estimate. This iterative process is performed until the gaussians parameters converge. Here is a
picture representing the two-step iterative aspect of algorithm:

You might also like