Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

B.

TECH

(SEM-V) THEORY EXAMINATION 2020-21


MACHINE LEARNING TECHNIQUES

Time: 3 Hours Total Marks: 100

Section A

a) Explain the concept of machine learning.

Ans: Machine Learning is getting computers to program themselves. If programming is automation, then
machine learning is automating the process of automation.

Writing software is the bottleneck, we don’t have enough good developers. Let the data do the work
instead of people. Machine learning is the way to make programming scalable.

 Traditional Programming: Data and program is run on the computer to produce the output.

 Machine Learning: Data and output is run on the computer to create a program. This program
can be used in traditional programming.

Machine learning is like farming or gardening. Seeds is the algorithms, nutrients is the data,
the gardner is you and plants is the programs.

b) Compare ANN and Bayesian Network

S.No. ANN BNN

1. It is short for Artificial Neural Network. It is short for Biological Neural Network.

2. Processing speed is fast as compared to They are slow in processing information.


Biological Neural Network.

3. Allocation for Storage to a new process Allocation for storage to a new process is
is strictly irreplaceable as the old easy as it is added just by adjusting the
location is saved for the previous interconnection strengths.
process.

4. Processes operate in sequential mode. The process can operate in massive


parallel operations.
Ans:
5. If any information gets corrupted in the Information is distributed into the
memory it cannot be retrieved. network throughout into sub-nodes,
even if it gets corrupted it can be
retrieved.

6. The activities are continuously There is no control unit to monitor the


monitored by a control unit. information being processed into the
network.

c) What is the difference between linear and logistic regression?

Ans:
Linear Regression Logistic Regression
1. Target is an interval variable 1.Target is a discrete (binary or
ordinal) variable
2. Input variables have any 2.Input variables have any
measurement level measurement
3. Predicated values are the mean of the 3.Predicated values are the probability
target variable at the given values of of a particular level(s)
the input variables

d) Discuss support vectors in SVM?

Ans: Support Vectors are the data points that help us to optimize the hyperplane. These vectors lie
closest to the hyperplane and are most difficult to classify. The position of the decision hyperplane
depends on the support vectors. A Support Vector Machine (SVM) uses the input data points or features
called support vectors to maximize the decision boundaries i.e. the space around the hyperplane. The
inputs and outputs of an SVM are similar to the neural network. There is just one difference between
the SVM and NN as stated below.

Inputs: The SVM network can contain n number of inputs say x1, x2, …….., xi, …., xn.

Outputs: The target output t.

Weights: Like neural network weights w1, w2, ……, wn are associated with each input whose linear
combination predicts the output y.

e) Discuss overfitting and underfitting situation in decision tree learning.

Ans: Overfitting occurs when our machine learning model tries to cover all the data points or more than
the required data points present in the given dataset. Because of this, the model starts caching noise
and inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy of
the model. The overfitted model has low bias and high variance.The chances of occurrence of overfitting
increase as much we provide training to our model. It means the more we train our model, the more
chances of occurring the overfitted model.

Overfitting is the main problem that occurs in supervised learning.

Underfitting occurs when our machine learning model is not able to capture the underlying trend of the
data. To avoid the overfitting in the model, the fed of training data can be stopped at an early stage, due
to which the model may not learn enough from the training data. As a result, it may fail to find the best
fit of the dominant trend in the data.

In the case of underfitting, the model is not able to learn enough from the training data, and hence it
reduces the accuracy and produces unreliable predictions.

An underfitted model has high bias and low variance.

f) What is the task of the E-step of the EM-algorithm?

Ans: The E step starts with a fixed θ(t), and attempts to maximize the lower bound(LB) function F(q(z), θ)
with respect to q(z). Intuitively, this happens when the LB function meets the objective likelihood
function. Mathematically, this is the case because the likelihood function is independent of q(z), and so
maximizing lower bound is equivalent to minimizing the KL divergence of q(z) and p(z|y, θ(t)). Therefore,
the E step gives q(z) = p(z|y, θ(t)).

g) Define the learning classifiers.

Ans: Learning classifier systems, or LCS, are a paradigm of rule-based machine learning methods that
combine a discovery component (e.g. typically a genetic algorithm) with a learning component
(performing either supervised learning, reinforcement learning, or unsupervised learning) Learning
classifier systems seek to identify a set of context-dependent rules that collectively store and apply
knowledge in a piecewise manner in order to make predictions (e.g. behavior modeling, classification,
data mining, regression, function approximation, or game strategy

h) What is the difference between machine learning and deep learning?

S.No. Machine Learning Deep Learning


1. Machine Learning is a superset of Deep Deep Learning is a subset of Machine Learning
Learning
2. The data represented in Machine Learning is The data representation is used in Deep
quite different as compared to Deep Learning Learning is quite different as it uses neural
as it uses structured data networks(ANN).
3. Machine Learning is an evolution of AI Deep Learning is an evolution to Machine
Learning. Basically it is how deep is the
machine learning.
4. Machine learning consists of thousands of data Big Data: Millions of data points.
points.
5. Outputs: Numerical Value, like classification of Anything from numerical values to free-form
score elements, such as free text and sound.
6. Uses various types of automated algorithms Uses neural network that passes data through
that turn to model functions and predict future processing layers to the interpret data
action from data. features and relations.
Ans:

i)What objective function do regression trees minimize?

Ans: There are several advantages to regression trees:


 They are very interpretable.
 Making predictions is fast (no complicated calculations, just looking up constants in the tree).
 It’s easy to understand what variables are important in making the prediction. The internal
nodes (splits) are those variables that most largely reduced the SSE.
 If some data is missing, we might not be able to go all the way down the tree to a leaf, but we
can still make a prediction by averaging all the leaves in the sub-tree we do reach.
 The model provides a non-linear “jagged” response, so it can work when the true regression
surface is not smooth. If it is smooth, though, the piecewise-constant surface can approximate it
arbitrarily closely (with enough leaves).
 There are fast, reliable algorithms to learn these trees.
j) What is the difference between Q learning and deep Q learning?
Ans:
Features Q-learning (QL) Deep Q-learning (DQL)
Is it an RL algorithm? Yes No (unless you use DQN to refer to
DQL, which is done often!)
Does it use neural No. It uses a table. No. DQN is the neural network.
networks?
Is it a model? No Yes (but usually not in the RL sense)
Can it deal with No (unless you Yes (in the sense that it can get real-
continuous state discretize them) valued inputs for the states)
spaces?
Can it deal with Yes (but maybe not a Yes (but only the sense that it can
continuous action good idea) produce real-valued outputs for
spaces? actions).
Does it converge? Yes Not necessarily
Is it an online learning Yes No, but it can be used in an online
algorithm? learning setting

Section B
Q2 a) Apply KNN for following dataset and predict class of test example (A1=3, A2=7).Assume
K=3

A1 A2 Class
7 7 True
7 4 True
3 4 False
1 4 True
5 3 False
6 3 True

Soln.

A1 A2 Square distance to (3,7) = Rank of


(A1-3)2+(A2-7)2 distance
7 7 16 3
7 4 25 5
3 4 9 1
1 4 13 2
5 3 20 4
6 3 25 5

Finding if an entry is included in K neighbours or not

A1 A2 Rank of distance Included? Category of


input
7 7 3 Yes True
7 4 5 No -
3 4 1 Yes False
1 4 2 Yes True
5 3 4 No -
6 3 5 No -

Since we have 2 True and 1 false, we get final answer as true for values (3, 7).

2.
b) Describe the Kohonen Self-Organizing maps and its algorithm.
Ans: Self Organizing Map (or Kohonen Map or SOM) is a type of Artificial Neural Network which
is also inspired by biological models of neural systems form the 1970’s. It follows an
unsupervised learning approach and trained its network through a competitive learning
algorithm. SOM is used for clustering and mapping (or dimensionality reduction) techniques to
map multidimensional data onto lower-dimensional which allows people to reduce complex
problems for easy interpretation. SOM has two layers, one is the Input layer and the other one is
the Output layer. The architecture of the Self Organizing Map with two clusters and n input
features of any sample is given below:

Let’s say an input data of size (m, n) where m is the number of training example and n is the
number of features in each example. First, it initializes the weights of size (n, C) where C is the
number of clusters. Then iterating over the input data, for each training example, it updates the
winning vector (weight vector with the shortest distance (e.g Euclidean distance) from training
example). Weight updation rule is given by :
wij = wij(old) - alpha(t) * (xik - wij(old))
where alpha is a learning rate at time t, j denotes the winning vector, i denotes the ith feature of
training example and k denotes the kth training example from the input data. After training the
SOM network, trained weights are used for clustering new examples. A new example falls in the
cluster of winning vector.
Algorithm
Steps involved are :
 Weight initialization
 For 1 to N number of epochs
 Select a training example
 Compute the winning vector
 Update the winning vector
 Repeat steps 3, 4, 5 for all training examples.
 Clustering the test sample
c) Explain the various learning models for reinforcement learning.
Ans: Reinforcement learning is an area of Machine Learning. It is about taking suitable action to
maximize reward in a particular situation. It is employed by various software and machines to find the
best possible behavior or path it should take in a specific situation. Reinforcement learning differs from
the supervised learning in a way that in supervised learning the training data has the answer key with it
so the model is trained with the correct answer itself whereas in reinforcement learning, there is no
answer but the reinforcement agent decides what to do to perform the given task. In the absence of a
training dataset, it is bound to learn from its experience.
Reinforcement Learning Algorithms
There are three approaches to implement a Reinforcement Learning algorithm.
1. Value-Based:
In a value-based Reinforcement Learning method, you should try to maximize a value function V(s). In
this method, the agent is expecting a long-term return of the current states under policy π.
2. Policy-based:
In a policy-based RL method, you try to come up with such a policy that the action performed in every
state helps you to gain maximum reward in the future.
Two types of policy-based methods are:
 Deterministic: For any state, the same action is produced by the policy π.
 Stochastic: Every action has a certain probability, which is determined by the following equation.
 Stochastic Policy :
n{a\s) = P\A, = a\S, =S]
3. Model-Based:
In this Reinforcement Learning method, you need to create a virtual model for each environment. The
agent learns to perform in that specific environment.
Learning Models of Reinforcement
There are two important learning models in reinforcement learning:
 Markov Decision Process
 Q learning
1. Markov Decision Process
The following parameters are used to get a solution:
 Set of actions- A
 Set of states -S
 Reward- R
 Policy- n
 Value- V
The mathematical approach for mapping a solution in reinforcement Learning is recon as a Markov
Decision Process or (MDP).
2. Q-Learning
Q learning is a value-based method of supplying information to inform which action an agent should
take.
Let's understand this method by the following example:
 There are five rooms in a building which are connected by doors.
 Each room is numbered 0 to 4
 The outside of the building can be one big outside area (5)
 Doors number 1 and 4 lead into the building from room 5
Q2 d) Explain the role of genetic algorithm? Discuss the various phases considered in genetic
algorithm.
Ans: The genetic algorithm is based on the genetic structure and behaviour of the chromosome of the
population. The following things are the foundation of genetic algorithms.
 Each chromosome indicates a possible solution. Thus the population is a collection of
chromosomes.
 A fitness function characterizes each individual in the population. Greater fitness better is the
solution.
 Out of the available individuals in the population, the best individuals are used to reproduce the
next generation offsprings.
 The offspring produced will have features of both the parents and is a result of mutation. A
mutation is a small change in the gene structure.
Phases of Genetic Algorithm
1. Initialization of Population(Coding)
 Every gene represents a parameter (variables) in the solution. This collection of parameters that
forms the solution is the chromosome. The population is a collection of chromosomes.
 Order of genes on the chromosome matters.
 Most of the time, chromosomes are depicted in binary as 0’s and 1’s, but there are also other
encodings possible.
2. Fitness Function
 Out of the available chromosomes, we have to select the best ones to reproduce offspring, so
each chromosome is given a fitness value.
 The fitness score helps to select the individuals who will be used for reproduction.
3. Selection
 This phase’s main goal is to find the region where the chances of getting the best solution are
more.
 Inspiration for this is from the survival of the fittest.
 It should be a balance between exploration and exploitation of search space.
 GA tries to move the genotype to higher fitness in the search space.
 Too strong fitness selection bias can lead to sub-optimal solutions.
 Too little fitness bias selection results in an unfocused search.
 Thus Fitness proportionate selection is used, which is also known as roulette wheel selection, is
a genetic operator used in genetic algorithms for selecting potentially useful solutions for
recombination.
4. Reproduction
Generation of offsprings happen in 2 ways:
 Crossover
 Mutation
a) Crossover
Crossover is the most vital stage in the genetic algorithm. During crossover, a random point is selected
while mating a pair of parents to generate offsprings.
There are 3 major types of crossover.
 Single Point Crossover: A point on both parents’ chromosomes is picked randomly and
designated a ‘crossover point’. Bits to the right of that point are exchanged between the two
parent chromosomes.
 Two-Point Crossover: Two crossover points are picked randomly from the parent chromosomes.
The bits in between the two points are swapped between the parent organisms.
 Uniform Crossover: In a uniform crossover, typically, each bit is chosen from either parent with
equal probability.
The new offspring are added to the population.
b) Mutation
In a few new offspring formed, some of their genes can be subjected to a low random probability
mutation. This indicates that some of the bits in the bit chromosome can be flipped. Mutation happens
to take care of diversity among the population and stop premature convergence.
5. Convergence (when to stop)
Few rules which are followed which tell when to stop is as follows:
 When there is no improvement in the solution quality after completing a certain number of
generations set beforehand.
 When a hard and fast range of generations and time is reached.
 Till an acceptable solution is obtained.

e) Describe BPN algorithm in ANN along with an example.


Ans: The principle behind back propagation algorithm is to reduce the error values in randomly allocated
weights and biases such that it produces the correct output. The system is trained in the supervised
learning method, where the error between the system’s output and a known expected output is
presented to the system and used to modify its internal state. We need to update the weights such that
we get the global loss minimum. This is how back propagation in neural networks works.

When the gradient is negative, increase in weight decreases the error.


When the gradient is positive, decrease in weight decreases the error.
Working of Back Propagation Algorithm:
The Back propagation algorithm in neural network computes the gradient of the loss function for a
single weight by the chain rule. It efficiently computes one layer at a time, unlike a native direct
computation. It computes the gradient, but it does not define how the gradient is used. It generalizes
the computation in the delta rule.
Consider the following Back propagation neural network example diagram to understand:

1. Inputs X, arrive through the pre-connected path


2. Input is modeled using real weights W. The weights are usually randomly selected.
3. Calculate the output for every neuron from the input layer, to the hidden layers, to the output
layer.
4. Calculate the error in the outputs
ErrorB= Actual Output – Desired Output
5. Travel back from the output layer to the hidden layer to adjust the weights such that the error is
decreased.
Keep repeating the process until the desired output is achieved
Section C
3. .
a) Why SVM is an example of a large margin classifier? Discuss the different kernels
functions used in SVM.
 Ans: SVM is a type of classifier which classifies positive and negative examples, here blue and
red data points
 As shown in the image, the largest margin is found in order to avoid overfitting i.e., the optimal
hyperplane is at the maximum distance from the positive and negative examples(Equal distant
from the boundary lines).
 To satisfy this constraint, and also to classify the data points accurately, the margin is
maximized, that is why this is called the large margin classifier.
For a dataset containing m training examples and n features, the objective function of the maximum
margin classifier is:

minimizeθ0,θ1,...θn M

subject to y(θ0+θ1xi1+θ2xi2+...+θnxin)≥M,i=1,2…,m.

where M is the distance of the hyperplane from the closest points belonging to both classes. The
solution to the optimization problem chooses the parameters θ0,θ1...θn to maximize M

Support Vector Machines is just an extension of the Maximum Margin Classifier to produce non-linear
decision boundaries between classes. Hence SVM too is a Maximum Margin Classifier.
Types of Kernel and methods in SVM:
1. Liner Kernel
Let us say that we have two vectors with name x1 and Y1, then the linear kernel is defined by the dot
product of these two vectors:
K(x1, x2) = x1 . x2
2. Polynomial Kernel
A polynomial kernel is defined by the following equation:
K(x1, x2) = (x1 . x2 + 1)d,
Where,d is the degree of the polynomial and x1 and x2 are vectors
3. Gaussian Kernel
This kernel is an example of a radial basis function kernel. Below is the equation for this:

The given sigma plays a very important role in the performance of the Gaussian kernel and should
neither be overestimated and nor be underestimated, it should be carefully tuned according to the
problem.
4. Exponential Kernel
This is in close relation with the previous kernel i.e. the Gaussian kernel with the only difference is – the
square of the norm is removed.
The function of the exponential function is:
This is also a radial basis kernel function.
5. Laplacian Kernel
This type of kernel is less prone for changes and is totally equal to previously discussed exponential
function kernel, the equation of Laplacian kernel is given as:

6. Hyperbolic or the Sigmoid Kernel


This kernel is used in neural network areas of machine learning. The activation function for the sigmoid
kernel is the bipolar sigmoid function. The equation for the hyperbolic kernel function is:

This kernel is very much used and popular among support vector machines.
7. Anova radial basis kernel
This kernel is known to perform very well in multidimensional regression problems just like the Gaussian
and Laplacian kernels. This also comes under the category of radial basis kernel.
The equation for Anova kernel is :

b) Explain the relevance of CBR. How CADET tool employs CBR?


Ans: Case-Based Reasoning classifiers (CBR) use a database of problem solutions to solve new problems.
It stores the tuples or cases for problem-solving as complex symbolic descriptions.
Applications of CBR includes:
1. Problem resolution for customer service help desks, where cases describe product-related
diagnostic problems.
2. It is also applied to areas such as engineering and law, where cases are either technical designs
or legal rulings, respectively.
3. Medical educations, where patient case histories and treatments are used to help diagnose and
treat new patients.
Case-based reasoning (CBR) is an experience-based approach to solving new problems by adapting
previously successful solutions to similar problems. Addressing memory, learning, planning and problem
solving, CBR provides a foundation for a new technology of intelligent computer systems that can solve
problems and adapt to new situations. In CBR, the “intelligent” reuse of knowledge from already-solved
problems, or cases, relies on the premise that the more similar two problems are, the more similar their
solutions will be.
Four step process for CBR
In general, the case-based reasoning process entails:
1. Retrieve- Gathering from memory an experience closest to the current problem.
2. Reuse- Suggesting a solution based on the experience and adapting it to meet the demands of
the new situation.
3. Revise- Evaluating the use of the solution in the new context.
4. Retain- Storing this new problem-solving method in the memory system.
Case-Based Reasoning in CADET
 Given function specification for new design, CADET search its library to find an exact match
 If found, return this case
 If not, find cases matching subgraphs. i.e., isomorphism subgraph searching, then piece them
together
 Elaborate the original function graph to match more cases

4.
a) Discuss the applications, properties, issues, and disadvantages of SVM.
Ans: SVM Applications:
1. Inverse Geosounding Problem
2. Seismic Liquefaction Potential
3. Protein Fold and Remote Homology Detection
4. Data Classification using SSVM
5. Facial Expression Classification
6. Texture Classification using SVM
7. Text Classification
8. Speech Recognition
9. Stenography Detection in Digital Images
10. Cancer Diagnosis and Prognosis
Properties of SVM :
 Duality
 Kernels
 Margin
 Convexity
 Sparseness
Issues of SVM:
 Requires full labeling of input data
 Uncalibrated class membership probabilities—SVM stems from Vapnik's theory which avoids
estimating probabilities on finite data
 The SVM is only directly applicable for two-class tasks. Therefore, algorithms that reduce the
multi-class task to several binary problems have to be applied; see the multi-class SVM section.
 Parameters of a solved model are difficult to interpret.
Disadvantages of SVM:
 SVM algorithm is not suitable for large data sets.
 SVM does not perform very well when the data set has more noise i.e. target classes are
overlapping.
 In cases where the number of features for each data point exceeds the number of training data
samples, the SVM will underperform.
 As the support vector classifier works by putting data points, above and below the classifying
hyperplane there is no probabilistic explanation for the classification.
b) Explain the Confusion Matrix with respect to Machine LearningAlgorithms.
Ans: A confusion matrix (or error matrix) is a specific table that is used to measure the performance of
an algorithm. It is mostly used in supervised learning; in unsupervised learning, it’s called the matching
matrix.
The confusion matrix has two parameters:
 Actual
 Predicted
It also has identical sets of features in both of these dimensions.
Consider a confusion matrix (binary matrix) shown below:

Here,
For actual values:
Total Yes = 12+1 = 13
Total No = 3+9 = 12
Similarly, for predicted values:
Total Yes = 12+3 = 15
Total No = 1+9 = 10
For a model to be accurate, the values across the diagonals should be high. The total sum of all the
values in the matrix equals the total observations in the test data set.
For the above matrix, total observations = 12+3+1+9 = 25
Now, accuracy = sum of the values across the diagonal/total dataset
= (12+9) / 25
= 21 / 25
= 84%
.
5. a) Illustrate the operation of the ID3 training example.
Consider information gain as attribute measure
b) Describe Markov Decision Process in reinforcement learning.
Ans: Reinforcement Learning is a type of Machine Learning. It allows machines and software agents to
automatically determine the ideal behavior within a specific context, in order to maximize its
performance. Simple reward feedback is required for the agent to learn its behavior; this is known as the
reinforcement signal.
There are many different algorithms that tackle this issue. As a matter of fact, Reinforcement Learning is
defined by a specific type of problem, and all its solutions are classed as Reinforcement Learning
algorithms. In the problem, an agent is supposed to decide the best action to select based on his current
state. When this step is repeated, the problem is known as a Markov Decision Process.
A Markov Decision Process (MDP) model contains:
 A set of possible world states S.
 A set of Models.
 A set of possible actions A.
 A real valued reward function R(s,a).
 A policy the solution of Markov Decision Process.

A State is a set of tokens that represent every state that the agent can be in.
A Model (sometimes called Transition Model) gives an action’s effect in a state. In particular, T(S, a, S’)
defines a transition T where being in state S and taking an action ‘a’ takes us to state S’ (S and S’ may be
same). For stochastic actions (noisy, non-deterministic) we also define a probability P(S’|S,a) which
represents the probability of reaching a state S’ if action ‘a’ is taken in state S.
An Action A is set of all possible actions. A(s) defines the set of actions that can be taken being in state S.
A Reward is a real-valued reward function. R(s) indicates the reward for simply being in the state S.
R(S,a) indicates the reward for being in a state S and taking an action ‘a’. R(S,a,S’) indicates the reward
for being in a state S, taking an action ‘a’ and ending up in a state S’.
A Policy is a solution to the Markov Decision Process. A policy is a mapping from S to a. It indicates the
action ‘a’ to be taken while in state S.
6.
a) What is instance based learning? How Locally Weighted Regression isdifferent from
Radial basis function networks?
Ans: The Machine Learning systems which are categorized as instance-based learning are the systems
that learn the training examples by heart and then generalizes to new instances based on some
similarity measure. It is called instance-based because it builds the hypotheses from the training
instances. It is also known as memory-based learning or lazy-learning. The time complexity of this
algorithm depends upon the size of training data. The worst-case time complexity of this algorithm is O
(n), where n is the number of training instances.
Advantages:
1. Instead of estimating for the entire instance set, local approximations can be made to the target
function.
2. This algorithm can adapt to new data easily, one which is collected as we go .
Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store the data, and each query involves starting the
identification of a local model from scratch.
Some of the instance-based learning algorithms are:
1. K Nearest Neighbor (KNN)
2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)
Locally weighted linear regression is a non-parametric algorithm, that is, the model does not learn a
fixed set of parameters as is done in ordinary linear regression. Rather parameters are computed
individually for each query point x. While computing , a higher “preference” is given to the points in
the training set lying in the vicinity of x than the points lying far away from x.

The modified cost function is: where, w(i) is a non-negative “weight”


(i)
associated with training pointx .
For x(i)is lying closer to the query point x, the value of w(i) is large, while for x(i) is lying far away from x the
value of w(i) is small.

A typical choice of w(i) is: where, is called the bandwidth parameter and
controls the rate at which w(i) falls with distance from x. Clearly, if |x(i)-x| is small w(i) is close to 1 and
if |x(i)-x| is large w(i) is close to 0.
Thus, the training-set-points lying closer to the query point x contribute more to the cost than the
points lying far away from x.
Radial Basis Function Network
In the field of mathematical modeling, a radial basis function network is an artificial neural network that
uses radial basis functions as activation functions. The output of the network is a linear combination of
radial basis functions of the inputs and neuron parameters. Radial basis function networks have many
uses, including function approximation, time series prediction, classification, and system control.
Radial basis function network This radial basis function can be organized into the hidden layer of a
neural network, and this type of network is called RBF Networks. The output of the network is a linear
combination of RBFs of the inputs and neuron parameters.
A radial basis function, RBF, ϕ(x)ϕ(x) is a function with respect to the origin or a certain point cc,
ie, ϕ(x)=f(∥x−c∥)ϕ(x)=f(‖x−c‖) where the norm is usually the Euclidean norm but can be other type of
measure.
The RBF learning model assumes that the dataset D=(xn,yn),n=1…N D=(xn,yn),n=1…N influences the
hypothesis set h(x)h(x), for a new observation xx, in the following way:
h(x)=∑n=1Nwn×exp(−γ∥x−xn∥2)h(x)=∑n=1Nwn×exp(−γ‖x−xn‖2)
which means that each xixi of the dataset influences the observation in a gaussian shape. Of course, if a
datapoint is far away from the observation its influence is residual (the exponential decay of the tails of
the gaussian make it so). It is an example of a localized function (x→∞⟹ϕ(x)→0x→∞⟹ϕ(x)→0).
Notice that other type of radial functions can be used.
b) How is Bayes theorem used in machine learning? How naive Bayes algorithmis
different from Bayes theorem?
Ans: Bayes Theorem is named for English mathematician Thomas Bayes, who worked extensively in
decision theory, the field of mathematics that involves probabilities. Bayes Theorem is also used widely
in machine learning, where it is a simple, effective way to predict classes with precision and accuracy.
The Bayesian method of calculating conditional probabilities is used in machine learning applications
that involve classification tasks.

A simplified version of the Bayes Theorem, known as the Naive Bayes Classification, is used to reduce
computation time and costs. In this article, we take you through these concepts and discuss the
applications of the Bayes Theorem in machine learning.

The Naive Bayes Classifier, a simplified version of the Bayes Theorem, is used as a classification
algorithm to classify data into various classes with accuracy and speed.

Let’s see how the Naive Bayes Classifier can be applied as a classification algorithm.

1. Consider a general example: X is a vector consisting of ‘n’ attributes, that is, X = {x1, x2, x3, …,
xn}.

2. Say we have ‘m’ classes {C1, C2, …, Cm}. Our classifier will have to predict X belongs to a certain
class. The class delivering the highest posterior probability will be chosen as the best class. So
mathematically, the classifier will predict for class Ci iff P(Ci | X) > P(Cj | X). Applying Bayes
Theorem:

P(Ci | X) = [ P(X | Ci) * P(Ci) ] / P(X)


1. P(X), being condition-independent, is constant for each class. So to maximize P(Ci | X), we must
maximize [P(X | Ci) * P(Ci)]. Considering every class is equally likely, we have P(C1) = P(C2) =
P(C3) … = P(Cn). So ultimately, we need to maximize only P(X | Ci).

2. Since the typical large dataset is likely to have several attributes, it is computationally expensive
to perform the P(X | Ci) operation for each attribute. This is where class-conditional
independence comes in to simplify the problem and reduce computation costs. By class-
conditional independence, we mean that we consider the attribute’s values to be independent
of one another conditionally. This is the Naive Bayes Classification.

P(Xi | C) = P(x1 | C) * P(x2 | C) *… * P(xn | C)

It is now easy to compute the smaller probabilities. One important thing to note here: since xk belongs
to each attribute, we also need to check whether the attribute we are dealing with
is categorical or continuous.

1. If we have a categorical attribute, things are simpler. We can just count the number of instances
of class Ci consisting of the value xk for attribute k and then divide that by the number of
instances of class Ci.

2. If we have a continuous attribute, considering we have a normal distribution function, we apply


the following formula, with mean ? and standard deviation ?:

Ultimately, we will have P(x | Ci) = F(xk, ?k, ?k).

Now, we have all the values we need to use Bayes Theorem for each class Ci. Our predicted class will be
the class achieving the highest probability P(X | Ci) * P(Ci).

They are completely different things. Naive Bayes is a type of prediction model; one which assumes that
all of the features are mutually independent.

The theorem known as “Bayes Theorem” is a theorem. It is a mathematical result. It tells us

that

Bayes Theorem is a theorem that allows us to infer the probability of a particular model given observed
data. There are many Machine Learning algorithms based on it. Naive Bayes is one of them.
7.
a) Compare regression, classification and clustering in machine learningalong with
suitable real life applications?

Ans:

Regression Classification Clustering


Supervised learning Supervised learning Unsupervised learning
Output is a continuous quantity Output is a categorical quantity Assigns data points into clusters
Main aim is to forecast or Main aim is to compute the Main aim is to group similar items
predict category of the data clusters
Example: Predict stock market Example: Classify emails as Example: Find all transactions
price spam or non-spam which are fraudulent in nature
Algorithm: Linear Regression Algorithm: Logistic Regression Algorithm: K-means
Need to learn a predicting Need to learn decision making Finding similarities/patterns in
particular outcome (no discrete (choosing one decision among observed events (data)
options) in a context by looking ‘n’ decisions) by looking t past
at past experience experiences

b) Given below is an input matrix named I, kernel matrix, calculate the Convoluted
matrix C using stride =1 also apply max pooling on C.

You might also like