Professional Documents
Culture Documents
VI Sem Machine Learning CS 601 PDF
VI Sem Machine Learning CS 601 PDF
VI Sem Machine Learning CS 601 PDF
1
Experiment No. 1
Introduction
Decision Trees are a type of Supervised Machine Learning (that is you explain
what the input is and what the corresponding output is in the training data)
where the data is continuously split according to a certain parameter. The tree
can be explained by two entities, namely decision nodes and leaves. The leaves
are the decisions or the final outcomes. And the decision nodes are where the
data is split.
An example of a decision tree can be explained using above binary tree. Let’s
say you want to predict whether a person is fit given their information like age,
eating habit, and physical activity, etc. The decision nodes here are questions
like ‘What’s the age?’, ‘Does he exercise?’, ‘Does he eat a lot of pizzas’? And the
leaves, which are outcomes like either ‘fit’, or ‘unfit’. In this case this was a
binary classification problem (a yes no type problem).
There are two main types of Decision Trees:
What we’ve seen above is an example of classification tree, where the outcome
was a variable like ‘fit’ or ‘unfit’. Here the decision variable is Categorical.
Here the decision or the outcome variable is Continuous, e.g. a number like
123.
Working
Now that we know what a Decision Tree is, we’ll see how it works internally.
There are many algorithms out there which construct Decision Trees, but one of
the best is called as ID3 Algorithm. ID3 Stands for Iterative Dichotomiser 3.
2
Before discussing the ID3 algorithm, we’ll go through few definitions.
Entropy
Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S, is
the measure of the amount of uncertainty or randomness in data.
Alternatively,
where IG(S, A) is the information gain by applying feature A. H(S) is the Entropy
of the entire set, while the second term calculates the Entropy after applying the
feature A, where P(x) is the probability of event x.
Let’s understand this with the help of an example
Consider a piece of data collected over the course of 14 days where the features
are Outlook, Temperature, Humidity, Wind and the outcome variable is whether
Golf was played on the day. Now, our job is to build a predictive model which
takes in above 4 parameters and predicts whether Golf will be played on the day.
We’ll build a decision tree to do that using ID3 algorithm.
3
Day Outlook Temperature Humidity Wind Play Golf
4
Experiment No. 2
Demonstratethe FIND-S algorithm for finding the most specific hypothesis based on a given set
of training data samples.
1. Concept Learning
2. General Hypothesis
3. Specific Hypothesis
1. Concept Learning
Let’s try to understand concept learning with a real-life example. Most of human learning is
based on past instances or experiences. For example, we are able to identify any type of
vehicle based on a certain set of features like make, model, etc., that are defined over a
large set of features.
These special features differentiate the set of cars, trucks, etc from the larger set of vehicles.
These features that define the set of cars, trucks, etc are known as concepts.
Similar to this, machines can also learn from concepts to identify whether an object belongs
to a specific category or not. Any algorithm that supports concept learning requires the
following:
Training Data
Target Concept
Actual Data Objects
2. General Hypothesis
3. Specific Hypothesis
The specific hypothesis fills in all the important details about the variables given in the
general hypothesis. The more specific details into the example given above would be I want
a cheeseburger with a chicken pepperoni filling with a lot of lettuce.
5
S = {‘Φ’,’Φ’,’Φ’, ……,’Φ’}
Now that we are done with the basic explanation of the Find-S algorithm, let us take a look at
how it works.
1. The process starts with initializing ‘h’ with the most specific hypothesis, generally, it is
the first positive example in the data set.
2. We check for each positive example. If the example is negative, we will move on to
the next example but if it is a positive example we will consider it for the next step.
3. We will check if each attribute in the example is equal to the hypothesis value.
4. If the value matches, then no changes are made.
5. If the value does not match, the value is changed to ‘?’.
6
6. We do this until we reach the last positive example in the data set.
Now that we are aware of the limitations of the Find-S algorithm, let us take a look at a
practical implementation of the Find-S Algorithm.
The concept of this particular problem will be on what days does a person likes to go on
walk.
This is our general hypothesis, and now we will consider each example one by one, but only
the positive examples.
7
We replaced all the different values in the general hypothesis to get a resultant hypothesis.
Now that we know how the Find-S algorithm works, let us take a look at an implementation
using Python.
Use Case
Let’s try to implement the above example using Python. The code to implement the Find-S
algorithm using the above data is given below.
1 import pandas as pd
2 import numpy as np
3
#to read the data in the csv file
4
data = pd.read_csv("data.csv")
5
print(data,"n")
6
7
#making an array of all the attributes
8
d = np.array(data)[:,:-1]
9
print("n The attributes are: ",d)
10
11
#segragating the target that has positive and negative
12
examples
13
target = np.array(data)[:,-1]
14 print("n The target is: ",target)
15
16 #training function to implement find-s algorithm
17 def train(c,t):
18 for i, val in enumerate(t):
19 if val == "Yes":
20 specific_hypothesis = c[i].copy()
21 break
22
23 for i, val in enumerate(c):
24 if t[i] == "Yes":
25 for x in range(len(specific_hypothesis)):
26 if val[x] != specific_hypothesis[x]:
27 specific_hypothesis[x] = '?'
28 else:
29 pass
30
31 return specific_hypothesis
32
33 #obtaining the final hypothesis
34 print("n The final hypothesis is:",train(d,target))
Output:
8
This brings us to the end of this article where we have learned the Find-S Algorithm in
Machine Learning with its implementation and use case. I hope you are clear with all that has
been shared with you in this tutoria
9
Experiment No. 3
2. Objectives:
To become familiar with neural networks learning algorithms from available
examples.
Provide knowledge of learning algorithm in neural networks.
5. Theory:
In late 1950s, Frank Rosenblatt introduced a network composed of the units that were
10
enhanced version of McCulloch-Pitts Threshold Logic Unit (TLU) model. Rosenblatt's
model of neuron, a perceptron, was the result of merger between two concepts from the
1940s, McCulloch-Pitts model of an artificial neuron and Hebbian learning rule of adjusting
weights. In addition to the variable weight values, the perceptron model added an extra
input that represents bias. Thus, the modified equation is now as follows:
6. Algorithm:
The perceptron learning rule was originally developed by Frank Rosenblatt in the late 1950s.
Training patterns are presented to the network's inputs; the output is computed. Then the
connection weightswjare modified by an amount that is proportional to the product of
the difference between the actual output, y, and the desired output, d, and
the input pattern, x.
where
d is the desired output,
t is the iteration number, and
eta is the gain or step size, where 0.0 < n < 1.0
11
4. Repeat steps 2 and 3 until:
12
1. the iteration error is less than a user-specified error threshold or
2. a predetermined number of iterations have been completed.
Learning only occurs when an error is made; otherwise the weights are left unchanged.
Multilayer Perceptron
Output Values
Output Layer
Adjustable Weights
X1 X2 Y
0 0 0
0 1 0
1 0 0
1 1 1
13
7. Conclusion:
Single layer perceptron learning algorithm is implemented for AND function. It is used
for train the iterations of neural network. Neural network mimics the human brain and
perceptron learning algorithm trains the neural network according to the input given.
8. Viva Questions:
14
Experiment No. 4
2. Objectives:
5. Theory:
These types of model are not provided with the correct results during the training.
It can be used to cluster the input data in classes on the basis of their statistical properties
only.
The labelling can be carried out even if the labels are only available for a small
number of objects represented of the desired classes. All similar input patters are grouped
together as clusters. If matching pattern is not found, a new cluster is formed.
15
number of different patterns & learns how to classify input data into appropriate
categories. Unsupervised learning tends to follow the neuro-biological organization of
brain. It aims to learn rapidly & can be used in real-time.
Hebbian Learning:
In 1949, Donald Hebb proposed one of the key ideas in biological learning,
commonly known as Hebb‘s Law. Hebb‘s Law states that if neuron i is near enough is
excite enough to excite neuron j & repeatedly participates in its activation, the synaptic
connection between these two neurons is strengthened & neuron j becomes more
sensitive to stimuli from neuron i.
1. If two neurons on either side of a connection are activated synchronously, then the
weight of that connection is increased.
2. If two neurons on either side of a connection are activated asynchronously, then the
weight of that connection is decreased.
Hebb‘s law provide basis for learning without a teacher. Learning here is a local
phenomenon occurring without feedback from the environment.
Using Hebb‘s Law we can express the adjustment applied to weight at iteration
p in the following form:
Hebbian learning implies that weights can only increase. To resolve this problem,
we might impose a limit on the growth of synaptic weights. It can be done by
introducing non-linear forgetting factor into Hebb‘s Law:
16
Where φ is the forgetting factor.
Step 1: Initialization
Set initial synaptic weights and thresholds to small random values, say in an interval [0,1].
Step 2: Activation
Step 3: Learning
Step 4:Iteration
7. Conclusion:
8. Viva Questions:
17
How Artificial Neurons learns?
What is the difference between neural network and fuzzy logi
18
Experiment No. 5
Now obviously, we are not superhuman. So, it’s not necessary that whatever weight values we have selected
will be correct, or it fits our model the best.
Okay, fine, we have selected some weight values in the beginning, but our model output is way different than
our actual output i.e. the error value is huge.
Basically, what we need to do, we need to somehow explain the model to change the parameters (weights),
such that error becomes minimum.
One way to train our model is called as Backpropagation. Consider the diagram below:
Calculate the error – How far is your model output from the actual output.
Minimum Error – Check whether the error is minimized or not.
Update the parameters – If the error is huge then, update the parameters (weights and biases). After
that again check the error. Repeat the process until the error becomes minimum.
Model is ready to make a prediction – Once the error becomes minimum, you can feed some inputs
to your model and it will produce the output.
I am pretty sure, now you know, why we need Backpropagation or why and what is the meaning of training a
model.
19
What is Backpropagation?
The Backpropagation algorithm looks for the minimum value of the error function in weight space using a
technique called the delta rule or gradient descent. The weights that minimize the error function is then
considered to be a solution to the learning problem.
Model output
Input Desired Output Absolute Error Square Error
(W=3)
0 0 0 0 0
1 2 3 1 1
2 4 6 2 4
Let’s change the value of ‘W’. Notice the error when ‘W’ = ‘4’
20
2 4 6 2 4 4 0
Now, what we did here:
So, we are trying to get the value of weight such that the error becomes minimum. Basically, we need to figure
out whether we need to increase or decrease the weight value. Once we know that, we keep on updating the
weight value in that direction until error becomes minimum. You might reach a point, where if you further update
the weight, the error will increase. At that time you need to stop, and that is your final weight value.
21
The above network contains the following:
two inputs
two hidden neurons
two output neurons
two biases
We will repeat this process for the output layer neurons, using the output from the hidden layer neurons as
inputs.
22
Now, let’s see what is the value of the error:
Consider W5, we will calculate the rate of change of error w.r.t change in weight W5.
23
Since we are propagating backwards, first thing we need to do is, calculate the change in total errors w.r.t the
output O1 and O2.
Now, we will propagate further backwards and calculate the change in output O1 w.r.t to its total net input.
Let’s see now how much does the total net input of O1 changes w.r.t W5?
Step – 3: Putting all the values together and calculating the updated
weight value
Now, let’s put all the values together:
24
Similarly, we can calculate the other weight values as well.
After that we will again propagate forward and calculate the output. Again, we will calculate the error.
If the error is minimum we will stop right there, else we will again propagate backwards and update the
weight values.
This process will keep on repeating until error becomes minimum.
Conclusion:
Well, if I have to conclude Backpropagation, the best option is to write pseudo code for the same.
25
Experiment No. 6
It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In
simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the
presence of any other feature.
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if
these features depend on each other or upon the existence of the other features, all of these properties
independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive
Bayes is known to outperform even highly sophisticated classification methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the
equation below:
Above,
• P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
Let’s understand it using an example. Below I have a training data set of weather and corresponding target
variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify whether players will play or not
based on weather condition. Let’s follow the below steps to perform it.
26
Step 1: Convert the data set into a frequency table
Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and probability of
playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The class with
the highest posterior probability is the outcome of prediction.
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on various attributes. This
algorithm is mostly used in text classification and with problems having multiple classes.
Pros:
It is easy and fast to predict class of test data set. It also perform well in multi class prediction
When assumption of independence holds, a Naive Bayes classifier performs better compare to other models
like logistic regression and you need less training data.
27
It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable,
normal distribution is assumed (bell curve, which is a strong assumption).
Cons:
If categorical variable has a category (in test data set), which was not observed in training data set, then model
will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero
Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is
called Laplace estimation.
On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba
are not to be taken too seriously.
Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost
impossible that we get a set of predictors which are completely independent.
Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for
making predictions in real time.
Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we can predict
the probability of multiple classes of target variable.
Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text classification
(due to better result in multi class problems and independence rule) have higher success rate as compared to
other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis
(in social media analysis, to identify positive and negative customer sentiments)
Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a
Recommendation System that uses machine learning and data mining techniques to filter unseen information
and predict whether a user would like a given resource or not
28