Download as pdf or txt
Download as pdf or txt
You are on page 1of 117

Machine Learning – 18CS71

Module 3

Decision Tree Learning and


ArtificialNeural Networks
2
Module - 3
Decision Tree Learning
1.Introduction
2.Decision Tree Representation
3.Appropriate Problems for Decision Tree Learning
4.Basic Decision Tree Learning Algorithm (ID3)

3
Make a decision tree that predicts whether tennis will be
played on the day?
4
5
Step 1: Create a root node

• How to choose the root node?


•The attribute that best classifies the training
data, use this attribute at the root of the
tree.

6
Step 1: Create a root node
How to choose the root node?
The attribute that best classifies the training data, use
this attribute at the root of the tree.

How to choose the best attribute?

So from here, ID3 algorithm begins


7
Definition: Entropy
• It is a Measurement of Homogeneity of Examples
• Given a collection S, containing +ve and -ve examples of
some target concept, the entropy of S is given by

• where p+ is the proportion of positive examples in S and p- is the


proportion of negative examples in S
• In all calculations involving entropy we define 0.log 0 = 0.
• In general for c class classification
8
9
Entropy Function
• Entropy function relative to a boolean classification, as p +, varies
between 0 and 1

10
Calculate Entropy (Amount of uncertainityin
dataset):

Calculate Average
Information:

Calculate Information Gain:(Difference in Entropy


before and after splitting dataset on
attribute A)

11
Steps
1.compute the entropy for data-set Entropy(s)

2. for every attribute/feature:

1.calculate entropy for all other values Entropy(a)

2.take average information entropy for the current attribute

3.3.calculate gain for the current attribute

4. pick the highest gain attribute.

5. Repeat until we get the tree we desired.

12
1.

P
N = =
Total =5
14
9

13
14
def entropy(target_col):
elements,counts = np.unique(target_col,return_counts = True)
entropy = np.sum([(-
counts[i]/np.sum(counts))*np.log2(counts[i]/np.sum(counts)) for i in range(len(eleme
nts))]):
return entropy

15
• pd.unique(pd.Series([2, 1, 3, 3]))

16
P N =
5
=

=
0.9
17
def InfoGain(data,split_attribute_name,target_name="class"):
total_entropy = entropy(data[target_name])
vals,counts= np.unique(data[split_attribute_name],return_counts=True)
Weighted_Entropy =
np.sum([(counts[i]/np.sum(counts))*entropy(data.where(data[split_attribute_name]==vals[i]).dr
opna()[target_name])
for i in range(len(vals))])
Information_Gain = total_entropy - Weighted_Entropy
return Information_Gain
18
• import pandas as pd
import numpy as np
dataset=
pd.read_csv('PlayTennis.csv',names=['outlook','temperature','humidity','wind','c
lass'])
def entropy(target_col):
elements,counts = np.unique(target_col,return_counts = True)
e = np.sum([(-counts[i]/np.sum(counts))*np.log2(counts[i]/np.sum(counts))
for i in range(len(elements))])
return e

def InfoGain(data, split_attribute_name, target_name="class"):


total_entropy = entropy(data[target_name])
vals, counts = np.unique(data[split_attribute_name], return_counts=True)
Weighted_Entropy = np.sum([(counts[i] / np.sum(counts)) *
entropy(data.where(data[split_attribute_name] == vals[i]).dropna()[target_name])
for i in range(len(vals))])
Information_Gain = total_entropy - Weighted_Entropy
return Information_Gain
print(InfoGain(dataset,'outlook','class'))

19
20
For each Attribute: (let say Outlook)
• Calculate Entropy for each Values, i.e for 'Sunny',
'Rainy','Overcast'

21
Calculate
Entropy(Outlook='Value'):

22
Entropy - Outlook

23
For each Attribute: (let say
Temperature)
Calculate Entropy for each Temp, i.e for 'Hot', 'Mild' and 'Cool'

24
Calculate Average Information Entropy:

25
Calculate Gain: attribute is Temperature

26
27
For each Attribute: (let say
Humidity)
Calculate Entropy for each Humidity, i.e for 'High', 'Normal'
Calculate Average Information Entropy:

29
Calculate Gain: attribute is Humidity

30
31
For each Attribute: (let say
Windy)
Calculate Entropy for each Windy, i.e for 'Strong' and 'Weak'
Calculate Average Information Entropy:

33
Calculate Gain: attribute is Windy

34
Pick the highest gain attribute.

Root Node:
OUTLOOK
35
36
37
Repeat the same thing for sub-trees till we get the
tree.

Outlook =
"Sunny"

Outlook =
"Rainy"

38
P= 2 N= 3
Total= 5

Entropy:

39
For each Attribute: (let say
Temperature):
Calculate Entropy for each Windy, i.e for 'Cool', 'Hot' and 'Mild'

Calculate Average Information Entropy: I(Temp) =


0.4
Calculate Gain:
Gain =
40
0.571
For each Attribute: (let say
Humidity):
Calculate Entropy for each Humidity, i.e for 'High' and 'Normal'

Calculate Average Information Entropy: I(Humidity)


= 0
Calculate Gain: Gain =
0.971 41
For each Attribute: (let say
Windy):
Calculate Entropy for each Windy, i.e for 'Strong' and 'Weak'

Calculate Average Information Entropy: I(Windy) =


0.951
Calculate Gain:
Gain = 0.020
42
Pick the highest gain attribute.

Next Node in sunny:


Humidity
43
44
P= N=
3 2
Total =
5
Entropy:

45
For each Attribute: (let say
Humidity):
Calculate Entropy for each Humidity, i.e for 'High' and 'Normal'

Calculate Average Information Entropy: I(Humidity) =


0.951
Calculate Gain: Gain =
0.020 46
For each Attribute: (let say
Windy):
Calculate Entropy for each Windy, i.e for 'Strong' and 'Weak'

Calculate Average Information Entropy: I(Windy) =


0
Calculate Gain:
Gain =
47
0.971
For each Attribute: (let say
Temperature):
Calculate Entropy for each Windy, i.e for 'Cool', 'Hot' and 'Mild'

Calculate Average Information Entropy: I(Temp) =


0.951
Calculate Gain:
Gain = 0.020
48
Pick the highest gain attribute.

Next Node in
Rainy:
Windy 49
Weak Strong

50
Cont…
• How to Learn A Consistent Tree with Low Expected Cost?
• One approach is replace Gain by Cost-Normalized-Gain
• Examples of normalization functions

51
Neural Networks

52
INTRODUCTION

• Artificial neural networks (ANNs) provide a general, practical


method for learning real-valued, discrete-valued, and vector-valued
target functions from examples.

53
Biological Motivation

• The study of artificial neural networks (ANNs) has been inspired by


the observation that biological learning systems are built of very
complex webs of interconnected Neurons
• Human information processing system consists of brain neuron:
basic building block cell that communicates information to and from
various parts of body
• Simplest model of a neuron: considered as a threshold unit –a
processing element (PE)
• Collects inputs & produces output if the sum of the input exceeds an
internal threshold value

54
55
56
Facts of Human Neurobiology
• Number of neurons ~ 1011
• Connection per neuron ~ 10 4 – 5
• Neuron switching time ~ 0.001 second or 10 -3
• Scene recognition time ~ 0.1 second
• 100 inference steps doesn’t seem like enough
• Highly parallel computation based on distributed representation

57
Properties of Neural Networks
• Many neuron-like threshold switching units
• Many weighted interconnections among units
• Highly parallel, distributed process
• Emphasis on tuning weights automatically
• Input is a high-dimensional discrete or real-valued (e.g, sensor input)
• Examples:
• Speech phoneme recognition
• Image classification
• Financial perdition

58
APPROPRIATE PROBLEMS FOR
NEURAL NETWORK LEARNING
• Instances are represented by many attribute-value pairs.
• The target function output may be discrete-valued, real-valued, or
a vector of several real- or discrete-valued attributes
• The training examples may contain errors.
• Long training times are acceptable.
• Fast evaluation of the learned target function may be required
• The ability of humans to understand the learned target function is
not important.

59
Perceptron
• A perceptron takes a vector of real-valued inputs,
• calculates a linear combination of these inputs, then outputs a 1 if the
result is greater than some threshold and -1 otherwise.
• More precisely, given inputs xl through x,,
• the output o(x1, . . . , x,) computed by the perceptron is

60
Cont…
• where each wi is a real-valued constant, or weight, that determines
the contribution of input xi to the perceptron output.
• -wO is a threshold that the weighted combination of inputs wlxl + . .
. + wnxn must surpass in order for the perceptron to output a 1.
• To simplify notation, an additional constant input xo = 1, allowing
us to write the above inequality as

• or in vector form as

61
Perceptron function can be written as

62
• Learning a perceptron involves choosing values for the weights wo, .
. . , w,.
• Therefore, the space H of candidate hypotheses considered in
perceptron learning is the set of all possible real-valued weight
vectors.

63
64
4.4.1 Representational Power of
Perceptrons
• A hyperplane decision surface in the n-dimensional space of
instances (i.e., points).
• The perceptron outputs a 1 for instances lying on one side of the
hyperplane and outputs a -1 for instances lying on the other side

65
• The decision surface represented by a two-input perceptron.
• (a) A set of training examples and the decision surface of a perceptron that
classifies them correctly.
• (b) A set of training examples that is not linearly separable (i.e., that cannot be
correctly classified by any straight line).
• xl and x2 are the Perceptron inputs.
• Positive examples are indicated by "+", negative by "-".

66
Neural Networks

67
• The perceptron works on these simple steps
• a. All the inputs x are multiplied with their weights w. Let’s call it k.
• b. Add all the multiplied values and call them Weighted Sum.
• c. Apply that weighted sum to the correct Activation Function.

68
Representational Power of Perceptrons
• some sets of positive and negative examplescannot be separated by
any hyperplane.
• Those that can be separated are called linearly separable sets of
examples.

69
Cont…
• A single perceptron can be used to represent many boolean functions.
• For example, if we assume boolean values of 1 (true) and -1 (false), then
one way to use a two-input perceptron to implement the AND function is
to set the weights wo = -3, and wl = wz = .5. This perceptron can be
made to represent the OR function instead by altering the threshold to wo
= -.3.
• In fact, AND and OR can be viewed as special cases of m-of-n functions:
that is, functions where at least m of the n inputs to the perceptron must be
true.
• The OR function corresponds to m = 1 and the AND function to m = n.
• Any m-of-n function is easily represented using a perceptron by setting
all input weights to the same value (e.g., 0.5) and then setting the
threshold wo accordingly.
70
• Perceptrons can represent all of the primitive boolean functions
AND, OR, NAND (⌐AND), and NOR (⌐OR).
• Some boolean functions cannot be represented by a single
perceptron, such as the XOR function whose value is 1 if and only if
xl ≠ x2.
• Note the set of linearly nonseparable training examples shown in
Figure 4.3(b) corresponds to this XOR function.

71
• The ability of perceptrons to represent AND, OR, NAND, and NOR
is important because every boolean function can be represented by
some network of interconnected units based on these primitives.
• In fact, every boolean function canbe represented by some network
of perceptrons only two levels deep, in which

72
• the inputs are fed to multiple units, and the outputs of these units are
then input to a second, final stage.
• One way is to represent the boolean function in disjunctive normal
form (i.e., as the disjunction (OR) of a set of conjunctions (ANDs) of
the inputs and their negations).
• Note that the input to an AND perceptron can be negated simply by
changing the sign of the corresponding input weight.

73
• Because networks of threshold units can represent a rich variety of
functions and because single units alone cannot, we will generally be
interested in learning multilayer networks of threshold units.

74
75
76
77
Recap
• Perceptrons
• Representation

78
Content
• The Perceptron Training Rule
• Gradient Descent and the Delta Rule

79
The Perceptron Training Rule
• Learning networks of many interconnected units
• How to learn the weights for a single perceptron?
• Precise learning problem is to determine a weight vector that causes the
perceptron to produce the correct +1 output for each of the given training
examples.
• Several algorithms are known to solve this learning problem.
• The Perceptron rule
• the Delta rule
• These two algorithms are guaranteed to converge to somewhat different
acceptable hypotheses, under somewhat different conditions.
• They are important to ANNs because they provide the basis for learning
networks of many units.

80
Cont…
• Methods to learn an acceptable weight vector is to
• begin with random weights,
• then iteratively apply the perceptron to each training example,
• Modifying the perceptron weights whenever it misclassifies an example.
• This process is repeated, iterating through the training examples as many
times as needed until the perceptron classifies all training examples
correctly.
• Weights are modified at each step according to the perceptron training
rule, which revises the weight wi associated with input xi according to the
rule

81
Cont…

•t is the target output for the current training example


•o is the output generated by the perceptron,
•ŋ is a positive constant called the learning rate.

82
Cont…
• The role of the learning rate is to moderate the degree to which
weights are changed at each step.
• It is usually set to some small value (e.g., 0.1) and is sometimes
made to decay as the number of weight-tuning iterations increases.

83
Cont…
• Why should this update rule converge toward successful weight
values?
• Suppose the perceptron outputs a -1, when the target output is + 1.
• To make the perceptron output a + 1 instead of - 1 in this case, the
weights must be altered to increase the value of

84
Cont…
• The above learning procedure can be proven to converge within a
finite number of applications of the perceptron training rule to a
weight vector that correctly classifies all training examples,
Condition:
•Training examples are linearly separable
•sufficiently small ŋ
•If the data are not linearly separable,
convergence is not assured.

85
86
87
Recap
• The Perceptron Training Rule
• Gradient Descent and the Delta Rule

88
Content
• Gradient Descent and the Delta Rule
• Stochastic Approximation to Gradient Descent
• A differentiable threshold unit
• The backpropagation algorithm

89
4.4.3 Gradient Descent and the Delta Rule
• Disadvantages of Perceptron rule:
• It finds a successful weight vector when the training examples are linearly
separable, it can fail to converge if the examples are not linearly separable.
• Second Rule: Delta rule
• Aim: Designed if the training examples are not linearly separable, the delta rule
converges toward a best-fit approximation to the target concept.
• Idea: Use gradient descent to search the hypothesis space of possible weight
vectors to find the weights that best fit the training examples.
• This rule is important because
• gradient descent provides the basis for the backpropogation aIgorithm, which can
learn networks with many interconnected units.
• gradient descent can serve as the basis for learning algorithms that must search
through hypothesis spaces containing many different types of continuously
parameterized hypotheses.

90
Cont…
• The delta training rule is best understood by considering the task of
training an unthresholded perceptron; that is, a linear unit for
which the output o is given by

Thus, a linear unit corresponds to the first stage of a


perceptron, without the threshold.

91
Cont…
• To derive a weight learning rule for linear units,
1. begin by specifying a measure for the training error of a hypothesis
(weight vector), relative to the training examples.
• Many ways to define this error, one common measure that will turn out to
be especially convenient is

92
93
94
4.3.3 STOCHASTIC APPROXIMATION
TO GRADIENT DESCENT
• Gradient descent is an important general paradigm for learning.
• It is a strategy for searching through a large or infinite hypothesis
space that can be applied whenever
1. the hypothesis space contains continuously parameterized hypotheses
2. the error can be differentiated with respect to these hypothesis
parameters.

95
Disadvantages of Gradient Descent are

(1) converging to a local minimum can


sometimes be quite slow (i.e., it can
require many thousands of gradient
descent steps)
(2) if there are multiple local minima in
the error surface, then there is no
guarantee that the procedure will find
the global minimum.

96
Cont…
• To overcome these disadvantage use: incremental gradient descent,
or alternatively stochastic gradient descent.
• Idea of stochastic gradient descent
• Approximate this gradient descent search by updating weights incrementally,
following the calculation of the error for each individual example.
• The modified training rule is we iterate through each training example we
update the weight according to

where t, o, and xi are the target value, unit output, and ith input for the training
example in question.

97
Cont…
• To modify the gradient descent algorithm and to implement this
stochastic approximation, Equation (T4.2) is simply deleted and
• Equation (T4.1) replaced by

• One way to view this stochastic gradient descent is to consider a


distinct error function for each individual training example d as
follows

98
Cont…

Where,
td, and od are the target value and the unit output value for
training example d.

99
Cont…
• Stochastic gradient descent
• iterates over the training examples d in D,
• at each iteration altering the weights according to the gradient with respect to distinct
error function
• The sequence of these weight updates, when iterated over all training
examples, provides a reasonable approximation to descending the gradient
with respect to our original error function E(w).
• By making the value of ŋ (the gradient descent step size) sufficiently small,
stochastic gradient descent can be made to approximate true gradient descent
arbitrarily closely.

100
The key differences between standard gradient
descent and stochastic gradient descent are:
Standard gradient descent Stochastic gradient descent
the error is summed on over all examples weights are updated upon examining each
before updating weights training example

Summing over multiple examples requires because it uses the true gradient, standard
more computation per weight update step gradient descent is often used with a larger
step size per weight update than stochastic
gradient descent.
if there are multiple local minima in the error avoid falling into local minima
surface, then there is no guarantee that the
procedure will find the global minimum. May
fall in local minimum. 101
• The training rule is delta rule, or the LMS (least-mean-square) rule,
Adaline rule, or Widrow-Hoff rule

• the delta rule is similar to the perceptron training rule


• the two expressions appear to be identical. However, the rules are
different because in the delta rule refers to the linear unit output
whereas for the perceptron rule refers to the thresholded output

102
4.5.1 A Differentiable Threshold Unit
• What type of unit shall we use as the basis for constructing multilayer
networks?
• we prefer networks capable of representing highly nonlinear functions.
• The perceptron unit is another possible choice, but its discontinuous
threshold makes it undifferentiable and hence unsuitable for gradient
descent.
• What we need is a unit whose output is a nonlinear function of its
inputs, but whose output is also a differentiable function of its inputs.
• One solution is the sigmoid unit-a unit very much like a perceptron, but
based on a smoothed, differentiable threshold function.
103
Cont…
• Like the perceptron, the sigmoid unit first computes a linear
combination of its inputs, then applies a threshold to the result.
• In the case of the sigmoid unit, however, the threshold output is a
continuous function of its input.
• More precisely, the sigmoid unit computes its output o as

104
Cont…
• σ is often called the sigmoid function or, alternatively, the logistic
function. Note its output ranges between 0 and 1, increasing
monotonically with its input

105
106
The Backpropagation Algorithm
• It learns the weights for a multilayer network, given a network with a
fixed set of units and interconnections.
• It employs gradient descent to attempt to minimize the squared error
between the network output values and the target values for these
outputs.

107
108
109
110
111
112
113
114
115
Backpropogation

https://www.youtube.com/watch?v=odlgtjXduVg

116
• https://www.youtube.com/watch?v=odlgtjXduVg

117

You might also like