Professional Documents
Culture Documents
Lecture Slides 1 - Introduction, PLA, and Logistic Regression - 2021
Lecture Slides 1 - Introduction, PLA, and Logistic Regression - 2021
Lecture Slides 1 - Introduction, PLA, and Logistic Regression - 2021
Regression.
Dr Etienne Pienaar
etienne.pienaar@uct.ac.za Room 5.43
October 3, 2021
1/48
Course Materials and Software
• The course material for this section consists of a set of lecture notes
which mainly draws from the following sources:
• Learning From Data by Abu-Mostafa et al. (2012).
• Neural Networks and Deep Learning by Nielsen (2015).
• The Elements of Statistical Learning by Friedman et al. (2001).
• Each section in these notes has homework problems which you can try
out. Though we use some of these problems as a basis for solving
problems in class, no formal solution set will be given.
• We will be using R and R only in this course. All tasks given for this
section will require you to code from scratch.
2/48
Neural Networks
3/48
Neural Networks
4/48
Objectives of Statistical Learning
5/48
Are Neural Networks Useful for These Purposes?
6/48
Why are Neural Networks Useful for Non-Linear Problems?
7/48
Example Recognizing a Non-linear Shape
Observations
1.0
●● ●
● ● ●
●● ●
●
● ● ●● ● ●●
● ●
●● ● ● ● ●
● ● ● ●
● ● ● ●● ● ●
● ● ● ●
●
0.5
● ● ● ● ●
● ●● ● ● ●
●● ● ●
● ● ● ● ●●
● ●
●
● ●● ● ● ● ●
● ● ●
● ● ●●● ● ●
● ● ● ●● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ●
● ●●● ● ● ●
● ● ● ● ●
● ●
●
● ● ● ●● ● ●●● ●● ●
● ●
● ●
0.0
● ● ●
x2
● ●
● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●● ●
●● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ●
●● ● ● ●
● ●● ●
● ● ● ● ●
● ● ●● ● ●
● ● ●
● ● ● ● ●
● ● ● ●
●● ● ●
●
● ●● ●
−0.5
● ● ●
● ● ●● ●● ●
● ●
● ● ● ● ●●
● ●●● ● ● ● ●
● ● ●
● ●
●
● ●
● ●●●● ● ●●
● ● ● ●
● ● ●
● ● ● ●
● ● ●●
●
−1.0
x1
8/48
Example `Learning' a Non-linear Shape
9/48
Example `Predicting' a Non-linear Shape
Model
dim(theta) = 13
0.22
●
x2 ●
0.92 0.04
● ● y
x1 ●
1
●
2 3 1
Prediction under the tted model at various points in the feature space.
10/48
In conclusion, yes, they are useful. But in order to understand when they
are useful, one needs to understand the technology inside-out. So, for
purposes of the present course, we will cover the mathematics from the
ground up, code the relevant components of the model class from
scratch, and conduct analyses using what we've built.
Before we begin our journey, it's probably a good idea to set some goals
in the form of questions that we might like to be able to answer about
the model class:
• What does a neural network do? (Almost surely, this will be an
interview question.)
• How do neural networks learn patterns? Do they really perceive
structures in the feature space which are not obvious to the
statistician?
• Where is the analysis most likely to go wrong, and how do we diagnose
the errors?
11/48
The Perceptron
Based on Abu-Mostafa et al. (2012).
12/48
History Slide credit: Dr Sebnem Er
Perceptron' may refer to either a simple neural network as we know it
`
or a primitive type of neural network. For our purposes, when referring to
a perceptron we take it to mean the latter, though the distinction is not
always clear in the literature:
• 1943: Warren McCulloch and Walter Pitts introduced one of the rst
articial neurons: A weighted sum of input signals is compared to a
threshold to determine the neuron output such that if
output-threshold, return 1; if the output < threshold, return 0.
• In the late 1950s, Frank Rosenblatt and others developed a class of
neural networks called 'perceptron' which were similar to McCulloch
and Pitts' network. Key contribution was the introduction of a
learning rule to solve pattern recognition problems.
• The learning problem is to determine a weight vector that causes the
perceptron to produce the correct output for each of the given training
examples.
• Along the way basic models were modied to have softer discriminants
for generating outputs (activation functions) and multiple layers,
leading to the so-called multi-layer-perceptron, which we would now
refer to as a neural network.
13/48
Base Elements of a Learning Algorithm
As machine learners, you will be in the business of developing/using
learning algorithms' under an appropriate statistical learning
`
paradigm. Whatever methodology you employ, the base elements usually
remain the same:
• You have data (and hopefully lots of it) consisting of some
response/target/output which you wish to relate/predict using an
appropriate set of predictors/inputs.
• The mechanism by which we relate these two components, is the
learning algorithm.
• mathematical model
The learning algorithm typically consists of a
training mechanism which selects an appropriate
and an appropriate
conguration of the mathematical model based on the data.
• The training mechanism may follow implicitly from the mathematical
model (e.g., the PLA), or be carefully selected to deal with the
nuances of the particular model class (this is the case for NNs).
14/48
The Perceptron
where of course:
p
x w=
X
T
i xik wk .
k=1
15/48
The Perceptron
xTi w > b
(
+1 if
yi = ,
if xi w < b
T
−1
16/48
The Perceptron
1.0
●
0.5
sign(∑ wkxk − b)
0.0
●
k=1
p
−0.5
−1.0
−1.0 −0.5 p
0.0 0.5 1.0
∑ wkxk − b
k=1
17/48
The Perceptron
18/48
The Perceptron
Our model thus functions by simply evaluating the sign of the oset
`score'. Assuming that this model is a valid representation of the
underlying process, we can then predict, based on some estimated
parameter vector ŵ, a new observation, with predictors say x:
ŷ = sign(xT ŵ − b̂),
19/48
The Perceptron Learning Algorithm
• Initialize: Make an initial guess for the weight vector w, say w(0) .
• Iterate:
1. Pick an example at random, say i? from the set of labelled
observations which is misclassied by the expression:
ŷ = sign(xTi? w(old) ).
20/48
The Perceptron Learning Algorithm
The geometric intuition behind the PLA is that the updating algorithm
aims (after picking a misclassied point) to modify the inner product
between x and w in such a way that it reects the observed classication:
21/48
PLA Logic
Observation Observation
2
2
w_before w_before
w_after w_after
1
1
1
0
−1
−1
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
1 1
22/48
PLA in action: 1D
23/48
PLA in action: 2D
24/48
Notes on the PLA
25/48
The Perceptron Learning Algorithm
The perceptron and the PLA represents perhaps the simplest marriage
of a model class and learning algorithm that resembles the functioning
of a neural network.
• multi-layer
Indeed, the literature often refers to neural networks as
perceptrons (MLPs) where we will refer to standard feed-forward
networks.
• Though linear, the class can be used to model non-linear phenomena
by way of appropriate transformation of the features.
• In the present treatment, we have only dealt with tting the model to
data which we have seen. Ideally, we would like to use the algorithm to
classify unseen observationsfortunately, we can easily calculate the
out-of-sample performance of the algorithm.
26/48
R: Code the PLA
Use the following skeleton code to complete the exercises for Section 2.
# x - Predictor
# y - Labels for each response
# w - A starting parameter vector
PLA = function(x,y,w)
{
# A while loop with appropriate escape:
while()
{
}
# Return a list of stuff:
return(list(weights = , predictions = ))
}
27/48
Wrap-Up
Why the hell do we care about the PLA? This is supposed to be the
ANNs section?! Well, there are an number of reasons:
• This is an unfamiliar model class build on linear elements passed
through a sigmoidal activation...
• The learning algorithm is a simple iterative numerical procedure...
Apart from the updating step, algorithms for tting ANNs will have
similar structure and path dependent stopping criteria.
• Perceptrons t within the model + learning algorithm + learning
paradigm in exactly the same way as ANNs do... In fact, for any given
dichotomous learning problem, a priori, there is no reason to suspect
that a neural network should be able to outperform this simple model.
28/48
Logistic Regression
Based loosely on Dobson and Barnett (2008) and Neter et al.
(1996).
29/48
Logistic Regression: Response Distribution
where π denotes the probability that Y takes on the value 1, and the
model is characterised i.t.o. the probability mass function:
f (y, π) = π y (1 − π)1−y .
30/48
Logistic Regression: Predictors
For purposes of modeling the relationship between the response and the
predictors, we again make use of the linear component to capture the
information in the predictors:
p
X
η= wk xk ∈ R.
k=1
31/48
Logistic Regression: Linear Predictors
through which the linear component thus enters the model by way of the
inverse link:
p
X
−1
π=g wk xk .
k=1
32/48
Logistic Regression - Sigmoidal Response Functions
Function g g −1
π eη
Logit log 1−π 1+eη
33/48
Logistic Regression: Sigmoidal Response Functions
1.0
0.5
f(z)
0.0
−0.5
Logistic
Probit
−1.0
C−LogLog
−4 −2 0 2 4
34/48
Logistic Regression: Graphical Representation
X1
.
.
.
Y
Xp
35/48
Fitting Logistic Regression: Maximum Likelihood
ŵ = argmax(L(y, w)),
where
n
L(y, w)) = f (yi , g −1 (xTi w)),
Y
i=1
or equivalently:
ŵ = argmax(log(L(y, w))),
with log-likelihood
n
log(L(y, w)) = (yi log(g −1 (xTi w)) + (1 − yi ) log(1 − g −1 (xTi w))).
X
i=1
36/48
Fitting Logistic Regression: Maximum Likelihood
37/48
Logistic Regression in Action
38/48
Polytomous Regression/Nominal Logistic Regression
n
L(y, w)) =
Y
f (yi , π),
i=1
where
q
Y
f (yi , π) = πjyi
j=1
Pq
with π = {π1 , π2 , . . . , πq } subject to the constraint j=1 πj = 1.
39/48
Nominal Logistic Regression
Denition
Nominal logistic regression applies the ideas behind logistic regression
to multinomial responses.
πj
log = ηj = xTi β j , j 6= b.
πb
• Since we have more than one category, the symmetry of `1 − πi ' is broken
but we analogously, measure odds w.r.t. a base category πb .
• 'b' is usually chosen as the rst category, i.e, π1 .
40/48
Nominal Logistic Regression
Denition
Nominal logistic regression applies the ideas behind logistic regression to multinomial
responses.
πj = π1 exp( xTi wj ), j = 2, 3, . . . , q.
• Since πj = 1, we have:
Pq
j=1
1
π1 =
1+
Pq
j=1 exp( xTi wj ) ,
and
exp(xTi wj )
j=1 exp(xi wj )
πj = Pq T
, j = 2, 3, . . . , q.
1+
41/48
Viewing the log-likelihood as the appropriate objective function for such a
multi-class classication problem we derive:
q
n X
log(L(y, w1 , w2 , ...)) = yij log(πj (wj ))
X
i=1 j=1
42/48
Nominal Logistic Regression
As before, the end-game with nominal logistic regression is to establish
the relative odds of an observation being distributed to a category given
a set of explanatory variables. Interpretation is almost invariably i.t.o. the
eect of a given explanatory variable on the odds of observing a category
of response.
• Consider comparing two categories with one continuous predictor x,
then:
π2
log = η1 = w11 + w21 x,
π1
π3
log = η2 = w12 + w22 x,
π1
which leads to the odds-ratio:
π3 π2
log(OR) = log − log .
π1 π1
• Alternatively, we could simply be interested in the most likely category
for a given set of predictors.
43/48
Some examples of nominal logistic regression problems:
• iris data standard R dataset.
• Car Preference Data.txt (Dobson and Barnett, 2008).
44/48
Logistic Regression: Where to from here?
45/48
46/48
Logistic Regression: Graphical Representation
X1 Y1
. Y2
.
.
Xp
47/48
References I
48/48