Lecture Slides 1 - Introduction, PLA, and Logistic Regression - 2021

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Lecture 4: Introduction to NN, PLA, and Logistic

Regression.

Dr Etienne Pienaar
etienne.pienaar@uct.ac.za  Room 5.43

University of Cape Town

October 3, 2021

1/48
Course Materials and Software

• The course material for this section consists of a set of lecture notes
which mainly draws from the following sources:
• Learning From Data by Abu-Mostafa et al. (2012).
• Neural Networks and Deep Learning by Nielsen (2015).
• The Elements of Statistical Learning by Friedman et al. (2001).

• Each section in these notes has homework problems which you can try
out. Though we use some of these problems as a basis for solving
problems in class, no formal solution set will be given.

• We will be using R and R only in this course. All tasks given for this
section will require you to code from scratch.

2/48
Neural Networks

Neural networks are a broad class of non-linear models


• Though there has been a large resurgence in research on neural
networks, the eld is closely tied to traditional statistical models.
• Indeed, neural networks have been around for quite some time. The
origins can be traced back to the likes of Rosenblatt and Boltzman.
• Interest has swayed signicantly over time and the present era of hype
has been precluded by similar periods of excitation and stagnation.
• Cheap (relatively) access to computing resources and large amounts of
data have made the class attractive for data scientists and analysts.
• As usual, the hype around such classeswarranted by astonishing
performance on particular statistical tasksis accompanied by an air
of mystique.

3/48
Neural Networks

Standard Feed-forward Neural Networks describe a class of models


which are graphically represented as sets of pair-wise connected layers of
nodes (vertices) where edges are usually directed sequentially between
successive layers of the network.
• '-Neural-' in the nomenclature supposedly refers to the structure and
functioning of such models mimicking biological systems of neurons.
• Although real-world neurological systems dier substantially in their
nature, it remains an attractive metaphor and much of the
nomenclature around neural networks reect this.
• It is easy to contaminate the interpretation of the mathematical
functioning of an ANN with that of an actual network of neurons, so
we will build the model class up from simple mathematical
foundations, and hopefully by the end of this course the distinction
between convenient discourse and function will be less opaque.
• Regardless, it is an exciting and interesting class that is well worth your
time to master.

4/48
Objectives of Statistical Learning

Usually, in the domain of Statistical Learning problems, we are presented


with one of two tasks/goals for the analysis.
• Determining how predictors and responses are related. Statistical
learning problems in business often stem from the need to inform
decisions empirically, which means identifying and interpreting
statistical relationships in data.
• Given a set of observed predictors and responses, can we predict to a
satisfactory degree of accuracy what the associated response will be?
Often we don't care what the actual relationship is, we are only
concerned with predictive performance.
In either case we need to assess the quality of the inference that can be
made in this regard. No model is perfect and we often need to conduct a
reality check or validate assumptions.

5/48
Are Neural Networks Useful for These Purposes?

So are Neural Networks useful for achieving these goals?


• Neural networks are useful for modeling highly non-linear phenomena.
So, for tasks where the underlying relationship between the features
and response variables is not expected to be linear or potentially too
complex to recreate using functions/transformations that can be

expressed analytically, i.e. in terms of polynomials or radicals ( x etc)
by a statistician.
• Neural networks are also easily amendable to dierent types of learning
problems: regression, classication, time series.
• We can, by careful application of response curves interpret relationships
between predictors and responses. (Also more advanced techniques.)
• Ultimately, the applications of neural networks are typically biased
towards the latter objective.
• Sanity checking for neural networks is dicult. Which is not an excuse
for not doing it. But even seasoned statisticians seem to neglect it
entirely...

6/48
Why are Neural Networks Useful for Non-Linear Problems?

The so-called universal approximation theorem states that a standard


feed-forward network can approximate continuous functions on subsets of
the reals to arbitrary precision provided that it has a sucient number of
hidden nodes. As such, the model class can be viewed as a basis for
constructing functions on a bounded feature space.
• Note, however, that a priori we have no reason to believe that such a
basis should outperform any other basis on any particular task. See for
example Fourier series or Legendre polynomials.
• However, neural networks can often be more easily adapted to real
world tasks and thus may have practical advantages where software
ecosystems are concernedtonnes of libraries for neural networks.

7/48
Example  Recognizing a Non-linear Shape

Observations

1.0
●● ●
● ● ●
●● ●

● ● ●● ● ●●
● ●
●● ● ● ● ●
● ● ● ●
● ● ● ●● ● ●
● ● ● ●

0.5

● ● ● ● ●
● ●● ● ● ●
●● ● ●
● ● ● ● ●●
● ●

● ●● ● ● ● ●
● ● ●
● ● ●●● ● ●
● ● ● ●● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ●
● ●●● ● ● ●
● ● ● ● ●
● ●

● ● ● ●● ● ●●● ●● ●
● ●
● ●
0.0

● ● ●
x2

● ●
● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●● ●
●● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ●
●● ● ● ●
● ●● ●
● ● ● ● ●
● ● ●● ● ●
● ● ●
● ● ● ● ●
● ● ● ●
●● ● ●

● ●● ●
−0.5

● ● ●
● ● ●● ●● ●
● ●
● ● ● ● ●●
● ●●● ● ● ● ●
● ● ●
● ●

● ●
● ●●●● ● ●●
● ● ● ●
● ● ●
● ● ● ●
● ● ●●

−1.0

−1.0 −0.5 0.0 0.5 1.0

x1

8/48
Example  `Learning' a Non-linear Shape

9/48
Example  `Predicting' a Non-linear Shape

Model
dim(theta) = 13

0.22

x2 ●

0.92 0.04
● ● y

x1 ●

1

2 3 1

Prediction under the tted model at various points in the feature space.

10/48
In conclusion, yes, they are useful. But in order to understand when they
are useful, one needs to understand the technology inside-out. So, for
purposes of the present course, we will cover the mathematics from the
ground up, code the relevant components of the model class from
scratch, and conduct analyses using what we've built.
Before we begin our journey, it's probably a good idea to set some goals
in the form of questions that we might like to be able to answer about
the model class:
• What does a neural network do? (Almost surely, this will be an
interview question.)
• How do neural networks learn patterns? Do they really perceive
structures in the feature space which are not obvious to the
statistician?
• Where is the analysis most likely to go wrong, and how do we diagnose
the errors?

11/48
The Perceptron
Based on Abu-Mostafa et al. (2012).

12/48
History  Slide credit: Dr Sebnem Er
Perceptron' may refer to either a simple neural network as we know it
`
or a primitive type of neural network. For our purposes, when referring to
a perceptron we take it to mean the latter, though the distinction is not
always clear in the literature:
• 1943: Warren McCulloch and Walter Pitts introduced one of the rst
articial neurons: A weighted sum of input signals is compared to a
threshold to determine the neuron output such that if
output-threshold, return 1; if the output < threshold, return 0.
• In the late 1950s, Frank Rosenblatt and others developed a class of
neural networks called 'perceptron' which were similar to McCulloch
and Pitts' network. Key contribution was the introduction of a
learning rule to solve pattern recognition problems.
• The learning problem is to determine a weight vector that causes the
perceptron to produce the correct output for each of the given training
examples.
• Along the way basic models were modied to have softer discriminants
for generating outputs (activation functions) and multiple layers,
leading to the so-called multi-layer-perceptron, which we would now
refer to as a neural network.
13/48
Base Elements of a Learning Algorithm
As machine learners, you will be in the business of developing/using
learning algorithms' under an appropriate statistical learning
`
paradigm. Whatever methodology you employ, the base elements usually
remain the same:
• You have data (and hopefully lots of it) consisting of some
response/target/output which you wish to relate/predict using an
appropriate set of predictors/inputs.
• The mechanism by which we relate these two components, is the
learning algorithm.
• mathematical model
The learning algorithm typically consists of a
training mechanism which selects an appropriate
and an appropriate
conguration of the mathematical model based on the data.
• The training mechanism may follow implicitly from the mathematical
model (e.g., the PLA), or be carefully selected to deal with the
nuances of the particular model class (this is the case for NNs).

Before we get to the details of the model and training mechanisms as


they pertain to NNs, we will explore these ideas at the hand of an
ancestor of sorts to NNs.

14/48
The Perceptron

The perceptron is a class of linear model that attempts to take a linear


combination of the feature set/inputs and assign the resulting signal to
one of two outcomes. That is, for some weight vector/parameters
w = [w1 , w2 , . . . , wp ]T , set yi ≡:

xTi w exceeds threshold,


(
Class A if

if xi w does not exceed threshold,


T
Class B

where of course:
p
x w=
X
T
i xik wk .
k=1

• xTi w can be thought of as a summative `score' for the inputs.


• We then associate the observation with a particular outcome of yi if
this score exceeds some threshold.

15/48
The Perceptron

Mathematically, the association is achieved by:

xTi w > b
(
+1 if
yi = ,
if xi w < b
T
−1

where b denotes a numerical threshold and {−1, +1} encodes the


possible outcomes of yi .
• We may equivalently consider xTi w − b > 0 etc., in which case we
subsume b in the parameter vector and append an additional feature
(constant 1 for each observation) to the inputs.
• The above expression thus describes almost exactly the behaviour of
the signum function.

16/48
The Perceptron

1.0

0.5
sign(∑ wkxk − b)

0.0


k=1
p

−0.5
−1.0

−1.0 −0.5 p
0.0 0.5 1.0
∑ wkxk − b
k=1

Signum function as applied to the perceptron model.

17/48
The Perceptron

Some properties of the model thus far:


• Clearly, the model is aimed at binary classication tasks.
• It is also clear that the model is linear in nature, since assignment
occurs by way of a linear combination of the predictors: Though the
mapping is discontinuous, the behaviour is to divide the feature space
into two regions linearly.
• Assignment is dictated entirely by the parameters w, and b...

18/48
The Perceptron

Our model thus functions by simply evaluating the sign of the oset
`score'. Assuming that this model is a valid representation of the
underlying process, we can then predict, based on some estimated
parameter vector ŵ, a new observation, with predictors say x:
ŷ = sign(xT ŵ − b̂),

or just ŷ = sign(xT ŵ) x and


if we have taken appended the feature set (
w will have one additional element).

19/48
The Perceptron Learning Algorithm

• Initialize: Make an initial guess for the weight vector w, say w(0) .
• Iterate:
1. Pick an example at random, say i? from the set of labelled
observations which is misclassied by the expression:

ŷ = sign(xTi? w(old) ).

2. Update the weights such that it moves xT w in a direction which


classies the observation correctly. This can be achieved by the
updating equation:

w(new) = w(old) + yi? xi? .

3. Set w(old) = w(new) . If some stopping criteria is met, terminate.


Otherwise, go to step 1.

• Return: ŵ = w(old) at the last iteration.

20/48
The Perceptron Learning Algorithm

The geometric intuition behind the PLA is that the updating algorithm
aims (after picking a misclassied point) to modify the inner product
between x and w in such a way that it reects the observed classication:

• xT w < 0, but y > 0, we need to modify w s.t. xT w > 0.


If
• If x w > 0, but y < 0, we need to modify w s.t. x w < 0.
T T

• A consequence of modifying w is that the separating plane is then


`rotated' at each iteration. This continues until the data is perfectly
classied.

21/48
PLA Logic

Observed Before= −1, Predicted Before=1 Observed Before= 1, Predicted Before=−1

Observation Observation
2

2
w_before w_before
w_after w_after
1

1
1

0
−1

−1
−2

−2
−2 −1 0 1 2 −2 −1 0 1 2

1 1

22/48
PLA in action: 1D

23/48
PLA in action: 2D

24/48
Notes on the PLA

• The algorithm is guaranteed to converge to a solution (perfect


classication in-sample) if the data are linearly separable.

• The PLA can be quite slow, as should be evident from increasing


dimensions of the predictor space. The speed issue can be circumvented
by early termination of the algorithm, e.g. if a certain `budget' for
misclassication has been achieved. This needs to be included in your
stopping criterion.

• Although perfect classication can be achieved in-sample for separable


data, solutions are not unique! As it turns out, this is an issue for most
algorithms which seek to partition the data, including logistic regression
(numerical instability)!

25/48
The Perceptron Learning Algorithm

The perceptron and the PLA represents perhaps the simplest marriage
of a model class and learning algorithm that resembles the functioning
of a neural network.
• multi-layer
Indeed, the literature often refers to neural networks as
perceptrons (MLPs) where we will refer to standard feed-forward
networks.
• Though linear, the class can be used to model non-linear phenomena
by way of appropriate transformation of the features.
• In the present treatment, we have only dealt with tting the model to
data which we have seen. Ideally, we would like to use the algorithm to
classify unseen observationsfortunately, we can easily calculate the
out-of-sample performance of the algorithm.

26/48
R: Code the PLA

Use the following skeleton code to complete the exercises for Section 2.

# x - Predictor
# y - Labels for each response
# w - A starting parameter vector
PLA = function(x,y,w)
{
# A while loop with appropriate escape:
while()
{

}
# Return a list of stuff:
return(list(weights = , predictions = ))
}

27/48
Wrap-Up

Why the hell do we care about the PLA? This is supposed to be the
ANNs section?! Well, there are an number of reasons:
• This is an unfamiliar model class build on linear elements passed
through a sigmoidal activation...
• The learning algorithm is a simple iterative numerical procedure...
Apart from the updating step, algorithms for tting ANNs will have
similar structure and path dependent stopping criteria.
• Perceptrons t within the model + learning algorithm + learning
paradigm in exactly the same way as ANNs do... In fact, for any given
dichotomous learning problem, a priori, there is no reason to suspect
that a neural network should be able to outperform this simple model.

28/48
Logistic Regression
Based loosely on Dobson and Barnett (2008) and Neter et al.

(1996).

29/48
Logistic Regression: Response Distribution

Consider a response variable Y ∈ 0, 1. An appropriate statistical model


for such a random variable would be the Bernoulli distribution:

Y ∼ Be(π), π ∈ [0, 1],

where π denotes the probability that Y takes on the value 1, and the
model is characterised i.t.o. the probability mass function:

f (y, π) = π y (1 − π)1−y .

• This model is useful in cases where the target variable is binary in


nature, e.g. default on a loan, or a fraudulent transaction.
• Naturally, in the present context, we are interested in the relationship
between the response and a set of predictors, the premise being that
the predictors might tell us something more about the likelihood of the
outcome being 0 or 1, e.g. credit score, annual income/ where a
transaction was processed etc.

30/48
Logistic Regression: Predictors

For purposes of modeling the relationship between the response and the
predictors, we again make use of the linear component to capture the
information in the predictors:

p
X
η= wk xk ∈ R.
k=1

• All that remains is to marry this linear component to the


aforementioned model.
• The only clear course of action is for η to be subsumed in the
parameter of the response distribution, i.e.,
Pp π.
• We may posit the π=η= k=1 wk xk ?
• This is awkward, however, since unless w are constrained such that
π ∈ [0, 1], such a strategy would fail.
• Alternatively, we could make use of a translation of η that maps values
in R to [0, 1].

31/48
Logistic Regression: Linear Predictors

In order to conserve the advantageous attributes of the linear component,


we thus dene an appropriate link function:
p
X
g(π) = w k xk ,
k=1

through which the linear component thus enters the model by way of the
inverse link:
p
X 
−1
π=g wk xk .
k=1

• Provided that g(x) is monotone and dierentiable, we should be in the


money.
• Furthermore, we conserve the interpretability of the linear component.
• How does one choose g(x)?

32/48
Logistic Regression - Sigmoidal Response Functions

Function g g −1
π eη

Logit log 1−π 1+eη

Probit Φ−1 (π) Φ(η)


C-Log-Log log(− log(1 − π)) 1 − exp(− exp(η))
Common sigmoidal response functions for logistic regression.

33/48
Logistic Regression: Sigmoidal Response Functions

Sigmoidal Response Functions

1.0
0.5
f(z)

0.0
−0.5

Logistic
Probit
−1.0

C−LogLog

−4 −2 0 2 4

34/48
Logistic Regression: Graphical Representation

The logistic regression model can be represented graphically using


nodes and edges connecting features to the response node (Neter et al.,
1996):

X1

.
.
.
Y

Xp

35/48
Fitting Logistic Regression: Maximum Likelihood

An appropriate strategy for tting a logistic regression model to binary


data would be to use maximum likelihood. That is, given a set of
observations y = {y1 , . . . , yn }, and predictors {xi : i = 1, 2, . . . n} we
maximize the joint probability of the observations w.r.t the parameters:

ŵ = argmax(L(y, w)),
where
n
L(y, w)) = f (yi , g −1 (xTi w)),
Y

i=1

or equivalently:
ŵ = argmax(log(L(y, w))),
with log-likelihood

n
log(L(y, w)) = (yi log(g −1 (xTi w)) + (1 − yi ) log(1 − g −1 (xTi w))).
X

i=1

36/48
Fitting Logistic Regression: Maximum Likelihood

Over and above the obvious awkwardness that sum-of-squares objective


function would incur with binary responses, maximum likelihood has a
number of theoretical advantages.
• One may derive asymptotic distributions for the parameters and hence
perform direct inference on the relationship between the predictors and
responses at the hand of these statistics.
• For our purposes, the main advantage pertains to the objective
function being explicitly formulated in terms of the probability content
of the model.
• As a consequence, the objective function is aligned directly with the
predictive applications of such models, i.e., classication tasks.

37/48
Logistic Regression in Action

Some examples of Logistic regression problems:


• Credit Card Default Data  ISLR package covered in (James et al.,
2013).
• Beetle Mortality  Beetle Mortality.txt, from (Dobson and
Barnett, 2008).
• Titanic Survival Data  Titanic Set A.txt, form Kaggle challenge.

38/48
Polytomous Regression/Nominal Logistic Regression

The likelihood for such a model can be constructed as:

n
L(y, w)) =
Y
f (yi , π),
i=1

where
q
Y
f (yi , π) = πjyi
j=1
Pq
with π = {π1 , π2 , . . . , πq } subject to the constraint j=1 πj = 1.

39/48
Nominal Logistic Regression

Denition
Nominal logistic regression applies the ideas behind logistic regression
to multinomial responses.

• We impose the link structure on the probability of each category:

 
πj
log = ηj = xTi β j , j 6= b.
πb
• Since we have more than one category, the symmetry of `1 − πi ' is broken
but we analogously, measure odds w.r.t. a base category πb .
• 'b' is usually chosen as the rst category, i.e, π1 .

40/48
Nominal Logistic Regression

Denition
Nominal logistic regression applies the ideas behind logistic regression to multinomial
responses.

• A neat mathematical result that follows from this specication is:

πj = π1 exp( xTi wj ), j = 2, 3, . . . , q.

• Since πj = 1, we have:
Pq
j=1

1
π1 =
1+
Pq
j=1 exp( xTi wj ) ,
and
exp(xTi wj )
j=1 exp(xi wj )
πj = Pq T
, j = 2, 3, . . . , q.
1+

41/48
Viewing the log-likelihood as the appropriate objective function for such a
multi-class classication problem we derive:

q
n X
log(L(y, w1 , w2 , ...)) = yij log(πj (wj ))
X

i=1 j=1

where the aforementioned expressions for π1 and πj : j = 2, 3, . . . , q are


used to map the linear components into the model distribution (hence
the dependence on the parameter vectors πj (wj ).
• As before, we treat log(L(y, w1 , w2 , ...)) as an objective to be
maximised over the w's.
• Note that we could collect all of the parameter vectors in the rows of a
(weight) matrix, say W.

42/48
Nominal Logistic Regression
As before, the end-game with nominal logistic regression is to establish
the relative odds of an observation being distributed to a category given
a set of explanatory variables. Interpretation is almost invariably i.t.o. the
eect of a given explanatory variable on the odds of observing a category
of response.
• Consider comparing two categories with one continuous predictor x,
then:  
π2
log = η1 = w11 + w21 x,
π1
 
π3
log = η2 = w12 + w22 x,
π1
which leads to the odds-ratio:
   
π3 π2
log(OR) = log − log .
π1 π1
• Alternatively, we could simply be interested in the most likely category
for a given set of predictors.

43/48
Some examples of nominal logistic regression problems:
• iris data  standard R dataset.
• Car Preference Data.txt  (Dobson and Barnett, 2008).

44/48
Logistic Regression: Where to from here?

• Nominal regression generalises the logistic regression to handle


responses with multiple categories.
• Since the model is formulated in terms of a distribution, for a given set
of predictors we get predictions out of the model giving the estimated
probabilities that the response coincides with every category.
• Despite the required transformation on the linear component, the
model remains inherently linear, segmenting the feature space with
linear decision bounds.
• Though linearity is clearly advantageous for interpretive purposes, this
does not help us with highly non-linear problems? For example,
handwritten digits?

45/48
46/48
Logistic Regression: Graphical Representation

The nominal logistic regression model can be represented graphically


using nodes and edges connecting features to the response node (Neter
et al., 1996). Unsurprisingly, the resulting gure starts to resemble simple
neural network...

X1 Y1

. Y2
.
.

Xp

47/48
References I

Yaser S Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin. Learning From


Data, volume 4. AMLBook New York, NY, USA:, 2012.
Annette J Dobson and Adrian Barnett. An Introduction to Generalized Linear Models.
CRC press, 2008.
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical
learning, volume 1. Springer series in statistics New York, NY, USA:, 2001.
Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction
to statistical learning, volume 112. Springer, 2013.
John Neter, Michael H Kutner, Christopher J Nachtsheim, and William Wasserman.
Applied linear statistical models, volume 4. Irwin Chicago, 1996.
Michael A Nielsen. Neural networks and deep learning. Determination Press, 2015.

48/48

You might also like