Intelligent Systems

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

1) Intelligent Systems

Prof Hong Wang & Dr Martin Brown


Room: E1{q,k}
E: {hong.wang,martin.brown}@manchester.ac.uk
T: 0161 306 {4655,4672}
W: http://blackboard.manchester.ac.uk/
3/1/2011 EEEN40115 1/99

1) Intelligent Systems Outline


(1.1) Introduction to soft computing/machine learning: L1
(1.2) Learning systems: L2-6
• Simple artificial neural models
• (Fuzzy logic systems L10)
• Parameter estimation
• Applications: L7-9
(1.3) Genetic optimization: L10
• System design & optimization
• Evolutionary optimization framework
• Genetic algorithms and genetic search

3/1/2011 EEEN40115, 2/99


Lectures 1-3:
Lecture 1
(1.1) Introduction to soft computing/machine learning
In this lecture we’ll introduce basic regression & classification,
learning problems and pose them as a bio-inspired
“machine learning” problem.

Lectures 2 & 3
(1.2) Learning systems
• Linear artificial neural models
• Parameter estimation
In these 2 lectures we’ll introduce both the basic Perceptron
classifier and also the LMS algorithm. Both of these are
instantaneous/on-line learning algorithms.

3/1/2011 EEEN40115, 3/99

Lecture 1-3: Resources


This set of slides are largely self-contained, however it is based on the
following texts, which provide useful background/supplementary
information:

Chapters 1-3, Machine Learning (MIT Open Courseware), T Jakkola,


http://ocw.mit.edu/courses/electrical-engineering-and-computer-
science/6-867-machine-learning-fall-2006/

Machine Learning, T Mitchell, McGraw Hill, 1997


An introduction to Support Vector Machines and other kernel-based
learning methods, N Cristianini, J Shawe-Taylor, Cambridge
University Press, 2000
Machine Learning, Neural and Statistical Classification, D Michie, DJ
Spiegelhalter and CC Taylor, 1994 (out of print, but available from:
http://www.amsta.leeds.ac.uk/~charles/statlog/)

3/1/2011 EEEN40115, 4/99


Lecture 1: Outline
1. Introduction to machine learning and intelligent
systems
– Intelligence and learning
– Examples of machine learning
– Definition of machine learning

2. Application area
– Face detection (classification)

3/1/2011 EEEN40115, 5/99

Examples of Machine Learning


The field of machine learning is concerned with the question
of how to construct computer programs that automatically
improve with experience
– Systems for credit card transactions approval. The system is
designed by showing it examples of good and bad
(fraudulent) transactions, and letting it learn the difference
– Learning to play Backgammon. The world’s best computer
backgammon players are based on machine learning
algorithms
– Systems for spam filtering. Decisions about whether or not a
mail is regarded as junk can be built and refined during
actual usage
– Note that recursive model estimation can also be regarded
as a type of machine learning

Broad definition, covering statistical regression and system


identification to genetic algorithms
3/1/2011 EEEN40115, 6/99
Definition of Machine Learning
A computer program M is said to learn from experience E with respect
to some tasks T and performance measure P, if M’s performance at
tasks in T, as measured by P, improves with experience E.
– The task T must be adequately described well in terms of
measurable signals (inputs and outputs) and success criteria
– The computer program M could be a physical model, where the
parameters require tuning, a statistical distribution or a set of
rules
– The experience E may occur during design (off-line) or during
actual operation (on-line)
– The actual, on-line performance P may be difficult to estimate, if
learning takes place off-line
– The experience set E needs to be rich enough so that M is
sufficiently exercised, should be balanced to reflect the actual
usage and should contain enough examples to estimate M
sufficiently accurately.

3/1/2011 EEEN40115, 7/99

Supervised Predict/Learn Cycle


Task T specify
measurable inputs ^
θ = f(y, y)
Learning: ∆θ
and outputs

Experience E is a
Model M which is Performance P
data set
parameterized by compares M’s
D = {X,y} ^
the unknown predictions y
of measurements
vector θ against y
and desired targets

Prediction: y^ = m(x,θ
θ)

This is the most commonly applied view of machine learning:


Supervised learning
Its also worthwhile remembering the statistician’s view that:
“All models are wrong, just some are more useful than others …”

3/1/2011 EEEN40115, 8/99


History of Machine Learning
Rosenblatt 1956 – Perceptron
Minsky and Papert 1969 – Critique of basic Perceptron theory
Holland 1975 – Genetic algorithms
Barto & Sutton 1985 – Proposed basic reinforcement learning
algorithms
Rumelhart 1986 – Development of back-propagation: a gradient
descent procedure for multi-layer perceptrons
Mackay 1992 – Bayesian MLPs (energy prediction winner)
Tessauro 1992 – Backgammon application, “world class”
Heckerman 1994 – Bayesian network (medical, win95 applications)
Vapnik 1995 – Support vector machines
Neal 1996 – Gaussian processes
Jordan 1997 – Variational learning and Bayesian nets

3/1/2011 EEEN40115, 9/99

Relationship to System Identification


This is obviously closely related to system identification, so its
worthwhile considering what the differences are

System identification is largely concerned with linear, dynamical


prediction, based on real-valued, time-delayed variables
The models are described by their structures and their unknown
parameter vectors, which are tuned to “best fit” the data

Intelligent systems is largely concerned with non-linear


classification and prediction problems, where the variables may be
real-valued or categorical.
Again, the models are described by their structure and their unknown
parameter vectors, which are tuned to “best fit” the data

In practice, there is a reasonable amount of overlap between the two


areas.

3/1/2011 EEEN40115, 10/99


Application 1: Face Detection
Task: Given an arbitrary image which could be a digitized video
signal or a scanned photograph, classify image regions
according to whether they contain human faces.
Data: Representative sample of images containing examples of
faces and non-faces. Input features are derived from the raw
pixel values, binary class labels are labelled by experts
Rational for machine learning: most pixel based detection
problems are hard, due to significant pattern variations that
are hard to parameterise analytically
Reference: Support Vector Machines: Training and Applications,
Osuna, Freund and Girosi, MIT AI-1602 Technical Report,
1997.

3/1/2011 EEEN40115, 11/99

1) Faces: Feature Extraction


Aim is given a 19*19 pixel window (~283) features, determine whether
the normalized image window contains a face (1) or not (-1).

1. Rescale Image several


times
(Scale invariance)
2. Cut 19*19 windows pattern
(Location detection)
3. Preprocessing –>
light correction,
histogram equalization
(brightness invariance)
4. Classify using SVM

3/1/2011 EEEN40115, 12/99


1) Faces: Classification Problem

Determine whether the 19*19


window contains a face
x2
Support vector machines
only use the examples
closest to the decision
boundary, the rest are
ignored

Training set contained


~2,000 examples x1

3/1/2011 EEEN40115, 13/99

1) Faces: Results and Comments


Use a support vector machine classifier using polynomial
kernels of degree 2.
Number of parameters in the polynomial ~40,000!
Test on two data sets
A 313 images, 313 faces (one per image – mug shots),
4.5M windows
B 23 images, 155 faces, 5.5M windows

Test Set A Test Set B


Detection rate False detection Detection rate False detection
Ideal 100% 0 100% 0
SVM 97.1% 4 74.2% 20
NN 94.6% 2 74.2% 11

3/1/2011 EEEN40115, 14/99


1) Faces: Example Classifications

3/1/2011 EEEN40115 15/99

Lecture 1: Summary
Many design tasks require a machine learning approach:
• Classification, clustering, regression, reinforcement

Machine learning will always be a 2nd best solution


In this section, we’re considering simple regression and
classification learning, although the concepts generalize.
θ = f(y, y)
Learning: ∆θ

Experience E is a
Model M which is Performance P
data set
parameterized by compares M’s
D = {X,y} ^
the unknown predictions y
of measurements
vector θ against y
and desired targets

3/1/2011
Prediction: y^ = m(x,θ
θ) EEEN40115, 16/99
Lectures 2 & 3:
Linear Machine Learning Algorithms

Lectures 2 & 3
(1.2) Learning systems
• Linear artificial neural models
• Parameter estimation
In these 2 lectures we’ll introduce both the basic Perceptron
classifier and also the LMS algorithm. Both of these are
instantaneous/on-line learning algorithms.

Lectures 2 & 3: Outline

Linear classification using the Perceptron


• Classification problem
• Linear classifier and decision boundary
• Perceptron learning rule
• Proof of convergence

Recursive linear regression using LMS


• Modelling and recursive parameter estimation
• Linear models and quadratic performance function
• LMS and NLMS learning rules
• Proof of convergence

3/1/2011 EEEN40115, 18/99


Lectures 2 & 3: Learning Objectives
1. Understand what classification and regression machine
learning techniques are and their differences

2. Describe how linear models can be used for both


classification and regression problems

3. Prove convergence of the learning algorithms for


linear relationships, subject to restrictive
conditions

4. Understand the restrictions of these basic proofs

Develop basic framework that will be expanded on in


subsequent lectures

3/1/2011 EEEN40115, 19/99

What is Classification?
Classification is also known as (statistical) pattern recognition

The aim is to build a machine/algorithm that can assign appropriate


qualitative labels to new, previously unseen quantitative data
using a priori knowledge and/or information contained in a training
set. The patterns to be classified are usually groups of
measurements/observations, that are believed to be informative for
the classification task.

Training data: Prior


Design/ D = {X,y} knowledge
learn
Classifier
Predict ^θ,x)
m(θ
New pattern: x Predicted class label: ^y

Example: Face recognition

3/1/2011 EEEN40115, 20/99


Classification Training Data
To supply training data for a classifier, examples must be collected that
contain both positive (examples of the class) and negative (examples
of other classes) instances. These are qualitative target class values
and are stored as +1 and -1, for the positive and negative instances
respectively. Generated by expert or by observation.

The quantitative input features should be informative

The training set should contain enough examples to be able to build


statistically significant decisions

How to encode
qualitative target
and input features?

3/1/2011 EEEN40115, 21/99

Structure of a Linear Classifier


Given a set of quantitative features x, a linear classifier has the form:
(
y = sgn xT ș + θ 0 )
The sgn() function is used to produce the qualitative class label (+/-1)
The class/decision boundary is determined when:
0= y
§θ · §θ ·
T x2 = −¨¨ 1 ¸¸ x1 − ¨¨ 0 ¸¸
0 = x ș + θ0 x2 © θ2 ¹ © θ2 ¹

This is an (n-1)D hyperplane in feature space. + + + +


In 2-dimensional feature space:
+ + +
+
0 = x1θ1 + x2θ 2 + θ 0
§θ · §θ ·
x2 = −¨¨ 1 ¸¸ x1 − ¨¨ 0 ¸¸ x1
© θ2 ¹ © θ2 ¹
How does the sign and magnitude of θ affect the decision boundary?
3/1/2011 EEEN40115, 22/99
Simple Example: Fisher’s Iris Data
Famous example of building
classifiers for a problem with 3 types
of Iris flowers and 4 measurements
about the flower:
• Sepal length and width
• Petal length and width

150 examples were collected, 50 from


each class

Build 3 separate classifiers, one for


recognizing examples of each class

Data is shown, plotted against last two


features, as well as two linear
classifiers for the Setosa and Virginica
classes

3/1/2011 EEEN40115, 23/99

Perceptron Linear Classifier


The Perceptron linear classifier was devised by Rosenblatt in 1956
It comprises a linear classifier (as just discussed) and a simple parameter
update rule of the form:
Cyclically present each training pattern {xk, yk} to the linear classifier
When an error (misclassification) is made, update the parameters:
șˆ k +1 = șˆ k + η yk x k
θˆ = θˆ + η y
0, k +1 0, k k
where η>0 is the learning rate.
The bias term can be included as θ0 with an extra feature x0 = 1:
șˆ k +1 = șˆ k + η yk x k
Continue until there are no prediction errors

Perceptron convergence theorem If the data set is linearly separable,


the Perceptron learning algorithm will converge to an optimal separator
in a finite time

3/1/2011 EEEN40115, 24/99


Instantaneous Parameter Update
What happens when the parameters are updated?

x2, θ2
xk
Error-driven update: ș^ k +1
șˆ k +1 = șˆ k + η yk x k ηy k

k

x1, θ1
The parameters are updated to make them more like the incorrect
feature vector.
After updating: y, y^
1
xTk șˆ k +1 = xTk șˆ k + η yk xTk x k x șˆ k
T

2
k xTk șˆ k
= xTk șˆ k + η yk x k 2
2 0 2 ^
η xk −η xk xTθ
Updated parameters are closer 2
-1
2

to correct decision
3/1/2011 EEEN40115, 25/99

Perceptron Convergence Proof Preamble …


Basic aim is to minimise the number of mis-classifications:
This is generally an NP-complete problem
We’ve assumed that there is an optimal solution with 0 errors
This is similar to Least Squares recursive estimation:
Performance = Σi(yi-y^i)2 = 4*numberOfErrors
Except that the sgn() makes it a non-quadratic optimization problem
Updating șˆ k +1 = șˆ k + η yk x k only when there are errors is the same as:
η with or without errors
șˆ k +1 = șˆ k + ( yk − yˆ k ) x k
2
Sometimes drawn as a network: ∆ș k = η ( yk − yˆ k )x k
“error driven”
Repeatedly parameter
cycle through xk y^k
data set D, yˆ k = sgn(x șˆ k )
T
k
estimation
-
drawing out +
each sample yk
{xk, yk}
3/1/2011 EEEN40115, 26/99
Convergence Analysis of the Perceptron (i)
If a linearly separable data set D is repeatedly presented to a
Perceptron, then the learning procedure is guaranteed to
converge (no errors) in a finite time

If the data set is linearly separable, there exists optimal


T *
parameters θ∗ such that sgn(xi ș ) = yi for all i = 1, …, l
Note that ș(α ) = α ș* , α > 0 are also optimal parameter vectors

Consider the positive quantity γ defined by, such that ||θ∗|| = 1:


γ = min i yi ( xTi ș* )
This is a concept known as the “classification margin”
Assume also that the feature vectors are bounded by:
2
R 2 = max i x i 2

3/1/2011 EEEN40115, 27/99

Convergence Analysis of the Perceptron (ii)


To show convergence, we need to establish that at the kth
iteration, when an error has occurred: θ2
^
2 2 θk
ș − șˆ k +1 < ș* − șˆ k
* ^
θk+1
2 2

Using the update formula: θ∗

2 2
θ1
ș − șˆ k +1 = ș − șˆ k − η yk x k
* *
2 2
2
= ș* − șˆ k
2
+ η 2 xk
2
2 (
− 2η yk xTk ș* − șˆ k )
2 2
< ș* − șˆ k + η 2 xk 2
− 2η yk xTk ș*
2
2 To finish proof, select
≤ ș* − șˆ k + η 2 R 2 − 2ηαγ
2
ηR 2
α>

3/1/2011 EEEN40115, 28/99
Convergence Analysis of the Perceptron (iii)
To show this terminates in a finite number of iterations,
simply note that:
ε (α ) = η 2 R 2 − 2ηαγ < 0
is independent of the current training sample, so the
parameter error must decrease by at least this amount at
^
each update iteration. As the initial error is finite, θ0 = 0,
say, there must exist a finite number of steps before the
parameter error is reduced to zero.

Note also that α is proportional to the size of the feature


vector (R2) and inversely proportional to the size of the
margin (γ). Both of these will influence the number of
update iterations when the Perceptron is learning
3/1/2011 EEEN40115, 29/99

Classification Margin
ș
In this proof, we assumed that there x2
exists a single, optimal parameter
vector.
In practice, when the data is linearly
separable, there are an infinite
number – simply requiring correct
classification results in an ill-posed x1
posed problem
The classification margin can be x2 xT ș = ?
1
defined as the minimum distance of 0
-1
the decision boundary to a point in
that class
– Used in deriving Support Vector
Machines
x1
3/1/2011 EEEN40115, 30/99
Example: Perceptron & Logical AND (i)
Consider modelling the logical AND data using a Perceptron

ª0 0º ª− 1º
«0 «− 1»
1»» Is the data linearly separable?
X=« , y=« »
«1 0» «− 1»
«¬1 1 »¼ « »
¬1¼

^ ^ ^
k=0, θ = [0.01, 0.1, 0.006] k=5, θ = [-0.98, 1.11, 1.01] k=18, θ = [-2.98, 2.11, 1.01]

x2 x2 x2

x1 x1 x1
3/1/2011 EEEN40115, 31/99

Example: Perceptron (ii)

θ^1,k

θ^i,k θ^2,k

bias θ^0,k

k: data presentation index

3/1/2011 EEEN40115, 32/99


Regression LMS Learning
Regression is a (statistical) methodology that utilizes the
relation between two or more quantitative variables so
that one variable can be predicted from the other, or
others.
Examples:
• Sales of a product can be predicted by using the
relationship between sales volume and amount of
advertising
• The performance of an employee can be predicted by
using the relationship between performance and
aptitude tests
• The size of a child’s vocabulary can be predicted by
using the relationship between the vocabulary size, the
child’s age and the parents’ educational input.
3/1/2011 EEEN40115, 33/99

Structure of a Linear Regression Model


Given a set of features x, a linear predictor has the form:
y = xT ș + b
The output is a real-valued, quantitative variable

y, y^

The bias term can be included as an extra feature x0 = 1. This


renames the bias parameter as θ0.
Most linear control system models do not explicitly include a bias
term, why is this?

3/1/2011 EEEN40115, 34/99


Least Mean Squares Learning
Least Mean Squares (LMS) proposed by Widrow 1962
This is a (non-optimal) sequential parameter estimation
procedure for a linear model:
șˆ = șˆ + η ( y − yˆ ) x
k +1 k k k k
^
• compared to classification, both yk and ^^yk
are quantitative
variables, so the error/noise signal (yk-yk) is generally non-
zero.
• similar to the Perceptron, but no threshold on xTθ.
• η>0 is again the positive learning rate.

Widely used in filtering/signal processing and adaptive


control applications
“Cheap” version of sequential/recursive parameter estimation

3/1/2011 EEEN40115, 35/99

Proof of LMS Convergence (i)


If a noise-free data set containing a linear relationship x->y is
repeatedly presented to a linear model, then the LMS
algorithm is guaranteed to update the parameters so that
they converge to their optimal values, assuming the
learning rate is sufficiently small.

Note:
1. Assume there is no measurement noise in the target data
2. Assume the data is generated from a linear relationship
3. Parameter estimation will take an infinite time to converge
to the optimal values
4. Rate of convergence and stability depend on the learning
rate

3/1/2011 EEEN40115, 36/99


Proof of Convergence (ii)
To show convergence, we need to establish that at the kth
iteration, when an error has occurred: θ2

k
2 2 ^
ș* − șˆ k +1 < ș* − șˆ k θk+1
2 2

Using the update formula: θ∗


θ1
2 2
ș* − șˆ k +1 = ș* − șˆ k − η ( yk − yˆ k ) x k
2 2
2
= ș* − șˆ k
2
2
+ η 2 ( yk − yˆ k ) x k
2
2 (
− 2η ( yk − yˆ k ) xTk ș* − șˆ k )
2 2 2 2
= ș* − șˆ k + η 2 ( yk − yˆ k ) x k 2
− 2η ( yk − yˆ k )
2
2
< ș* − șˆ k
2
­° 2 ½°
when 0 < η < min k ® 2¾
3/1/2011
°̄ x k 2 °¿ EEEN40115, 37/99

Example: LMS Learning


Consider the “target” linear model y =
1 - 2*x, where the inputs are k=100
drawn from a normal distribution
with zero mean, unit variance y, k=5
Data set consisted of 25 data points, y^
and involved 10 cycles through k=0
the data set
η=0.1
x

θ^0
^ ^
θ1 θ

^
θ1
^
θ0
3/1/2011
k EEEN40115, 38/99
Lectures 2 & 3: Summary
This lecture has looked at basic linear classification
(Perceptron) and regression LMS techniques
– Investigated basic linear model structure
– Proposed simple, “on-line” error-based learning rules
șˆ k +1 = șˆ k + η ( yk − yˆ k ) x k
– Proved convergence for simple environments

While these algorithms are rarely used in this form, their


structure has strongly influenced the development of
more advanced techniques
– Support vector machines
– Multi-layer perceptrons
which will be studied in the coming weeks

3/1/2011 EEEN40115, 39/99

You might also like