Download as pdf or txt
Download as pdf or txt
You are on page 1of 114

Data-driven modelling

Data- g
in water-
water-related problems.
PART 1

Dimitri P. Solomatine

UNESCO-IHE Institute for Water Education


Hydroinformatics Chair
Outline of the course

 Notion of data-driven modelling (DDM)


 Sources of technologies for DMM: machine learning, data mining, soft
computing
ti
 Introduction to methods: decision, regression and model trees,
Bayesian approaches,
approaches neural networks,
networks fuzzy systems
systems, chaotic
systems
 Demonstration of applications
– Rainfall-runoff modelling
– Reservoir optimization
– R l ti
Real-time control
t l off water
t llevels
l in
i the
th polder
ld areas
– References to other applications (surge water level prediction in the
North Sea,, interpretation
p of aerial photos)
p )

D.P. Solomatine. Data-driven modelling (part 1). 2


Modelling

 Model is …
– a simplified description of reality
– an encapsulation
l ti off knowledge
k l d aboutb t a particular
ti l physical
h i l or social
i l
process in electronic form

 Goals of modelling are:


– understand the studied system or domain (understand the past)
– predict the future
 predict the future values of some of the system variables, based on the
knowledge about other variables
– use the results of modelling for making decisions (change the future)

D.P. Solomatine. Data-driven modelling (part 1). 3


Data--driven models vs Knowledge
Data Knowledge--driven (physically-
(physically-
based) models (1)

 ”Physically-based", or "knowledge-based" models are based on the


understanding of the underlying processes in the system
– examples: river models based on main principles of water motion
motion,
expressed in differential equations, solved using finite-difference
approximations
 "Data-driven" model is defined as a model connecting the system
state variables (input, internal and output) without much knowledge
about the "physical"
physical behaviour of the system
– examples: regression model linking input and output
 Current trend: combination of both ((“hybrid”
hybrid models)

D.P. Solomatine. Data-driven modelling (part 1). 4


Why data-
data-driven now?

 Measuring campaigns using automatic computerised equipment: a lot


of data became available
 i
important
t tbbreakthroughs
kth h iin computational
t ti l iintelligence
t lli and
d machine
hi
learning methods
 penetration of “computer
computer sciences”
sciences into civil engineering (e.g.,
(e g
hydroinformatics, geo-informatics etc.)

D.P. Solomatine. Data-driven modelling (part 1). 5


Hydroinformatics system:
typical architecture

User interface
Judgement engines
Decision support systems for management

F engines
Fact i Data, information, Knowledge inference
Physically-based knowledge engines
models
ode s Knowledge-base
g
systems
Data-driven
models
d l Communications

R l world
Real ld

D.P. Solomatine. Data-driven modelling (part 1). 6


Classification of models

 specific - general
 model estimation (data-driven) - first principles (process) models
 numerical - analytical
 stochastic - deterministic
 microscopic
i i - macroscopic i
 discrete - continuous
 qualitative - quantitative

D.P. Solomatine. Data-driven modelling (part 1). 7


Example of a simple data-
data-driven Linear regression model
Y = a 0 + a1 X
model::
model Y
linear regression
actual
output
value

model
d l predicts
di
new output
value X

new input
value

 Given measured (training) data: T vectors {x(t)


( ),y(t)
( )}, t =1,…T .

 Unknown a0 and a1 are found by solving an optimization problem

 
T
E   y  (a0  a1 x )  min
(t ) (t ) 2

t 1
 Then for the new V vectors {x(v)}, v =1,…V this equation can approximately
reproduce
d the
th corresponding
di functions
f ti values
l
{ y(v) } , v =1,…V
Data: attributes, inputs, outputs

 set K of examples (or instances)


represented by the duple <xk, yk>, where k = 1,…, K,
vector xk = {x1,…,xn}k , vector yk = {y1,…,ym}k ,
n = number of inputs,
p , m = number of outputs.
p
 The process of building a function ( ‘model’) y = f (x) is called training.
 Often only one output is considered, so m = 1.

Measured data Attributes Model output


IInputs
t Output
O t t y*
* = f (x )
Instances x1 x2 … xn y y*
Instance 1 x11 x12 x1n y1 y1*
Instance 2 x21 x21 x2 n y2 y2*
… …
Instance K xK1 xK2 xK n yK yK*
D.P. Solomatine. Data-driven modelling (part 1). 9
Data--driven model
Data

Actual (observed)
Modelled output
t t Y
Input data X (real)
system
Learning is aimed
at minimizing this
Machine difference
l
learning
i
(data-driven)
model Predicted output Y’

 DDM “learns” the target function Y=f (X) describing how the real
system behaves
 Learning = process of minimizing the difference between observed data
and model output. X and Y may be non-numeric
 After learning,
learning being fed with the new inputs,
inputs DDM can generate output
close to what the real system would generate
D.P. Solomatine. Data-driven modelling (part 1). 10
Data--driven models vs Physically
Data Physically--based models
physically-based
(conceptual)
hydrological
rainfall-runoff model

P = precipitation, E = evapotranspiration, Q =
runoff, SP = snow pack, SM = soil moisture, UZ =
upper groundwater zone, LZ =lower groundwater
zone, lakes = lake volume

Coefficients to be identified
byy calibration

Precipitation P(t)
Moving average of data-driven rainfall-
Precipitation MAP3(t-2) runoff model
Runoff flow Q(t+1) (artificial neural
Evapotranspiration E(t) network)
Runoff Q(t)
connections with weights
neurons (non-linear functions)
D.P. Solomatine. Data-driven modelling (part 1). to be identified by training 11
Using data-
data-driven methods in
rainfall--runoff modelling
rainfall

Qtupp
 Available data:
– rainfalls Rt
– runoffs (flows) Qt
 Inputs: lagged rainfalls Rt Rt-1 … Rt-L
Rt Qt
 Output
p to predict:
p Qt+T

 Model: Qt+T = F (Rt Rt-1 … Rt-L … Qt Qt-1 Qt-A … Qtup Qt-1up …)


(past rainfall) (autocorrelation) (routing)

 Questions:
– how to find the appropriate lags? (lags embody the physical properties of
the catchment)
non linear regression function F ?
– how to build non-linear

D.P. Solomatine. Data-driven modelling (part 1). 12


So, what is then data-
data-driven modelling
in engineering?

 Data-driven modelling:
– oriented towards building predictive models
– follows general modelling guidelines adopted in engineering
– uses the apparatus of machine learning
 Proper data-driven modelling is impossible without
– good understanding of the modelled system,
– relation to the external environment and
– possible connections to other models and decision making processes
 It differs from data mining in:
– application area (engineering and natural processes)
– used terminology and relation to physically-based models
– usually smaller data sets (and hence possible different choice of methods)

D.P. Solomatine. Data-driven modelling (part 1). 13


Steps in modelling process: details

 State the problem (why do the modelling?)


 Evaluate data availability
availability, data requirements
 Specify the modelling methods and choose the tools
 Build ((identify)
y) the model:
– Choose variables that reflect the physical processes
– Collect, analyse and prepare the data
– Build the model
– Choose objective function for model performance evaluation
– Calibrate (identify, estimate) the model parameters:
 if possible, maximize model performance by comparing the model output to
past measured data and adjusting parameters
 Evaluate the model:
– Evaluate the model uncertainty, sensitivity
– Test (validate) the model using the “unseen” measured data
 A l the
Apply th model
d l (and
( d possibly
ibl assimilate
i il t real-time
l ti data)
d t )
 Evaluate results, refine the model
D.P. Solomatine. Data-driven modelling (part 1). 14
Some “golden rules” in building a model (1)

 1. Select clearly defined problem that the model will help to


resolve.
 2 Specify
2. S if the
th required
i d solution
l ti or the
th problem.
bl
 3. Define how the solution delivered is going to be used in practice.
 4 Learn the problem
4. problem, collect the domain knowledge,
knowledge understand it.
it
 5. Let the problem drive the modelling, including the tool selection,
data
da a preparation,
p epa a o , etc.
e Thata iss take
a e the
e best
bes tool
oo for
o the
e job, not
o just
jus a
job you can do with the available tool.
 6. Clearly define assumptions (do not just assume, but discuss
them with the domain knowledge experts).
 ... ->

D.P. Solomatine. Data-driven modelling (part 1). 15


Some “golden rules” in building a model (2)

 7. Refine the model iteratively (try different things until the model
seems as good as it is going to get).
 8 Make
8. M k the
th model
d l as simple
i l as possible,
ibl but
b t no simpler,
i l
formulated also as:
– KISS (("Keep
Keep It Sufficiently Simple",
Simple , or "Keep
Keep It Simple, Stupid
Stupid"))
– Minimum Description Length principle: “the best model is one that that is
the smallest”
– the Occam's Razor principle - formulated by William of Occam in 1320 in
the following form: shave off all the "unneeded philosophy off the
explanation".
p
 9. Define instability in the model (critical areas where small
changes in inputs lead to large change in output).
 10. Define uncertainty in the model (critical areas and ranges in the
data where the model produces low confidence predictions). 16
D.P. Solomatine. Data-driven modelling (part 1).
Suppliers of methods for data-
data-driven modelling
(1) Machine learning (ML)

 ML = constructing computer programs that automatically improve with


experience
 Most g
general paradigm
p g for DDM
 ML draws on results from:
– statistics
– artificial Neural Networks (various types)
– artificial intelligence
– philosophy, psychology, cognitive science, biology
– information theory,
theory computational complexity
– control theory
 For a long time concentrated on categorical (non-continuous) variables

D.P. Solomatine. Data-driven modelling (part 1). 17


Suppliers of methods for data-
data-driven modelling
(2) Soft computing

 Soft computing - tolerant for imprecision and uncertainty of data (Zadeh,


1991). Currently includes almost everything:
– fuzzyy logic
g
– neural networks
– evolutionary computing
– probabilistic computing (incl. belief networks)
– chaotic systems
– parts of machine learning theory

D.P. Solomatine. Data-driven modelling (part 1). 18


Suppliers of methods for data-
data-driven modelling
(3) Data mining

 Data mining (preparation, reduction, finding new knowledge):


– automatic classification
– identification of trends ((eg.
g statistical methods like ARIMA))
– data normalization, smoothing, data restoration
– association rules and decision trees
 IF (WL
(WL>1.2
1.2 @3 h ago, Rainfall
Rainfall>50
50 @1 h ago) THEN (WL
(WL>1.5
1.5 @now)
– neural networks
– fuzzy systems
 Other methods oriented towards optimization:
– automatic calibration (with a lot of data involved, makes a physically-driven
model partly data-driven)

D.P. Solomatine. Data-driven modelling (part 1). 19


Machine learning: Learning from data

D.P. Solomatine. Data-driven modelling (part 1). 20


Data--driven modelling:
Data modelling: (machine) learning

Actual (observed)
Modelled output
t t Y
Input data X (real)
system
Learning
L i isi aimed
i d
at minimizing this
Machine difference
learning
(data-driven)
model Predicted output Y’

 DDM tries to “learn” the target function Y=f (X) describing how the
real system behaves
 Learning = process of minimizing the difference between observed
data and model output. X and Y may be non-numeric
 After learning,
learning being fed with the new inputs,
inputs DDM can generate
output close to what the real system would generate
D.P. Solomatine. Data-driven modelling (part 1). 21
Training (calibration), cross-
cross-validation, testing

 Ideally the observed data has to be split in three data sets:


Ideally,

– training (calibration)

– cross-validation used
d tto build
b ild the
th model
d l
(imitates test set and model operation,
used to test model performance
during calibration process)

– testing
(imitates model operation, used to test the model
used for final model test,
should
h ld nott be
b seen by
b developer
d l ) after it is built

 One should distinguish


– minimizing the error during the model calibration
calibration, cross-validation
– minimizing the error during model operation (or on the unseen test set)
 Ideally, we should aim at minimizing the cross-validation error – since
this will give hopes that error on test set will be also small
small.
 In practice, process of training uses training set, and cross-validation
set is used to check periodically the model error and stop training
D.P. Solomatine. Data-driven modelling (part 1). 22
What is a “good” model?

Models being progressively made more


 Consider a model being accurate and complex during training
progressively made more
accurate ((and complex):
p ) actual Y
Green  Red  Blue output (e.g., flow)
value
 Green (linear) model is simple –
but it is not accurate enough model predicts
 Blue model is the most accurate –new output
but is it “the best”? value

 Redd model:
d l less
l accurate than
h
the Blue one, but captures the
trend in data. It will g
generalise X
well. new input
value
(
(e.g. rainfall)
i f ll)
 Question: how to determine
during training when to stop Which model is “better”:
better :
improving the model? green, red or blue?
D.P. Solomatine. Data-driven modelling (part 1). 23
Necessity of cross-
cross-validation during training

good moment
to stop training in-sample = training
out-of-sample = cross-validation, or verification (testing)

D.P. Solomatine. Data-driven modelling (part 1). 24


Training, cross-
cross-validation, verification (2)
Training Cross-validation
Y

Relationship Y=F(X)
t be
to b modelled
d ll d (unknown)
( k ) X
x Data {xi, yi}collected
about this relationship
Data driven model
Data-driven

 Training: iterative
refinement of the model –
with each iteration model
parameters are changed to
reduce error on training set
  error on training set gets
lower and lower
  error on test set gets
lower, then starts to
increase
 This may lead to overfitting
and high error on cross-
validation set
 The moment to stop
training
D.P. Solomatine. Data-driven modelling (part 1). 25
Data--Driven Modelling: care is needed
Data

 Difficulties with extrapolation (working outside the variables’ range)


– A solution: exhaustive data collection, optimal construction of the
calibration set
 Care needed if the time series is not stationary
– A solution: to build several models responsible for different regimes
 Need to ensure that the relevant physical variables are included
– A solution: use correlation and average mutual information analysis

D.P. Solomatine. Data-driven modelling (part 1). 26


Data

D.P. Solomatine. Data-driven modelling (part 1). 27


Data: attributes, inputs, outputs

 set K of examples (or instances)


represented by the duple <xk, yk>, where k = 1,…, K,
vector xk = {x1,…,xn}k , vector yk = {y1,…,ym}k ,
n = number of inputs,
p , m = number of outputs.
p
 The process of building a function ( ‘model’) y = f (x) is called training.
 Often only one output is considered, so m = 1.

Measured data Attributes Model output


IInputs
t Output
O t t y*
* = f (x )
Instances x1 x2 … xn y y*
Instance 1 x11 x12 x1n y1 y1*
Instance 2 x21 x21 x2 n y2 y2*
… …
Instance K xK1 xK2 xK n yK yK*
D.P. Solomatine. Data-driven modelling (part 1). 28
Types of data (roughly)

 Class (category, label)


 Ordinal (order)
 Numeric (real-valued)
 etc. (considered later)

 (Time) series data: numerical data which values have associated


index variable with it.

D.P. Solomatine. Data-driven modelling (part 1). 29


Four styles of learning

 Classification
– on the basis of classified examples, a way of classifying unseen examples
is to be found
 Association
– association between features (which combinations of values are most
frequent) is to be identified
 Clustering
– groups of objects (examples) that are "close" are to be identified
 Numeric prediction
– outcome
t iis nott a class,
l b
butt a numeric
i (real)
( l) value
l
– often called regression

D.P. Solomatine. Data-driven modelling (part 1). 30


Important notions in machine learning

 Hypotheses
– there are various possible functions Y=f (X) (hypotheses) relating input and
output
– machine learning is searching through this hypotheses space in order to
determine one that fits the observed data and the prior knowledge of the learner

 Concepts: the thing to be learned on the basis of available data. For


example:
– children learn how to read and write,, what is sweet and saltyy
– conditions that lead to flood
– combinations of particular algae indicating poor water quality

 Instances (examples) representing concepts

 Attributes (describing instances)

D.P. Solomatine. Data-driven modelling (part 1). 31


Occam's razor: short hypotheses,
minimum description length (MDL)

 Principle: accept the simplest hypothesis (=model)


 Known as the Occam's razor principle: shave off all the "unneeded
philosophy
hil h off
ff the
th explanation“
l ti “ (William
(Willi off Occam,
O 1320)
 In ML also known as Minimum Description Length (MDL) principle

D.P. Solomatine. Data-driven modelling (part 1). 32


Instances (examples)

 Instances = examples of input data.


 Instances that can be stored in a simple rectangular table (only these
will
ill be
b mainly
i l considered):
id d)
– individual unrelated customers described by a set of attributes
– records of rainfall
rainfall, runoff
runoff, water level taken every hour
 Instances that cannot be stored in a table, but require more complex
structures:
– instances of pairs that are sisters, taken from a family tree
– related tables in complex databases describing staff, their ownership,
involvement in projects,
projects borrowing of computers,
computers etc
etc.

D.P. Solomatine. Data-driven modelling (part 1). 33


Attributes: more detailed view at measured data (1)

 Nominal (also called class, category, labels). Examples:


– customers that tend to do shopping on Fridays
– combinations of hydrometeorological conditions leading to high surge
– combination of conditions leading to a high probability of flood

 Ordinal - categories that can be ordered (ranked)


– temperature expressed as cool, mild, hot
– water level expressed as low,
low medium,
medium high

 ...
 ...

D.P. Solomatine. Data-driven modelling (part 1). 34


Attributes: more detailed view at measured data (2)

 ...
 ...
 Interval - ordered and expressed in fixed equal units. Examples:
– dates (cannot be however multiplied)
– temperature expressed in degrees Celsius

 Ratio (real numbers) - there is a zero point. Examples:


– distance, precipitation, water level, soil moisture, etc.
– but not temperature, since using another zero (C/F) changes ratios
– amount of money a customer tend to spend on a typical Friday

D.P. Solomatine. Data-driven modelling (part 1). 35


Attributes: main types of data in machine learning
(simplified)

 Nominal (also called class, category, labels, enumerated)


– discrete but without ordering
– Special
S i l case - dichotomy
di h t (Boolean)
(B l )

 Ordinal (also called numeric,


numeric continuous)
– could be also discrete but must be ordered
– p p is also called regression
prediction of real-valued output g

 more complex types, requiring additional descriptions (metadata), eg.:


– dimensions (for building expressions that are dimensionally correct)
– partial ordering (like "same day next week")
– subsets
s bsets (like "holidays"
"holida s" when
hen water
ate consumption
ons mption is different),
diffe ent) et
etc.

D.P. Solomatine. Data-driven modelling (part 1). 36


Data preparation and surveying

 prepare the
h data
d - this
h may include
l d complexl procedures
d off restoring
the missing data, data transformation, etc.;
 survey the data - understand the nature of the data
data, get insight into
the problem this data describes - includes identification and analysis
of variability, sparsity, peaks and valleys, entropy, mutual information
inputs to outputs, etc. (this step is often merged with the previous
one);
 build the model

D.P. Solomatine. Data-driven modelling (part 1). 37


Data preparation results in:

 training data set - raw data is presented in a form necessary to to


train the DDM;
 cross-validation
lid ti data
d t sett - needed
d d to t detect
d t t overtraining;
t i i
 testing, or validation data set - it is needed to validate (test) the
model'ss predictive performance;
model
 algorithms and software to perform pre-processing (eg.,
normalization);
 algorithms and software to perform post-processing (eg.,
denormalization).

D.P. Solomatine. Data-driven modelling (part 1). 38


Important steps in data preparation

 Replace missing, empty, inaccuarate values


 Handle issue of spatial and temporal resolution
 Linear scaling and normalization
 Non-linear transformations
 Transform the distributions
 Time
i series:
i
– Fourier and wavelet transforms
– Identification of trend
trend, seasonality
seasonality, cyles
cyles, noise
– Smoothing data
 Finding
g relationships
p between attributes (eg.
( g correlation,, average
g
mutual information - AVI)
 Discretizing numeric attributes into {low, medium, high}
 Data reduction (Principal components analysis - PCA)
D.P. Solomatine. Data-driven modelling (part 1). 39
Example: raw data set

D.P. Solomatine. Data-driven modelling (part 1). 40


Example: normalization and redistribution

D.P. Solomatine. Data-driven modelling (part 1). 41


Replacing missing and empty values

 What to do with the outliers? How to reconstruct missing values?


 Estimator is a device (algorithm) used to make a justifiable guess
about
b t th
the value
l off some particular
ti l variable,
i bl that
th t is,
i tot produce
d an
estimate
 Unbiased estimator is a method of guessing that does not change
important characteristics of the data set when the estimates are
included with the existing values
 Example: Dataset 1 2 3 x 5
– Estimators:
 2.750,
2 750 if the mean is to be unbiased;
 4.659, if the standard deviation is to be unbiased;
 4.000, if the step-wise change in the variable value (trend) is to be unbiased
(that is, linear interpolation is used xi = (xi+1 + xi-1) / 2 )

D.P. Solomatine. Data-driven modelling (part 1). 42


Issue of spatial and temporal resolution

 Examples:
– in a harbour sedimentation is measured once in two weeks at one
locations and once a month at other two locations,
locations and never at other
(maybe important) locatons
– in a catchment the rainfall data that was manually collected at a three
gauging stations for 20 years once a day, and 3 years ago the
measurements started also at 4 new automatic stations as well, with the
hourlyy frequency
q y
 Solutions:
– filling-in missing data
– introducing an artificial resolution being equal to the maximum for all
variables

D.P. Solomatine. Data-driven modelling (part 1). 43


Linear scaling and normalization

 General form:
 x'i = a xi + b
 to keep data positive:
 x'i = xi + min (x1...xn) + SmallConst
 S
Squashing
hi data
d t into
i t the
th range [0,
[0 1]

xi  min( x1... xn )
xi 
max( x1... xn )  min( x1... xn )

D.P. Solomatine. Data-driven modelling (part 1). 44


Non--linear transformations
Non

 Logarithmic
 x'i = log (xi)

 Softmax scaling:

D.P. Solomatine. Data-driven modelling (part 1). 45


Logistic function

1
L( x ) 
1  e x
L og istic
i ti fun
f ctio
ti n
Output va lue
1.2

0.8

0 6
0.6

0.4

0.2

0
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.2

Input va lue

D.P. Solomatine. Data-driven modelling (part 1). 46


Softmax function

 1. {x} should be first transformed linearly to vary around the mean


mx :
xi  E ( x )
xi 
 ( x / 2 )
– where
– E(x) is mean value of variable x;
– σx is the standard deviation of variable x;
– λ is linear response measured in standard deviations – for example ±σ
(that is σ on either side of the central point of the distribution) cover
68% of the total range of x, ±2σ cover 95 95.5%,
5% ±3σ cover 99 99.7%.
7%
–  ≈ 3.14
 2. logistic
g function applied
pp ( ’)
L(x Output va lue
L og istic fun ctio n

1.2

0.8

0.6

0.4

0.2

D.P. Solomatine. Data-driven modelling (part 1).


-10 -8 -6 -4 -2
-0.2
0

Input va lue
2 4 6 8 10
47
Transforming the distributions: Box-
Box-Cox transform

 The first step uses the power transform to adjust the changing
variance:
xi  1
xi 
 where
he e 
 xi is the original value,
 x'i is the transformed value,,
 λ is a user-selected value.
 The second step balances the distribution by subtracting the mean
and
d dividing
di idi the
th result
lt by
b the
th standard
t d d deviation:
d i ti
xi  E ( x)
– where xi 
 x'i is the value after the first transform,
 x
 x''i is (final) standardized value,

 E(x') is mean value of variable x'

 σx' is standard deviation of variable x'.


D.P. Solomatine. Data-driven modelling (part 1). 48
Box--Cox transform of the rainfall data
Box
Origin al ho urly d isch arg e d ata (30 d ays)

500

Dis ch arge [m3 /s]


400

300

200

100

0
0 100 200 300 400 500 600 700
Time [ hrs]

B o x-C o x tran sfo rm of the d isch arge d ata

10 Box-Cox, step 1
ansformed d ischarge

8 Box-Cox, final

4
2

0
Tra

-2 0 100 200 300 400 500 600 700

T ime [hrs]
D.P. Solomatine. Data-driven modelling (part 1). 49
Box--Cox transform: original and resulting histograms
Box

Distibution D istribution
of the origina l discha rge of the B ox -Cox tra nsform e d discha rge

400 160

350 140

300 120

250 100
ue ncy

ue ncy
200 80
Fre qu

Fre qu
150 60

100 40

50 20

0 0
.9

.6

e
1

6
.1

51

90

29
2

e
4

73

12

06

45
4

5
or
2.

4.
3.

0.

9.

8.

7.

5.

3.

2.
17

45

74

or
.2

.8

6
.4

.0
10

13

21

24

27

30

0.

1.

1.

1.

2.

3.

3.
M
16

18

33

36

0.

2.
-1

-0

-0

-0

M
Di h rge [m
Discha [ 3/s],
3/ ] bins
bi Tra nsform e d discha rge , bins

D.P. Solomatine. Data-driven modelling (part 1). 50


Multistationary data: care needed

 transforming the distributions could be dangerous: such variables


actually change the “nature” of data and the relationships between
variables

 a) original data with two clusters (two “samples”) visible


 b) normalized data – clusters cannot be identified
D.P. Solomatine. Data-driven modelling (part 1). 51
Smooth data?
 Simple and Weighted Moving
Averages
 Savitzky–Golay
y y filter: builds local
polynomial regression (of degree k)
on a series of values
 other filters ((Gaussian,, Fourier,,
etc.)

D.P. Solomatine. Data-driven modelling (part 1). 52


Fourier transform can be used to smooth data
(extract only low
low--frequency harmonics)

original signal (time series)

six harmonic components


p

D.P. Solomatine. Data-driven modelling (part 1). 53


Fourier transform: the same signal in frequency domain

D.P. Solomatine. Data-driven modelling (part 1). 54


Is life linear?
Are outliers bad?

D.P. Solomatine. Data-driven modelling (part 1). 55


Finding relationships between i/o variables
(used for Input Variables Selection,
Selection IVS)
n

 ( x  x )( y  y )
i i
R i 1
 Correlation coefficient R n n

 (x  x )  ( y  y)
i 1
i
2

i 1
i
2

 Average mutual information (AMI).


(AMI) It
represents the measure of information that can be
AMI can be used to identify the
learned from one set of data having knowledge of optimal time lag for a data
data-
another set off data. driven rainfall-runoff model
 P ( x, y ) 
I ( X ;Y )    P ( x , y ) logg 2 
 P ( x ) P ( y )


x X yY

 where P (x,y) is the joint probability for realisation


x of X and y of Y; and P (x) and P (y) are the
individual probabilities of these realisations
 If X is completely independent of Y then AMI I
(X;Y) is zero.
D.P. Solomatine. Data-driven modelling (part 1). 56
Us of AMI: relatedness between flow and past rainfall
FLOW1: effective rainfall and discharge data Discharge [m3/s]
Eff.rainfall [mm]
800 0

700
2
 Consider future discharge
600
4
Qt+1 and
d past rainfalls
i f ll Rt-L.
Effective rainfall [mm] 6
What is lag L such that the
relatedness is strongest ?
500
8
400 10
Discharge [m3/s]
12  AMI can be used to identify
the optimal time lag to
300
14
200
16 ensure
100
18 AMI between Qt+1t 1 and
0 20 past lagged rainfalls Rt-L
0 500 1000 1500 2000 2500
Time [hrs]

35 0
L
Lag 3 .5

30 0 3

Zoom in: 25 0 2 .5
Q [ m 3 /s ]

R [m m ]
20 0 Q 2

15 0 R 1 .5

10 0 1

50 0 .5

0 0
90 100 110 120 130 140 150

t [hrs ]

Max AMI  optimal lag


Introducing classification:

main ideas

D.P. Solomatine. Data-driven modelling (part 1). 58


k-Nearest neighbors method:
a “common
common sense”
sense method of classification

 instances are points in 2-dim.


2 dim space,
space output is boolean (+ or -))
 new instance xq is classified w.r.t. proximity of nearest training instances
– to class + (if 1 neighbor is considered)
– to class - (if 4 neighbors are considered)
 for discrete-valued outputs
p assign:
g the most common value

Voronoi diagram for 1-Nearest neighbor 59


D.P. Solomatine. Data-driven modelling (part 1).
Discriminating surfaces:
a traditional method of classification from statistics

 surface (line, hyperplane) separates examples of different classes


Y = a1 X + a2
Y Y

X X

linearly separable examples more difficult example: linear function will


misclassifyy several examples,
p
so a non-linear function needed
(or transformation of space)
D.P. Solomatine. Data-driven modelling (part 1). 60
Decision tree: example with 2 numeric input variables,
2 output classes: 0 and 1
X2
4 class 0 class 0

3
class 1

2
class 0 class 0
1
class 1
X1
•x2 > 2
0 1 2 3 4 5 6 Yes No

•x1 > 2.5 •x1 < 4


Yes No Yes No

•x2 < 3.5 class 0 class 0 •x2 < 1


Yes No Yes No

class 1 class 0 class 1 class 0


D.P. Solomatine. Data-driven modelling (part 1). 61
Classification rules and decision trees:
Play Tennis”
“Play Tennis example

Outlook Temp
p Humid. Wind Play?
y
sunny hot high weak no
sunny hot high strong no
overcast hot high weak yes
rainy mild high weak yes
rainy cool normal weak yes
rainy cool normal strong no
overcast cool normal strong yes
sunny mild high weak no
sunny cool normal weak yes
rainy mild normal weak yes
sunny mild normal strong yes
overcast mild high strong yes
overcast
t h t
hot normal
l weak
k yes
rainy mild high strong no

 Objective: to represent this knowledge using ML techniques

D.P. Solomatine. Data-driven modelling (part 1). 62


Several ways to represent knowledge about data set
(1) Classification rules

 Classification rules - they predict the classification of examples in


terms of whether to play of not. E.g.:
– if (Outlook=sunny)
(Outlook sunny) and (Humidity
(Humidity=high)
high) then Play=No
Play No
– if (Outlook=rainy) and (Windy=strong) then Play=No
– if (Outlook
(Outlook=overcast)
overcast) then Play=Yes
Play Yes
– if (Humidity=normal) then Play=Yes
– if (non of the above) then Play=Yes
 These rules are to be interpreted in order (also called decision list)

D.P. Solomatine. Data-driven modelling (part 1). 63


Several ways to represent knowledge about data set
(2) Decision trees

 Decision trees - treelike structure representing classification rules

D.P. Solomatine. Data-driven modelling (part 1). 64


Several ways to represent knowledge about data set
(3) Association rules

 Association rules - they associate different attribute values. E.g.:


– if (Temperature=cool) then Humidity=normal
– if (Humidity=normal)
(H idit l) and
d (Windy=weak)
(Wi d k) then
th Play=Yes
Pl Y
– if (Outlook=sunny) and (Play=No) then Humidity=high
– if (Windy
(Windy=false)
false) and (Play=No)
(Play No)
then (Outlook=sunny) and (Humidity=high)
 In total there are around 60 of such rules that are 100% correct
 These rules can “predict” any of the attributes, not just “Play”
attribute

D.P. Solomatine. Data-driven modelling (part 1). 65


Classification:
C ass ca o

decision trees
and ID3 and C4.5 algorithms

D.P. Solomatine. Data-driven modelling (part 1). 66


Example with categorical attributes:
14 or 36 possible combinations

Outlook Temp Humid. Wind Play?


sunny hot high weak no
sunny hot high strong no
overcast hot high weak yes
rainy mild high weak yes
rainy cool normal weak yes
rainy cool normal strong no
overcast cool normal strong yes
sunny mild
ild hi h
high weak
k no
sunny cool normal weak yes
rainy mild normal weak yes
sunny mild normal strong yes
overcast mild high strong yes
overcast hot normal weak yes
rainy mild high strong no

D.P. Solomatine. Data-driven modelling (part 1). 67


Objective of classification

 to build a data-driven model that would


– be based on the available historical data on behaviour of a particular
person (14 instances)
– classify a new instance (observation) into a proper class – in this case
into “Yes” (play) or “No” (not to play).

D.P. Solomatine. Data-driven modelling (part 1). 68


Decision tree for 'PlayTennis
'PlayTennis''
Outlook Tempp Humid. Wind Played
y tennis?
sunny hot high weak no
sunny hot high strong no
overcast hot high weak yes
rainy mild high weak yes
rainy cool normal weak yes
rainy cool normal strong no
overcast cool normal strong yes
sunny mild high weak no
sunny cool normal weak yes
rainy mild normal weak yes
sunny mild normal strong yes
overcast mild high strong yes
overcast hot normal weak yes
 each node = rainy mild high strong no

a test for an
attribute,

 leaves =
classification of
an instance

 How to construct such tree?: algorithm ID3 (Quinlan 86)


86), extended
later to C4.5 and C5
D.P. Solomatine. Data-driven modelling (part 1). 69
Verification (test) data set

oops!
 verification (test) set: observations that also happened: Our model is
wrong here
Outlook Temp Humid.
Humid Wind Play? | Model prediction
rainy cool high strong yes | no (wrong)
sunny hot normal strong yes | yes (correct)
overcast cool high strong yes | yes (correct)

 classification error for this verification data set of 3 examples is 1/3 =


33%.

D.P. Solomatine. Data-driven modelling (part 1). 70


Model in operation

 new instance that the model never “saw”:

Outlook Temp Humid. Wind Play? | Model prediction


rainy cool low weak ? | yes

Correct?
We do not know!

D.P. Solomatine. Data-driven modelling (part 1). 71


Algorithm for building decision trees

 1. A := the 'best' decision (split) attribute for next node


('best' = giving max information gain)
 2 Assign A as decision attribute for node
2.
 3. For each value of A, create new descendant of node
 4. Sort trainingg examples
p to leaf nodes
 5. If training examples perfectly classified,
Then STOP,
Else iterate over new leaf nodes

 But a q
question remains:
– how to identify the “best” split attribute, i.e. how to compute the
information gain ?
– Let
Let’ss consider Interachtive Dichotomizer 3 (ID3) algorithm

D.P. Solomatine. Data-driven modelling (part 1). 72


Entropy: measure of the uncertainty (unpredictability)
associated with a random variable

 Consider a set S of members that


can be + or –
 We sample a member randomly,
and assume we get + with
probabilityy P
p
 if P+=1 (all members are +), then
E=0 (0 bits are needed to encode
a message about a sample since
they
h are allll +))
 if P+=P-=0.5, then E=1 (1 bit
needed since a sample is + or –
with equal probability)
 if P+=0.8 then E=0.72 (so < 1 bit
per message about the class)
 Entropy(S) = expected information
(number of bits) needed to encode
class (+ or –) of randomly drawn
sample
p of S (under
( the shortest-
length "code")

D.P. Solomatine. Data-driven modelling (part 1). 73


Entropy: general case

 if members in S may belong to c classes:


c
E ( S )    pi log
l 2 pi
i 1

 llogarithms
ith is
i still
till base
b 2 because
b entropy
t is
i the
th measure off the
th
expected encoding length measured in bits
 max value of entropy is log2c
 for example if c=8, and p1=…=p8=0.125, then
E(S)
( )=3
(3 bits needed to send a message about the class number)
 if all examples belong to one class, E=0. This is what we are aiming
at.
t

D.P. Solomatine. Data-driven modelling (part 1). 74


Entropy and Information Gain: essence of
the Interachtive Dichotomizer 3 (ID3) algorithm
for building decision trees
 S = (14 instances: 9+,
9+ 55-))
 Entropy E([9+, 5-]) = -(9/14) log(9/14) - (5/14) log(5/14) = 0.940
 we want to reduce total entropy by splitting the set S into subsets with
lower entropy (i.e. with higher share of examples of the same class)
 lower entropy = information gain
 if split is made on the basis of attribute A:
– let Values(A) is the set of all possible values for attribute A
 e.g.: Values
V l (Humidity)
(H idit ) = {normal,
{ l high}
hi h}
– Sv is the subset of S for which attribute A has value v
g S Humidity=normal = {{7 examples},
 e.g.: p }, |S Humidity=normal | = 7
– then Information gain (expected reduction in Entropy caused by knowing
the value of attribute A):
| Sv |
Gain( S , A)  E ( S )   E (Sv )
vValues ( A ) | S |
D.P. Solomatine. Data-driven modelling (part 1). 75
Selecting 'best' attributes for leaves

 for a node an attribute is selected that provides max Gain:


 Gain(S Humidity) = 0.151
Gain(S, 0 151 Gain(S,
Gain(S Wind) = 0.048
0 048
 Gain(S, Temperature) = 0.029
 Gain(S, Outlook) = 0.246 -highesthighest

D.P. Solomatine. Data-driven modelling (part 1). 76


Selecting the next best attribute

D.P. Solomatine. Data-driven modelling (part 1). 77


Same example with continuous or mixed data
Outlook Temp Humid.
Humid Wind Play?
sunny 85 85 weak no
sunny 80 90 strong no
overcast 83 86 weak yes
rainy 70 96 weak yes
rainy 68 80 weak yes
rainy 65 70 strong no
overcast 64 65 strong yes
sunny 72 95 weak no
sunny
y 69 70 weak yes
y
rainy 75 80 weak yes
sunny 75 70 strong yes
overcast 72 90 strong yes
overcast 81 75 weak yes
rainy 71 91 strong no
 Temperature: 40 48 60 72 80 90
 Play? No No Yes Yes Yes No
 Thresholds: ^ ^
(for max information gain)
D.P. Solomatine. Data-driven modelling (part 1). 78
Unknown (missing) values in decision trees

 If some examples do not have value for attribute A:


– assign most common value of examples sorted to this node, or
– assign
i mostt common value l off examples
l with
ith same the
th target
t t value,
l or
– assign probability pi to each possible value vi of A, and then provide
fraction pi of examples
p to each descendant in a tree

D.P. Solomatine. Data-driven modelling (part 1). 79


Pruning Decision trees to improve interpretability and
remove overfitting

 Unknown values:
 Tree may appear to be too complex (hundreds of nodes):
– difficult to interpret
– "too accurate" (overfitting) following all (even noisy) examples
 Solution: tree can be reduced (pruned)
– reduced error pruning (Quinlan 87): replace a subtree by a leaf node and
assign the most common class of the associated training examples; the
new tree must be not worse on the validation set
– algorithms can be forced to have a certain number of instances in a
leave to sort
– tree will inevitably misclassify some instances, so accuracy is decreased -
but this normally removes overfitting
– however understandability of smaller trees is better

D.P. Solomatine. Data-driven modelling (part 1). 80


Overfitting: can also happen in decision trees (1)

 Frequent situation: accuracy during learning increases,


increases but on the test
examples drops

D.P. Solomatine. Data-driven modelling (part 1). 81


Overfitting: can also happen in decision trees (2)

 Example of overfitting:
– consider adding new, 15th, instance [sunny, hot, normal, strong, NO] -
this example
p is “noisy”
y - could be jjust wrong
g
– the new tree will have to take into account this (wrong) example (NB:
check on Temperature is added) - built by ID3 algorithm in Weka
software:  (Original
(O i i l tree
t built
b ilt on 14 examples):
l )
 outlook = sunny  outlook = sunny
 | temperature = hot: no  | humidity = high: no (3.0)
 | temperature = mild  | humidity = normal: yes (2.0)
 | | humidity = high: no  outlook = overcast: yes (4.0)
 | | humidity = normal: yes  outlook = rainy
 | ttemperature
t = cool:
l yes  | windy = strong: no (2.0)
 outlook = overcast: yes  | windy = weak: yes (3.0)
 outlook = rainy
 | windy
i d = strong: no
 | windy = weak : yes
 (it misclassifies the 15th example)
D.P. Solomatine. Data-driven modelling (part 1). 82
Problems appropriate for decision tree learning

 Instances describable by attribute-value pairs.


– Easiest is when an attribute takes a small number of disjoint possible
values (e
(e.g.
g Temp={cool
Temp={cool, mild
mild, hot})
– however with proper discretization real values are also possible
 Target function is discrete valued (i.e. it is a class).
 Possibly noisy (missing) training data are allowed

D.P. Solomatine. Data-driven modelling (part 1). 83


Classification: use of rules

D.P. Solomatine. Data-driven modelling (part 1). 84


Classification rules

 A popular alternative to decision trees


 General form (if-then or if-then-else):
– if (Antecedent) then (Consequent)
– Antecedent is normally several ANDed preconditions
 Several rules are often connected with the OR operator (“ORed”)
( ORed )
 Rules can be read off a decision tree, but they are then are far more
too complex
p than necessary.y Well constructed rules are often more
compact than trees
 Problems: Rules can be interpreted in order (as a decision list) or
individually,
d d ll what h to dod iff different
d ff rules
l lead
l d to different
d ff conclusions
l
for the same instance, etc.

D.P. Solomatine. Data-driven modelling (part 1). 85


Classification rules based on “representative instances”
(prototype--based rules)
(prototype

 if there is a “representative example” Er that carries a lot of


information, it could be used as the basis for rules like:

– if NewExample is “close” to Er then its class is the same as of Er

D.P. Solomatine. Data-driven modelling (part 1). 86


Example classification rule
based on a pruned tree

 Decision tree is built, then pruned, and rules are built (PART
algorithm). Example:
– if (outlook = overcast) then Play
Play=YES
YES (total=4
(total 4 / errors=0)
errors 0)
– if (humidity = high) then Play=NO (total=5 / errors=1)
– otherwise Play=YES
Play YES (total=5
(total 5 / errors=1)
errors 1)
 Accuracy is reduced, but the set of rules is very simple

D.P. Solomatine. Data-driven modelling (part 1). 87


Example of using trees and rules:
influence of environmental factors on diseases

 Influence of environmental factors on respiratory diseases (Kontic,


Dzeroski, 2000) - decision trees (C4.5), classification rules (CN2)
– 760 patients:
– 33 input attributes:
 g
gender,, age,
g , residence,, practices:
p cooking,
g, heating,
g, smoking,
g, allergies,
g ,
education, type of house, number of people in household, exposition, month,
etc.
– the trained classifier was able to predict the disease (4 classes):
 acute nasopharingitis (common cold),
 acute infection of upper respiratory organs,
 influenza,
fl
 acute bronchitis

D.P. Solomatine. Data-driven modelling (part 1). 88


Case Study Woudse:
Woudse: using decision trees to replicate
pumping strategy for Woudse water system (Delfland
(Delfland,
Delfland, NL)

D.P. Solomatine. Data-driven modelling (part 1). 89


Wodse:: schematisation of the water system
Wodse

polder

D.P. Solomatine. Data-driven modelling (part 1). 90


Woudse Case Study: water levels and pumping
(full data set)

Water level & pump discharge for Woudse

-3
3.7
7 2
-3.8
-3.9

harge
Water Level (M)

-4

Pump Disch
41
-4.1
-4.2 1
-4.3
-4.4
-4.5
-4.6
-4.7 0
0 500 1000 1500 2000
WL
Observation
Pump

D.P. Solomatine. Data-driven modelling (part 1). 91


Woudse Case Study: water levels and pumping
(fragment)

Water level & pump discharge for Woudse


-3.7 2
-3.8
-3.9

charge
-4
vel (M)

-4.1

Pump Disc
Water Lev

-4.2 1
-4.3
-4.4
45
-4.5
-4.6
-4.7 0
900 950 1000 1050 1100 1150
WL
Observation
Pump

D.P. Solomatine. Data-driven modelling (part 1). 92


Woudse Case Study: problem description

 Input: water level(t) and pump discharge(t-1)


 Output: pump discharge(t)
 The pumping station has two pumps, each with capacity 0.133
m3/sec. Possible pump discharge: 0, 0.133 or 0.266 m3/sec
 Pump discharge has been described as a ‘category
category variable’
variable and is
expressed as 0, 1 or 2
 If water
a e level
e e goes up pump(s)
pu p(s) should
s ou d be switched
s ed on
o too reduce
edu e thee
water level, target water level is -4.6 M
 At each time level we have to determine:
– the pump discharge (0, 1 or 2) based on two inputs: water level(t) and
pump discharge (t-1)
 The problem is posed as a classification problem.
problem

D.P. Solomatine. Data-driven modelling (part 1). 93


Woudse case study: resulting decision tree solving the
classification problem (trained on 5000 instances)
Pumpt = f (WLt, Pumpt-1) Pump = {0, 1, 2}
 PumpT-1
PumpT 1 = 0
 | WLt <= -4.577: 0 (1084.0/1.0)
 | WLt > -4.577
 | | WLt <=
< -4.57:
4.57: 0 (417.0/111.0)
 | | WLt > -4.57
 | | | WLt <= -4.551: 1 (125.0)
 | | | WLt > -4.551: 2 (12.0)
 PumpT-1 = 1
 | WLt <= -4.595
 | | WLt <= -4.601: 0 (129.0)
 | | WLt > -4.601:
4 601 1 (189
(189.0/64.0)
0/64 0)
 | WLt > -4.595
 | | WLt <= -4.55: 1 (680.0/3.0)
 | | WLt > -4
4.55:
55: 2 (41
(41.0)
0)
 PumpT-1 = 2
 | WLt <= -4.593
 | | WLt <= -4.6: 0 (
(37.0)
)
 | | WLt > -4.6: 2 (39.0/16.0)
 | WLt > -4.593: 2 (2247.0/1.0)
D.P. Solomatine. Data-driven modelling (part 1). 94
Woudse case study: verification result
(full data set,
set 3759 instances)

Water level & pump discharge for Woudse

-2,9
, 2,5
,
-3,1

scharge
2
evel (M)

-3,3
-3,5
-3,7
, 1,5

Pump Dis
Water Le

-3,9
1
-4,1
-4,3 0,5
-4,5
-4,7 0
0 500 1000 1500 2000 2500 3000 3500 4000
Observation WL
Known Pump
Predicted Pump

D.P. Solomatine. Data-driven modelling (part 1). 95


Woudse case study: verification result
(fragment with the 100% correct classification)

Water level & pump discharge for Woudse

-2.9
2.9 2.5

scharge
evel (M)

-3.1 2

-3.3 1.5

Pump Dis
Water Le

-3.5 1
-3.7 0.5

-3.9 0
950 960 970 980 990 1000
Observation WL
Known Pump
Predicted Pump

D.P. Solomatine. Data-driven modelling (part 1). 96


Woudse case study: verification result
(fragment with some errors present)

Water level & pump discharge for Woudse

-4.5
4.5 2.5

scharge
evel (M)

1.5

Pump Dis
Water Le

-4.6
46
1
0.5

-4.7 0
600 610 620 630 640 650
Observation WL
Known Pump
Predicted Pump

D.P. Solomatine. Data-driven modelling (part 1). 97


Clustering and classification:
some applications

 SOFM in finding 12 groups of catchments based on their 12


characteristics, and then applying ANN to model the regional flood
frequency (Hall et al
al. 2000)
 Hannah et al. (2000) used clustering for finding groups of
hydrographs
y g p on the basis of their shape p and magnitude;
g ;
clusters are then used for classification by experts
 similarly identifying the classes of river regimes (Harris et al. 2000)
 fuzzy c-means in classifying shallow Dutch groundwater sites into
homogeneous groups (Frapporti et al. 1993)
 fuzzy classification in soil classification on the basis of cone
penetration tests (CPTs) (Zhang et al. 1999)

D.P. Solomatine. Data-driven modelling (part 1). 98


Clustering and classification:
some applications

 decision trees in classifying surge water levels in the coastal zone


depending on the hydrometeorological data, with the Dutch Ministry
for public works (Solomatine et alal., 2000; Velickov,
Velickov 2004);
 decision trees in classifying the river flows in the problem of flood
control
 self-organizing feature maps (Kohonen neural networks) as clustering
methods, and SVM as classification method in aerial photos
i
interpretation
i (Velickov
(V li k et al.,
l 2000);
2000)
 using decision trees, Bayesian methods and neural networks in soil
classification on the basis of cone penetration tests (CPT)
(Bhattacharya and Solomatine, 2006)

D.P. Solomatine. Data-driven modelling (part 1). 99


Classification: conclusions

 there is a wide choice of methods


 classification methods are mainly applied in pattern recognition
problems
bl
 engineering numerical problems could be sometimes posed as
classification problems.
problems Using classification methods (decision trees)
often leads to simpler models and requires less accurate data

D.P. Solomatine. Data-driven modelling (part 1). 100


N
Numeric
i prediction
di ti (regression):
( i )

linear models
and their combinations in tree-
tree-like structures
(M5 model trees)

D.P. Solomatine. Data-driven modelling (part 1). 101


Models for numeric prediction

 Target function is real-valued


 There are many methods:
– Linear and non-linear regression
– ARMA (auto-regressive moving average) and ARIMA models
– Artificial Neural Networks (ANN)

  We will consider now:


– Linear regression
– Regression trees
– Model trees

D.P. Solomatine. Data-driven modelling (part 1). 102


Linear regression
Y = a 1 X + a0
actual
Y
output value

model predicts
new output y(t)
value y(v)

X
x(t)
new input
value x(v)
 Given measured (training) data: T vectors {x(t)
( ),y(t)
( )}, t =1,…T .

 Unknown a1 and a2 are found by solving an optimization problem

 
T
E   y ( t )  (a0  a1 x ( t ) )  min
2

t 1

 Then for the new V vectors {x(v)}, v =1,…V this equation


q can
approximately reproduce the corresponding functions values
{y(v)}, v =1,…V
D.P. Solomatine. Data-driven modelling (part 1). 103
Numeric prediction by averaging in subsets
(regression trees in 1D)

iinputt X1 is
i split
lit into
i t intervals;
i t l
averaging is performed in each interval

Y (output)

 input space can be split according to standard deviation in


subsets
D.P. Solomatine. Data-driven modelling (part 1). 104
Numeric prediction by piece-
piece-wise linear models
(model trees in 1D)

input
p X1 is split
p into intervals;;
separate linear models can be built
for each of the intervals

Y (output)

 question is: how to split the input space in an optimal way?


D.P. Solomatine. Data-driven modelling (part 1). 105
Regression and M5 model trees: building them in 2D
X2
4 Model 3 Model 2
input space X1X2 is split into regions;
separate regression models can be built for
3
Model 1 each of the regions

2  Tree structure where


Model 4 Model 6 – nodes are splitting conditions
1 – leaves are:
 constants ( regression tree)
Model 5 X1
 linear regression models ( M5 model tree)
1 2 3 4 5 6
Y (output) •x2 > 2
Yes No
•x1 > 2.5 •x1 < 4
Example of Model 1: Yes No Yes No

- in a regression tree:
•x2 < 3.5 Model 3 Model 4 •x2 < 1
Y = 10.5
Yes No Yes No
- in M5 model tree:
Y = 2.1*X1 + 0.3*X2
Model 1 Model 2 Model 5 Model 6
D.P. Solomatine. Data-driven modelling (part 1). 106
How to select an attribute for split in regression trees
and M5 model trees

 regression
g trees: same as in decision trees (information
( gain)
g )
 main idea:
– choose the attribute that splits the portion T of the training data that
reaches a particular node into subsets T1, T2,…
– use the standard deviation sd (T) of the output values in T as a measure
of error at that node (in decision trees - entropy E(T) was used)
– split should result in subsets Ti with low standard deviation sd(Ti)
– so model trees splitting criterion is SDR (standard deviation reduction)
that has to be maximized: 4
X 2
Model 3 Model 2

Ti 3

SDR  sdd (T )    sdd (Ti )


Model 1
2

i T
Model 4 Model 6
1
Model 5 X1
1 2 3 4 5 6
Y (output)

D.P. Solomatine. Data-driven modelling (part 1). 107


Measures to improve performance of model trees

 smoothing:
– smoothing process is used to compensate for the sharp discontinuities
between adjacent linear models

 pruning (size reduction) - needed when a large tree overfits the data:
– a subtree is replaced by one linear model

D.P. Solomatine. Data-driven modelling (part 1). 108


Regression and model trees in numerical prediction:
some applications

D.P. Solomatine. Data-driven modelling (part 1). 109


M5 tree in Rainfall-
Rainfall-runoff modelling ((Huai
Huai river, China)
Model
M d l structure:
t t
Qt+T = F (Rt Rt-1 …Rt-L … Qt Qt-1 Qt-A …Qtup Qt-1up …)

Variables considered
Output: predicted discharge QXt+1
Inputs (with different time lags):
- dailyy areal rainfall (Pa)
- moving average of daily areal rainfall (PaMov)
- discharges (QX) and upstream (QC)

Smoothed
h d variables
bl have
h higher
h h correlation
l coeff.
ff
with the output, e.g. 2-day-moving average of
rainfall (PaMov2t)
 Final model for the flood season:
 Output:
Data (1976-1996) g the next dayy Q
– discharge QXt+1
Daily
l discharges
d h (QX,
Q QC)  Inputs:
Daily rainfall at 17 stations – Pat Pat-1 PaMov2t PaMov2t-1
Daily evaporation for 14 years QCt QCt-1 QXt
(1976 1989) at 3 stations
(1976-1989)  Techniques used: M5 model trees, ANN
Training data: 1976-89
Cross-valid. & testing: 1990-96 110
D.P. Solomatine. Data-driven modelling (part 1).
Resulting M5 model tree with 7 models (Huai
(Huai river)
QXt <= 154 :
| PaMov2t <= 4.5 : LM1 (1499/4.86%)
| PaMov2t > 4.5 :
| | PaMov2t <= 18.5 : LM2 (315/15.9%)
| | PaMov2t > 18.5 : LM3 (91/86.9%)
QXt > 154 :
| PaMov2t-1 <= 13.5 :
| | PaMov2t <= 4.5 : LM4 (377/15.9%)
| | PaMov2t > 4.5 : LM5 ((109/89.7%)
/ )
| PaMov2t-1 > 13.5 :
| | PaMov2t <= 26.5 : LM6 (135/73.1%)
| | PaMov2t > 26.5 : LM7 (49/270%)
Models at the leaves:
LM1: QXt+1 = 2.28 + 0.714PaMov2t-1 - 0.21PaMov2t + 1.02Pat-1 + 0.193Pat
- 0.0085QCt-1 + 0.336QCt + 0.771QXt
LM2: QXt+1 = -24.4 - 0.0481PaMov2t-1 - 4.96PaMov2t + 3.91Pat-1 + 4.51Pat
- 0.363QCt-1 + 0.712QCt + 1.05QXt
LM3: QXt+1 = -183 + 10.3PaMov2t-1 + 8.37PaMov2t - 5.32Pat-1 + 1.49Pat
- 0.0193QCt-1 + 0.106QCt + 2.16QXt
LM4: QXt+1 = 47.3 + 1.06PaMov2t-1 - 2.05PaMov2t + 1.91Pat-1 + 4.01Pat
- 0.3QCt-1 + 1.11QCt + 0.383QXt
LM5: QXt+1 = -151 - 0.277PaMov2t-1 - 37.8PaMov2t + 31.1Pat-1 + 30.3Pat
- 0.672QCt-1 + 0.746QCt + 0.842QXt
LM6: QXt+1 = 138 - 5.95PaMov2t-1 - 39.5PaMov2t + 29.6Pat-1 + 35.4Pat
- 0.303QCt-1 + 0.836QCt + 0.461QXt
LM7: QXt+1 = -131 - 27.2PaMov2t-1 + 51.9PaMov2t + 0.125Pat-1 - 5.29Pat
- 0.0941QCt-1 + 0.557QCt + 0.754QXt
D.P. Solomatine. Data-driven modelling (part 1). 111
Performance of M5 and ANN models (Huai
(Huai river)
D.P.
D PS Solomatine
l ti and d Y.
Y Xue.
X M5 model d l ttrees compared
d tto neurall networks:
t k application
li ti to t flood
fl d forecasting
f ti ini the
th
upper reach of the Huai River in China. ASCE Journal of Hydrologic Engineering, 9(6), 2004, 491-501.

4000
OBS
s)

3000
arge (m3/s

FS-M5

2000 FS-ANN
Discha

1000

0
96-6-1 96-7-1 96-7-31 96-8-30
Time

M5 and ANN, flood season data (testing, fragment)

D.P. Solomatine. Data-driven modelling (part 1). 112


M5 model trees and ANNs in rainfall-
rainfall-runoff modelling:
predicting
p edicting flow
flo three
th ee hours
ho s ahead (Sieve
(Sie e catchment)
The model: Qt+3 = f (REt, REt-1, REt-2, REt-3, Qt, Qt-1 )
 IInputs:
t
 REt, REt-1, REt-2, Prediction of Qt+3 : Verification performance
REtt-33,Qt,Qtt-11 350
(rainfall for 3
Observed
past hours, 300
Modelled (ANN)
runoff for 2) 250 Modelled (MT)
 ANN verification
Q [ m 3 /s ]

200
RMSE=11.353
150
NRMSE=0.234
COE=0.9452 100
 MT verification 50
RMSE=12.548
0
NRMSE=0.258
NRMSE 0.258 0 20 40 60 80 100 120 140 160 180
COE=0.9331 t [hrs]

D.P. Solomatine. Data-driven modelling (part 1). 113


Numerical prediction by M5 model trees: conclusions

 Transparency of trees: model trees is easy to understand (even by


the managers)
 M5 model tree is a mixture of local accurate models
 Pruning (reducing size) allows:
– to prevent overfitting
– to generate a family of models of various accuracy and complexity

D.P. Solomatine. Data-driven modelling (part 1). 114

You might also like