Data Data - Driven Driven Modelling Modellinggg in Water in Water - Related Problems. Related Problems

Data-driven modelling
Data- g
in water-
water-related problems.
PART 1
Dimitri P. Solomatine
UNESCO-IHE Institute for Water Education

Hydroinformatics Chair
Outline of the course
 Notion of data-driven modelling (DDM)

 Sources of technologies for DMM: machine learning, data mining, soft
computing
ti
 Introduction to methods: decision, regression and model trees,
Bayesian approaches,
approaches neural networks,
networks fuzzy systems
systems, chaotic
systems
 Demonstration of applications
– Rainfall-runoff modelling
– Reservoir optimization
– R l ti
Real-time control
t l off water
t llevels
l in
i the
th polder
ld areas
– References to other applications (surge water level prediction in the
North Sea,, interpretation
p of aerial photos)
p )
D.P. Solomatine. Data-driven modelling (part 1). 2

Modelling
 Model is …
– a simplified description of reality
– an encapsulation
l ti off knowledge
k l d aboutb t a particular
ti l physical
h i l or social
i l
process in electronic form
 Goals of modelling are:

– understand the studied system or domain (understand the past)
– predict the future
 predict the future values of some of the system variables, based on the
knowledge about other variables
– use the results of modelling for making decisions (change the future)

Data--driven models vs Knowledge
Data Knowledge--driven (physically-
(physically-
based) models (1)
 ”Physically-based", or "knowledge-based" models are based on the

understanding of the underlying processes in the system
– examples: river models based on main principles of water motion
motion,
expressed in differential equations, solved using finite-difference
approximations
 "Data-driven" model is defined as a model connecting the system
state variables (input, internal and output) without much knowledge
about the "physical"
physical behaviour of the system
– examples: regression model linking input and output
 Current trend: combination of both ((“hybrid”
hybrid models)

Why data-
data-driven now?
 Measuring campaigns using automatic computerised equipment: a lot

of data became available
 i
important
t tbbreakthroughs
kth h iin computational
t ti l iintelligence
t lli and
d machine
hi
learning methods
 penetration of “computer
computer sciences”
sciences into civil engineering (e.g.,
(e g
hydroinformatics, geo-informatics etc.)

Hydroinformatics system:
typical architecture
User interface
Judgement engines
Decision support systems for management
F engines
Fact i Data, information, Knowledge inference
Physically-based knowledge engines
models
ode s Knowledge-base
g
systems
Data-driven
models
d l Communications
R l world
Real ld

Classification of models
 specific - general
 model estimation (data-driven) - first principles (process) models
 numerical - analytical
 stochastic - deterministic
 microscopic
i i - macroscopic i
 discrete - continuous
 qualitative - quantitative

Example of a simple data-
data-driven Linear regression model
Y = a 0 + a1 X
model::
model Y
linear regression
actual
output
value
model
d l predicts
di
new output
value X
new input
value
 Given measured (training) data: T vectors {x(t)

( ),y(t)
( )}, t =1,…T .
 Unknown a0 and a1 are found by solving an optimization problem
 
T
E   y  (a0  a1 x )  min
(t ) (t ) 2
t 1
 Then for the new V vectors {x(v)}, v =1,…V this equation can approximately
reproduce
d the
th corresponding
di functions
f ti values
l
{ y(v) } , v =1,…V
Data: attributes, inputs, outputs
 set K of examples (or instances)

represented by the duple <xk, yk>, where k = 1,…, K,
vector xk = {x1,…,xn}k , vector yk = {y1,…,ym}k ,
n = number of inputs,
p , m = number of outputs.
p
 The process of building a function ( ‘model’) y = f (x) is called training.
 Often only one output is considered, so m = 1.
Measured data Attributes Model output

IInputs
t Output
O t t y*
* = f (x )
Instances x1 x2 … xn y y*
Instance 1 x11 x12 x1n y1 y1*
Instance 2 x21 x21 x2 n y2 y2*
… …
Instance K xK1 xK2 xK n yK yK*
Data--driven model
Data
Actual (observed)
Modelled output
t t Y
Input data X (real)
system
Learning is aimed
at minimizing this
Machine difference
l
learning
i
(data-driven)
model Predicted output Y’
 DDM “learns” the target function Y=f (X) describing how the real
system behaves
 Learning = process of minimizing the difference between observed data
and model output. X and Y may be non-numeric
 After learning,
learning being fed with the new inputs,
inputs DDM can generate output
close to what the real system would generate
Data--driven models vs Physically
Data Physically--based models
physically-based
(conceptual)
hydrological
rainfall-runoff model
P = precipitation, E = evapotranspiration, Q =
runoff, SP = snow pack, SM = soil moisture, UZ =
upper groundwater zone, LZ =lower groundwater
zone, lakes = lake volume
Coefficients to be identified
byy calibration
Precipitation P(t)
Moving average of data-driven rainfall-
Precipitation MAP3(t-2) runoff model
Runoff flow Q(t+1) (artificial neural
Evapotranspiration E(t) network)
Runoff Q(t)
connections with weights
neurons (non-linear functions)
D.P. Solomatine. Data-driven modelling (part 1). to be identified by training 11
Using data-
data-driven methods in
rainfall--runoff modelling
rainfall
Qtupp
 Available data:
– rainfalls Rt
– runoffs (flows) Qt
 Inputs: lagged rainfalls Rt Rt-1 … Rt-L
Rt Qt
 Output
p to predict:
p Qt+T
 Model: Qt+T = F (Rt Rt-1 … Rt-L … Qt Qt-1 Qt-A … Qtup Qt-1up …)

(past rainfall) (autocorrelation) (routing)
 Questions:
– how to find the appropriate lags? (lags embody the physical properties of
the catchment)
non linear regression function F ?
– how to build non-linear

So, what is then data-
data-driven modelling
in engineering?
 Data-driven modelling:
– oriented towards building predictive models
– follows general modelling guidelines adopted in engineering
– uses the apparatus of machine learning
 Proper data-driven modelling is impossible without
– good understanding of the modelled system,
– relation to the external environment and
– possible connections to other models and decision making processes
 It differs from data mining in:
– application area (engineering and natural processes)
– used terminology and relation to physically-based models
– usually smaller data sets (and hence possible different choice of methods)

Steps in modelling process: details
 State the problem (why do the modelling?)

 Evaluate data availability
availability, data requirements
 Specify the modelling methods and choose the tools
 Build ((identify)
y) the model:
– Choose variables that reflect the physical processes
– Collect, analyse and prepare the data
– Build the model
– Choose objective function for model performance evaluation
– Calibrate (identify, estimate) the model parameters:
 if possible, maximize model performance by comparing the model output to
past measured data and adjusting parameters
 Evaluate the model:
– Evaluate the model uncertainty, sensitivity
– Test (validate) the model using the “unseen” measured data
 A l the
Apply th model
d l (and
( d possibly
ibl assimilate
i il t real-time
l ti data)
d t )
 Evaluate results, refine the model
Some “golden rules” in building a model (1)
 1. Select clearly defined problem that the model will help to

resolve.
 2 Specify
2. S if the
th required
i d solution
l ti or the
th problem.
bl
 3. Define how the solution delivered is going to be used in practice.
 4 Learn the problem
4. problem, collect the domain knowledge,
knowledge understand it.
it
 5. Let the problem drive the modelling, including the tool selection,
data
da a preparation,
p epa a o , etc.
e Thata iss take
a e the
e best
bes tool
oo for
o the
e job, not
o just
jus a
job you can do with the available tool.
 6. Clearly define assumptions (do not just assume, but discuss
them with the domain knowledge experts).
 ... ->

Some “golden rules” in building a model (2)
 7. Refine the model iteratively (try different things until the model
seems as good as it is going to get).
 8 Make
8. M k the
th model
d l as simple
i l as possible,
ibl but
b t no simpler,
i l
formulated also as:
– KISS (("Keep
Keep It Sufficiently Simple",
Simple , or "Keep
Keep It Simple, Stupid
Stupid"))
– Minimum Description Length principle: “the best model is one that that is
the smallest”
– the Occam's Razor principle - formulated by William of Occam in 1320 in
the following form: shave off all the "unneeded philosophy off the
explanation".
p
 9. Define instability in the model (critical areas where small
changes in inputs lead to large change in output).
 10. Define uncertainty in the model (critical areas and ranges in the
data where the model produces low confidence predictions). 16
D.P. Solomatine. Data-driven modelling (part 1).
Suppliers of methods for data-
(1) Machine learning (ML)
 ML = constructing computer programs that automatically improve with

experience
 Most g
general paradigm
p g for DDM
 ML draws on results from:
– statistics
– artificial Neural Networks (various types)
– artificial intelligence
– philosophy, psychology, cognitive science, biology
– information theory,
theory computational complexity
– control theory
 For a long time concentrated on categorical (non-continuous) variables

(2) Soft computing
 Soft computing - tolerant for imprecision and uncertainty of data (Zadeh,

1991). Currently includes almost everything:
– fuzzyy logic
g
– neural networks
– evolutionary computing
– probabilistic computing (incl. belief networks)
– chaotic systems
– parts of machine learning theory

(3) Data mining
 Data mining (preparation, reduction, finding new knowledge):

– automatic classification
– identification of trends ((eg.
g statistical methods like ARIMA))
– data normalization, smoothing, data restoration
– association rules and decision trees
 IF (WL
(WL>1.2
1.2 @3 h ago, Rainfall
Rainfall>50
50 @1 h ago) THEN (WL
(WL>1.5
1.5 @now)
– neural networks
– fuzzy systems
 Other methods oriented towards optimization:
– automatic calibration (with a lot of data involved, makes a physically-driven
model partly data-driven)

Machine learning: Learning from data

Data--driven modelling:
Data modelling: (machine) learning
Actual (observed)
Modelled output
t t Y
Input data X (real)
system
Learning
L i isi aimed
i d
at minimizing this
Machine difference
learning
(data-driven)
model Predicted output Y’
 DDM tries to “learn” the target function Y=f (X) describing how the
real system behaves
 Learning = process of minimizing the difference between observed
data and model output. X and Y may be non-numeric
 After learning,
learning being fed with the new inputs,
inputs DDM can generate
output close to what the real system would generate
Training (calibration), cross-
cross-validation, testing
 Ideally the observed data has to be split in three data sets:

Ideally,
– training (calibration)
– cross-validation used
d tto build
b ild the
th model
d l
(imitates test set and model operation,
used to test model performance
during calibration process)
– testing
(imitates model operation, used to test the model
used for final model test,
should
h ld nott be
b seen by
b developer
d l ) after it is built
 One should distinguish

– minimizing the error during the model calibration
calibration, cross-validation
– minimizing the error during model operation (or on the unseen test set)
 Ideally, we should aim at minimizing the cross-validation error – since
this will give hopes that error on test set will be also small
small.
 In practice, process of training uses training set, and cross-validation
set is used to check periodically the model error and stop training
What is a “good” model?
Models being progressively made more

 Consider a model being accurate and complex during training
progressively made more
accurate ((and complex):
p ) actual Y
Green  Red  Blue output (e.g., flow)
value
 Green (linear) model is simple –
but it is not accurate enough model predicts
 Blue model is the most accurate –new output
but is it “the best”? value
 Redd model:
d l less
l accurate than
h
the Blue one, but captures the
trend in data. It will g
generalise X
well. new input
value
(
(e.g. rainfall)
i f ll)
 Question: how to determine
during training when to stop Which model is “better”:
better :
improving the model? green, red or blue?
Necessity of cross-
cross-validation during training
good moment
to stop training in-sample = training
out-of-sample = cross-validation, or verification (testing)

Training, cross-
cross-validation, verification (2)
Training Cross-validation
Y
Relationship Y=F(X)
t be
to b modelled
d ll d (unknown)
( k ) X
x Data {xi, yi}collected
about this relationship
Data driven model
Data-driven
 Training: iterative
refinement of the model –
with each iteration model
parameters are changed to
reduce error on training set
  error on training set gets
lower and lower
  error on test set gets
lower, then starts to
increase
 This may lead to overfitting
and high error on cross-
validation set
 The moment to stop
training
Data--Driven Modelling: care is needed
Data
 Difficulties with extrapolation (working outside the variables’ range)

– A solution: exhaustive data collection, optimal construction of the
calibration set
 Care needed if the time series is not stationary
– A solution: to build several models responsible for different regimes
 Need to ensure that the relevant physical variables are included
– A solution: use correlation and average mutual information analysis

Data

Data: attributes, inputs, outputs
 set K of examples (or instances)

represented by the duple <xk, yk>, where k = 1,…, K,
vector xk = {x1,…,xn}k , vector yk = {y1,…,ym}k ,
n = number of inputs,
p , m = number of outputs.
p
 The process of building a function ( ‘model’) y = f (x) is called training.
 Often only one output is considered, so m = 1.
Measured data Attributes Model output

IInputs
t Output
O t t y*
* = f (x )
Instances x1 x2 … xn y y*
Instance 1 x11 x12 x1n y1 y1*
Instance 2 x21 x21 x2 n y2 y2*
… …
Instance K xK1 xK2 xK n yK yK*
Types of data (roughly)
 Class (category, label)

 Ordinal (order)
 Numeric (real-valued)
 etc. (considered later)
 (Time) series data: numerical data which values have associated

index variable with it.

Four styles of learning
 Classification
– on the basis of classified examples, a way of classifying unseen examples
is to be found
 Association
– association between features (which combinations of values are most
frequent) is to be identified
 Clustering
– groups of objects (examples) that are "close" are to be identified
 Numeric prediction
– outcome
t iis nott a class,
l b
butt a numeric
i (real)
( l) value
l
– often called regression

Important notions in machine learning
 Hypotheses
– there are various possible functions Y=f (X) (hypotheses) relating input and
output
– machine learning is searching through this hypotheses space in order to
determine one that fits the observed data and the prior knowledge of the learner
 Concepts: the thing to be learned on the basis of available data. For

example:
– children learn how to read and write,, what is sweet and saltyy
– conditions that lead to flood
– combinations of particular algae indicating poor water quality
 Instances (examples) representing concepts
 Attributes (describing instances)

Occam's razor: short hypotheses,
minimum description length (MDL)
 Principle: accept the simplest hypothesis (=model)

 Known as the Occam's razor principle: shave off all the "unneeded
philosophy
hil h off
ff the
th explanation“
l ti “ (William
(Willi off Occam,
O 1320)
 In ML also known as Minimum Description Length (MDL) principle

Instances (examples)
 Instances = examples of input data.

 Instances that can be stored in a simple rectangular table (only these
will
ill be
b mainly
i l considered):
id d)
– individual unrelated customers described by a set of attributes
– records of rainfall
rainfall, runoff
runoff, water level taken every hour
 Instances that cannot be stored in a table, but require more complex
structures:
– instances of pairs that are sisters, taken from a family tree
– related tables in complex databases describing staff, their ownership,
involvement in projects,
projects borrowing of computers,
computers etc
etc.

Attributes: more detailed view at measured data (1)
 Nominal (also called class, category, labels). Examples:

– customers that tend to do shopping on Fridays
– combinations of hydrometeorological conditions leading to high surge
– combination of conditions leading to a high probability of flood
 Ordinal - categories that can be ordered (ranked)

– temperature expressed as cool, mild, hot
– water level expressed as low,
low medium,
medium high
 ...
 ...

Attributes: more detailed view at measured data (2)
 ...
 ...
 Interval - ordered and expressed in fixed equal units. Examples:
– dates (cannot be however multiplied)
– temperature expressed in degrees Celsius
 Ratio (real numbers) - there is a zero point. Examples:

– distance, precipitation, water level, soil moisture, etc.
– but not temperature, since using another zero (C/F) changes ratios
– amount of money a customer tend to spend on a typical Friday

Attributes: main types of data in machine learning
(simplified)
 Nominal (also called class, category, labels, enumerated)

– discrete but without ordering
– Special
S i l case - dichotomy
di h t (Boolean)
(B l )
 Ordinal (also called numeric,

numeric continuous)
– could be also discrete but must be ordered
– p p is also called regression
prediction of real-valued output g
 more complex types, requiring additional descriptions (metadata), eg.:

– dimensions (for building expressions that are dimensionally correct)
– partial ordering (like "same day next week")
– subsets
s bsets (like "holidays"
"holida s" when
hen water
ate consumption
ons mption is different),
diffe ent) et
etc.

Data preparation and surveying
 prepare the
h data
d - this
h may include
l d complexl procedures
d off restoring
the missing data, data transformation, etc.;
 survey the data - understand the nature of the data
data, get insight into
the problem this data describes - includes identification and analysis
of variability, sparsity, peaks and valleys, entropy, mutual information
inputs to outputs, etc. (this step is often merged with the previous
one);
 build the model

Data preparation results in:
 training data set - raw data is presented in a form necessary to to

train the DDM;
 cross-validation
lid ti data
d t sett - needed
d d to t detect
d t t overtraining;
t i i
 testing, or validation data set - it is needed to validate (test) the
model'ss predictive performance;
model
 algorithms and software to perform pre-processing (eg.,
normalization);
 algorithms and software to perform post-processing (eg.,
denormalization).

Important steps in data preparation
 Replace missing, empty, inaccuarate values

 Handle issue of spatial and temporal resolution
 Linear scaling and normalization
 Non-linear transformations
 Transform the distributions
 Time
i series:
i
– Fourier and wavelet transforms
– Identification of trend
trend, seasonality
seasonality, cyles
cyles, noise
– Smoothing data
 Finding
g relationships
p between attributes (eg.
( g correlation,, average
g
mutual information - AVI)
 Discretizing numeric attributes into {low, medium, high}
 Data reduction (Principal components analysis - PCA)
Example: raw data set

Example: normalization and redistribution

Replacing missing and empty values
 What to do with the outliers? How to reconstruct missing values?

 Estimator is a device (algorithm) used to make a justifiable guess
about
b t th
the value
l off some particular
ti l variable,
i bl that
th t is,
i tot produce
d an
estimate
 Unbiased estimator is a method of guessing that does not change
important characteristics of the data set when the estimates are
included with the existing values
 Example: Dataset 1 2 3 x 5
– Estimators:
 2.750,
2 750 if the mean is to be unbiased;
 4.659, if the standard deviation is to be unbiased;
 4.000, if the step-wise change in the variable value (trend) is to be unbiased
(that is, linear interpolation is used xi = (xi+1 + xi-1) / 2 )

Issue of spatial and temporal resolution
 Examples:
– in a harbour sedimentation is measured once in two weeks at one
locations and once a month at other two locations,
locations and never at other
(maybe important) locatons
– in a catchment the rainfall data that was manually collected at a three
gauging stations for 20 years once a day, and 3 years ago the
measurements started also at 4 new automatic stations as well, with the
hourlyy frequency
q y
 Solutions:
– filling-in missing data
– introducing an artificial resolution being equal to the maximum for all
variables

Linear scaling and normalization
 General form:
 x'i = a xi + b
 to keep data positive:
 x'i = xi + min (x1...xn) + SmallConst
 S
Squashing
hi data
d t into
i t the
th range [0,
[0 1]
xi  min( x1... xn )
xi 
max( x1... xn )  min( x1... xn )

Non--linear transformations
Non
 Logarithmic
 x'i = log (xi)
 Softmax scaling:

Logistic function
1
L( x ) 
1  e x
L og istic
i ti fun
f ctio
ti n
Output va lue
1.2
0.8
0 6
0.6
0.4
0.2
0
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.2
Input va lue

Softmax function
 1. {x} should be first transformed linearly to vary around the mean

mx :
xi  E ( x )
xi 
 ( x / 2 )
– where
– E(x) is mean value of variable x;
– σx is the standard deviation of variable x;
– λ is linear response measured in standard deviations – for example ±σ
(that is σ on either side of the central point of the distribution) cover
68% of the total range of x, ±2σ cover 95 95.5%,
5% ±3σ cover 99 99.7%.
7%
–  ≈ 3.14
 2. logistic
g function applied
pp ( ’)
L(x Output va lue
L og istic fun ctio n
1.2
0.8
0.6
0.4
0.2

-10 -8 -6 -4 -2
-0.2
0
Input va lue
2 4 6 8 10
47
Transforming the distributions: Box-
Box-Cox transform
 The first step uses the power transform to adjust the changing
variance:
xi  1
xi 
 where
he e 
 xi is the original value,
 x'i is the transformed value,,
 λ is a user-selected value.
 The second step balances the distribution by subtracting the mean
and
d dividing
di idi the
th result
lt by
b the
th standard
t d d deviation:
d i ti
xi  E ( x)
– where xi 
 x'i is the value after the first transform,
 x
 x''i is (final) standardized value,
 E(x') is mean value of variable x'
 σx' is standard deviation of variable x'.

Box--Cox transform of the rainfall data
Box
Origin al ho urly d isch arg e d ata (30 d ays)
500
Dis ch arge [m3 /s]

400
300
200
100
0
0 100 200 300 400 500 600 700
Time [ hrs]
B o x-C o x tran sfo rm of the d isch arge d ata
10 Box-Cox, step 1
ansformed d ischarge
8 Box-Cox, final
4
2
0
Tra
-2 0 100 200 300 400 500 600 700
T ime [hrs]
Box--Cox transform: original and resulting histograms
Box
Distibution D istribution
of the origina l discha rge of the B ox -Cox tra nsform e d discha rge
400 160
350 140
300 120
250 100
ue ncy
ue ncy
200 80
Fre qu
Fre qu
150 60
100 40
50 20
0 0
.9
.6
e
1
6
.1
51
90
29
2
e
4
73
12
06
45
4
5
or
2.
4.
3.
0.
9.
8.
7.
5.
3.
2.
17
45
74
or
.2
.8
6
.4
.0
10
13
21
24
27
30
0.
1.
1.
1.
2.
3.
3.
M
16
18
33
36
0.
2.
-1
-0
-0
-0
M
Di h rge [m
Discha [ 3/s],
3/ ] bins
bi Tra nsform e d discha rge , bins

Multistationary data: care needed
 transforming the distributions could be dangerous: such variables

actually change the “nature” of data and the relationships between
variables
 a) original data with two clusters (two “samples”) visible

 b) normalized data – clusters cannot be identified
Smooth data?
 Simple and Weighted Moving
Averages
 Savitzky–Golay
y y filter: builds local
polynomial regression (of degree k)
on a series of values
 other filters ((Gaussian,, Fourier,,
etc.)

Fourier transform can be used to smooth data
(extract only low
low--frequency harmonics)
original signal (time series)
six harmonic components

p

Fourier transform: the same signal in frequency domain

Is life linear?
Are outliers bad?

Finding relationships between i/o variables
(used for Input Variables Selection,
Selection IVS)
n
 ( x  x )( y  y )
i i
R i 1
 Correlation coefficient R n n
 (x  x )  ( y  y)
i 1
i
2
i 1
i
2
 Average mutual information (AMI).

(AMI) It
represents the measure of information that can be
AMI can be used to identify the
learned from one set of data having knowledge of optimal time lag for a data
data-
another set off data. driven rainfall-runoff model
 P ( x, y ) 
I ( X ;Y )    P ( x , y ) logg 2 
 P ( x ) P ( y )


x X yY
 where P (x,y) is the joint probability for realisation

x of X and y of Y; and P (x) and P (y) are the
individual probabilities of these realisations
 If X is completely independent of Y then AMI I
(X;Y) is zero.
Us of AMI: relatedness between flow and past rainfall
FLOW1: effective rainfall and discharge data Discharge [m3/s]
Eff.rainfall [mm]
800 0
700
2
 Consider future discharge
600
4
Qt+1 and
d past rainfalls
i f ll Rt-L.
Effective rainfall [mm] 6
What is lag L such that the
relatedness is strongest ?
500
8
400 10
Discharge [m3/s]
12  AMI can be used to identify
the optimal time lag to
300
14
200
16 ensure
100
18 AMI between Qt+1t 1 and
0 20 past lagged rainfalls Rt-L
0 500 1000 1500 2000 2500
Time [hrs]
35 0
L
Lag 3 .5
30 0 3
Zoom in: 25 0 2 .5
Q [ m 3 /s ]
R [m m ]
20 0 Q 2
15 0 R 1 .5
10 0 1
50 0 .5
0 0
90 100 110 120 130 140 150
t [hrs ]
Max AMI  optimal lag

Introducing classification:
main ideas

k-Nearest neighbors method:
a “common
common sense”
sense method of classification
 instances are points in 2-dim.

2 dim space,
space output is boolean (+ or -))
 new instance xq is classified w.r.t. proximity of nearest training instances
– to class + (if 1 neighbor is considered)
– to class - (if 4 neighbors are considered)
 for discrete-valued outputs
p assign:
g the most common value
Voronoi diagram for 1-Nearest neighbor 59

Discriminating surfaces:
a traditional method of classification from statistics
 surface (line, hyperplane) separates examples of different classes

Y = a1 X + a2
Y Y
X X
linearly separable examples more difficult example: linear function will

misclassifyy several examples,
p
so a non-linear function needed
(or transformation of space)
Decision tree: example with 2 numeric input variables,
2 output classes: 0 and 1
X2
4 class 0 class 0
3
class 1
2
class 0 class 0
1
class 1
X1
•x2 > 2
0 1 2 3 4 5 6 Yes No
•x1 > 2.5 •x1 < 4

Yes No Yes No
•x2 < 3.5 class 0 class 0 •x2 < 1

Yes No Yes No
class 1 class 0 class 1 class 0

Classification rules and decision trees:
Play Tennis”
“Play Tennis example
Outlook Temp
p Humid. Wind Play?
y
sunny hot high weak no
sunny hot high strong no
overcast hot high weak yes
rainy mild high weak yes
rainy cool normal weak yes
rainy cool normal strong no
overcast cool normal strong yes
sunny mild high weak no
sunny cool normal weak yes
rainy mild normal weak yes
sunny mild normal strong yes
overcast mild high strong yes
overcast
t h t
hot normal
l weak
k yes
rainy mild high strong no
 Objective: to represent this knowledge using ML techniques

Several ways to represent knowledge about data set
(1) Classification rules
 Classification rules - they predict the classification of examples in

terms of whether to play of not. E.g.:
– if (Outlook=sunny)
(Outlook sunny) and (Humidity
(Humidity=high)
high) then Play=No
Play No
– if (Outlook=rainy) and (Windy=strong) then Play=No
– if (Outlook
(Outlook=overcast)
overcast) then Play=Yes
Play Yes
– if (Humidity=normal) then Play=Yes
– if (non of the above) then Play=Yes
 These rules are to be interpreted in order (also called decision list)

(2) Decision trees
 Decision trees - treelike structure representing classification rules

(3) Association rules
 Association rules - they associate different attribute values. E.g.:

– if (Temperature=cool) then Humidity=normal
– if (Humidity=normal)
(H idit l) and
d (Windy=weak)
(Wi d k) then
th Play=Yes
Pl Y
– if (Outlook=sunny) and (Play=No) then Humidity=high
– if (Windy
(Windy=false)
false) and (Play=No)
(Play No)
then (Outlook=sunny) and (Humidity=high)
 In total there are around 60 of such rules that are 100% correct
 These rules can “predict” any of the attributes, not just “Play”
attribute

Classification:
C ass ca o
decision trees
and ID3 and C4.5 algorithms

Example with categorical attributes:
14 or 36 possible combinations
Outlook Temp Humid. Wind Play?

sunny mild
ild hi h
high weak
k no
overcast hot normal weak yes
rainy mild high strong no

Objective of classification
 to build a data-driven model that would

– be based on the available historical data on behaviour of a particular
person (14 instances)
– classify a new instance (observation) into a proper class – in this case
into “Yes” (play) or “No” (not to play).

Decision tree for 'PlayTennis
'PlayTennis''
Outlook Tempp Humid. Wind Played
y tennis?
sunny mild high weak no
overcast hot normal weak yes
 each node = rainy mild high strong no
a test for an
attribute,
 leaves =
classification of
an instance
 How to construct such tree?: algorithm ID3 (Quinlan 86)

86), extended
later to C4.5 and C5
Verification (test) data set
oops!
 verification (test) set: observations that also happened: Our model is
wrong here
Outlook Temp Humid.
Humid Wind Play? | Model prediction
rainy cool high strong yes | no (wrong)
sunny hot normal strong yes | yes (correct)
overcast cool high strong yes | yes (correct)
 classification error for this verification data set of 3 examples is 1/3 =

33%.

Model in operation
 new instance that the model never “saw”:
Outlook Temp Humid. Wind Play? | Model prediction

rainy cool low weak ? | yes
Correct?
We do not know!

Algorithm for building decision trees
 1. A := the 'best' decision (split) attribute for next node

('best' = giving max information gain)
 2 Assign A as decision attribute for node
2.
 3. For each value of A, create new descendant of node
 4. Sort trainingg examples
p to leaf nodes
 5. If training examples perfectly classified,
Then STOP,
Else iterate over new leaf nodes
 But a q
question remains:
– how to identify the “best” split attribute, i.e. how to compute the
information gain ?
– Let
Let’ss consider Interachtive Dichotomizer 3 (ID3) algorithm

Entropy: measure of the uncertainty (unpredictability)
associated with a random variable
 Consider a set S of members that

can be + or –
 We sample a member randomly,
and assume we get + with
probabilityy P
p
 if P+=1 (all members are +), then
E=0 (0 bits are needed to encode
a message about a sample since
they
h are allll +))
 if P+=P-=0.5, then E=1 (1 bit
needed since a sample is + or –
with equal probability)
 if P+=0.8 then E=0.72 (so < 1 bit
per message about the class)
 Entropy(S) = expected information
(number of bits) needed to encode
class (+ or –) of randomly drawn
sample
p of S (under
( the shortest-
length "code")

Entropy: general case
 if members in S may belong to c classes:

c
E ( S )    pi log
l 2 pi
i 1
 llogarithms
ith is
i still
till base
b 2 because
b entropy
t is
i the
th measure off the
th
expected encoding length measured in bits
 max value of entropy is log2c
 for example if c=8, and p1=…=p8=0.125, then
E(S)
( )=3
(3 bits needed to send a message about the class number)
 if all examples belong to one class, E=0. This is what we are aiming
at.
t

Entropy and Information Gain: essence of
the Interachtive Dichotomizer 3 (ID3) algorithm
for building decision trees
 S = (14 instances: 9+,
9+ 55-))
 Entropy E([9+, 5-]) = -(9/14) log(9/14) - (5/14) log(5/14) = 0.940
 we want to reduce total entropy by splitting the set S into subsets with
lower entropy (i.e. with higher share of examples of the same class)
 lower entropy = information gain
 if split is made on the basis of attribute A:
– let Values(A) is the set of all possible values for attribute A
 e.g.: Values
V l (Humidity)
(H idit ) = {normal,
{ l high}
hi h}
– Sv is the subset of S for which attribute A has value v
g S Humidity=normal = {{7 examples},
 e.g.: p }, |S Humidity=normal | = 7
– then Information gain (expected reduction in Entropy caused by knowing
the value of attribute A):
| Sv |
Gain( S , A)  E ( S )   E (Sv )
vValues ( A ) | S |
Selecting 'best' attributes for leaves
 for a node an attribute is selected that provides max Gain:

 Gain(S Humidity) = 0.151
Gain(S, 0 151 Gain(S,
Gain(S Wind) = 0.048
0 048
 Gain(S, Temperature) = 0.029
 Gain(S, Outlook) = 0.246 -highesthighest

Selecting the next best attribute

Same example with continuous or mixed data
Outlook Temp Humid.
Humid Wind Play?
sunny 85 85 weak no
sunny 80 90 strong no
overcast 83 86 weak yes
rainy 70 96 weak yes
rainy 65 70 strong no
overcast 64 65 strong yes
sunny 72 95 weak no
sunny
y 69 70 weak yes
y
sunny 75 70 strong yes
overcast 72 90 strong yes
overcast 81 75 weak yes
rainy 71 91 strong no
 Temperature: 40 48 60 72 80 90
 Play? No No Yes Yes Yes No
 Thresholds: ^ ^
(for max information gain)
Unknown (missing) values in decision trees
 If some examples do not have value for attribute A:

– assign most common value of examples sorted to this node, or
– assign
i mostt common value l off examples
l with
ith same the
th target
t t value,
l or
– assign probability pi to each possible value vi of A, and then provide
fraction pi of examples
p to each descendant in a tree

Pruning Decision trees to improve interpretability and
remove overfitting
 Unknown values:
 Tree may appear to be too complex (hundreds of nodes):
– difficult to interpret
– "too accurate" (overfitting) following all (even noisy) examples
 Solution: tree can be reduced (pruned)
– reduced error pruning (Quinlan 87): replace a subtree by a leaf node and
assign the most common class of the associated training examples; the
new tree must be not worse on the validation set
– algorithms can be forced to have a certain number of instances in a
leave to sort
– tree will inevitably misclassify some instances, so accuracy is decreased -
but this normally removes overfitting
– however understandability of smaller trees is better

Overfitting: can also happen in decision trees (1)
 Frequent situation: accuracy during learning increases,

increases but on the test
examples drops

Overfitting: can also happen in decision trees (2)
 Example of overfitting:
– consider adding new, 15th, instance [sunny, hot, normal, strong, NO] -
this example
p is “noisy”
y - could be jjust wrong
g
– the new tree will have to take into account this (wrong) example (NB:
check on Temperature is added) - built by ID3 algorithm in Weka
software:  (Original
(O i i l tree
t built
b ilt on 14 examples):
l )
 outlook = sunny  outlook = sunny
 | temperature = hot: no  | humidity = high: no (3.0)
 | temperature = mild  | humidity = normal: yes (2.0)
 | | humidity = high: no  outlook = overcast: yes (4.0)
 | | humidity = normal: yes  outlook = rainy
 | ttemperature
t = cool:
l yes  | windy = strong: no (2.0)
 outlook = overcast: yes  | windy = weak: yes (3.0)
 outlook = rainy
 | windy
i d = strong: no
 | windy = weak : yes
 (it misclassifies the 15th example)
Problems appropriate for decision tree learning
 Instances describable by attribute-value pairs.

– Easiest is when an attribute takes a small number of disjoint possible
values (e
(e.g.
g Temp={cool
Temp={cool, mild
mild, hot})
– however with proper discretization real values are also possible
 Target function is discrete valued (i.e. it is a class).
 Possibly noisy (missing) training data are allowed

Classification: use of rules

Classification rules
 A popular alternative to decision trees

 General form (if-then or if-then-else):
– if (Antecedent) then (Consequent)
– Antecedent is normally several ANDed preconditions
 Several rules are often connected with the OR operator (“ORed”)
( ORed )
 Rules can be read off a decision tree, but they are then are far more
too complex
p than necessary.y Well constructed rules are often more
compact than trees
 Problems: Rules can be interpreted in order (as a decision list) or
individually,
d d ll what h to dod iff different
d ff rules
l lead
l d to different
d ff conclusions
l
for the same instance, etc.

Classification rules based on “representative instances”
(prototype--based rules)
(prototype
 if there is a “representative example” Er that carries a lot of

information, it could be used as the basis for rules like:
– if NewExample is “close” to Er then its class is the same as of Er

Example classification rule
based on a pruned tree
 Decision tree is built, then pruned, and rules are built (PART
algorithm). Example:
– if (outlook = overcast) then Play
Play=YES
YES (total=4
(total 4 / errors=0)
errors 0)
– if (humidity = high) then Play=NO (total=5 / errors=1)
– otherwise Play=YES
Play YES (total=5
(total 5 / errors=1)
errors 1)
 Accuracy is reduced, but the set of rules is very simple

Example of using trees and rules:
influence of environmental factors on diseases
 Influence of environmental factors on respiratory diseases (Kontic,

Dzeroski, 2000) - decision trees (C4.5), classification rules (CN2)
– 760 patients:
– 33 input attributes:
 g
gender,, age,
g , residence,, practices:
p cooking,
g, heating,
g, smoking,
g, allergies,
g ,
education, type of house, number of people in household, exposition, month,
etc.
– the trained classifier was able to predict the disease (4 classes):
 acute nasopharingitis (common cold),
 acute infection of upper respiratory organs,
 influenza,
fl
 acute bronchitis

Case Study Woudse:
Woudse: using decision trees to replicate
pumping strategy for Woudse water system (Delfland
(Delfland,
Delfland, NL)

Wodse:: schematisation of the water system
Wodse
polder

Woudse Case Study: water levels and pumping
(full data set)
Water level & pump discharge for Woudse
-3
3.7
7 2
-3.8
-3.9
harge
Water Level (M)
-4
Pump Disch
41
-4.1
-4.2 1
-4.3
-4.4
-4.5
-4.6
-4.7 0
0 500 1000 1500 2000
WL
Observation
Pump

Woudse Case Study: water levels and pumping
(fragment)

-3.7 2
-3.8
-3.9
charge
-4
vel (M)
-4.1
Pump Disc
Water Lev
-4.2 1
-4.3
-4.4
45
-4.5
-4.6
-4.7 0
900 950 1000 1050 1100 1150
WL
Observation
Pump

Woudse Case Study: problem description
 Input: water level(t) and pump discharge(t-1)

 Output: pump discharge(t)
 The pumping station has two pumps, each with capacity 0.133
m3/sec. Possible pump discharge: 0, 0.133 or 0.266 m3/sec
 Pump discharge has been described as a ‘category
category variable’
variable and is
expressed as 0, 1 or 2
 If water
a e level
e e goes up pump(s)
pu p(s) should
s ou d be switched
s ed on
o too reduce
edu e thee
water level, target water level is -4.6 M
 At each time level we have to determine:
– the pump discharge (0, 1 or 2) based on two inputs: water level(t) and
pump discharge (t-1)
 The problem is posed as a classification problem.
problem

Woudse case study: resulting decision tree solving the
classification problem (trained on 5000 instances)
Pumpt = f (WLt, Pumpt-1) Pump = {0, 1, 2}
 PumpT-1
PumpT 1 = 0
 | WLt <= -4.577: 0 (1084.0/1.0)
 | WLt > -4.577
 | | WLt <=
< -4.57:
4.57: 0 (417.0/111.0)
 | | WLt > -4.57
 | | | WLt <= -4.551: 1 (125.0)
 | | | WLt > -4.551: 2 (12.0)
 PumpT-1 = 1
 | WLt <= -4.595
 | | WLt <= -4.601: 0 (129.0)
 | | WLt > -4.601:
4 601 1 (189
(189.0/64.0)
0/64 0)
 | WLt > -4.595
 | | WLt <= -4.55: 1 (680.0/3.0)
 | | WLt > -4
4.55:
55: 2 (41
(41.0)
0)
 PumpT-1 = 2
 | WLt <= -4.593
 | | WLt <= -4.6: 0 (
(37.0)
)
 | | WLt > -4.6: 2 (39.0/16.0)
 | WLt > -4.593: 2 (2247.0/1.0)
Woudse case study: verification result
(full data set,
set 3759 instances)
-2,9
, 2,5
,
-3,1
scharge
2
evel (M)
-3,3
-3,5
-3,7
, 1,5
Pump Dis
Water Le
-3,9
1
-4,1
-4,3 0,5
-4,5
-4,7 0
0 500 1000 1500 2000 2500 3000 3500 4000
Observation WL
Known Pump
Predicted Pump

(fragment with the 100% correct classification)
-2.9
2.9 2.5
scharge
evel (M)
-3.1 2
-3.3 1.5
Pump Dis
Water Le
-3.5 1
-3.7 0.5
-3.9 0
950 960 970 980 990 1000
Observation WL
Known Pump
Predicted Pump

(fragment with some errors present)
-4.5
4.5 2.5
scharge
evel (M)
1.5
Pump Dis
Water Le
-4.6
46
1
0.5
-4.7 0
600 610 620 630 640 650
Observation WL
Known Pump
Predicted Pump

Clustering and classification:
some applications
 SOFM in finding 12 groups of catchments based on their 12

characteristics, and then applying ANN to model the regional flood
frequency (Hall et al
al. 2000)
 Hannah et al. (2000) used clustering for finding groups of
hydrographs
y g p on the basis of their shape p and magnitude;
g ;
clusters are then used for classification by experts
 similarly identifying the classes of river regimes (Harris et al. 2000)
 fuzzy c-means in classifying shallow Dutch groundwater sites into
homogeneous groups (Frapporti et al. 1993)
 fuzzy classification in soil classification on the basis of cone
penetration tests (CPTs) (Zhang et al. 1999)

Clustering and classification:
some applications
 decision trees in classifying surge water levels in the coastal zone

depending on the hydrometeorological data, with the Dutch Ministry
for public works (Solomatine et alal., 2000; Velickov,
Velickov 2004);
 decision trees in classifying the river flows in the problem of flood
control
 self-organizing feature maps (Kohonen neural networks) as clustering
methods, and SVM as classification method in aerial photos
i
interpretation
i (Velickov
(V li k et al.,
l 2000);
2000)
 using decision trees, Bayesian methods and neural networks in soil
classification on the basis of cone penetration tests (CPT)
(Bhattacharya and Solomatine, 2006)

Classification: conclusions
 there is a wide choice of methods

 classification methods are mainly applied in pattern recognition
problems
bl
 engineering numerical problems could be sometimes posed as
classification problems.
problems Using classification methods (decision trees)
often leads to simpler models and requires less accurate data

N
Numeric
i prediction
di ti (regression):
( i )
linear models
and their combinations in tree-
tree-like structures
(M5 model trees)

Models for numeric prediction
 Target function is real-valued

 There are many methods:
– Linear and non-linear regression
– ARMA (auto-regressive moving average) and ARIMA models
– Artificial Neural Networks (ANN)
  We will consider now:

– Linear regression
– Regression trees
– Model trees

Linear regression
Y = a 1 X + a0
actual
Y
output value
model predicts
new output y(t)
value y(v)
X
x(t)
new input
value x(v)
 Given measured (training) data: T vectors {x(t)
( ),y(t)
( )}, t =1,…T .
 Unknown a1 and a2 are found by solving an optimization problem
 
T
E   y ( t )  (a0  a1 x ( t ) )  min
2
t 1
 Then for the new V vectors {x(v)}, v =1,…V this equation

q can
approximately reproduce the corresponding functions values
{y(v)}, v =1,…V
Numeric prediction by averaging in subsets
(regression trees in 1D)
iinputt X1 is
i split
lit into
i t intervals;
i t l
averaging is performed in each interval
Y (output)
 input space can be split according to standard deviation in

subsets
Numeric prediction by piece-
piece-wise linear models
(model trees in 1D)
input
p X1 is split
p into intervals;;
separate linear models can be built
for each of the intervals
Y (output)
 question is: how to split the input space in an optimal way?

Regression and M5 model trees: building them in 2D
X2
4 Model 3 Model 2
input space X1X2 is split into regions;
separate regression models can be built for
3
Model 1 each of the regions
2  Tree structure where

Model 4 Model 6 – nodes are splitting conditions
1 – leaves are:
 constants ( regression tree)
Model 5 X1
 linear regression models ( M5 model tree)
1 2 3 4 5 6
Y (output) •x2 > 2
Yes No
•x1 > 2.5 •x1 < 4
Example of Model 1: Yes No Yes No
- in a regression tree:
•x2 < 3.5 Model 3 Model 4 •x2 < 1
Y = 10.5
Yes No Yes No
- in M5 model tree:
Y = 2.1*X1 + 0.3*X2
Model 1 Model 2 Model 5 Model 6
How to select an attribute for split in regression trees
and M5 model trees
 regression
g trees: same as in decision trees (information
( gain)
g )
 main idea:
– choose the attribute that splits the portion T of the training data that
reaches a particular node into subsets T1, T2,…
– use the standard deviation sd (T) of the output values in T as a measure
of error at that node (in decision trees - entropy E(T) was used)
– split should result in subsets Ti with low standard deviation sd(Ti)
– so model trees splitting criterion is SDR (standard deviation reduction)
that has to be maximized: 4
X 2
Model 3 Model 2
Ti 3
SDR  sdd (T )    sdd (Ti )

Model 1
2
i T
Model 4 Model 6
1
Model 5 X1
1 2 3 4 5 6
Y (output)

Measures to improve performance of model trees
 smoothing:
– smoothing process is used to compensate for the sharp discontinuities
between adjacent linear models
 pruning (size reduction) - needed when a large tree overfits the data:
– a subtree is replaced by one linear model

Regression and model trees in numerical prediction:
some applications

M5 tree in Rainfall-
Rainfall-runoff modelling ((Huai
Huai river, China)
Model
M d l structure:
t t
Qt+T = F (Rt Rt-1 …Rt-L … Qt Qt-1 Qt-A …Qtup Qt-1up …)
Variables considered
Output: predicted discharge QXt+1
Inputs (with different time lags):
- dailyy areal rainfall (Pa)
- moving average of daily areal rainfall (PaMov)
- discharges (QX) and upstream (QC)
Smoothed
h d variables
bl have
h higher
h h correlation
l coeff.
ff
with the output, e.g. 2-day-moving average of
rainfall (PaMov2t)
 Final model for the flood season:
 Output:
Data (1976-1996) g the next dayy Q
– discharge QXt+1
Daily
l discharges
d h (QX,
Q QC)  Inputs:
Daily rainfall at 17 stations – Pat Pat-1 PaMov2t PaMov2t-1
Daily evaporation for 14 years QCt QCt-1 QXt
(1976 1989) at 3 stations
(1976-1989)  Techniques used: M5 model trees, ANN
Training data: 1976-89
Cross-valid. & testing: 1990-96 110
Resulting M5 model tree with 7 models (Huai
(Huai river)
QXt <= 154 :
| PaMov2t <= 4.5 : LM1 (1499/4.86%)
| PaMov2t > 4.5 :
| | PaMov2t <= 18.5 : LM2 (315/15.9%)
| | PaMov2t > 18.5 : LM3 (91/86.9%)
QXt > 154 :
| PaMov2t-1 <= 13.5 :
| | PaMov2t <= 4.5 : LM4 (377/15.9%)
| | PaMov2t > 4.5 : LM5 ((109/89.7%)
/ )
| PaMov2t-1 > 13.5 :
| | PaMov2t <= 26.5 : LM6 (135/73.1%)
| | PaMov2t > 26.5 : LM7 (49/270%)
Models at the leaves:
LM1: QXt+1 = 2.28 + 0.714PaMov2t-1 - 0.21PaMov2t + 1.02Pat-1 + 0.193Pat
- 0.0085QCt-1 + 0.336QCt + 0.771QXt
LM2: QXt+1 = -24.4 - 0.0481PaMov2t-1 - 4.96PaMov2t + 3.91Pat-1 + 4.51Pat
- 0.363QCt-1 + 0.712QCt + 1.05QXt
LM3: QXt+1 = -183 + 10.3PaMov2t-1 + 8.37PaMov2t - 5.32Pat-1 + 1.49Pat
- 0.0193QCt-1 + 0.106QCt + 2.16QXt
LM4: QXt+1 = 47.3 + 1.06PaMov2t-1 - 2.05PaMov2t + 1.91Pat-1 + 4.01Pat
- 0.3QCt-1 + 1.11QCt + 0.383QXt
LM5: QXt+1 = -151 - 0.277PaMov2t-1 - 37.8PaMov2t + 31.1Pat-1 + 30.3Pat
- 0.672QCt-1 + 0.746QCt + 0.842QXt
LM6: QXt+1 = 138 - 5.95PaMov2t-1 - 39.5PaMov2t + 29.6Pat-1 + 35.4Pat
- 0.303QCt-1 + 0.836QCt + 0.461QXt
LM7: QXt+1 = -131 - 27.2PaMov2t-1 + 51.9PaMov2t + 0.125Pat-1 - 5.29Pat
- 0.0941QCt-1 + 0.557QCt + 0.754QXt
Performance of M5 and ANN models (Huai
(Huai river)
D.P.
D PS Solomatine
l ti and d Y.
Y Xue.
X M5 model d l ttrees compared
d tto neurall networks:
t k application
li ti to t flood
fl d forecasting
f ti ini the
th
upper reach of the Huai River in China. ASCE Journal of Hydrologic Engineering, 9(6), 2004, 491-501.
4000
OBS
s)
3000
arge (m3/s
FS-M5
2000 FS-ANN
Discha
1000
0
96-6-1 96-7-1 96-7-31 96-8-30
Time
M5 and ANN, flood season data (testing, fragment)

M5 model trees and ANNs in rainfall-
rainfall-runoff modelling:
predicting
p edicting flow
flo three
th ee hours
ho s ahead (Sieve
(Sie e catchment)
The model: Qt+3 = f (REt, REt-1, REt-2, REt-3, Qt, Qt-1 )
 IInputs:
t
 REt, REt-1, REt-2, Prediction of Qt+3 : Verification performance
REtt-33,Qt,Qtt-11 350
(rainfall for 3
Observed
past hours, 300
Modelled (ANN)
runoff for 2) 250 Modelled (MT)
 ANN verification
Q [ m 3 /s ]
200
RMSE=11.353
150
NRMSE=0.234
COE=0.9452 100
 MT verification 50
RMSE=12.548
0
NRMSE=0.258
NRMSE 0.258 0 20 40 60 80 100 120 140 160 180
COE=0.9331 t [hrs]

Numerical prediction by M5 model trees: conclusions
 Transparency of trees: model trees is easy to understand (even by

the managers)
 M5 model tree is a mixture of local accurate models
 Pruning (reducing size) allows:
– to prevent overfitting
– to generate a family of models of various accuracy and complexity

Data Data - Driven Driven Modelling Modellinggg in Water in Water - Related Problems. Related Problems

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Data - Driven Driven Modelling Modellinggg in Water in Water - Related Problems. Related Problems

Uploaded by

Copyright:

Available Formats

Data-driven modelling

UNESCO-IHE Institute for Water Education

 Notion of data-driven modelling (DDM)

D.P. Solomatine. Data-driven modelling (part 1). 2

 Goals of modelling are:

D.P. Solomatine. Data-driven modelling (part 1). 3

 ”Physically-based", or "knowledge-based" models are based on the

D.P. Solomatine. Data-driven modelling (part 1). 4

 Measuring campaigns using automatic computerised equipment: a lot

D.P. Solomatine. Data-driven modelling (part 1). 5

D.P. Solomatine. Data-driven modelling (part 1). 6

D.P. Solomatine. Data-driven modelling (part 1). 7

 Given measured (training) data: T vectors {x(t)

 Unknown a0 and a1 are found by solving an optimization problem

 set K of examples (or instances)

Measured data Attributes Model output

 Model: Qt+T = F (Rt Rt-1 … Rt-L … Qt Qt-1 Qt-A … Qtup Qt-1up …)

D.P. Solomatine. Data-driven modelling (part 1). 12

D.P. Solomatine. Data-driven modelling (part 1). 13

 State the problem (why do the modelling?)

 1. Select clearly defined problem that the model will help to

D.P. Solomatine. Data-driven modelling (part 1). 15

 ML = constructing computer programs that automatically improve with

D.P. Solomatine. Data-driven modelling (part 1). 17

 Soft computing - tolerant for imprecision and uncertainty of data (Zadeh,

D.P. Solomatine. Data-driven modelling (part 1). 18

 Data mining (preparation, reduction, finding new knowledge):

D.P. Solomatine. Data-driven modelling (part 1). 19

D.P. Solomatine. Data-driven modelling (part 1). 20

 Ideally the observed data has to be split in three data sets:

 One should distinguish

Models being progressively made more

D.P. Solomatine. Data-driven modelling (part 1). 24

 Difficulties with extrapolation (working outside the variables’ range)

D.P. Solomatine. Data-driven modelling (part 1). 26

D.P. Solomatine. Data-driven modelling (part 1). 27

 set K of examples (or instances)

Measured data Attributes Model output

 Class (category, label)

 (Time) series data: numerical data which values have associated

D.P. Solomatine. Data-driven modelling (part 1). 29

D.P. Solomatine. Data-driven modelling (part 1). 30

 Concepts: the thing to be learned on the basis of available data. For

 Instances (examples) representing concepts

 Attributes (describing instances)

D.P. Solomatine. Data-driven modelling (part 1). 31

 Principle: accept the simplest hypothesis (=model)

D.P. Solomatine. Data-driven modelling (part 1). 32

 Instances = examples of input data.

D.P. Solomatine. Data-driven modelling (part 1). 33

 Nominal (also called class, category, labels). Examples:

 Ordinal - categories that can be ordered (ranked)

D.P. Solomatine. Data-driven modelling (part 1). 34

 Ratio (real numbers) - there is a zero point. Examples:

D.P. Solomatine. Data-driven modelling (part 1). 35

 Nominal (also called class, category, labels, enumerated)

 Ordinal (also called numeric,

 more complex types, requiring additional descriptions (metadata), eg.:

D.P. Solomatine. Data-driven modelling (part 1). 36

D.P. Solomatine. Data-driven modelling (part 1). 37

 training data set - raw data is presented in a form necessary to to

D.P. Solomatine. Data-driven modelling (part 1). 38