Professional Documents
Culture Documents
Data Data - Driven Driven Modelling Modellinggg in Water in Water - Related Problems. Related Problems
Data Data - Driven Driven Modelling Modellinggg in Water in Water - Related Problems. Related Problems
Data- g
in water-
water-related problems.
PART 1
Dimitri P. Solomatine
Model is …
– a simplified description of reality
– an encapsulation
l ti off knowledge
k l d aboutb t a particular
ti l physical
h i l or social
i l
process in electronic form
User interface
Judgement engines
Decision support systems for management
F engines
Fact i Data, information, Knowledge inference
Physically-based knowledge engines
models
ode s Knowledge-base
g
systems
Data-driven
models
d l Communications
R l world
Real ld
specific - general
model estimation (data-driven) - first principles (process) models
numerical - analytical
stochastic - deterministic
microscopic
i i - macroscopic i
discrete - continuous
qualitative - quantitative
model
d l predicts
di
new output
value X
new input
value
T
E y (a0 a1 x ) min
(t ) (t ) 2
t 1
Then for the new V vectors {x(v)}, v =1,…V this equation can approximately
reproduce
d the
th corresponding
di functions
f ti values
l
{ y(v) } , v =1,…V
Data: attributes, inputs, outputs
Actual (observed)
Modelled output
t t Y
Input data X (real)
system
Learning is aimed
at minimizing this
Machine difference
l
learning
i
(data-driven)
model Predicted output Y’
DDM “learns” the target function Y=f (X) describing how the real
system behaves
Learning = process of minimizing the difference between observed data
and model output. X and Y may be non-numeric
After learning,
learning being fed with the new inputs,
inputs DDM can generate output
close to what the real system would generate
D.P. Solomatine. Data-driven modelling (part 1). 10
Data--driven models vs Physically
Data Physically--based models
physically-based
(conceptual)
hydrological
rainfall-runoff model
P = precipitation, E = evapotranspiration, Q =
runoff, SP = snow pack, SM = soil moisture, UZ =
upper groundwater zone, LZ =lower groundwater
zone, lakes = lake volume
Coefficients to be identified
byy calibration
Precipitation P(t)
Moving average of data-driven rainfall-
Precipitation MAP3(t-2) runoff model
Runoff flow Q(t+1) (artificial neural
Evapotranspiration E(t) network)
Runoff Q(t)
connections with weights
neurons (non-linear functions)
D.P. Solomatine. Data-driven modelling (part 1). to be identified by training 11
Using data-
data-driven methods in
rainfall--runoff modelling
rainfall
Qtupp
Available data:
– rainfalls Rt
– runoffs (flows) Qt
Inputs: lagged rainfalls Rt Rt-1 … Rt-L
Rt Qt
Output
p to predict:
p Qt+T
Questions:
– how to find the appropriate lags? (lags embody the physical properties of
the catchment)
non linear regression function F ?
– how to build non-linear
Data-driven modelling:
– oriented towards building predictive models
– follows general modelling guidelines adopted in engineering
– uses the apparatus of machine learning
Proper data-driven modelling is impossible without
– good understanding of the modelled system,
– relation to the external environment and
– possible connections to other models and decision making processes
It differs from data mining in:
– application area (engineering and natural processes)
– used terminology and relation to physically-based models
– usually smaller data sets (and hence possible different choice of methods)
7. Refine the model iteratively (try different things until the model
seems as good as it is going to get).
8 Make
8. M k the
th model
d l as simple
i l as possible,
ibl but
b t no simpler,
i l
formulated also as:
– KISS (("Keep
Keep It Sufficiently Simple",
Simple , or "Keep
Keep It Simple, Stupid
Stupid"))
– Minimum Description Length principle: “the best model is one that that is
the smallest”
– the Occam's Razor principle - formulated by William of Occam in 1320 in
the following form: shave off all the "unneeded philosophy off the
explanation".
p
9. Define instability in the model (critical areas where small
changes in inputs lead to large change in output).
10. Define uncertainty in the model (critical areas and ranges in the
data where the model produces low confidence predictions). 16
D.P. Solomatine. Data-driven modelling (part 1).
Suppliers of methods for data-
data-driven modelling
(1) Machine learning (ML)
Actual (observed)
Modelled output
t t Y
Input data X (real)
system
Learning
L i isi aimed
i d
at minimizing this
Machine difference
learning
(data-driven)
model Predicted output Y’
DDM tries to “learn” the target function Y=f (X) describing how the
real system behaves
Learning = process of minimizing the difference between observed
data and model output. X and Y may be non-numeric
After learning,
learning being fed with the new inputs,
inputs DDM can generate
output close to what the real system would generate
D.P. Solomatine. Data-driven modelling (part 1). 21
Training (calibration), cross-
cross-validation, testing
– training (calibration)
– cross-validation used
d tto build
b ild the
th model
d l
(imitates test set and model operation,
used to test model performance
during calibration process)
– testing
(imitates model operation, used to test the model
used for final model test,
should
h ld nott be
b seen by
b developer
d l ) after it is built
Redd model:
d l less
l accurate than
h
the Blue one, but captures the
trend in data. It will g
generalise X
well. new input
value
(
(e.g. rainfall)
i f ll)
Question: how to determine
during training when to stop Which model is “better”:
better :
improving the model? green, red or blue?
D.P. Solomatine. Data-driven modelling (part 1). 23
Necessity of cross-
cross-validation during training
good moment
to stop training in-sample = training
out-of-sample = cross-validation, or verification (testing)
Relationship Y=F(X)
t be
to b modelled
d ll d (unknown)
( k ) X
x Data {xi, yi}collected
about this relationship
Data driven model
Data-driven
Training: iterative
refinement of the model –
with each iteration model
parameters are changed to
reduce error on training set
error on training set gets
lower and lower
error on test set gets
lower, then starts to
increase
This may lead to overfitting
and high error on cross-
validation set
The moment to stop
training
D.P. Solomatine. Data-driven modelling (part 1). 25
Data--Driven Modelling: care is needed
Data
Classification
– on the basis of classified examples, a way of classifying unseen examples
is to be found
Association
– association between features (which combinations of values are most
frequent) is to be identified
Clustering
– groups of objects (examples) that are "close" are to be identified
Numeric prediction
– outcome
t iis nott a class,
l b
butt a numeric
i (real)
( l) value
l
– often called regression
Hypotheses
– there are various possible functions Y=f (X) (hypotheses) relating input and
output
– machine learning is searching through this hypotheses space in order to
determine one that fits the observed data and the prior knowledge of the learner
...
...
...
...
Interval - ordered and expressed in fixed equal units. Examples:
– dates (cannot be however multiplied)
– temperature expressed in degrees Celsius
prepare the
h data
d - this
h may include
l d complexl procedures
d off restoring
the missing data, data transformation, etc.;
survey the data - understand the nature of the data
data, get insight into
the problem this data describes - includes identification and analysis
of variability, sparsity, peaks and valleys, entropy, mutual information
inputs to outputs, etc. (this step is often merged with the previous
one);
build the model
Examples:
– in a harbour sedimentation is measured once in two weeks at one
locations and once a month at other two locations,
locations and never at other
(maybe important) locatons
– in a catchment the rainfall data that was manually collected at a three
gauging stations for 20 years once a day, and 3 years ago the
measurements started also at 4 new automatic stations as well, with the
hourlyy frequency
q y
Solutions:
– filling-in missing data
– introducing an artificial resolution being equal to the maximum for all
variables
General form:
x'i = a xi + b
to keep data positive:
x'i = xi + min (x1...xn) + SmallConst
S
Squashing
hi data
d t into
i t the
th range [0,
[0 1]
xi min( x1... xn )
xi
max( x1... xn ) min( x1... xn )
Logarithmic
x'i = log (xi)
Softmax scaling:
1
L( x )
1 e x
L og istic
i ti fun
f ctio
ti n
Output va lue
1.2
0.8
0 6
0.6
0.4
0.2
0
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.2
Input va lue
1.2
0.8
0.6
0.4
0.2
Input va lue
2 4 6 8 10
47
Transforming the distributions: Box-
Box-Cox transform
The first step uses the power transform to adjust the changing
variance:
xi 1
xi
where
he e
xi is the original value,
x'i is the transformed value,,
λ is a user-selected value.
The second step balances the distribution by subtracting the mean
and
d dividing
di idi the
th result
lt by
b the
th standard
t d d deviation:
d i ti
xi E ( x)
– where xi
x'i is the value after the first transform,
x
x''i is (final) standardized value,
500
300
200
100
0
0 100 200 300 400 500 600 700
Time [ hrs]
10 Box-Cox, step 1
ansformed d ischarge
8 Box-Cox, final
4
2
0
Tra
T ime [hrs]
D.P. Solomatine. Data-driven modelling (part 1). 49
Box--Cox transform: original and resulting histograms
Box
Distibution D istribution
of the origina l discha rge of the B ox -Cox tra nsform e d discha rge
400 160
350 140
300 120
250 100
ue ncy
ue ncy
200 80
Fre qu
Fre qu
150 60
100 40
50 20
0 0
.9
.6
e
1
6
.1
51
90
29
2
e
4
73
12
06
45
4
5
or
2.
4.
3.
0.
9.
8.
7.
5.
3.
2.
17
45
74
or
.2
.8
6
.4
.0
10
13
21
24
27
30
0.
1.
1.
1.
2.
3.
3.
M
16
18
33
36
0.
2.
-1
-0
-0
-0
M
Di h rge [m
Discha [ 3/s],
3/ ] bins
bi Tra nsform e d discha rge , bins
( x x )( y y )
i i
R i 1
Correlation coefficient R n n
(x x ) ( y y)
i 1
i
2
i 1
i
2
700
2
Consider future discharge
600
4
Qt+1 and
d past rainfalls
i f ll Rt-L.
Effective rainfall [mm] 6
What is lag L such that the
relatedness is strongest ?
500
8
400 10
Discharge [m3/s]
12 AMI can be used to identify
the optimal time lag to
300
14
200
16 ensure
100
18 AMI between Qt+1t 1 and
0 20 past lagged rainfalls Rt-L
0 500 1000 1500 2000 2500
Time [hrs]
35 0
L
Lag 3 .5
30 0 3
Zoom in: 25 0 2 .5
Q [ m 3 /s ]
R [m m ]
20 0 Q 2
15 0 R 1 .5
10 0 1
50 0 .5
0 0
90 100 110 120 130 140 150
t [hrs ]
main ideas
X X
3
class 1
2
class 0 class 0
1
class 1
X1
•x2 > 2
0 1 2 3 4 5 6 Yes No
Outlook Temp
p Humid. Wind Play?
y
sunny hot high weak no
sunny hot high strong no
overcast hot high weak yes
rainy mild high weak yes
rainy cool normal weak yes
rainy cool normal strong no
overcast cool normal strong yes
sunny mild high weak no
sunny cool normal weak yes
rainy mild normal weak yes
sunny mild normal strong yes
overcast mild high strong yes
overcast
t h t
hot normal
l weak
k yes
rainy mild high strong no
decision trees
and ID3 and C4.5 algorithms
a test for an
attribute,
leaves =
classification of
an instance
oops!
verification (test) set: observations that also happened: Our model is
wrong here
Outlook Temp Humid.
Humid Wind Play? | Model prediction
rainy cool high strong yes | no (wrong)
sunny hot normal strong yes | yes (correct)
overcast cool high strong yes | yes (correct)
Correct?
We do not know!
But a q
question remains:
– how to identify the “best” split attribute, i.e. how to compute the
information gain ?
– Let
Let’ss consider Interachtive Dichotomizer 3 (ID3) algorithm
llogarithms
ith is
i still
till base
b 2 because
b entropy
t is
i the
th measure off the
th
expected encoding length measured in bits
max value of entropy is log2c
for example if c=8, and p1=…=p8=0.125, then
E(S)
( )=3
(3 bits needed to send a message about the class number)
if all examples belong to one class, E=0. This is what we are aiming
at.
t
Unknown values:
Tree may appear to be too complex (hundreds of nodes):
– difficult to interpret
– "too accurate" (overfitting) following all (even noisy) examples
Solution: tree can be reduced (pruned)
– reduced error pruning (Quinlan 87): replace a subtree by a leaf node and
assign the most common class of the associated training examples; the
new tree must be not worse on the validation set
– algorithms can be forced to have a certain number of instances in a
leave to sort
– tree will inevitably misclassify some instances, so accuracy is decreased -
but this normally removes overfitting
– however understandability of smaller trees is better
Example of overfitting:
– consider adding new, 15th, instance [sunny, hot, normal, strong, NO] -
this example
p is “noisy”
y - could be jjust wrong
g
– the new tree will have to take into account this (wrong) example (NB:
check on Temperature is added) - built by ID3 algorithm in Weka
software: (Original
(O i i l tree
t built
b ilt on 14 examples):
l )
outlook = sunny outlook = sunny
| temperature = hot: no | humidity = high: no (3.0)
| temperature = mild | humidity = normal: yes (2.0)
| | humidity = high: no outlook = overcast: yes (4.0)
| | humidity = normal: yes outlook = rainy
| ttemperature
t = cool:
l yes | windy = strong: no (2.0)
outlook = overcast: yes | windy = weak: yes (3.0)
outlook = rainy
| windy
i d = strong: no
| windy = weak : yes
(it misclassifies the 15th example)
D.P. Solomatine. Data-driven modelling (part 1). 82
Problems appropriate for decision tree learning
Decision tree is built, then pruned, and rules are built (PART
algorithm). Example:
– if (outlook = overcast) then Play
Play=YES
YES (total=4
(total 4 / errors=0)
errors 0)
– if (humidity = high) then Play=NO (total=5 / errors=1)
– otherwise Play=YES
Play YES (total=5
(total 5 / errors=1)
errors 1)
Accuracy is reduced, but the set of rules is very simple
polder
-3
3.7
7 2
-3.8
-3.9
harge
Water Level (M)
-4
Pump Disch
41
-4.1
-4.2 1
-4.3
-4.4
-4.5
-4.6
-4.7 0
0 500 1000 1500 2000
WL
Observation
Pump
charge
-4
vel (M)
-4.1
Pump Disc
Water Lev
-4.2 1
-4.3
-4.4
45
-4.5
-4.6
-4.7 0
900 950 1000 1050 1100 1150
WL
Observation
Pump
-2,9
, 2,5
,
-3,1
scharge
2
evel (M)
-3,3
-3,5
-3,7
, 1,5
Pump Dis
Water Le
-3,9
1
-4,1
-4,3 0,5
-4,5
-4,7 0
0 500 1000 1500 2000 2500 3000 3500 4000
Observation WL
Known Pump
Predicted Pump
-2.9
2.9 2.5
scharge
evel (M)
-3.1 2
-3.3 1.5
Pump Dis
Water Le
-3.5 1
-3.7 0.5
-3.9 0
950 960 970 980 990 1000
Observation WL
Known Pump
Predicted Pump
-4.5
4.5 2.5
scharge
evel (M)
1.5
Pump Dis
Water Le
-4.6
46
1
0.5
-4.7 0
600 610 620 630 640 650
Observation WL
Known Pump
Predicted Pump
linear models
and their combinations in tree-
tree-like structures
(M5 model trees)
model predicts
new output y(t)
value y(v)
X
x(t)
new input
value x(v)
Given measured (training) data: T vectors {x(t)
( ),y(t)
( )}, t =1,…T .
T
E y ( t ) (a0 a1 x ( t ) ) min
2
t 1
iinputt X1 is
i split
lit into
i t intervals;
i t l
averaging is performed in each interval
Y (output)
input
p X1 is split
p into intervals;;
separate linear models can be built
for each of the intervals
Y (output)
- in a regression tree:
•x2 < 3.5 Model 3 Model 4 •x2 < 1
Y = 10.5
Yes No Yes No
- in M5 model tree:
Y = 2.1*X1 + 0.3*X2
Model 1 Model 2 Model 5 Model 6
D.P. Solomatine. Data-driven modelling (part 1). 106
How to select an attribute for split in regression trees
and M5 model trees
regression
g trees: same as in decision trees (information
( gain)
g )
main idea:
– choose the attribute that splits the portion T of the training data that
reaches a particular node into subsets T1, T2,…
– use the standard deviation sd (T) of the output values in T as a measure
of error at that node (in decision trees - entropy E(T) was used)
– split should result in subsets Ti with low standard deviation sd(Ti)
– so model trees splitting criterion is SDR (standard deviation reduction)
that has to be maximized: 4
X 2
Model 3 Model 2
Ti 3
i T
Model 4 Model 6
1
Model 5 X1
1 2 3 4 5 6
Y (output)
smoothing:
– smoothing process is used to compensate for the sharp discontinuities
between adjacent linear models
pruning (size reduction) - needed when a large tree overfits the data:
– a subtree is replaced by one linear model
Variables considered
Output: predicted discharge QXt+1
Inputs (with different time lags):
- dailyy areal rainfall (Pa)
- moving average of daily areal rainfall (PaMov)
- discharges (QX) and upstream (QC)
Smoothed
h d variables
bl have
h higher
h h correlation
l coeff.
ff
with the output, e.g. 2-day-moving average of
rainfall (PaMov2t)
Final model for the flood season:
Output:
Data (1976-1996) g the next dayy Q
– discharge QXt+1
Daily
l discharges
d h (QX,
Q QC) Inputs:
Daily rainfall at 17 stations – Pat Pat-1 PaMov2t PaMov2t-1
Daily evaporation for 14 years QCt QCt-1 QXt
(1976 1989) at 3 stations
(1976-1989) Techniques used: M5 model trees, ANN
Training data: 1976-89
Cross-valid. & testing: 1990-96 110
D.P. Solomatine. Data-driven modelling (part 1).
Resulting M5 model tree with 7 models (Huai
(Huai river)
QXt <= 154 :
| PaMov2t <= 4.5 : LM1 (1499/4.86%)
| PaMov2t > 4.5 :
| | PaMov2t <= 18.5 : LM2 (315/15.9%)
| | PaMov2t > 18.5 : LM3 (91/86.9%)
QXt > 154 :
| PaMov2t-1 <= 13.5 :
| | PaMov2t <= 4.5 : LM4 (377/15.9%)
| | PaMov2t > 4.5 : LM5 ((109/89.7%)
/ )
| PaMov2t-1 > 13.5 :
| | PaMov2t <= 26.5 : LM6 (135/73.1%)
| | PaMov2t > 26.5 : LM7 (49/270%)
Models at the leaves:
LM1: QXt+1 = 2.28 + 0.714PaMov2t-1 - 0.21PaMov2t + 1.02Pat-1 + 0.193Pat
- 0.0085QCt-1 + 0.336QCt + 0.771QXt
LM2: QXt+1 = -24.4 - 0.0481PaMov2t-1 - 4.96PaMov2t + 3.91Pat-1 + 4.51Pat
- 0.363QCt-1 + 0.712QCt + 1.05QXt
LM3: QXt+1 = -183 + 10.3PaMov2t-1 + 8.37PaMov2t - 5.32Pat-1 + 1.49Pat
- 0.0193QCt-1 + 0.106QCt + 2.16QXt
LM4: QXt+1 = 47.3 + 1.06PaMov2t-1 - 2.05PaMov2t + 1.91Pat-1 + 4.01Pat
- 0.3QCt-1 + 1.11QCt + 0.383QXt
LM5: QXt+1 = -151 - 0.277PaMov2t-1 - 37.8PaMov2t + 31.1Pat-1 + 30.3Pat
- 0.672QCt-1 + 0.746QCt + 0.842QXt
LM6: QXt+1 = 138 - 5.95PaMov2t-1 - 39.5PaMov2t + 29.6Pat-1 + 35.4Pat
- 0.303QCt-1 + 0.836QCt + 0.461QXt
LM7: QXt+1 = -131 - 27.2PaMov2t-1 + 51.9PaMov2t + 0.125Pat-1 - 5.29Pat
- 0.0941QCt-1 + 0.557QCt + 0.754QXt
D.P. Solomatine. Data-driven modelling (part 1). 111
Performance of M5 and ANN models (Huai
(Huai river)
D.P.
D PS Solomatine
l ti and d Y.
Y Xue.
X M5 model d l ttrees compared
d tto neurall networks:
t k application
li ti to t flood
fl d forecasting
f ti ini the
th
upper reach of the Huai River in China. ASCE Journal of Hydrologic Engineering, 9(6), 2004, 491-501.
4000
OBS
s)
3000
arge (m3/s
FS-M5
2000 FS-ANN
Discha
1000
0
96-6-1 96-7-1 96-7-31 96-8-30
Time
200
RMSE=11.353
150
NRMSE=0.234
COE=0.9452 100
MT verification 50
RMSE=12.548
0
NRMSE=0.258
NRMSE 0.258 0 20 40 60 80 100 120 140 160 180
COE=0.9331 t [hrs]