Download as pdf or txt
Download as pdf or txt
You are on page 1of 199

Stochastic generation of daily rainfall for catchment water

management studies

Author:
Harrold, Timothy Ives
Publication Date:
2002
DOI:
https://doi.org/10.26190/unsworks/20428
License:
https://creativecommons.org/licenses/by-nc-nd/3.0/au/
Link to license to see what you are allowed to do with this resource.

Downloaded from http://hdl.handle.net/1959.4/18640 in https://


unsworks.unsw.edu.au on 2024-01-06
Stochastic Generation of Daily Rainfall for
Catchment Water Management Studies

Timothy Ives Harrold, B.E.(Hons) (Newcastle), M.Nat.Res. (U.N.E.)

A thesis submitted for the degree of Doctor of Philosophy


at the University of New South Wales, Sydney, Australia.

May 2002.
Abstract

This thesis presents an approach for generating long synthetic sequences of


single-site daily rainfall which can incorporate low- frequency features such as
drought, while still accurately representing the day-to-day variations in rainfall.
The approach is implemented in a two-stage process. The first stage is to generate
the entire sequence of rainfall occurrence (i.e. whether each day is dry or wet).
The second stage is to generate the rainfall amount on all wet days in the
sequence. The models used in both stages are nonparametric (they make minimal
general assumptions rather than specific assumptions about the distributional and
dependence characteristics of the variables involved), and ensure an appropriate
representation of the seasonal variations in rainfall. A key aspect in formulation of
the models is selection of the predictor variables used to represent the historical
features of the rainfall record. Methods for selection of the predictors are
presented here. The approach is applied to daily rainfall from Sydney and
Melbourne. The models that are developed use daily- level, seasonal- level, annual-
level, and multi- year predictors for rainfall occurrence, and daily- level and
annual- level predictors for rainfall amount. The resulting generated sequences
provide a better representation of the variability associated with droughts and
sustained wet periods than was previously possible. These sequences will be
useful in catchment water management studies as a tool for exploring the potential
response of catchments to possible future rainfall.

2
Dedication

This thesis is dedicated to the memory of my father, Dr. Ross Ives Harrold, 1939-
2001. Dad, I wish you could be here so we could celebrate and enjoy together the
achievement that this thesis represents.

“Lord, you have been our refuge from one generation to another. Satisfy us by your
loving-kindness in the morning; so shall we rejoice and be glad all the days of our life.
May the graciousness of the Lord our God be upon us. Prosper the work of our hands.”
Psalm 90:1,14,17.

Certificate of Originality

I hereby declare that this submission is my own work and to the best of my
knowledge it contains no material previously published or written by another
person, nor material which to a substantial extent has been accepted for the award
of any other degree or diploma at UNSW or any other educational institution,
except where due acknowledgement is made in this thesis. Any contribution made
to the research by others, with whom I have worked at UNSW or elsewhere, is
explicitly acknowledged in the thesis.

I also declare that the intellectual content of this thesis is the product of my own
work, except to the extent that assistance from others in the project’s design and
conception or in style, presentation and linguistic expression is acknowledged.

Timothy Ives Harrold

3
Acknowledgements

I gratefully acknowledge the Australian Research Council and the New South
Wales Department of Land and Water Conservation for funding this research.

I am very grateful for the help and support of my supervisor, Dr. Ashish Sharma,
who introduced me to this research area and persuaded me to take on the work.
Ashish, thank you for your encouragement, enthusiasm, availability, and many
helpful suggestions.

My co-supervisor, Professor Simon Sheather, is an inspiration to me. Simon, your


constant encouragement at our weekly meetings has meant a great deal to me.
Thank you for your input into this thesis.

My thanks also go to Associate Professor Ian Cordery for being there when I
needed advice, and to Dr. Dugald Black, my industry co-supervisor, for his input
into the direction of the research, and for his timely and insightful comments on
drafts of the chapters that make up this thesis. I also acknowledge the comments
of two anonymous reviewers, and my three examiners.

The staff support, computer facilities, and office space provided by the School of
Civil and Environmental Engineering at the University of New South Wales are
gratefully ackno wledged. Thank you in particular to Les Brown, Ian Gilbert,
Karenne Irvine, Patricia McLaughlin, Julie O’Keefe, Jong Perng, Vir Sardana,
Angela Spano, Gareth Swarbrick, and Betty Wong. Thank you also to the helpful
staff in the University library.

Finally, I would like to thank my mother, Beryl, for proofreading this thesis and
providing helpful comments. And heartfelt thanks go to my wife, Sharon, for her
love, support and understanding. Without her, this thesis would not have been
possible. “I thank my God every time I think of you” (Philippians 1:3).

4
Table of Contents

ABSTRACT 2

DEDICATION 3

DECLARATION 3

TABLE OF CONTENTS 5

LIST OF TABLES 9

LIST OF FIGURES 11

1. INTRODUCTION 14

1.1 Motivation 14

1.2 Objectives of the Research 16

1.3 Outline of the Thesis 17

2. BACKGROUND TO THE RESEARCH 20

2.1 Characteristics of rainfall data, and characteristics of the Australian


rainfall record 20

2.2 Current practice in formulating hydrologic inputs for catchment


water management studies 21

2.3 Stochastic generation of hydrologic data 23


2.3.1 Generation of yearly and seasonal records 23
2.3.2 Models with long memory 25
2.3.3 Models for stochastic generation of daily rainfall 28
2.3.4 Problems in modelling daily rainfall 29

5
3. SELECTION OF PREDICTOR VARIABLES FOR A ONE-DAY-
AHEAD FORECAST OF DAILY RAINFALL OCCURRENCE 36

3.1 Introduction 36

3.2 Partial Information 38


3.2.1 Theoretical background 38
3.2.2 Sample estimates of partial information 40
3.2.3 A nonparametric measure of significance 42
3.2.4 An example to illustrate the applicability of MI and PI for
synthetically generated binary random variables 42
3.2.5 Partial informational correlation (PIC) 45

3.3 Selection of predictors for rainfall occurrence 46


3.3.1 Methodology 46
3.3.2 Stepwise predictor selection algorithm 48

3.4 Application to daily rainfall data 49


3.4.1 Testing with synthetically generated daily rainfall data 49
3.4.2 Identification of predictors for rainfall occurrence 54

3.5 Forecasting rainfall occurrence using the identified predictors 58

3.6 Conclusions 62

4. A NONPARAMETRIC MODEL FOR DAILY RAINFALL


OCCURRENCE THAT REPRODUCES LONG-TERM
VARIABILITY 65

4.1 Introduction 65

4.2 The resampling model for rainfall occurrence 67

4.3 A method for predictor selection 73

4.4 Application of the resampling model 76


4.4.1 Implementation Details 76

6
4.4.2 Results for Sydney 77
4.4.3 Results for Melbourne 94

4.5 Conclusions 101

5. A MODEL FOR STOCHASTIC GENERATION OF DAILY


RAINFALL AMOUNTS 106

5.1 Introduction 106

5.2 Incorporating longer-term variability into a daily rainfall model 108

5.3 A nonparametric model for rainfall amounts on wet days 109

5.4 Conditional modelling of wet day amounts 112

5.5 Implementation of the model for amounts 115

5.6 Results for the rainfall generator 119

5.7 Conclusions 128

6. CONCLUSIONS 132

7. REFERENCES 140

APPENDIX A. SELECTION OF A KERNEL BANDWIDTH FOR


MEASURING DEPENDENCE IN HYDROLOGIC TIME SERIES
USING THE MUTUAL INFORMATION CRITERION 146

A.1 Introduction 147

A.2 Background 150


A.2.1 The mutual information criterion as a measure of dependence 150
A.2.2 Kernel density estimation of the mutual information criterion 151

A.3 Empirical trials to find a practical choice of a 156

7
A.4 Results 163
A.4.1 Results of small-sample trials 163
A.4.2 MSE values and selection of the best a choices 163

A.5 Conclusions 168

A.6 References 170

APPENDIX B. SUPPLEMENT TO CHAPTER 4: A “PANEL OF


PLOTS” APPROACH TO PREDICTOR SELECTION 172

APPENDIX C. SUPPLEMENT TO CHAPTER 5: STOCHASTIC


GENERATION OF DAILY RAINFALL AMOUNTS 175

C.1 Bandwidth selection for kernel density estimation 175


C.1.1 Univariate case 175
C.1.2 Conditional case 178
C.1.3 Testing and implementation of the bandwidth selection rules 178

C.2 Long-term variability and wet day classes 180

C.3 Detailed results for Sydney 182

C.4 Detailed results for Melbourne 189

C.5 References 197

APPENDIX D. PUBLICATIONS LIST 198

8
List of Tables
Table 2.1. Average rainfall per year at Tenterfield 1936-1996 ........................... 33
Table 3.1 Test cases for trivariate two-state discrete (binary) data. .................... 44
Table 3.2 Description of the D-day or Y-year Wetness Index. ............................ 47
Table 3.3 Selected predictors for test sequences. ................................................ 51
Table 3.4 Selected predictors for rainfall occurrence .......................................... 55
Table 3.5 Mean Square Error for forecasts of Melbourne rainfall occurrence. ... 60
Table 3.6 Mean Square Error for forecasts of Sydney rainfall occurrence. ........ 61
Table 4.2 Resampling models selected for Sydney rainfall occurrence. ............. 86
Table 4.3 Predictors used in models for Sydney rainfall occurrence. ................. 87
Table 4.4 Summary of results for ROG models for Sydney................................ 89
Table 4.5 Summary table for various models for Sydney . ................................. 91
Table 4.6 Summary table for various models for Sydney. .................................. 92
Table 4.7 Resampling models selected for Melbourne rainfall occurrence. ....... 95
Table 4.8 Predictors used in models for Melbourne rainfall occurrence. ............ 96
Table 4.9 Summary results for ROG models for Melbourne .............................. 98
Table 4.10 Summary table for various models for Melbourne. ......................... 100
Table 4.11 Summary table for various models for Melbourne ......................... 100
Table 5.1 Average annual rainfall (mm) for El Niño and La Niña years at
Tenterfield 1936-1996 ...................................................................... 108
Table A.1 The Autoregressive (AR(1)) models.................................................. 159
Table A.2 The Autoregressive-Moving Average (ARMA(1,1)) models............ 159
Table A.3 Best a values obtained – AR(1) models. ........................................... 164
Table A.4 Best a values obtained – ARMA(1,1) models. ................................. 165
Table A.5 Efficiency loss (%) from selecting a by the suggested method –
AR(1) models. .................................................................................. 167
Table A.6 Efficiency loss (%) from selecting a by the suggested method –
ARMA(1,1) models. ......................................................................... 167
Table C.1 Average annual rainfall (mm) for El Niño and La Niña years at
Tenterfield 1936-1996 ...................................................................... 180

9
Table C.2 Average number of class 0, class 1, and class 2 wet days for El
Niño and La Niña years at Tenterfield 1936-1996........................... 181
Table C.3 Average rainfall amount (mm/day) on class 0, class 1, and class
2 wet days at Tenterfield 1936-1996 ................................................ 182

10
List of Figures
Figure 1.1a. Catchment modelling using historical rainfall data. ........................ 15
Figure 1.1b. Catchment modelling using multiple input sequences..................... 15
Figure 3.1 Components of Mutual Information (MI) for 5 test cases. . .............. 45
Figure 4.1 ROG(1): Daily statistics for Sydney rainfall occurrence. .................. 79
Figure 4.2 ROG(1): Distribution of wet days per year for Sydney. .................... 80
Figure 4.3 ROG(4): Mean, standard deviation, and longest dry and wet
spell lengths in each season for Sydney. ........................................ 81
Figure 4.4 ROG(4): Frequency-duration curves for dry spell lengths and
wet spell lengths in each season for Sydney. .................................. 82
Figure 4.5 ROG(4): Mean and standard deviation of wet days per season,
and standard deviation of wet days per year for Sydney.. .............. 83
Figure 4.6 ROG(4): Distribution of wet days per year for Sydney. .................... 84
Figure 4.7 ROG(4): Longer-term statistics for Sydney . ..................................... 85
Figure 5.1 Historical mean daily rainfall on class 0, 1a, 1b, and 2 wet days ..... 110
Figure 5.2 Illustration of the conditional probability density function .............. 113
Figure 5.3 RAG(1): Statistics of daily rainfall on class 2 wet days. .................. 121
Figure 5.4 Combined ROG(4)/RAG(1): Variability of Sydney rainfall totals
at several timescales.. ................................................................... 122
Figure 5.5 Combined ROG(4)/RAG(2): Variability of Sydney rainfall totals
at several timescales.. ................................................................... 123
Figure 5.6 Combined ROG(4)/RAG(2): Distribution of annual rainfall
amounts for Sydney. ..................................................................... 124
Figure 5.7 RAG(1): Statistics of daily rainfall for Melbourne on class 2 wet
days. .............................................................................................. 125
Figure 5.8 Combined ROG(4)/RAG(1): Variability of Melbourne rainfall
totals at several timescales.. .......................................................... 126
Figure 5.9 Combined ROG(4)/RAG(2): Variability of Melbourne rainfall
totals at several timescales.. .......................................................... 127
Figure 5.10 Combined ROG(4)/RAG(2): Distribution of annual rainfall
amounts for Melbourne................................................................. 128

11
Figure A.1 Example lag one dependence structure. ........................................... 161
Figure A.2 Estimated univariate PDF for the example in Figure 1. ................... 161
Figure A.3 Time series plot for the example in Figure 1. .................................. 162
Figure A.4 Best a choice vs. ? for small sample sizes (n=30, n=50) ................. 165
Figure A.5 Selection of a for Model no. 2. ........................................................ 166
Figure B.1 ROG(1): Panel of plots for Sydney rainfall occurrence. ................. 173
Figure B.2 ROG(4): Panel of plots for Sydney rainfall occurrence. ................. 174
Figure C.1 Skewness correction factor (s) for bandwidth selection.................. 177
Figure C.2 RAG(1): Statistics of daily rainfall on class 0 wet days. ................ 183
Figure C.3 RAG(1): Statistics of daily rainfall on class 1a wet days................ 184
Figure C.4 RAG(1): Statistics of daily rainfall on class 1b wet days. .............. 185
Figure C.5 ROG(4): Mean wet days per year for Sydney. a. All classes
combined. b. class 0. c. class 1a. d. class 1b. e. class 2. ............... 186
Figure C.6 ROG(4): Standard deviation of wet days per year for Syd ney. a.
All classes combined. b. class 0. c. class 1a. d. class 1b. e.
class 2............................................................................................ 186
Figure C.7 Combined ROG(4)/RAG(1): Mean rainfall per year for
Sydney. a. All classes combined. b. class 0. c. class 1a. d.
class 1b. e. class 2. ........................................................................ 187
Figure C.8 Combined ROG(4)/RAG(1): Standard deviation of rainfall per
year. a. All classes combined. b. class 0. c. class 1a. d. class
1b. e. class 2. ................................................................................. 188
Figure C.9 Combined ROG(4)/RAG(2): Standard deviation of rainfall per
year. a. All classes combined. b. class 0. c. class 1a. d. class
1b. e. class 2. ................................................................................. 189
Figure C.10 RAG(1): Statistics of daily rainfall for Melbourne on class 0
wet days ........................................................................................ 190
Figure C.11 RAG(1): Statistics of daily rainfall for Melbourne on class 1a
wet days. ....................................................................................... 191
Figure C.12 RAG(1): Statistics of daily rainfall for Melbourne on class 1b
wet days. ....................................................................................... 192

12
Figure C.13 ROG(4): Mean wet days per year for Melbourne. a. All
classes combined. b. class 0. c. class 1a. d. class 1b. e. class 2. ... 193
Figure C.14 ROG(4): Standard deviation of wet days per year for
Melbourne. a. All classes combined. b. class 0. c. class 1a. d.
class 1b. e. class 2. ........................................................................ 194
Figure C.15 Combined ROG(4)/RAG(1): Mean rainfall per year for
Melbourne. a. All classes combined. b. class 0. c. class 1a. d.
class 1b. e. class 2. ........................................................................ 195
Figure C.16 Combined ROG(4)/RAG(1): Standard deviation of rainfall per
year for Melbourne. a. All classes combined. b. class 0. c.
class 1a. d. class 1b. e. class 2. ..................................................... 196
Figure C.17 Combined ROG(4)/RAG(2): Standard deviation of rainfall per
year for Melbourne. a. All classes combined. b. class 0. c.
class 1a. d. class 1b. e. class 2. ..................................................... 197

13
1. Introduction

1.1 Motivation
The 1997 CH Munro Oration [McMahon, 1997] referred to unfinished business in
hydrology. One of the five research areas discussed was stochastic data
generation, defined as “a statistical procedure that allows the hydrologist to
estimate synthetic sequences of rainfalls or streamflows that are as likely to occur
in the future as the observed historical record”. These synthetic sequences should
reproduce the statistical characteristics of the observed record, including the
observed variability at both short and long timescales; if this is true then the
sequences provide alternative realisations of rainfall that are consistent with the
climatic variability found in the observed record. These synthetic sequences can
then be used in catchment water management studies as a tool for exploring the
potential variability in the catchment response. An illustration of this is given in
Figure 1.1. Figure 1.1a shows that the historical record provides a single input
sequence to a catchment model, from which a single set of model outputs can be
obtained. Figure 1.1b shows that good quality synthetic sequences can provide
multiple inputs to a catchment model, that supplement the historical record. These
multiple input sequences produce a range of model outputs, and risk-based
analysis of these outputs can help quantify the variability of the catchment
response.

14
Rainfall Catchment
Rainfall Catchment Model
Model


→ Model Outputs
→ Model Outputs
→ average outflow
e.g. outflow = 200GL
=200GL

Probability of outflow
of less than 100GL = ?

Figure 1.1a. Catchment modelling Figure 1.1b. Catchment modelling


using historical rainfall data. using multiple input sequences (i.e.
historical rainfall data, supplemented
by generated synthetic sequences).

One reason why the field of stochastic data generation can be regarded as
“unfinished business” is that the statistical characteristics and features of the
historical record are complex and can be hard to reproduce in generated
sequences. Such features include droughts and sustained periods of high rainfall,
which are of great interest in catchment planning and management. If these
features are not reproduced in the synthetic sequences generated by a stochastic
model, then the generated sequences will not be of great value in catchment water
management studies. The use of generated sequences from such stochastic models
could lead to misrepresentation of the possible effects of climatic variability, and
to suboptimal policies for system management. For example, if the observed
record contains one or two major droughts, and the generated sequences used in a
catchment study do not contain any such droughts, then the results from the study
will neglect the effects of drought on the catchment. On the other hand, if a
stochastic rainfall model is available that adequately represents the process that
leads to drought, the variability in the response of the catchment to drought can
then be investigated using multiple input sequences, as illustrated in Figure 1.1b.

15
Unfortunately, existing methods for the generation of daily rainfall under-
represent the longer-term variability that is associated with drought [Buishand,
1978; Wilks and Wilby, 1999], and this deficiency limits the usefulness of these
methods in catchment studies.

The motivation of this research is, therefore, to develop methods for generating
synthetic sequences of daily rainfall that more closely reproduce the complex
short-term and longer-term features of the historical record. Given the assumption
of stationarity that is implied in McMahon’s [1997] definition of stochastic data
generation, such sequences would come closer to being “as likely to occur in the
future as the observed record”, compared to sequences that are generated by
existing methods. By providing improved methods for generating synthetic data
for catchment studies, it is hoped that this thesis will be a step down the path
towards improved risk-based manage ment of water resources, and more reliable
evaluation of the hydrologic, environmental and socioeconomic impacts of
alternative water resource management plans.

1.2 Objectives of the Research


This thesis develops approaches for generation of long daily rainfall sequences at
a given location that can represent the variability in rainfall at both short (daily)
and long (seasonal, annual and inter-annual) time scales.

The objectives of the research are to:


1. Develop methods for the selection of predictors for use in simulation of daily
rainfall.
2. Develop a model for generating long sequences of rainfall occurrence (i.e.
whether particular days are “dry” or “wet”).
3. Develop a model for simulating rainfall amounts on the wet days.

For both the occurrence and amount models, the aim is to accurately reproduce
the distributional features, dependence features, and seasonal variations of the

16
observed record. The distributional features include the mean and the shape of the
underlying probability density function. The dependence features include both
day-to-day correlations, and longer-term features such as droughts and sustained
periods of high rainfall. The seasonal variations include changes in both
distributional and dependence characteristics with time of year.

1.3 Outline of the Thesis


Chapter 2 gives a background to the research, including a description of the
characteristics of Australian rainfall data, and of current practice in formulating
hydrologic inputs for large-scale catchment water management studies. Methods
for the stochastic generation of rainfall data are discussed, with emphasis placed
on problems with existing approaches for simulating daily rainfall.

The forecasting and generation of daily rainfall occurrence (whether a day is "dry"
or "wet") is the problem addressed in chapter 3 and chapter 4. Chapter 3
introduces a generic measure of partial dependence developed for discrete random
variables, and applies it to select relevant short and long-term predictors for a one-
day-ahead forecast of rainfall occurrence state. The method is tested using
synthetically generated data with known dependence attributes, and using long-
term rainfall data from 13 locations in Australia. The utility of the selected
predictors is then evaluated for Melbourne and Sydne y, Australia, by forecasting
the rainfall occurrence for each day in the historical record in a leave-one-out
cross-validation mode.

Chapter 4 presents a nonparametric model for generating single-site daily rainfall


occurrence. The model is formulated to reproduce longer-term variability and
low- frequency features such as drought and sustained wet periods, while still
reproducing characteristics at daily time scales. The model is also designed to
ensure an accurate representation of the seasonal variations present in the rainfall
time series. The model is applied using historical daily rainfall occurrence from
Melbourne and Sydney, Australia. It is found that the use of multiple predictors

17
leads to sequences that more closely reproduce the longer-term variability present
in the historic records, compared to sequences produced by models incorporating
only one or two predictors.

Chapter 5 proposes a model for stochastic generation of rainfall amounts on wet


days that is nonparametric, accommodates seasonality, and reproduces a number
of key aspects of the distributional and dependence properties of observed rainfall.
The proposed rainfall amount model is shown to emulate the day-to-day features
that exist in the historical rainfall record, including the lag-one correlation
structure of rainfall amounts. This model is then linked with long sequences of
rainfall occurrence generated by the model of chapter 4. The approach is applied
to daily rainfall from Sydney and Melbourne, Australia, and the performance of
the approach is demonstrated by presentation of model results at daily, seasonal,
annual, and inter-annual timescales.

Chapter 6 puts forward the conclusions to the research, and section 7 presents a
list of references used in the body of the thesis.

Appendix A is a paper on the selection of a key smoothing parameter for


calculation of the mutual information (MI) criterion. MI is a measure of
dependence that can quantify both linear and non- linear dependence between any
two variables. The research that produced appendix A was conducted at an early
stage of the project, and provides a significant contribution to the literature, which
is published as [Harrold et al., 2001]. However, the decision was made to use
different methods for predictor selection when the rainfall occurrence and rainfall
amount models were being developed.

Appendix B introduces an additional tool for assessment of the quality of


generated sequences, that supplements the predictor selection approaches given in
chapter 4. Appendix C presents material that is supplementary to chapter 5,
discussing rules for selection of a key smoothing parameter that is used in the
chapter, describing complex low- frequency features of an observed rainfall

18
record, and giving more detailed results for the rainfall occurrence/amount models
for both Sydney and Melbourne.

As a result of the research undertaken for this thesis, one journal paper [Harrold
et al., 2001] and six conference papers have been published, and two additional
conference presentations have been made. Note that Harrold et al. [2001] is
reproduced in this thesis as appendix A. A list of the conference papers and
conference presentations is included in appendix D. Also note that the material
contained in chapters 3 and 4 has been submitted as a two-paper series to Water
Resources Research, and the material contained in chapter 5 has also been
submitted as a paper to Water Resources Research. Each of these three papers are
co-authored by Harrold, Sharma, and Sheather.

19
2. Background to the Research

2.1 Characteristics of rainfall data, and characteristics of


the Australian rainfall record
As the purpose of this thesis is to develop methods for generating rainfall data that
can supplement the historical record for use in catchment studies, it is informative
to give a brief description of the historical record of rainfall data as it exists in
Australia, and to briefly describe the current practice for formulation of inputs for
large-scale catchment studies.

Rainfall is an intermittent process. When rain occurs, the intensity of the rainfall
changes with time, and the spatial distribution of rainfall also changes with time.
However, the most common method of measuring rainfall is by point estimates of
the rainfall depth (in mm) falling in a given time interval. This point estimate is
measured by a rainfall gauge. Daily read rainfall gauges are located at thousands
of locations across Australia, for example at many post offices. Because of this
extensive network, daily rainfall records are by far the most common records of
hydrologic data in Australia. The Bureau of Meteorology, which administers this
network, lists more than 16 000 daily rainfall sites on its website
(www.bom.gov.au). Because of the availability of data it is also relatively easy to
cross-check for consistency between rainfall stations. Note that Lavery et al.
[1997] have used cross-checking and compositing techniques to develop a high-
quality dataset of 379 daily rainfall records which covers most of Australia. 341 of
these records commence before 1910.

Australia has lower average rainfall than any other continent apart from Antartica,
and annual variability of rainfall is high, with linkages to large-scale quasi-
periodic ocean-atmosphere variations such as the El Niño Southern Oscillation
Index [Allan, 1991]. These features of Australia’s climate, along with factors such

20
as the continental distribution of vegetation, lead to a larger coefficient of
variation in runoff than for any other cont inent [Croke and Jakeman, 2001].

Rainfall is seasonal in nature. For example, summer rainfall tends to be higher


than winter rainfall in much of southeastern Australia. The distribution of rainfall
is also highly skewed, with a small number of extreme values in the record.

Dependence in rainfall records exists at several time scales. For example, rainfall
on the current day may be strongly influenced by rainfall on the previous day.
Longer-term dependence also exists. Periods of drought and sustained periods of
high rainfall, which are common in Australian hydrologic records, are related to
this longer-term dependence. Persistent periods of drought and high flows have
also been linked to known low frequency climatic anomalies such as the El Niño
Southern Oscillation [Simpson et al., 1993].

2.2 Current practice in formulating hydrologic inputs for


catchment water management studies
One of the best tools for large-scale catchment water management studies in
Australian conditions is IQQM, the Integrated Quantity-Quality Model developed
by the New South Wales Department of Land and Water Conservation (DLWC)
[Department of Land and Water Conservation, 1998a,b]. IQQM requires daily
rainfall, evaporation, and streamflow data at several locations as input to the
model. The historical record is the only reliable indication of the possible values
for these inputs.

In a typical implementation of IQQM in a major river valley, around 10-40


rainfall gauging stations, 3-5 evaporation stations, and 10-30 streamflow gauging
stations are used, with the locations of these stations distributed around the
catchment [Harrold, 2000]. The longest and most reliable historical records are
identified and cross-checked with surrounding stations. The rainfall records are
gap- filled (not extrapolated) by cross-correlation with other rainfall stations that

21
are nearby. Small gaps in the streamflow record are also filled by cross-correlation
with nearby streamflow stations. The aim is to get as long a record as possible of
good quality gap-filled data. It is typically possible to get:

• around 100 years of gap-filled daily rainfall data at several locations,


• around 15-40 years of daily evaporation data at several locations, and
• up to 100 years of daily streamflow data at one or two locations on the main
river, with 30-60 years of daily streamflow for major tributaries.

The overall aim is to get around 100 years of data for all variables, with no
missing values in the records. It is therefore necessary to use the rainfall data as
input to calibrated rainfall-evaporation and rainfall- runoff models, to generate 100
years of evaporation and streamflow data based on the rainfall data, and then use
the generated data to gap- fill and extend the observed records. The methodology
that DLWC uses for the rainfall-evaporation model is explained in the
"Description of the Daily Weather Model" [CRC for Waste Management, 1995].
DLWC uses the Sacramento rainfall-runoff model [Burnash et al., 1973] for
generating daily streamflow based on historical daily rainfall records.

The methodology outlined above will produce around 100 years of daily
hydrological inputs at multiple sites, for use in a catchment water management
study. The multivariate dataset that is produced is based on the best available
historical daily rainfall, evaporation and streamflow records, with the evaporation
and streamflow records extended based on the longer rainfall records. Significant
spatial and temporal correlations exist in the multivariate dataset. The
distributional and dependence structures in the rainfall and the multivariate dataset
are complex - so much so that DLWC has not found a currently available,
practical and reliable method for stochastic generation of long sequences of
synthetic rainfall or multivariate data that meets their requirements for use in
catchment modelling. This thesis is a step towards this goal, the aim being to
develop improved methods for the generation of single-site daily rainfall. The

22
development of a model for generating spatially correlated multi-site rainfall is
beyond the scope of this current work.

2.3 Stochastic generation of rainfall data


Historical records for rainfall are hydrologic time series. A time series exists when
there is a random variable that is observed in a sequential manner in time [Salas et
al., 1980]. There is a large body of literature devoted to the study of hydrologic
time series, to the calculation of sample statistics, and in particular to the fitting of
stochastic models for generating synthetic data (for example, see Salas et al.
[1980]; Hirsch et al. [1993]; Salas [1993]; Yevjevich [1972,1984]; Bras and
Rodriguez-Iterbe [1985]). These models are formulated to reproduce some of the
properties of the original hydrologic time series; the form of the model and how
well it fits the patterns in the original data strongly influence the quality of the
generated synthetic time series. A review of relevant literature is included in this
section.

A simple time series model is introduced first, followed by a discussion of


traditional approaches to modelling long memory, which is of interest here
because long memory is related to low frequency features such as drought.
Finally, approaches for the stochastic generation of daily rainfall are presented.
This review of the literature is limited to approaches for generating a single time
series at a single location.

2.3.1 Generation of yearly and seasonal records


Common time intervals for a time series are yearly, seasonal (e.g. quarterly or
monthly), and daily. The complexity of the patterns in the data tends to increase as
the timestep reduces. For example, a time series of yearly rainfall has lower
correlation than a time series of daily rainfall, and the yearly time series is a
continuous process while the daily time series is an intermittent process. It
therefore follows that simpler models can be fitted to yearly and seasonal time
series than to daily data. Autoregressive (AR) and autoregressive- moving average

23
(ARMA) models form a group of models that are applicable for many hydrologic
processes [Salas, 1993], provided they are fitted to a continuous, stationary,
nonperiodic time series such as annual rainfall, or to seasonal rainfall, if a separate
model is used for each season. In addition, it the data are skewed, a transformation
should be applied to create a dataset that is approximately normally distributed,
and the model is applied to the transformed data.

An Autoregressive model of order p (AR(p)) is expressed as:

p
X t = µ + ∑ φ j ( X t− j − µ ) + et (2.1)
j =1

where X t is the value of the time series at time step t, µ is the mean of the time

series, there are p autoregressive parameters φ1 ...φ p , and et is an uncorrelated

normal random variable. The variance, and autocorrelation functions of AR


models are easy to calculate [Salas, 1993]; for example, the lag-one correlation of
an AR(1) model is equal to the autoregressive parameter φ1 .

Autoregressive models can be combined with a moving average component to


produce a slightly more complicated model, that has an autocorrelation structure
that decays more slowly. An ARMA(p,q) model is expressed as

p q
X t = µ + ∑ φ j ( X t− j − µ ) −∑ θ j et− j + et (2.2)
j =1 j =1

where there are q moving average parameters θ 1...θ p .

Low-order AR and ARMA models are commonly used in hydrology. The lowest-
order autoreggressive (AR(1)) model reproduces mainly short term dependence
while an ARMA(1,1) model can reproduce some longer term dependence,
because of its slowly decaying autocorrelation structure.

24
2.3.2 Modelling long-term persistence

Long memory, or long-term persistence, exists in a hydrologic record when there


is significant dependence between observations a long time apart. One key
characteristic of records with lo ng memory is that there are persistent periods of
low values and persistent periods of high values in the record, a situation which is
common in Australia. Droughts will be more severe when long memory is
present, and sustained periods of high rainfall are also more likely. Long memory
also has an effect on the estimation of parameters such as the mean. In a
hydrologic record that has long memory, each additional observation is correlated
to previous values and therefore adds little new information about the underlying
process. This implies that the estimate of the population mean obtained from a
sample will be less reliable if long memory is present. In other words, the variance
of sample estimates of the mean and other parameters is very high in the presence
of long memory.

The Hurst coefficient (h) is a measure that is designed to quantify the extent of the
long memory that exists in a dataset. Hurst [1951,1956] found evidence for long
memory in discharge data from the Nile River and from many other hydrological
and geophysical records. He studied 120 long-term geophysical records, which
had periods of record of between 30 and 2000 years. The statistic that he
calculated (the "rescaled range" or R/S statistic) behaved similarly for all the
phenomena he studied, and the coefficient he calculated from this behaviour is
termed the Hurst coefficient (h). Independent and short memory models produce a
Hurst coefficient of 0.5. The tendency of geophysical time series to produce
values of h greater than 0.5 has become known as the Hurst phenomenon.

There are some parametric models that can reproduce long-memory effects. One
of these is the fractionally differenced autoregressive integrated moving average
(FARIMA) model [Hosking, 1984]. Other long- memory models that can be used
to reproduce a value of h>0.5 are fractional Gaussian noise [Beran, 1994; Bras
and Rodriguez-Iturbe, 1985] and the broken- line process [Bras and Rodriguez-

25
Iturbe, 1985]. However these models are all more complicated than ARMA
models, and they may not accurately reproduce short-term dependence
characteristics.

FARIMA models are described in Hosking [1984], Beran [1994], and Montanari
et al. [1997]. They are an extension of the autoregressive integrated moving
average models described by Box and Jenkins [1976]. The three parameters of a
FARIMA(p,d,q) model are:
p = the order of the autoregressive component;
q = the order of the moving average component;
d = the differencing order, which is allowed to be fractional.
FARIMA models are directly related to the Hurst coefficient by h=d+0.5. The
range of interest for d is 0 to 0.5. When d=0, the model reduces to an ARMA
model, and when d>0.5, the model is nonstationary. As d increases within this
range, the FARIMA model shows stronger long memory characteristics.

A very large dataset is required to fit and apply a FARIMA model. Montanari et
al. [1997] fit this model to seasonally standardised 1 daily inflows to Lake
Maggiore. They unsuccessfully attempted to fit the model to seasonally
standardised mont hly inflows, a dataset of length 614. They consider that this was
too small a sample size to be able to conclude whether long memory is present.
Montanari concludes that "when the sample size is small, many different models
perform approximately as well, even if long memory is present." Note that the
longest recorded hydrologic data in Australia is for a period of around 140 years,
so that FARIMA models and similar models cannot be applied to Australian data
without reverting to the use of seasonal standardisation.

Lettenmaier and Burges [1977] compared a long memory model (fractional


Gaussian noise) against an autoregressive moving average model (ARMA(1,1)).

1
Seasonal standardisation of daily data involves calculating the mean and standard deviation for
all the data from each of 366 calendar days in the record, then forming a standardised dataset by
subtracting the mean and dividing by the standard deviation for each data value. This procedure is
not recommended in this thesis because the resulting dataset can still be nonstationary in the
autocorrelation. Instead, the “moving window” approach is used, as described later in this chapter.

26
They undertook a modelling analysis of the storage required to satisfy a fixed
demand over the operating life of a reservoir. Their storage analysis effectively
showed that for a 40 year design life it was not worth considering using the long
memory model, because the simpler ARMA(1,1) model gave almost identical
results. They showed that use of the long memory model may be worthwhile
when a design life of 100 years is used.

It is also worth noting that an autoregressive moving average (ARMA) model is


easier to fit to a dataset than a FARIMA model. Even though the correlation
function for an ARMA model does not decay as slowly as the autocorrelation
functions for a FARIMA model, Salas [1993, page 19.22] writes that, in relation
to modelling hydrologic processes, one may say that AR processes are short-
memory and ARMA processes are long-memory. For an ARMA process h
approaches 0.5 asymptotically, and values of h greater than 0.5 may be obtained
for finite samples.

Thyer and Kuczera [2000] proposed a two-state hidden state Markov model to
represent the long-term persistence in annual rainfall records. They note that
Australian locations that are significantly influenced by tropical weather systems
are subject to alternating wet and dry regimes. Their model framework provides
an explicit mechanism to stochastically simulate these wet and dry periods. Thyer
and Kuczera successfully applied this model to datasets of around 140 years in
length, and under the assumptions of their model, strong persistence structures
were identified. When they fit ARMA-type models to this data, no statistically
significant evidence of persistence was obtained, i.e. under the assumptions of
ARMA processes, there was no significant difference between the fit of an AR(1)
and an AR(0) (white noise) model.

The influence of low- frequency features such as drought and sustained periods of
high rainfall on a hydrologic model depend on the timescale that is being used.
For example, a three-year drought affects three timesteps of an annual model, but
it affects many more timesteps of a daily model. Thus at the yearly timescale,

27
low- frequenc y features cause relatively high correlation between years, however
at the daily timescale, the same low frequency features will result in very small
correlations between adjacent days. In recognition of this, Harrold and Sharma
[1999] proposed a new approach to characterising longer-term memory in daily
data, involving the use of aggregated values (i.e. values from a consecutive
sequence of days added together) as the basis of measuring dependence at long
time lags, while still working at a daily time step. The motivation for this
approach is that the dependence structure in daily hydrologic data is masked by
the presence of large amounts of noise. Smoothing out this noise will enable the
dependence structure to be identified more clearly. One way of smoothing the
noise is by aggregation, i.e. by adding together values from a consecutive
sequence of days. Harrold and Sharma's approach can capture the dependence
structure of the data in a way that involves fewer parameters than a multi- lag
ARMA model.

2.3.3 Models for stochastic generation of daily rainfall

None of the models discussed above can be applied directly to daily rainfall
because daily rainfall is an intermittent process. An algorithm that deals with both
the problem of deciding whether a day will be wet or dry, and the simulation of
the rainfall amount on a wet day must be used. Also note that the concept of long
memory is not as useful at the daily timescale as at longer timescales, because the
influence of low-frequency features such as drought and sustained periods of high
rainfall depend on the timescale that is being used. At the daily timescale, the
proportion of the correlation between adjacent days that is due to multi- year
quasi-periodic cycles is so small that would be almost impossible to measure. For
this reason, it is more common to talk about “longer-term variability” rather than
“long memory” when daily data is being discussed.

A common approach to modelling single-site daily rainfall has been to develop


models describing the rainfall occurrence (dry-wet) process and to describe the
distribution of rainfall amounts on wet days independently [Woolhiser, 1992].
Rainfall occurrence is represented in two ways: either as a Markov process

28
[Gabriel and Neumann, 1962; Salas, 1993; Katz and Zheng, 1999], the
assumption being that the rainfall state on the next day is related to the state of
rainfall on a finite number of previous days; or as an alternating renewal process
for dry and wet sequences [Buishand, 1978; Sharma and Lall, 1999], the approach
being to stochastically generate the dry and wet spell lengths, either
unconditionally, or conditional to appropriately selected predictors. Rainfall
amount is generated once a day has been specified as wet. Rainfall amount on a
wet day follows a positively skewed probability distribution. This can be
modelled using parametric distributions such as the two-parameter gamma, mixed
exponential, and skewed normal distributions (see Woolhiser [1992], Wilks and
Wilby [1999], and Srikanthan and McMahon [2000] for reviews of the literature).
The rainfall amount and rainfall occurrence can also be modelled as a single
process using a multi-state Markov chain, the rainfall being treated as a mixed
discrete and continuous variable, and transition probabilities being prescribed to
model the dependence structure present [Haan et al., 1976; Srikanthan and
McMahon, 1985; Gregory et al., 1993].

2.3.4 Problems in modelling daily rainfall

Conventionally used methods for daily rainfall generation are found wanting in
several respects. Some of the limitations in conventional daily rainfall generation
approaches are the representation of the distributional features of rainfall, the
representation of short-term dependence features, the representation of longer-
term variability, and the representation of seasonal variations.

Representation of distributional characteristics.


The commonly used first-order Markov chain (which assumes that the probability
of rainfall occurrence depends only on the state of the previous day) can under-
represent extreme dry spell lengths [Buishand, 1978; Wilks and Wilby, 1999]. An
alternative approach is to use an alternating renewal process for dry and wet
sequences [Buishand, 1978]. Parametric distributions such as the truncated
negative binomial distribution can be fitted to the distributions of dry and wet
spell lengths. Rainfall amount on wet days can also be modelled using parametric

29
distributions such as the two-parameter gamma, the mixed exponential and the
skewed normal distributions (see Woolhiser [1992] for a review of the literature).
After the model is fitted to the data, a limited number of model parameters are
used in simulation to give generated values that reproduce the assumed
distribution. The model fitting process involves assumptions about the nature of
the underlying probability density. If the assumed distribution is correct a highly
efficient model can result. However, in this approach, both the choice of model
and the choice of model parameters are subject to uncertainty. Woolhiser [1992]
concluded that “it appears unlikely that a single distribution will provide a good
fit to daily rainfall data for all climatic regions”. Lall et al. [1996] reached a
similar conclusion. When the chosen distribution does not fit the underlying data
structure the resulting model may not be an accurate representation of reality.
Unfortunately, the choice of probability distribution can be difficult, especially for
small data sets.

An alternative approach is to avoid assumptions regarding the shape of the


underlying probability density, and to use nonparametric methods. These methods
make weak assumptions about the structure of the data, are able to reproduce a
broad class of underlying density functions, and use the observed data more
directly to characterise the probability density. Resampling (bootstrap) methods
and kernel density estimation methods are good examples of nonparametric
approaches [Lall et al., 1996; Rajagopalan et al., 1996; Rajagopalan and Lall,
1999; Sharma and Lall, 1999]. Resampling methods produce generated sequences
of values which have been conditionally selected from the historical record, while
kernel density estimates are obtained by considering the cumulative effect of
smooth functions called kernels which have been placed over each sample data
point. These methods minimise the assumptions that are made in fitting a
stochastic model to a dataset, and are designed to reproduce sample
characteristics. However, the methods work best for large sample sizes, since
small samples may not be representative of the underlying population.

30
Rainfall amounts on wet days are not identically distributed. Buishand [1978]
argues that the distribution of rainfall is different on solitary wet days, days at the
start or end of wet spells, and days in the middle of wet spells. Chapman [1998]
shows that stochastic models that treat rainfall amounts as separate classes, based
on the number of adjoining wet days, generally result in a better fit than stochastic
models that treat the data together.

Representation of short-term dependence features


Most daily rainfall amount models assume that precipitation amounts on wet days
are independent [Wilks and Wilby, 1999]. However, Buishand [1978] found
significant correlation between precipitation amounts on successive wet days.
Gregory et al. [1993] suggest that reproduction of the structure of daily
autocorrelation provides a crucial test for a stochastic rainfall generator. Multi-
state Markov chains [Haan et al., 1976; Srikanthan and McMahon, 1985; Gregory
et al., 1993] are one type of rainfall model that do not assume independence of
wet day amounts. Nonparametric models can also be formulated to condition the
simulated amounts on the value on the previous day [Sharma and Lall, 1999].

The multi-state Markov chain simulates both occurrence and amounts. Different
ranges of rainfall amounts are grouped into classes, with the lowest class
constituting zero rainfall (i.e. dry days), and transition probabilities among all
possible pairs of classes are calculated. Once the class is decided, the rainfall
within the class is calculated using a parametric distribution. It is the transition
probabilities of a multi-state Markov chain that allow the model to at least
partially reproduce the correlation structure of rainfall amounts. Gregory et al.
[1993] show that the daily serial correlation coefficients of area-average rainfall in
South-East England are better reproduced by multi-state Markov chains than by
models that assume the rainfall amounts on wet days are independent.

In a study covering 14 locations in Australia, 6 locations in South Africa, 24


locations in North America, and 22 locations in the Pacific islands, Chapman
[1998] compared the performance of commonly used parametric rainfall models

31
against the multi-state Markov chain of Srikanthan and McMahon [1985]. The
Srikanthan and McMahon approach performed better than the other models for 53
out of the 66 locations tested. The Akaike Information criteria [Akaike, 1974] was
used for the comparisons. This is a measure of the quality of a one-day-ahead
forecast made using the model. The relatively good performance of the Srikanthan
and McMahon approach is due to the nonparametric nature of the approach and to
the ability of a multi- state Markov chain to partially reproduce the structure of
daily autocorrelation, and thus provide a more accurate forecast of the rainfall
amount on a day to day basis.

Representation of longer-term variability


Existing stochastic models for daily rainfall typically underrepresent the variance
of seasonal and annual historical rainfall totals [Buishand, 1978; Wilks, 1989;
Gregory et al., 1993; Katz and Zheng, 1999]. Such reduced variability affects the
representation of sustained droughts or periods of continuously high rainfall in the
generated sequences, features that are of great interest in catchment planning and
management. For example, the response of a catchment to persistent rainfall may
be of great interest in a catchment management study. Under-representation of
variability also potentially affects the quality of short-term forecasts of rainfall. A
forecast model may be mis-specified or sub-optimal due to the non-consideration
of the different dependence patterns that may apply in very dry years as compared
to very wet years. These problems are compounded in locations subject to
significant inter-annual variability, as is the case with the Australian rainfall data
analysed in this thesis.

Drought and persistent periods of rainfall have been linked to known low
frequency climatic anomalies such as the El Niño Southern Oscillation [Simpson
et al., 1993]. Table 2.1 presents the average rainfall per year at Tenterfield in
northern New South Wales, Australia, for periods of record associated with El
Niño Southern Oscillation (ENSO) and non- ENSO conditions. The El Niño and
La Niña year classifications used here are those of Allan (1997). The differences
shown in the table are statistically significant. Climatic anomalies such as the

32
ENSO occur with a frequency of 3-7 years on the average. The fact that
significantly different values occur during the El Niño (dry) and La Niña (wet)
phases of the ENSO is indicative of the longer-term variations that need to be
characterised in the rainfall record. If specific mechanisms for reproducing such
longer-term variability are not included in the rainfall generation model, then such
variability will not be captured in the generated synthetic series.

Table 2.1. Average rainfall per year (mm) at Tenterfield 1936-1996


El Niño years Non-ENSO years La Niña years
790 900 1020

One way of increasing the variability of generated longer-term totals is to


condition the daily rainfall model on a covariate. Wilks [1989] proposed a method
for generating a synthetic daily rainfall sequence where he conditioned on 30-day
probabilistic forecasts issued by the United States Climate Analysis Centre.
Woolhiser [1992] incorporates the information in the Southern Oscillation Index
into a daily rainfall model. Katz and Zheng [1999] condition daily rainfall on a
hidden two-state annual index. Chapman [2001] conditioned the Srikanthan-
McMahon daily rainfall model [Srikanthan and McMahon, 1985] on separate
datasets for wet and dry years, and applied refinements to adjust the monthly and
annual standard deviations of the resulting generated sequences. However, these
approaches are limited because they require the generation of an exogenous
variable of the same length as the generated daily rainfall sequence, the value of
the conditioning variable may only be able to change at fixed times (for example
at the start of the water year), and each approach only considers a single
conditioning variable.

Representation of seasonal variations.


Seasonality is traditionally dealt with by either applying a model to distinct
segments of the year (seasons or months), or by imposing a seasonal trend on the
parameters that describe a model, for example with a Fourier representation. The
first approach is a step function approach, where the parameter values can change

33
abruptly from the end of one month to the start of the next. Woolhiser [1992]
notes that this is not intuitively satisfying because we would not expect the
precipitation characteristics on adjacent days to vary substantially. Using Fourier
coefficients to describe seasonal variation can result in a parsimonious and
smoothly varying set of model parameters. However if abrupt seasonal changes
are present in the record, it may be necessary to incorporate higher-order
harmonics. An alternative for modelling seasonality is the “moving window”
approach [Rajagopalan et al., 1996; Sharma and Lall, 1999]. A window of
specified length is centred at the current calendar day, and all days falling within
the moving window (from all historical years) form the local subset of data used
in the model for the current day. An overlapping sequence of 366 moving
windows provides coverage of all calendar days. These windows naturally
represent the seasonal variability present in the historical record, while providing
sufficiently large datasets for use in the model. Such datasets can then be used to
estimate parameters of an appropriately specified model, or they can be used in a
nonparametric modelling approach, as has been done by Sharma and Lall [1999].
This method is data-intensive, but it avoids the need to calculate the Fourier
coefficients, or to deal with discontinuities at the edge of each discrete season.

2.4 Conclusion
This chapter has briefly described characteristics of Australian rainfall data, and
the requirements for formulating hydrologic inputs for large-scale catchment
water management studies in Australia. Methods for the stochastic generation of
single-site rainfall data have been discussed, with a focus on the problems
encountered in existing methods for generating daily data. It has been
demonstrated that existing daily rainfall models are lacking in their representation
of both short-term dependence features and longer-term variability, and that many
existing models are also limited in their ability to reproduce the distributional
features and seasonal variations of the observed rainfall record. There is a need to
develop new approaches for modelling daily rainfall, which attempt to overcome
these limitations. Such new approaches are developed in this thesis. The first stage

34
in the development of these new approaches is presented in the next chapter. This
chapter focuses on the rainfall occurrence process, and on the selection of
appropriate predictors to forecast the rainfall occurrence state.

35
3. Selection of Predictor Variables for a One-Day-
Ahead Forecast of Daily Rainfall Occurrence

3.1 Introduction
The rainfall generation problem is approached in two stages: generation of rainfall
occurrence, and subsequent generation of rainfall amounts on the simulated wet
days. An approach for generation of daily rainfall occurrence (whether a day will
be “dry” or “wet”) is presented first, with this chapter concentrating on the task of
identifying predictors for forecasts of the daily rainfall occurrence state. The next
chapter (chapter 4) presents an approach for generating rainfall occurrence time
series that represent the day-to-day variations in the historical record, and can
represent variability at longer (seasonal, annual and inter-annual) time scales
through the use of appropriately specified predictors. The generation of the
rainfall amounts is presented in chapter 5.

The multi-order Markov chain model [Chin, 1977; Salas, 1993] is the best known
of the traditional methods for generating daily rainfall occurrence. A multi-order
Markov chain can only capture the longer-term structure of the data if the order of
the model (and hence the number of parameters) is made very large. An
alternative approach to describe longer-term structure in a hydrologic time series
was introduced by Sharma and O’Neill [2001], who used the total streamflow for
the previous twelve months as an "aggregate" streamflow that represents a longer-
term state of the catchment, and simulated monthly streamflow as a function of
both the previous month’s flow and the aggregate flow variable. This approach
ensured an accurate representation of month-to-month variability, as well as a
better representation of the dependence of monthly flow on the previous year's
flow. This chapter applies the aggregation approach to daily rainfall occurrence.
The approach adopted here is to model both short-term (day to day) and longer-
term (seasonal, annual and inter-annual) features through the use of appropriately
selected aggregate predictor variables that describe these features. Use is made of

36
“aggregate” variables that describe how many wet days have been observed over a
period of time, and which represent the rainfall state over these periods. These
aggregate variables are used in a set of candidate predictors for daily rainfall
occurrence. The candidate predictors are formed solely from previous values in
the rainfall occurrence sequence, and are formulated to represent the short-term
and longer-term variability that exists in the historical rainfall occurrence record.
This approach can capture the dependence structure of the data in a way that
involves fewer parameters than a traditional multi-order Markov chain model.

The presence of a number of possible predictor variables representing both short-


term and longer-term dependence necessitates the use of a predictor identification
criterion. Such a criterion should be capable of working with discrete variables,
since the predictand (rainfall occurrence) as well as the candidate predictors
(rainfall occurrence on the previous day, and a set of aggregate variables
representing longer-term dependence), all assume discrete values. Additionally,
the criterion should be such that no explicit or implicit assumptions about the
nature of variability or dependence are made in selecting the predictors.
Unfortunately, traditional methods used to select the model order, such as the
Akaike Information Criterion [Akaike, 1974; Wilks and Wilby, 1999] and the
Bayesian Information Criterion [Schwarz, 1978], make assumptions about the
probability distributions of the variables involved. Presented here is an alternative,
nonparametric procedure for measuring the dependence between discrete
variables, which avoids specification of probability distributions (such as
Binomial, Multinomial or Poisson). The procedure also avoids assuming linear or
a specified non- linear dependence, and is capable of quantifying both linear and
non- linear dependence between the discrete variables being considered. The
proposed procedure is termed partial informational correlation (PIC). This is a
"partial" measure of dependence, which allows it to be used to identify predictors
in a stepwise approach.

In this chapter, the relationships between daily rainfall occurrence and variables
formed from previous values in the rainfall occurrence sequence are examined

37
using PIC. The applicability of PIC as a generic measure of partial dependence is
tested using synthetically generated discrete variable data sets, and the best
predictors of daily rainfall occurrence are identified, using a stepwise
implementation of PIC. The proposed approach (which has been developed for
discrete variables) is related to the Partial Mutual Information (PMI) criterion for
a system of continuous random variables [Sharma, 2000].

The rest of this chapter is as follows. Section 3.2 develops the PIC criterion for
discrete data, describes a nonparametric confidence measure for it, and tests its
applicability with synthetically generated data with known dependence attributes.
Section 3.3 applies the PIC criterion to the problem of predictor selection for
rainfall occurrence, and defines the aggregate variables from which the predictors
are to be selected. Section 3.4 applies the PIC criterion to identify predictors in
synthetically generated rainfall occurrence datasets, and presents the results for
the application of the method to daily rainfall data from 13 Australian locations. A
more detailed look at the results for Melbourne and Sydney daily rainfall
occurrence is taken in Section 3.5, where identified predictors are used to forecast
the rainfall occurrence, and results are compared with what was observed.
Conclusions from the research are presented in Section 3.6.

3.2 Partial Information

3.2.1 Theoretical background


The mutual information (MI) criterion [Sharma, 2000; Fraser and Swinney, 1986;
Linfoot, 1957] is a measure of dependence that can detect and quantify both linear
and non- linear relationships. Estimation of MI requires accurate characterisation
of both the marginal and joint probability density functions of the variables whose
dependence is being measured. Nonparametric implementation of the MI criterion
ensures that the probability densities and hence the MI scores are estimated
without needing to assume the form of dependence between the variables. Mutual
information is related to entropy, and has also been referred to as transinformation

38
(see Chapman [1986], Singh [1997]). Sharma [2000] shows that MI performs
better than correlation in detecting and quantifying a range of nonlinear
dependence structures, and that it also performs well in quantifying linear
dependence. The mutual information criterion can quantify a broader range of
underlying dependence structures than any other available method. Further details
on the calculation of MI for two variables is given in Appendix A. The theory for
partial information that is developed in this chapter has its origin in the expression
of MI for three variables.

The mutual information between variables X, P1 , and P2 , where X denotes the


response and P1 , P2 denote the predictor variables, is defined as:

 f X,P1 ,P2 ( x, p1 , p2 ) 
MI ( X , P1 , P2 ) = ∫∫∫ f X,P1,P2 ( x, p1 , p 2 ) log   dxdp1dp 2 (3.1)
 f X (x)f P1 (p1 )f P2 (p 2 ) 

where f X (x ) , f P1 ( p1 ) and f P2 ( p 2 ) are the marginal probability density functions

(PDF) of X, P1 , and P2 respectively, and f X,P1,P2 ( x , p1 , p 2 ) is the joint (trivariate)

PDF of X, P1 , and P2 .

If X is independent of (P1 ,P2 ):

 f X (x)f P1 , P2 (p1 , p 2 ) 
MI ( X , P1 , P2 ) = ∫∫∫ f X (x)f P1 ,P2 (p1 , p 2 ) log   dxdp1dp 2
 f X (x)f P1 (p1 )f P2 (p 2 ) 

 f P ,P (p1 , p 2 ) 
= ∫f X (x)dx ∫∫ f P1 ,P2 (p1 , p 2 ) log  1 2  dp1dp 2
 f P1 (p1 )f P2 (p 2 ) 

= MI ( P1 ,P2 ) .

If X is not independent of (P1 ,P2 ), then MI( X,P1 ,P2 ) > MI( P1,P2 ) .

These properties of the MI are used to help develop a criterion to determine the
strength of the partial dependence between P2 and X, after accounting for the
effect of the existing predictor P1 . This criterion is named partial information
PI(X,P2 |P1 ), and is defined as follows:

39
PI( X,P2|P1 ) = MI( X,P1,P2 ) − MI( X,P1 ) − MI( P1,P2 ) (3.2)

The partial information in (3.2) represents the partial dependence of X on P2


conditional to the pre-selection of the first predictor P1 . This is a measure of
association that represents the joint dependence of all three variables (X, P1 , and
P2 ), from which one has removed the dependence the pre-existing predictor P1 has
on the other two variables (X and P2 ).

3.2.2 Sample estimates of partial information

For any given trivariate sample, the MI score in (3.1) can be estimated as:

1 n  fˆX,P ,P (x i ,p1i ,p2i ) 


M̂ I = ∑
n i =1
log  1 2

 f X (x i ) fˆP1 (p1i ) fˆP2 (p 2i ) 
ˆ
(3.3)

where ( xi , p1i , p2 i ) are the ith trivariate sample data triplet in a sample of size n,

and, fˆX (x i ) , fˆP1 (p1i ) , fˆP2 (p 2i ) , and fˆX,P1,P2 (x i ,p1i ,p 2i ) are the respective marginal

and joint probability densities estimated at the sample data points.

The Partial Information in equation (3.2) can be estimated as:

1 n   fˆ ( xi , p1i , p 2 i )   fˆ ( xi , p 1i )   fˆ ( p1i , p 2i )  
P̂ I( X,P2|P1 ) = ∑ log  
n i =1   fˆ ( xi ) fˆ ( p1i ) fˆ ( p 2 i ) 
− log 
ˆf ( x ) fˆ ( p ) 
− log ˆ ˆ  
 i 1i   f ( p1i ) f ( p 2i )  

1 n  fˆ ( xi , p1i , p 2i ) fˆ ( xi ) fˆ ( p1i ) fˆ ( p1i ) fˆ ( p2i ) 


= ∑ fˆ ( x ) fˆ ( p ) fˆ ( p ) fˆ ( x , p ) fˆ ( p , p ) 
log 
n i=1  i 1i 2i i 1i 1i 2i 

1 n  fˆ ( xi , p1i , p 2i ) fˆ ( p1i ) fˆ ( p1i ) 


= ∑ log  
n i=1  fˆ ( p1i ) fˆ ( xi , p1i ) fˆ ( p1i , p2i ) 

This leads to the result

1 n  fˆ ( x i , p 2i | p1i ) 
P̂I( X,P2|P1 ) = ∑ log   (3.4)
n i=1  fˆ ( x i | p1i ) fˆ ( p 2i | p1i ) 

40
P̂I( X,P2|P1 ) estimates the partial dependence between X and P2 , after accounting
for the effect of the existing predictor P1 . The rationale behind partial information
is the definition of dependence (or independence). The conditional joint
probability density function is equal to the product of the two conditional
probability densities if there exists no dependence between X and P2 , after
accounting for the effect of P1 . The PI score in (3.4) would, in that case, equal a
value of 0 (the ratio of the joint and marginal densities being one, and the log of
this equals zero). A high value of the PI score would indicate dependence between
X and P2 , after accounting for the effect of P1 . PI is scale invariant, and remains
unchanged if either variable undergoes any linear or non- linear one-to-one
transformation.

Equation (3.4) is now generalised to find the PI for the kth predictor given the
existing predictor set z , where z contains P1 , P2 ,... Pk-1 . For a multivariate
sample where the variables are discrete:

1 n  pˆ ( xi , p ki | z i ) 
P̂I( X,PK | z) = ∑ log   (3.5)
n i=1  pˆ ( xi | z i ) pˆ ( p ki | z i ) 

where ( xi , p1i , p 2i ,..., p ki ) is the ith multivariate sample data point in a sample of

size n, zi = ( p1i , p2i ,..., p ( k −1) i ) , and pˆ ( x i | zi ) , pˆ ( p ki | zi ) , and pˆ ( xi , p ki | zi )

are conditional probability mass functions estimated at the sample data points.

When the first predictor is being investigated, the pre-existing predictor set z is
empty and equation 3.5 collapses to the equation for mutual information. If the
sum in equation 3.5 is broken down into separate equations for each possible zi ,
equation 3.5 becomes a weighted sum of as many MI estimates, as there are
possible states of zi (a finite number as zi is discrete). This is a useful result
because PI becomes easier to calculate, and it allows a confidence measure to be
formed, as described in section 3.2.3.

41
3.2.3 A nonparametric measure of significance

When a measure of dependence is estimated from a small sample, there is some


uncertainty as to whether the calculated value represents dependence that is
significant, i.e. that the underlying (population) dependence is greater than zero. A
randomisation test [Maritz, 1981] is used here to test significance, where the X
variable is repeatedly resampled without replacement to form a number of
randomised samples that have no dependence between X and the other variables.
If the calculated PI value is greater than the upper 95th percentile PI from the
randomised samples, the calculated PI value represents significant dependence,
and this is expected to happen less than 5% of the time when the variables are
independent. The 95th percentile PI as described here will be denoted as PI95 in the
discussion that follows. The above definition is extended to mutual information
(MI95 ) and partial information correlation (PIC 95 ) (described in section 3.2.5).

3.2.4 An example to illustrate the applicability of MI and PI for


synthetically generated binary random variables

The aim of this section is to illustrate how well the partial information criterion
can identify partial dependence amongst selected discrete random variables. Using
synthetically generated samples of trivariate binary variables (X, P1 and P2 ), the
following three hypotheses are tested:

a. The Partial Information (PI) is negligible (close to zero) when all three
variables are mutually independent
b. PI is negligible when X is dependent on P1 only, and independent of P2 .
c. PI is non-negligible when X is dependent on both P1 and P2 .

The previously defined significance level, equal to the 95th percentile PI based on
a randomised sample, is used as the basis for checking whether the sample
estimate of the PI arising from each of the above test cases is negligible, or is in
fact significant.

42
Consider equation (3.2). From this equation, MI for dependence between three
variables can be broken into three components:

i. Partial information between X and the second predictor (PI(X,P2 |P1 )).
ii. Mutual information between X and the first predictor (MI(X,P1 ))
iii. Mutual information between the predictors (MI(P1 ,P2 ))

The hypotheses (a), (b) and (c) are tested using five sets of synthetically generated
data, each set consisting of 1000 trivariate data points, details for which are
presented in Table 3.1. Test case 1 represents independent random binary
variables (taking values of 0 or 1), which each have an expected value of 0.25, i.e.
25% of the datapoints are assigned a value of 1. This test case is used to test
condition (a) above. In test case 2, P1 and P2 are independent random binary
variables, each with an expected value of 0.25, and X is dependent on P1 only.
Test case 2 is used to test condition (b). Test case 3 is similar to case 2, except X
is dependent on both P1 and P2 . This case is used to test condition (c). Test case 4
has P1 independent, with P2 dependent on P1 and X dependent on P1 . This case is
used as another test of condition (b). Test case 5 is similar to case 4, except X is
dependent on both P1 and P2 . This case is used as another test of condition (c).

43
Table 3.1 Test cases for trivariate two-state discrete (binary) data.a
E(X | E(X | E(X | E(X |
E(P2 | E(P2 |
Case Description E(P1 ) P1 =0, P1 =0, P1 =1, P1 =1,
P1 =0) P1 =1)
P2 =0) P2 =1) P2 =0) P2 =1)

1 Independent variables 0.25 0.25 0.25 0.25 0.25 0.25 0.25


X dependent on P1 only
2 0.25 0.25 0.25 0.1 0.1 0.8 0.8
(P1 , P2 independent)
X dependent on P1 and P2
3 0.25 0.25 0.25 0.05 0.4 0.7 0.8
(P1 , P2 independent)
X dependent on P1 only
4 0.25 0.1 0.6 0.1 0.1 0.6 0.6
(P2 dependent on P1 )
X dependent on P1 and P2
5 0.25 0.1 0.6 0.1 0.2 0.3 0.7
(P2 dependent on P1 )

a
The conditional expected values for each of the three variables (P1 , P2 , and X) are
shown for each case.

The results from this analysis are presented in Figure 3.1. It can be seen from the
figure that the estimated MI is negligible for case 1, but is about 0.19 for the other
test cases. PI is a non-negligible component of the three-dimensional MI for cases
3 and 5 only. This result shows that PI meets the requirements of conditions (a),
(b), and (c), and is thus a valid measure of partial dependence. Randomisation
tests showed that the PI in test cases 3 and 5 represents statistically significant
dependence.

44
0.20

0.15

MI PI(X,P2|P1)
0.10 MI(X,P1)
MI(P1,P2)

0.05

0.00
1 2 3 4 5
test case

Figure 3.1 Components of Mutual Information (MI) between three variables, for
5 test cases. The test cases are described in Table 3.1. The partial information
between X and P2 (PI(X,P2 |P1 )) is non-negligible in test cases 3 and 5.

The randomisation test procedure was also checked, using the null-dependence
cases (cases 1, 2 and 4) from Table 3.1. This verification was carried out on a
large number of synthetic datasets generated randomly from the parent
distributions. Sample sizes of 50, 600, and 1500 were used, with 10 000 trials for
each case/sample size combination. In these trials, the frequency of obtaining a
“false positive” result was between 4% and 4.5% for all the null-dependence test
cases, regardless of sample size. This verified that the randomisation test works
for binary data, with a significance level actually closer to 4% than 5%.

3.2.5 Partial informational correlation (PIC)

Mutual information can be transformed to give a statistic that lies in the range 0 to
1, where 0 represents no dependence and 1 represents perfect dependence. The
rescaled statistic is called informational correlation [Linfoot, 1957], and is
expressed as:

45
ÎC = 1 − exp( −2M̂ I)

This transformation can also be applied to the partial information in (3.5); the
resulting dependence measure for discrete variables is partial informational
correlation (PIC):

P̂IC = 1 − exp( −2P̂I) (3.6)

As PIC assumes a range of 0 to 1, and can be thought of as a generic measure of


correlation independent of distributional specifications, all results presented in the
sections that follow use PIC as the measure of dependence. Similarly, PIC 95 (see
description in section 3.2.3) is used to indicate whether the estimated value of PIC
is statistically significant or not.

3.3 Selection of predictors for rainfall occurrence

3.3.1 Methodology

The partial information criterion is now applied to the problem of predictor


selection for a daily rainfall occurrence model. The aim is to choose a set of
statistically significant predictors from a pre-specified set of candidate predictor
variables. The set of candidate predictors is specified to represent a range of short,
medium and longer-term features in the daily rainfall occurrence sequence. The
candidate predictors for rainfall occurrence adopted are the length (L ) of the
previous dry/wet spell leading up to the current day; the number of wet days in the
last M days where M = 1, 2, 3; a “wetness index” for the last D days where D = 7,
14, 30, 60, and 183 days; and a “wetness index” for the last Y years where Y = 1,
2, 3, 4, 5, 6, 7, 8, 9, and 10 years. Thus 19 candidate predictors are formulated.
These candidate predictors are denoted as L, 1d, 2d, 3d, 7d, 14d, 30d, 60d, 183d,
1y, 2y, 3y, 4y, 5y, 6y, 7y, 8y, 9y, and 10y in the discussion that follows. All except
the first two candidate predictors are “aggregate” variables that describe how wet
it has been over a period of time. To facilitate estimation of the probability mass

46
associated with these discrete variables, each of the “wetness indexes” (i.e.
candidate predictors 7d to 10y) were represented as a state variable that describes
how wet it has been over the previous period. Each wetness index can take integer
values between 1 and 5, where 1 = very dry, 2 = dry, 3 = average, 4 = wet, and 5 =
very wet. Values are assigned based on comparison with the ranked historical
values of the number of wet days in each period of length D days (or Y years)
ending in the sample being investigated, as shown in Table 3.2. Candidate
predictor L (the length of the previous dry/wet spell) is also a state variable. It can
take integer values between -3 and 3, where -3 = a very long wet spell, -2 = a long
wet spell, -1 = a short wet spell, 1 = a short dry spell, 2 = a long dry spell, and 3 =
a very long dry spell. Values are assigned based on the ranked historical lengths
of the dry/wet spells that end in the sample being investigated, in a way that is
similar to the specification of the values of the wetness index. Dry and wet spells
are ranked separately.

Table 3.2 Description of the D-day or Y-year Wetness Index. a


Wetness Description Number of wet days in last D days or Y years
state
1 very dry < 15th percentile ranked historical value
between the 15th percentile and the 37th percentile
2 dry
ranked historical values
between the 38th percentile and the 62nd percentile
3 average
ranked historical values
between the 63rd percentile and the 85th percentile
4 wet
ranked historical values
5 very wet > 85th percentile ranked historical value
a
The table shows how the value of a D-day or Y-year Wetness Index is assigned to
a particular day, based on the number of wet days in the last D days or Y years,
where D = 7, 14, 30, 60, or 183 days, and Y = 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 years.
The ranked historical values for the number of wet days are calculated for all D-
day (or Y-year) periods ending in the historical sample (seasonal window) being
investigated.

47
Predictors were identified based on seasonally representative samples drawn from
the historical record. Values for the predictand (the daily rainfall occurrence state)
and the corresponding candidate predictors are calculated for every day in these
samples; the first ten years of historical data is excluded from the samples, since
some of the candidate predictors are not defined for this period, ten years being
the largest lag considered in formulating the predictors. The samples consist of
observations falling in pre-specified seasonal windows. A total of 24 seasonal
windows (each window being either 15 or 16 days width) were used. The seasonal
windows provide a sufficiently large sample for calculation purposes, while
keeping seasonal effects negligible in the sample formed by the window. The six
seasonal windows forming each of the four quarters of September-October-
November, December-January-February, March-April-May, and June-July-
August, have their PIC results averaged to give a single parsimonious predictor
set for each of the four quarters, as described next. The analysis showed that this
method produces consistent results and allows the candidate predictors to be
directly compared against each other.

3.3.2 Stepwise predictor selection algorithm

The predictor identification process is implemented in a stepwise manner. The


algorithm us ed is a modified version of the algorithm in Sharma [2000], and is
implemented as follows for identification of the predictor set for each quarter.

1. Define z as the array that will store the final predictors of the system. z is a
null vector at the beginning of the calculations.
2. Repeat the following steps as many times as are needed:
i. Estimate the PIC for the dependent variable (rainfall occurrence on the
current day) and each of the plausible new predictors, conditional to the
pre-existing predictor set z. Estimate the value of PIC 95 , the 95th percentile
randomised sample PIC value. Do this step for each of the six seasonal
windows in the quarter.

48
ii. Identify the predictor having the highest value of PIC/PIC 95 for the
quarter, calculated as (average PIC)/(average PIC 95 ) from the six seasonal
windows that form the quarter.
iii. If the value of PIC/PIC 95 for the identified predictor is one or greater,
include this predictor in the predictor set z. If the dependence is not
significant (i.e. the average value of PIC/PIC 95 is less than one), go to step
3.
3. This step will be reached only when all the predictors have been identified.

Note that the above algorithm calculates four different predictor sets, one each for
each quarter of the year.

3.4 Application to daily rainfall data

3.4.1 Testing with synthetically generated daily rainfall data


Before the PIC predictor selection method to real rainfall occurrence data, results
using synt hetically generated daily rainfall data with a known dependence
structure are presented. These tests were designed to show that: a) no predictors
are selected when all variables are known to be independent of each other; and, b)
that the correct predictor (rainfall occurrence on the previous day) is selected
when lag-one dependence exists.

Two sets of test simulations were performed. In the first set, the predictor
identification method was tested on synthetic binary data with no dependence. For
these independence cases, synthetic sequences of both 40 and 140 years in length
were formed, with a fraction of wet days in each pair of sequences of 5%, 10%,
20%, 30%, 40%, and 50%, respectively. Because this data is synthetic and non-
seasonal, and because the predictor identification procedure outlined in section 3.3
finds predictors for each of four quarters, the four quarters provide four
independent trials of the procedure for a given test case. In none of these trials
was a predictor identified as significant, according to the significance criterion
described earlier in section 3.2.3. This was as expected, however, the fact that no

49
“false positive” results occurred in a total of 48 trials (12 sequences × 4 quarters)
indicates that the true significance level given by the algorithm in section 3.3.2
(which involves both randomisation tests and averaging) could be less than 5%.

The second set of test sequences involved daily rainfall occurrence time series
generated so as to have lag-one dependence in the data. The results from this
testing are shown in Table 3.3. A range of lag-one dependent data structure was
tested, with synthetic sequences of both 40 and 140 years in length. The columns
in the table that are titled “p(wet|wet)” and “p(wet|dry)” show the conditional
probabilities of rainfall occurrence, given the rainfall state on the previous day.
“Wet %” is the sample fraction of wet days in the synthetic sequence. The two
columns for each quarter show the first and second predictor choices of the
selection algorithm, along with the corresponding PIC/PIC 95 value, which
represents the significance of the selected predictors. When this ratio is greater
than one, it implies that the predictor is significant at the 5% level. The predictors
identified as significant (i.e. with PIC/PIC 95 = 1.0) are underlined, for each of four
quarters. Again, because this test data is synthetic and non-seasonal, the four
quarters provide four independent trials of the procedure for a given test case.
Note that high lag-one dependence exists if the conditional probability of a wet
day following a dry day is very different to the conditional probability of a wet
day following a wet day. If the two probabilities are roughly the same, that
indicates a weak dependence structure, as a wet day is equally likely to be
generated irrespective of whether it was dry or wet on the previous day.
Sequences 1, 6, 11 and 16 represent the four test cases where lag-one dependence
is strongest. Sequences 5, 10, 15 and 20 represent the weakest lag-one dependence
tested. The no tation used to denote the predictors shown in the table is discussed
in section 3.3.1.

50
Table 3.3 Selected predictors for test sequences formed from lag-one dependent
binary data.a
Test p(wet| p(wet| Wet length Sep-Nov Dec-Feb Mar-May Jun-Aug
(years)
Case wet) dry) % 1 2 1 2 1 2 1 2
1d 7y 1d 5y 1d 1y 1d 30d
1 0.5 0.05 0.09 40
3.5 0.7 3.5 0.8 3.5 0.8 3.0 0.8
1d 9y 1d 30d 1d 7y 1d 183d
2 0.5 0.1 0.17 40
3.6 0.9 3.2 0.9 3.3 0.8 3.5 0.8
1d 9y 1d L 1d 14d 1d 7y
3 0.5 0.2 0.29 40
3.0 0.9 3.1 0.8 3.5 0.8 3.0 0.9
1d 6y 1d 1y 1d 7d 1d 9y
4 0.5 0.3 0.37 40
2.5 0.9 2.0 0.8 2.2 0.8 2.2 0.8
1d 5y 1d 183d 1d 14d 1d 7y
5 0.5 0.4 0.45 40
1.1 0.9 1.4 0.8 0.8 0.7 0.9 0.8
1d 10y 1d 7y 1d 4y 1d 8y
6 0.5 0.05 0.09 140
8.1 0.8 7.3 0.9 7.1 0.8 7.7 0.9
1d 8y 1d 5y 1d 10y 1d 2y
7 0.5 0.1 0.17 140
8.1 0.8 7.7 0.8 7.7 0.9 7.9 0.7
1d 6y 1d 14d 1d 1y 1d 5y
8 0.5 0.2 0.29 140
6.5 0.8 6.9 0.8 6.1 0.8 6.6 0.8
1d 6y 1d 60d 1d 1y 1d 7y
9 0.5 0.3 0.37 140
4.3 0.7 5.0 0.9 4.1 0.8 4.7 0.7
1d 14d 1d 60d 1d 5y 1d 183d
10 0.5 0.4 0.45 140
2.4 0.8 2.0 0.9 1.9 0.8 2.3 0.8
1d 1y 1d 2y 1d 7d 2d 1d
11 0.2 0.02 .025 40
1.9 0.8 1.7 0.8 1.3 0.8 1.2 0.8
1d 6y 1d 6y 1d 183d 1d 3d
12 0.2 0.03 .036 40
1.6 0.8 1.5 0.7 1.7 0.7 1.7 0.8
1d 8y 1d 183d 1d 2y 1d 6y
13 0.2 0.05 0.06 40
1.2 0.8 1.8 0.8 1.4 0.8 1.3 0.8
183d 10y 1d 7y 1d 2y 1d 10y
14 0.2 0.1 0.11 40
0.7 0.9 1.0 0.8 1.0 0.8 0.9 0.8
183d 7y 7y 1y L 5y 4y 30d
15 0.2 0.15 0.16 40
0.8 0.9 0.9 0.9 0.8 0.9 0.7 0.9
16 0.2 0.02 .025 140 1d 9y 1d 8y 1d 8y 1d 6y

51
2.5 0.8 3.3 0.9 3.2 0.9 3.2 0.8
1d 3y 1d 1y 1d 10y 1d 183d
17 0.2 0.03 .036 140
2.7 0.8 2.3 0.8 2.6 0.8 2.5 0.8
1d 2y 1d 60d 1d 8y 1d 3y
18 0.2 0.05 0.06 140
2.6 0.9 2.3 0.8 2.3 0.8 2.5 0.9
1d L 1d 9y 1d 8y 1d 10y
19 0.2 0.1 0.11 140
1.9 0.8 2.4 0.9 2.2 0.8 2.3 0.8
1d 30d 2d 7y 1d 60d 1d 3d
20 0.2 0.15 0.16 140
0.9 0.9 0.9 0.8 0.7 0.8 1.3 0.8

a
Each test sequence is defined by the conditional probabilities shown in the
second and third columns, and by its length. The two most significant predictors
for each quarter are shown, along with the associated value for the PIC/PIC 95
statistic. A PIC/PIC 95 value greater than one indicates that the predictor is
significant at the 5% level.

52
There are three main “errors” of interest to look for when studying the results in
Table 3.3:

1. A significant predictor might not be found, when dependence does in fact


exist. This occurred in test sequences 5, 15, and 20, for some quarters. Note,
however, that these cases involve the weakest lag-one dependence of all the
cases evaluated.
2. If a predictor other than 1d is selected, this is an “error”. This occurred in test
sequence 11 of Table 3.3 (wet % = 0.025, length = 40 years), in the June-
August quarter. This result could be due to sampling variability, as the
selected predictor in this case is 2d (the number of wet days in the past 2
days), a variable that contains almost the same information as the variable 1d.
3. A second significant predictor might be found. This should not occur, as none
of the test sequences in Table 3.3 involve lag-two dependence. It can be seen
from the tables that this type of error did not occur.

Note that the occurrence of error type 1 does not invalidate the methodology. On
the contrary, the results in Table 3.3 show that the methodology works. There is a
threshold level of significant dependence, above which the correct predictors are
selected, and below which no predictors are selected. Near the threshold, the
dependence may or may not be detected; sampling variability accounts for the
variation between quarters for sequences 5, 15, and 20. The threshold level of
dependence depends on the sample size. It can be seen that the PIC/PIC 95 ratios in
the table are greater for the 140-year sequences than for the 40-year sequences.
Thus the power of the method to identify the presence of dependence increases as
the length of record increases. Also note that random variations due to sampling
variability play a role in “choosing” the values in Table 3.3, especially for those
predictors where PIC/PIC 95 is low. However (with the minor exception noted in
the description of error type 2) the algorithm of section 3.3.2 ensures that the
significant predictors (those where PIC/PIC 95 is greater than one) are in fact
correctly identified. These random variations also affect the non-significant results
for the rainfall occurrence data that are given in the next section (Table 3.4).

53
It has been demonstrated that the predictor identification criterion was able to
identify correctly the known dependence attributes in the synthetic samples used.
While, for simplicity, the examples considered data sets with a Markov order 1
dependence structure, the dependence criterion is generic and should be able to
identify any significant dependence between the candidate predictors and rainfall
occurrence on the current day. This is demonstrated in the next section with
rainfall occurrence data from 13 Australian locations.

3.4.2 Identification of predictors for rainfall occurrence


This section presents the results from the application of this methodology to daily
rainfall occurrence from 13 locations in Australia. The locations selected were all
but two of those used by Srikanthan and McMahon [1985], plus Tenterfield in
northern New South Wales. Gap-free records were selected; however a small
amount of gap filling from nearby stations was done for some of the locations. A
threshold of 0.3 mm was used to define a “wet day” [after Buishand, 1978].

The results from this analysis are presented in Table 3.4. Shown are the three most
significant predictors for daily rainfall occurrence that were identified, for each of
four quarters. The PIC/PIC 95 ratio is also shown. The predictors that were
identified as significant at the 5% level (i.e. with PIC/PIC 95 = 1.0) are underlined.
In the table, “wet %” is the historical fraction of days in each season with 0.3 mm
or more of recorded rainfall. From “wet %” one can see that Darwin has the
wettest summer and driest winter, Perth has the driest summer and wettest winter,
Alice Springs is driest overall, and Melbourne is wettest overall. Despite these
differences, the PIC predictor identification methods produce sensible results at
all these locations.

54
Table 3.4 Selected predictors for rainfall occurrence from 13 locations in Australia.a

Location length wet Sep-Nov wet Dec-Feb wet Mar-May wet Jun-Aug

(years) % 1 2 3 % 1 2 3 % 1 2 3 % 1 2 3

Adelaide 1d 9y 5y 1d 7y 7d 1d 2d 4y 1d 2y 5y
125 0.37 0.16 0.3 0.51
(1853-1977) 5.96 0.78 0.86 5.72 0.77 0.88 7.16 1.02 0.87 6.93 0.86 0.93
Alice
Springs 91 0.11 1d 3d 4y 0.15 1d 2d 2y 0.08 1d 7d 1y 0.07 1d 2d 30d
(1908-1998) 4.85 0.98 0.90 5.55 1.17 0.94 5.63 1.12 0.87 5.32 1.06 0.91
Brisbane 1d 60d 1y 1d 2d 2y 1d 2d 7y 1d 4y 30d
106 0.28 0.4 0.36 0.23
{1887-1992) 5.06 0.99 0.94 5.92 0.99 0.95 7.36 1.39 0.92 6.08 0.90 0.90
Broome 1d 7d 10y 1d 2d 1y 1d 2d 7d 1d 6y 183d
58 0.03 0.3 0.12 0.05
(1941-1998) 1.83 1.05 0.89 4.63 1.19 0.81 3.96 1.13 0.88 2.97 0.87 0.89
Cowra 1d 2y 9y 1d 9y 5y 1d 10y 30d 1d 14d 60d
47 0.29 0.21 0.22 0.34
(1944-1990) 3.12 0.78 0.91 2.97 0.77 0.90 4.10 0.87 0.91 3.62 0.79 0.89
Darwin 1d 7d 2y 1d L 60d 1d 14d 4y 1d 60d 7y
69 0.18 0.59 0.29 0.01
(1872-1940) 2.34 0.93 0.93 3.59 1.21 0.91 3.96 1.11 0.92 3.03 0.79 0.87

55
Kalgoorlie 1d 183d 3y 1d 2d 5y 1d 60d 1y 1d 60d 3d
59 0.14 0.12 0.18 0.3
(1940-1998) 2.87 0.84 0.88 3.23 0.88 0.87 3.88 0.90 0.91 3.43 1.25 0.94
Mackay 1d 183d 9y 1d L 183d 1d 3d 60d 1d 2d 7y
60 0.18 0.43 0.42 0.19
(1889-1948) 4.13 0.93 0.94 5.72 0.93 0.89 6.10 1.01 0.88 4.84 0.94 0.89
Melbourne 1d 5y 2d 1d 6y 183d 1d 10y 60d 1d 4y 30d
144 0.45 0.29 0.39 0.52
(1855-1998) 5.68 0.87 0.92 6.01 0.86 0.95 6.30 0.99 0.89 4.20 1.21 0.93
Monto 1d 2d 1y 1d 4y 14d 1d 30d 5y 1d 6y 2y
62 0.18 0.31 0.23 0.15
(1930-1991) 3.08 0.87 0.87 4.28 0.89 0.91 4.68 0.96 0.91 3.51 0.92 0.92
Perth 1d 1y 3y 1d 1y 4y 1d 8y 14d 1d 3y 2y
111 0.37 0.11 0.27 0.58
(1880-1990) 5.99 0.87 0.87 4.13 0.81 0.88 6.42 0.73 0.90 7.90 0.85 0.86
Sydney 1d 183d 60d 1d 60d 4y 1d L 2y 1d 9y 3d
140 0.37 0.39 0.43 0.37
(1859-1998) 6.12 0.93 0.92 5.93 0.88 0.89 8.05 1.08 0.91 7.85 1.15 0.97
Tenterfield 1d 3y L 1d 3y 1y 1d 183d 9y 1d 183d 6y
64 0.27 0.36 0.28 0.22
(1935-1998) 3.41 0.88 0.87 4.73 0.88 0.92 4.58 0.88 0.99 4.26 0.86 0.93
a
The three most significant predictors, denoted “1”, “2”, and “3”, are shown for each quarter, along with the associated value for the
PIC/PIC 95 statistic. A PIC/PIC 95 value greater than one indicates that the predictor is significant at the 5% level.

56
Rainfall occurrence on the previous day (i.e. candidate predictor 1d) is identified
as a significant predictor for every location. The results showed that when the first
predictor was being chosen, the 1d predictor easily outperfo rmed all other
candidate predictors at all the locations being tested. However, when a second
predictor was chosen, the difference in performance between the candidate
predictors was not as clear-cut. Additionally, in no case were more than two
predictors identified as significant at the 5% level. The location that came closest
to having a third significant predictor was Sydney in winter (predictor 3d,
S=0.97).

The selection of 1d (rainfall occurrence on the previous day, which can either be
“dry” or “wet”) means that the expected value of the rainfall occurrence state
depends on whether the previous day is dry or wet. If 4y is selected, as it is for
Melbourne in winter, then the expected value of the rainfall occurrence state is
dependent on whether the previous four years is very dry, dry, average, wet or
very wet. This information is useful in formulating a conditional one-day-ahead
forecast of rainfall occurrence.

The results in Table 3.4 show that rainfall occurrence on the previous day is the
best single predictor for rainfall occurrence on the current day. This is an implicit
assumption in rainfall occurrence models that generate occurrence one day at a
time, such as a Markov chain. Note that a combination of candidate predictors 1d
and 2d is equivalent to the predictors used in a traditional second-order Markov
chain model. The results indicate that a first-order or second-order Markov chain
model should provide a reasonable fit to the short-term historical features of most
of the rainfall occurrence records. However, predictor combinations that were
identified for some locations are quite different to the predictors used in a
traditional Markov chain model. In all the cases where more than one predictor is
identified as significant, a model that uses these identified multiple predictors
should have the potential to better reproduce the short-term historical features of
the rainfall record. This is tested in the next section of this chapter, where mean

57
square errors for a one-day-ahead prediction for both Sydney and Melbourne are
calculated.

3.5 Forecasting rainfall occurrence using the identified


predictors
In order to test the utility of the selected predictors in forecasting the rainfall
occurrence state, a leave-one-out cross validation analysis of rainfall occurrence
forecasts was conducted. This involved predicting the rainfall one day at a time
for the full historical record using a simple forecast approach, and comparing the
predicted rainfall state with what was observed in reality. The approach adopted
consisted of using the current state of the predictors to conditionally generate one
hundred forecasts of the rainfall occurrence state on the current day, the prediction
model being formulated based on all observations in a seasonal subset of the
historic record except the observations from the year corresponding to the day
being predicted. Such an approach, known as leave-one-out-cross-validation,
allows the model to be tested on data points not used in model development,
hence allowing one to assess both the accuracy as well as the parsimony of the
model. The model used to produce these forecasts is the resampling model that is
described in section 4.2.

With the dry state denoted as "0" and the wet state as "1", the leave-one-out-cross-
validation predictions were compared to the true rainfall state, as inferred from the
historical record. If the one-day ahead predictions are working well, one expects
that the difference between the actual state and the average predicted state would
be small. The measure of error used for the ith day of the year is:

MSEi =
1 n

n j =1
(x j, i − E( − j ) ( x | z j ,i ) )
2
(3.7)

58
where MSEi is used to denote the mean square error, n is the number of years of

historical data used in the calculation, x j,i is the observed rainfall occurrence state

(“0” representing dry and “1” representing wet) on the ith day of year j, and

E( − j) ( x | z j, i ) represents the expectation operator, estimated here as the average

value of the predicted rainfall state, conditional to z j,i, the predictor variables

associated with x j,i. The prediction model is formulated so as to use all

observations in a seasonal subset of the historic record except those from the year

for which the prediction is made (i.e year j).

If the fraction of wet days on the ith day of a year is denoted as wi, an exact
expression for the MSE can be written, that would be expected when the rainfall
occurrence state is predicted unconditionally. The resulting MSE (denoted MSE0
to distinguish from the MSE obtained from a conditional prediction approach)
would be expressed as:

MSE0i = wi(1-wi) + (1-wi)wi = 2wi(1-wi) (3.8)

The values of MSE0 and MSE obtained using various combinations of predictors
for the forecasts for Melbourne and Sydney are presented in Table 3.5 and Table
3.6. Values from all days in a season are averaged to give the results shown in the
table. The predictor combinations are:

a. No predictors (i.e. unconditional resampling)


b. One predictor (i.e. rainfall occurrence yesterday, 1d)
c. Two predictors for each quarter, from Table 3.4
d. Three predictors for each quarter, from Table 3.4.

59
Table 3.5 Mean Square Error (MSE) for forecasts of Melbourne rainfall
occurrence.a
Predictor Sep-Nov Dec-Feb Mar-May Jun-Aug annual
combination
MSE0 0.490 0.403 0.463 0.499 0.464
Uncond itional
0.490 0.404 0.463 0.500 0.464
resampling
one predictor 0.459 0.372 0.427 0.477 0.434
two predictors 0.459 0.371 0.426 0.474 0.432
three predictors 0.458 0.371 0.428 0.475 0.433
a
A lower number indicates that the associated predictor combination gives a better
forecast.

60
Table 3.6 Mean Square Error (MSE) for forecasts of Sydney rainfall occurrence.a
Predictor Sep-Nov Dec-Feb Mar-May Jun-Aug annual
combination
MSE0 0.467 0.475 0.491 0.465 0.475
Unconditional
0.467 0.476 0.491 0.465 0.475
resampling
one predictor 0.433 0.439 0.427 0.397 0.424
two predictors 0.433 0.438 0.427 0.395 0.423
three 0.434 0.439 0.427 0.393 0.423
predictors
a
A lower number indicates that the associated predictor combination gives a better
forecast.

The results show that the relative reduction in MSE due to using one predictor
(rainfall occurrence yesterday) instead of unconditional resampling is large. This
reinforces the conclusion from the previous section that rainfall occurrence on the
previous day is the single best predictor for rainfall occurrence on the current day.
Incorporating this predictor into the forecast model significantly reduces the error
of the forecast. The results for Melbourne show that two predictors produce a
lower prediction error than one predictor in June-August. For this quarter, the
predictor selection methods using PIC identify two predictors as significant (cf.
Table 3.4). Two predictors are also slightly better than one predictor in December-
February and March-May. For March-May, the predictor selection algorithm was
bordering on selection of two predictors. The use of more than two predictors is
not justified (except, possibly, in September-November), as the prediction errors
are not lower in the three predictors case than in the two predictors case. With the
exception of September-November, this agrees with the result from the predictor
selection algorithm.

The results for Sydney show that the use of more than two predictors is not
justified in September-November, December-February, and March-May, as the
MSE are not lower in the three predictors case than in the two predictors case. In

61
June-August, the MSE results suggest that three predictors should be used. This
agrees with the result from the predictor selection algorithm. For June-August,
Table 3.4 shows results that are on the borderline of selecting three predictors (as
noted in the discussion after the table). The prediction errors suggest a possible
modification to the PIC predictor identification method: the significance level
used in the procedure could be relaxed from 5% to 10%, i.e. PIC 90 could replace
PIC 95 in the algorithm. This modification should pick the same predictors as
shown in Table 3.4; only the PIC/PIC 95 values should change. However it is a
conservative choice to leave the algorithm as it is, which is the approach adopted
here. Also note that Tables 3.5 and 3.6 show that any improvements to MSE that
result from using more than one predictor are small. While the improvements in
MSE correspond with the PIC predictor selection results, in practice such
marginal improvements in MSE may not justify the use of higher order models.

Alternative predictor sets were also tested. They were chosen to represent specific
time scales (daily, seasonal, annual and inter-annual). It was found that the
predictor combinations identified using PIC have equal or lower prediction errors
than the alternative predictor combinations. This result provides a limited
validation of the PIC predictor identification method for prediction of rainfa ll
occurrence one day ahead of the present.

3.6 Conclusions
This chapter has presented a new measure of dependence that is named partial
informational correlation (PIC), and applied PIC to discrete time series data. PIC
is a partial measure of dependence derived from mutual information theory, which
is sensitive to both linear and non- linear dependence. PIC is used here in an
approach that identifies predictors for forecasting rainfall occurrence. A set of
candidate predictors for daily rainfall occurrence is formulated, solely from
previous values in the sequence. The approach is tested at 13 locations from
around Australia. Despite major differences in the rainfall occurrence patterns, the

62
PIC predictor identification methods produce sensible results at all these
locations. The selected predictors are validated using a one-day-ahead forecast,
for the Melbourne and Sydney data.

The predictor identification methods in this chapter use PIC as the criterion to
pick the rainfall occurrence on the current day. What is being done, in effect, is to
use information drawn only from the historical time series of rainfall occurrence
to pick whether today is going to be dry or wet. It is therefore appropriate to
validate the performance of the selected predictors using leave-one-out cross
validation analysis of rainfall occurrence forecasts, as was done here. The
conclusion from this analysis is that the PIC predictor identification method gives
a valid predictor set for the short-term prediction of daily rainfall occurrence. The
method gives a conservative choice for the predictor set; the 5% significance level
used at a crucial step within the algorithm means that if an “error” occurs, the
method errs on the side of parsimony rather than over-specification of predic tors.
The method is a nonparametric alternative to the use of traditionally used order
selection techniques such as the Akaike Information Criterion or the Bayesian
Information Criterion, as no specific assumptions are made regarding the
probability distributions of the variables involved. The variables that are
considered include “aggregate” variables that represent the longer-term features of
the historic record. It should be noted, however, that the PIC criterion is only
formulated to identify the effects of these longer-term features on a one-day ahead
forecast of rainfall occurrence. The simulation of longer-term variability (which is
most visible at seasonal or annual time scales) is the subject of the next chapter.

There are a number of potential shortcomings of the predictor selection methods


using PIC:

1. The procedure is stepwise. There is a possibility that two (or more) predictors
may exist that, when combined, could capture more of the dependence
structure than the predictors identified by the stepwise procedure. However
this possibility is considered remote for two reasons. Firstly, the 1d predictor

63
(rainfall occurrence yesterday) is always identified as the first predictor to be
selected under the methodology. Common sense dictates that this predictor
should be included in the model anyway, and PIC allows assessment of the
additional benefit of other variables after taking into account the effect of 1d.
Secondly, across the diverse range of locations that were examined, no more
than two or three predictors are ever identified as significant.
2. The true significance level (a) of the procedure is hard to determine. The tests
with synthetic data in section 3.4.1 indicate that a for the procedure may be
slightly less than 5%.
3. The procedure works in choosing predictors for a forecasting model where
the focus is on the state of rainfall one day ahead. However it will be
demonstrated in the next chapter that it does not work well for a stochastic
generation model, designed to produce long synthetic sequences that
reproduce historical long-term variability and features such as drought or
extended wet periods.
4. The methods are data- intensive, and hence computational requirements can
be significant. This is a consequence of using data-driven nonparametric
methods, which avoid the need to make specific assumptions about the
distribution of the data.

The longer-term dependence that is associated with historical features such as


droughts and sustained periods of high rainfall, and generation of long sequences
of synthetic data that reproduce these features, are of particular interest in this
thesis. A different method for measuring the performance of a generation model
based on the assessment of a large number of long generated sequences is
proposed in the next chapter. The development of the model for stochastically
generating rainfall occurrences is also described.

64
4. A Nonparametric Model for Daily Rainfall
Occurrence that Reproduces Long-term Variability

4.1 Introduction
Chapter 3 looked at selection of predictors for a conditional daily rainfall
occurrence forecast model. A set of candidate predictors based on previous values
in the rainfall occurrence series was proposed. The candidate predictors were
selected so as to characterise short-term dependence using rainfall occurrence at
short time lags from the present, and longer-term dependence via predictors that
describe the state of wetness over a longer length of time. A method for selecting
predictors using partial informational correlation (PIC) was described, and applied
to select predictor sets for 13 Australian locations having long rainfall records.
The PIC predictor selection criterion was designed to give good one-day-ahead
forecasts of the rainfall occurrence state.

This chapter builds on the work of chapter 3, and presents a model for stochastic
generation of daily rainfall occurrence, formulated so as to reproduce longer-term
variability and low- frequency features such as drought and sustained wet periods,
while still reproducing characteristics at daily time scales. The model is termed
“ROG” to denote “Rainfall Occurrence Generator”. An alternative approach for
identifying predictors for use in the daily rainfall occurrence model is also
presented. This alternative approach is in contrast to the PIC predictor selection
criterion presented in chapter 3, and offers a less mathematical and more intuitive
approach designed specifically for representation of historical variability in the
simulations. This approach is based on a stepwise selection of short-term,
medium-term, long-term, and very- long-term predictors. The predictor selection is
made at each step of this procedure by analysis of a range of statistical
characteristics of sequences generated by a model incorporating the selected
predictors, compared to the statistical characteristics of the historical record. It
must be pointed out that the work presented here is the first stage of a two-stage

65
modelling process; it involves generating the entire sequence of rainfall
occurrence (characterisation of all days as either dry or wet). The second stage of
the modelling process, which is not discussed here, will involve generating the
rainfall amount on all wet days in the sequence.

The stochastic model for generation of daily rainfall occurrence that is presented
in this chapter removes many of the limitations of traditional approaches that have
been noted in chapter 2. The proposed model is nonparametric, i.e., no specific
assumptions about the probability distribution or the nature of dependence
between variables is made. The model uses a Markovian approach, assuming that
a finite number of previous values in the sequence are sufficient to characterise
the rainfall state on the next day. This approach is espoused by several other daily
rainfall occurrence models [e.g. Gabriel and Neumann, 1962; Chin, 1977; Roldan
and Woolhiser, 1982], except that the proposed model provides an improved
representation of both the rainfall occurrence process, when viewed at longer
(seasonal, annual and inter-annual) time scales, and the seasonal variation of
rainfall within a year. Seasonality is represented using the moving window
approach described in chapter 2. Improved representation of the occurrence
process is achieved through the use of three “aggregate” conditioning variables
chosen specifically to impart the longer-term variability that is of interest in this
application. These longer-term predictors all act as conditioning variables,
however they are not external to the generated rainfall record, as these variables
are formed solely from previous values in the sequence. These “internal”
conditioning variables contain low- frequency signals that are similar to the signal
contained in ENSO (see Table 2.1). The advantage in using such “internal”
conditioning variables is that the use of exogenous variables is avoided, and when
long sequences of synthetic data are generated, it is not necessary to first generate
a long sequence of the exogenous variable. The combination of short-term and
longer-term predictors proposed here is designed to capture the daily, seasonal,
annual, and very- long-term features of the historical record in a parsimonious
way.

66
This chapter is arranged as follows. Section 4.2 describes the rainfall occurrence
model. Section 4.3 describes the alternative method of predictor selection. Section
4.4 presents results from the application of the rainfall occurrence model to daily
rainfall data from Sydney and Melbourne, Australia. Section 4.5 presents
conclusions to this chapter.

4.2 The resampling model for rainfall occurrence


Stochastic models of hydrologic time series are an exercise in conditional
probability [Bras and Rodriguez-Iturbe, 1985]. It is assumed that the time series
{x 1 , x 2 ,..., x t,...} has a Markovian dependence structure, i.e. that the value of x t is
dependent only on a finite set of prior values. With this assumption, formation of
a stochastic model involves specification of the conditional probability
distribution f(x t|zt ), where zt is a vector of predictors formed from the prior values
of x t. In a traditional multi-order Markov chain [Chin, 1977; Salas 1993], zt is
formed from x t-1 , x t-2 , x t-3 , and x t-4 (a Markov chain of order greater than four is
rarely used in practice, for reasons of parsimony). An alternative approach to
specifying zt is to use a combination of short-term and lo ng-term predictors, as
has been done in Sharma and O’Neill [2001]. Such an approach entails
representing zt as:

zt ∈ {xt−1,..., xt− p ; at } (4.1)

where

at ∈{am1 , t , a m2 , t ,..., a mq , t } (4.2)

and a mi ,t is referred to as an “aggrega te” variable which represents a long-term

state of the hydrologic system. It is defined as:

mi
a mi ,t = ∑ x t− j (4.3)
j =1

67
In the present study the choice of p (the number of short-term predictors) and q
(the number of long-term predictors) have been restricted to 1 and 3 respectively.
More details on the rationale behind the use of aggregate variables is given in
chapter 3. Details on the choice of aggregate variables used in this study (i.e. the
chosen values of mi) are given in section 4.4.1.

Simulation proceeds by generating values of x t from the conditional probability


distribution f(x t|zt), defined as:

f ( xt , z t )
f(x t|zt) = (4.4)
∫ f (x , z )dx
t t t

Traditional parametric models specify (4.4) through assumed distributions,


expressed using parameters such as historical values of the mean, variance and
skewness, and historical measures of dependence such as correlation. The
assumptions involved in specifying these distributions may not always be valid, as
illustrated in Sharma et al. [1997]. Nonparametric methods, on the other hand,
specify (4.4) through more direct, empirical, data driven estimates of the
conditional and joint probability densities drawn from a local subset of the
historical data [Lall, 1995], with no prior assumptions about the form of
dependence or probability distribution. Use of the data-driven framework ensures
that the resulting simulations have similar dependence and distributional attributes
as observed in the historical record [Sharma and O’Neill, 2001].

The particular application of (4.4) is to daily rainfall occurrence data, where


seasonal effects are important. A moving window of length l is used, centred on
the Julian day that corresponds to the current day t, to form a local subset of the
historical data that natur ally represents seasonal variability (as described in
chapter 2). Equation (4.4) is estimated using the sample formed by the moving
window. The moving window includes data from all historical years except the
first 10 (which are excluded since zt cannot be defined for the first few historical

68
values). After analysing the sensitivity of the proposed model to different choices
of l, a value of l = 15 days was chosen for use in this application, as a compromise
between minimising seasonal effects in the sample and providing as large a
sample as possible.

A practical problem was encountered in the implementation of (4.4) when these


equations were applied to daily rainfall occurrence (whether a day is “dry” or
“wet”). Because this data is discrete, it was found that all variables in zt should be
rescaled into discrete variables with a limited number of states, rather than dealing
with a mixture of discrete and nearly-continuous variables. The rescaled predictor
set is z ′t , defined as follows:

z ′t ∈ {xt−1 , bm1 ,t , bm2 ,t , bm3 , t } (4.5)

where bmi ,t is termed a “wetness index” (as defined in section 3.3.1 of chapter 3),

and is a rescaled version of a mi ,t , transformed into a small number of discrete

categories based on the distribution of a mi within the local subset of the historical

data formed by the moving window. Simulation then proceeds by generating


values of x t from the conditional probability distribution f(x t| z ′t ), defined as:

f ( xt , z ′t )
f(x t| z ′t ) = (4.6)
∫ f ( x t , z ′t ) dxt

The particular nonparametric method that has been chosen here for estimation of
f(x t| z ′t ) is a resampling model using nearest neighbour methods [Lall and Sharma,

1996; Sharma and Lall, 1999]. Instead of using parametric models to describe
(4.6), a data resampling strategy that approximates the “random” mechanism that
produced the historical data is implemented. This resampling strategy is a
conditional bootstrap [Efron and Tibshirani, 1993; Lall and Sharma, 1996]. In
essence, the k patterns in the historical data that are most similar to the current
pattern in the generated sequence are found, and a historical value is chosen from

69
the successors to the k historical patterns and placed as the successor to the current
pattern in the generated sequence. The value of k that is chosen acts as a
“smoothing parameter” for the procedure. A value of k that is too small or too
large can cause the resulting generated sequences to be biased. Lall and Sharma
[1996] propose that, for continuous variables, k may be chosen as the square root
of the length of the dataset.

In the application of this methodology, the patterns that are examined are those
that exist in the predictor set z ′d , which is defined for every day d in the seasonal
subset of the historical record formed by the moving window centred at the Julian
day corresponding to t, which is the current day in the generated sequence. The
historical patterns z ′d are compared to the current pattern in the generated

sequence z ′t , and the k most similar patterns are chosen from z ′d . The

corresponding k successors to z ′d (which are chosen from x d ) are termed nearest

neighbours, and they form an empirical estimate of f(x t| z ′t ) that is based directly
on the characteristics of the historical sample. Further details on selection of the
patterns using nearest neighbour methods and Euclidean distances are now given.

The k historical nearest neighbours to x t are defined to be those values of x d that


have the k smallest Euclidean distances between z ′t and z ′d . These Euclidean
distances are calculated as:

3
Ed = ( w0 ( x t−1 - x d −1 )) 2 + ∑ ( wi ( bmi , t - bmi , d )) 2 (4.7)
i =1

Scaling weights are used to ensure that each predictor has the same weighting in
selection of nearest neighbours. Localised estimates of the standard deviation are
used to account for seasonal variations. The scaling weights used are:

wi = 1/s i (4.8)

70
where wi denotes the weight for predictor variable i, where i ranges from 0 to 3,
and si is the estimated local standard deviation. Values of si are seasonal. For this
study values of si are calculated on a daily basis, using a moving window centred
at the Julian day corresponding to t.

The algorithm for the rainfall occurrence model is presented next. The resampling
model (termed “ROG” to denote “Rainfall Occurrence Generator”) is based on
Sharma and Lall [1999]. The differences in this approach from the Sharma and
Lall [1999] model are:

1. Rainfall occurrence is resampled one day at a time, rather than resampling


entire dry spells or entire wet spells. An advantage of resampling one day at a
time is that unprecedentedly long dry and wet spells may occur in the
generated sequences.
2. Multiple predictors for rainfall occurrence are used here. Sharma and Lall
used a single predictor. The use of multiple predictors enables the model to
capture both short and longer-term dependence structure, and results in
generated sequences that more closely reproduce the variability found in the
historical record, compared to sequences that are generated by a single-
predictor model.

ROG conditiona lly resamples rainfall occurrence from a seasonal subset of the
historic record. At every iteration of the algorithm, one of the k nearest neighbours
from the seasonal subset is resampled and inserted into the generated sequence.
Lall and Sharma [1996] developed a conditional probability density distribution
for use with this nearest-neighbour resampling method. This distribution is:

1/ i
p (i ) = (4.9)

k
j =1
1/ j

Where p(i) is the probability that the ith nearest neighbour will be resampled, and k
is the number of neighbours considered. This distribution resamples most

71
frequently from the closest neighbours, with a reducing probability of selection as
i increases.

Lall and Sharma [1996] suggested that k could be chosen as the square root of the
length (n) of the dataset used for resampling. This rule was tested, and it was
found to be inadequate for rainfall occurrence data; in many cases, the sequences
generated with this choice of k were biased (i.e. the fractions of wet days in the
generated sequences were consistently different to the historical value). This
problem occurred because the rule was developed for continuous variables
whereas ROG deals with discrete variables. Instead, k was obtained by trialling a
range of possible values, and selecting the value that gave the smallest bias in the

generated sequences. For simplicity, values of k that were multiples of n were

trialled, using the formula k = α n . The values of α obtained depended on


location, and on the number of predictors being used in the model. Note that the
choice of k was only problematic for ROG models with two or more predictors;

for a zero-predictor or one-predictor ROG model k was set equal to n , since the
performance of such a model was not sensitive to this parameter.

At the pre-processing stage of the algorithm, the values of the predictors for all
historical days are calculated, and the scaling weights for each predictor are
calculated using equation (4.8). Then a short part of the historical sequence is
selected at random, to use for the initial specification of z ′t . The first day in the
generated sequence is the day immediately after the end of this startup sequence.
The main algorithm can then be implemented:

1. Calculate the values of the predictors for the current day in the generated
sequence ( z ′t ).
2. Randomly choose a value for i from the distribution specified in equation
(4.9).
3. Formulate an l-day moving window centred on the current date.

72
4. Select the ith historical nearest neighbour from the moving window using
equation (4.7). The ith nearest neighbour has the ith smallest value of Ed .
Whenever ties occur, randomly choose from all dates in the moving window
with Ed equal to the ith smallest value. (At this point in the algorithm, x t is
undefined, although z ′t is known.)
5. Put the chosen historical value x d (i.e. the ith nearest neighbour) into the
current position in the generated sequence, i.e. set x t equal to x d .
6. Move on to the next date in the generated sequence.
7. repeat steps 1-6 until the desired length of generated sequence is obtained.

The performance of the model is evaluated by comparing a large number of the


generated synthetic sequences with the historical record. The approach for
assessing model performance is described in the next section of this chapter. Note
that specify separate predictor sets in each season can be specified. This is done
for the predictor choices found using PIC (see chapter 3). However it was found
that the same set of predictors could be used throughout the year at Sydney and
Melbourne, and still give good results.

4.3 A method for predictor selection


The use of multiple predictors in the rainfall occurrence model necessitates the
use of a predictor identification criterion. Jimoh and Webster [1996] question the
use of traditionally used criteria, such as the Akaike Information Criterion (AIC)
[Akaike, 1974] or the Bayesian Information Criterion (BIC) [Schwarz, 1978], for
selecting the number of parameters used in a rainfall occurrence model. They
suggest that such criteria be used with caution. These criteria use maximum
likelihood (or a similar measure) to give an indication of the goodness of fit of the
model parameters to the data, based on one-day-ahead forecasts made using the
model. The predictor selection method described in chapter 3 provides a
nonparametric alternative to the use of criteria such as AIC or BIC. However, the
method of chapter 3 is still based on one-day-ahead forecasts. The quality of the
forecasts made using the selected predictors does not indicate whether the

73
sequences generated by the model will reproduce historical longer-term
variability. A different method is required to assess the longer-term behaviour of
the generated sequences.

Jimoh and Webster [1996] propose that the use of frequency-duration curves of
dry and wet spell lengths can provide a robust alternative method of model
identification. This method goes closer to assessing how well the goal of this
thesis is achieved: i.e. to get synthetic sequences that are representative of the
historical sequence. Others have used similar methods for assessing model
performance. Gregory et al. [1993] state that reproduction of seasonal variance
provides a crucial test of stochastic weather generators. Wilks [1999] tested
seasonal variance, extreme daily precipitation amounts, and long runs of
consecutive dry and wet days.

The rainfall occurrence model formed from the selected predictors should
generate sequences that are statistically similar to the historical sequence, with
short-term, medium-term, long-term, and very- long-term statistics reproduced. A
method of predictor selection, based upon this principle, is to first select a short-
term predictor, then a medium-term predictor, then a long-term predictor, and then
a very- long-term predictor if required. This approach captures the features of the
historical record in a parsimonious way. One predictor at a time is added to the
existing predictor set, and the resulting model is evaluated by generating 100
sequences from the model, of the same length as the historical record, and then
comparing the short-term, medium-term, and longer-term characteristics of the
generated sequences with the characteristics of the historical record. The best
performing predictor (as shown by these comparisons) is chosen at each step of
this procedure. This method of assessing model performance is a logical extension
of the assessment method proposed by Jimoh and Webster [1996]; it proposes a
more extensive and rigorous examination of the resulting sequences. This
assessment method becomes the basis for selecting the predictors to be used in the
best-fitting model, the value of the smoothing parameter k, and the number of
predictors to be used.

74
This new approach checks the properties of the generated sequences by examining
a range of different criteria:

a. Daily means and correlations are examined to verify that the short-term
statistics are being reproduced.
b. Seasona l means and standard deviations are examined to verify that the
seasonal (medium- term) statistics are being reproduced.
c. Annual mean, standard deviation, and probability plots of the distribution of
the number of wet days per year are examined to verify that the annual (long-
term) statistics are being reproduced.
d. Annual autocorrelation functions and standard deviation vs. timescale plots
are examined to verify that very long-term statistics are being reproduced.
e. The distributions of dry and wet spell lengths are also examined, along with
the observed and generated values of the longest dry spell and the longest wet
spell in each season.

A three-step checklist for quality of the generated sequences based on this


approach is:

1. Check that the seasonal and annual average numbers of wet days are not
biased in comparison to the historical values.
2. Look at the probability plot of the distribution of wet days per year. This
shows whether the variability in the annual wet days follows a pattern similar
to the historical record.
3. The chosen model/set of predictors can be further validated by checking the
goodness of fit of the other statistics, which present information about both
short and longer-term variability in the generated sequences.

This three-step checklist was used to assess the models chosen at each stage of the
stepwise predictor selection procedure described above. A large number of
predictor and smoothing parameter combinations were assessed (including the

75
predictor sets that were chosen using the PIC criterion of chapter 3), before
arriving at the results that are reported in this chapter. Examples of the practical
application of this checklist are given in sections 4.4.2 and 4.4.3.

4.4 Application of the resampling model

4.4.1 Implementation Details


This section applies the ROG model to 140 years of daily rainfall data from
Sydney and to 144 years of daily rainfall data from Melbourne, Australia. The
Sydney daily rainfall record is studied in detail in the results that follow. Results
from a ROG model that incorporates a single short-term predictor are presented
first, to show the ability of such a model to represent short-term features of the
record, and the inability of such a model to represent longer-term features of the
record. Results for a ROG model that incorporates both a short-term predictor and
three longer-term predictors are given next, to show how the inclusion of the
longer-term predictors improves the representation of the longer-term features of
the record. A summary of the results from alternative ROG models for Sydney
rainfall is presented, to demonstrate the predictor selection method. A similar
application to Melbourne rainfall is then made, which reinforces the conclusions
arrived at from the earlier application.

Short-term dependence is simulated using rainfall occurrence on the previous day,


and longer-term dependence using “aggregate” variables that describe how wet it
has been over a longer length of time. Simulation proceeds by resampling from a
local subset of the historical record of rainfall occurrence, conditional to the
current values of these predictors. The seasonal variations present in the rainfall
time series are modelled using a 15-day moving window, centred at the current
day. This forms a seasonally representative sample at any given time of year,
while giving a larger sample for calculation purposes. The moving window
includes days from all except the first few historical years (since the first few
years in the record do not have associated values of all the predictors). A

76
threshold value of 0.3mm is used to decide whether a day is dry or wet (after
Buishand [1978]).

The predictors used in this chapter are the rainfall occurrence on the previous day,
a wetness index for the last 90 days, a wetness index for the last 365 days, and a
wetness index for the last Y years where Y = 4 or 5 years. With reference to
equation (4.5), this corresponds to choosing m1 = 90, m2 = 365, and m3 = 1460 or
1825. The predictors are denoted as 1d, 90d, 1y, 4y, and 5y, in the discussion that
follows (note that this notation is consistent with that of chapter 3; using the
notation of equation (4.5) the predictors are x t-1 , b90,t, b365,t, b1460,t, and b1825,t). Note
that these predictors were chosen from among a range of candidate predictors by
testing the performance of ROG models incorporating the candidate predictors.
The “wetness indexes” are five-state aggregate variables with categories of very
dry, dry, average, wet, and very wet, with values assigned as described in section
3.3.1 of chapter 3.

ROG(0) denotes a model that resamples unconditionally from all days in the
seasonally representative sample formed by the moving window. ROG(1) denotes
a model which uses rainfall occurrence on the previous day (1d) as the single
predictor. This model resamples from the days in the seasonally representative
sample that are preceded by dry/wet days if the current value of 1d is dry/wet.
ROG(2), ROG(3), and ROG(4) denote models which use two, three, or four
selected predictors, respectively. The predictors are selected using the stepwise
approach described in section 4.3. These more complicated models conditionally
resample from the seasonally representative sample based on the current values of
the predictors.

4.4.2 Results for Sydney

The approach for stepwise selection of predictors, and for selection of an


appropriate number of nearest neighbours (k) to be used in the simulation model,
is applied to a 140 year long (1859-1998) daily rainfall occurrence record for

77
Sydney, Australia. Table 4.2 describes the ROG models that were selected at each
stage of the stepwise procedure described in section 4.3.

Results for ROG(1) are presented first, in Figures 4.1 and 4.2. Figure 4.1
demonstrates the ability of a model that incorporates a single short-term predictor
to reproduce the short-term (i.e. daily level) statistical characteristics of the
historical record. Figure 4.2 demonstrates the inability of such a model to
reproduce the longer-term characteristics of the historical record. Figure 4.1a
shows the fraction of wet days for the ROG(1) model for Sydney, as it varies with
time of year. The distribution of the statistic from 100 generated sequences is
shown by the 5th percentile, median, and 95th percentile lines (each of the 100
sequences generated by the given model is of the same length as the historical
sequence). Superimposed on this graph are the historical values (circles). Note
that the calculations for each day shown in Figure 4.1 have their sample sizes
increased by the use of a 15-day moving window centred on the day, for both the
historical sequence, and each of the generated sequences. It can be seen that
ROG(1) adequately reproduces the historical values. ROG(1) also adequately
reproduces the historical daily lag-one correlations, as shown in Figure 4.1b.
However, the longer-term features of the historical record are not reproduced by
this single-predictor model. This is illustrated in Figure 4.2, which shows a
probability plot of the distribution of wet days per year for Sydney. The figure
shows that the driest year on record for Sydney has only 90 wet days, but the
wettest year has approximately 220 wet days. The generated sequences from
ROG(1) do not reproduce this distribution. The standard deviation of the number
of wet days per year is directly related to this distribution, and thus generated
sequences from ROG(1) under-represent the historical annual- level standard
deviation. This is a failure of single-predictor stochastic daily generation models
that has been widely recognised [Buishand, 1978; Wilks, 1989; Gregory et al.,
1993; Katz and Zheng, 1999]. This failure of single-predictor models to reproduce
longer-term features of the historical record is the motivation for development of
the four-predictor model (ROG(4)). Results for this model are presented next.

78
0.50
a b
0.50
___ ___
5%, median, 95% generated values 5%, median, 95% generated values
° historical values ° historical values

0.45
0.45
fraction of wet days

lag-one correlation
0.40
0.40

0.35
0.30
0.35

0.25
0.30

0.20

0 100 200 300 0 100 200 300


julian day julian day

Figure 4.1 ROG(1): Daily statistics for Sydney rainfall occurrence. a. Fraction of
wet days as a function of Julian day. b. Lag-one correlations as a function of
Julian day. This plot (and subsequent plots) show the distribution of values
obtained from 100 generated sequences of the same length as the historical record,
with the historical values superimposed on the plot.

79
220
___
5%, median, 95% generated values
° historical values
200
180
wet days per year
160
140
120
100

0.995 0.900 0.500 0.100 0.005

exceedance probability
Figure 4.2 ROG(1): Distribution of wet days per year for Sydney.

The results for ROG(4) are presented in Figures 4.3 to 4.7. These figures show
how the use of a combination of short-term and longer-term predictors gives a
model that better reproduces the medium-term and longer-term characteristics of
the historical record, while still reproducing the short-term characteristics
adequately. Generated sequences from ROG(4) adequately reproduce the
historical fraction of wet days and the historical daily lag-one correlations, with
similar results to those for ROG(1) shown in Figure 4.1. Generated sequences
from ROG(4) also adequately reproduce both the mean dry and wet spell lengths
in each season, and the standard deviations of the spell lengths, as shown in
Figure 4.3, panels a, b, d, and e. The seasons are Spring (September-November),
Summer (December-February), Autumn (March-May), and Winter (June-August),
denoted as S, S, A, and W respectively in the Figures, and in Tables 4.4 and 4.9.
Values of each statistic were calculated for each of the 100 generated sequences,
and the distribution of these values is shown as a boxplot. In this and all
subsequent boxplots, the median, 25th percentile, and 75th percentile are the lines

80
forming the box, and the 5th percentile and 95th percentile values are shown by the
whiskers. The historical values are superimposed on the plot, connected by a line.
Panels c and f of Figure 4.3 show boxplots formed from the longest dry spell and
the longest wet spell ending in each season, for each of the 100 generated
sequences from ROG(4), with the historical values superimposed on the plot. The
longest dry and wet spell lengths are reasonably well matched by the generated
sequences. Wilks (1999) notes that a Markov chain model can under-represent
extreme dry spell lengths. The results presented here show that this was not a
problem for the ROG(4) model of Sydney rainfall occurrence.
standard deviation (days)

longest dry spell (days)


3.5 4.0 4.5 5.0

5.0

a b c

20 30 40 50 60
mean (days)

3.0 4.0

S S A W S S A W S S A W
season season season
standard deviation (days)

longest wet spell (days)


2.8

40

d e f
mean (days)

2.0 2.4
2.6

20 30
2.2

1.6

10

S S A W S S A W S S A W
season season season

Figure 4.3 ROG(4): Mean, standard deviation, and longest dry and wet spell
lengths in each season for Sydney. a. Mean dry spell lengths. b. Standard
deviation of dry spell lengths. c. Longest dry spell lengths. d. Mean wet spell
lengths. e. Standard deviation of wet spell lengths. f. Longest wet spell lengths.

81
Figure 4.4 shows frequency-duration curves for dry and wet spell lengths in each
season for ROG(4). The 5th percentile, median, and 95th percentile lines from the
generated sequences are shown for each of four seasons, with the historical values
superimposed on the plots. The generated sequences from ROG(4) match the
historical frequency-duration of spell lengths. Jimoh and Webster [1996]
recommended that these curves be used to provide a robust method of model
identification. Note, however, that a number of different statistics are used in the
predictor selection method proposed here, as described in section 4.3. The fact
that ROG(4) reproduces the frequency-duration curves validates the proposed
predictor selection method.

dry spells - spring dry spells - summer dry spells - autumn dry spells - winter

500
500

500
500

__
5%, median, 95%
generated values
° historical values

50
50
frequency

50
50

5 10
5 10

5 10
5 10
1

1
0 5 10 15 20 25 ≥ 30 0 5 10 15 20 25 ≥ 30 0 5 10 15 20 25 ≥ 30 0 5 10 15 20 25 ≥ 30

wet spells - spring wet spells - summer wet spells - autumn wet spells - winter
500

500
500
500
frequency

50

50
50
50

5 10

5 10
5 10
5 10
1

5 10 ≥ 15 5 10 ≥ 15 5 10 ≥ 15 5 10 ≥ 15

spell length (days) spell length (days) spell length (days) spell length (days)

Figure 4.4 ROG(4): Frequency-duration curves for dry spell lengths (top panels)
and wet spell lengths in each season for Sydney.

82
Figure 4.5a and Figure 4.5b show the results for ROG(4) for the mean number of
wet days per season, and the standard deviation of the number of wet days per
season. The reproduction of the means shows that the ROG(4) model is not
biased. The representation of the seasonal standard deviations is better than for
any alternative ROG model for Sydney, as shown in the summary of results for
Sydney, which is presented below. Figure 4.5c shows that the standard deviation
of the number of wet days per year is reproduced by ROG(4). The probability plot
of the distribution of wet days per year for ROG(4) is given in Figure 4.6. The
generated sequences from ROG(4) adequately reproduce this distribution. These
results are a major achievement of the proposed model. The two wettest years are
not perfectly reproduced, however ROG(4) does better in reproducing these
values than existing models.
standard deviation (days)

standard deviation (days)


7 8 9 10 11 12

a b c
34 38 42

20 22 24 26
mean (days)
30

S S A W S S A W
season season

Figure 4.5 ROG(4): Mean and standard deviation of wet days per season, and
standard deviation of wet days per year for Sydney. a. Mean wet days per season.
b. Standard deviation of wet days per season. c. Standard deviation of wet days
per year.

83
220
___
5%, median, 95% generated values
° historical values
200
180
wet days per year
160
140
120
100

0.995 0.900 0.500 0.100 0.005

exceedance probability
Figure 4.6 ROG(4): Distribution of wet days per year for Sydney.

Statistics related to the very-long-term variability of the generated sequences from


ROG(4) are presented in Figure 4.7. Figure 4.7a shows the autocorrelation
function of the number of wet days per year, from lag one up to lag ten. A line
joins the ten historical values, and box plots of the values from the generated
sequences from ROG(4) are shown. Figure 4.7b is a plot of standard deviation of
total wet days versus timescale in years. Standard deviations are shown on this
plot for annual total wet days, biannual total wet days, triannual total wet days, 4-
yearly total wet days, 5-yearly total wet days, 6- yearly total wet days, and 7-
yearly total wet days. The standard deviation of annual total wet days is identical
to the plot shown in Figure 4.5c. The other standard deviations are an indicator of
the very- long-term variability of the sequences. While the historical values for
both these statistics are not perfectly reproduced by ROG(4), this model gives a
better representation of these statistics than any other model tested (as shown
below), and the degree of long-term variability in the sequences produced by the
model is good, considering the complexity of the natural processes that contribute
to the historical variability. Again, this is a major achievement of the ROG(4)

84
model.

a b

120
0.4

standard deviation (days)


100
correlation
0.2

80
60
0.0

40
-0.2

20

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7
lag (years) timescale (years)
Figure 4.7 ROG(4): Longer-term statistics for Sydney rainfall occurrence. a.
Autocorrelation function of wet days per year. b. Standard deviation of total wet
days versus timescale (years).

It has just been shown that the ROG(4) model reproduces the statistical
characteristics of the historical record at several timescales. The predictor
selection method that led to the identification of this model will now be
demonstrated. To facilitate this exercise, a summary of model results for several
ROG models for Sydney will be given. The summary of results for each model is
based on the plo t types that have already been presented. The models are:
• An unconditional resampling model, denoted ROG(0) because it does not use
any predictors.
• The models identified at each step of the method described in section 4.3, i.e.
ROG(1), ROG(2), ROG(3), and ROG(4). These models use one, two, three, or
four predictors, respectively, as described in Table 4.2.

85
• The model formed in chapter 3 from the (one or two) predictors identified as
significant in each season using PIC. This model is denoted ROG(1A).
• The models formed in chapter 3 from the (two or three) best predictors in each
season found using PIC. These models are denoted ROG(2A) and ROG(3A),
respectively.

Table 4.2 Resampling models selected for Sydney rainfall occurrence.


Number of Model
aa Predictors
predictors name
(unconditional
0 ROG(0) 1
resampling)
rainfall occurrence on the 1d b
1 ROG(1) 1
previous day
rainfall occurrence on the 1d
previous day
2 ROG(2) 6
wetness state for previous 90d
90 days
rainfall occurrence on the 1d
previous day
wetness state for previous 90d
3 ROG(3) 2
90 days
wetness state for previous 365d
365 days
rainfall occurrence on the 1d
previous day
wetness state for previous 90d
90 days
4 ROG(4) 1
wetness state for previous 365d
365 days
wetness state for previous 5y
5 years

86
a
a is a smoothing parameter used to specify the number of nearest neighbours k

used in each model; k = α n where n is the sample size.


b
“1d” refers to the rainfall occurrence state (dry or wet) on the previous day, and
“90d”, “365d”, and “5y” refer to the wetness state (very dry, dry, average, wet, or
very wet) for the previous 90 days, 365 days, and five years, respectively.

ROG(1A), ROG(2A), and ROG(3A) use different predictors depending on the


season; the specific predictors that were used are shown in Table 4.3. The
methodology used in selection of these predictors is based on the quality of one-
day-ahead forecasts, evaluated using the PIC criterion, as described in chapter 3.

Table 4.3 Predictors used in models ROG(1A), ROG(2A), ROG(3A) for Sydney
rainfall occurrence.
Model Description aa Predictors used for
Sep- Dec- Mar- Jun-
Nov Feb May Aug
Predictors identified as 1d 1d
ROG(1A) 1 1d b 1d
significant using PIC L 9y
Two best predictors 1d 1d 1d 1d
ROG(2A) 6
found using PIC 183d 60d L 9y
1d 1d 1d 1d
Three best predictors
ROG(3A) 2 183d 60d L 9y
found using PIC
60d 9y 2y 3d
a
a is a smoothing parameter used to specify the number of nearest neighbours k

used in each model; k = α n where n is the sample size.


b
“1d” refers to the rainfall occurrence state (dry or wet) on the previous day, “L”
refers to the length (short, average, or long) of the preceding dry or wet spell
leading up to the current day, “3d” refers to the number of wet days in the past
three days (0, 1, 2, or 3), and “60d”, “183d”, “2y”, and “9y” refer to the wetness
state (very dry, dry, average, wet, or very wet) for the previous 60 days, 183 days,
two years, and nine years, respectively.

87
The summary of results for these models is shown in Tables 4.4, 4.5, and 4.6. The
columns in Table 4.4 reporting summary statistics for each model are, in order:
a. average of the seasonal mean dry spell lengths (cf. Figure 4.3a)
b. average of the seasonal standard deviations of the dry spell length (cf. Figure
4.3b)
c. longest dry spell (for the season containing the historically longest dry spell,
cf. Figure 4.3c)
d. average of the seasonal mean wet spell lengths (cf. Figure 4.3d)
e. average of the seasonal standard deviations of the wet spell lengths (cf. Figure
4.3e)
f. longest wet spell (for the season containing the historically longest wet spell,
cf. Figure 4.3f)
g. average number of wet days per year
h. standard deviation of the number of wet days in spring (cf. Figure 4.5b)
i. standard deviation of the number of wet days in summer (cf. Figure 4.5b)
j. standard deviation of the number of wet days in autumn (cf. Figure 4.5b)
k. standard deviation of the number of wet days in winter (cf. Figure 4.5b)
l. lag-1 autocorrelation of (wet days per year) (cf. Figure 4.7a)
m. lag-2 autocorrelation of (wet days per year) (cf. Figure 4.7a)
n. lag-3 autocorrelation of (wet days per year) (cf. Figure 4.7a).

88
Table 4.4 Summary of results for ROG models for Sydney: bias and standard deviation in spell lengths and longer term statisticsa
column a b c d e f g h i j k l m n
dry spells wet spells wet days standard deviation of wet days autocorrelation of
per year per season wet days per year
Model mean SD longest mean SD longest mean S S A W lag 1 lag 2 lag 3
historicalb 3.80 3.61 47 2.46 2.09 29 143.5 7.91 8.47 9.60 10.25 0.50 0.33 0.36
biasc -1.24 -1.60 -28 -0.80 -1.05 -18 0.1 -3.32 -3.84 -4.86 -5.63 -0.50 -0.34 -0.37
ROG(0)
SDd 0.08 0.12 2.9 0.04 0.06 1.6 0.9 0.28 0.46 0.32 0.31 0.08 0.09 0.08
bias 0.00 -0.33 -20 0.00 -0.18 -11 0.2 -1.80 -2.24 -2.77 -3.33 -0.51 -0.35 -0.36
ROG(1)
SD 0.15 0.21 4.9 0.08 0.13 2.7 1.1 0.40 0.36 0.40 0.40 0.09 0.09 0.08
bias 0.01 -0.21 -16 0.00 -0.15 -10 -0.2 -0.49 -0.75 -1.20 -0.93 -0.44 -0.35 -0.38
ROG(2)
SD 0.21 0.28 6.7 0.11 0.17 3.1 1.8 0.43 0.48 0.44 0.53 0.08 0.09 0.09
bias 0.04 -0.09 -13 0.02 -0.08 -8.4 -0.1 -0.11 -0.58 -0.66 -0.26 -0.27 -0.31 -0.38
ROG(3)
SD 0.28 0.34 6.7 0.16 0.21 3.4 2.8 0.47 0.51 0.58 0.57 0.09 0.09 0.08
bias 0.04 0.07 -10 0.03 0.00 -6.5 -0.04 0.14 -0.27 -0.66 -0.08 -0.18 -0.14 -0.18
ROG(4)
SD 0.53 0.62 8.5 0.25 0.35 4.0 4.9 0.49 0.51 0.63 0.71 0.09 0.11 0.10
bias -0.02 -0.26 -20 0.01 -0.13 -8.2 0.7 -1.84 -2.28 -2.07 -2.85 -0.45 -0.29 -0.32
ROG(1A)
SD 0.20 0.29 4.0 0.11 0.16 3.2 1.7 0.35 0.39 0.47 0.52 0.09 0.09 0.10

89
bias 0.01 -0.18 -19 0.00 -0.12 -8.5 -0.1 -1.25 -0.97 -1.99 -2.78 -0.37 -0.26 -0.29
ROG(2A)
SD 0.30 0.39 5.0 0.17 0.20 3.1 2.8 0.46 0.35 0.51 0.52 0.09 0.10 0.10
bias 0.01 -0.03 -15 0.02 -0.02 -5.9 0.6 -0.24 -0.60 -0.94 -1.89 -0.19 -0.12 -0.19
ROG(3A)
SD 0.46 0.56 6.2 0.25 0.34 4.5 4.6 0.43 0.42 0.49 0.58 0.10 0.12 0.12
a
The units for all columns (except l- n) are days. Smaller biases indicate a better-fitting model.
b
Historical values of each statistic are shown in this row of the table.
c
“Bias” refers to the absolute difference between the historical value of each statistic and the mean of 100 values of each statistic,
corresponding to each of 100 sequences generated by the given model. Each generated sequence is of the same length as the historical
sequence.
d
“SD” refers to the standard deviation of the 100 generated values of each statistic.

90
In Table 4.5, the sum of squared residuals from the probability plot of the
distribution of wet days per year (cf. Figure 4.6) is reported; this is calculated as:

2
n
 100

S = ∑  H i − ( ∑ G j ,i ) / 100  (4.10)
i =1  j =1 

where
n = number of years in the historical sequence - number of years at the start of the
historical record used solely for specification of predictors
= number of years in each of 100 generated sequences
The n years of wet day totals are in rank order, for both historical years Hi and the
j th sequence of generated years Gj,i. A smaller sum of squared residuals indicates a
better fitting model.

Table 4.5 Summary table for various models for Sydney rainfall occurrence: sum
of squared residuals (SSR) for the distribution of wet days per year.a
Model SSR
ROG(0) 22373
ROG(1) 11710
ROG(2) 1657
ROG(3) 954
ROG(4) 950
ROG(1A) 9790
ROG(2A) 7035
ROG(3A) 2786
a
A smaller number indicates a better- fitting model.

In Table 4.6, the sum of squared residuals from both the plot of the daily fraction
of wet days (cf. Figure 4.1a) and the plot of the daily lag-one correlations (cf.
Figure 4.1b) are reported. Again, this is calculated based on the differences
between the historical values and the mean of the generated values:

91
2

366 100

S = ∑  H i − ( ∑ G j ,i ) / 100 
 (4.11)
i =1  j =1 

where
i = the Julian day being considered
Hi = the historical value of the statistic on day i
Gj,i = the value of the statistic on day i of the j th generated sequence
and the sample sizes for calculation of both Hi and Gj,i are increased using a 15-
day moving window.

Table 4.6 Summary table for various models for Sydney rainfall occurrence: sum
of squared residuals (SSR) for daily mean and daily lag-one correlation. a

SSR of SSR of daily


Model
daily mean correlation

ROG(0) 0.012 39.4


ROG(1) 0.013 0.028
ROG(2) 0.018 0.032
ROG(3) 0.017 0.040
ROG(4) 0.018 0.045
ROG(1A) 0.025 0.049
ROG(2A) 0.017 0.053
ROG(3A) 0.024 0.048
a
A smaller number indicates a better- fitting model.

Using the three-step checklist for quality of the generated sequences, we should
look first at bias in the annual means (column (g) in Table 4.4), and then look at
the sum of squared residuals from the distribution of wet days per year (Table
4.5). Referring to Table 4.4, column (g) shows biases are close to zero, for all of
the models shown, with the possible exception of ROG(1A) and ROG(3A). Table

92
4.5 shows the sum of squared residuals from the probability plot is indeed
smallest for the chosen model (i.e. ROG(4)), although it can be seen from Table
4.4 that the main difference between ROG(4) and ROG(3) (which is the next-best
model) is in the autocorrelation function of wet days per year. From the analysis it
was found that if generated sequences provide a good fit to the distribution of wet
days per year (Table 4.5), then the other statistics (especially the means and the
seasonal and annual standard deviations) are also matched. The sums of squared
residuals for the daily level statistics (Table 4.6) are considered to be small for all
of the models, except for ROG(0), which has no mechanism to reproduce daily
level correlations.

It can be seen from Table 4.4 and Table 4.6 that unconditional resampling from a
seasonal subset of the historic record (i.e. ROG(0)) produces generated sequences
that reproduce daily and annual means. However none of the other statistics
shown in Tables 4.4, 4.5, and 4.6 are matched. ROG(1)) is an improvement over
the unconditional model, producing generated sequences that reproduce historical
mean dry and wet spell lengths and historical daily lag-one correlation. However,
all of the standard deviations and longer-term statistics for the generated
sequences from ROG(1) are still well below the historical levels. This is because
the ROG(1) model has no mechanism for reproducing low- frequency variability.
Adding the seasonal- level predictor to the predictor set to form ROG(2) greatly
improves the match of the distribution of wet days per year (Table 4.5). Other
generated statistics have also improved, indicating that the variability of the
generated sequences from ROG(2) is closer to the historical variability, compared
to generated sequences from the simpler models. Addition of an annual level
predictor to the predictor set to form ROG(3) again improves the representation of
the distribution of wet days per year, and the addition of the multi- year predictor
(to form ROG(4)) greatly improves the reproduction of the autocorrelation
function of the number of wet days per year.

It is interesting to note that, as the number of predictors used in the models


increase, the variability of the sample statistics among the 100 generated

93
sequences increases, as shown by the standard deviation rows in Table 4.4. This
occurs because the addition of each extra long-term predictor allows the model to
capture and reproduce more of the natural longer-term variability that is present in
the original data.

Tables 4.4, 4.5, and 4.6 show that the models for generation of long synthetic
sequences formed from predictors identified using PIC (ROG(1A), ROG(2A), and
ROG(3A)) are not as good as models formed using the method of predictor
selection described in this chapter. The models selected from the PIC predictor
identification did not reproduce the seasonal and annual statistics well, and even
the daily- level statistics (Table 4.6) in the generated sequences produced by these
models are not good. This shows that, even though the predictors selected using
PIC are valid for forecasting occurrence one-day-ahead, they are not the best
predictors to use for generation of long sequences of data. This may also apply, in
general, to any predictor set selected based on the quality of one-day-ahead
forecasts rather than based on the quality of the generated sequences produced by
the model, since the quality of one-day-ahead forecasts gives little indication as to
how well the observed variability at longer timescales is reproduced. In contrast,
the summary tables show that the 1d predictor is good at matching daily level
dependencies, the 90d predictor is good at matching seasonal level dependencies,
the 1y predictor is good at matching annual level dependencies, and the 5y
predictor is good at matching inter-annual level dependencies. A combination of
these four predictors forms a model (denoted ROG(4)) that captures much of the
natural variability of the historical record, and reproduces this variability in the
synthetic sequences that are generated by the model.

4.4.3 Results for Melbourne


For the 144 year long record of Melbourne rainfall occurrence (1855-1998), the
models shown in Table 4.7 were selected at each step of the proposed approach
for stepwise selection of predictors. All except the fourth predictor are the same as
those chosen for Sydney. For the fourth predictor, the 4-year wetness state was
chosen for Melbourne, while the 5-year wetness state was chosen for Sydney.

94
Also note that the a for selection of the number of nearest neighbours were
different to those selected for Sydney.

The models for Melbourne selected using PIC (i.e. ROG(1A), ROG(2A), and
ROG(3A); see chapter 3 for details on how these models were selected) use
different predictors depending on the season; the specific predictors that were
used are shown in Table 4.8.

Table 4.7 Resampling models selected for Melbourne rainfall occurrence.


Number of Model
aa Predictors
predictors name
rainfall occurrence on the 1db
1 ROG(1) 1
previous day
rainfall occurrence on the 1d
previous day
2 ROG(2) 8
wetness state for previous 90d
90 days
rainfall occurrence on the 1d
previous day
wetness state for previous 90d
3 ROG(3) 2
90 days
wetness state for previous 365d
365 days
rainfall occurrence on the 1d
previous day
wetness state for previous 90d
90 days
4 ROG(4) 0.5
wetness state for previous 365d
365 days
wetness state for previous 4y
4 years

95
a
a is a smoothing parameter used to specify the number of nearest neighbours k

used in each model; k = α n where n is the sample size.


b
“1d” refers to the rainfall occurrence state (dry or wet) on the previous day, and
“90d”, “365d”, and “4y” refer to the wetness state (very dry, dry, average, wet, or
very wet) for the previous 90 days, 365 days, and four years, respectively.

Table 4.8 Predictors used in models ROG(1A), ROG(2A), ROG(3A) for


Melbourne rainfall occurrence.
Model Description aa Predictors used for
Sep- Dec- Mar- Jun-
Nov Feb May Aug
Predictors identified as 1d
ROG(1A) 1 1db 1d 1d
significant using PIC 4y
Two best predictors 1d 1d 1d 1d
ROG(2A) 8
found using PIC 5y 6y 10y 4y
1d 1d 1d 1d
Three best predictors
ROG(3A) 2 5y 6y 10y 4y
found using PIC
2d 183d 60d 30d
a
a is a smoothing parameter used to specify the number of nearest neighbours k

used in each model; k = α n where n is the sample size.


b
“1d” refers to the rainfall occurrence state (dry or wet) on the previous day, “2d”
refers to the number of wet days in the past two days (0, 1, or 2), and “30d”,
“60d”, “183d”, “4y”, “5y”, “6y”, and “10y” refer to the wetness state (very dry,
dry, average, wet, or very wet) for the previous 30 days, 60 days, 183 days, four
years, five years, six years, and ten years, respectively.

A summary of results for the resampling models for Melbourne rainfall


occurrence is presented in Tables 4.9, 4.10 and 4.11. The layout of the tables is
similar to the tables presented for Sydney. These tables show that the one-
predictor model (ROG(1)) for Melbourne matches the daily mean, the daily lag-
one correlation, and the mean dry and wet spell lengths. In contrast to the Sydney

96
results, the standard deviations of dry and wet spell lengths are also matched by
the one-predictor model. However, the longer-term statistics shown in columns
(h) to (n) in Table 4.9 are not matched. Adding the seasonal- level predictor to the
predictor set (ROG(2)) improves the representation of the seasonal variability
(columns (h) to (k) in Table 4.9) and the match of the distribution of wet days per
year (Table 4.10). The addition of the annual level predictor (ROG(3)) gives
further improvement to the representation of the distribution of wet days per year.
However very long term statistics (columns (l), (m), and (n) in Table 4.9) are not
matched. Adding the inter-annual predictor to form ROG(4) gives generated
sequences that reproduce much of the historical variability over short, medium,
and long time scales, as shown in the tables. The overall fit of the range of
statistics shown in the tables is good. The very- long-term statistics of the
generated sequences (columns (l), (m), and (n) in Table 4.9) are closer to the
historical values than for any model with fewer predictors, however they still fall
short of the historical values.

97
Table 4.9 Summary results for ROG models for Melbourne: bias and standard deviation in spell lengths and longer term statisticsa
column a b c d e f g h i j k l m n
dry spells wet spells wet days standard deviation of wet days autocorrelation of
per year per season wet days per year
Model mean SD longest mean SD longest mean S S A W lag 1 lag 2 lag 3
historicalb 3.51 3.10 40 2.33 1.74 18 149.2 7.11 6.22 7.21 8.81 0.53 0.46 0.43
biasc -0.91 -1.00 -12.7 -0.58 -0.59 -7.6 -0.1 -2.40 -1.95 -2.65 -4.05 -0.53 -0.48 -0.45
ROG(0)
SDd 0.08 0.13 4.4 0.04 0.07 1.5 0.8 0.30 0.42 0.31 0.28 0.07 0.07 0.08
bias 0.00 -0.04 -2.5 0.00 0.03 -2.5 0.08 -1.03 -0.53 -1.18 -2.92 -0.53 -0.47 -0.43
ROG(1)
SD 0.14 0.20 5.3 0.07 0.11 2.5 0.9 0.38 0.41 0.38 0.37 0.08 0.08 0.09
bias 0.00 0.00 -2.4 -0.01 0.06 -1.7 0.0 0.06 0.08 -0.27 -1.11 -0.47 -0.48 -0.44
ROG(2)
SD 0.17 0.24 5.5 0.09 0.12 2.9 1.5 0.40 0.36 0.40 0.42 0.08 0.10 0.09
bias -0.01 0.04 -2.7 0.00 0.09 -0.8 0.2 0.22 -0.10 -0.17 -0.71 -0.25 -0.42 -0.42
ROG(3)
SD 0.21 0.25 5.6 0.14 0.17 2.6 2.4 0.46 0.36 0.38 0.47 0.08 0.09 0.08
bias 0.01 0.19 0.6 0.00 0.17 0.0 -0.1 0.37 0.06 -0.18 -0.51 -0.15 -0.17 -0.14
ROG(4)
SD 0.38 0.44 6.8 0.22 0.26 2.7 4.1 0.45 0.41 0.45 0.56 0.09 0.10 0.10
bias 0.00 -0.04 -4.0 0.00 0.03 -2.5 0.04 -1.09 -0.51 -1.08 -2.28 -0.43 -0.38 -0.33
ROG(1A)
SD 0.14 0.21 5.7 0.10 0.13 2.1 1.5 0.32 0.34 0.37 0.43 0.09 0.10 0.09

98
bias 0.08 0.07 -3.3 -0.04 0.01 -2.6 -3.1 -0.65 -0.47 -0.55 -1.90 -0.17 -0.10 -0.07
ROG(2A)
SD 0.66 0.68 5.5 0.38 0.43 2.6 7.3 0.48 0.33 0.51 0.61 0.16 0.15 0.16
bias 0.06 0.14 -0.7 -0.04 0.01 -2.4 -3.2 -0.96 -0.05 -0.64 -1.47 -0.18 -0.13 -0.10
ROG(3A)
SD 0.64 0.65 6.3 0.40 0.44 2.9 7.5 0.45 0.44 0.74 0.64 0.15 0.14 0.15
a
The units for all columns (except l- n) are days. Smaller biases indicate a better-fitting model.
b
Historical values of each statistic are shown in this row of the table.
c
“Bias” refers to the absolute difference between the historical value of each statistic and the mean of 100 values of each statistic,
corresponding to each of 100 sequences generated by the given model. Each generated sequence is of the same length as the historical
sequence.
d
“SD” refers to the standard deviation of the 100 generated values of each statistic.

99
Table 4.10 Summary table for various models for Melbourne rainfall occurrence:
sum of squared residuals (SSR) for the distribution of wet days per year.a
Model SSR
ROG(0) 11624
ROG(1) 5949
ROG(2) 1227
ROG(3) 374
ROG(4) 306
ROG(1A) 4217
ROG(2A) 2988
ROG(3A) 2961
a
A smaller number indicates a better-fitting model.

Table 4.11 Summary table for various models for Melbourne rainfall occurrence:
sum of squared residuals (SSR) for daily mean and daily lag-one correlation. a

SSR of SSR of daily


Model
daily mean correlation

ROG(0) 0.013 24.0


ROG(1) 0.013 0.038
ROG(2) 0.014 0.039
ROG(3) 0.015 0.039
ROG(4) 0.020 0.052
ROG(1A) 0.018 0.047
ROG(2A) 0.051 0.055
ROG(3A) 0.077 0.076
a
A smaller number indicates a better-fitting model.

Using the three-step checklist for quality of the generated sequences, column (g)
in Table 4.9 shows that the chosen model is unbiased, but that models ROG(2A)
and ROG(3A) (which incorporate predictors chosen using the PIC procedure

100
described in chapter 3) are biased. Table 4.10 shows the sum of squared residuals
from the distribution of wet days per year is smallest for the ROG(4) model. The
tables agree with the Sydney results since they show that the models for
generation of long sequences of synthetic data formed from predictors identified
using PIC (ROG(1A), ROG(2A) and ROG(3A)) are not as good as models formed
using the method of predictor selection described in this chapter.

4.5 Conclusions
This chapter has presented a model for generating long synthetic sequences of
daily rainfall occurrence that reproduce both the short-term and longer-term
variability of the historical record. A method for selecting the predictors used in
the model has also been presented. These generated sequences provide a better
representation of droughts and sustained wet periods than was previously
possible. Such features are of great interest in catchment management studies, and
the generated sequences can be used in such studies to enable better quantification
of the uncertainty in the catchment response that is due to climatic variability.

The model resamples from a seasonal subset of the historical record of rainfall
occurrence, conditional to the values of a set of multiple predic tors. The predictors
are formed solely from previous values in the sequence, and represent short-term,
seasonal, annual, and inter-annual features of the rainfall sequence. Predictors are
selected in a stepwise procedure, based on comparison of generated synthetic
sequences with the observed rainfall occurrence time series, the comparison
including measures of location, dependence and variability at daily, seasonal,
annual, and inter-annual time scales. It was shown that the use of these multiple
predictors in the resampling model produces generated sequences that more
closely reproduce the historical longer-term variability, compared to sequences
from a model incorporating fewer predictors. For example, for both Sydney and
Melbourne, the addition of a seasonal level predictor to an existing one-predictor
set of “rainfall occurrence yesterday” results in a model that produces generated
sequences that more closely match the historical variability of wet days per

101
season. Further addition of an annual level predictor to give a three-predictor
model results in improvements to the variability of wet days per year, and
subsequent addition of a multi- year predictor to give a four-predictor model
results in generated sequences which more closely reproduce the autocorrelation
function of the number of wet days per year. These improvements to the long-
term behaviour of the generated sequences do not compromise the short-term
performance of the model (features such as daily means and lag-one correlations
were similar regardless of the number of predictors used). The final resampling
models recommended for both Sydney and Melbourne each use four predictors.

The predictor selection methods described in this chapter are designed to give a
model for generation of long sequences of synthetic data that adequately
reproduces both short-term and longer-term statistical properties of the historical
series in the generated sequences. In contrast, traditionally used parametric model
order-selection criteria (as discussed in section 4.3, and chapter 3) are designed to
give a model that produces accurate forecasts of rainfall occurrence one day ahead
of the present. Results presented here show that predictors chosen by forecasts of
the rainfall one day ahead (using the generic method described in chapter 3), do
not result in a model for generation of long sequences of synthetic data that
appropriately characterises rainfall variability at longer time scales. Thus it is
shown that traditionally used criteria based on one-day-ahead forecasts are not
good indicators of whether synthetic sequences produced by the chosen model
will reproduce the longer-term statistical properties of the historical series.

Some possible shortcomings of the predictor selection methods described in this


chapter are:

1. Selection of a key smoothing parameter used in the resampling model - the


number of nearest neighbours (k). The method for choosing k involves
trialling a range of possible values, and selecting the value that gives the
smallest bias in the generated sequences. Rules for selection of k are yet to be
developed.

102
2. The procedure is stepwise. There is a possibility that a number of predictors
may exist that, when combined, could capture more of the dependence
structure than the predictors identified by the stepwise procedure. This is
more of a risk as the number of selected predictors increases. However a large
range of predictor combinations for both Sydney and Melbourne have been
tested, and it was found that the combination of predictors identified by the
stepwise method was the best possible combination, for both of these
locations.
3. The methods are data- intensive, and require several hours of computer time to
run. This is a consequence of using data-driven nonparametric methods,
which avoid the need to make specific assumptions about the distribution of
the data.

The stochastic rainfall generator presented here is designed to reproduce the


features of the observed record. The algorithms used require the data to be gap-
filled (note that when the terms “observed record” or “historical record” are used
in this thesis, we are actually referring to the gap- filled record). The locations
selected for analysis required little or no gap filling. The quality of the resulting
generated sequences emulate the quality of the gap-filled observed record; if
systematic or gross errors exist in the observed record, then these errors will be
propagated in the generated rainfall sequences. Also note that the methods
presented here cannot compensate for a short record of observation. On the
contrary, the longer the observed record, the better the quality of the generated
sequences. For example, if the observed record is too short to contain low-
frequency features such as major droughts, then these features will not be present
in the generated sequences. Additionally, long sequences are required to enable
the incorporation of a multi- year predictor into the model to improve the
reproduction of very-long-term variability.

The occurrence process is binary. This makes it easier in practice to fit a complex
model to this data, since the computation times for binary data are much faster
than for continuous data, and simpler algorithms can be used. Even so, the use of

103
four predictors in the occurrence model took significant computer time. Adding
further predictors to this model (which is not considered to be desirable) would
further increase the computer processing times. However the limitations of the
historical dataset are even more restrictive. ROG(4) uses a two-state short-term
predictor, and three five-state longer-term predictors, resulting in 2·53 possible
“model states”. The historical dataset is large enough to accommodate this
number of model states, but addition of more predictors would cause the number
of model states to grow to an unsupportable level.

The four-predictor rainfall occurrence generator (ROG(4)) provides an explicit


mechanism to simulate drought and low- frequency wet periods. In ROG(4), if the
predictors indicate that the period leading up to the current day is dry, then the
resampled value to be inserted in the generated sequence will be taken from a part
of the historical record that is also dry. Because of the nature of the resampling
process, transitions from “drought conditions” to “wet conditions” in the
generated sequence will emulate the way that transitions from “drought
conditions” to “wet conditions” occur in the historical record.

It is the longer-term predictors used in ROG(4) that contain these low- frequency
signals regarding the persistence of “drought conditions” or “wet conditions”.
These signals are similar to the signal contained in ENSO. However, the
formulation of ROG(4) avoids the use of exogenous variables, with the advantage
that when long sequences of synthetic data are generated, it is not necessary to
first generate a long sequence of the exogenous variable.

The usefulness of the methods should be further validated by carrying out more
testing at more locations, both within and outside Australia. It is expected that the
seasonal- level and annual level predictors, as used in the proposed rainfall
occurrence model, will be useful longer-term predictors regardless of location,
since they are chosen to reproduce the seasonal- level and annual- level variability
that is lacking in existing rainfall models. The usefulness of the multi- year
predictor may vary depending on location, as it is known that Sydney and

104
Melbourne are influenced by low- frequency climatic anomalies such as ENSO,
but that other locations are not as strongly influenced by such climatic anomalies.

The ROG(4) occurrence model could be linked with existing event-based


approaches that only work when it is wet, such as the approach of Seed et al.
[1999]. This would introduce longer-term variability into the sequences generated
by the event-based approaches.

The next chapter of this thesis will focus on the problem of stochastically
generating rainfall amount values for all the wet days simulated by the rainfall
occurrence model. Details on the procedure used to generate the rainfall amounts
will be presented.

105
5. A Model for Stochastic Generation of Daily
Rainfall Amounts

5.1 Introduction
The aim of this thesis is to formulate methods of generating daily rainfall at a
single location within a catchment, such that the generated sequences represent
the historical record both at the daily time scale as well as at longer time scales
(seasonal, annual and inter-annual). As a first part of this work, chapter 4 has
proposed an approach for generating sequences of the rainfall occurrence state
(wet or dry), the approach being structured so as to ensure that the variation in the
number of wet days in a seasonal, annual or longer time period, is of the same
form as is observed in the historical rainfall record. This chapter presents the
second part of the work. The aim here is to generate a sequence of rainfall
amounts for all the wet days in the generated rainfall record, such that the
generated sequence of amounts adequately reproduces the distributional features,
dependence features, and seasonal variations of the observed record, and such that
the resulting synthetic rainfall record (occurrence + amounts) reproduces the
observed features of the historic record at daily, seasonal, annual and inter-annual
timescales.

Previous authors have shown that the distribution of rainfall amount is different
on solitary wet days compared to days in the middle of a wet spell, and to days at
the start or end of a wet spell [Buishand, 1978; Chapman, 1998]. This is due to
the characteristics of the weather patterns that produce the rainfall, and to the fact
that rainfall events may start or end at any time of day, with only a few hours of
the event extending into the first or last day of the wet spell. These issues are
taken into account in the amounts model by considering four classes of wet day,
namely solitary wet days, days at the start of wet spells, days at the end of wet
spells, and days in the middle of wet spells, and then giving separate treatment to
the distributions of amount on each class of wet day.

106
The weather patterns that produce rainfall introduce dependence structure into the
record of amounts. For example, slow- moving cyclones or slow- moving frontal
systems can result in several days of high rainfall at a given location. These
effects produce small but significant correlations between amounts within a wet
spell [Buishand, 1978]. Most existing rainfall amount models ignore this
correlation structure and assume that the wet day amounts are independent. The
approach proposed in this chapter accommodates within-spell correlations of
rainfall amounts by using the amount on the previous day as a conditioning
variable. Allan [1991] also notes that the weather patterns that produce rainfall are
influenced by low- frequency ocean-atmosphere interactions such as the El Niño
Southern Oscillation. Such climatic variability can result in a rainfall record that
exhibits complex low-frequency dependence features. This is the case for many
rainfall records in Southeastern Australia, such as the records considered in this
thesis. The occurrence model of chapter 4 has considered low- frequency
dependence features in its formulation. An issue that is addressed in this chapter is
whether it is also necessary to consider such low- frequency dependence features
when constructing a model for rainfall amounts on wet days.

This chapter is organised as follows. Section 5.2 discusses the influence of low-
frequency climatic features such as the El Niño Southern Oscillation on the
longer-term variability of rainfall records in Southeastern Australia, and considers
how this longer-term variability can be reproduced in a daily rainfall model. The
model for rainfall amounts is presented in sections 5.3 and 5.4, with the
implementation of the model discussed in section 5.5. Results from the amounts
model, and from the combined occurrence/amounts model, are given in section
5.6. Section 5.7 is the conclusion to the chapter. An additional supplement to this
chapter is presented in Appendix C, which discusses the selection of a key
smoothing parameter that is used in the amounts model, and the influence of the
El Niño Southern Oscillation on rainfall amounts for each of four wet-day classes,
and also presents detailed results for both Sydney and Melbourne, broken down
into wet-day classes.

107
5.2 Incorporating longer-term variability into a daily rainfall
model
The El Niño Southern Oscillation (ENSO) is a large-scale coupled ocean-
atmosphere quasi-periodic variation centred in the low latitudes of the Pacific
Ocean [Allan, 1991], which occurs with a frequency of around 3-7 years. ENSO
influences rainfall over Southeastern Australia. Table 5.1 presents the average
annual rainfall at Tenterfield in northern New South Wales, Australia, for El Niño
and La Niña years as classified by Allan [1997]. The differences shown in the
table for the annual rainfall are statistically significant (a=0.05). These differences
between El Niño and La Niña years show that the rainfall record contains low-
frequency features consistent with significant longer-term variability. This
variability cannot be reproduced by a rainfall generation model unless that model
incorporates mechanisms designed to simulate it.

Table 5.1 Average annual rainfall (mm), average number of wet days per year,
and average rainfall on a wet day (mm), for El Niño and La Niña years at
Tenterfield 1936-1996
El Niño Non-ENSO La Niña
years years years
annual rainfall (mm) 790 900 1020
number of wet days 90 97 112
rainfall on a wet day (mm) 8.8 9.3 9.2

The association between annual rainfalls and ENSO can be broken down into an
occurrence component and an amount component. This is also shown in Table
5.1. The average number of wet days for El Niño and La Niña years are
significantly different (a =0.05). However the average rainfall on wet days is not
very different for El Niño and La Niña years. This suggests that longer-term
variability in climate has a more dominant effect on the distribution of the number
of wet or dry days in a year, as compared to the amount of rainfall that is recorded

108
on any given rainy day. Similar results can be obtained for several locations in
Australia where long daily rainfall records are available. Based on this result, and
similar results for other locations, linking an occurrence model that reproduces
longer-term variability with a simpler model for rainfall amounts on wet days
should give a good overall model for daily rainfall. This will be tested later in this
chapter.

5.3 A nonparametric model for rainfall amounts on wet


days
As noted in chapter 2, Chapman [1998] shows that stochastic models that treat
rainfall amounts as separate classes, based on the number of adjoining wet days,
generally result in a better fit than stochastic models that treat the data together.
Chapman defines class 0 rainfall as amounts on solitary wet days, class 1 rainfall
as amounts on days at the beginning or end of a wet spell, and class 2 rainfall as
amounts on days in the middle of a wet spell. Note that an advantage of
generating the entire rainfall occurrence sequence before rainfall amounts are
simulated is that four separate classes of rainfall can be identified, including days
at the end of a wet spell. In this thesis, class 1 is subdivided into class 1a (days at
the start of wet spells) and class 1b (days at the end of wet spells). The differences
between class 0, class 1a, class 1b, and class 2 amounts are illustrated in Figure
5.1, which shows the historical mean daily amounts for each class for Sydney
(1859-1998), as they vary with time of year. Note that the calculations for each
day shown in Figure 5.1 have their sample sizes increased by the use of a 31-day
moving window [Rajagopalan et al., 1996] centred on the day. This window
width was chosen in a compromise between obtaining a smoothly varying
seasonal pattern, and having a width small enough to capture any rapidly
changing seasonal effects.

109
20
° Class 2
Class 1a
mean daily rainfall (mm)

Class 1b
Class 0
15
10
5

0 100 200 300


julian day
Figure 5.1 Historical mean daily rainfall for Sydney on class 0, 1a, 1b, and 2 wet
days, as a function of Julian day.

The model for rainfall amounts that is presented in this chapter is nonparametric,
and uses the kernel density estimation methods of Sharma et al. [1997]. A
univariate kernel density estimator is presented first, which is designed to
adequately reproduce the distributional features of the observed record. In the
section following this, a conditional kernel density estimator is introduced, which
is designed to adequately reproduce both distributional features and the daily
correlation structure of rainfall amounts.

A univariate kernel probability density estimator is written as


n
1  ( x − xi ) 
fˆX ( x; h) = ∑ K   (5.1)
i =1 nh  h 

110
where x i is the ith data point in a sample of size n, K( ) is a kernel function that
must integrate to 1, and h is the bandwidth (a smoothing parameter) of the kernel
used in estimating the probability density function.

The density estimate in (5.1) is formed by summing kernels with bandwidth h


centred at each observation x i. This is similar to the construction of a histogram
where individual observations contribute to the density by placing a rectangular
box (analogous to the kernel function) in the prespecified bin the observation lies
in [Sharma et al., 1997]. A histogram is discrete and sensitive to the position and
size of each bin. By using smooth kernel functions, the kernel estimate in (5.1) is
smooth and continuous.

We use a Gaussian kernel functio n in this work. The Gaussian kernel function is
defined as:

1
K ( x) = exp( − x 2 / 2) (5.2)
(2π ) 0 .5

The choice of the bandwidth is the key to an accurate estimate of the probability
density. A large bandwidth results in an oversmoothed probability density, with
subdued modes and over-enhanced tails. A small bandwidth, on the other hand,
can lead to density estimates overly influenced by individual data points, with
noticeable bumps in the tails of the probability density.

Several methods for estimating an “optimal” bandwidth have been proposed in the
statistical literature. A relatively simple bandwidth choice, suggested for use with
samples having a high coefficient of skewness, is used here. This bandwidth can
be estimated as:

h = λR n −1 / 5 (5.3)

111
where R is the sample interquartile range of the data, and λ a factor that depends
on the sample coefficient of skewness. A value of λ = 0.3 was chosen in the
present study based on the average coefficient of skewness noted for the Sydney
daily rainfall amounts. This skewness was stable across seasons and across the
wet-day classes, with an average value of about 4. The derivation of equation
(5.3) is discussed in Appendix C.

5.4 Conditional modelling of wet day amounts


We now develop a conditional model that is designed to reproduce the observed
dependence between precipitation amounts on successive wet days. A conditional
probability density estimator formed from a bivariate density is written as:

fˆ ( xt , x t−1 )
fˆ (x t | xt −1 ) = (5.4)
fˆm ( x t−1 )

Where fˆ (x t , xt −1 ) is the bivariate density estimate formed by summing bivariate

kernels centred at each of n historical observations ( xi , xi−1 ), and fˆm (x t−1 ) is the

estimated marginal density of xt −1 .

fˆ (x t , xt −1 ) is formed in a similar way to the univariate kernel density estimate in


(5.1), except its formation is in two dimensions, and two bandwidths are used, one

in each coordinate direction. fˆ (x t | xt −1 ) is a slice through this bivariate density

function along the conditioning plane specified by xt −1 . The conditional estimate

is itself composed of a sum of slices, along the conditioning plane, through the n
individual kernels that form the bivariate density estimate. This is illustrated in
Figure 5.2.

112
Figure 5.2 Illustration of the conditional probability density function fˆ (x t | xt −1 )
(after Sharma et al. [1997]).

Using the kernel density estimator in (5.1), the conditional density in (5.4) is
estimated as:

n  ( xt − xi ) 2 
fˆ (x t | xt −1 ) = ∑
1
w exp −  (5.5)
 
i =1 ( 2π )
0. 5 i 2
h1  2h1 

where:

fˆ (x t | xt −1 ) is the conditional probability density estimate;

h1 is the bandwidth in the X t coordinate direction (cf. Figure 5.2);

wi is the weight associated with each kernel slice that constitutes the
conditional probability density:

 ( xt −1 − xi−1 ) 2 
exp  − 2


 2 h 
wi =
2
;
n  ( xt −1 − x j −1 ) 2 
∑ exp  − 2h 2 
j =1  2 

113
h2 is the bandwidth in the X t−1 coordinate direction;

and xi is the conditional mean associated with each kernel slice.

The derivation of equation (5.5) is presented in Sharma and O’Neill [2002],


except that this thesis uses a bivariate kernel that is scaled by h1 and h2 along
each coordinate axis respectively. Sharma and O’Neill used the full covariance
matrix to scale the kernel, giving elliptical axes with orientation specified by the
covariance matrix. This process is known as sphering [Fukanaga 1972, p175]. A
sphered kernel will intercept the conditioning plane ( xt −1 ) at an angle. This can
result in kernel slices that have a conditional mean that is negative, a situation to
be avoided.

As stated above, fˆ (x t | xt −1 ) is a slice through the bivariate density function

fˆ (x t , xt −1 ) , which is itself formed from bivariate kernels centred on each data pair

( xi , xi−1 ). fˆ (x t | xt −1 ) is therefore composed of a sum of slices, along the

conditio ning plane specified by xt −1 , through the n individual kernels centred on

each data pair ( xi , xi−1 ). The weight wi depends directly on how far the data pair

( xi , xi−1 ) is from the conditioning plane (i.e. the distance | xt −1 - xi−1 |), and it gives
the relative area of each individual kernel slice at the conditioning plane. A
smaller wi implies that the kernel centred on the data pair is far from the
conditioning plane and does not make up a significant proportion of the
conditional density estimate. On the other hand, a large wi implies that kernel i is
close to the conditioning plane and constitutes a significant portion of the
conditional density estimate. Consequently, simulation will proceed with more
emphasis given to the observed data points lying closer to the conditioning plane,
and less emphasis given to the data points that lie further away.

114
Both h1 and h2 are calculated using equation (5.3), and are estimated based on the

respective samples denoting the X t and X t−1 coordinate directions. However the

values of R and n used for calculation of h1 are conditional to the value of xt −1 .

Details of these conditional calculations are given in Appendix C.

5.5 Implementation of the model for amounts


The model for rainfall amounts is called RAG to denote “rainfall amount
generator”. The one-predictor RAG(1) model, described here, uses kernel density
estimation to provide an empirical estimate of the distributional features of the
observed data. The model is conditioned on the rainfall amount on the previous
day, so that it can reproduce the short-term dependence structure of the observed
amounts, and seasonality is modelled using the moving window approach that is
described above. The model is applied separately to solitary wet days, days at the
start of wet spells, days at the end of wet spells, and days in the middle of wet
spells, with the amounts on each class of wet day labelled as class 0, class 1a,
class 1b, and class 2, respectively.

RAG(1) generates amounts from a conditional probability density function


formed from the class 0 amounts (or class 1a, class 1b, or class 2 amounts, as
appropriate, along with the rainfall amounts on the previous day) that fall within a
31-day moving window centred on the day of interest. A threshold of 0.3 mm is
used to define a wet day. For class 0 and class 1a amounts, the amount on the
previous day is always zero, and the simpler version of the model described in
section 5.3 is implemented. Simulation of class 0 and class 1a amounts (x t)
proceeds as follows:
1. Form a seasonal subsample of n class 0 (or class 1a) amounts x i from the
historical record. Call this seasonal subsample X.

2. Estimate h from X using equation (5.3).

3. Pick an x i value from X with probability 1/n.

115
4. Select x t as a random variate from the kernel centred on x i:
x t = x i + hWt (5.6)

where Wt is a random variate from a normal distribution with mean of 0 and


variance of 1.

Equation (5.6) can lead to values for the rainfall amounts that are less than the
threshold amount. To get around this problem, a variable kernel and boundary
renorma lisation is used (as described in Sharma and O’Neill [2002]) near the
threshold amount value of 0.3 mm. The bandwidth used for simulating the new
value (step 4 of the algorithm) is reduced depending on how far x i is from 0.3. The
modified step 4 of the above algorithm is as follows:

4a. Estimate a transformed bandwidth h ' such that:

h'= h if FN ( xi (x t < 0.3) ≤ α


,h 2 )

= h' if FN ( x , h ) ( xt < 0.3) > α , such


2 that FN ( x , h'2 ) (xt < 0.3) = α
i i

=0 if x i = 0.3

where FN ( µ ,σ 2 )
is the cumulative probability of a normal distribution with

mean µ and variance σ2 , with the bandwidth being transformed to h ' if, for
the selected normal kernel, the probability of the amount x t being less than
0.3, is estimated to be greater than a specified threshold α.

4b. Sample a new value of x t as xt = xi + h 'Wt

where Wt is a random variate from a normal distribution with zero mean and
unit variance.

4c. Repeat step 4b if the sampled x t is less than 0.3.

116
The rationale behind the use of the above transformation is to leave the bandwidth
unaltered if the selected kernel (centred on x i) is far away from the 0.3 mm
amount boundary, but to reduce the bandwidth if that is not the case. A threshold
probability α equal to 0.06 has been used here (after Sharma and O’Neill [2002]).
If an amount less than 0.3 mm is simulated (as would happen in 6% of all cases
for kernels lying close to the zero- flow boundary), a new value is sampled from
the same kernel until an acceptable value results (this renormalisation will result
in some bias, but the method is designed to minimise this bias, and it can be seen
from the results in Figure 5.3 that any bias resulting from the boundary
renormalisations does not substantially affect our results). Note that the use of
such a variable kernel ensures that if a significant number of observed data points
represent low-rainfall conditions, the nature of dependence that leads to such low
rainfalls in the historical record, is naturally enforced in the simulations. If such a
procedure were not used, the simulation would proceed in the same manner for
both low and high rainfall observations, without recognising that the variability
associated with low-rainfall values is significantly smaller than that associated
with the higher-rainfall values. A related outcome of this procedure is that the
transformed bandwidth h ' is set equal to zero when x i = 0.3, so that the simulated
values in this case are equal to 0.3. After simulation, the generated rainfall
amounts are rounded to the nearest 0.1 mm, because the historical amounts are
also rounded to the nearest 0.1 mm.

For class 1b and class 2 amounts, the rainfall on the previous day is considered,
and the conditional version of the model described in section 5.4 is implemented.
The simulation of class 1b and class 2 amounts proceeds as follows:

1. Specify the Julian day j corresponding to x t. x t is unknown at this stage of the


algorithm, but xt −1 is known.

2. Form a seasonal subsample (centred on Julian day j) of class 1b (or class 2)


amounts (x i) from the historical record. Also identify all corresponding xi−1

117
values. These form n pairs of amounts ( xi , xi−1 ). Call this seasonal subsample

X.

3. Estimate h2 from the marginal distribution of xi−1 using equation (5.3);

4. Estimate the weights wi for the kernel slices that are associated with each data
pair ( xi , xi−1 ).

5. Estimate h1 from X using equation (3), with values of R and n chosen

conditional to the value of xt −1 (cf. Appendix C);

6. Pick a data pair (xi , xi−1 ) from X with probability wi. The bivariate kernel

centred on this data point intersects the conditioning plane specified by xt −1 ,

giving a kernel slice centred at x i. (Parameters x i, h1 , and wi give the centre,


spread, and relative area of each individual kernel slice at the conditioning
plane specified by xt −1 ).

7. Select x t as a random variate from the kernel slice centred on x i :

x t = x i + h1 Wt (5.7)

where Wt is a random variate from a normal distribution with mean of 0 and


variance of 1.

The use of a variable kernel and boundary renormalisation modifies step 7 of the
above procedure near the threshold amount value of 0.3 mm, in a way similar to
the unconditional case that has already been described. Note that equation (5.7) is
identical to equation (5.6), with the only differences here being that the x i and h1
values are selected conditionally.

118
5.6 Results for the rainfall generator
The ROG(4) model for occurrences was combined with RAG(1) for amounts to
give a combined model for rainfall generation. Results for this combined model
for Sydney daily rainfall (1859-1998) are shown in Figures 5.3 to 5.6.

Figure 5.3 shows daily statistics of class 2 rainfall amounts for the RAG(1) model.
The distribution of each statistic from 100 generated sequences is shown by the 5th
percentile, median, and 95th percentile lines (each of the 100 sequences generated
by the RAG(1) model is of the same length as the historical sequence).
Superimposed on each graph are the historical values (circles). It can be seen that
the historical values vary smoothly with time of year, and so do the generated
values, with RAG(1) adequately reproducing the historical values. The
reproduction of the lag-one correlation struc ture that is shown in Figure 5.3d is
better than that reported for a multi-state Markov chain in Gregory et al. [1993].
The calculations for each day shown in Figure 5.3 have their sample sizes
increased by the use of a 31-day moving window centred on the day.

The results for class 0, class 1a, and class 1b amounts also reproduced the
corresponding historical daily level statistics (see Appendix C). The only
exception to this was that simulated mean class 1b rainfalls were slightly too high
(cf. Figure C.4a). It was found that the high rainfalls occurred because the model
overestimates the amounts on the days before class 1b days. When the class 1b
amounts are conditionally simulated, the positive correlation structure causes the
class 1b amounts to be similarly overestimated. However this does not
significantly alter the overall model performance.

Note that about 115 days in the Sydney rainfall record have rainfalls of greater
than 100 mm. Amounts over 100 mm usually occur on class 2 wet days, but can
occur on class 1b, or (less commonly) on class 1a wet days. Amounts over 100
mm do not occur on class 0 wet days. Six days in the Sydney record have amounts
over 200 mm. One of these is a class 1b wet day (234 mm on 9 November 1984)
and the rest are class 2 wet days. The two wettest days are 280 mm on 28 March

119
1942 and 328 mm on 6 August 1986, corresponding to Julian days 88 and 219,
respectively. The increase in skewness (Figure 5.3c) that occurs around Julian day
219 is due to the wettest day in the historical record (328 mm on 6 August 1986).
This also affects the standard deviations (Figure 5.3b). Note that the observed
extreme value on 6 August 1986 affects the results up to 15 days on either side of
Julian day 219, because the 31-day moving window is being used.

It must be noted that the median skewness for Julian day 219 from the 100
sequences generated by RAG is lower than the historical skewness for Julian day
219. This phenomenon (i.e. high historical skewness values are underrepresented
by median values of the generated skewness) can also be observed in results for
the class 0, class 1a, and class 1b amounts, and in results for the Melbourne data,
as reported in appendix C. A possible explanation for this phenomenon is that the
expected skewness for a generated sequence may be smaller than the observed
skewness. Consider a dataset where one observed datapoint is twice the
magnitude of all the other data. Let us label the extreme datapoint as “a”. This one
datapoint causes a substantial proportion of the observed skewness. If we use
resampling with replacement to form 100 generated sequences of the same length
as the original dataset, then we expect datapoint “a” to appear about 100 times in
the generated sequences. But it will not appear in some sequences, and it will
appear more than once in some other sequences. In the sequences where it does
not appear, the skewness will be low. In the sequences where it appears twice or
more, the skewness will be only slightly greater than the skewness for those
sequences where it only appears once. Thus the expected skewness for a generated
sequence is not as high as the observed skewness, in this example.

120
___
5%, median, 95% generated values
° historical values
mean amount (mm)

sd of amount (mm)
15 20 25 30 35
a b
18
14
8 10

0 100 200 300 0 100 200 300


julian day julian day

lag-one correlation
c d
2 3 4 5 6 7 8
skew of amount

0.5
0.3
0.1
0 100 200 300 0 100 200 300
julian day julian day

Figure 5.3 RAG(1): Statistics of daily rainfall for Sydney on class 2 wet days, as
a function of Julian day. a. Mean daily rainfall. b. Standard deviation of daily
rainfall. c. Skew of daily rainfall. d. Lag-one correlation of daily rainfall.

Figure 5.4 shows how the combined ROG(4)/RAG(1) model reproduces the
variability of Sydney rainfall at several timescales. Figures 5.4a and 5.4b show the
standard deviation of rainfall per season and per year, respectively. Figure 5.4c is
a plot of standard deviation of rainfall totals versus timescale in years. Standard
deviations are shown on this plot for annual rainfall, biannual rainfall, triannual
rainfall, 4-yearly rainfall, 5-yearly rainfall, 6-yearly rainfall, and 7-yearly rainfall.
The standard deviation of annual rainfall is identical to the plot shown in Figure
5.4b. The other standard deviations are an indicator of the very- long-term
variability of the sequences. Historical values of the standard deviations are
shown by lines (Figures 5.4a and 5.4c) or by a solid circle (Figure 5.4b). The box
plots show the distribution of each statistic from 100 generated sequences, with
the median, 25th percentile, and 75th percentile values forming the box, and 5th
percentile and 95th percentile values forming the whiskers. The seasonal level

121
standard deviations are reproduced adequately, however the annual and longer-
term standard deviations are slightly overrepresented.

1000 1200 1400 1600 1800


220

a b c

420
sd of rainfall per season (mm)

sd of rainfall per year (mm)


200

standard deviation (mm)


400
180

380
160

360
140

800
340

600
120

320

400
100

S S A W 1 2 3 4 5 6 7
season timescale (years)
Figure 5.4 Combined ROG(4)/RAG(1): Variability of Sydney rainfall totals at
several timescales. a. Standard deviation of seasonal rainfall. b. Standard
deviation of annual rainfall. c. Standard deviation of rainfall totals at very long
timescales.

The overrepresentation of longer-term variability is unexpected, since the rainfall


occurrence model (ROG(4)) adequately represents the variability of wet days per
year, and previous studies found that annua l variability was underrepresented by
daily rainfall models [Wilks and Wilby, 1999]. Upon investigation, it was found
that the overrepresentation of variability was linked to complex low- frequency
features in the historical rainfall record, relating to differences between the wet
day classes. In particular, the apparent cause of over-representation of annual and
longer-term standard deviations is that the historical class 2 daily amount is lower
in wet years than in dry years (cf. the discussion in Appendix C, and Table C.3).
Without the use of an annual- level predictor, the RAG(1) model is insensitive to
this feature of the historical record. To overcome this problem, the amounts model

122
was conditioned on the 365-day wetness state (very dry, dry, average, wet, or very
wet), based on the number of wet days over the past 365 days. The resulting two-
predictor model, which is denoted as RAG(2), is formulated in the same way as
the one-predictor model described above, except the observations x i are separated
into five datasets according to the historical values of the 365-day wetness state.
The values of the wetness state in the generated sequence determine which dataset
to use in the simulation for a particular day. The combined ROG(4)/RAG(2)
model gives the results shown in Figures 5.5 and 5.6. The daily- level results for
this model are similar to those shown in Figure 5.3. The figures show that the use
of the annual- level predictor in the amounts model significantly improves the
representation of the annual- level variability.

1200
360

a b c
200
sd of rainfall per season (mm)

sd of rainfall per year (mm)

standard deviation (mm)


1000
180

340
160

800
320
140

600
300
120

400
100

280

S S A W 1 2 3 4 5 6 7
season timescale (years)

Figure 5.5 Combined ROG(4)/RAG(2): Variability of Sydney rainfall totals at


several timescales. a. Standard deviation of seasonal rainfall. b. Standard
deviation of annual rainfall. c. Standard deviation of rainfall totals at very long
timescales.

123
___
5%, median, 95% generated values
° historical values
rainfall per year (mm)
2000
1500
1000

0.995 0.900 0.500 0.100 0.005


exceedance probability
Figure 5.6 Combined ROG(4)/RAG(2): Distribution of annual rainfall amounts
for Sydney.

The combined ROG(4)/RAG(1) model was also applied to Melbourne daily


rainfall (1855-1998). The ROG(4) results are documented in chapter 4, and show
that the occurrence model reproduces both daily- level statistics and statistics for
long-term wet day totals. The amounts model for Melbourne (RAG(1))
reproduces historical daily level statistics of amounts for the four classes of
Melbourne rainfall. Figure 5.7 shows daily statistics of class 2 rainfall amounts for
the RAG(1) model. The calculations for each day shown in Figure 5.7 have their
sample sizes increased by the use of a 31-day moving window centred on the day.

124
___
5%, median, 95% generated values
° historical values
mean amount (mm)

sd of amount (mm)
a b

16
8 10

12
6

4 6 8
4

0 100 200 300 0 100 200 300


julian day julian day

0.4
lag-one correlation
c d
skew of amount
6
5

0.2
4
3

0.0
2

0 100 200 300 0 100 200 300


julian day julian day

Figure 5.7 RAG(1): Statistics of daily rainfall for Melbourne on class 2 wet days,
as a function of Julian day. a. Mean daily rainfall. b. Standard deviation of daily
rainfall. c. Skew of daily rainfall. d. Lag-one correlation of daily rainfall.

Figure 5.8 shows how the combined ROG(4)/RAG(1) model reproduces the
variability of Melbourne rainfall at several timescales. The seasonal level standard
deviations are reproduced adequately, however the annual and longer-term
variability of Melbourne rainfall amounts are overrepresented, as was the case for
Sydney (cf. Figure 5.4). A combined ROG(4)/RAG(2) model for Melbourne
corrected this overrepresentation, as shown in Figure 5.9 and Figure 5.10. This
combined model incorporates the one-year wetness state as a predictor for
amounts. Note that the performance of the ROG(4)/RAG(2) model for Melbourne
is not quite as good as the equivalent model for Sydney rainfall. A 90-day wetness
state was also tried in place of the one-year wetness state in the amounts model
for Melbourne, but this did not imp rove the reproduction of longer-term
variability.

125
160
a b c

600
80
sd of rainfall per season (mm)

sd of rainfall per year (mm)

standard deviation (mm)


150

500
70

400
140
60

300
130
50

200
120
40

100
S S A W 1 2 3 4 5 6 7
season timescale (years)

Figure 5.8 Combined ROG(4)/RAG(1): Variability of Melbourne rainfall totals at


several timescales. a. Standard deviation of seasonal rainfall. b. Standard
deviation of annual rainfall. c. Standard deviation of rainfall totals at very long
timescales.

126
130

450
80
a b c
sd of rainfall per season (mm)

400
sd of rainfall per year (mm)

standard deviation (mm)


125
70

350
120

300
60

250
115

200
50

110

150
40

105

100
S S A W 1 2 3 4 5 6 7
season timescale (years)
Figure 5.9 Combined ROG(4)/RAG(2): Variability of Melbourne rainfall totals at
several timescales. a. Standard deviation of seasonal rainfall. b. Standard
deviation of annual rainfall. c. Standard deviation of rainfall totals at very long
timescales.

127
___
5%, median, 95% generated values
° historical values
rainfall per year (mm)
400 500 600 700 800 900

0.995 0.900 0.500 0.100 0.005


exceedance probability
Figure 5.10 Combined ROG(4)/RAG(2): Distribution of annual rainfall amounts
for Melbourne.

Detailed results for the combined rainfall generators for both Sydney and
Melbourne are given in Append ix C. The appendix includes a breakdown of
results into wet day classes, a more detailed discussion of why using an annual
level predictor in the amounts model improves the results, and a discussion of
why the combined ROG(4)/RAG(2) for Melbourne does not perfectly reproduce
the historical annual standard deviation shown in Figure 5.10.

5.7 Conclusions
The model for stochastic generation of rainfall amounts that is presented in this
chapter is an improvement over existing models. It can reproduce key day-to-day
features that exist in the historical rainfall record, including the lag-one correlation
structure, and differences in the distribution of amounts between solitary wet
days, days at the start or end of a wet spell, and days in the middle of a wet spell.

128
Seasonal variations in the rainfall record are also smoothly reproduced. Existing
amounts models are not designed to reproduce all of these features. In addition,
the use of an annual level predictor in the proposed model allows complex low-
frequency features of the historical record to be reproduced. All of this is achieved
within a nonparametric framework that minimises the assumptions made in
simulating the rainfall amounts.

The proposed rainfall amount model is linked with long sequences of rainfa ll
occurrence generated by the model of chapter 4, and it is shown that the longer-
term variability present in the historical rainfall record can be reproduced by this
combined model. The resulting generated sequences provide a better
representation of the variability associated with droughts and sustained wet
periods than was previously possible. Such features are of great interest in
catchment management studies, and the generated sequences can be used in
catchment studies to enable better quantification of the uncertainty in the
catchment response that is due to climatic variability.

The combined model is composed of two parts, which are a multi-predictor


rainfall occurrence generator (ROG(4)), and a two-predictor rainfall amount
generator (RAG(2)) for amounts on wet days. In both parts of the model, the use
of a seasonally representative sample at any given time of year ensures accurate
representation of the seasonal variations present in the rainfall time series.
ROG(4) resamples from a seasonal subset of the historical record of rainfall
occurrence, conditional to the values of a set of multiple predictors. The predictors
are formed solely from previous values in the sequence, and represent short-term,
seasonal, annual, and inter-annual features of the rainfall sequence. The use of
these multiple predictors in the resampling model produces generated occurrence
sequences that closely reproduce the historical longer-term variability.

The model for generation of amounts (RAG(2)) works with a seasonal subset of
the historical record of rainfall amounts. This model is conditioned on both daily-
level and annual- level predictors. A smoothed empirical estimate of the

129
conditional probability distribution of amounts is formed, and amounts are
generated from this empirical distribution. Separate RAG models are applied to
class 0, class 1a, class 1b, and class 2 wet days, where each wet day is assigned to
a class based on the number of adjacent wet days, and class 1a and class 1b refer
to wet days at the start and at the end of a wet spell, respectively. The use of these
various classes is important since the probability density function of the rainfall
varies between each class.

An important conclusion to be made from this research is that a substantial


portion of the longer-term variability in rainfall is accounted for by the variability
in the rainfall occurrence process. The use of a rainfall occurrence model that is
capable of simulating low- frequency variability ensures that the overall daily
rainfall model can reproduce the observed statistics at seasonal, annual, and
longer timescales. The subsequent simulation of amounts on each wet day is
relatively simple. A second conclusion is that use of separate models for
representing rainfall amounts on solitary wet days, days at the start or end of a wet
spell, and days within the wet spell (denoted class 0, class 1a, class 1b and class 2,
respectively) is essential for properly representing the observed rainfall process,
as the distributional characteristics of each such class vary significantly from one
to the other. Lastly, the use of categorical aggregate variables to represent long-
term features in the record has significant benefits, and ensures an accurate
representation of low-frequency features in the simulations. In our application of
the combined rainfall occurrence and rainfall amounts model to Sydney and
Melbourne rainfall data, we show that the simulations reproduce statistics at both
a daily level and at longer (annual and even inter-annual) time scales. This would
not have been possible without the use of categorical aggregate variables.

The algorithms used here require the data to be gap-filled. The Sydney and
Melbourne data required little or no gap filling. After simulation, the quality of
the generated sequences emulates the quality of the gap- filled observed record, as
noted in the conclusion to the previous chapter.

130
A feature of the amounts model is that the behaviour of the model for extreme
values is essentially assumption- free. The shape of the empirical probability
density function is determined by the observed data, and the largest observed
values form “bumps” in the tail of this function. The result of this is that the
extreme values that are inserted into the generated sequences are closely related to
the observed extreme values. Unfortunately, to obtain extreme values that are
much larger than (or different to) the observed extremes would involve the type of
assumptions that the nonparametric methods used in this thesis try to avoid.

131
6. Conclusions

6.1 Motivation
The motivation of this research was to develop methods for generating synthetic
sequences of daily rainfall that reproduce the complex short-term and longer-term
features of the historical record. The rainfall generation proble m was approached
in two stages: generation of rainfall occurrence, and subsequent generation of
rainfall amounts on the simulated wet days. For both the occurrence and amount
models, the aim was to accurately reproduce the distributional features,
dependence features, and seasonal variations of the observed record.

An approach for generation of daily rainfall occurrence (whether a day will be


“dry” or “wet”) was presented first, with chapter 3 concentrating on the task of
selecting predictor variables for forecasts of the daily rainfall occurrence state.
The next chapter (chapter 4) presented an approach for generating long sequences
of daily rainfall occurrence. Methods for stochastic generation of the rainfall
amounts were presented in chapter 5.

6.2. Selection of predictors for rainfall occurrence


Chapter 3 introduced a generic measure of partial dependence termed “partial
informational correlation” (PIC), developed for discrete random variables, and
applied it to select relevant short and long-term predictors for a one-day-ahead
forecast of rainfall occurrence state. Chapter 3 also presented a set of candidate
predictors based on previous values in the rainfall occurrence series, that
characterised short-term dependence using rainfall occurrence at short time lags
from the present, and longer-term dependence via predictors that describe how
wet it has been over a longer period.

132
PIC is a partial measure of dependence derived from mutual information theory,
which is sensitive to both linear and non- linear dependence. The PIC predictor
identification methods produced sensible results at all 13 locations where they
were tested. The conclusion from a leave-one-out cross validation analysis of
rainfall occurrence forecasts was that the PIC predictor identification method gave
a valid predictor set for the short-term prediction of daily rainfall occurrence. The
method is a nonparametric alternative to the use of traditionally used order
selection techniques such as the Akaike Information Criterion or the Bayesian
Information Criterion.

An alternative approach for identifying predictors for use in the daily rainfall
occurrence model was presented in chapter 4. This alternative approach was
contrasted against the PIC predictor selection criterion presented in chapter 3; the
alternative approach offers a less mathematical and more intuitive approach
designed specifically for representation of historical variability in the simulations.
This approach involves stepwise selection of short-term, medium-term, long-term,
and very-long-term predictors. Predictors were chosen at each stage of this
method by comparing the statistical characteristics of synthetic sequences
produced by a model incorporating the selected predictors with the corresponding
statistics observed in the historical record.

The results presented in chapter 4 show that predictors chosen by forecasts of the
rainfall one day ahead (using the PIC method described in chapter 3), do not result
in a model for generation of long synthetic sequences which reproduces the
longer-term variability of rainfall. Thus it was shown that traditionally used order
selection techniques (which measure the quality of short-term forecasts) give little
indication of whether generated sequences from the chosen model will emulate
the longer-term features of the observed record.

133
6.3. The rainfall occurrence model
Chapter 4 proposed an approach for generating sequences of the rainfall
occurrence state (wet or dry). The aim of the approach was to ensure that the
observed variation in the number of historical wet days at seasonal, annual or
longer time scales was reproduced in the generated sequences. The approach was
also designed to accurately represent the seasonal variations of the rainfall time
series, through the use of the moving window methodology. Application of the
approach was to historical daily rainfall occurrence from Melbourne and Sydney,
Australia.

The model developed in chapter 4 resamples from a seasonal subset of the


historical record of rainfall occurrence, conditional to the values of a set of
multiple predictors. The predictors were categorical aggregate variables formed
solely from previous values in the sequence, and represented short-term, seasonal,
annual, and inter-annual features of the rainfall sequence. This approach provided
an explicit mechanism to simulate drought and low- frequency wet periods, since
the longer-term predictors used in the model contain low-frequency signals that
are similar to the signal contained in ENSO. For both Sydney and Melbourne, the
addition of the seasonal level, annual level, and multi- year predictors resulted in
improvements to the representation of the variability of wet days per season,
variability of wet days per year, and the autocorrelation function of wet days per
year. These improvements did not compromise the short-term performance of the
model.

An important conclusion of this thesis is that the longer-term variability in daily


rainfall is mainly due to variability in the rainfall occurrence process. The use of a
rainfall occurrence model that is able to simulate low- frequency variability is the
key to ensuring that a combined occurrence/amount model can reproduce the
observed statistics of rainfall at seasonal, annual, and longer timescales.

134
6.4. The model for rainfall amounts
Chapter 5 proposed a model for stochastic generation of rainfall amounts on wet
days that is nonparametric, accommodates seasonality, and reproduces a number
of key aspects of the distributional and dependence properties of observed rainfa ll.
The model was conditioned on the rainfall amount on the previous day, and was
able to reproduce key day-to-day features that exist in the historical rainfall
record, including the lag-one correlation structure of rainfall amounts. In addition,
the use of an annual level predictor in the proposed model allowed complex low-
frequency features of the historical record to be reproduced. All of this was
achieved within a nonparametric framework that minimised the assumptions made
in simulating the rainfall amounts.

The distribution of rainfall amount is different on solitary wet days compared to


days in the middle of a wet spell, and to days at the start or end of a wet spell.
This was taken into account in the amounts model by considering four classes of
wet day, namely solitary wet days, days at the start of wet spells, days at the end
of wet spells, and days in the middle of wet spells, and then giving separate
treatment to the distributions of amount on each class of wet day.

The rainfall amount model was applied to daily rainfall from Sydney and
Melbourne, Australia, and the performance of the approach was demonstrated by
presentation of model results at daily, seasonal, annual, and inter-annual
timescales. The rainfall amount model was then linked with ol ng sequences of
rainfall occurrence generated by the model of chapter 4, to produce a combined
model that reproduced the observed features of the rainfall record at several
timescales.

6.5 Limitations of the research


Potential shortcomings of the predic tor selection methods described in chapter 3
and chapter 4 are listed at the end of each of these chapters. In summary, these
potential shortcomings are:

135
• the procedures are stepwise, and there is therefore a (remote) possibility
that two (or more) predictors may exist that, when combined, could
capture more of the dependence structure than the predictors identified by
the stepwise procedures.
• The true significance levels of the procedures are hard to determine.
• The methods are data- intensive, and hence computational requirements
can be significant.
• Selection of a key smoothing parameter used in chapter 4 is by iteration.

The usefulness of the methods of this thesis should be validated by application of


the methods to data from other sites. It is expected that the seasonal- level and
annual level predictors, as used in the rainfall occurrence model proposed in
chapter 4, will be useful longer-term predictors regardless of location, since they
are chosen to reproduce the seasonal- level and annual- level variability that is
lacking in existing rainfall models.

The proposed model for rainfall occurrence uses a two-state short-term predictor,
and three five-state longer-term predictors, resulting in 2·53 possible “model
states”. Although this number of model states is large, the historical datasets used
in this thesis were big enough to accommodate this level of complexity. However,
any increase in the number of predictors would escalate the number of model
states beyond acceptable limits.

The quality of the generated sequences produced by the methods of chapter 4 and
chapter 5 reflect the quality of the observed record; if systematic or gross errors
exist in the observed record, then these errors will be passed on to the generated
data. Also note that the methods presented here cannot compensate for a short
record of observation. On the contrary, the longer the observed record, the better
the quality of the generated sequences.

The amounts model is formulated without making major assumptions regarding


the shape of the tail of the probability distribution. This means that the largest

136
daily amounts that are put into the generated sequences are closely related to the
observed extreme values. The production of generated sequences that incorporate
rainfall amounts much larger than (or different to) the observed extremes would
involve the type of assumptions that the nonparametric methods used in this thesis
try to avoid.

6.6 Future work


The proposed rainfall occurrence model (which captures most of the longer-term
variability of the observed rainfall) could be linked with event-based approaches
that only work when it is wet (see, for example, Seed et al. [1999]). The aim of
this would be to inject longer-term variability into the generated sequences that
are produced by the event-based approaches.

The approaches for stochastic data generation outlined in this thesis assume that
the observed data is stationary, i.e. that the probability distributions, dependence
structure, and periodic seasonal patterns do not change over time. This assumes
that any trends or shifts have already been removed from the observed record, and
that the effects of climate change and anthropogenic change on the observed
record are negligible. Under the assumption of stationarity, the aim of the
stochastic analysis is to produce multiple synthetic sequences that represent the
climate variability contained in the observed record. Note that these synthetic
sequences would form a useful base-case for comparing the hydrologic effects of
“climate variability” against the hydrologic effects of “climate change”, if
projected synthetic rainfall under climate change conditions was available from a
climate change study.

Future research could involve extension of the methodology to produce spatially


correlated rainfall at multiple sites. Extension of the approaches described in this
thesis to multi-site data will involve consideration of spatial correlations and
spatial variability.

137
To the knowledge of the author, the use of categorical aggregated variables as
proposed in this thesis is new. This approach should be useful in time series
modelling in general, to introduce seasonal, annual, and longer-term variability
into a synthetic time series of daily data, or even more generally, to introduce
longer-term variability into a model that operates on a short time step.

6.7 Summary
In summary, this research has produced:
• an approach for predictor identification based on the quality of a one-day
ahead forecast, that is a nonparametric alternative to traditional approaches
such as the Akaike Information Criterion or the Bayesian Information
Criterion.
• an approach for model selection that is based on the quality of long sequences
of data generated by a model.
• models for generation of long sequences of rainfall occurrence and rainfall
amounts that reproduce a number of key aspects of the distributional features,
dependence characteristics, and seasonal variations of the observed record.
• a reliable empirical method for selecting a bandwidth for use when calculating
the mutual information criterion.

Note that the use of predictor selection methods based on the quality of one-day-
ahead forecasts is abandoned after chapter 3. In chapter 4 it is shown that these
methods are not good indicators of whether synthetic sequences produced by a
daily model which incorporates the chosen predictors will reproduce the
properties of the observed record at seasonal, annual, or inter-annual timescales.

The occurrence and amount models presented in this thesis use multiple predictors
to capture the short-term and longer-term dependence features of the daily rainfall
record. The use of nonparametric techniques, where the entire sample is used to
characterise the underlying distribution rather than just a few descriptors, and the

138
use of the “moving window” approach to model seasonality, help to ensure that
distributional features and seasonal variability are also reproduced.

The model for rainfall occurrence uses one short-term predictor and three longer-
term predictors. The use of the longer-term predictors, which are specified as
categorical aggregate variables, results in generated sequences that can reproduce
the observed variability of seasonal, annual, and longer-term wet day totals.

The model for rainfall amounts on a wet day is conditioned on the rainfall amount
on the previous day, and variations in the rainfall amount distribution that depend
on the position of the day in a rainfall spell are also modelled. The proposed
rainfall amount model can emulate the day-to-day features that exist in the
historical rainfall record, including the lag-one correlation structure of rainfall
amounts.

When the rainfall occurrence and rainfall amount models are linked, the resulting
generated sequences provide a better representation of the statistical
characteristics of the historical record, including the variability associated with
droughts and sustained wet periods, than was previously possible. These synthetic
sequences can provide useful input data for catchment water management studies,
to help quantify the uncertainty that results from climatic variability. It is
therefore hoped that this thesis will be a step towards improved risk-based
management of water resources, and more reliable evaluation of the hydrologic,
environmental, and socioeconomic impacts of alternative water resource
management plans.

139
7. References

Akaike, H., A new look at the statistical model identification, IEEE Transactions
on Automation and Control, AC-19, 716-723, 1974.
Allan, R.J., Australasia, in Teleconnections Linking Worldwide Climate
Anomalies, edited by H. Glantz, R.W. Katz, and N. Nicholls, pp 73-120,
Cambridge University Press, New York, 1991.
Allan, R., El Niño/La Niña year classification, in Flood, N.R., and A. Peacock,
Twelve month Australian rainfall relative to historical records (poster),
Resource Sciences Centre, Queensland Department of Natural Resources,
http://www.dnr.qld/gov/au/longpdk, 1997.
Beran, J., Statistics for long memory processes, Chapman and Hall, New York,
1994.
Box, G.E.P., and G.M. Jenkins, Time series analysis: forecasting and control,
Holden-Day, Merrifield, Virginia, 1976.
Bras, R.L., and I. Rodriguez-Iturbe, Random functions and hydrology, Dover
Publications, New York, 1985.
Brockwell, P.J., and R.A. Davis, Introduction to time series and forecasting,
Springer-Verlag, New York, 1986.
Buishand, T.A., Some remarks on the use of daily rainfall models, Journal of
Hydrology, 36, 295-308, 1978.
Burnash, R.J.C., R.L. Ferral, and R.A. McGuire, A Generalised Streamflow
Simulation System: Conceptual Modelling for Digital Computers, Report, US
Department of Commerce, National Weather Service in Cooperation with
California, Department of Water Resources, Sacramento, California, 1973.
Chapman, T.G., Entropy as a measure of hydrologic data uncertainty and model
performance, Journal of Hydrology, 85, 111-126, 1986.
Chapman, T.G., Stochastic modelling of daily rainfall: the impact of adjoining
wet days on the distribution of rainfall amounts, Environmental Modelling
and Software, 13, 317-324, 1998.

140
Chapman, T.G., Refinements to the Srikanthan-McMahon stochastic model for
daily rainfall, MODSIM 2001 Congress, Canberra, 10-13 December, 287-
292, 2001.
Chin, E.H., Modelling daily precipitation occurrence process with Markov chain,
Water Resources Research, 13(6), 949-956, 1977.
CRC for Waste Management and Pollution Control Limited, Description of the
Daily Weather Model, Catchment Processes and Modelling Branch,
Department of Land and Water Conservation, Sydney, 1995.
Croke, B.F.W., and A.J. Jakeman, Predictions in catchment hydrology: An
Australian perspective, Marine and Freshwater Research, 52, 65-79, 2001.
Department of Land and Water Conservation, Integrated Quantity-Quality Model
(IQQM) Reference Manual, Catchment Processes and Modelling Branch,
Department of Land and Water Conservation, Sydney, 1998a.
Department of Land and Water Conservation, Integrated Quantity-Quality Model
(IQQM) User Manual, Catchment Processes and Modelling Branch,
Department of Land and Water Conservation, Sydney ,1998b.
Efron, B., and R.J. Tibshirani, An Introduction to the Bootstrap, Chapman and
Hall, New York, 1993.
Fraser, A.M., and H.L. Swinney, Independent coordinates for strange attractors
from mutual information, Physical Review A, 33(2), 1134-1140, 1986.
Gabriel, K.R., and J. Neumann, A Markov chain model for daily rainfall
occurrence at Tel Aviv, Journal of the Royal Meteorological Society, 88, 90-
95, 1962.
Gregory, J.M., T.M.L. Wigley, and P.D. Jones, Application of Markov models to
area-average daily precipitation series and interannual variability in seasonal
totals, Climate Dynamics, 8, p299-310, 1993.
Haan, C.T., D.M. Allen, and J.D. Street, A Markov chain model of daily rainfall,
Water Resources Research, 12(3), 443-449, 1976.
Harrold, T.I., The Department of Land and Water Conservation’s hydroinformatic
system for streamflow data, Unpublished manuscript, University of New
South Wales, Sydney, 2000.

141
Harrold, T.I., and A. Sharma, Interseasonal and interannual dependence in daily
hydrologic records, Water 99 Joint congress, Hydrology Conference of the
Institution of Engineers Australia, Brisbane, 751-756, 1999.
Harrold, T.I., A. Sharma, and S.J. Sheather, Selection of a kernel bandwidth for
measuring dependence in hydrologic time series using the mutual
information criterion, Stochastic Environmental Research and Risk
Assessment, 15(4), 310-324, 2001.
Hirsch, R.M., D.R. Helsel, T.A. Cohn, and E.J. Gilroy, Statistical analysis of
hydrologic data, in Handbook of Hydrology, edited by D.R. Maidment, 17.1-
17.55, McGraw-Hill, New York, 1993.
Hosking, J.R.M., Modelling persistence in hydrological time series using
fractional differencing, Water Resources Research, 20(12), 1898-1908, 1984.
Hurst, H.E., Long-term storage capacity of reservoirs, Transactions of the
American Society of Civil Engineers, 116, 770-799, 1951.
Hurst, H.E., Methods of using long term storage in reservoirs, Proceedings
Institute of Civil Engineers, 5(5), 519-590, 1956.
Jimoh, O.D., and P. Webster, The optimum order of a Markov chain model for
daily rainfall in Nigeria, Journal of Hydrology, 185, 45-69, 1996.
Katz, R.W., and X. Zheng, Mixture model for overdispersion of precipitation,
Journal of Climate, 12, 2528-2537, 1999.
Lall, U., Recent advances in nonparametric function estimation: hydraulic
applications, US National Report, International Union of Geophysics, 1991-
1994, Review of Geophysics, 33, 1093, 1995.
Lall, U., B. Rajagopalan, and D.G. Tarboton, A nonparametric wet/dry spell
model for resampling daily precipitation, Water Resources Research, 32(9),
2803-2823, 1996.
Lall, U., and A. Sharma, A nearest neighbour bootstrap for resampling of
hydrologic time series, Water Resources Research, 32(3), 679-693, 1996.
Lavery, B., G. Joung and N. Nicholls, An extended high-quality historical rainfall
dataset for Australia, Australian Meteorology Magazine, 46, 27-38, 1997.
Lettenmaier, D.P., and S.J. Burges, Operational assessment of hydrologic models
of long-term persistence, Water Resources Research 13(1), 113-124, 1977.

142
Linfoot, E.H., An informational measure of correlation, Information and Control,
1, 85-89, 1957.
Maritz, J.S., Distribution-free statistical methods, Chapman and Hall, London,
1981.
McMahon, T.A., Hydrology - some unfinished business, CH Munro Oration,
Hydrology Conference of the Institution of Engineers Australia, 1997.
Montanari, A., R. Rosso, and M.S. Taqqu, Fractionally differenced ARIMA
models applied to hydrologic time series: Identification, estimation and
simulation, Water Resources Research 33(5), 1035-1044, 1997.
Rajagopalan, B., U. Lall, and D.G. Tarboton, Nonhomogeneous Markov Model
for Daily Precipitation, Journal of Hydrologic Engineering, 1(1), 33-40,
1996.
Rajagopalan, B., and U. Lall, A k-nearest neighbour simulator for daily
precipitation and other weather variables, Water Resources Research, 35(10),
3089-3101, 1999.
Roldan, J., and D.A. Woolhiser, Stochastic daily precipitation models 1. A
comparison of occurrence processes, Water Resources Research, 18(5),
1451-1459, 1982.
Salas, J.D., Analysis and modelling of hydrologic time series, in Handbook of
Hydrology, edited by D.R. Maidment, 19.1-19.72, McGraw-Hill, New York,
1993.
Salas, J.D., J.W. Delleur, V. Yevjevich and W.L. Lane, Applied Modelling of
Hydrologic Time Series, Water Resources Publications, Littleton, Colorado,
1980.
Schwarz, G., Estimating the dimension of a model, Annals of Statistics, 6, 461-
464, 1978.
Scott, D.W., Multivariate density estimation - theory, practice and visualisation,
John Wiley, New York, 1992.
Seed, A.W., R. Srikanthan, and M. Menabde, A space and time model for design
storm rainfall, Journal of Geophysical Research, 104(D24), 31623-31630,
1999.

143
Sharma, A., Seasonal to interannual rainfall ensemble forecasts for improved
water supply management: 1. A strategy for system predictor identification,
Journal of Hydrology, 239, 232-239, 2000.
Sharma, A., and U. Lall, A nonparametric approach for daily rainfall simulation,
Mathematics and Computers in Simulation, 48, 361-371, 1999.
Sharma, A., and R. O’Neill, A nonparametric approach for representing
interannual dependence in monthly streamflow sequences, Water Resources
Research, in press, 2002.
Sharma, A., D.G. Tarboton, and U. Lall, Streamflow Simulation : A
Nonparametric Approach, Water Resources Research, 33(2), 291-308, 1997.
Sheather, S.J., and M.C. Jones, A reliable data-based bandwidth selection method
for kernel density estimation, Journal of the Royal Statistical Society B,
53(3), 683-690, 1991.
Silverman, B.W., Density Estimation for Statistics and Data Analysis, Chapman
and Hall, New York, 1986.
Simpson, H.J., M.A. Cane, A.L. Herczeg, S.E. Zebiak, and J.H. Simpson, Annual
river discharge in Southeastern Australia related to El Niño-Southern
Oscillation Forecasts of Sea Surface Temperatures, Water Resources
Research, 29(11), 3671-3680, 1993.
Singh, V.P., The use of entropy in hydrology and water resources, Hydrological
Processes, 11, 587-626, 1997.
Srikanthan, R., and T.A. McMahon, Stochastic generation of rainfall and
evaporation data, Technical paper 84, Aust. Water Resources Council,
Canberra, 1985.
Srikanthan, R., and T.A. McMahon, Stochastic generation of annual, monthly and
daily climate data: a review, Report 00/16, Cooperative Research Centre for
Catchment Hydrology, Monash University, Australia, 2000.
Thyer, M., and Kuczera, G., Modelling long-term persistence in hydroclimatic
time series using a hidden state Markov model, Water Resources Research,
36(11), 3301-3310, 2000.
Venables, W.N., and B.D. Ripley, Modern Applied Statistics with S-plus, 2nd ed.,
Springer-Verlag, New York, 1997.

144
Wilks, D.S., Conditioning stochastic daily precipitation models on total monthly
precipitation, Water Resources Research, 25(6), 1429-1439, 1989.
Wilks, D.S., Interannual variability and extreme-value characteristics of several
stochastic daily precipitation models, Agricultural and Forest Meteorology,
93(3), 153-169, 1999.
Wilks, D.S., and R.L. Wilby, The weather generation game: a review of stochastic
weather models, Progress in Physical Geography, 23(3), 329-357, 1999.
Woolhiser, D.A., Modelling daily precipitation - progress and problems, In
Statistics in the Environmental and Earth Sciences, edited by A. Walton and
P. Gutton, Edward Arnold, London, 1992.
Yevjevich, V., Structure of Daily Hydrologic Series, Water Resources
Publications, Littleton, Colorado, 1984.
Yevjevich, V., Stochastic Processes in Hydrology, Water Resources Publications,
Colorado, 1972.

145
Appendix A. Selection of a Kernel Bandwidth for
Measuring Dependence in Hydrologic Time Series
using the Mutual Information Criterion

Mutual information is a generalised measure of dependence between any two


variables. It can be used to quantify non- linear as well as linear dependence
between any two variables. This makes mutual information an attractive
alternative to the use of the correlation coefficient, which can only quantify the
linear dependence pattern. Mutual information is especially suited for application
to hydrological problems, because the dependence between any two hydrologic
variables is seldom linear in nature.

Calculation of the mutual information score involves estimation of the marginal


and joint probability density functions of the two variables. In this appendix,
nonparametric kernel density estimation methods are used to estimate the
probability density functions. Accurate estimation of the mutual information score
using kernel methods requires selection of appropriate smoothing parameters
(bandwidths) for use with the kernels. The aim of this appendix is to obtain a
practical method for bandwidth selection for calculation of the mutual information
score.

The lag-one dependence structures of several autocorrelated time series are


analysed using mutual information (note that this produces the lag-one auto-MI
score, the analog of the lag-one autocorrelation). Empirical trials are used to select
appropriate bandwidths for a range of underlying autoregressive and
autoregressive- moving average models with normal or near-normal parent
distributions. Expressions for reasonable bandwidth choices under these
conditions are proposed.

146
A.1 Introduction
Identification, analysis, and modelling of dependence are essential parts of the
discipline of hydrology. Statistical dependence is a mathematical description of
the strength of the relationship between a dependent variable and one or more
explanatory (predictor) variables. Examp les where the analysis and modelling of
dependence is required include filling gaps in rainfall, evaporation and streamflow
data, forecasting future rainfall and streamflows, and generation of long sequences
of synthetic data. Hirsch et al. [1993] provides a good introduction to the analysis
of relationships between hydrologic variables.

The dependence structure present in hydrologic time series has traditionally been
modelled using autoregressive or autoregressive- moving average models. Such
models characterise the time series by an assumed probability distribution based
on a small number of sample statistics such as mean, variance, correlation and
skewness, calculated from the historical record. The fitted models assume a linear
dependence relationship, and fitting of the models is often based on the
calculation of the correlation or partial correlation between various lags of the
variable being modelled. Measures such as correlation offer a limited
representation of the nature of dependence that may be present, as they can only
represent the quality of a linear relationship between the variables. Nonlinear
relationships between variables (which are common in hydrology) cannot be
adequately detected and quantified by the correlation coefficient. As a result, the
corresponding fitted models may not be fully representative of the system that
they are attempting to model.

There is a need to develop a strategy to measure dependence and fit hydrologic


models in a more general manner. In particular, a more general measure of
dependence that can detect and quantify nonlinear relationships is required. While
such a measure has obvious uses in any application where the dependence
between any two variables needs to be quantified, it has a unique importance in
hydrologic time series modelling applications such as those described in Sharma
et al. [1997]. These applications are designed to reproduce a broad class of

147
underlying probability density functions, and they therefore require the use of a
generalised measure of dependence for selecting the predictor variables that will
be used in the modelling. The use of a linear measure of dependence (such as the
correlation coefficient) for this task may result in the selection of predictors that
cannot adequately reproduce nonlinear dependence and non-Gaussian probability
density functions.

Several approaches have been used for finding the order of dependence for a time
series model. In a linear context, measures such as the Akaike information
criterion (AIC) [Brockwell and Davis 1996, p171] are used to choose the model
order for parametric models. However, traditional measures such as AIC are not
applicable to the problem of model order selection for nonparametric time series
models, such as the NP(p) model proposed by Sharma et al. [1997]. Estimation of
the model order using a general linear/nonlinear time series model such as the
NP(p) model can be accomplished by use of the partial mutual information
criterion. The partial mutual information criterion (PMI) [Sharma, 2000] is a
useful alternative for identifying a combination of predictors for a model (the
number of predictors being the model order) without making major assumptions
about the underlying model structure. Accurate estimation of the mutual
information criterion is an important part of the calculation of PMI.

The mutual information (MI) criterion [Sharma, 2000; Fraser and Swinney, 1986]
is a measure of dependence that can detect and quantify both linear and nonlinear
relationships. Mutual information is related to entropy, and has also been referred
to as transinformation (see Chapman [1986], Singh [1997]). Sharma [2000] shows
that MI performs better than correlation in detecting and quantifying a range of
nonlinear dependence structures, and that it also performs well in quantifying
linear dependence. The mutual information criterion can quantify a broader range
of underlying dependence structures than any other available method.

A nonparametric implementation of the mutual information is used in this study.


The nonparametric method used is kernel density estimation [Silverman, 1986].

148
Nonparametric methods avoid the issue of assuming a probability distribution and
use the entire historical sample to estimate the probability densities needed for
simulation. Nonparametric models are constructed with minimal assumptions
regarding the underlying dependence structure and the form of the probability
density function, and are therefore more generally applicable than traditional
parametric models. The implementation of the mutual information criterion
studied here has been used in Moon et al. [1995] and Sharma [2000], and is
sensitive to the choice of a set of smoothing parameters known as the kernel
bandwidths. For example, in a typical result from this study, a 13% decrease in
bandwidth from the best obtained resulted in a 150% increase in the mean square
error of the estimated mutual information.

This current work presents the results of empirical trials and sensitivity studies.
These results suggest appropriate choices for smoothing parameter (bandwidth)
selection for calculation of the MI score using kernel density estimation methods.
These choices are based on the use of the Gaussian reference bandwidth
[Silverman, 1986 p86; Scott, 1992 p152], multiplied by a scaling factor.

Mutual information is described in the next section of this appendix, followed by


a discussion of kernel density estimation of the mutual information and a
discussion on the selection of the kernel bandwidths. The methodology for the
empirical tria ls to find practical bandwidth choices for use in calculating the MI
score is discussed next, followed by a presentation of the results. A practical rule
for selecting appropriate bandwidths, based on the results, is presented in the
conclusion to this appendix.

149
A.2 Background

A.2.1 The mutual information criterion as a measure of dependence


For two variables X and Y, the MI criterion is defined in bits 2 as:

 f ( x, y ) 
MI = ∫∫ f X , Y ( x, y ) log 2  X , Y  dxdy (A.1)
 f X ( x) f Y ( y ) 

where:

f X (x) and f Y ( y) are the marginal probability density functions (PDF’s) of X


and Y respectively, and,

f X ,Y ( x, y ) is the joint (bivariate) PDF of X and Y.

If the two variables X and Y are not related then, by the definition of
independence, the joint PDF is equal to the product of the marginal PDF’s. The
ratio in equation (A.1) would equal one and the log of this would equal zero. Thus
the MI criterion for independent data is expected to be zero. The possible values
that the MI criterion can take range from zero (if no dependence exists between
the variables) to a number approaching positive infinity (if perfect dependence
exists between the variables).

Chapman [1986] showed that, for a bivariate normal distribution, mutual


information is directly related to the correlation coefficient (ρ) as shown in
equation A.2.

1
MI = − log 2 (1 − ρ 2 ) (A.2)
2

2
The units of MI are defined by the bas e of the logarithm in equation 1. If base 2 is used the units
are called bits. This appendix uses base 2 (after Fraser and Swinney [1986]).

150
It can be seen that mutual information does not distinguish between a positive or
negatively sloped dependence relationship. However, in situations where the
underlying dependence structure is not linear (and therefore equation (A.2) is
invalid), mutual information is a more reliable indicator of the presence of
dependence.

A.2.2 Kernel density estimation of the mutual information criterion

For any given bivariate sample, the MI score in (A.1) can be estimated as:

1 n  fˆ X , Y ( xi , yi ) 
MI = ∑ log 2 
ˆ  (A.3)
ˆ ˆ
 f X ( xi ) f Y ( yi ) 
n i =1

where

( xi , yi ) is the i'th bivariate sample data pair in a sample of size n, and,

fˆX ( xi ) , fˆY ( y i ) , and fˆX ,Y ( xi , yi ) are the respective marginal and joint probability

densities estimated at the sample data points.

This appendix adopts nonparametric methods for producing estimates of the joint
and marginal densities in equation (A.3). A nonparametric method is defined as
one that can reproduce a broad class of underlying density functions [Scott, 1992
p44]. These methods seek to approximate the underlying density locally using
data from a small neighbourhood near the point of estimate [Lall, 1995], and
avoid making specific assumptions about the form of the underlying PDF.

Some versions of the MI function [Fraser and Swinney, 1986; Osaka et al.,
1997,1998] use histograms to estimate the joint and marginal probability densities
in equation (A.3). An alternative to this was suggested by Darbellay [1999] and
Darbellay and Vajda [1999], who used an adaptive histogram to estimate the
probability density functions. However, histograms (which count the number of
data points falling into evenly spaced bins) are a crude measure of probability
density because the histogram is not smooth at the bin edges, and because the

151
location of the bin edges can dramatically affect the estimates. Kernel density
estimation [Silverman, 1986; Scott, 1992; Sharma et al., 1997] is a nonparametric
method that eliminates the bin edge problems that are associated with histograms.
The probability density is estimated by the summation of kernels (smooth
functions) centred at each observed data point. This produces a weighted moving
average of the empirical frequency distribution of the data. The implementation of
the MI criterion using kernel density estimation techniques was first proposed by
Moon et al. [1995].

The kernel density estimate of the underlying univariate PDF at coordinate


location x can be written as follows. A normal kernel3 [Silverman, 1986] is used.

1 n  − ( x − x i )2 
fˆX ( x) = ∑
1
exp  2 
 (A.4)
n i=1 2π λ1σˆ  2(λ1 σˆ ) 

where

x i is the i'th data point in X, for a sample of size n,

λ1 σ̂ is the bandwidth (a smoothing parameter) of the univariate kernel used in


estimating the PDF,

σ̂ is the estimated standard deviation of X,

λ1 is the univariate bandwidth factor which represents the bandwidth for a sample
having a unit variance.

3
The form of the kernel (be it normal, Epanechnikov, or the like) has little impact on the final
kernel density estimate, compared to the impact of the choice of smoothing parameterisation. A
normal kernel is chosen because the properties of this kernel are well understood, especially for
bivariate data.

152
The kernel density estimate of the underlying bivariate PDF at (x,y) for the
bivariate case is given in equation (A.5). Again, a normal kernel is used.

  x − xi  T 2 −1  x − xi  
 

[ ]
 λ2 S  y − y  
1 n
1   y y   i 
fˆX ,Y ( x, y ) = ∑ exp − i

[ ]  (A.5)
n i=1 2π det λ S2 2
2  
 

where

( xi , yi ) is the i'th data pair in a sample of size n,

λ22 S is the bivariate bandwidth, and

S is the sample covariance matrix of the variable set X ,Y .

σˆ xx σˆ xy 
S=
σˆ xy σˆ yy 

This method of representing the bivariate bandwidth, referred to as sphering


[Fukunaga, 1972], is equivalent to using a bandwidth λ2 on a sample transformed
such that the resulting covariance matrix is the identity (I).

The choice of the bandwidth (represented by λ1 σ̂ in equation (A.4) and by λ2 2 S


in equation (A.5)) is the key to an accurate estimate of the probability density. A
large bandwidth results in an oversmoothed probability density, with subdued
modes and over-enhanced tails. A small bandwidth, on the other hand, can lead to
density estimates overly influenced by individual data points, with noticeable
bumps in the tails of the probability density. The way that the bandwidth is
represented is important. After investigating the performance of several
smoothing parameterisations in a kernel estimator of the bivariate probability

153
density, Wand and Jones [1993] state that smoothing strategies based on the
covariance matrix are inappropriate in general. However the distributions that
Wand and Jones base this recommendation on are seldom seen in hydrology. For
near-normal, skewed, and strongly autocorrelated data, which are common in
hydrology, sphering (i.e. representing the bandwidth by λ2 2 S) is a simple, practical
and efficient way of specifying appropriate smoothing parameters. It is done as an
intermediate step to improve the density estimation before the mutual information
is calculated. Following Moon et al. [1995] and Sharma [2000], the sphering
choice for use in calculating mutual information is adopted here.

A relatively simple bandwidth choice, known as the Gaussian reference


bandwidth [Silverman, 1986 p86; Scott, 1992 p152], is estimated as:

1 /( d + 4 )
 4 
λref =  n ( −1/( d + 4)) (A.6)
 d + 2

where n and d refer to the sample size and dimension of the multivariate variable
set respectively.

For a univariate PDF (d=1) this reduces to

λref 1 = 1.06n −1/ 5 (A.7)

For a bivariate PDF (d=2) this reduces to

λ ref 2 = 1.0n −1/ 6 (A.8)

Use of the Gaussian reference bandwidth is simple and computationally efficient


when compared to other bandwidth selection rules for estimating a probability
density. The Gaussian reference bandwidth for kernel density estimation of a PDF
has been derived by minimising the mean square error assuming that the
underlying PDF is normal, and thus it is the optimal bandwidth for estimating a

154
PDF if the kernel estimator is normal and if the underlying (population) PDF is
normal. It is a reasonably robust choice even if the underlying PDF is weakly
normal, for example if it is slightly skewed, or slightly multimodal. For example,
Sharma et al. [1998] showed that the Gaussian reference bandwidth produced
integrated squared errors (ISE’s) that were comparable to the ISE’s for the
Maximum Likelihood Cross Validation or Least Square Cross Validation
methods, when these methods were applied to a weakly bimodal bivariate
probability density. Equation (A.6) is therefore appropriate for choosing the
bandwidth factor for estimating the PDF of a broad range of datasets. However,
calculation of the MI score involves a ratio of probability densities. When the
sample estimate of the MI score is calculated using equation (A.3), two
bandwidths are used in the calculations: one for calculation of the marginal
(univariate) PDF’s (equation (A.4)) 4 , and one for the calculation of the joint
(bivariate) PDF (equation (A.5)). The Gaussian reference bandwidth cannot be
expected to be optimal or near-optimal when calculating the MI score because of
the function of PDF’s involved. Previous papers that used λref 1 and λref 2 when

calculating the MI score [Moon et al., 1995; Sharma, 2000] did not consider this
issue.

A bandwidth selection rule for calculation of the MI score is required. The


approach that is taken in this appendix is as follows. For simplicity, it is assumed
that the bandwidth choice will be a multiple of the Gaussian reference bandwidth.
This reduces the bandwidth selection problem to selection of a scaling factor for
the bandwidth, which is called a. The bandwidth factors used to calculate the MI
score (i.e. λ1 in equation (A.4) and λ2 in equation (A.5)) are calculated as follows:

λ1 = αλref 1 (A.9)

λ2 = αλref 2 (A.10)

4
The same bandwidth choice is chosen for both f X (x ) and f Y ( y) . This simple approach is
warranted if the distributions of x and y are similar. For calculation of the lag-one auto-MI score
(as discussed in this appendix) this assumption is valid because Y is simply a lagged version of X.

155
where α was selected in a range varying from 0.5 to 2.5, and λ ref 1 and λref 2 are as

specified in equations (A.7) and (A.8).

When α is well-chosen, then the calculation of the ratio in equation (A.3) will not
be overly affected by oversmoothing or undersmoothing in any of the estimated
PDF’s, and the estimated value of the MI score obtained from the data sample will
be close to the “true” (population) value of the MI score. However, if α is poorly
chosen, then the estimated value of the MI score may be very different to the true
value of the MI score. A poor choice of α may result in a poor estimate of the MI
score, even if the sample size being used for the estimate is large.

This appendix aims to empirically derive appropria te values for α for a range of
underlying dependence structures for near-normal data. A known lag-one
dependence structure of an autocorrelated time series is analysed using mutual
information (note that this calculates the lag-one auto-MI score, which is the
equivalent of the lag-one autocorrelation). Empirical trials are used to select
appropriate values for α for calculating the lag-one auto-MI score. This is done by
calculating a mean square error based on 500 trials, where the differences between
the small- sample estimate of the MI score and the “true” (population) MI score
are recorded. For a given sample size, the α-value with the smallest mean square
error is selected.

A.3 Empirical trials to find a practical choice of a


As stated in the previous section, a good choice of the scaling factor for the
bandwidth (α) will result in a sample estimate of the lag-one auto-MI score that is
close to the “true” (population) value of the MI score. The aim is to derive
practical choices for α for a range of underlying dependence structures and
sample sizes.

156
The underlying lag-one dependence structures that were considered in this study
were provided by autoregressive (AR(1)) models and autoregressive moving
average (ARMA(1,1)) models. These models were chosen because equation (A.2)
can then be used to provide an exact expression for the “true” lag-one auto-MI
score (provided the random deviate is normal). Additionally, an AR(1) model
reproduces mainly short-term dependence while an ARMA(1,1) model can
reproduce some longer-term dependence. The ARMA(1,1) models were included
in the study to see if longer-term dependence has any influence on the choice of a
for calculation of the MI score.

The AR(1) models used were of the form

X t = φX t−1 + 1 − φ 2 * et (A.11)

where φ = the autoregressive parameter


= the lag one autocorrelation.
et = the random deviate (either normal, with mean=0, variance=1;
or skewed, from a gamma distribution with mean=0, variance=1,
skew=0.5)

The ARMA(1,1) models were of the form

1−φ 2
X t = φX t−1 + (e − θet−1 ) (A.12)
1 + θ 2 − 2φθ t

where φ = the autoregressive parameter

θ = the moving average parameter


et = the random deviate (either normal, with mean=0, variance=1;
or skewed, from a gamma distribution with mean=0, variance=1,
skew=0.5)

157
The combination of random deviate, model type, and model parameters led to 24
models being tested in this study, as shown in Table A.1 and Table A.2. Table A.2
also shows the lag-one autocorrelations (?) for the ARMA(1,1) models, which
helps to compare these models with the AR(1) models. The models tested all
produced a zero mean, unit standard deviation time series.

158
Table A.1 The Autoregressive Table A.2 The Autoregressive-
(AR(1)) models. Moving Average (ARMA(1,1))
models.
Model φ
Random Model φ
Random
no. deviate no.
θ deviate
?

1 0.0 normal 13 .3 .1 normal 0.20


2 0.17 normal 14 .5 .2 normal 0.32
3 0.3 normal 15 .5 .3 normal 0.22
4 0.5 normal 16 .7 .3 normal 0.47
5 0.7 normal 17 .7 .5 normal 0.24
6 0.9 normal 18 .9 .3 normal 0.80
7 0.0 skewed 19 .9 .6 normal 0.49
8 0.17 skewed 20 .3 .1 skewed 0.20
9 0.3 skewed 21 .5 .2 skewed 0.32
10 0.5 skewed 22 .5 .3 skewed 0.22
11 0.7 skewed 23 .7 .3 skewed 0.47
12 0.9 skewed 24 .7 .5 skewed 0.24

159
For each of the 24 underlying models, the “true” (population) MI score was
calculated. For the models which used a normally distributed random deviate (i.e.
models 1-6 and 13-19), the “true” (population) lag-one auto-MI score was
calculated exactly from the lag-one autocorrelation using equation (A.2). However
it was necessary to estimate the “true” lag-one auto-MI scores for the models
which used a gamma distributed random deviate. The estimates were calculated
using an averaged shifted histogram (ASH) probability density estimator (Scott
1992), for a sample size of 107 . This work was done using the statistical software
package S-Plus [MathSoft, 1997], and the ASH library add-in [Scott, 1993]. ASH
requires a specification of the grid (number of bins) to be used in the calculation
of the histograms (the number of bins used directly affects the bin size, which is
analogous to bandwidth). The shift parameter used in the algorithm was set to the
default value. A 400*400 grid was used.

An example of the lag-one dependence structure for an ARMA(1,1) model with


φ=0.7 and θ=0.3, a skewed random deviate (coefficient of skewness = 0.5) and a
sample size of 200 is shown graphically in Figure A.1. For this example (which is
model no. 23; see Table A.2) the “true” (population) lag-one autocorrelation is
0.4716 and the estimated “true” lag-one auto-MI score is 0.194, while the sample
estimates of these parameters were calculated to be 0.474 and 0.209 respectively.
Figure A.1 also shows contours of the estimated bivariate PDF for this sample,
while Figure A.2 shows the estimated univariate PDF for this sample, as
calculated using the kernel density estimator in equation (A.4). A time series plot
for this example is shown in Figure A.3.

160
4
3
0.02

2
0.04
0.02
0.080.06
1
Xt-1

0.1
0
-1

0.02
0.02
0.02
-2

0.02 0.02

-2 -1 0 1 2 3 4

Xt

Figure A.1 Example lag one dependence structure. Contours of the estimated
bivariate PDF are superimposed on the plot.
0.3
0.2
0.1
0.0

-4 -2 0 2 4
Xt
Figure A.2 Estimated univariate PDF for the example in Figure 1.

161
4
3
2
1
Xt

0
-1
-2

0 50 100 150 200

Figure A.3 Time series plot for the example in Figure 1.

For each combination of underlying model and sample size, the procedure
adopted for finding the best value of a was:
Generate 500 samples from the underlying model.
Calculate the sample lag-one auto-MI scores for a range of possible a for each of
the 500 samples.
Calculate the mean square error (MSE) for each possible a.
The lowest MSE identifies the best choice of a for that combination of underlying
model and sample size.

The method for calculating the MSE is shown in equation A.13.

1 n ˆ
MSE = ∑
n i=1
( MI i − MI true )
2
(A.13)

where n = the number of trials (=500),

M̂I i = the small-sample estimate of the MI score for the i'th trial,
and

162
MItrue = the “true” MI score.

a was selected in a range varying from 0.5 to 2.5. The sample sizes used were 30,
50, 200, 800, and 1500.

A.4 Results

A.4.1 Results of small-sample trials


Lag-one auto-MI scores from the 500 trials for each combination of model (1 to
24), sample size, and a were calculated. These results were used to calculate
MSE’s based on the deviations between the trial values of the MI score and the
“true” MI score. As expected, the calculated MSE values reduced as sample size
increased. However, this aspect of the results is not of interest here. The task is to
identify the value of a that gives the smallest MSE value for a given sample size.
This is the best choice of a for this combination of model and sample size. Values
that are of interest are the smallest MSE obtained for a given combination, the
corresponding a, and the percentage changes in MSE for other a considered for
that combination of underlying model and sample size.

A.4.2 MSE values and selection of the best a choices


The smallest MSE’s obtained for given combinations of underlying model and
sample size were calculated, but are not shown here for space reasons. The
corresponding a values, which are the best choice of a for each combination of
model and sample size, are shown in Table A.3 and Table A.4. It can be seen from
the tables that the selected a values are quite stable over the range of sample sizes
and underlying models that were tested, especially for lag-one autocorrelations of
0.7 or less (i.e. all models except 6, 12 and 18). For these cases, if the sample size
is 200 or greater then a = 1.5 is generally selected. For sample sizes of 30 or 50, a
is more variable, but the value obtained for any given case is approximately (1.8-
?) where ? is the underlying lag-one autocorrelation for that model. This is shown
in Figure A.4 (excluding results from models 6, 12 and 18). A practical method is

163
therefore proposed for choosing a when the lag-one autocorrelation is 0.7 or less:
if the sample size is 200 or greater choose a = 1.5, otherwise choose a = (1.8-r),
where r is the sample estimate of the lag-one autocorrelation.

Table A.3 Best a values obtained – AR(1) models.


Model Sample size
no. 30 50 200 800 1500
1 1.9 1.7 1.6 1.5 1.5
2 1.7 1.6 1.5 1.5 1.5
3 1.6 1.6 1.5 1.5 1.5
4 1.4 1.5 1.5 1.5 1.5
5 1.1 1.1 1.3 1.4 1.4
6 0.6 0.6 0.8 1.1 1.1
7 1.9 1.7 1.5 1.5 1.4
8 1.7 1.6 1.5 1.5 1.4
9 1.6 1.5 1.5 1.5 1.4
10 1.4 1.4 1.4 1.4 1.4
11 1.1 1.1 1.3 1.3 1.5
12 0.6 0.6 0.8 1.2 1.5

164
Table A.4 Best a values obtained – ARMA(1,1) models.
Model Sample size
no. 30 50 200 800 1500
13 1.7 1.7 1.5 1.5 1.5
14 1.6 1.6 1.5 1.5 1.5
15 1.7 1.7 1.6 1.5 1.5
16 1.4 1.4 1.4 1.5 1.5
17 1.7 1.6 1.5 1.5 1.5
18 0.7 0.8 1.0 1.2 1.3
19 1.1 1.2 1.3 1.4 1.5
20 1.7 1.6 1.5 1.4 1.4
21 1.6 1.6 1.5 1.4 1.4
22 1.7 1.6 1.5 1.5 1.4
23 1.3 1.4 1.4 1.4 1.4
24 1.7 1.6 1.5 1.5 1.4
1.8
1.6

α
1.4
1.2

0.0 0.2 0.4 0.6


lag-one autocorrelation (ρ)

Figure A.4 Best a choice vs. ? for small sample sizes (n=30, n=50)

The longer-term dependence present in the ARMA(1,1) models had little


influence on the choice of a. The a selected for the ARMA(1,1) models were
similarto those selected for theAR(1) models with similar lag-one

165
autocorrelations. The only exception to this was model 19, which had the highest
moving-average component (θ=0.6) and thus had strong long memory. Even so,
this effect was only for small sample sizes (30 and 50) and a=(1.8- ?) is still a
reasonable selection rule for these cases.

Figure A.5 plots the percentage changes in MSE against a for one of the models
that was considered in this study. The percentage changes in MSE are plotted
relative to the MSE for the best a choice. This figure contains more information
on the choice of a than is shown in Table A.3 and Table A.4. For example, the
figure shows that for a sample size of 200, a 13% decrease in a from the best
obtained resulted in a 150% increase in the MSE of the estimated MI.

n= 30
n= 50
200

n= 200
n= 800
% increase on smallest MSE

n=1500
150
100
50
0

1.4 1.6 1.8 2.0


α
Figure A.5 Selection of a for Model no. 2.

Table A.5 and Table A.6 show the efficiency loss (in terms of percentage increase
in MSE, compared to the smallest MSE obtained for that combination of model
and sample size) for adopting a = 1.5 (for n = 200) or a=1.8- ? (for n<200). It can
be seen that the method for choosing a works well, with little efficiency loss, for

166
all the cases considered. (The highest efficiency loss is 21%. In most cases the
efficiency loss is below 5%.) Note that the method does not give a value for a if ?
>0.7, so no results are recorded for models 6, 12, and 18.

Table A.5 Efficiency loss (%) from selecting a by the suggested method –
AR(1) models.
Model Sample size
no. 30 50 200 800 1500
1 3 9 21 0 0
2 10 0 0 0 0
3 7 2 0 0 0
4 2 7 0 0 0
5 0 0 4 2 1
6 - - - - -
7 2 15 0 0 16
8 10 0 0 0 14
9 3 0 0 0 7
10 1 2 4 4 4
11 0 0 7 6 0
12 - - - - -

Table A.6 Efficiency loss (%) from selecting a by the suggested method –
ARMA(1,1) models.
Model Sample size
no. 30 50 200 800 1500
13 10 1 0 0 0
14 5 3 0 0 0
15 5 0 1 0 0
16 4 3 1 0 0
17 4 0 0 0 0
18 - - - - -
19 14 3 5 2 0
20 4 0 0 10 2
21 2 0 0 5 1
22 0 0 0 0 12
23 0 4 4 2 3
24 4 0 0 0 5

167
The mean square error consists of a bias component and a variance component.
The bias and variance were analysed for the results from model 2. There was a
small negative bias in the sample estimates of the lag-one auto-MI scores. For all
of the results from model 2, the variance represented more than 98% of the MSE.
It is interesting to compare these results with those presented in Darbellay and
Vajda [1999]. Using an adaptive histogram to estimate the probability density
functions, Darbellay and Vajda [1999] calculate MI scores for samples from a
normal distribution, for sample size 250 and larger. The results reported in
Darbellay and Vajda [1999] show a significantly higher bias which decreases
with increasing sample size. This is to be expected given that a kernel density
estimate is a superior measure of the probability density, as compared to
histogram based methods.

A.5 Conclusions
As a practical method for smoothing parameter selection for use in calculating the
mutual information score, this study found that a = 1.5 (applied to equations (A.9)
and (A.10)) is a good scaling factor for the bandwidth for sample sizes of 200 or
more. For sample sizes less than 200, a value of a = (1.8-r) should be used, where
r is the sample estimate of the lag-one autocorrelation. This result applies for a
near-normal parent PDF and a range of underlying autoregressive dependence
structures. This scaling factor selection rule is very stable, with efficiency loss of
less than 5% for most of the cases considered in this study. However, the rule
does have limitations. The selection rule was obtained for a normal or near-
normal parent distribution, and may not be appropriate if this assumption is badly
violated. The rule that was obtained is also not appropriate for lag one
autocorrelations greater than about 0.7.

The longer-term dependence present in the ARMA(1,1) models had little


influence on the choice of a. The a selected for the ARMA(1,1) models were
similar to those selected for the AR(1) models with similar lag-one

168
autocorrelations. The only exception to this was model 19, which had the highest
moving-average component (θ=0.6) and thus had strong long memory. Even so,
this effect was only for small sample sizes (30 and 50) and a=(1.8-r) is still a
reasonable selection rule for these small sample sizes.

When using the approach proposed here, data that are strongly skewed should be
transformed to near-normal before the dependence structure is analysed. An
alternative (but less established) approach would be to use adaptive kernel
methods, where the bandwidth is larger in the tails than in the peak of the PDF.

The assumption that the selected bandwidth would be a multiple of the Gaussian
reference bandwidth provided a reasonably consistent performance across a wide
variety of dependence structures that are commonly encountered in hydrologic
practice. The results presented here are specific to time series analysis, where the
data is collected at regular time intervals. Results may vary when the two
variables are time- independent.

The bandwidth selection strategy presented here is a major improvement on the


strategies used previously. To the author’s knowledge, this is the first study of
appropriate kernel bandwidth choice for use in calculating the mutual information
criterion. The proposed bandwidth selection rules provide a sound basis for using
mutual information to quantify a broad range of underlying dependence
structures, especially the type of dependence structures that are typically found in
hydrologic data. While the empirical results presented here are limited to linear
dependence structures for the sake of simplicity, it must be emphasised that the
strong point of mutual information is that it measures both linear and nonlinear
dependence. Sensitivity analysis with nonlinear models is beyond the scope of this
study.

169
A.6 References
Brockwell, P.J., and R.A. Davis, Introduction to time series and forecasting,
Springer-Verlag, New York, 1996.
Chapman, T.G., Entropy as a measure of hydrologic data uncertainty and model
performance, Journal of Hydrology, 85, 111-126, 1986.
Darbellay, G.A., An estimator of the mutual information based on a criterion for
independence, Computational Statistics and Data Analysis, 32, 1-17, 1999.
Darbellay, G.A., and I. Vajda, Estimation of the information by an adaptive
partitioning of the observation space, IEEE Transactions on Information
Theory, 45(4), 1315-1321, 1999.
Fraser, A.M., and H.L. Swinney, Independent coordinates for strange attractors
from mutual information, Physical Review A, 33(2) 1134-1140, 1986.
Fukunaga, K., Introduction to statistical pattern recognition, Academic Press,
New York, 1972.
Hirsch, R.M., D.R. Helsel, T.A. Cohn, and E.J. Gilroy, Statistical analysis of
hydrologic data, in Handbook of Hydrology, edited by D.R. Maidment, 17.1-
17.55, McGraw-Hill, New York, 1993.
Lall, U., Recent advances in nonparametric function estimation: hydraulic
applications, US National Report, International Union of Geophysics, 1991-
1994, Review of Geophysics, 33, 1093, 1995.
MathSoft, S-Plus 4.5 Statistical Software Package, Data Analysis Products
Division, Math Soft Incorporated, Seattle, Washington, 1997.
Moon Y.I., B. Rajagopalan, and U. Lall, Estimation of mutual information using
kernel density estimators, Physical Review E, 52(3), 2318-2321, 1995.
Osaka, M., H. Saitoh, T. Yokoshima, H. Kishida, H. Hayakawa, and R.J. Cohen,
Nonlinear pattern analysis of ventricular premature beats by mutual
information, Methods of Information in Medicine, 36, 257-260, 1997.
Osaka, M., T. Yambe, H. Saitoh, M. Yoshizawa, T. Itoh, S. Nitta, H. Kishida, and
H. Hayakawa, Mutual information discloses relationship between
hemodynamic variables in artificial heart- implanted dogs, American Journal
of Physiology, 275, H1419-H1433, 1998.

170
Scott, D.W., Multivariate density estimation - theory, practice and visualisation,
John Wiley, New York, 1992.
Scott, D.W., ASH (Averaged Shifted Histogram) library add- in for the S-Plus
Statistical Software Package, http://lib.stat.cmu.edu/s/ash, 1993.
Sharma, A., Seasonal to interannual rainfall ensemble forecasts for improved
water supply management: 1. A strategy for system predictor identification,
Journal of Hydrology, 239, 232-239, 2000.
Sharma, A., U. Lall, and D.G. Tarboton, Kernel bandwidth selection for a first
order nonparametric streamflow simulation model, Stochastic Hydrology and
Hydraulics, 12, 33-52, 1998.
Sharma, A., D.G. Tarboton, and U. Lall, Streamflow Simulation: A
Nonparametric Approach, Water Resources Research, 33 (2), 291-308, 1997.
Silverman, B.W., Density estimation for statistics and data analysis, Chapman
and Hall, New York, 1986.
Singh, V.P., The use of entropy in hydrology and water resources, Hydrological
Processes, 11, 587-626, 1997.
Wand, M.P., and Jones, M.C., Comparison of smoothing parameterisations in
bivariate kernel density estimation, Journal of the American Statistical
Association, 88(422), 520-528, 1993.

171
Appendix B. Supplement to Chapter 4: A “panel of
plots” approach for assessment of the quality of
generated sequences

In chapter 4, it was proposed that predictor selection should be based on the


quality of generated sequences produced by a model that incorporates the
predictors, and statistics that can be used for this assessment were presented. In
this brief appendix, the statistics that were introduced one-at-a-time in chapter 4
are presented in a “panel of plots” approach for assessment of the quality of
generated sequences. To illustrate this approach, a panel of plots for the ROG(1)
model for Sydney rainfall occurrence is presented in Figure B.1, and a panel of
plots for the ROG(4) model for Sydney rainfall occurrence is presented in Figure
B.2. Note that the performance of these models has already been discussed in
chapter 4; the point of this appendix is to present a useful supplementary tool for
presentation of results, that was developed at the time that the research for chapter
4 was undertaken.

The panel of plots is an intermediate stage between the one-at-a-time presentation


(Figures 4.1 to 4.7) and the summary tables (Tables 4.4 to 4.6) of chapter 4. The
one-at-a-time plots give detailed information about one aspect of a model, while
the summary tables give concise information about a number of models at the
same time. In contrast, the panel of plots gives detailed information about many
aspects of the performance of a particular model, all on the one page. The level of
information on the page may be hard to comprehend at first glance, but to an
experienced modeller it has the potential to be a useful tool for assessing a model,
since it places all the relevant and detailed information in the one place.

172
mean dry spell length mean wet spell length daily mean seasonal mean annual SD wetdays/year

220
0.50
4.6

22
2.8

200
40
4.4

0.45

20

180
4.2

2.6

18
38
0.40
wet fraction

160
4.0

xx
16
0.35
2.4

140
36
3.8

14

120
0.30
3.6

34
2.2

12

100
3.4

0.25
0 100 200 300 0.0050 0.1000 0.5000 0.9000 0.9950
S S A W S S A W julian day
S S A W /4 percent

sd of dry spell length sd of wet spell length daily r seasonal SD ACF of (wetdays/year) SD vs timescale (years)

0.50
2.4

100
10
4.2

0.4
0.45
4.0

2.2

80
0.40
3.8

0.2
lag-1 autocorrelation
2.0

8
3.6

60
0.35
3.4

0.0
1.8

7
0.30

40
3.2

0.25
1.6

-0.2
6
3.0

20
0.20
2.8

0 100 200 300


S S A W S S A W julian day
S S A W 1 3 5 7 9 1 3 5 7

Figure B.1 ROG(1): Panel of plots for Sydney rainfall occurrence.

173
mean dry spell length mean wet spell length daily mean seasonal mean annual SD wetdays/year

220
44
0.50
5.0

42
2.8

26

200
0.45

40
4.5

180
2.6

24
38
0.40
wet fraction

160
xx
36
4.0

2.4

0.35

140
22
34

120
32
0.30
2.2
3.5

20
30

100
0.25
0 100 200 300 0.0050 0.1000 0.5000 0.9000 0.9950
S S A W S S A W julian day
S S A W /4 percent

sd of dry spell length sd of wet spell length daily r seasonal SD ACF of (wetdays/year) SD vs timescale (years)

12
0.50
2.8
5.0

120
2.6

0.45

11

0.4

100
4.5

2.4

0.40

10
lag-1 autocorrelation

0.2

80
2.2

0.35
4.0

60
2.0

0.30

0.0
3.5

8
1.8

40
0.25
3.0

-0.2
1.6

20
7
0.20

0 100 200 300


S S A W S S A W julian day
S S A W 1 3 5 7 9 1 3 5 7

Figure B.2 ROG(4): Panel of plots for Sydney rainfall occurrence.

174
Appendix C. Supplement to Chapter 5: Stochastic
generation of daily rainfall amounts

C.1 Bandwidth selection for kernel density estimation


As noted in chapter 5, the choice of the bandwidth is the key to an accurate kernel
density estimate of the probability density. Some possible bandwidth selection
rules are now discussed.

C.1.1 Univariate case

A useful choice of univariate bandwidth h is one that minimises the mean


integrated square error (MISE):

MISE ( fˆ ) = E ∫ { f ( x) − fˆX ( x; h)} 2 dx (C.1)

where f (x ) is the underlying (population) probability density and

fˆX ( x; h) is the kernel density estimate of f (x ) .

If f (x ) is normally distributed, the choice of bandwidth that minimises MISE is


given by the Gaussian reference bandwidth [Silverman, 1986 p45]:

href = 1.06 σ̂ n −1 / 5 (C.2)

where σ̂ is the sample standard deviation of the data.

The magnitude of the MISE optimal bandwidth choice is reduced for a skewed
parent distribution. In this case, a robust measure of spread should be used.
Equation (C.2) written in terms of the interquartile range R for a normally
distributed parent distribution becomes:

175
href = 0.79R n −1 / 5 (C.3)

Scott [1979] suggests that a correction factor be applied to adjust for skewness in
the sample. Silverman [1986, p47] presents a graph that shows appropriate
skewness correction factors that can be applied to equation (C.3), for a lognormal
parent distribution of given skewness. This graph is reproduced in Figure C.1.
Applying this correction factor, we obtain:

hrot = 0.79sR n −1 / 5 (C.4)

where s is the skewness correction factor obtained from Figure C.1, for a
given sample value of skewness.

Equation (C.4) is referred to as the “rule-of-thumb” (ROT) method for bandwidth


selection (after Silverman [1986]).

176
1.0
skewness correction factor (s)
0.8
0.6
0.4
0.2
0.0

0 1 2 3 4 5
skewness
Figure C.1 Skewness correction factor (s) for rule-of-thumb bandwidth selection,
assuming a log-normal parent distribution. (after Silverman [1986, p47])

A more general univariate bandwidth selection rule is the iterative “solve the
equation” (SJ) method of Sheather and Jones [1991]. The SJ method involves
asymptotic expansion of the MISE and solving for the optimal bandwidth:

1/5

hopt =  ∫ K (x )2 
 (C.5)
 n∫ f ′′( x ) 2 {∫ x 2 K ( x )} 2 
 

where K ( x) is a kernel function (such as the kernel given in (5.2));


f ′′( x) is the second derivative of the underlying probability density;

∫ f ′′( x)
2
and the bandwidth for estimating is taken to be proportional to

h5/7 .

177
C.1.2 Conditional case

Rules for calculating appropriate bandwidths for conditional densities (such as

fˆ (x t | xt −1 ) , shown in Figure 5.2) are not well developed. Here, the rule-of thumb
method given in (C.4) is applied, except the values of R and n that are used are
chosen conditionally. This is consistent with the principle given in section 5.4 of
this thesis, which states that simulation should proceed with more emphasis give n
to the observed data points lying closer to the conditioning plane, and less
emphasis given to the data points that lie further away. In this proposed rule, the
interquartile range R can vary with xt −1 , and the effective number of data points

neff depends on the weights associated with the kernel slices that constitute the

conditional probability density (i.e. the wi values in equation (5.5)). For a dataset
with positive skew, this rule leads to increasing values of the bandwidth h1 as xt −1

increases.

For simplicity, three broad classes for xt −1 are considered: low, medium, and high.

R values are calculated for each of these data ranges, and h1 is calculated
separately for each range:

−1 / 5
hlow = 0.79sRlowneff (C.6a)
−1 / 5
hmed = 0.79 sRmed neff (C.6b)
−1 / 5
hhigh = 0.79sRhighneff (C.6c)

where neff = 1 / max( wi ) , and Figure C.1 is used to give a correction factor

for skewness.

C.1.3 Testing and implementation of the bandwidth selection rules

The bandwidth selection rule used in chapter 5 (equation (5.3)) was chosen after
the rules presented above were evaluated. The criteria for evaluation was
examination of the statistics shown in Figure 5.3; a poor bandwidth selection rule

178
will result in a poor match between the median simulated values and the historical
values shown in this Figure.

Based on analysis of the skewness of the data used in this study, that it was
appropriate to adopt an average skewness of about 4, for both univariate and
conditional bandwidth calculations. The average skewness was the same for all
rainfall amount classes. From Figure C.1, s is chosen as approximately 0.38. This
gives:

0.79s = 0.3 (C.7)

Application of this factor to (C.4) gives:

h = 0.3R n −1 / 5 (C.8)

Which is identical to (5.3) and is the rule adopted for use in this thesis.

The following bandwidth selection rules were also tested: h = 0.4R n −1 / 5 ; the SJ
bandwidth selection method; and the Gaussian reference bandwidth. Each of the
tested bandwidth selection methods performed adequately, except for the
Gaussian reference bandwidth, which is not appropriate for skewed data. Equation
(5.3) gave equal or better performance than any of the tested methods, and was
chosen because it selected bandwidths that were larger than those selected using
SJ, but smaller than those selected using h = 0.4R n −1 / 5 . Note that the good relative
performance of (5.3) compared to the SJ method occurs because (5.3) is
specifically formulated for the form of probability density encountered in this
application, while the SJ method is recognised as a method which is close to
optimal [Venables and Ripley, 1997 p 184], but much more general.

For the conditional simulation of amounts (x t) for the class 1b and class 2 rainfall,
xt −1 was allocated to one of three broad rainfall ranges: low rainfall was defined

as 0.3 to 4 mm, medium rainfall was defined as 4.1-15 mm, and high rainfall was

179
defined as anything over 15 mm. R values were calculated for each of these data
ranges. h1 was calculated separately for each range using equations (C.6) and

(C.7). This specification of h1 gave the results presented in this thesis. Results

obtained using a single data range for the rainfall, and using n eff = n , were also

adequate; however the chosen specification gave better results, and this way of
defining h1 is closer to the theoretically optimum bandwidth specification.

C.2 Long-term variability and wet day classes


The association between annual rainfalls and ENSO at Tenterfield is shown in
Table 5.1 (this table is reproduced here as Table C.1, for ease of reference). This
association can be broken down into wet day classes based on the number of
adjacent wet days. This is shown in Table C.2 and Table C.3. There are
differences between the classes. For example, the average number of class 0 wet
days in La Niña years is about the same as in El Niño years, but the average
rainfall amount on a class 0 wet day in La Niña years is slightly higher than in El
Niño years. Also note that the average number of class 2 wet days in La Niña
years is much higher than in El Niño years, but the relationship for the average
rainfall amount on a class 2 wet day is reversed; the wet day amounts in La Niña
years are slightly lower than in El Niño years. These complex features of the
historical data cannot be modelled unless the classes are modelled separately, and
an annual- level predictor is introduced to the RAG model.

Table C.1 Average annual rainfall (mm), average number of wet days per year,
and average rainfall on a wet day (mm), for El Niño and La Niña years at
Tenterfield 1936-1996
El Niño Non-ENSO La Niña
years years years
annual rainfall (mm) 790 900 1020
number of wet days 90 97 112
rainfall on a wet day (mm) 8.8 9.3 9.2

180
Table C.2 Average number of class 0, class 1, and class 2 wet days for El Niño
and La Niña years at Tenterfield 1936-1996
Class El Niño Non-ENSO La Niña
years years years
0 25 27 24
1 44 46 53
2 21 24 35
all 90 97 112

181
Table C.3 Average rainfall amount (mm/day) on class 0, class 1, and class 2 wet
days at Tenterfield 1936-1996
Class El Niño Non-ENSO La Niña
years years years
0 6.3 7.3 7.1
1 8.8 9.0 8.4
2 13.3 12.6 11.6
all 8.8 9.3 9.2

C.3 Detailed results for Sydney


In this section, detailed results for the rainfall generator for Sydney are presented,
broken down into results for each wet day class (class 0, class 1a, class 1b, and
class 2).

Results for the rainfall amount generator (RAG(1)) for class 0, class 1a, and class
1b amounts are shown in Figures C.2 to C.4 (class 2 amounts have already been
presented in Figure 5.3). The most notable feature of these results is that the
simulated mean class 1b rainfalls are too high (cf. Figure C.4a). It was found that
the high rainfalls occurred because the model overestimates the amounts on the
days before class 1b days. When the class 1b amounts are conditionally simulated,
the positive correlation structure causes the class 1b amounts to be similarly
overestimated. Possible ways to correct this deficiency in the formulation of the
amounts model are to:

1. Model five separate rainfall classes instead of four, i.e. break class 2 amounts
into “second- last days in wet spells” and “all other days in the middle of a wet
spell”. For Sydney rainfall, the distribution of rainfall in these two sub-classes is
different, with lower means and standard deviations for “second- last days in wet
spells”. However, modelling five separate classes would make the RAG model
formulation more complicated;

182
2. Simulate class 1b unconditionally. This would correct the problem in the
simulated means; however the simulated lag-one correlations (cf. Figure C.4d)
would be zero.

___
5%, median, 95% generated values
° historical values
mean amount (mm)

sd of amount (mm)
a b

10
5

8
4

6
3

4
2

0 100 200 300 0 100 200 300


julian day julian day

c
7
skew of amount
6
5
4
3
2

0 100 200 300


julian day

Figure C.2 RAG(1): Statistics of daily rainfall for Sydney on class 0 wet days, as
a function of Julian day. a. Mean daily rainfall. b. Standard deviation of daily
rainfall. c. Skew of daily rainfall.

183
___
5%, median, 95% generated values
° historical values
mean amount (mm)

sd of amount (mm)
12
a b

25
10

20
15
8

10
6

0 100 200 300 0 100 200 300


julian day julian day

c
skew of amount
3 4 5 6 7 8

0 100 200 300


julian day

Figure C.3 RAG(1): Statistics of daily rainfall for Sydney on class 1a wet days,
as a function of Julian day. a. Mean daily rainfall. b. Standard deviation of daily
rainfall. c. Skew of daily rainfall.

184
___5%, median, 95% generated values
° historical values
mean amount (mm)

sd of amount (mm)
a b
8

20
7

15
6

10
5

0 100 200 300 0 100 200 300


julian day julian day

lag-one correlation
0.6
c d
skew of amount
8 10

0.4
6

0.2
4

0.0
2

0 100 200 300 0 100 200 300


julian day julian day

Figure C.4 RAG(1): Statistics of daily rainfall for Sydney on class 1b wet days,
as a function of Julian day. a. Mean daily rainfall. b. Standard deviation of daily
rainfall. c. Skew of daily rainfall. d. Lag-one correlation of daily rainfall.

Results for the ROG(4) occurrence model for Sydney broken down into wet day
classes are presented in Figures C.5 and C.6. The breakdown of class 0, class 1a,
class 1b, and class 2 wet days per year is reasonable for Sydney. The next section
shows that this is not the case for the corresponding results for Melbourne, and
that this poor breakdown for Melbourne affects the overall results for its
combined ROG(4)/RAG model.

185
a b c d e
155

55

55

55

55
150

50

50

50

50
145

45

45

45

45
140

40

40

40

40
135

35

35

35

35
130

30

30

30

30
125

25

25

25

25
Figure C.5 ROG(4): Mean wet days per year for Sydney. a. All classes
combined. b. class 0. c. class 1a. d. class 1b. e. class 2.

a b c d e
25

25

25

25

25
20

20

20

20

20
15

15

15

15

15
10

10

10

10

10
5

Figure C.6 ROG(4): Standard deviation of wet days per year for Sydney. a. All
classes combined. b. class 0. c. class 1a. d. class 1b. e. class 2.

186
Results for the combined ROG(4)/RAG(1) rainfall generator for Sydney are
presented in Figures C.7 and C.8. Figure C.7d shows that the simulated mean
class 1b rainfall per year is a little high. The reason for this is because the model
overestimates the amounts on the days before class 1b days, as discussed above.
1350

250

250

250

850
a b c d e
1300

200

200

200

800
1250

150

150

150

750
1200

100

100

100

700
1150

650
50

50

50
1100

600
0

Figure C.7 Combined ROG(4)/RAG(1): Mean rainfall per year for Sydney. a.
All classes combined. b. class 0. c. class 1a. d. class 1b. e. class 2.

The standard deviation of the rainfall per year for each wet-day class generated by
the combined ROG(4)/RAG(1) model does not perfectly reproduce the historical
values, as shown in Figure C.8. This is despite the fact that the simulated wet days
per year for each class gives a reasonable match to the historical values (cf. Figure
C.6). In particular, the variability of the class 2 rainfall makes a dominant
contribution to the overall variability of rainfall per year that is shown in Figure
C.8a, and the model overestimates the annual variability of class 2 rainfall. This
may occur because historical daily amounts on class 2 days tend to be smaller in
wet years than in dry years (cf. Table C.2, and the discussion in section C.2). A

187
similar argument applies to the annual variability of class 1a rainfall, and the
inverse argument applies to class 0 and class 1b rainfall, i.e. that historical daily
amounts on solitary wet days and days at the end of wet spells may tend to be
larger in wet years than in dry years. The RAG(1) model is insensitive to these
complex low- frequency features of the historical data.

a b c d e
400

400

400

400

400
300

300

300

300

300
200

200

200

200

200
100

100

100

100

100

Figure C.8 Combined ROG(4)/RAG(1): Standard deviation of rainfall per year


for Sydney. a. All classes combined. b. class 0. c. class 1a. d. class 1b. e. class 2.

Results for the combined ROG(4)/RAG(2) model are now considered. The
reproduction of daily- level statistics by the RAG(2) rainfall amount generator,
incorporating the annual- level predictor, is very similar to the results obtained for
RAG(1) (cf. Figures 5.3, C.2, C.3 and C.4), and the mean rainfall (class totals) per
year are similar to those for RAG(1) (cf. Figure C.7). A breakdown of the
variability of annual rainfall totals for the combined ROG(4)/RAG(2) model into
wet-day classes is shown in Figure C.9. Panel a shows that this model adequately
reproduces the annual variability of rainfall. The standard deviation of class 0,
class 1a, and class 2 rainfall per year is better than that for ROG(4)/RAG(1) (cf.

188
Figure C.8). The class 1b variability is slightly worse, which may be due to the
deficiency in the model for class 1b amounts that was noted earlier.

a b c d e
400

400

400

400

400
300

300

300

300

300
200

200

200

200

200
100

100

100

100

100
Figure C.9 Combined ROG(4)/RAG(2): Standard deviation of rainfall per year
for Sydney. a. All classes combined. b. class 0. c. class 1a. d. class 1b. e. class 2.

C.4 Detailed results for Melbourne


Results for the rainfall amount generator (RAG(1)) for Melbourne for class 0,
class 1a, and class 1b amounts are shown in Figures C.10 to C.12 (class 2 amounts
are presented in Figure 5.7). These results show that the amounts model
adequately reproduces the historical daily level statistics.

189
___
5%, median, 95% generated values
° historical values
mean amount (mm)

sd of amount (mm)
a b
5

8
4

6
3

4
2

2
0 100 200 300 0 100 200 300
julian day julian day
7

c
skew of amount
6
5
4
3
2

0 100 200 300


julian day

Figure C.10 RAG(1): Statistics of daily rainfall for Melbourne on class 0 wet
days, as a function of Julian day. a. Mean daily rainfall. b. Standard deviation of
daily rainfall. c. Skew of daily rainfall.

190
___
5%, median, 95% generated values
° historical values
mean amount (mm)

sd of amount (mm)
a b
3 4 5 6 7 8 9

8 10
6
4
0 100 200 300 0 100 200 300
julian day julian day

c
5
skew of amount
4
3
2

0 100 200 300


julian day

Figure C.11 RAG(1): Statistics of daily rainfall for Melbourne on class 1a wet
days, as a function of Julian day. a. Mean daily rainfall. b. Standard deviation of
daily rainfall. c. Skew of daily rainfall.

191
___
5%, median, 95% generated values
° historical values
mean amount (mm)

sd of amount (mm)
a b
6

12
5

4 6 8
4
3
2

0 100 200 300 0 100 200 300


julian day julian day

lag-one correlation
2 3 4 5 6 7 8

c d
skew of amount

0.4
0.2
0.0
0 100 200 300 0 100 200 300
julian day julian day

Figure C.12 RAG(1): Statistics of daily rainfall for Melbourne on class 1b wet
days, as a function of Julian day. a. Mean daily rainfall. b. Standard deviation of
daily rainfall. c. Skew of daily rainfall. d. Lag-one correlation of daily rainfall.

Results for the ROG(4) occurrence model for Melbourne broken down into wet
day classes are presented in Figures C.13 and C.14. The breakdown of mean wet
days per year is not perfect. The simulated mean class 0 wet days per year are too
high, with a bias of about two days. There is also negative bias of about two days
for both simulated class 1a and simulated class 1b means, and positive bias of
about one day for simulated class 2 means. These biases affect the overall results
for the combined ROG(4)/RAG models. Unfortunately, it is unlikely that
adjustments to the structure of the ROG(4) model will improve this breakdown of
results for the wet day totals per year. An alternative, post-simulation approach to
improve the quality of the generated occurrence sequences would be to: a. add a
new wet day into every year of the generated sequences, next to a class 0 wet day;
and b. remove a class 2 wet day from every year of the generated sequences, and
place it next to a class 0 wet day in that same generated year. This would
artificially adjust the simulated means to the correct levels. This exercise has not

192
been carried out here; instead, the unadjusted ROG(4) results are used, so that the
effect of using these results can be seen in the output from the combined
ROG(4)/RAG models.

a b c d e
155

55

55

55

55
150

50

50

50

50
145

45

45

45

45
140

40

40

40

40
135

35

35

35

35
130

30

30

30

30
125

25

25

25

25

Figure C.13 ROG(4): Mean wet days per year for Melbourne. a. All classes
combined. b. class 0. c. class 1a. d. class 1b. e. class 2.

193
a b c d e
20

20

20

20

20
15

15

15

15

15
10

10

10

10

10
5

5
Figure C.14 ROG(4): Standard deviation of wet days per year for Melbourne. a.
All classes combined. b. class 0. c. class 1a. d. class 1b. e. class 2.

Results for the combined ROG(4)/RAG(1) rainfall generator for Melbourne are
presented in Figures C.15 and C.16. Figure C.15 shows that although the overall
simulated mean rainfall per year reproduces the historical value, the breakdown
into classes is slightly flawed. This is probably due to the biases in the output
from the occurrence model, that have been discussed above.

194
a b c d e
700

300

300

300

300
650

250

250

250

250
600

200

200

200

200
550

150

150

150

150
500

100

100

100

100
Figure C.15 Combined ROG(4)/RAG(1): Mean rainfall per year for Melbourne.
a. All classes combined. b. class 0. c. class 1a. d. class 1b. e. class 2.

The standard deviation of the rainfall per year for each wet-day class generated by
the combined ROG(4)/RAG(1) model for Melbourne does not perfectly reproduce
the historical values, as shown in Figure C.16. The general pattern of results is
similar to that obtained for the corresponding results for Sydney (i.e. the class 0
and class 1b standard deviations are underestimated, and the class 1a and class 2
standard deviations are overestimated; cf. Figure C.8). Possible causes for this
pattern of results include the causes affecting the Sydney results. In addition, the
problems with ROG(4) generated occurrence sequences for Melbourne contribute
to the overrepresentation and underrepresentation of the variability in annual
rainfall for each wet day class.

195
160

160

160

160

160
a b c d e
140

140

140

140

140
120

120

120

120

120
100

100

100

100

100
80

80

80

80

80
60

60

60

60

60
40

40

40

40

40
Figure C.16 Combined ROG(4)/RAG(1): Standard deviation of rainfall per year
for Melbourne. a. All classes combined. b. class 0. c. class 1a. d. class 1b. e. class
2.

The combined ROG(4)/RAG(2) model for Melbourne, incorporating the one-year


wetness state as an annual- level predictor, reproduces daily- level statistics of
rainfall amounts in a similar way to the ROG(4)/RAG(1) model (cf. Figures 5.7,
C.10, C.11 and C.12), and the mean rainfall (class totals) per year are also similar
(cf. Figure C.13). A breakdown of the variability of annual rainfall totals for the
combined ROG(4)/RAG(2) model into wet-day classes is shown in Figure C.17.
The standard deviation of class 0, class 1a, and class 2 rainfall per year is better
than that for ROG(4)/RAG(1) (cf. Figure C.16). The class 1b variability is slightly
worse. Note that the overall standard deviation of rainfall per year is not perfectly
reproduced, even though the class 2 standard deviation is reproduced adequately.
Further improvement to these results should be possible if the Melbourne
occurrence sequences are adjusted before applying the amounts model, as
discussed above.

196
160

160

160

160

160
a b c d e
140

140

140

140

140
120

120

120

120

120
100

100

100

100

100
80

80

80

80

80
60

60

60

60

60
40

40

40

40

40
Figure C.17 Combined ROG(4)/RAG(2): Standard deviation of rainfall per year
for Melbourne. a. All classes combined. b. class 0. c. class 1a. d. class 1b. e. class
2.

C.5 References
Scott, D.W., On optimal and data-based histograms. Biometrika, 66, 605-610,
1979.
Sheather, S.J., and M.C. Jones, A reliable data-based bandwidth selection method
for kernel density estimation, Journal of the Royal Statistical Society B,
53(3),683-690, 1991.
Silverman, B.W., Density Estimation for Statistics and Data Analysis, Chapman
and Hall, New York, 1986.
Venables, W.N., and B.D. Ripley, Modern Applied Statistics with S-plus, 2nd ed.,
Springer-Verlag, New York, 1997.

197
Appendix D. Conference Papers and Presentations

Conference Papers
Harrold, T.I., and A. Sharma, Interseasonal and interannual dependence in daily
hydrologic records, Water 99 Joint congress, Hydrology Conference, the
Institution of Engineers Australia, Brisbane, 751-756, 1999.
Harrold, T.I., A. Sharma, and S.J. Sheather, Identification of predictors for a daily
rainfall simulation model, Hydro 2000 - 3rd International Hydrology and
Water Resources Symposium of the Institution of Engineers, Australia, 161-
166, 2000.
Harrold, T.I., A. Sharma, and S.J. Sheather, Predictor selection for a daily rainfall
occurrence model using partial informational correlation. MODSIM 2001
Congress, Canberra, 10-13 December, 275-280, 2001.
Harrold, T.I., A. Sharma, and S.J. Sheather, A nonparametric model for daily
rainfall occurrence that reproduces long-term variability. MODSIM 2001
Congress, Canberra, 10-13 December, 281-286, 2001.
Harrold, T.I., A. Sharma, and S.J. Sheather, A resampling model for daily rainfall
occurrence, ACE 2002 Conference, India, January 2002.
Harrold, T.I., A. Sharma, and S.J. Sheather, Representation of longer-term
variability in daily rainfall generation, Hydro 2002, 27th Hydrology and
Water Resources Symposium of the Institution of Engineers, Australia,
Melbourne, May 2002.

Other Conference Presentations


Harrold, T.I., A. Sharma, and S.J. Sheather, Selection of predictors for a
simulation model of daily rainfall occurrence, EGS 2002, 27th General
Assembly of the European Geophysical Society, Nice, France, 22-26 April,
2002.
Harrold, T.I., A. Sharma, and S.J. Sheather, Incorporating longer-term variability
into a stochastic model for daily rainfall, EGS 2002, 27th General Assembly
of the European Geophysical Society, Nice, France, 22-26 April, 2002.

198

You might also like