Grover 2007 Thesis

A Model for Sensing and
Reasoning in Autonomous Systems
Richard Grover
A thesis submitted in fulfillment

of the requirements for the degree of
Doctor of Philosophy
ARC Centre of Excellence in Autonomous Systems

Australian Centre for Field Robotics
School of Aerospace, Mechanical and Mechatronic Engineering
The University of Sydney
April 2007
2
Declaration
I hereby declare that this submission is my own work and that, to the best of my knowl-
edge and belief, it contains no material previously published or written by another person
nor material which to a substantial extent has been accepted for the award of any other
degree or diploma of the University or other institute of higher learning, except where due
acknowledgement has been made in the text.
Richard Grover
3 April 2007
i
ii
Abstract
Richard Grover Doctor of Philosophy
The University of Sydney April 2007
A Model for Sensing and

Reasoning in Autonomous Systems
This thesis presents a new model for an engineering analysis of the sensing and reasoning
problem. Fundamentally, the focus of this work is on developing a model which repre-
sents and encapsulates the essential characteristics of the operations of sensing, storage,
manipulation and interpretation of exteroceptive sensory data. The thesis makes two key
conjectures: that these operations constitute the most basic components necessary for the
effective utilisation of sensory data; and that the engineer can assess the viability of a
particular implementation by reference to, and in terms of, these components.
A new mathematical model is introduced which provides the flexibility required to achieve of
the work. The theoretical model is shown to correspond to a formally intractable solution,
thus necessitating the utilisation of sub-optimal approximations if practical and reliable
systems are to be developed. The mathematical framework is such that the effects of
these approximations can be critically and objectively assessed, providing the engineer with
an important tool to guide the development process. Furthermore, several mathematical
tools commonly used in the literature are extended and examined in detail, notably those
measures associated with information-theoretic quantities, such as entropy and mutual in-
formation, and a rigourous development of a theory of functional forms for representing and
reasoning with sensory information is presented.
The model is discussed in detail and the most commonly used approximative techniques are
identified and analysed to demonstrate the flexibility and strengths of the approach, with
particular emphasis on the application of the new model to the development of a system
to provide situational awareness to an autonomous uninhabited ground vehicle (AUGV)
operating in an unstructured, outdoor environment. The new framework is also shown to
give rise to many new avenues of research and to provide important new techniques for the
design, analysis and management of reliable, next-generation autonomous systems.
Acknowledgements
It is not possible to express the degree of gratitude which I owe the vast multitude of people
who have assisted, supported, encouraged, tolerated and wrangled me in the process of
completing this thesis. First and foremost, my supervisor Steven Scheding has shown a
remarkable ability to combine encouragement with rebuke and in doing so, has actually
managed to direct my work successfully; a master cat-herder you have shown yourself to
be. The time can best be summarised by a short conversation we had on several occasions:
RICHARD: “I am trying!”
STEVE: “Yes, you are.”
I am eternally indebted to Hugh Durrant-Whyte, Suresh Kumar and Tim Bailey for their
guidance, encouragement, advice and occasional wild-goose chases - without your assistance
this thesis may never have reached its end. It certainly would not have been possible if it
were not for the generosity, support and well-intentioned ‘emotional readjustment tech-
niques’ of the Argo and CRC Mining teams: Ross Hennessy, Craig Rodgers, James Under-
wood, Graham Brooker, David Rye, Monte (Judas) MacDiarmid, Ben Soon, Tim Harcombe,
Tom Allen, Chris Mifsud, Craig Lobsey, Alan Trinder and Jeremy Randle. Together you
have made the last four years interesting, exciting and ultimately highly rewarding. Bella
Wong, Matt Ridley, Ben Upcroft, Mari Velonaki, Anna Jones, Ruth Olip and Christy Wang
have all made the whole process run much more smoothly than it otherwise might have.
They bring colour, flair, panache (and that’s just Mari), professionalism and dignity to the
cut-throat world of professional research.
To my many friends who have put up with the four and a half years, and particularly the
last twelve months, of this endeavour, thank you for remaining my friends. It hasn’t always
been easy and sometimes even I was ready to murder me. My family are the greatest
support and encouragement to me - particularly when the going gets tough, the end looks
an interminable distance away and the writing is going slowly. I am thankful that I can
guarantee that even a short time with you all restores a sense of balance, meaning and
purpose to life.
This work would also have not been possible without the support and assistance of St Paul’s
College, providing much more than a place to live, a bed and breakfast, or a finishing school
for blokes. Finding that the anachronism of a genuine academic community which supports
its members, encourages excellence in all pursuits and exists for a purpose higher than
passing a degree, can exist and thrive encourages me greatly.
iii
Although this may seem a paradox, all exact science is dominated by the idea of
approximation. When a man tells you that he knows the exact truth about anything, you
are safe in inferring that he is an inexact man.
Bertrand Russel
Writing is easy; all you do is sit staring at a blank sheet of paper until the drops of blood
form on your forehead.
Gene Fowler
Contents
Declaration i
Abstract ii
Acknowledgements iii
Contents v
List of Figures xii
List of Tables xviii
List of Examples xx
List of Symbols xxiii
1 Introduction 1
1.1 The Objectives of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Sensing and Reasoning in Outdoor Autonomous Systems . . . . . . . . . . . 3
1.3 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 The Structure of the Model . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Approximations and Assumptions in the Model . . . . . . . . . . . . 9
1.3.3 Engineering Analysis vs. Machine Learning Methods . . . . . . . . . 11
1.4 Performance Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.1 Approximative Effects . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.2 Off-line Components . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.3 On-line Source Contributions . . . . . . . . . . . . . . . . . . . . . . 14
v
vi CONTENTS
1.4.4 Situational Dependencies of Measure Quantities . . . . . . . . . . . 14

1.5 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.6 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Measures, Distances and Information Theory 19

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Measure Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Deviations, Divergences and Distances . . . . . . . . . . . . . . . . . . . . . 21
2.4 Extending Basic Measure Theory . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.1 Coordinate Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.2 Measures and Inner Products . . . . . . . . . . . . . . . . . . . . . . 25
2.4.3 Norms and Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.4 Measures for Functions and Infinite Dimensional Systems . . . . . . 31
2.5 Information Theoretic Measures . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.1 Probability Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.2 Interpreting Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.3 Shannon Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.4 Distribution Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5.5 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.5.6 Distribution Mutual Information . . . . . . . . . . . . . . . . . . . . 64
2.5.7 Three Variable Expressions . . . . . . . . . . . . . . . . . . . . . . . 67
2.6 Representing the Relationships between Measures . . . . . . . . . . . . . . . 71
2.6.1 Interval Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.6.2 Venn Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.6.3 Vector-Space Construction . . . . . . . . . . . . . . . . . . . . . . . 76
2.7 Complete Enumeration of Information Measures . . . . . . . . . . . . . . . 78
2.8 Deviations using Entropy and Mutual Information . . . . . . . . . . . . . . 84
2.8.1 Entropy and Mutual Information . . . . . . . . . . . . . . . . . . . . 84
2.8.2 Statistical Independence Distance . . . . . . . . . . . . . . . . . . . . 87
2.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
CONTENTS vii
3 Sensing and Representation 93

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.2 Data Gathering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.2.1 Task and Sensor-Centric Representations . . . . . . . . . . . . . . . 96
3.2.2 The Sensory Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.2.3 Sensor Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.3 Functional Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.3.1 Function Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.3.2 Transformation of Function Bases . . . . . . . . . . . . . . . . . . . 111
3.3.3 Functional Forms and Distributions . . . . . . . . . . . . . . . . . . 115
3.4 Utilisation of Functional Models . . . . . . . . . . . . . . . . . . . . . . . . 118
3.4.1 Domain Inference using Functional Models . . . . . . . . . . . . . . 118
3.4.2 Functional Sensor Models . . . . . . . . . . . . . . . . . . . . . . . . 123
3.4.3 Sequential Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.5 Functional Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.5.1 Mixture Representations . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.5.2 Graphical Models for Functional Forms . . . . . . . . . . . . . . . . 136
3.5.3 Parameter Space Partitioning . . . . . . . . . . . . . . . . . . . . . . 143
3.6 Performance Measurement and Practical Considerations . . . . . . . . . . . 146
3.6.1 Approximation Effects and Artifacts . . . . . . . . . . . . . . . . . . 147
3.6.2 Representational Conciseness and Sensitivity . . . . . . . . . . . . . 151
3.6.3 Manipulation and Decomposition Complexity . . . . . . . . . . . . . 152
3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
4 Reasoning with Sensory Data 159

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
4.2 Reasoning as Lossy Reinterpretation . . . . . . . . . . . . . . . . . . . . . . 161
4.2.1 The Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4.2.2 The Transformation Approach . . . . . . . . . . . . . . . . . . . . . 165
4.2.3 Reasoning Algorithms as Transformations . . . . . . . . . . . . . . . 166
4.3 Transformation of Functional Representations . . . . . . . . . . . . . . . . . 167
viii CONTENTS
4.3.1 Deterministic Transformations . . . . . . . . . . . . . . . . . . . . . 168

4.3.2 Uncertainty in Value . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.3.3 Domain Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
4.3.4 Interpreting Feature Transformations . . . . . . . . . . . . . . . . . . 173
4.4 Special Transformation Classes . . . . . . . . . . . . . . . . . . . . . . . . . 175
4.4.1 Independent Transformations . . . . . . . . . . . . . . . . . . . . . . 176
4.4.2 Composite Transformations . . . . . . . . . . . . . . . . . . . . . . . 178
4.4.3 Feature Extraction and Decision Processing . . . . . . . . . . . . . . 179
4.5 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
4.5.1 Task-agnostic Information Preservation . . . . . . . . . . . . . . . . 185
4.5.2 Task-specific Information Preservation . . . . . . . . . . . . . . . . . 188
4.5.3 Task-specific Transformation Costs . . . . . . . . . . . . . . . . . . . 191
4.5.4 Online Performance Of System Components . . . . . . . . . . . . . . 192
4.5.5 Sensor Model Performance . . . . . . . . . . . . . . . . . . . . . . . 194
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
5 Application of Model 201

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
5.2 Implementing Functional Representations . . . . . . . . . . . . . . . . . . . 203
5.2.1 Representing a Functional Form . . . . . . . . . . . . . . . . . . . . 203
5.2.2 Computational Operations on Functional Forms . . . . . . . . . . . 205
5.2.3 Requirements for a Computational Framework . . . . . . . . . . . . 208
5.3 A Computational Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 210
5.3.1 System and Network Structure . . . . . . . . . . . . . . . . . . . . . 210
5.3.2 The Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Multi-resolution Domain Sampling . . . . . . . . . . . . . . . . . . . 213
Thread-safe Data Access . . . . . . . . . . . . . . . . . . . . . . . . . 217
Arbitrary Data Tensor Storage . . . . . . . . . . . . . . . . . . . . . 219
Probabilistic Data Management . . . . . . . . . . . . . . . . . . . . . 220
5.3.3 Sequential State Management . . . . . . . . . . . . . . . . . . . . . . 222
5.3.4 Sensing Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
CONTENTS ix
5.3.5 Reasoning Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 225

5.4 Application: Global Planning for AUGV Systems . . . . . . . . . . . . . . . 225
5.4.1 AUGV Project Description . . . . . . . . . . . . . . . . . . . . . . . 225
5.4.2 The Sensor Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
5.4.3 Task and Representation Selection . . . . . . . . . . . . . . . . . . . 231
5.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
5.5 Application: Local AUGV Navigation . . . . . . . . . . . . . . . . . . . . . 241
5.5.1 Task Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
5.5.2 Sensor Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
5.5.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 245
5.5.4 Global and Local AUGV Operation . . . . . . . . . . . . . . . . . . 247
5.6 Application: Terrain Imaging and Estimation . . . . . . . . . . . . . . . . . 247
5.6.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
5.6.2 Sensor Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
5.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
5.7 Application: Warpable-Domain Function Models . . . . . . . . . . . . . . . 254
5.7.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
5.7.2 Warping the Functional Domain . . . . . . . . . . . . . . . . . . . . 260
5.7.3 Point-based Transformation Operations . . . . . . . . . . . . . . . . 263
5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
6 Conclusions and Future Work 267

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
6.2 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
6.2.1 Function-based Measure Theoretic Framework . . . . . . . . . . . . 268
6.2.2 Information-theoretic Quantities . . . . . . . . . . . . . . . . . . . . 268
6.2.3 Distance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
6.2.4 Parameterisations and Function Estimation . . . . . . . . . . . . . . 270
6.2.5 Reasoning as Transformation . . . . . . . . . . . . . . . . . . . . . . 270
6.3 Further Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
6.3.1 Theoretical Development . . . . . . . . . . . . . . . . . . . . . . . . 271
x CONTENTS
Information Theoretic Quantities for Systems with More Than Two

Random Variables . . . . . . . . . . . . . . . . . . . . . . . 271
Selection of the Functional Domain and Tensor Range . . . . . . . . 272
Confirmation of Reasoning Operations as Transformations . . . . . . 272
6.3.2 Practical Development . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Measure Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Implementation of Functional Models in Real Systems . . . . . . . . 273
Development of Real Sensor Models . . . . . . . . . . . . . . . . . . 273
Implementation of Real Reasoning Operations . . . . . . . . . . . . 274
Completion of the Warpable-Domain Theories . . . . . . . . . . . . 274
A Tensor Geometry 275
B Ordinate Transformation Invariance for Mutual Information 281

B.1 Simple Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
B.2 General Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
C Measure and Information Theory Examples 291

C.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
C.2 Measure Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
C.3 Deviations, Divergences and Distances . . . . . . . . . . . . . . . . . . . . . 292
C.4 Extending Basic Measure Theory . . . . . . . . . . . . . . . . . . . . . . . . 294
C.4.1 Coordinate Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
C.4.2 Measures and Inner Products . . . . . . . . . . . . . . . . . . . . . . 296
C.4.3 Norms and Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
C.4.4 Measures for Functions and Infinite Dimensional Systems . . . . . . 301
C.5 Information Theoretic Measures . . . . . . . . . . . . . . . . . . . . . . . . . 304
C.5.1 Shannon Information and Entropy . . . . . . . . . . . . . . . . . . . 305
C.5.2 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
C.6 Deviations Using Entropy and Mutual Information . . . . . . . . . . . . . . 316
C.6.1 Entropy and Mutual Information . . . . . . . . . . . . . . . . . . . . 316
CONTENTS xi
D Measure Comparisons 319

D.1 Measures for Vectors and Functions . . . . . . . . . . . . . . . . . . . . . . 319
D.1.1 Euclidian Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
D.1.2 Closed Form Euclidian Distance . . . . . . . . . . . . . . . . . . . . 326
D.1.3 Ln Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
D.1.4 Mahalanobis Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 335
D.1.5 Hellinger-Battacharya Distance . . . . . . . . . . . . . . . . . . . . . 340
D.2 Measures for Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
D.2.1 Kullback-Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . 345
D.2.2 Csiszár’s Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
D.2.3 Mutual Information as an Affinity . . . . . . . . . . . . . . . . . . . 355
D.2.4 Statistical Information Distance . . . . . . . . . . . . . . . . . . . . 357
D.3 Measure Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
E Occupancy Sensor Models 365

E.1 Analytic Occupancy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
E.2 Sampled Occupancy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
E.3 Fitting the Gaussian Model to Real Sensors . . . . . . . . . . . . . . . . . . 372
E.4 On-Boresight to Global Transformations . . . . . . . . . . . . . . . . . . . . 374
E.5 Batch Updating in Sampled Models . . . . . . . . . . . . . . . . . . . . . . 375
E.5.1 Updating Binary Values . . . . . . . . . . . . . . . . . . . . . . . . . 375
E.5.2 Maintaining Sampled Models . . . . . . . . . . . . . . . . . . . . . . 377
Bibliography 379
xii CONTENTS
List of Figures
1.1 An Autonomous Uninhabited Ground Vehicle (AUGV) shown in a typical

open farmland terrain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Some additional operational environments for a practical AUGV . . . . . . 5
1.3 The High-level System Model . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 The triangle inequality for deviation measures . . . . . . . . . . . . . . . . . 22

2.2 Path invariance for differentiable measures on a continuous space . . . . . . 26
2.3 A representation of the differential or difference of a measure defined on
qualitatively different domains . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 A Venn Diagram for two disjoint sets representing the outcomes of two inde-
pendent experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5 Conditional and joint Distributions for a deterministic relationship between
Voltage and Current . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6 Conditional and joint Distributions for the relationship between Voltage and
Current when resistance is a third random variable . . . . . . . . . . . . . . 39
2.7 Approximation of a continuous probability density by discrete distribution . 49
2.8 The contributions of different values of probability to distribution entropy. . 50
2.9 A Venn diagram showing the relationships between the information theoretic
quantities defined for three random variables . . . . . . . . . . . . . . . . . 68
2.10 An interval–diagram interpretation of the relationships between common
information–theoretic measures . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.11 Interval diagrams for two cases involving negative quantities . . . . . . . . . 73
2.12 A Venn diagram for two random variables . . . . . . . . . . . . . . . . . . . 74
2.13 A commonly drawn Venn diagram for information measures involving two
random variables, x and y . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.14 A two dimensional vector diagram illustrating a possible interpretation of the
relationship between HP (X), HP (X|Y ) and IP (X; Y ) . . . . . . . . . . . . 77
xiii
xiv LIST OF FIGURES
2.15 The full three-dimensional vector diagram for two random variables x and y 78
2.16 A Venn diagram depiction of the seven ‘base’ measures available for a situa-
tion involving three random variables . . . . . . . . . . . . . . . . . . . . . . 81
2.17 A pictorial representation of the 37 classes of ‘information-theoretic’ measures
available for three random variables . . . . . . . . . . . . . . . . . . . . . . 83
3.1 The data gathering part of the model . . . . . . . . . . . . . . . . . . . . . 95

3.2 The abstract sensory-space . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.3 The independent-sensor high-level model . . . . . . . . . . . . . . . . . . . . 101
3.4 The common-source independent-sensor model . . . . . . . . . . . . . . . . 102
3.5 Domain transformation of functional models . . . . . . . . . . . . . . . . . . 106
3.6 A Dirac basis set for a regular grid . . . . . . . . . . . . . . . . . . . . . . . 110
3.7 Wavelet representations for image data . . . . . . . . . . . . . . . . . . . . . 115
3.8 Inference in functional models . . . . . . . . . . . . . . . . . . . . . . . . . . 124
3.9 A simple Markov chain for a recursive estimation scheme . . . . . . . . . . 138
3.10 A Markov chain for demonstrating conditional independence of non-adjacent
variables, given intermediate variables . . . . . . . . . . . . . . . . . . . . . 138
3.11 A graphical model showing a non-trivial series of statistical dependencies . 140
3.12 Sensitivity of the Fourier transform to small time offsets . . . . . . . . . . . 152
3.13 The Sensing and Data-Gathering Process . . . . . . . . . . . . . . . . . . . 156
4.1 The data abstraction part of the model . . . . . . . . . . . . . . . . . . . . 161

4.2 The Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.3 A deterministic transform between two functional forms. . . . . . . . . . . . 169
4.4 The composite transformation for functional forms . . . . . . . . . . . . . . 174
4.5 Side information in a communications channel . . . . . . . . . . . . . . . . . 174
4.6 Orthogonal bases and statistical independence in feature spaces . . . . . . . 177
4.7 A compound transformation showing pre-processed data stage . . . . . . . . 179
4.8 Discretisation schemes for decision processing . . . . . . . . . . . . . . . . . 180
4.9 An interval-diagram interpretation of the relationships between input and
output distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
4.10 A graphical model describing the transformation of functional forms from a
state-space to a feature-space . . . . . . . . . . . . . . . . . . . . . . . . . . 189
LIST OF FIGURES xv
4.11 An alternate form of Figure 4.10 . . . . . . . . . . . . . . . . . . . . . . . . 189

4.12 An approach to multi-modal signal alignment . . . . . . . . . . . . . . . . . 190
4.13 The Data-Abstraction Process . . . . . . . . . . . . . . . . . . . . . . . . . . 198
5.1 Implementing functional representations . . . . . . . . . . . . . . . . . . . . 206

5.2 A communications structure for the main system components . . . . . . . . 211
5.3 A multi-process extension of the system structure of Figure 5.2 . . . . . . . 212
5.4 A ‘delta’ manager approach to efficient distribution of representational data
across a bandwidth limited connection . . . . . . . . . . . . . . . . . . . . . 212
5.5 The structure of a k-D subdivision of the domain W . . . . . . . . . . . . . . 214
5.6 A UML class-diagram for implementing a k-D tree . . . . . . . . . . . . . . 215
5.7 A (k = 3)-D tree for mm-Wave radar data in suburban parkland . . . . . . 216
5.8 Thread-safe access to a k-D tree . . . . . . . . . . . . . . . . . . . . . . . . 218
5.9 A UML class diagram for a subset of the SensorModel part of the software
structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
5.10 The CAS Outdoor Systems Demonstrator AUGV . . . . . . . . . . . . . . . 226
5.11 Sample digital Terrain Model (DTM) data . . . . . . . . . . . . . . . . . . . 228
5.12 Sample aerial imagery data . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
5.13 Hyperspectral aerial data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
5.14 A terrain-imaging radar unit . . . . . . . . . . . . . . . . . . . . . . . . . . 231
5.15 Schematic view of the Global AUGV planning problem . . . . . . . . . . . . 232
5.16 Digital Terrain Model data in framework . . . . . . . . . . . . . . . . . . . . 236
5.17 Radar data in framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
5.18 A view of a small section of radar data . . . . . . . . . . . . . . . . . . . . . 238
5.19 Aerial imagery in framework . . . . . . . . . . . . . . . . . . . . . . . . . . 239
5.20 Combining surface estimation with local radar data . . . . . . . . . . . . . . 240
5.21 The SICK LMS-221 outdoor laser scanner system . . . . . . . . . . . . . . . 243
5.22 2D laser scanner data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
5.23 Laser scanner and camera on AUGV . . . . . . . . . . . . . . . . . . . . . . 244
5.24 Sample image from vehicle camera . . . . . . . . . . . . . . . . . . . . . . . 245
5.25 High-speed Scanning Radar (HSSR) sensor . . . . . . . . . . . . . . . . . . 246
5.26 Schematic view of the Local AUGV planning problem . . . . . . . . . . . . 247
xvi LIST OF FIGURES
5.27 Laser augmented camera data showing one of the AUGVs . . . . . . . . . . 248
5.28 Laser augmented camera data showing a large open field environment . . . 248
5.29 Laser augmented camera data showing a road traversal . . . . . . . . . . . . 249
5.30 Laser augmented camera data showing a person in the field of view . . . . . 249
5.31 Global and Local representations for AUGV control . . . . . . . . . . . . . 250
5.32 A dragline mining plant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
5.33 A Typical dusty environment in a mining application . . . . . . . . . . . . . 253
5.34 Schematic view of the terrain imaging problem for a mining application . . 254
5.35 Raw occupancy points from a mine trial . . . . . . . . . . . . . . . . . . . . 255
5.36 Surface estimates for mine environment . . . . . . . . . . . . . . . . . . . . 256
5.37 A second surface estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
5.38 Volumetric discrepancies of surfaces . . . . . . . . . . . . . . . . . . . . . . 258
5.39 Schematic representation of the DenseSLAM approach . . . . . . . . . . . . 260
5.40 Inconsistencies in the SLAM algorithm . . . . . . . . . . . . . . . . . . . . . 262
5.41 The two domains introduced by the morphable-domain algorithm . . . . . . 264
5.42 Consistency in location information for two approaches . . . . . . . . . . . . 265
A.1 Contravariant and covariant coordinates for a vector . . . . . . . . . . . . . 276

A.2 A simple change of basis for a unit vector . . . . . . . . . . . . . . . . . . . 278
B.1 Effects of applying several transformations to a discrete probability field . . 285
C.1 An example of a flat measure . . . . . . . . . . . . . . . . . . . . . . . . . . 297

C.2 An example of a non-flat measure . . . . . . . . . . . . . . . . . . . . . . . . 299
C.3 A Measure surface and path equivalent for a continuous set with finite di-
mensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
C.4 A Measure surface and path equivalent for a discrete set with finite dimen-
sionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
C.5 Probability distribution and entropy function for two dimensional indepen-
dent Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
C.6 Joint and marginal distributions and mutual information measures for a two-
dimensional discrete system with a Gaussian distribution . . . . . . . . . . 313
C.7 Joint and marginal distributions and mutual information measures for an
independent two-dimensional discrete system with a Gaussian distribution . 313
LIST OF FIGURES xvii
C.8 Joint and marginal distributions and mutual information measures for a two-
dimensional continuous system with a Gaussian distribution . . . . . . . . . 314
D.1 Synchronous and asynchronous time-series with unknown offset . . . . . . . 321

D.2 Comparing two models to an input signal using Euclidian Distance . . . . . 325
D.3 Construction of the closed–form Euclidian distance for continuous Gaussian
distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
D.4 The Manhattan metric for a two-dimensional space . . . . . . . . . . . . . . 330
D.5 The Ln distance measures for an input signal and two asynchronous models 333
D.6 Euclidian and Mahalanobis metric spaces for correlated two-dimensional Gaus-
sian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
D.7 Comparison of the Euclidian and Hellinger-Battacharya distance measures
applied to a non-negative signal and model . . . . . . . . . . . . . . . . . . 343
D.8 Comparison of the Euclidian and Hellinger-Battacharya distance measures
applied to a continuous probability distribution and model . . . . . . . . . . 344
D.9 The value of the log-odds for constructing the Kullback-Leibler divergence . 346
D.10 Individual contributions to the Kullback-Leibler divergence . . . . . . . . . 347
D.11 Kullback-Leibler divergences for the dice of Example D.2.1 . . . . . . . . . 351
D.12 The input and model distributions from Example D.2.2 . . . . . . . . . . . 352
D.13 Kullback-Leibler divergences for an input signal with unknown offset and
several candidate models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
D.14 Deterministic, statistically dependent and independent joint distributions
with common marginal distributions . . . . . . . . . . . . . . . . . . . . . . 360
D.15 An input image and two models for investigation of the mutual information
and statistical independence distance . . . . . . . . . . . . . . . . . . . . . . 361
D.16 Mutual information, statistical independence distance and Euclidian distance
for an image-based template-matching problem . . . . . . . . . . . . . . . . 363
E.1 One dimensional occupancy model . . . . . . . . . . . . . . . . . . . . . . . 367

E.2 A Two-dimensional Occupancy Sensor Model . . . . . . . . . . . . . . . . . 369
E.3 A Three-dimensional Occupancy Sensor Model . . . . . . . . . . . . . . . . 371
E.4 Approximation of a radar beam pattern with a Gaussian . . . . . . . . . . . 373
E.5 A specific Gaussian beam approximation . . . . . . . . . . . . . . . . . . . . 373
E.6 In order and out-of-order observations . . . . . . . . . . . . . . . . . . . . . 376
E.7 Parallel observations in a binary model . . . . . . . . . . . . . . . . . . . . . 376
xviii LIST OF FIGURES
List of Tables
2.1 Two discrete probability distributions for examining mutual information . . 59

2.2 Two discrete probability distributions for examining mutual information . . 68
2.3 Information theoretic quantities calculated for the probability distributions
of Section 2.5.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.4 Partial association of the 37 classes of measure for three random variables . 83
4.1 Three different feature-spaces for 2-D laser scanner data . . . . . . . . . . . 182
B.1 Entropies and Mutual Information measures (in nats) of the distributions
shown in Figure B.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
D.1 Table of normalised frequencies for the five dice of Example D.2.1. . . . . . 350
D.2 Table of Kullback-Leibler divergences for the dice of Example D.2.1. . . . . 351
xix
xx LIST OF TABLES
List of Examples
2.4.1 Discrete Coordinate System . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4.2 Continuous Coordinate System . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.1 Voltage and Current in Known Resistor . . . . . . . . . . . . . . . . . . 38
2.5.2 Voltage and Current in a Real Resistor . . . . . . . . . . . . . . . . . . . 39
2.5.3 Event mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.5.4 Mutual information limits . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.5.5 Positive three-term mutual information . . . . . . . . . . . . . . . . . . . 69
2.5.6 Negative three-term mutual information . . . . . . . . . . . . . . . . . . 69
3.3.1 Domain transformation for functional forms . . . . . . . . . . . . . . . . 106
3.3.2 Wavelet approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.4.1 Domain inference in topological space . . . . . . . . . . . . . . . . . . . 122
3.4.2 Single-return range-bearing sensor model . . . . . . . . . . . . . . . . . . 128
3.6.1 Parameter sensitivity in approximations . . . . . . . . . . . . . . . . . . 151
3.6.2 Computational complexity of a Gaussian representation . . . . . . . . . 153
4.4.1 Features from 2-D laser scans . . . . . . . . . . . . . . . . . . . . . . . . 181
C.2.1 Σ-algebras for a discrete set . . . . . . . . . . . . . . . . . . . . . . . . . 291
C.2.2 Σ-algebras for a continuous space . . . . . . . . . . . . . . . . . . . . . . 292
C.3.1 Distance measure for Example C.2.1 . . . . . . . . . . . . . . . . . . . . 292
C.3.2 Divergence measure for Example C.2.1 . . . . . . . . . . . . . . . . . . . 293
C.3.3 An alternative distance measure for Example C.2.1 . . . . . . . . . . . . 294
C.4.1 The Fourier basis for a simple case . . . . . . . . . . . . . . . . . . . . . 295
C.4.2 A flat measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
C.4.3 A non-flat measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
xxi
xxii LIST OF EXAMPLES
C.4.4 Norms and distance in discrete system . . . . . . . . . . . . . . . . . . . 300

C.4.5 Measure on a continuous space with finite dimensionality . . . . . . . . . 301
C.4.6 Measure on a discrete space with finite dimensionality . . . . . . . . . . 302
C.4.7 Measure on a discrete space with infinite dimensionality . . . . . . . . . 303
C.4.8 Measure on a continuous space with infinite dimensionality . . . . . . . 304
C.5.1 Probabilities in a discrete system . . . . . . . . . . . . . . . . . . . . . . 304
C.5.2 Event entropy in a discrete system . . . . . . . . . . . . . . . . . . . . . 305
C.5.3 Event entropies in a continuous system . . . . . . . . . . . . . . . . . . . 305
C.5.4 Event entropy in the case of Example 2.5.2 . . . . . . . . . . . . . . . . 307
C.5.5 Distribution entropy from Example C.5.2 . . . . . . . . . . . . . . . . . 309
C.5.6 Entropy of a Gaussian distribution . . . . . . . . . . . . . . . . . . . . . 310
C.5.7 Distribution entropy in the case of Example 2.5.2 . . . . . . . . . . . . . 311
C.5.8 Limiting values for event mutual information . . . . . . . . . . . . . . . 311
C.5.9 Event mutual information with independence . . . . . . . . . . . . . . . 312
C.5.10 Event mutual information for a continuous system . . . . . . . . . . . . 314
C.5.11 Distribution mutual information in Example 2.5.3 . . . . . . . . . . . . . 314
C.5.12 Information theoretic quantities for Gaussian systems . . . . . . . . . . . 315
C.6.1 Mutual information as deviation in continuous systems . . . . . . . . . . 316
D.1.1 Euclidian distance for vectors . . . . . . . . . . . . . . . . . . . . . . . . 322
D.1.2 Euclidian distance in discrete system . . . . . . . . . . . . . . . . . . . . 323
D.1.3 Euclidian distance for functions . . . . . . . . . . . . . . . . . . . . . . . 324
D.1.4 Ln distances for vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
D.1.5 Ln distances for functions . . . . . . . . . . . . . . . . . . . . . . . . . . 332
D.1.6 The Mahalanobis distance . . . . . . . . . . . . . . . . . . . . . . . . . . 339
D.1.7 Template matching with Hellinger-Battacharya distance . . . . . . . . . 342
D.1.8 Hellinger-Battacharya for probability distributions . . . . . . . . . . . . 343
D.2.1 Kullback-Leibler divergence in discrete system . . . . . . . . . . . . . . . 350
D.2.2 Template-matching with the Kullback-Leibler divergence . . . . . . . . . 353
D.2.3 Mutual information and statistical independence distance . . . . . . . . 359
D.2.4 Template matching with I(X; Y ) and DSI (X, Y ) . . . . . . . . . . . . . 359
List of Symbols
S An arbitrary set over which a family of subsets (also called a σ-

algebra) can be defined . . . . . . . . . . . . . . . . . . . . . . . . 20
Σ A family of subsets of S which is closed under application of a
finite number of set operations . . . . . . . . . . . . . . . . . . . . 20
μ A measure mapping elements of the σ-algebra of S to a magnitude 20
(S, Σ, μ) A measure space defined by μ applied to the σ-algebra Σ of the
set S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Si An arbitrary element of the σ-algebra of the set S . . . . . . . . . 20
∅ The empty subset . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
R The set of real numbers . . . . . . . . . . . . . . . . . . . . . . . 20
Z The set of integers . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Z+ The subset of Z corresponding to positive integers . . . . . . . . . 20
Z∗ The subset of Z corresponding to non-negative integers . . . . . . 20
C The set of complex numbers . . . . . . . . . . . . . . . . . . . . . 20
Δ(Si , Sj ) A two-element measure of the ‘deviation’ between the elements Si
and Sj of Σ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
D(Si Sj ) A divergence measure for the two elements Si ,Sj of Σ . . . . . . . 21
d(S1 , S2 ) A distance measure for the two elements Si ,Sj of Σ . . . . . . . . 22
si A coordinate (vector) for the ith element of S . . . . . . . . . . . 24
êk The kth basis (vector) defining a coordinate system for the set S 24
si , sj The inner product operator applied to the two elements Si and Sj
of Σ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
μkl The metric tensor for a flat bi-linear measure space . . . . . . . . 27
x A random variable for the distribution (or experiment) denoted by
X. This symbol has alternate definitions in different contexts . . 34
xxiii
xxiv LIST OF SYMBOLS
X A distribution (or experiment) associated with the random variable

x. This symbol has alternate definitions in different contexts . . . 34
A A discrete ‘alphabet’ describing an arbitrary set . . . . . . . . . . 34
xi A specific value of the discrete random variable x with xi ∈ A . . 34
PA (xi ) The probability of the discrete random variable x having the value
xi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
x̃ A specific value of the continuous random variable x with x̃ ∈ Rn 35
PX (x̃) The probability density of the random variable x having value x̃ 35
x A specific value of the random variable x where the context implies
a continuous or discrete interpretation . . . . . . . . . . . . . . . 36
P (x, y) The probability value (or density) of the random variable taking
on value x where the context implies a continuous or discrete in-
terpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
P (x|y) The conditional probability of random variable x given the value
of the random variable y . . . . . . . . . . . . . . . . . . . . . . . 37
UH (x) The unit Heaviside (generalised-) function . . . . . . . . . . . . . 39
hP (x) The event entropy of the particular outcome, x = x, with respect
to the distribution P (x) . . . . . . . . . . . . . . . . . . . . . . . 43
δD (x) The Dirac delta (generalised-) function . . . . . . . . . . . . . . . 44
Exy {μ(x, y)} The expectation operator applied to the measure μ(x, y) with re-
spect to the random variables x and y described by the probability
distribution P (x, y) . . . . . . . . . . . . . . . . . . . . . . . . . . 48
HP (X) The distribution entropy of the experiment X . . . . . . . . . . . 48
iP (x; y) The event mutual information associated with the outcome (x =
x, y = y) according to the distribution P (x, y) . . . . . . . . . . . 58
IP (X; Y ) The distribution mutual information for the experiment (X, Y )
described by P (x, y) . . . . . . . . . . . . . . . . . . . . . . . . . 64
DSI (X, Y ) The statistical independence distance between distributions (or ex-
periments) X and Y . . . . . . . . . . . . . . . . . . . . . . . . . 89
Y An arbitrary domain over which a functional form is to be defined 103
y An element y ∈ Y representing a ‘localisation’ in that space . . . 103
x(y) A (tensor-valued) function defined over the domain Y ; usually a
random variable referred to as the ‘state’ function . . . . . . . . . 104
X The space defining the range of the state function. x(y) ∈ X ∀ y ∈ Y 104
x̂(y) An estimate of the function defined by the random variable x(y) 107
LIST OF SYMBOLS xxv
P [x(y)] The probability distribution over functions for the system de-
scribed by the random variable x(y) . . . . . . . . . . . . . . . . 107
{eδ } The set of basis vectors for a Dirac space; each element is a Dirac
delta function at a single element of the space Y . . . . . . . . . 107
FD The set of functions which can be defined on a Dirac basis set . . 108
FY The set of functions which can be defined over the domain Y using
an implied function basis; a subset of the set FD . . . . . . . . . 108
α(w) The parameter function (or coordinates) of a function according
to a given basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
w An arbitrary element from the domain W , describing a coordinate
system for the parameters α(w) . . . . . . . . . . . . . . . . . . . 109
W An arbitrary domain over which the parameters of a functional
form are defined . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Tbasis A basis transformation operation mapping a parameter function
α(w) to (the implied) Dirac basis form x(y) . . . . . . . . . . . . 112
ẑl (vl ) A sensory observation from the lth source defined as a function
over an arbitrary domain Vl and range Zl . . . . . . . . . . . . . 124
Zl The space over which the range of a sensory observation function
is defined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
vl An arbitrary location in the domain of a sensory observation; vl ∈ Vl 124
Vl The domain over which a sensory observation function is defined 124
P [zl (vl )|x(y)] The (forward) sensor model for the lth sensor . . . . . . . . . . . 125
TZl The transformation from the state function to observation function
for the lth sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
x̂k (y) An estimate of the value of the random variable x(y) after k ob-
servations have been made . . . . . . . . . . . . . . . . . . . . . . 130
ẑlk (vl ) The kth observation in a sequential system, the subscript l identi-
fies the source of the particular observation . . . . . . . . . . . . 131
Zk The sequence of observations up to and including the kth measure-
ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
uk A set of parameters defining the transition of the state function
during the kth interval . . . . . . . . . . . . . . . . . . . . . . . . 131
Uk The history of transition parameters up to the kth interval . . . . 131
S A space defining the range of a feature-space vector s ∈ S . . . . 163
s A feature-descriptor vector defining a set of measured properties 163
xxvi LIST OF SYMBOLS
Q A domain over which feature space values s will be defined as a

functional form . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
s(q) A feature-space function defining the feature-descriptor values over
the domain Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
q An arbitrary location in the domain Q . . . . . . . . . . . . . . . 163
TSl The transformation generating the lth feature space . . . . . . . . 167
Tdet The deterministic part of the transform TSl . . . . . . . . . . . . 168
Pvalue [sl (ql )|x(y)] The ‘value’ (or noise) uncertainty introduced during the transfor-
mation TSl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
y A new random variable describing the value of the domain location
y when this is not known deterministically . . . . . . . . . . . . . 172
Py [ y | y ] The distribution defining the values of the random variable y given
the ‘true’ domain location y . . . . . . . . . . . . . . . . . . . . . 172
H [SQl ] The distribution entropy for the lth feature-space output distribu-
tion P [sl (ql )] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
I(SQl ; XY ) The distribution mutual information for the transformation joint
distribution P [sl (ql ), x(y)] . . . . . . . . . . . . . . . . . . . . . . 185
H(SQl |XY ) The conditional distribution entropy for the transformation en-
coded by P [sl (ql )|x(y)] . . . . . . . . . . . . . . . . . . . . . . . 185
FL [sl (ql ), x(y)] The cost function for the result of obtaining the output function
sl (ql ) when the true input function is x(y) . . . . . . . . . . . . . 191
Pbatch [ŝl (ql ), x̂(y)] The distribution of a batch of estimates defining the online perfor-
mance of a transformation operation . . . . . . . . . . . . . . . . 193
Chapter 1
Introduction
1.1 The Objectives of this Thesis
This thesis examines an interpretative model of the operations of sensing, representing, pro-
cessing and reasoning with exteroceptive data. Motivated by the recognition that the de-
velopment of any reliable autonomous or semi-autonomous system depends critically on the
ability to reliably manipulate and interpret the available sensory data, this thesis presents
a new approach to understanding the essential character of these operations. While the
literature is rich in techniques, theories and applications related to performing these tasks
(Thrun 1998, Newman 2000, Stanford Racing Team 2005), it is not clear how the various
methods are related, nor what characteristics define the essential nature of each operation.
This is especially clear in the context of reasoning with sensory data where techniques
include deterministic and probabilistic discriminative models (Hearst et al. 1998, Mackay
1998, Muller et al. 2001), probabilistic generative models (Ng & Jordan 2002, Kumar et al.
2005) and traditional signal processing methods (Middleton 1996). This thesis addresses
the following conjecture: that the operations associated with the capture, storage, manipu-
lation and interpretation of sensory data constitute the ‘elementary’ components necessary
for effective utilisation of this data, and that the engineer can quantitatively assess the
viability of many approaches in terms of these basic operations. This is achieved through
the development of a comprehensive mathematical model which embodies the fundamental
qualities of the entire process and seeks to encompass the corpus of techniques. The pri-
mary purpose of such a model is to identify and clarify the essential characteristics of each
2 Introduction
of these components. In doing so, such a model naturally suggests the concept of an ‘ideal’
or ‘optimal’ system against which any particular implementation can be compared.
Though it will be demonstrated that the approach proposed in this thesis is formally in-
tractable in its ideal formulation, explicit recognition of the requirement for approximations
and assumptions regarding each of the elements of the model enables a system developer
to analyse the implications of such ‘sub-optimal’ deviations from the ideal. Furthermore,
while the mathematical treatment is highly abstract (to enable a generality sufficient to
encompass the wide variety of techniques available), the focus of the thesis is on providing
an engineer with the tools and structure to thoroughly assess the impact and performance
of the inevitable assumptions and approximations. These engineering decisions are not lim-
ited to a consideration of the effect of approximation artifacts, manipulation constraints
or computational complexity alone, but include the incorporation of a priori sensor and
task-specific knowledge and the identification of how heuristic or ‘learned’ knowledge can
be reliably incorporated into a complete system.
Middleton (1996, p773) notes that a systematic and unified model must have a character
which addresses three specific requirements: “[embodying] (1) methods for determining the
explicit structure of the optimal, or ‘best’, systems for the problem at hand; (2) procedures
for evaluating the performance of such optimum systems; and (3) techniques of system
comparison, whereby different systems, optimum and suboptimum, may be quantitatively
compared”. While that work was concerned primarily with the problem of communications
theory, these three requirements are directly applicable to this thesis. In this context the
objectives of this thesis can be identified in detail:
• To identify the essential characteristics of the operations of sensing, representation

and reasoning and to develop a cohesive model capturing these properties; and to
employ this model to identify the ideal approach to the management and utilisation
of sensory data;
• To explicitly acknowledge the intractability of the ideal approach and, therefore, to

elucidate the nature and forms of approximations and assumptions required to con-
struct reliable and practical systems; and,
• To provide a framework enabling the engineer to assess and, where possible, quantify
the ‘performance’ of the system or a sub-component, including: the impact of these
1.2 Sensing and Reasoning in Outdoor Autonomous Systems 3
approximations on the performance of a sub-optimal technique with respect to an

‘ideal’ or alternative approach; the quality of the output of a sub-component of the
system; and what contribution a particular sensor or information source is making at
any stage throughout the system.
1.2 Sensing and Reasoning in Outdoor Autonomous Systems
While the model presented in this thesis is applicable to a much wider array of systems,
this work is motivated primarily by the particular problem of the development of an au-
tonomous uninhabited ground vehicle (AUGV) system operating in an unstructured, out-
door environment. The success of three teams in the recently completed DARPA Grand
Challenge (DARPA 2005) has demonstrated that high-speed, long-duration autonomous
navigation is readily achievable with current technologies. Although these successes repre-
sent a substantial improvement in the ‘state-of-the-art’ for outdoor autonomous systems,
many fundamental questions remain unanswered in regard to the deployment of functional,
reliable systems operating in a general off-road, outdoor environment. In particular, the
operational environment of the competition remained relatively structured, driving on well-
maintained unsealed roads and containing well-defined navigational obstacles. The most
successful system demonstrated that reliable, yet relatively simple, sub-components tai-
lored to the specific requirements of the competition were sufficient to compete effectively
(Stanford Racing Team 2005).
The extension of the current technology to enable an AUGV system to perform effectively
over a diverse set of terrain, weather and other situational conditions represents a significant
obstacle to the deployment of parctically viable systems in the near future. In particular,
such a system must be capable of operating under at least the following conditions:
24-hour Continuous Operation requiring that the system be capable of seamlessly man-
aging the transition between daylight illumination conditions and a variety of night-
time scenarios, including operation with or without vehicle-mounted illumination sys-
tems
All-weather Operation requiring that a system be robust against the sensory effects of
weather conditions such as rain, fog and dust
4 Introduction
Diverse Terrains including desert, rural farmland, woodland areas and heavily-vegetated
environments requiring that the system should be capable of adapting to the local
terrain characteristics. Examples of these types of terrain are shown in Figures 1.1
and 1.2
It is clear that the addition of these requirements greatly increases the degree of complexity
in the resulting system, particularly with respect to the selection of sensory capabilities
and the interpretation of the available data. As examples, consider that a visible-spectrum
camera may be of use during daylight or illuminated night-time conditions, but if used in
isolation will be insufficient to enable continued operation under low or no-light conditions.
Likewise, a laser-based range sensor may operate effectively under clear-weather conditions,
but not in dusty or smokey conditions. Furthermore, the requirements of increased relia-
bility suggest that a viable system must be capable of detecting and responding gracefully
to the failure of any particular sensor or algorithm. Finally, the physical properties of the
operational environments have significant structural complexity and great subtlety may be
required for the reliable interpretation of sensory cues. Consider the vehicle in Figure 1.1
and how this vehicle could distinguish between a moss-encrusted rock (corresponding to
an impassable obstacle) and a dense shrub (which can be safely driven over). If only vi-
sual information was available this distinction can be quite subtle, relying on a non-trivial
combination of colours, textures and other visual cues. Similarly, the distinction between a
dry salt-pan (representing a large open space suitable for high-speed traversal) and a lake
may not be clear if the only available information is remotely-sensed terrain elevation data.
These ambiguities exist within the data of the sensors and as a result can only be overcome
through the utilisation of additional sensing capabilities: radar can reveal that the den-
sity of the shrub is substantially lower than that of the rock, and polarisation information
reveals the water surface clearly.
As there is no single sensor system capable of providing all necessary environmental infor-
mation under all conditions, a practical system must combine the available data from a
diverse suite of sensors and, more importantly, be capable of continued operation when one
or more of the sensors has failed or is otherwise unusable. Reliability in the operation of
the system requires that a sufficiently suite of different sensors be selected such that the
system is not critically dependent on any single sensor. Furthermore, it is not reasonable to
assume that each sensor provides consistent and valid data at all times. A reliable system
1.2 Sensing and Reasoning in Outdoor Autonomous Systems 5
Figure 1.1: An Autonomous Uninhabited Ground Vehicle (AUGV) shown in a typical open
farmland terrain. Photo: Alex Green, 2005.
(a) An open desert (b) Heavily-vegetated environment
Figure 1.2: Some additional operational environments for a practical AUGV. (a) shows an
open desert landscape, and (b) a heavily vegetated tropical jungle.
6 Introduction
must be capable of quantifying the performance of each sensor and of identifying when
any particular sensor has failed. The particular problem of designing a sensory system to
ensure fault-detectability has been examined in detail in Scheding (1997), but the issue of
measuring the contribution of any single sensor to the successful achievement of a particular
task will be considered in Section 4.5.5.
Taken together, the requirement for redundant, reliable sensing and the necessity of captur-
ing a sufficiently rich collection of sensory cues suggests that a viable system will require the
ability to combine and utilise a wide range of sensor systems simultaneously. What aspects
of these different sensory cues are relevant in any particular context will depend directly on
the particular task being performed: a tracking problem may only require measuring the
location of an object, but a threat-assessment system will require additional information
such as radar cross-section or polarisation signature to interpret the ‘nature’ of that object.
This suggests that the utilisation of the sensory data will be highly task-focussed. As noted
above, a viable system will require the completion of many reasoning tasks simultaneously.
For an autonomous uninhabited ground vehicle (AUGV) these include navigation, tracking,
terrain estimation and obstacle avoidance and each of these tasks will necessarily utilise a
different subset of the available data. An application of the model proposed in this thesis
to the AUGV shown in Figure 1.1 will be discussed in Chapter 5.
1.3 The Model
The proposed model represents a structure designed to exemplify the need for an engineering
balance between, on the one hand, the computational and algorithmic costs associated with
increasing the quantity and fidelity of sensory information, and, on the other, the advantages
of supporting a greater flexibility in the interpretation and utilisation of that data. As noted
in Section 1.2, the sensory data typically available from unstructured environments requires
careful interpretation: mistaking a moss-covered rock and a shrub can have effects ranging
from excessive caution to reckless confidence, neither of which is desirable. These difficulties
of interpretation can be considered to correspond to the problem of ensuring that the sensory
capabilities of the system are sufficient for the achievement of the tasks required.
In an ideal situation where computational and storage resources were unlimited, it would be
possible to maintain all sensory data without compression or loss of any kind and a system
1.3 The Model 7
operating in this manner would have access to the most thorough and complete information
for the given set of sensors. Furthermore, the ideal system would have the capability to
achieve the most general utilisation of the data, that is, it could complete any task which
can be performed using the available data. In practice, however, the raw data cannot be
maintained in this manner and some of the available information is necessarily discarded,
reducing the flexibility of the resulting system. This is particularly clearly demonstrated
in systems utilising imaging sensors: a typical camera may record 640 × 480 pixels in three
8-bit channels at 25Hz, and will generate approximately 22Mb a second. While the visual
field conveys a vast amount of information about the operational environment, practical
constraints forbid the direct utilisation of much of this data and various techniques are
employed to reduce the complexity of the stored information; this necessarily reduces the
flexibility of the resulting system (Blakemore 1990).
1.3.1 The Structure of the Model
The model is depicted in Figure 1.3 and it is assumed that the precise set of tasks required
for the system to operate effectively are unknown a priori. This last assumption implies
that the ‘ideal’ system is that which supports the most flexible utilisation of the available
sensory data and thus represents the system which is capable of the achievement of the
largest possible number of application-specific tasks.1 This flexibility is provided through
the logical separation of the data-focussed operations of sensing and representation, and
the task-focussed operations of reasoning with the available data. That is, the selection
of which sensory systems are used determines what can be measured, but the tasks which
must be achieved determine how it is measured. In this way the processes involved can be
divided into two distinct phases: the first involving bringing together the sensory data to
generate and maintain a cohesive and consistent summary of the available data; and the
second involving transforming, manipulating and interpreting the resulting information.
Furthermore, at a more detailed level, the new approach represents the operation of a
complex perceptual system as a series of parallel processing streams, with the distinct
phases of the operations clearly identified as sub-components of these streams.
The model shown in Figure 1.3 has three distinguishing characteristics:

1
Applications for which the particular tasks are known correspond to special cases of the reasoning
approach outlined in Chapter 4.
8 Introduction
Sensor
y Raw Model
Data
x
Sensor 1
Likelihood
Z Data Data
θ S b t
p(a) Transformation Application 1
p(b1) Features
Sensor Likelihood
φ Raw Model p(b2) X
Data
R
Sensor 2 p(c)
Y
Data Data
Likelihood Subset Transformation Application M
Features
θ
Raw mmary of Information
Data Sensor
φ Model
R
Sensor N
Figure 1.3: The High-level System Model. Multiple sensors asynchronously generate ob-
servation values from which likelihood functions are calculated using an appropriate sensor
model. These functions are combined statistically with the state (or summary of informa-
tion), updating the currently stored values. Subsets of the richly-descriptive probabilistic
representation are extracted and processed by specific data transformations to calculate
the feature properties necessary for the achievement of a specific application.
• The data is captured and stored in a sensor-centric manner, that is, the information is
retained in a form equivalent to the observed sensory responses. No attempt is made
to infer the existence of, or to exploit, important characteristics or features within
the data. The sensory data is treated as the best and most complete model of the
external environment;
• The uncertainty and ambiguity inherent in any real sensory process is captured ana-
lytically in the explicit utilisation of probabilistic modelling techniques. Specifically,
sensor models are constructed as likelihood functions relating an underlying random
quantity with the sensory observation.2 Sequential propagation and fusion of sen-
sory data can be achieved by using the Chapman-Kolmogorov and Bayes’ equations
respectively.
• The process of interpretation of the data is recognised as an operation which abstracts,

compresses and transforms the sensor-centric information to obtain some task-centric
2
In this ideal case, the sensor models encode the relationships between the sensory cues and also the
noise affecting their measurement. More sophisticated assumptions can be incorporated into these models
and the utilisation of these models is considered in detail in Section 3.4.2.
1.3 The Model 9
knowledge used to achieve a specific goal. The transformation can be explicitly en-
coded in a conditional distribution supporting both probabilistic and non-probabilistic
transformations.
The distinguishing characteristic of this approach is in the emphasis on attempting to pro-

cess the sensory data in as close to its ‘raw’ form as possible and delaying any interpretative
tasks until the time and place when they are required. This is motivated by the belief that
the sensory stimuli themselves represent a consistent, complete summary of the available
data and that the interpretative operations should be dependent on the available sensory
data and not vice-versa. This focus on retaining the most descriptive representation clearly
implies that the representation part of the model will consume the greatest proportion of
the computational resources in a viable system. Interestingly, recent studies from biolog-
ical systems have led to the conjecture that as little as 20% of visual cortex activity is
directly related to the sensory input, with the remaining activity proposed to correspond
to maintaining an internal representation (Fiser et al. 2004).
1.3.2 Approximations and Assumptions in the Model
While the model described in Section 1.3.1 represents a viable ‘ideal’ approach, it is clear
that the retention of all sensory data in a lossless manner for all time represents an impossi-
bly complicated and intractable problem. One important advantage of the model of Figure
1.3, however, is that it is possible to explicitly identify the parts of the system which repre-
sent computational, storage and theoretical ‘bottlenecks’. That is, for a given application,
the model enables the engineer to construct a practical system by analysing and designing
assumptions and approximations which deviate from the ideal case. The most significant
result of this approach is that almost all of the design decisions faced by the engineer can
be clearly identified as particular components of the new model, including many heuristic
methods.
The engineering decisions of interest here are related to the development of the practical
details of the system, such as the representational form, sensory encoding, state selection and
what type of operations are performed on the data. These can be broadly divided into two
categories: those which introduce and apply external knowledge to the system, for example
a sensor model; and those which address ‘internal’ characteristics related to the character
10 Introduction
of the data alone, such as computational and storage constraints. In other words, the first
category corresponds to the application of external knowledge to the resulting system in
addition to the data itself and the second to the application of engineering knowledge to
the form and character of the data.3
Assumptions and approximations of the first kind are readily identified as corresponding to
the application of external knowledge to the nature of the basic elements of the system in
Figure 1.3:
• Summary of Information (or state), in which the system developer selects the form
and character of the information which is maintained internally by the system. Among
other things, the representation may: contain data which is either discrete or contin-
uous valued; incorporate temporal effects or be static; correspond only to the local
region near the system, or retain the data in a global form; and/or maintain the data
in the sensory form or transform the data into a different quantity. Each of these is
considered in detail in Chapter 3.
• Sensor Models, in which the characteristics of the transformation from the raw sensory
data to the form maintained in the state are determined. In the ideal case this
corresponds to assessing the effects of noise on the resulting measurement, though
in real systems it corresponds to explicit assumptions including: the relationships
between different sensory cues; how the raw data and state being estimated are related;
and what part (or sub-domain) of the state corresponds to a particular measurement.
These issues, which are clearly related to the selection of the state, will be examined
in detail in Section 3.4.2.
• Data Transformation, in which the transformation from the represented state to gen-
erate ‘feature properties’ suitable for use in particular applications is developed. It is
demonstrated in Chapter 4 that this operation and the sensor models are very closely
related. Of interest here is the sub-domain of the state affecting the value of a given
feature property; how the state information is processed to obtain an estimate of a
feature property value; and how different feature properties are related to one another.
• Application, in which the designer enumerates and specifically encodes the manner by
3
In this way the first class corresponds mainly to assumptions, while the second to approximations,
though this particular distinction is somewhat arbitrary as discussed in the text.
1.3 The Model 11
which the system will utilise the transformed feature properties to achieve its goals.
Examples include: using the locations of a series of natural objects in the environment
(obtained from one or more data transformations) to perform navigation; matching the
observed properties of an object within the environment to a database of alternatives
to perform threat assessment; or using an extracted terrain model to perform path-
planning. While not discussed in detail in this thesis, these engineering decisions
strongly affect the development of the data transformations of Chapter 4.
Engineering judgements of the second class, however, relate primarily to the internal char-
acteristics of the data and how it should be maintained within a representation. For this
reason, the most important of these assumptions and approximations correspond to the op-
erations performed within the Summary of Information part of the model. The particular
approximations of interest in this part include: whether the domain of interest is discrete
or continuous, structured or unstructured; how the stored data is approximated; if it is
compressed; whether (and how) the probabilistic interactions between the stored quantities
are retained; and whether some data sources are retained within the data at all. These
issues are addressed in Section 3.6.
1.3.3 Engineering Analysis vs. Machine Learning Methods
As the focus of this thesis is on understanding the structure of the sensing, representa-
tion and interpretation processes in an engineering context, it is contended that approaches
which allow the application of rigourously-tested and well-established techniques in each of
the identified areas of Section 1.3.2 should be preferred to heuristic or special-case tech-
niques. This assertion is based on the desire to enable the engineer to develop systems of
high-integrity and high-reliability, requiring an understanding of not only the operational
characteristics of the system, but also its modes of failure. It is obvious, however, that the
complexity and subtlety of both the physical world and the sensory data obtained from it
make it difficult to employ ‘classical’ physical models to handle all situations, most notably
with respect to the interpretation of the sensory data for the achievement of a particular
goal.
In fact, there will be many situations in which it is necessary and advantageous to utilise the
computational and interpretative power of machine learning techniques, even though many
12 Introduction
of these may not permit rigourous analysis of their reliability or failure modes. For exam-
ple, Support Vector Machines (SVMs) allow complex discriminative tasks to be completed
with very low computational cost and correspond to a high-dimensional generalisation of
threshold selection in one-dimension; however, the discriminative boundary is uniquely tied
to the training data used in its construction and deviations from the behaviour exhibited in
the training set will not be captured (Hearst et al. 1998). The feasibility of the technique
relies on the utilisation of a sufficiently rich training set and unless the scope of applica-
tion of the technique is carefully controlled, it can be very difficult to justify or validate
the reliability of the resulting system. Critically, when this and other machine learning
techniques are used in well-defined and focussed applications, their performance can often
be validated or well-justified. It is contended that it is only when the entire scope of the
sensing, representation and interpretation problem is considered to be a feasible application
of a ‘raw input-to-goal’ learning technique that it becomes difficult to justify and validate
the performance of the system, making such approaches problematic for the engineer.
Furthermore, in the context of the new model, learning techniques can be broadly in-
terpreted as procedures for discovering and making use of complex, usually non-linear,
transformations from one space to another. That is, starting from an arbitrary space (for
example the space spanning input images from a colour CCD camera) a transformation is
developed by way of a series of examples which can subsequently be used to transform new
observations from the input space into the transformed output (a binary decision, for exam-
ple). This interpretation of these techniques enables the methods in question to be directly
employed in the proposed model, and as discussed in Chapter 4, these techniques form a
highly versatile and effective toolkit for finding and employing the data transformation part
of the model.
1.4 Performance Measurement
While the systematic analysis of the structure of the model and the identification of the role
and nature of the assumptions and approximations necessary for construction of a viable
practical system provides the engineer with a qualitative framework for system design, an
important characteristic of the structure of the new model is that it supports the quantitative
assessment of the performance of the system and its sub-components. It is clear that the
1.4 Performance Measurement 13
ability to quantify the effects of the approximations identified in Chapters 3 and 4 on

the performance of the model corresponds directly with the requirement due to Middleton
(1996, p773) for the quantitative comparison of optimum and sub-optimum systems, but
the notions of performance measurement enabled by the model presented in this thesis are
more general than the calculation of approximation effects in a representation or expected
risk in a decision-making system. In fact, there are three distinct types of performance
measure which are considered in this thesis: approximative effects, off-line (design-time)
component performance, and the online contribution of sources to an application.
1.4.1 Approximative Effects
In the first situation, performance can be interpreted as a measurement of the impact of

the assumptions and approximations of Section 1.3.2 on the operation of the system and
allows the comparison of different approaches. As discussed earlier, two criteria must be
met in order to provide this capability: firstly, there must be some notion of an ‘ideal’
or ‘baseline’ approach; and secondly there must be some appropriate quantity which can
be calculated in order to compare other approaches to the reference. For example, the
effect of an approximation to a function such as using a piecewise-linear or cubic-spline can
be quantified using a measurement of the squared-error between the approximation and
the true function. In such a case it is possible to compare the squared errors for the two
approximations for a variety of ‘typical’ functions and thereby to select the most appropriate
approximation for the given application.
1.4.2 Off-line Components
Importantly, however, the approach shown in Figure 1.3 represents a systematic approach
to the management of the flow of the data from the sensors to the applications in which
parallel streams act on subsets of the available information. That is, each of the sensors
represents a separate operation adding new information to the summary of information.
Likewise, sequential operations can be completed directly on the summary of information
and a series of parallel transformation streams provide the information necessary for each
application. This immediately suggests that the performance of each of these operations can
be considered in addition to the effects of approximations and assumptions. Specifically, as
14 Introduction
will be discussed in Section 4.5, the amount of information transferred through the sensor
model and data-transformation operations can be used to quantify the performance of
that transformation.
1.4.3 On-line Source Contributions
In addition to the design-time assessment of the quality and performance of the system
components, it becomes possible to quantify the online performance of the overall system.
Significantly, the ability to quantify the contribution of a particular information source
(sensory or otherwise) to a decision represents a substantial change in the state-of-the-art
for autonomous systems as it is no longer sufficient to consider that an application can
work; how well it is working at this instant must also be measurable. For example, in
an AUGV system it is highly beneficial if a non-functioning, damaged or obscured sensor
can be recognised as such, so that the overall system reliability is improved. Consider
that identifying when a camera has a cracked or damaged lens is very difficult unless the
contribution of that sensor to an application can be measured. A key advantage of the
new model is that the explicit recognition of the structure and ‘stream-like’ nature of the
entire system allows the contribution of any given sensor to be calculated and measured
with respect to any individual downstream task. This is examined in detail in 4.5.4.
1.4.4 Situational Dependencies of Measure Quantities
As the example of measuring the ‘performance’ of a function approximation shows, the

selection of the quantity used to measure the performance of a given operation depends on
both the characteristics of the operation and the given application. Chapter 2 introduces
a single framework for the development of quantities for making these measurements and,
significantly, shows that the selection of a particular measure is effectively arbitrary and
that the utilisation of one over some other can ultimately only be justified through the
demonstrated usefulness of the measure in the given situation. This is clearly displayed by
the differences between the performance of a terrain estimation representation as measured
using a ‘squared-error’4 quantity and the performance of a transformation measured by the
4
Formally, the Euclidian distance between the approximation and the ‘true’ surface treated as multi-
dimensional functions. See Section 2.4.4 for more detail.
1.5 Contributions of this Thesis 15
information-transfer ratio, which represents the degree to which the information contained
in the signal (with respect to a reference) is preserved across the transformation.
1.5 Contributions of this Thesis
The primary contribution of this thesis is the introduction of a comprehensive new model
for the interpretation of the sensing, representation and reasoning tasks utilising utilising
a new function-space interpretation. Specifically, it develops viable functional forms and
the utilisation of transformational models for sensor-modelling and reasoning operations.
This approach possesses the added advantage of being able to quantitatively assess the
performance of the components of the system.
In achieving the matters described above it also makes several specific contributions relating
to the new interpretation of the problem. These contributions are:
1. The development of a function-based measure-theoretic framework for reinterpreting

the concepts associated with sensing and reasoning operations (Ch.2);
2. The identification of a formal equivalency between entropy and mutual information

quantities as expectations of ‘event’ defined properties (§2.5.3–2.5.6);
3. The identification of the impact that this interpretation has on the existence of mutual
information and entropy-like quantities for systems of three or more random variables
(§2.5.7);
4. A comprehensive examination of the relationships between these quantities including

a re-interpretation of Venn diagram depictions (§2.6);
5. The introduction of a novel vector-space interpretation of the entropy-like quantities

which can be defined for any number of random variables, and a demonstration that
this interpretation encompasses previous approaches as special cases (§2.6.3, §2.7);
6. The identification of the ‘statistical independence’ distance as a true distance measure

in the information space for discrete systems (§2.8.2);
16 Introduction
7. An examination of a broad variety of distance and divergence measures using func-

tional interpretations, including a novel closed-form solution for calculating the Eu-
clidian distance between gaussian distributions (App.D);
8. The definition of the sensing and representation problem as the estimation of a func-
tional form rather than as a state-based scheme, including defining and interpreting
both the functional domain and the tensor-space corresponding to the range of the
developed functions (Ch.3);
9. The explicit identification of the equivalency between the estimation of the parameters
defining a functional form and the estimation of the pure functional form itself. This
provides an important link between the new interpretations and traditional approaches
(§3.3);
10. The analysis of the tractability of the ideal functional estimation scheme and the engi-
neering of appropriate approximations to achieve practical implementations, including
the identification of the ability to quantify these effects (§3.5–3.6);
11. The recognition of reasoning operations as the application of a general transformation

of functional forms where the transformations encode the biases which the operation
applies to the information contained in the original function (Ch. 4);
12. The identification of the general formulation of these transformations and the devel-
opment of several important simplified forms (§4.3);
13. The identification of performance measures suitable for the examination of the impact
and operation of the transformations associated with sensor modelling, data abstrac-
tion and reasoning (§4.5); and
14. The development of a practical computational framework which embodies the new
model and the examination of how this framework maps to several practical examples
(Ch. 5).
1.6 Thesis Structure
This thesis focusses on the interpretation of the sensing and reasoning problem using the
sensory model proposed in Section 1.3 with particular emphasis on the express identifica-
1.6 Thesis Structure 17
tion of the assumptions, approximations and sub-optimal approaches required to generate

reliable, practical systems. In many cases the effects of these engineering decisions on the
performance of the resulting system can be quantitatively measured with respect to an
‘ideal’ case.
Significantly, this thesis does not include a distinct literature review as the breadth of the
discussion warrants the introduction of relevant material as required.
Also note that in the heavily theoretical sections, important results and conclusions are
highlighted in boxed summaries to aid the interpretation of the material.
Chapter 2 introduces the mathematical background required for the remaining parts of
the model. A framework is developed with particular focus on measures which correspond
to the deviation between two scalars, vectors, functions or distributions. The same frame-
work is used to re-interpret information-theoretic quantities including entropy and mutual
information and to compare the essential characteristics of several commonly used measures.
Chapter 3 describes the sensing and representation phase of the proposed model and
corresponds to a mathematical model for describing the processes of capturing, representing
and manipulating the sensory information to generate a sensor-centric representation. In
particular, it describes the function-estimation approach and examines the most important
assumptions and approximations which arise in that context.
Chapter 4 examines the interpretation of the data contained in such a sensor-centric model
as a process of abstraction and transformation. In particular, the mathematical forms
describing the process of converting from sensor-focussed to task-focussed representations
are examined in detail.
Chapter 5 outlines the development of a practical implementation of the model proposed

in this thesis and presents an engineering assessment of the decisions and approximations
made within that development. Focussing on the application of the model to an autonomous
uninhabited ground vehicle (AUGV), this practical demonstration outlines many of the
computational and practical implications of the theoretical model presented in Chapters
1-4.
Chapter 6 discusses how the high-level model presented in this thesis gives rise to several
important areas for future research and development, and examines the implications and
conclusions of this model.
18 Introduction
Chapter 2
Measures, Distances and

Information Theory
2.1 Introduction
The interpretative power of the model presented in this thesis lies in the explicit acknowl-
edgement of the necessity of approximations and simplifications in the development of viable
practical implementations. This requires the quantification of the effects of these approx-
imations in a consistent and comprehensive framework. In the most general sense, this
suggests the application of the theory of measures in the specific context of the ‘deviation’
of an approximation from an underlying mathematical entity such as a function or distri-
bution. Such deviations will necessarily be dependent on the particular context in which
the approximation is made and should ‘map’ the characteristic differences onto a quantity
which allows comparisons to be made between multiple approximations.
This chapter introduces a single framework for capturing many starkly different concepts
of deviation. Of particular interest are probabilistic measures, and the notions of ‘distance’
between vectors, functions and distributions. Many different distances can be defined in
this way, each measuring the proximity or deviation according to different aspects of the
objects and spaces being considered. The subtle but important distinction between measures
defined for functions and those defined for distributions will be considered in detail. Some
measures considered ‘information-theoretic’ can be shown to actually measure the structural
properties of the distributions rather than any explicit underlying statistical relationships.
20 Measures, Distances and Information Theory
2.2 Measure Theory
Formal measure theory is well beyond the scope of this thesis, though it is instructive to
consider the basic elements of the mathematics. A measure μ is defined as a mapping
which projects elements of a set S onto the real numbers1 , R, in a manner analogous to
the concept of a quantity such as length or volume. Many measures belong to the special
case of positive measures and map the elements to R+ , the positive real line. Formally, the
general theory requires the definition of a σ-algebra, Σ, on the set S and an appropriate
measure μ which together define a measure space as the triple (S, Σ, μ). A σ-algebra of a
set is defined as a family of subsets Si with Si ⊂ S which are closed under the application
of a countable number of set operations - such as complements, unions and intersections.
The measure μ then maps elements of the σ-algebra to the real line,
μ : Si → R for Si ∈ Σ. (2.1)
A measure has the following important properties:
(∅) = 0 (2.2)

μ Si = μ(Si ) (2.3)
i i
where Equation 2.2 implies that the empty subset (∅) must have zero measure. Equation
2.3 holds if and only if the elements Si are disjoint subsets of S and in this case the measure
of the union is the sum of the measures of the individual elements; Σ is said to have the
property of σ-additivity. Many of the useful properties of a measure follow immediately
from these properties. This definition means that whenever it is possible to define a closed
family of subsets of S, it is also possible to generate arbitrary measures on that family.2
A σ-algebra is a family of subsets of an arbitrary set S and a measure μ maps the

elements of Σ onto the real line.
Throughout this thesis the set of real numbers will be denoted R; the integers by Z; the positive integers
1
by Z+ ; the non-negative integers by Z∗ ; and the imaginary numbers by C.

2
See Examples C.2.1 and C.2.2 in Appendix C.
2.3 Deviations, Divergences and Distances 21
2.3 Deviations, Divergences and Distances
Each of the measures, μ, of the previous section is a function of a single element of Σ and is a
quantity which represents the magnitude of a ‘property’ of that entity such as the probability
of an event or the length of an interval. These quantities, however, have merit only in that
they allow comparisons to be made between multiple entities: P (x1 , y1 ) > P (x2 , y2 ) or
Length([1, 3]) = Length([5, 7]) for the two examples respectively. These comparisons can be
reinterpreted as a single measure which combines the evaluation of the deviation between
the two elements, for example the measure difference μ (Si , Sj ) = μ(Si ) − μ(Sj ).
Formally, using the same definition of a set S and a σ-algebra as earlier, a two-variable
measure is defined: let Si , Sj ∈ Σ be two arbitrary elements and define the mapping
Δ(Si , Sj ): Σ × Σ → R+ in place of the one parameter measure μ(Si ) : Σ → R+ . This
function is intended to capture the ‘deviation’ between the two elements and it is desirable
if it satisfies:
Δ(Si , Sj ) ≥ 0 ∀Si , Sj ∈ Σ (2.4)
Δ(Si , Sj ) = 0 ⇐⇒ Si = Sj , (2.5)
where Equation 2.4 ensures that the resulting measure is a positive quantity for every
possible pair of subsets. Equation 2.5 provides the reasonable requirement that the deviation
between equal elements should be zero. The resulting measure is known as a divergence
and is a measure which is zero when the elements are identical and non-zero otherwise; in
this thesis a divergence is denoted by D(Si Sj ) where i and j index the elements of Σ.
A measure which satisfies these properties allows the divergence between subsets Si and Sj ,
D(Si Sj ), to be compared to the divergence D(Si Sk ), but does not require that the rela-
tionship between these divergences bear any specific relationship to the divergence between
Sj and Sk . Practically, although elements can be ordered by their relative divergence from
a common element, the measure does not constrain the space itself and the relationships
between elements may change arbitrarily if the common element changes.
S3
d(S3,S2)
d(S1,S3)
S2
S1 d(S1,S2)
Figure 2.1: A schematic interpretation of the triangle inequality of Equation 2.7. Three
arbitrary elements from the σ-algebra are shown such that the length of the lines can be
interpreted as the distance measure.
The deviation can be further strengthened by imposing two additional constraints:
Δ(S1 , S2 ) = Δ(S2 , S1 ) (2.6)
Δ(S1 , S2 ) ≤ Δ(S1 , S3 ) + Δ(S3 , S2 ); (2.7)
in which case the measure will be a distance and is denoted by d(S1 , S2 ). Equation 2.6
requires that the measure treats both elements equivalently, in contrast to a divergence
where it is common to have a ‘favourable’ or ‘dominant’ element. For example, the Kullback-
Leibler divergence between two distributions S1 ⇔ P1 (x) and S2 ⇔ P2 (x) defined on

the same discrete alphabet x ∈ A is defined as DKL [P1 (x)P2 (x)] = i P1 (xi ) log PP12 (x i)
(xi )
(Mackay 2004, p34). Clearly the element P1 (x) has a different role to that of P2 (x).
The final equation, 2.7, is the triangle-inequality and is represented geometrically in Figure
2.1. This measure forces the elements of the σ-algebra to have a unique relationship to one
another, regardless of which two elements are being considered. In a geometric sense, if the
elements {Si } for which the distance is defined are interpreted as points in space, then the
distance measure will generate a unique configuration for those points - the structure of the
embedding will be fixed. The figure is a simple example of this where knowing the three
distances uniquely specifies the necessary relative configuration of the points.
This is important because it means that if two elements S1 and S3 are close, that is d(S1 , S3 )
is ‘small’, then the distances of these elements from a third, S2 , will be comparable, espe-
cially if d(S1 , S2 ) is large. Moreover the converse is also true: if the two elements are both
close to S2 , then they must be close to each other: the triangle inequality gives an upper-
bound for the divergence between two elements in terms of their separate distances from a
common element. This property does not hold for a divergence. With this final property
2.4 Extending Basic Measure Theory 23
it is possible to ‘order’ the elements of the space using the distance, provided that this
ordering is correctly interpreted as the configuration of points in a multi-dimensional space.
It is only under rare circumstances that the distances may impose a configuration in which
all elements are collinear and they will be explicitly ‘ordered’.3
A Deviation Δ(Si , Sj ) is a measure which maps two elements of the σ-algebra to a

real quantity which corresponds to the degree to which the elements differ from one
another.
A Divergence D(Si Sj ) satisfies:
Δ(Si , Sj ) ≥ 0 ∀Si , Sj ∈ Σ (2.4)
Δ(Si , Sj ) = 0 ⇐⇒ Si = Sj . (2.5)
A Distance d(Si , Sj ) satisfies these and also:
Δ(S1 , S2 ) = Δ(S2 , S1 ) (2.6)
Δ(S1 , S2 ) ≤ Δ(S1 , S3 ) + Δ(S3 , S2 ). (2.7)
Lastly, it is possible to construct an explicit link between deviations and measures defined
over the same elements. While not strictly required by either definition, it supports the
intuitive relationship between a distance and a ‘norm’ in the material which follows. Specif-
ically, it is possible to arbitrarily enforce the interpretation that the deviation between Si
and the empty set, ∅, is equal to the measure for the element, that is,
Δ(Si , ∅) = μ(Si ). (2.8)
2.4 Extending Basic Measure Theory
The theory introduced to this point defines the measure space in the most general way
possible in terms of the underlying set and the σ-algebra defined on it. This section in-
troduces the three specialisations of the theory which are of particular importance in this
model: the link between formal measure theory and the familiar inner product spaces will
be demonstrated; measures defined for n-tuples, particularly n-dimensional vectors; and
measures defined for continuous and infinite dimensional σ-algebras.
3
See Examples C.3.1, C.3.2 and C.3.3.
2.4.1 Coordinate Systems
Given the space S and the σ-algebra defining the subsets Si ⊂ S, it is possible to define a
‘coordinate system’ for the subsets so that each element Si is uniquely indexed by l integers
(or if the set is formally uncountable, by l real numbers). That is, the coordinate vector
si is defined by Si ⇔ si = {si1 , si2 , . . . , sik , . . . , sil }. Explicitly, such a coordinate system
defines the notion of l bases êk which differentiate elements of the set according to the kth
index. Under this notation it is possible to write the set elements as

si = sik êk (2.9)
k
where the summation is interpreted in the space of the set S and in the sense of the basis
vectors êk .
This means that the basis vectors êk can be interpreted in two distinct ways: firstly as a
characteristic property of the subsets Si capable of uniquely identifying each subset; and
secondly as a vector in the space defined by the indices sik . Examples 2.4.1 and 2.4.2
demonstrate how this generates the two most common discrete and continuous coordinate
systems.4
A coordinate system is a set of (possibly continuous) indices {sik } which uniquely

specify the elements of the set S. Each coordinate is defined along with a basis element
- an ‘elementary’ member of S defining the characteristic property which varies with
the value of that particular index. This relationship can be written as

si = sik êk (2.9)
k
Example 2.4.1 – Discrete Coordinate System

Consider the set S2 = {∅, S1 , S2 , . . . , S6 }2 from the example of throwing two dice simulta-
neously. There are 36 elements (Si ) in the σ-algebra defining pairs of outcomes, but the
structure suggests that these can be indexed by two integers, i1 and i2 . If ê1 and ê2 are
identified with die 1 and 2 respectively, then the outcomes can be equivalently represented
4
See Example C.4.1.
as si = {si1 , si2 } with si1 , si2 ∈ {1, 2, 3, 4, 5, 6}, ê1 = {1, 0} and ê2 = {0, 1}. Thus, the mea-
sure can be defined on a two dimensional discrete vector and gives rise to the traditional
representation of the outcome of rolling two dice, P (si1 , si2 ), with si1 , si2 ∈ {1, . . . , 6}.
Example 2.4.2 – Continuous Coordinate System

Alternatively, consider the case where S = Rl and Σ = {{∅}, {x1 , . . . , xl }}, the set of
all points within the space. As the underlying set Rl is formally uncountable, rather than
indexing each element using integers, they are indexed using the elements of the basic set
S, in this case real numbers. Identifying that it is possible to express an element Si as
x = {x1 , . . . , xl }, where si is replaced by x without a subscript to clearly designate the
distinction of this case from a countable set, it remains to select the basis vectors êk . The
preferred selection is clearly the orthogonal set êk = {0, . . . , 0, 1, 0, . . . , 0} where the 1 is in
the kth position, but there is nothing to prevent the selection of a non-orthogonal set, such
as êk = {0, . . . , 0, 1, 0, . . . , 0, 1} where both the kth and final positions are non-zero.
2.4.2 Measures and Inner Products
Using the basis vectors êk and the indices sik as defined here for k ∈ [1, l] ⊂ Z+ it is possible
to express the resulting measure as a function of the indices directly,
μ(Si ) ⇔ μ(si ) (2.10)
where the function on the right hand side operates on the l elements of the vector si instead
of the single set element Si . Now, in general, the form of μ(si ) will be non-trivial, though
there are several important situations in which the structure of the set and the measure
give rise to simplifications.
The most important of these arises when a measure is finite on a discrete space (that is,
μ(si ) < ∞ ∀ si ∈ Σ) or, when defined on a continuous space, is differentiable everywhere.
In this case the value of the measure for any particular element si can be obtained from
the value of any other measure sj , notably an element with zero measure, by summation
e2
si
si2
e1
si1
Figure 2.2: Multiple Paths between the origin and element si . Provided the space is differ-
entiable (or finite for discrete sets) the value of the measure at si can be obtained by the
integral of the derivative of the measure along any path to the point.
(integration) of the difference (derivative) over any path joining the elements of a discrete
(continuous) set. As the measure is continuous on the space, guaranteed if the triangle-
inequality is satisfied (Doob 1994, p34), then the actual path traversed is irrelevant and
it is reasonable to consider the path which follows each of the coordinate axes in turn, as
shown in Figure 2.2, and the value of the measure at si is

μ(si ) = dμ(s̃ ) (2.11)
s
∂μ(s )
= dsk (2.12)
s ∂sk
k
where in the first equation the integral is performed along all points s along the path and
dμ(s ) is the total differential. Equation 2.12 follows immediately from the definition of the
total differential (Stewart 1995, p788).
If it is further assumed that the measure is flat, that is, the gradient in each dimension is
∂μ(s )
∂sk = Ck where Ck is a constant that does not depend on s , then Equation 2.12 reduces
to

μ(si ) = Ck dsk (2.13)
k s

= Ck sik . (2.14)
k
It is possible, in addition, to extend this analysis to measures which are functions of two
elements si and sj , in the same way that Section 2.3 defined the notion of a distance
measure. A special case of this function involves defining the measure to be the product
of the measures of Equation 2.11 applied separately to each element. This function can be
interpreted as a generalisation of the inner product si , sj and is given by
si , sj = μ(si ) × μ(sj )

∂μ(s ) ∂μ(sj )
= i
dsk × dsl (2.15)
s
i
∂sk s
j
∂sl
k l
∂μ(s ) ∂μ(sj )
i
= dsk dsl . (2.16)

si sj ∂s k ∂sl
k l
Importantly, this expression is bilinear in si and sj so that it is possible to write
αsi + βsj , γsk = αsi , γsk + βsj , γsk
= αγsi , sk > +βγsk , sk . (2.17)
For a ‘flat’ measure the two partial derivatives are constants; moreover, they do not depend
on the elements si or sj . Equation 2.16 becomes

si , sj = Ck Cl sik sjl
k l

= μkl sik sjl (2.18)
k l
where μkl is the constant product of Ck and Cl . Equation 2.18 is the standard definition of
the inner product and μkl is the metric tensor for the space on which it is defined. Appendix
A includes a discussion of the alternative interpretation of inner products and measures in
the context of Tensor Geometry. This definition further highlights the interpretation of the
inner product as a bilinear measure: S × S → R.5
5
See Examples C.4.2 and C.4.3.
The ‘standard’ inner product can be written

si , sj = μkl sik sjl (2.18)
k l
where si k is the kth component of si and μkl is the k-lth element of the metric tensor
for that space. This equation is valid if:
∂μ(s)
1. The coordinate system defines a ‘flat’ space, that is ∂sk is constant for each of
the k components.
2. and the measure is defined on a continuous set and is differentiable everywhere

or is defined on a discrete set and is finite everywhere.
The first condition can be relaxed to define the ‘generalised’ inner product as
∂μ(s ) ∂μ(sj )
si , sj = i
dsk dsl . (2.16)

si sj ∂s k ∂sl
k l
2.4.3 Norms and Distances
When the space S is closed under addition (and subtraction) then it makes sense to iden-
tify with two arbitrary elements Si and Sj two new elements S+ (i, j) = Si + Sj and
S− (i, j) = Si − Sj both of which are members of S. For example, in the set Rn the
sum or difference between any two elements is also in the set; however, the set S2 represent-
ing the throws of two dice does not have the same concept of addition for elements of the
set; there is no meaning to {1, 4} + {6, 3}. When these operations are meaningful then the
measures associated with the two elements S+ (i, j) and S− (i, j) have special meaning. The
first represents the measure of the ‘sum’ and is also known as the ‘norm’6 and is denoted
s+ (i, j). Likewise, the second is the measure of the ‘difference’ and is a particular type of
deviation measure, as introduced in Section 2.3 and can be treated interchangeably under
the condition that subtraction makes sense on S.
There is a special case in which the sum and difference operations can be written in terms
6
A more typical definition of norm is “a mathematical quantity that in some (possibly abstract) sense
describes the length, size or extent of the object”(Weisstein 1999c), which is almost identical to the definition
of the measure introduced earlier. For this reason norm is taken in the more restricted sense in this work.
of the coordinates:
S+ (i, j) = Si + Sj
⇒ s+ (i, j) = si + sj (2.19)
and similarly for the subtraction. In these cases, it is possible to write the resulting measures
as:
μ[s+ (i, j)] = μ(si + sj ) (2.20)
μ[s− (i, j)] = μ(si − sj ). (2.21)
Letting s+ = si + sj , the inner product becomes
s+ , s+ = si + sj , si + sj
= si , si + sj , sj + 2si , sj . (2.22)
If the measure is flat, the Equation 2.18 yields

si + sj , si + sj = μkl sik sil + μkl sjk sjl + 2 μkl sik sjl . (2.23)
k l k l k l
In the special case where the elements μkl are zero when k = l and unity otherwise (so that

the coordinate system is orthonormal) then the equation becomes s+ (i, j) = k s2+k (i, j).
By convention the units of the norm are required in the same units as the original measure
μ of Equation 2.11 and this is achieved by taking the square root of this particular equation,
yielding the L2 norm (Weisstein 1999d)

s+ (i, j) = s2+k (i, j). (2.24)
k
Similar examples include the L1 , Ln and L∞ norms (Weisstein 1999d):

L1 [s+ (i, j)] = |s+k (i, j)| (2.25)
k

Ln [s+ (i, j)] = n |sn+k (i, j)| (2.26)
k
L∞ [s+ (i, j)] = max |s+k (i, j)| (2.27)
k
where the absolute values are required to preserve the positivity7 of the measures and the
nth root are taken to ensure the units are the same as for the ‘base’ measure.
In fact, the nth root is required so that the resulting measure μ(si − sj ) satisfies the triangle
inequality of Equation 2.7 and ensures the continuity of the resulting deviation. Further-
more, if si = sj then and if μ(0) = 0, then this deviation measure also satisfies the positivity
requirement of Equation 2.4. Lastly, the absolute values ensure symmetry in the deviation
and this special case is a distance measure.
Importantly, these are special cases corresponding to situations in which the spaces are
orthonormal. However, more complicated spaces can be considered using more general
forms of μjk , or by using the version of Equation 2.16 directly in spaces with varying partial
derivatives (Heinbockel 2001, §1.3).8
For a space closed under appropriately defined ‘addition’ and ‘subtraction’ operations,
define the ‘sum’ and ‘difference’ elements as S+ (i, j) = Si + Sj and S− (i, j) = Si − Sj .
The ‘norm’ and ‘difference’ are defined as:
μ[s+ (i, j)] = μ(si + sj ) (2.20)
μ[s− (i, j)] = μ(si − sj ) (2.21)
respectively, and if the inner product can be written, then these become:
s+ (i, j)2 = s+ , s+
s− (i, j)2 = s− , s− .
Further examples can be written when the coordinate system is orthogonal and these
include the L2 , L1 and L∞ norms of Equations 2.24, 2.25 and 2.27.
7
The measure could also become complex if the summation resulted in a negative result.
8
See Example C.4.4.
2.4.4 Measures for Functions and Infinite Dimensional Systems
In Section 2.4.2 it was demonstrated that the measure of an element can be rewritten in
terms of an open path integral taken over the total differential of the measure, that is,
Equations 2.11 and 2.12 yield the measure for a differentiable measure on a continuous
space as

μ(si ) = dμ(s ) (2.11)
s
and
∂μ(s )
μ(si ) = dsk (2.12)
k s ∂sk
where the elements of S can be indexed by k coordinates and the measure is differentiable.
This formulation reveals the link between the measure definitions introduced earlier and
the differential geometric viewpoint. A representation of this case is shown in Figure 2.3 (a)
where the partial differentials are taken in the s direction and the total differential obtained
by summation over k. For completeness, consider the same expressions in the cases where
the set is discrete and when the coordinate system is infinitely-dimensional.
The analogous pair of equations for discrete S and finite dimensionality are found by re-
placing the integral by an equivalent summation and the total differential by a difference
equivalent:

μ(si ) = Δμ(s ) (2.28)
s
Δμ(s )
= δsk (2.29)
Δsk
k s
Δμ(s )
where Δsk is the kth component of the difference, δsk is the difference in index sk and s
indexes the elements along the path. This case is shown in Figure 2.3 (b), where once again
the ‘volume’ under the surface is calculated.
In these expressions it was assumed that the coordinate system was finite dimensional, that
is k ∈ [1, l] ⊂ Z+ with l < ∞. If the dimensionality becomes infinite, then the summation
2
0.8
1
0.6
0 0.4
−1 0.2
−2 0
−3 1
2
1 3
4 10
10 5 9
8 6 8
0.5 7
6 7 6
8 5
4 4
9 3
2 10 2
Set Element, s 0 Vector Element, si 1
Vector Component, k Vector Coordinate, k
(a) Continuous Set, Finite Dimension (b) Discrete Set, Finite Dimension
3 1
2
0.5
1
0 0
−1
−0.5
−2
−3 −1
10 2
8 1 1.5 1
6 0.8 0.8
0.6 1 0.6
4 0.4 0.4
0.5
2 0.2 0.2
Set Element, s 0 Set Element, s 0 0
i Vector Component, k Vector Component, k
(c) Discrete Set, Infinite Dimension (d) Continuous Set, Infinite Dimension
Figure 2.3: A representation of the differential or difference of a measure showing the

qualitative difference between continuous and discrete sets, S, with finite or infinite dimen-
sionality. Each figure represents a single path, s , as defined in Section 2.4.2. The left axis
indexes the intermediate states si ∈ s of this path, the right axis indexes the components
of each element and the vertical scale is the differential or difference (for continuous and
discrete dimensions respectively) value of the measure. Each line in (a) therefore, repre-
sents the differential of the measure along the path for a given component. The resulting
measure can be interpreted as the ‘volume’ under the ‘derivative’ surfaces using summation
for discrete axes and integration for continuous ones.
over k in these expressions becomes an integral. Rewriting Equation 2.29 for this case yields

μ(si ) = Δμ[s (k)] (2.30)
s

Δμ[s (k)]
= δs(k) dk (2.31)
k Δs(k)
s
Δμ[s (k)]
where Δs(k) δs(k) represents the partial difference between two points s1 and s2 on the
path, as a function of the ‘coordinate’ k. An example of this is shown in Figure 2.3c.
Lastly, for a continuous set with an infinite dimensional coordinate system, Equation 2.12
becomes
∂μ[si (k)]
μ(si ) = dsi (k) dk. (2.32)
k si ∂si (k)
With this change, the definition of the inner product of Equation 2.15 can be written as

∂μ ∂μ
si , sj = dsi (k) dk × dsj (l) dl (2.33)
k si ∂si (k) l sj ∂sj (l)

∂μ ∂μ
= dsi (k) dsj (l) dk dl (2.34)
k l si sj ∂si (k) ∂sj (l)
where the functional dependence of μ[s (k)] on s has been dropped for notational simplicity.
If the space is flat then the partial derivative is a function c(k) and is independent of s ;
∂μ[s (k)]
= c(k). (2.35)
∂s(k)
Therefore, the analogy of Equation 2.18 becomes

si , sj = μij (k, l)si (k)sj (l) dk dl. (2.36)
k l
It is also normally assumed that the coordinates of the two elements, the k and l are ‘aligned’
so that μij (k, l) = 0 if k = l and the expression simplifies to

si , sj = μij (k)si (k)sj (k) dk. (2.37)
k
In the same way that Equation 2.37 is equivalent to the (L2 ) inner product for finite
dimensional vectors, this expression is the same as the typical interpretation of the inner
product for functions defined over a continuous space (Weisstein 1999b).9
When defined over the continuous space of functions, the inner product can be written

as si , sj = μij (k)si (k)sj (k) dk (2.37)
k
for the special case where the functions si and sj are ‘aligned’.
2.5 Information Theoretic Measures
2.5.1 Probability Measures
This work focusses on the explicit application of Bayesian statistical analysis to the con-
sideration of the structure of the perceptual problem, that is, all quantities and elements
within the model are intrinsically probabilistic. The major advantage is that at no stage is
it required to provide a final interpretation of any quantity - each and every aspect can be
re-evaluated given the observation of new data. This section introduces the basic measures
which arise when considering the information theoretic underpinnings of this approach.
Consider a pair of random variables x and y representing (in the first instance) two different
experiments, X and Y . It is helpful to formally distinguish between the case when the
measured values are discrete and continuous. In the discrete case the variables take on
one of the values from the ‘alphabets’ A and B respectively. Using the notation of Section
2.2 the set S = {A, B}, where the obvious σ-algebra is ΣD = {{xi }, {yj }, {xi , yj }}, can be
identified with xi ∈ A, yj ∈ B, i ∈ [1, |A|] and j ∈ [1, |B|], where |.| denotes the cardinality
of the set.10 These two sets are depicted in Figure 2.4 as a Venn diagram in which the two
sets do not overlap.
Now, since only one event can occur at a time, the individual events xi must be disjoint and
by Equation 2.3 any measure applied to a union of atomic events xi will be the sum of the

measure of each outcome: μ[∪i xi ] = i μ[xi ], where i indexes a subset of A. Likewise,

μ[∪j yj ] = j μ[yj ] for a subset {yj } ⊂ B and a particularly useful measure can be
identified as the probability associated with each outcome, PA (xi ) or PB (yj ). This measure
9
See Examples C.4.5, C.4.6, C.4.7 and C.4.8.
10
The σ-algebra also implicitly includes the null event, ∅, and further unions, complements and intersec-
tions of these listed elements.
2.5 Information Theoretic Measures 35
Sx Sy
Figure 2.4: A Venn diagram showing two disjoint sets of events Sx = {x} and Sy = {y}.
Sx represents the set of all outcomes x of experiment X to measure the value of random
variable x, and Sy the outcomes y of experiment Y measuring y. The marginal probabilities
can be generated by mapping the elements of Sx (or Sy ) to the unit interval, and the joint
probability by a separate mapping of the pairs of events (x, y) to the same interval.
maps each element of the appropriate set to a member of the unit interval [0, 1] ⊂ R. Since
probabilities should be interpretable for any union of atomic events xi , then their unions
should also map to this same interval, this implies in particular that PA [∪i xi ], for {xi } ≡ A,
should also map to the interval, and since this exhausts the possible outcomes, this quantity

is by convention made equal to unity, that is, i PA (xi ) = 1. The probability measure also
maps the joint events (xi , yj ) to the unit interval in a similar manner to yield PAB (xi , yj )

with i j PAB (xi , yj ) = 1.
Discrete probability distributions are defined for random variables x and y which take
on values from the discrete alphabets xi ∈ A and yj ∈ B respectively. The joint
probability measure is PA (xi , yj ) ∈ [0, 1] ⊂ R and can never exceed a value of unity.
By convention the measure is normalised according to

PAB (xi , yj ) = 1. (2.38)
i j
Likewise, in the continuous case consider the outcomes to be two vectors x̃ ∈ Rn1 and
ỹ ∈ Rn2 , which should not be confused with the random variables themselves, x and y. If
the space is identified as S ≡ Rn1 +n2 , then an obvious σ-algebra is ΣC = {{x̃}, {ỹ}, {x̃, ỹ}}.
In an entirely analogous manner to the discrete case, it is possible to identify first that all
events x̃ or ỹ are disjoint, and subsequently define the probability density measures PX (x̃),
PY (ỹ) and PXY (x̃, ỹ). Note that these are density measures, rather than probabilities
themselves and that this means that the density is not constrained to be in the interval
[0, 1], though it will be in R+ .
Again, by convention, the probability of the entire set should be unity: PX (x̃)dx̃ = 1,
PY (ỹ)dỹ = 1 and PXY (x̃, ỹ)dx̃ dỹ = 1. The density PX (x̃) should not be confused
with the probability associated with any subset, x ⊂ S, of the domain which will be given
by P (x ) = x PX (x) dx which must fall within the range [0, 1]. Therefore, while the density
can have an instantaneous value greater than unity, the probability mass associated with
any particular subset of events (including the subset containing only a single point) can not
be greater than unity. It is important that the values taken on by continuous distributions
are only interpreted in the context of providing a density, rather than actual probabilities
which are the mass of the density function for a realisable subset of the continuous domain.
Continuous probability densities are defined for random variables x and y taking values
from the continuous domain as x̃ ∈ Rn1 and ỹ ∈ Rn2 and are written as PXY (x̃, ỹ). As
with the discrete case, the total probability mass is normalised as

PXY (x̃, ỹ) dx̃ dỹ = 1. (2.39)
However, the density values are not constrained to the interval [0, 1] ⊂ R as before
and PXY (x̃, ỹ) ∈ [0, ∞) ⊂ R.
When the context applies equally well to both continuous and discrete distributions, the
outcomes will be denoted by x and the distribution by P (x, y) without subscripts; it is
assumed that the distribution is appropriately normalised.
The marginal and joint probability measures are further related by noting that the union
of all events with a particular value of x and all values of y, {(x = x0 , y)}, forms a disjoint

set, so the joint probability is P [∪j (x0 , yj )] = j P [x0 , yj ] ∈ [0, 1]. Now this value depends
only on the value of x0 and so, by convention, is made equal to the marginal probability.11
That is,

PA (xi ) = PA,B (xi , yj ) (2.40)
j

PX (x̃) = PXY (x̃, ỹ) dỹ. (2.41)
11
See Example C.5.1.
The formulation of the probability as a measure defined on the σ-algebra of the underlying
set also extends to families of probability distributions depending on some arbitrary set of
parameters α. In this way, given a particular set of values for the parameters the resulting
functions define a probability distribution and it is possible to write the family as P (x)|α =
P (x|α) where the resulting measure is obtained on the combined set {X, A}. An important
example of this occurs when the parameters of the distribution can be identified with a
second random variable, y say. In this case it is possible to define the conditional probability
distribution P (x|y) according to
P (x, y)
P (x|y) (2.42)
P (y)
which captures the fact that knowledge of the joint distribution P (x, y), already seen to
enable the determination of the marginal distributions P (x) and P (y), allows the generation
of the probability distribution of one random variable, given a particular value of the other.
Since this is a probability measure of x alone, albeit with a functional dependence on the
value of y, it must normalise as

P (x|y) dx = 1 ∀y (2.43)
when the distribution is defined on a continuous set.
Marginal probability distributions P (x) and P (y) are defined as:

PA (xi ) = PA,B (xi , yj ) (2.40)
j

PX (x̃) = PXY (x̃, ỹ) dỹ (2.41)
and are themselves normalised with a total probability mass of unity.
Conditional probability distributions correspond to parameterised families of distribu-

tions where the parameters are identified as a second random variable
P (x, y)
P (x|y) (2.42)
P (y)
and are normalised with respect to x, but not y.
20 8
15 6
P(v|i)
P(v,i)
10 4
10 10
5 2
0 5 0 5
10 10
8 8
6 6
4 4
2 0 Voltage (V) 2 0 Voltage (V)
0 0
Current (mA) Current (mA)
(a) Conditional distribution P (v|i) (b) Joint distribution P (v, i)
Figure 2.5: The conditional and joint distributions for the deterministic relationship between
the current i and voltage v through a resistor of known value R from Example 2.5.1
Example 2.5.1 – Voltage and Current in Known Resistor

Let the current through and voltage across a resistor be denoted by i and v respectively. It
is known that these values are related by v = Ri where R is the resistance. If the value of
R is known exactly, then the conditional probability P (v|i) is given by
P (v|i) = δD (v − iR)
where δD (.) is the Dirac delta function, that is, there is a deterministic relationship between
the values of the current and voltage. The joint distribution will, therefore, be given by
P (v, i) = P (v|i)P (i)
= δD (v − iR) P (i)
⎧
⎨ 0 if v = iR
=
⎩ P (i) otherwise.
For a resistance of R = 1kΩ and if the current has a Gaussian distribution with mean
ĩ = 5mA and variance σi = 1mA then the voltage will be related and must have mean
x̃ = ĩR = 5V and variance σv = σi R = 1V. The conditional distribution, P (v|i), is shown
in Figure 2.5a and the resulting joint distribution, P (v, i), and marginal distributions, P (v)
and P (i), are shown in Figure 2.5b.
8
0.4
6
0.3
4
P(v|i)
P(v,i)
0.2
2
0 0.1
10
0
8
10
6
4
5
2
10 10
Current (mA) 8 8
6 Current (mA) 6
0 4 4
2 2
0 0
Voltage (V) Voltage (V)
(a) Conditional distribution P (v|i) (b) Joint distribution P (v, i)
Figure 2.6: The conditional and joint distributions for the deterministic relationship between
the current i and voltage v through a random-valued resistor R from Example 2.5.2
Example 2.5.2 – Voltage and Current in a Real Resistor

If the resistor of Example 2.5.1 is not known exactly, but instead has a probability distri-
bution itself, P (R), and noting that P (v|i, R) = δD (v − iR), that is, the functional rela-
tionship of v on i and R remains deterministic, then it is possible to obtain the conditional
distribution P (v|i) according to

P (v|i) = P (v|i, R)P (R)dR. (2.44)
Figure 2.6(a) shows the resulting conditional distribution for a resistance which is uniformly
distributed in the range [0.995, 1.005] kΩ. The validity of the resulting conditional distribu-
tion can be verified by noting that the resistance marginal distribution is given by
UH (R − Rmin ) − UH (R − Rmax )
P (R) (2.45)
Rmax − Rmin
where UH (x) is the Heaviside function, defined according to UH (x) = 0 for x < 0 and
UH (x) = 1 for x > 0. 12 The Dirac delta function δD (x) is normally identified as the
12
Strictly the Heaviside functions is a ‘generalised function’ and is undefined at x = 0.
‘derivative’ of UH (x) and the conditional distribution becomes

Rmax
1
P (v|i) = δD (v − iR) dR
Rmax − Rmin Rmin
Rmax
1 −1
= UH (v − iR)
Rmax − Rmin i Rmin
−1
= [UH (v − iRmax ) − UH (v − iRmin )] (2.46)
i(Rmax − Rmin )
and the integral over v should yield unity for all i if P (v|i) is properly formed

−1
P (v|i) dv = [v]iRmin
i(Rmax − Rmin ) iRmax
= 1 ∀ i. (2.47)
Figure 2.6(b) shows the resulting joint distribution if i is distributed as in the previous
example i ∼ N (i; 5, 1)mA; also shown are the marginal distributions of both current and
voltage. These results are intuitive as the selection of a current with a uniform distribution of
resistance should result in a uniform distribution of voltages, the range of which will increase
as the current does because of the functional relationship between the random variables.
Finally, note that the marginal distribution of the voltage is not Gaussian but has a longer
tail for higher values as a result of the same linear relationship.
At this point it is critical to recognise that the joint probability, though closely related to
the marginal probability, does not simply represent the same measure applied to the union
of events xi ∪ yj . Fundamentally, the probability measures defined here are each a result
of a particular selection of the structure of the σ-algebra and the form of the measure on
that structure, notably the normalisation of the measure implicit in probability. Either or
both of these can be changed arbitrarily and a mathematically acceptable measure will still
be generated. In general, the selection of a measure is essentially arbitrary and ultimately
its usefulness to practical situations is the only viable criterion for deciding between them
(Jaynes 1996, Ch11).
2.5.2 Interpreting Probabilities
The interpretation of these distributions requires careful consideration as there are two
distinct approaches in the literature. A ‘frequentist’ considers the distribution to represent
the result of amassing many different independent measurements of a given experiment, that
is, the values which the distribution takes on are normalised frequencies. Alternatively, a
‘Bayesian’ interprets the distributions an an explicit representation of the combined evidence
about the value of a given random variable.
The most important and significant difference in these interpretations relates to the exis-
tence (or non-existence) of a ‘true’ value of the random variable. A frequentist interprets a
random variable as an accurately measurable quantity which inherently changes value in a
random fashion between measurements. Each experiment finds the actual value of a quan-
tity which changes constantly and the distribution captures the variability in the underlying
value itself. A Bayesian, however, assumes that the random variable represents a single,
true, but unknown value which is measured in a manner which may introduce uncertainty
as to the true value. Here the distribution provides a summary of the evidence relating to
the underlying value.
Consider an experiment involving the throwing of a weighted coin and assume a large num-
ber of experiments have already been performed, resulting in a distribution describing the
results, P{H,T } (xi ). The frequentist interprets this distribution in the sense that another toss
of the coin should conform to the known frequencies. Here the ‘weighted-ness’ of the coin
represents the expected frequencies of heads and tails for subsequent tosses. The Bayesian,
however, views the distribution as the direct evidence for the weighted-ness of the coin:
while additional experiments will improve the accuracy of the estimate, the distribution
needs no further experiment to have meaning.
The distinction becomes clearer in the case of a continuously variable quantity, such as the
measurement of a voltage where the measurement introduces some noise onto the ‘true’
value. The frequentist is able to calculate the probability that a subsequent instantiation
of the experiment will yield a given value, furthermore they can calculate properties of the
ensemble of measurements at hand. The Bayesian can also predict these subsequent values
of the measurement given the evidence, but additionally interprets the distribution itself
as evidence for a particular ‘true’ value of the voltage. This means that in addition to
noting the characteristics of the previously observed experiments, the Bayesian can con-
struct assumptions about the underlying process. The distribution, together with these
assumptions, can be used to make inferences about the underlying system (Mackay 2004,
pp25-26).
2.5.3 Shannon Information
The probability measures of Section 2.5.1 capture the distribution of uncertainty associated
with the possible values of one or more random variables, but consider now the situation
where subsequent measurements of the random variables are to be performed. Two ques-
tions arise in this context: given the distribution, how certain (or uncertain) is a particular
outcome; and how certain is the outcome of the entire experiment? Specifically, the first
question seeks a measure by which different possible outcomes of the measurements can
be compared in terms of their relative certainty of occurrence, while the second seeks a
measure capable of capturing the intrinsic uncertainty of the random variables taking into
account all possible outcomes.
An important alternative interpretation of these measures of ‘unexpectedness’ relates to the

utility of a measurement to the experimenter. Here the second measure relates to the whole
experiment and can be interpreted as the utility of actually performing the test. If the
outcome of an experiment is deterministic and known a priori then it is not necessary to
perform the experiment to know what value will be measured; conversely, if the distribution
favours no particular value, then it is most beneficial to perform the experiment. Under
this interpretation, the first measure of the unlikeliness of a particular event corresponds to
the degree to which the given observation challenges the assumed distribution.
Consider the joint event (x, y) obtained from measuring two random variables x and y.
The probability measures of the previous Sections would seem suitable candidates for the
expectedness of the particular event (the first measure): P (x) implies how expected the
event is with respect to x only; P (y) with respect to y; and P (x, y) with respect to both
random variables. Clearly an event with low probability corresponds to an unlikely event
and unlikely events challenge the assumed distribution most when observed.
A special case of great interest occurs when the random variables x and y are statistically
independent. In this case the joint probability reduces to the product of the marginal
distributions P (x, y) = P (x)P (y) (Papoulis & Pillai 2002, p175). Critically however, in the
case of dependent random variables, the joint probability can be either higher or lower than
the product of the marginals. Indeed, it can take on any value between zero and the smallest
of the marginal probabilities, as noted in Section 2.5.1. The statistical dependence between
the random variables can be determined, therefore, by comparison of the joint probabilities
to the product of the marginals.
While the joint probabilities do capture the necessary relationships, they do so with respect
to the product of the marginals. Intuitively, a measure which captures the same properties
as the probabilities, but which is is measured with respect to the sum of the measures in the
case of independent random variables, would provide an alternative measure. Noting that
the joint probability captures all the essential characteristics of the random variables, it
would be reasonable that the desired measure should be a function of the joint probability
alone.13 Using the approach of Middleton (1996, pp292-6) an arbitrary measure of the
uncertainty of an outcome x is denoted by U(x) and will be related to the probability
associated with that particular outcome, that is, U (x) = U[P (x)]. Since events with lower
probability have a greater uncertainty, this function should be monotonically increasing for
decreasing P (x). A convenient formulation results if independent random variables x and
y are required to yield a measure of the form
U (x, y) = U (x) + U(y) (2.48)
= U [P (x)] + U[P (y)] (2.49)
= U[P (x, y)] (2.50)
and since in this case P (x, y) = P (x)P (y) this naturally suggests a new measure, the entropy
or hP (x), defined for a joint experiment by

1
hP (x, y) = logb (2.51)
P (x, y)
where the base b is an arbitrary constant which effectively sets the units of the resulting
values. Note that as P (x, y) → 0, hP (x, y) → ∞.
The behaviour of the entropy can be starkly different for discrete and continuous systems.
13
Note that the marginal probabilities can be obtained analytically from the joint distribution and so can
be treated as functions of the joint probability.
Recall from Section 2.5 that discrete distributions and continuous densities have values in
the following ranges:
PA (.) ∈ [0, 1] ⊂R (2.52)
PX (.) ∈ [0, ∞) ⊂ R. (2.53)
Consider first the discrete case, and note that Equation 2.52 ensures that the fraction in
Equation 2.51 is greater than or equal to one. Hence
hPA (xi , yj ) ≥ 0 ∀ PA (xi , yj ) (2.54)
and the discrete entropy measure is a positive measure for any outcome of any distribution.
This is not true for continuous densities where, despite being normalised according to

PXY (x̃, ỹ) dx̃ dỹ = 1, (2.39)
the value is not constrained to the interval [0, 1], as indicated by Equation 2.53. This results
from the fact that arbitrarily ‘compressing’ the density onto a smaller and smaller domain
must result in the increase (without bound) of the instantaneous density value. This is
clearly seen with the Dirac delta function δD (x) which is often defined by starting from
a unit-area rectangular impulse of width d and height 1
d and taking the limit as d → 0,
resulting in a generalised function with the ‘value’ of ∞ at x = 0. At all stages in this limit,
the area of the ‘function’ remains constant at unity.
Consider the transformation PX (x) → PX (kx) where k is a constant multiplier and the
result is a ‘compression’ of the density onto a smaller part of the integration domain.
Clearly, scaling both the integration domain and the domain of the function results in the
same value for the integral, that is,

PX (x) dx = 1 (2.55)

⇒ PX [kx] d [kx] = 1. (2.56)
If only the function is changed (while the integration domain remains the same), the result
obtained is

1
PX (kx) dx = PX (kx) d(kx)
k

1
= PX (kx) d(kx)
k
1
= (2.57)
k
so that a normalised, transformed distribution can be achieved using
PX (x) → kPX (kx). (2.58)
Indeed, as k → ∞ any distribution will approach the Dirac delta δD (x). This means
that the density values for a continuous system can be greater than unity, implying that
the instantaneous event entropy can be negative. In the limit of the Dirac delta, the
instantaneous values of the entropy are
⎧
⎨ log 1 = ∞ if x = 0
b 0
hδD (x) = (2.59)
⎩ log 0 = −∞ if x = 0.
b
This quantity is the Shannon information content of the outcome (x, y) and its viability as
a measure has been justified in many contexts (Cover & Thomas 1991, Mackay 2004, Jaynes
1996). This quantity will also be referred to as the ‘event’ entropy to distinguish it from the
‘distribution’ entropy of the next part. While it has been written in terms of the joint event
(x, y) this quantity is defined for any probability distribution P (.), specifically P (x), P (y),
P (x|y) and P (y|x). It is possible, therefore to extend the definition of the event entropy to
capture the equivalent measure for conditional and marginal distributions as

1
hP (x|y) = logb (2.60)
P (x|y)
and
1
hP (x) = logb (2.61)
P (x)
respectively, though either x or y can be replaced here by any number of terms and the
same expression holds for x = {x1 , . . . , xnx } and y = {y1 , . . . , yny }.
In Section 2.5.1 the marginal probability distributions were shown to be related to the joint
distribution by Equations 2.40 and 2.41;

PA (xi ) = PAB (xi , yj ) (2.40)
j

PX (x̃) = PXY (x̃, ỹ) dỹ. (2.41)
So that in the discrete case
PAB (xi , yj ) ≤ min {PA (xi ), PB (yj )} (2.62)

1 1
⇒ ≥
PAB (xi , yj ) min {PA (xi ), PB (yj )}
1 1
⇒ logb ≥ logb . (2.63)
PAB (xi , yj ) min {PA (xi ), PB (yj )}
The joint and marginal entropies are therefore related by
hPAB (x, y) ≥ max {hPA (x), hPB (y)} . (2.64)
In the continuous case, however, Equation 2.41 implies that it is possible for PXY (x, y) to
be greater than either PX (x) or PY (y). Indeed, if PXY (x, y) f (x)δD (x − y) then if that
f (x) is normalised and finite everywhere, then PX (x) = PY (y) and the instantaneous value
of PXY (x, y) is infinite when x = y and zero otherwise. This means that hPXY (x, y) → −∞
with hPX (x) and hPY (y) finite, so that in this case there is no lower limit.
In addition to this, for both discrete and continuous systems it is possible to have P (x, y) = 0
with P (x) = 0 and P (y) = 0, so that there can be no upper limit to the event entropy in
terms of the marginal entropies.14
14
The event entropy (or Shannon Information) hP (x, y) is a measure obtained from the
probability measure P (x, y) to satisfy the following properties:
1. hP (x, y) should capture the ‘unexpectedness’ of an event, that is, as P (x, y) → 0,
hP (x, y) → ∞; and
2. for independent random variables x and y should be additive: hP (x, y) = hP (x)+

hP (y).
The event entropy is defined as

1
hP (x, y) = logb . (2.51)
P (x, y)
The event entropy has different limits for discrete and continuous cases:
hPAB (xi , yj ) ≥ 0 ∀ PAB (xi , yj ) (2.65)
hPXY (x̃, ỹ) ∈ R. (2.66)
In addition to the joint distribution, it is possible to define the marginal and conditional
entropies as

1
hP (x|y) = logb (2.60)
P (x|y)

1
hP (x) = logb . (2.61)
P (x)
For the discrete case, the joint entropy is related to the marginal entropies by
hPAB (x, y) ≥ max {hPA (x), hPB (y)} . (2.64)
2.5.4 Distribution Entropy
Given that event entropy of Equation 2.51 measures the ‘unexpectedness’ of a particular
outcome, the second problem of how to measure the relative uncertainty inherent in the
experiment {X, Y } should be closely related to this quantity. If the true outcome is dis-
tributed according to P (x, y), then while unlikely events are highly informative, they are
comparatively unlikely to occur when performing the experiment. It was noted in Section
2.5.1 that the random variables are defined in this work so that they take on a single value
from a mutually exclusive set, so that the elements of the set {x} are disjoint. Likewise,
the elements of {y} are disjoint and so are those of {(x, y)}. This means that the measure

can be applied to a union of these events, i ,j (xi , yj ) where i and j index a subset of S.
Because they are disjoint, the measure is given by
⎡ ⎤

hP ⎣ (xi , yj )⎦ = hP (xi , yj ) . (2.67)
i ,j i ,j
Specifically, this means that it is possible to define the expectation operator for a measure,
Exy {μ(x, y)}, defined on a discrete space as

Exy {μ(xi , yj )} = PAB (xi , yj )μ(xi , yj ) (2.68)
i,j
and for a continuous space

Exy {μ(x, y)} = PXY (x, y)μ(x, y) dx dy (2.69)
and because of Equation 2.67 this represents a (weighted) measure on the same space as
the original measure.
The expectation operators for measures μ(x, y) defined on a space along with a prob-
ability measure P (x, y) are defined for discrete and continuous systems as

Exy {μ(xi , yj )} = PAB (xi , yj )μ(xi , yj ) (2.68)
i,j

Exy {μ(x, y)} = PXY (x, y)μ(x, y) dx dy. (2.69)
The ‘distribution’ entropy of the experiment X is HP (X) defined as the expected value of
the event entropy and for a discrete distribution is given by
HPA (X) = Ex {hPA (xi )} (2.70)

1
= Ex logb
PA (xi )
1
= PA (xi ) logb . (2.71)
PA (xi )
i
PX(x)
PA(xi)
Figure 2.7: The approximation of a continuous probability density function PX (x) by a

discrete set PA (xi ) leads to a divergence in the calculation of the distribution entropy
HP (X).
For the continuous case, it becomes
HPX (X) = Ex {hPX (x)} (2.72)

1
= Ex logb
PX (x)
∞
1
= PX (x) log b dx. (2.73)
−∞ PX (x)
The discrete distribution entropy is commonly referred to as the ‘Shannon’ entropy and the
continuous as the ‘Boltzmann’ (or ‘differential’) entropy, and while they are defined in an
almost identical manner they are not interchangable.
In particular, it is tempting to assume that one can derive the continuous form of the
entropy as the limiting case for the discrete case, but it can be shown that the discrete
entropy HA (X) diverges in the limit. To demonstrate this, let the discrete states be given
a constant ‘width’ d, as in figure 2.7. It can readily be shown that
lim HA (X) = HX (X) − logb (d) (2.74)

d→0
and that limd→0 logb (d) → −∞. This divergence means that it is not possible to directly
compare entropies measured from continuous pdf’s and a discrete approximation of that
0.4
0.2
−0.2
−P(x,y) log [P(x,y)]

−0.4
b
−0.6
−0.8
−1
−1.2
−1.4
0 0.5 1 1.5 2
P(x,y)
Figure 2.8: The contributions of different values of probability to distribution entropy, the
1
curve shows P (x, y) log b P (x,y) against P (x, y) and clearly demonstrates the change of sign
at P (x, y) = 1
pdf.15
As with the event entropy, the distribution entropy can display categorically different be-
haviours for discrete and continuous systems. Examination of Equations 2.71 and 2.73
1
suggests that the behaviour will depend on the values of the expression P (x, y) logb P (x,y) .
Figure 2.8 shows the value of this for different values of P (x, y). This curve demonstrates
that there is a change of sign in the contribution of the event (x, y) when the probabil-
ity exceeds unity. In the discrete case, the probabilities are constrained to the interval
[0, 1] ⊂ R and the contribution of every event is non-negative; the distribution entropy
must be non-negative also. This does not hold for the continuous case as the probability
density can be greater than unity, resulting in a negative contribution to the distribution
entropy. While the result suggesting that the entropy measure of Equations 2.71 and 2.73
can become negative is often considered to render it an invalid measure (Kraskov et al.
2004, p3), others such as Cerf & Adami (1997) recognise that positivity is not a necessary
condition for a measure of the form of Equation 2.1, but rather that the measure should be
considered relatively instead of in an absolute manner.
15
This should not be confused with a discrete approximation to the calculation of the integral, such as the
use of the trapezoidal rule, where the convergence of the numerical approximation to the integral remains
unchanged.
While the event entropy of Section 2.5.3 is meaningful in that the relative ‘unexpectedness’ of
multiple outcomes can be compared directly, the distribution entropy can be interpreted by
comparison with the distribution entropy of alternative distributions. The most important
of these are the ‘deterministic’ and ‘uninformative’ distributions. In the discrete case these
correspond to a distribution with only one event with a non-zero probability and the case
where all events are equally likely, that is,
⎧
⎨ 1 for i=i’,j=j’
D
PAB (xi , yj ) = (2.75)
⎩ 0 otherwise
1
U
PAB (xi , yj ) = where Ne is the number of events in S (2.76)
Ne
which have distribution entropies of

HP D (X, Y ) = − D
PAB D
(xi , yj ) logb PAB (xi , yj )
AB
ij
= 1 log b 1
= 0 (2.77)
and
1 1
HP U (X, Y ) = − logb
AB Ne Ne
ij
= logb Ne (2.78)
since there are Ne events with the same contribution. For a given Ne , HP U (X, Y ) represents
AB
the maximum achievable entropy (Mackay 2004, p33) and, because of the non-negativity of
entropy, HP D (X, Y ) is the minimum value. That is,
AB
HPAB (X, Y ) ∈ [0, log b Ne ] ⊂ R. (2.79)

The entropy of a discrete probability distribution with Ne possible events is defined by

1
HPAB (X, Y ) = Exy {hPAB (xi , yj )} = PAB (xi , yj ) logb (2.71)
PAB (xi , yj )
ij
and has lower and upper limits corresponding to deterministic and uninformative dis-
tributions respectively. The entropies of these are
HP D (X, Y ) = 0 (2.77)
AB
and
HP U (X, Y ) = logb Ne (2.78)

AB
so that
HPAB (X, Y ) ∈ [0, log b Ne ] ⊂ R. (2.79)
Alternatively, in the continuous case, the deterministic and uninformative distributions

are defined by the Dirac delta function and a uniform distribution respectively and for a
one-dimensional case are given by
D
PXY (x, y) = δD (x − x , y − y ) (2.80)
U
PXY (x, y) = UH (x − xmin , y − ymin ) − UH (x − xmax , y − ymax ) (2.81)
where UH (.) is the Heaviside function defined in Equation 2.45 in Example 2.5.2. Note
that the uninformative density is ill-defined when the domain of either coordinate x or y is

infinite. Consider the distribution PX∗ (x) w1 UH (x + w2 ) − UH (x − w2 ) which is a unit-
area rectangular pulse of width w centered at x = 0. This corresponds to the Dirac delta
function in the limit that w → 0. The distribution entropy is

1
∞
w w 1 w w
HP∗ X (X) = − UH (x + ) − UH (x − ) logb UH (x + ) − UH (x − ) dx
w 2 2 w 2 2
−∞∞
1 w w 1
= − UH (x + ) − UH (x − ) logb dx +
−∞ w 2 2 w
∞
1 w w w w
UH (x + ) − UH (x − ) logb UH (x + ) − UH (x − ) dx
−∞ w 2 2 2 2
w
2 1
= logb w + logb 1 dx
−w w
2
= logb w. (2.82)
Therefore, it is possible to write the entropies of the deterministic and uninformative den-
sities as
HP D (X) = lim HP∗ X (X)

X w→0
= lim logb w
w→0
= −∞ (2.83)
and
HP U (X, Y ) = logb [(xmax − xmin )(ymax − ymin )] . (2.84)
XY
As (xmax − xmin ) → ∞ and (ymax − ymin ) → ∞ the entropy HP U (X) → ∞ also. Therefore,
X
when the domain of a continuous density is known, the entropy is bounded as

HPXY (X, Y ) ∈ HP D (X, Y ), HP U (X, Y )
XY XY
= [−∞, logb {(xmax − xmin )(ymax − ymin )}] ⊂ R. (2.85)
Now consider the effect of applying the transformation of Equation 2.58 to the resulting
entropy
PX (x) → PX (x) = kPX (kx) (2.58)
which yields
∞
HPX (X) = − kPX (kx) logb kPX (kx) dx
−∞
∞ ∞
= − logb k kPX (kx) dx − kPX (kx) logb PX (kx) dx
−∞ −∞
∞
1
= logb − PX (kx) logb PX (kx) d(kx)
k −∞
1
= logb + HPX (X) (2.86)
k
where the fact that scaling the domain of the function and the differential element simul-
taneously does not change the result has been used. This implies that a scaling of the
domain (such as that involved in a change of units) introduces an additive term into the
entropy. This term will be negative for any ‘compression’ of the distribution, which agrees
intuitively with the notion of the entropy measuring the ‘spread’ or overall uncertainty of
the distribution. This result can be trivially extended to higher-dimensional distributions

by recognising that the scaling of each axis introduces an equivalent term,
PX Y (x, y) = kx ky PXY (kx x, ky y) (2.87)

1
⇒ HPX Y (X, Y ) = logb + HPXY (X, Y )
kx ky
1 1
= logb + logb + HPXY (X, Y ). (2.88)
kx ky
The entropy of a one-dimensional continuous probability distribution with a finite do-

main x ∈ [xmin , xmax ] and y ∈ [ymin , ymax ] is given by

1
HPXY (X, Y ) = Exy {hPXY (x, y)} = PXY (x, y) logb dxdy (2.73)
PXY (x, y)
and has lower and upper limits corresponding to deterministic and uninformative dis-
tributions respectively. The entropies of these are
HP D (X, Y ) = −∞ (2.83)
XY
and
HP U (X, Y ) = logb [(xmax − xmin )(ymax − ymin )] (2.84)

XY
so that
HPXY (X, Y ) ∈ [−∞, logb {(xmax − xmin )(ymax − ymin )}] ⊂ R. (2.85)
If the domain is infinite, then the upper limit becomes ∞. There is no guarantee that
the continuous entropy will be positive and the value depends directly on the units
of the domain of the random variable. Applying the scaling transformation PX (x) →
PX (x) = kPX (kx) yields
1
HPX (X) = logb + HPX (X) (2.86)
k
which implies that changing the scale of a continuous random variable will shift the
1
entropy by logb k where k is the scale factor.
Additionally, a direct application of l’Hospital’s rule (Stewart 1995, p420) (or inspection of
Figure 2.8) demonstrates that while the information content of an event with zero prob-
ability P (x) → 0 is infinite, the contribution of that outcome to the average entropy is
zero,
1
lim P (x) log =0 (2.89)
P (x)→0 P (x)
That is, while the observation of the impossible event would contribute an infinite quantity
of information (if observed given the assumed distribution) it makes no contribution to the
average as it will never be observed. This is a reasonable result: conceiving of additional
states which cannot occur should have no effect on the actual behaviour of the experiment.
When considering the event entropy in Section 2.5.1, the conditional entropy was defined
as an entropy calculated on the distribution P (x|y) which is a distribution over the random
variable x, parameterised by the value of the random variable y,

1
hP (x|y) = logb (2.60)
P (x|y)
In the same way as the definition of the distribution entropies, the distribution conditional
entropy is defined as the expectation of this quantity with respect to the joint distribution
and is
1
HP (X|Y ) = Exy logb (2.90)
P (x|y)
where the expectation is determined by either integration or summation depending on
whether the distribution is continuous or discrete, cf. Equations 2.71 and 2.73. This
expression can be written as
1 1
P (xi , yj ) logb = P (xi |yj )p(yj ) logb
P (xi |yj ) P (xi |yj )
i j i j
1
= P (yj ) P (xi |yj ) logb (2.91)
P (xi |yj )
j i
where the inner summation over i is the conditional entropy given y = yj , H(X|yj ) (Mackay
2004, p138). The inner summation represents the expectation of the conditional entropy
(over x) of the distribution for a given value of yj and the outer summation represents the
expectation (over y) of these values for all possible values of y.
The distribution conditional entropy is defined for a discrete distribution as

1 1
HPAB (X|Y ) = Exy logb = PAB (x, y) logb (2.91)
PAB (x|y) PAB (x|y)
ij
and has the logical continuous counterpart,

1 1
HPXY (X|Y ) = Exy logb = PXY (x, y) log b dx dy. (2.92)
PXY (x|y) PXY (x|y)
Finally, consider the relationship between the distribution joint entropy hP (x, y) and the
distribution and conditional marginal entropies. The joint entropy can be expanded as

1
HP (X, Y ) = Exy logb
P (x, y)

1 1
= Exy logb + logb
P (y|x) P (x)
= HP (Y |X) + HP (X) (2.93)
and similarly, HP (X, Y ) = HP (X|Y ) + HP (Y ). (2.94)
Now it is readily shown for the discrete case that HPAB (Y |X) ≤ HPB (Y ) with equality if and
only if PAB (xi , yj ) = PA (xi )PB (yj ) ∀ (xi , yj ), that is, if x and y are independent(Mackay
2004, p154). Following the same argument the conditional distribution entropy is

1
HPXY (X|Y ) = PY (y) PX (x|y) log b dxdy
PX (x|y)

PY (y)
= PX (x)PY (y|x) logb dxdy Using Bayes’ Rule
PY (y|x)PX (x)

1 PY (y)
= PX (x)PY (y|x) logb dxdy + PX (x)PY (y|x) logb dxdy
PX (x) PY (y|x)

1 PY (y)
= PX (x) log b dx + PX (x) PY (y|x) logb dydx
P (x) PY (y|x)
X
= HPX (X) − PX (x) DKL [PY (y), PY (y|x)] dx (2.95)

P (x)
where DKL (P Q) Ex P (x) logb Q(x) has been written in the last line with PY (y|x)
and PY (y) replacing P and Q in the definition. Now a direct application of Jensen’s rule,
Ex {f (x)} ≥ f (Ex {x}), for a convex function f (x) (Mackay 2004, p47) ensures that
DKL (P, Q) ≥ 0 (2.96)
for any two probability distributions P and Q and the integral on the right hand side of
Equation 2.95 must be non-negative, HPXY (X|Y ) ≤ HPX (X). This means that for both
discrete and continuous cases,
HP (X|Y ) ≤ HP (X) (2.97)
and similarly, HP (Y |X) ≤ HP (Y ) (2.98)
and together with Equations 2.93 and 2.94 means that16
HP (X|Y ) + HP (Y |X) ≤ HP (X, Y ) ≤ HP (X) + HP (Y ) (2.99)
For both continuous and discrete distribution entropies, the relationship between the
joint and marginal entropies are given by
HP (X|Y ) + HP (Y |X) ≤ HP (X, Y ) ≤ HP (X) + HP (Y ). (2.99)
Finally, note that there is an inherent dimensionality in the definitions of the entropy. Recall
the definitions of the event entropies in Equation 2.51,

1
hP (x, y) = logb . (2.51)
P (x, y)
In the discrete case the probability measure PAB (xi , yj ) is dimensionless and the resulting
entropy must therefore also be dimensionless.17 In the continuous case Smith (2001, pp8-
10) notes that the dimensions of the n−dimensional density function18 PXY (x, y) must
be [length]−n so that an integral over this density results in a dimensionless probability
measure. In turn, this means that the resulting continuous entropy has units of

[ hPX (x, y) ] = logb [length]−n . (2.100)
16
17
This must not be confused with the ‘dimensions’ of bits, nats and Hartleys applied to the cases of b = 2,
b = e and b = 10 respectively and which represent a multiplicative factor applied to the actual units of the
function.
18
Where n = dim(x) + dim(Y).
This, contrary to the assertion in that paper, remains a valid comparative dimensional unit,
in the same way that Radar cross-sections are often measured in dBm2 with reference to a
1m2 target, rather than in m2 . Furthermore, as the entropy has been defined as a measure
in the broadest sense, there is no requirement for it to have any specific units. Overall,
the entropy represents a measure which captures the same fundamental properties of the
distribution as the probabilities, but in a mathematical form which supports a wider range
of interpretations and applications.
2.5.5 Mutual Information
The event entropy, hP (x, y), provides a useful measure of the uncertainty associated with
observing the joint event (x, y) given the knowledge encoded in the joint probability dis-
tribution P (x, y). It was constructed as a replacement for the probability measure so that
in the case of independent random variables, x and y, the measure for the experiments
taken separately would add to give the measure for the joint experiment. Clearly, when
the random variables are dependent, the joint probability is not equal to the product of
the marginals, P (x, y) = P (x)P (y), and the joint entropy is not equal to the sum of the
marginal entropies, hP (x, y) = hP (x) + hP (y).
Let the difference between the joint event entropy and the sum of the marginal event
entropies be denoted by iP (x; y) and let this define the ‘event mutual information’ of the
particular outcome (x, y). Substituting Equation 2.51 into this definition yields
iP (x; y) = hP (x) + hP (y) − hP (x, y)

1 1 1
= logb + logb − logb
P (x) P (y) P (x, y)

P (x, y)
= logb . (2.101)
P (x)P (y)
This expression holds for both discrete and continuous cases; with P (x, y), P (x) and P (y)
replaced by PA,B (xi , yj ), PA (xi ) and PB (yj ), and PXY (x, y), PX (x) and PY (y) respectively.
The relationship between the event entropies and the event mutual information can also be
written as
hP (x, y) = hP (x) + hP (y) − iP (x; y). (2.102)
Formally, Equation 2.101 represents another measure applied to the joint event (x, y). This
Produce Box 1 Box 2

Red Apples 0.25 0.4
Green Apples 0.25 0.1
Red Capsicum 0.25 0.1
Green Capsicum 0.25 0.4
Table 2.1: Two discrete probability distributions for examining mutual information; the nor-
malised frequencies of two boxes of produce from a market selling red and green Capsicum
and Apples are shown. Box 1 is filled in such a way that colour and type are independent
and Box 2 so that this is not true. The numbers in this table can be interpreted as the joint
probability distribution P (xi , yj ) where xi is either ‘Apple’ or ‘Capsicum’ and yj is either
‘Red’ or ‘Green’.
measure increases as the statistical dependence between x and y increases and for indepen-
dent variables, P (x, y) = P (x)P (y), and the expression yields

P (x) P (y)
iP (x; y) = logb + logb
P (x) P (y)
= 0. (2.103)
Now, just as the event entropy measures (hP (x, y), hP (x) and hP (y)) can be negative quan-
tities, there is no requirement that the mutual information be a non-negative quantity.
Rather than being a problematic result, this is the property which lends the greatest ana-
lytical power to the mutual information. Consider Equation 2.101 in detail. The sign of the
P (x,y)
resulting measure will depend directly on the magnitude of the fraction K = P (x)P (y) .
As noted earlier, when two events are independent the joint distribution is separable,
P (x, y) = P (x)P (y), which means that the denominator of K can be understood as the joint
distribution which would have occurred if the two events were in fact independent. This frac-
tion therefore directly compares the true joint distribution P (x, y) with the ‘independent-
assumption’ joint distribution PI (x, y) = P (x)P (y).
Example 2.5.3 – Event mutual information

Consider a market where Apples and Capsicum can be purchased and apply the measure to
two different boxes of mixed produce with the distributions of Table 2.1. The first box is an
even mixture of the four types of produce with one quarter of the box being each type. The
second is not evenly packed and favours Red Apples and Green Capsicum with 80% of the
box containing these two items. Let the random variables x and y correspond to produce
type and colour denoted by xi ∈ {Apple, Capsicum} and yj ∈ {Red, Green} respectively.
The marginal distributions were defined earlier as the distribution of only one of the random
variables. For example, the marginal probability of randomly selecting an apple, irrespective
of its colour, from Box 1 is

P1 (xi = Apple) = P1 (xi = Apple, yj )
j
= P1 (xi = Apple, yj = Red)
+P1 (xi = Apple, yj = Green)
= 0.25 + 0.25
= 0.5. (2.104)
Likewise, for Box 1 the remaining marginal probabilities are P1 (xi = Capsicum) = 0.5,
P1 (yj = Red) = 0.5 and P1 (yj = Green) = 0.5. Note here that the probability of an event
(xi , yj ) is equal to the product of the marginal distributions P1 (xi ) × P1 (yj ). In this case all
marginals are 0.5 so that all joint probabilities are 0.5 × 0.5 = 0.25. This means that the
two random variables are independent: when selecting an item from the box the probability
of its colour being red does not depend on the probability of it being a capsicum or an apple.
In the second box, however, this is not the case. The marginal probabilities are easily verified
to be the same as before: P2 (xi = Apple) = 0.4 + 0.1 = 0.5, P2 (xj = Capsicum) = P2 (yj =
Red) = P2 (yj = Green) = 0.5. If the experiments are assumed independent it is expected
that the joint probabilities will be the same as those for the first box but they are not.
When evaluating the mutual information of the event (xi , yj ) there are three distinct cases
for the value of K:
P (x, y) = PI (x, y) = P (x)P (y): The random variables are independent and the resulting
measure is logb (1) = 0. This corresponds to Box 1 in which the random variables
are independent. Thus, the joint probabilities of Box 1 correspond to the expected
probabilities of Box 2 under the assumption that the variables are independent. The
joint event entropy hP (x, y) is, by Equation 2.102, equal to hP (x) + hP (y).
P (x, y) > PI (x, y): Here the joint distribution indicates that the joint event (x, y) is more
likely to occur than it would have been if the random variables were actually indepen-
dent. The fraction K is larger than unity and the resulting measure will be positive.
For example, the event (xi = Apple, yj = Red) for Box 2 has P2 (x, y) = 0.4, but
P2 (x)P2 (y) = 0.25) and iP (x, y) = loge (1.6) = 0.47 nats. Equation 2.102 means that
the resulting event entropy hP (x, y) of the joint event will be lower than in the case of
independent experiments. The same holds for the event (xi = Capsicum, yj = Green).
P (x, y) < PI (x, y): Here the joint event is less likely to occur than if x and y were inde-
pendent. K will be less than unity and so the resulting measure will be negative. For
example (xi = Apple, yj = Green) has iP (x, y) = 0.1
0.25 = −0.92 nats. This means that
the joint event entropy of the pair (x, y) is greater than if they had been independent.
In this case, knowing one property makes the probability associated with obtaining the
second at the same time much smaller and the joint event has a high event entropy.
Therefore, the mutual information iP (x; y) represents a measure which captures both the
degree of statistical dependence between the elements and the sign of that dependence for
specific pairs of events. Importantly, the distribution entropy and mutual information are
closely related. Consider two perfectly correlated random variables y ≡ x, for which P (x)
is known. The joint distribution will be equal to P (x) whenever x = y and will be zero
otherwise, that is P (x, y) = P (x)δD (x − y) where δD (x) is the Dirac delta function and this
ensures that P (y) = P (x). For a discrete system the mutual information of the possible
events (xi , yj ) will be
PA (xi )δD (xi − yj )

iPAB (xi ; yj ) = logb
PA (xi )PB (yj )
1
= logb
PA (xi )
= hPA (x) (2.105)
which is a remarkable result showing that the discrete event entropy can actually be con-
sidered a mutual information measure when only one random variable is present. In the
continuous case, however, the Dirac delta function has the ‘value’ of δD (0) → ∞ and instead
the relationship becomes
PX (x)δD (0)
iPXY (x; y) = logb
PX (x)PX (x)
1
= logb + logb δD (0)
PX (x)
→ hPX (x) + ∞. (2.106)
The final term represents the effect of the joint entropy hPXY (x, y) → −∞ because it is a
‘ribbon’ of zero width, see Figure 2.5 in Example 2.5.1 for a specific example of this.
The event ‘mutual information’ captures the difference between the actual joint dis-
tribution P (x, y) and the joint distribution which would have occurred if the random
variables x and y were independent, PI (x, y) = P (x)P (y).
P (x, y)
iP (x; y) hP (x) + hP (y) − hP (x, y) = logb (2.101)
P (x)P (y)
When independent, the event mutual information is
PI (x, y)
iPI (x; y) = logb = 0. (2.103)
PI (x, y)
When not independent, there are two possible cases:
• P (x, y) < PI (x, y): The event (x, y) is less likely than if the variables had been
independent: iP (x; y) < 0.
• P (x, y) > PI (x, y): The event (x, y) is more likely than if the variables had been
independent: iP (x; y) > 0.
As in Section 2.5.3 it is useful to consider the limits which apply to the event mutual
information in terms of the marginal and conditional entropies. Recall that for discrete
systems,
PAB (xi , yj ) ≤ min {PA (xi ), PB (yj )} (2.62)

so that it is possible to write
PAB (xi , yj )
iPAB (xi ; yj ) = logb
PA (xi )PB (yj )
min {PA (xi ), PB (yj )}
≤ logb
PA (xi )PB (yj )
1
= logb
max {PA (xi ), PB (yj )}
= min {hPA (xi ), hPB (yj )} . (2.107)
Similarly, for the discrete case a lower limit to the event mutual information is
PAB (xi , yj )
iPAB (xi ; yj ) ≥ logb
PA (xi )
1
= − logb
PAB (yj |xi )
= −hPAB (yj |xi ) (2.108)
where the fact that PB (yj ) ∈ [0, 1] ⊂ R was used in the first line. A similar result can be
shown for hPAB (xi |yj ) yielding
iPAB (xi ; yj ) ≥ − min {hPAB (xi |yj ), hPAB (yj |xi )} . (2.109)
In the case of a continuous density, however, the marginal probabilities are not confined to
the interval [0, 1] ⊂ R, so that the inequality of Equations 2.62 and 2.108 no longer holds.
In fact, there is no lower limit as the value of PXY (x, y) can be zero regardless of the value
of PY (y) (or PX (x)).19
Some additional limits of the event mutual information can be written in terms of the
marginal and conditional entropies. For discrete systems, an upper limit is
iPAB (xi ; yj ) ≤ min {hPA (xi ), hPB (yj )} (2.107)
and a lower limit
iPAB (xi ; yj ) ≥ − min {hPAB (xi |yj ), hPAB (yj |xi )} . (2.108)
19
2.5.6 Distribution Mutual Information
Analogously to the definition of the distribution entropy as the expected value of the event
entropy, the distribution mutual information is defined as the expected value of the event
mutual information. Again, the events are disjoint so that Equation 2.67 implies that the
expectation operators of Equations 2.68 and 2.69 generate values which can be interpreted
as a weighted measure on the union of events.
The distribution mutual information, IP (X; Y ), is therefore given by
IP (X; Y ) = Exy {iP (x; y)} . (2.110)
For the discrete and continuous cases this can be written explicitly as

PAB (xi , yj )
IPAB (X; Y ) = Exy logb
PA (xi )PB (yj )
PAB (xi , yj )
= PAB (xi , yj ) logb (2.111)
PA (xi )PB (yj )
i j
and

PXY (x, y)
IPXY (X; Y ) = Exy logb
PX (x)PY (y)

PXY (x, y)
= PXY (x, y) log b dxdy. (2.112)
PX (x)PY (y)
Now, while iP (x; y) can be less than zero, it is not immediately obvious that IP (X; Y ) may
not; see Cover & Thomas (1991, pp26-7) for details of a proof that IP (X; Y ) ≥ 0 for any
pair of random variables. There is also a simple relationship between the elements HP (X),
HP (Y ), HP (X, Y ) and IP (X; Y ),
IP (X; Y ) = Exy {iP (x; y)}

1 1 1
= Exy logb + logb − logb
P (x) P (y) P (x, y)
= Exy {hP (x) + hP (y) − hP (x, y)}
= HP (X) + HP (Y ) − HP (X, Y ) (2.113)
and it is seen that the mutual information, IP (X; Y ), has the same relationship to the
entropy as iP (x; y) has to the Shannon information content of joint events, though notably
with the additional requirement that IP (X; Y ) ≥ 0 with equality if and only if the random
variables are independent.
The distribution mutual information for the random variables x and y is defined as
PAB (xi , yj )
IPAB (X; Y ) = PAB (xi , yj ) logb (2.111)
PA (xi )PB (yj )
i j
for a discrete system and

PXY (x, y)
IPXY (X; Y ) = PXY (x, y) log b dxdy (2.112)
PX (x)PY (y)
for a continuous one. It is related to the distribution entropies by
IP (X; Y ) = HP (X) + HP (Y ) − HP (X, Y ) (2.113)
and is strictly positive with IP (X; Y ) ≥ 0 for all distributions.
Once more, consider the relationship between the distribution mutual information and the
marginal and conditional entropies,

P (x, y)
IP (X; Y ) = Exy logb
P (x)P (y)

1 1
= Exy logb + Exy logb
P (x) P (x|y)
= HP (X) − HP (X|Y ). (2.114)
Now, Equation 2.97 showed that HP (X|Y ) ≤ HP (X) for both discrete and continuous
systems, reaffirming the positivity of the mutual information. Further, for discrete systems
all entropies are positive (as a result of all probabilities, including PA (x|y), being in the
interval [0, 1] ⊂ R) so that IPAB (X; Y ) ≤ HPA (X). In fact,
IPAB (X; Y ) ≤ min {HPA (X), HPB (Y )} . (2.115)
Unfortunately, the situation is substantially more complicated in the case of continuous

systems, since the quantities HPXY (X, Y ), HPX (X), HPY (Y ), HPX (X|Y ) and HPY (Y |X)
can be either positive or negative, depending on the particular distribution involved. In
principle there are, therefore, 25 = 32 possible combinations of the signs for these quantities,
though application of the following relationships reduces this to twelve distinct cases,
HP (X|Y ) ≤ HP (X) (2.116)
HP (X, Y ) ≥ HP (X|Y ) + HP (Y |X) (2.117)
HP (X, Y ) ≤ HP (X) + HP (Y ) (2.118)
HP (X|Y ) = HP (X, Y ) − HP (Y ). (2.119)
With the changes of sign allowed, there is no longer a simple rule for the relationships
among the quantities and the upper limit for the mutual information varies with each case.
In addition, several configurations introduce additional constraints between the quantities.
It suffices for the purpose of this thesis to note these additional cases exist, but to formally
avoid defining any constraints on the mutual information apart from positivity.
The distribution mutual information is always non-negative for both discrete and con-
tinuous systems. In addition, discrete systems have a well-defined upper limit,
IPAB (X; Y ) ≤ min {HPA (X), HPB (Y )} . (2.115)
No simple limit exists for the continuous case as a scaling of the domain can change
the sign of the various quantities arbitrarily.
Example 2.5.4 – Mutual information limits

Consider the case where
HP (X, Y ), HP (X), HP (Y ) ≥ 0
HP (X|Y ), HP (Y |X) ≤ 0.
Now, Equation 2.119 implies that in this case, HP (X) and HP (Y ) must be greater than
HP (X, Y ). In turn this generates a lower limit for the mutual information as IP (X; Y ) =
HP (X)+HP (Y )−HP (X; Y ). Similarly, the upper limit in this case corresponds to HP (X, Y ) =
0 and IP (X; Y ) = HP (X) + HP (Y ), resulting in
IP (X; Y ) ∈ [max {HP (X), HP (Y )} , HP (X) + HP (Y )] . (2.120)

These properties mean that the Mutual Information is a viable measure for the statistical
dependence between the experiments X and Y and also between particular outcomes of
the experiment x and y. It is a measure of ‘proximity’ of the distributions, rather than
a deviation, but can still be used to quantify the relationship between two distributions
(or distribution elements). It is important to note that this quantity is not a divergence
between the two distributions P (x) and P (y), even though it can be shown that mutual in-
formation is related to the Kullback-Leibler divergence between the distribution P (x, y) and
the ‘independent-assumption’ distribution P (x)P (y). This actually highlights the meaning
of mutual information as a comparison between the actual joint distribution and the joint
distribution which would occur if the random variables were independent.20
2.5.7 Three Variable Expressions
It is also possible to extend the definition of Equation 2.101 to the case where the distri-
butions P (x, y), P (x) and P (y) are replaced by P (x, y|z), P (x|z) and P (y|z) respectively.
That is, the mutual information between random variables x and y, given z, is

P (x, y|z)
iP (x; y|z) = logb (2.121)
P (x|z)P (y|z)
and the expectation of this value, taken firstly over P (x, y|z) and then over P (z) as in
Equation 2.91, yields

P (x, y|z)
IP (X; Y |Z) = Exyz logb (2.122)
P (x|z)P (y|z)
where the expectation is taken over the joint probability distribution P (x, y, z).
The validity of additional expressions involving three terms (other than the conditional
mutual information) such as I(X; Y ; Z) is often debated. Mackay (2004, p139) asserts its
invalidity as a direct result of not being a necessarily positive quantity; many other authors
acknowledge this fact, but grant that the quantity actually exists (Cover & Thomas 1991,
p45), albeit in some cases “without intuitive meaning” (Csiszar & Korner 1982, pp52-3), or
with only “mathematical significance”(Yeung 1991, p470). Figure 2.9 shows a Venn diagram
20
See Examples C.5.11 and C.5.12.
I(X;Z|Y)
H(X) H(Z)
H(X|Y,Z) H(Z|X,Y)
I(X;Y|Z) I(Y;Z|X)
H(Y|X,Z) I(X;Y;Z)
H(Y)
Figure 2.9: A Venn diagram showing the relationships between the information theoretic
quantities defined for three random variables x, y and z. The area of each region is in-
terpreted as proportional to the measure by which it is labelled. The three terms H(X),
H(Y ) and H(Z) correspond to the areas of the three large circles and the other terms to
the regions indicated.
Produce Fresh Rotten Total

Red Apples 0.1 0.3 0.4
Green Apples 0.01 0.09 0.1
Red Capsicum 0.05 0.05 0.1
Green Capsicum 0.34 0.06 0.4
Total 0.5 0.5 1.0
Table 2.2: Two discrete probability distributions for examining mutual information; shown
are the normalised frequencies of two boxes of produce from a market selling red and
green Capsicum and Apples showing the relative number of pieces of fruit which are rotten
when the box is purchased. The final column shows the marginal distribution obtained by
summation over the Fresh-Rotten state.
interpretation of the case involving three random variables. Section 2.6 will examine the
validity of the Venn diagram approach in more detail, but here it will be assumed that
the area of the regions in the diagram of Figure 2.9 are proportional to the value of the
measures by which they are labelled. This figure provides a valid tool for representing the
relationships between the indicated measures. Two cases are of interest here, giving rise to
I(X; Y ; Z) > 0 and I(X; Y ; Z) < 0 respectively.
Quantity Value (nats)

H(P ) 0.693
H(C) 0.693
H(F ) 0.693
H(P |C, F ) 0.3854
H(C|P, F ) 0.4693
H(F |C, P ) 0.4958
I(P ; C|F ) 0.1415
I(C; F |P ) 0.0311
I(P ; F |C) 0.1150
I(P ; C; F ) 0.0512
H(X, Y, Z) 1.6894
Table 2.3: Information theoretic quantities calculated for the probability distributions of
Section 2.5.7. The three variables P , C and F refer to the three binary quantities: produce
(Apple/Capsicum); colour (Red/Green); and Freshness (Fresh/Rotten).
Example 2.5.5 – Positive three-term mutual information

Firstly, consider the greengrocer example of Section 2.5.5, but consider also the statistics of
the second box (with uneven packing of produce) when some of the produce is rotten. Table
2.2 reproduces the joint distribution, which when marginalised over the state of Fresh-Rotten
yields the second column of Table 2.1. Table 2.3 contains the calculated quantities identified
in the Venn diagram of Figure 2.9 and it is easily verified that the Venn diagram gives the
correct relationships between these quantities. Similar agreement can be demonstrated be-
tween several other terms which are represented in the figure, but not labelled: H(X|Y ), the
subset of the area of H(X) which does not overlap with H(Y ); H(X|Z), H(Y |Z) H(Y |X),
H(Z|X) and H(Z|Y ), which are defined similarly; I(X; Y ), the common region shared by
both H(X) and H(Y ); and I(X; Z) and I(Y ; Z) defined in the same way.
Importantly, in this example the central term I(X; Y ; Z) is a positive quantity so that ob-
serving any one variable results in a reduction in the mutual information available between
the remaining two variables.
Example 2.5.6 – Negative three-term mutual information

Alternatively, take the example of Mackay (2004, p144) in which x ∈ {0, 1} and y ∈ {0, 1}
with I(X; Y ) = 0 and P (x = 0) = P (x = 1) = P (y = 0) = P (y = 1) = 0.5, then by
inspection, H(X) = H(Y ) = 1 bit. Furthermore, H(Y |X) = H(Y ) = 1 and H(X|Y ) =
H(X) = 1. However, if z ∈ {0, 1} and z = x + y mod 2, then H(Z) = 1 bit also.

Observation of z now correlates the two random variables x and y; specifically, y = z − x
mod 2 which also has an entropy of 1 bit. In order for I(X; Y ) = 0 and I(X; Y |Z) = 1
then the diagram requires that I(X; Y ; Z) = −1 bit. Here, this means the opposite effect
is present compared to the first example: when any one of the three random variables are
observed, there will be an increase in the amount of mutual information between the other
two.
The three-term mutual information for an event (x, y, z) can be written in a form similar
to the two term quantity iP (x; y) of Equation 2.101 as
P (x, y)P (y, z)P (x, z)

iP (x; y; z) = logb . (2.123)
P (x)P (y)P (z)P (x, y, z)
It is easily verified that substitution into
hP (x, y, z) = hP (x|y, z) + hP (y|x, z) + hP (z|x, y) + iP (x; z|y)
+iP (y; z|x) + iP (x; y|z) + iP (x; y; z) (2.124)
yields the correct expression for hP (x, y, z) as calculated directly from the joint distribution.
Note that this expression is written in terms of the individual terms (lower case identifiers)
but holds equally for the expectations of these (upper case identifiers) as each of these
is given by the expectation over the joint distribution P (x, y, z). Indeed, even single term
entropies can be written in terms of an expectation over the joint distribution in any number
of variables, for example consider three variables:
HP (X) = Exyz {hP (xi )}

1
= P (xi , yj , zk ) logb
P (xi )
i j k
1
= P (xi ) logb (2.125)
P (xi )
i
where the last step uses the definition of the marginal distribution of a discrete distribution,

P (xi ) = j k P (xi , yj , zk ).
Finally, Amari (2001, p1707) notes the negativity of the region I(X; Y ; Z) and proposes a
2.6 Representing the Relationships between Measures 71
H(X,Y)
H(X)
H(Y)
H(X|Y) I(X;Y) H(Y|X)
Figure 2.10: An interval-diagram interpretation of the relationships between common
information-theoretic measures defined for two random variables x and y: marginal, joint
and conditional entropies and mutual information. The length of each interval corresponds
to the value of that quantity.
different measure,
IA (X; Y ; Z) = I(X; Y ) + I(Y ; Z) + I(X; Z) − I(X; Y ; Z),
which can be interpreted here as corresponding to the sum of the regions labelled I(X; Z|Y ),
I(X; Y |Z), I(Y ; Z|X) and twice I(X; Y ; Z). While this is necessarily a positive quantity,
it cannot represent the same concept as the mutual information I(X; Y ) as it includes the
regions I(X; Y |Z). Each of these represents the mutual information remaining between two
random variables once the effects of the third have been removed. Section 2.6 will introduce
a more formal reason for why this quantity, and similarly the two equivalents I(Y ; Z|X)
and I(X; Z|Y ), represents a ‘z-independent’ information measure.21 Since each of these
three regions does not include the effect of the third variable, the measure as a whole is
somewhat more than the common part of the joint entropy. In fact, it should be clear from
this Section that the negativity of I(X; Y ; Z) is one of its most useful characteristics.
2.6 Representing the Relationships between Measures
The previous section highlighted that the relationships between the information-theoretic
measures introduced so far are subtle and complicated, particularly for continuous systems
and those with more than two random variables involved. In this section several approaches
are introduced for representing these relationships, each of which emphasises different char-
acteristics of the interrelations.
21
Clearly, this corresponds to an ‘x-independent’ and ‘y-independent’ measure in the other two cases.
2.6.1 Interval Diagrams
When only two random variables are considered, the most straightforward interpretation of
the quantities involved follows the approach of Mackay (2004, p140), in which the magnitude
of the distribution entropies and mutual information are shown as the length of partially
overlapping intervals. This diagram is normally only assumed to hold for the distribution
quantities, rather than for the event terms hP (.) or iP (.; .). Figure 2.10 shows the typical
arrangement of these intervals. From this diagram it is clear that
HP (X; Y ) = HP (X) + HP (Y ) − IP (X; Y ) (2.126)
= HP (X) + HP (Y |X) (2.127)
= HP (Y ) + HP (X|Y ) (2.128)
= HP (X|Y ) + HP (Y |X) + IP (X; Y ). (2.129)
The relationships of Equations 2.126-2.129 and the interval diagram directly correspond to
the relationships between the event quantities, for example
1
hP (x, y) = logb
P (x, y)
P (x)P (y)
= logb
P (x)P (y)P (x, y)
1 1 P (x, y)
= logb + logb − logb
P (x) P (y) P (x)P (y)
= hP (x) + hP (y) − iP (x; y), (2.130)
and the linearity of the expectation operation implies that the same relationship holds for
the distribution quantities. For this reason, these relationships hold for both discrete and
continuous systems and also for the event entropies and mutual informations. The particular
diagram shown in Figure 2.10 corresponds to the case where all quantities are non-negative
and from the discussions of Sections 2.5.3 and 2.5.5 this was seen to hold for all discrete
cases, but is not necessarily true for continuous systems.
In the discrete case it is also possible to infer the limiting relationships between the various
quantities, such as,
HP (X|Y ) + HP (Y |X) ≤ HP (X, Y ) ≤ HP (X) + HP (Y ), (2.131)

H(X,Y)
H(X) H(X,Y)
H(Y) H(X|Y)
H(X|Y) H(Y|X) H(Y|X)
I(X;Y) H(X) I(X;Y) H(Y)
(a) For HP (X|Y ) and HP (Y |X) both negative (b) For all quantities other than
Mutual Information negative
Figure 2.11: Interval diagrams for two cases involving negative quantities where negative
quantities are shown shaded.
0 ≤ IP (X; Y ) ≤ min{HP (X), HP (Y )}, (2.132)
0 ≤ HP (X|Y ) ≤ HP (X) (2.133)
and
0 ≤ HP (Y |X) ≤ HP (Y ), (2.134)
where the fact that the intervals correspond to positive quantities have been used in each
case.
Figure 2.11 shows two other possible interval diagrams, corresponding to the case of the
conditional entropies becoming negative and to all quantities being negative. It is readily
confirmed that the relationships between the quantities still hold, provided that the shaded
intervals are interpreted as negative quantities. With this extension the interval diagrams
can be readily applied to continuous systems as well as discrete ones. The interval diagram is
convenient for consideration of two random variable cases, but cannot be readily extended
into cases involving additional random variables. In fact, when considering even three
random variables it is necessary to generalise the interval diagram to higher dimensionalities,
which actually results in a Venn diagram.
2.6.2 Venn Diagrams
Venn diagrams are generally accepted for cases involving two random variables, where the
same relationships as the interval diagram of Figure 2.10 are embodied in the areas of
I(X;Y)
H(X) H(Y)
H(X|Y) H(Y|X)
H(X,Y)
Figure 2.12: A Venn diagram for two random variables. Each of the quantities labelled
correspond to the areas of the different regions. H(X, Y ) corresponds to the total area,
H(X) the area of the left-hand circle and H(Y ) the right. The same relationships as in
Figure 2.10 are demonstrated in this diagram.
the overlapping regions, rather than the length of intervals. Figure 2.12 shows this simple
case. When the system is continuous, or when three or more variables are used, however,
these diagrams also give rise to ‘areas’ which can correspond to negative quantities, such
as I(X; Y ; Z). This property is often seen as a disqualifier, though the inability to draw
accurate diagrams of more than three variables in two dimensions is a more significant
limitation (Yeung 1991, p469). Unlike intervals, however, the Venn diagram generalises
to higher dimensionalities as intersecting volumes and hyper-volumes and in each case the
relationships between the information theoretic quantities are appropriately represented.
An important question due to Mackay (2004, p143) requires addressing here: “Venn dia-
grams [depict] sets; but what are the sets H(X) and H(Y )?” Firstly, it is important not
to confuse the measure with the underlying set on which it is defined and this confusion is
compounded by the fact that the regions in Figure 2.12 are labelled by the measure of the
union of all of the elements within that area. In particular, Mackay (2004, p143) identifies
that it is tempting to assume that each point in the diagram corresponds to a particular
event (x, y) and that the statistical dependencies between the random variables cause cer-
tain subsets of these (the two circles) to overlap to a greater or lesser degree. This is a
highly misleading interpretation as all quantities can be shown to be directly dependent
on all possible outcomes (x, y); for example, Section 2.5.3 defines the conditional entropy
H(X|Y ) as
1
HP (X|Y ) = P (xi , yj ) logb (2.91)
P (xi |yj )
i,j
which clearly depends on all events (xi , yj ).

Sx Sy
Sx;y
Sx|y Sy|x
Sx,y
Figure 2.13: A commonly drawn Venn diagram for information measures involving two
random variables, x and y. This diagram is inherently misleading as it actually represents
three different measures applied simultaneously to the same set of events, rather than one
measure applied to three disjoint subsets of a single set of events. The three disjoint areas
Sx|y , Sy|x and Sx;y represent these three measures. The joint event (x, y) corresponds to a
separate point in each region yielding measures due to the properties of x alone, y alone
and both x and y together, respectively. Under this interpretation, Sx,y represents the total
of the three measures, Sx the total of the measures with a dependence on x and Sy those
with a dependence on y. It is a desirable property to have the region Sx (or Sy ) in both
this and Figure 2.4 correspond to equivalent measures.
In fact this Venn diagram actually corresponds to three different measures, each taken with
respect to the set {(x, y)}. Figure 2.13 shows the same Venn diagram, but now labelled
according to the sets which correspond to the different areas. There are three fundamental
measures which can be defined for the two-variable case and the sets corresponding to these
are labelled Sx|y , Sy|x and Sx;y . Importantly, all three of these measures are defined over
all events {(x, y)}. This means that the event (x, y) corresponds to three points in this
diagram, one in each of the regions. The three regions correspond to:
Sx;y : (x, y) → iP (x, y) (2.135)
Sx|y : (x, y) → hP (x|y) (2.136)
Sy|x : (x, y) → hP (y|x). (2.137)
Now, while this is a misleading diagram in that it involves three measures, the disjoint
nature of the individual events (x, y) allow these measures to be added together in an
analogous manner to adding measures on disjoint subsets. In fact, doing this justifies the
particular selection of the Venn diagram as the representation of these three measures as
it explicitly links the marginal entropies defined in Figure 2.4 to the measures defined for
joint distributions. Consider the Region labelled Sx in these diagrams: in Figure 2.4 this
corresponds to the set of all outcomes {x} for the single random variable x; whereas in
Figure 2.13 it corresponds to the union of the two regions Sx|y and Sx;y . Now, consider the
sum of the two measures, defined by the second interpretation,
1 P (x, y)
hP (x|y) + iP (x, y) = logb + logb
P (x|y) P (x)P (y)
P (x, y)
= logb
P (x)P (y)P (x|y)
1
= logb (2.138)
P (x)
where Equation 2.138 uses the fact that P (x, y) = P (x|y)P (y). Remarkably, in both Figure
2.4 and Figure 2.13 the represented quantities are identical and the two regions Sx|y and
Sx;y can be interpreted as independent components of the marginal subset Sx . Similar
relationships hold for the regions Sy and Sx,y corresponding to the measures for the marginal
P (y) and the joint measures on P (x, y). As will be generalised further in Section 2.7 the
regions Sx|y , Sy|x and Sx;y correspond to the contributions to the statistics of the outcomes
from x alone, y alone and both variables together, respectively, which intuitively relates
the the three types of independent causes which can be present in a situation involving two
random variables.
Finally, all of the properties implied by the Venn diagram construction for two variables
directly reproduce the properties of Equations 2.126-2.129, again provided that it is recog-
nised that the area of any region can correspond to a negative quantity. In general, the Venn
diagram approach is a valid and powerful tool for analysing the interrelations of entropic
measures provided that their foundation in measure theory is kept in mind. A thorough
analysis of the set-theoretic properties of the Venn diagrams can be found in Yeung (1991).
2.6.3 Vector-Space Construction
Both the interval and Venn diagrams of Figures 2.10 and 2.12 identify that for cases in-
volving two random variables there are three independently variable quantities: HP (X|Y ),
HP (Y |X) and I(X; Y ). Significantly, all other quantities can be derived from these three
alone and changes in any one of these has no effect on the remaining two. This suggests
I P(X;Y) H P(X)
H P(X|Y)
Figure 2.14: A two dimensional vector diagram illustrating a possible interpretation of the
relationship between HP (X), HP (X|Y ) and IP (X; Y ). Each vector
here has a length which
P (X)| = HP (X), so that the lengths
is the square root of the associated quantity, eg. |H
are related by Pythagoras’ theorem as HP (X) = HP (X|Y ) + IP (X; Y ).
that these quantities could be interpreted as orthogonal basis vectors in a three dimensional
space. Under this interpretation all quantities derived from these might be representable
using vector constructions in this space.
P (X|Y ) and IP (X; Y )

Explicitly, note that HP (X) = HP (X|Y ) + IP (X; Y ); so that if H

P (X|Y )| = HP (X|Y ) and |IP (X; Y )| =
are interpreted as orthogonal vectors of length |H

P (X) of length
IP (X; Y ), then the vector summation of these corresponds to a vector H

HP (X) using Pythagoras’ theorem. This is depicted in Figure 2.14 and the full vector
construction in three dimensions for two random variables is shown in Figure 2.15. Each
vector shown is constructed intuitively in this diagram, but captures exactly the relation-
ships embodied in the Venn or interval diagram methods. In addition a new quantity is
explicitly generated in this interpretation, denoted LP (X, Y ) here, and is the only diagonal
of the prism unidentified in other approaches.
Furthermore, the limits of Equations 2.131-2.134 are also implied as the prism’s edge lengths
change. Once again, the basic relationships still hold for the information terms for an
individual outcome and the diagram is unchanged for iP (x; y) ≥ 0. When this quantity
becomes negative, the diagram can be re-interpreted in two different ways: the length of
P (x) and h
iP (x; y) could be imaginary; or the placements of the vectors h P (.|.) can be
interchanged. In this case this involves making the edge vectors correspond to hP (x), hP (y)
and iP (x; y) instead of hP (x|y), hP (y|x) and iP (x; y). Both of these approaches preserve
the relationships and limits of all quantities involved.
H P(Y) H P(X,Y)
I P(X;Y) H P(X)
H P(Y|X) L P(X,Y)
H P(X|Y)
Figure 2.15: The full three-dimensional vector diagram for two random variables x and y.
The edge vectors are HP (X|Y ), HP (Y |X) and IP (X; Y ) with lengths equal to the square
root of the associated
measure as in Figure 2.14. The body diagonal is readily recognised
having length HP (X, Y ) and two of the face diagonals have length HP (X) and
as
HP (Y ). The remaining face diagonal, denoted LP (X, Y ) is usually unidentified, but will
be introduced formally in Section 2.8.2 as a distance measure.
Thus, the vector-space interpretation emphasises the importance of the quantities which
are ‘orthogonal’ and those which are defined by vector sums of these. This interpretation
becomes difficult to draw with more than two random variables; for example a case involv-
ing three random variables gives rise to seven quantities (depending on one variable, two
variables or all three in turn). The representation remains consistent and applicable, even
if it is not visualisable. Fundamentally, this approach identifies each vector as a different
measure applied to the same set, or more correctly here, the same measure applied to trans-
formed versions of the same set. This is in direct agreement with the detailed analysis of
the Venn diagram approach of the previous Section, and avoids the misleading nature of
the Venn diagrams in the representation.
2.7 Complete Enumeration of Information Measures
To this point the development of the information theoretic quantities when multiple random
variables are considered have been introduced piecemeal; in this section, the formal measure
theory and the relationship between the mutual information and entropy will be used to
2.7 Complete Enumeration of Information Measures 79
completely enumerate all quantities generated by a single measure for multiple random
variables with highly unexpected implications.
Recall from the previous Section that, when considering two random variables x and y, the
contributions to the joint entropy can be divided into three contributions: hP (x|y), hP (y|x)
and iP (x, y). In that context the quantities can be represented by orthogonal vectors and
all other measures are obtained by summing together any two or all three of these ‘basis’
vectors. Recall also from the discussion of Venn diagrams that this approach can also be
considered a direct representation of the relationships between the possible measures and
the same three elementary, disjoint (and hence independently modifiable) measures.
It was contended that when the joint distribution is defined over three random variables, x,
y and z, this gives rise to seven independent measures. Figure 2.9 shows a Venn diagram
for three random variables and clearly identifies seven disjoint regions:

1
H(x|y, z) = Exyz logb (2.139)
P (x|y, z)

1
H(y|x, z) = Exyz logb (2.140)
P (y|x, z)

1
H(z|x, y) = Exyz logb (2.141)
P (z|x, y)

P (x, y|z)
I(x; y|z) = Exyz logb (2.142)
P (x|z)P (y|z)

P (x, z|y)
I(x; z|y) = Exyz logb (2.143)
P (x|y)P (z|y)

P (y, z|x)
I(y; z|x) = Exyz logb (2.144)
P (y|x)P (z|x)

P (x, y)P (y, z)P (x, z)
I(x; y; z) = Exyz logb . (2.145)
P (x)P (y)P (z)P (x, y, z)
As before in the case of two random variables, these values all involve the entire set of
observations and so correspond to seven different measures applied to the set {x, y, z}.
In the discrete case these seven measures can all be expressed as a single measure ap-
plied to the seven ‘base’ distributions expressible for a joint distribution PABC (x, y, z).
Furthermore, it is contended that this measure corresponds to the three-term mutual
information of Equation 2.145. Several important identities are required in this con-
text. Firstly, it is clear that Equation 2.145 is already in the correct form. Equations
2.142 - 2.144, however, are expressed in terms of two-term conditional joint distributions,
PAB (x, y|z) for example. If this distribution is augmented to generate the new distribution
PABB (x, y, y |z) = PAB (x, y|z)δD (y − y ), where δD (.) is the Dirac delta function, then the
expression can be written as
PAB (x, y|z)PAB (x, y |z)PBB (y, y |z)

iPABB (x; y; y |z) = logb
PA (x|z)PB (y|z)PB (y |z)PABB (x, y, y |z)
PAB (x, y|z)
= logb (2.146)
PA (x|z)PB (y|z)
= iPAB (x; y|z) (2.147)
where the fact that PBB (y, y |z) = PB (y|z)δD (y − y ) has been used. Also note that when
y = y , the expression becomes logb 1 = 0 and makes no contribution to any sum involving
these expressions, allowing us to replace all remaining instances of y with y to get the
desired result. Similar results hold for Equations 2.143 and 2.144.
Next, consider the measure applied to PAAA (x, x , x |y, z) defined to be equal to PA (x|y, z)δD (x−
x )δD (x − x ),

PAA (x, x |y, z)PAA (x , x |y, z)PAA (x, x |y, z)
iPAAA (x; x ; x |y, z) = logb
PA (x|y, z)PA (x |y, z)PA (x |y, z)PAAA (x, x , x |y, z)
1
= logb (2.148)
PA (x|y, z)
= hPA (x|y, z) (2.149)
using similar Dirac delta relationships to those in the previous case. Thus, each of the
seven measures again corresponds to the same measure i(.; .; .) applied to the distribution
PABC (x, y, z) in the basic forms of PA (x|y, z), PAB (x, y|z) and PABC (x, y, z).
Now, in general, assume that there are n random variables, (x1 , . . . , xn ) and that the joint
distribution P (x1 , . . . , xn ) is known. Conditional probabilities of the form of the previous
example can be generated and written as (where the number following each is the number
I(X;Z|Y)
H(X|Y,Z) H(Z|X,Y)
I(X;Y|Z) I(Y;Z|X)
H(Y|X,Z) I(X;Y;Z)
Figure 2.16: A Venn diagram depiction of the seven ‘base’ measures available for a situa-
tion involving three random variables. Each region is labelled according to the measure it
represents when evaluated on the entire set of outcomes (x, y, z).
of different distributions of that class):
P (x1 |x2 , . . . , xn ) : n
C1
P (x1 , x2 |x3 , . . . , xn ) : n
C2
..
.
P (x1 , . . . , xk |xk+1 , . . . , xn ) : n
Ck
..
.
n
P (x1 , . . . , xn ) : Cn = 1. (2.150)
This clearly corresponds to the three distributions for n = 2 and the seven distributions of
n = 3. Inspection reveals that n = 4 gives rise to fifteen base distributions, and so on.
Unfortunately, the identification of these independent quantities as the base measures avail-
able for constructing all possible information theoretic measures gives rise to many more
quantities than were expected. Consider the case involving two random variables: here the
full construction revealed a seventh measure, labelled L in Section 2.6. In this case, all
seven possible measures have now been enumerated from all possible constructions of the
three base measures. In the case of three random variables, however, there are seven base
measures, which can be combined in
7
C1 + 7 C2 + · · · + 7 C7 = 127 (2.151)
different ways to generate 127 different measures. But which, if any, of these possible
measures correspond to useful quantities and to what do the others correspond?
The construction involving seven dimensions is inconvenient to represent, but for three ran-
dom variables, the Venn diagram is a concise representation, provided that it is interpreted
appropriately. Figure 2.16 shows the base measures corresponding to the seven regions of
the diagram and Figure 2.17 contains the 37 possible configurations of these summations,
depending on how many and which areas are combined. These constructions arise from se-
lecting the given number of regions, say 2 for class (b), and selecting these from among the
three different types of base measure. As noted in Equation 2.150 these three classes are:
the three outside regions corresponding to P (x|y, z), P (y|x, z) and P (z|x, y); the next three
intermediate regions corresponding to P (x, y|z), P (x, z|y) and P (y, z|x); and the central
region corresponding to P (x, y, z). For example, in class (b) it is possible to select: two of
the first type or outside regions, (b1 ); one outside region and one of the intermediate regions,
(b2 ); one outside region and the central region, (b3 ); two of the intermediate regions, (b4 );
and one intermediate and the central regions, (b5 ).
It is remarkable that the same construction which generates the normal measures such as
I(X; Y ; Z) or H(X, Y |Z) should actually give rise to so many other measures. Table 2.4
contains a list of these 37 regions and to which common measure (if any) they correspond.
Note that the measures lSI (.) correspond to the quantity L from earlier and will be dis-
cussed further in Section 2.8.2. Note that 9 of the 12 listed measures correspond to the
measures ordinarily used in information theoretic analysis, and that some conjugate pairs
exist including (c1 , d1 ) corresponding to H(Z) and H(X, Y |Z), and (b5 , e5 ) where only b5
is an identified measure.
While it is conceded that many of these possible measures will be unsuitable for practical
application, the ability to systematically generate the complete set of information theoretic
quantities is advantageous. The identification of the base measures, expressed as the or-
thogonal basis vectors of the vector construction or Venn diagrams of the previous section,
serves to clarify the nature of those representations and to justify the construction of the
remaining measures.
Measure Example Quantity

1 H(X|Y, Z)
1 3 (a) 2 I(X; Y |Z)
(a) 3 I(X; Y ; Z)
1 lSI (X, Y |Z)
2 H(X|Z)
1 5
(b) 3 -
(b) 4 -
5 I(X; Y )
1 5 1 H(X, Y |Z)
2 -
3 -
6 10 4 lSI (X, Y, Z)
(c) (c) 5 -
6 -
7 -
1 5 8 -
9 -
6 10
10 -
(d) 1 H(X)
2 -
3 -
1 5 4 -
(e) (d) 5 lSI (X, Y )
6 -
7 -
1 3 8 -
(f ) 9 -
10 -
1 -
1
2 -
(g) (e) 3 -
4 -
Figure 2.17: A pictorial representation of the
5 -
37 classes of ‘information-theoretic’ measures
1 H(X, Y )
available for three random variables. The
(f) 2 -
classes are grouped according to the num-
3 -
ber of regions combined, from 1 in the top
(g) 1 H(X, Y, Z)
group (a) to all 7 in the bottom group (g) and
within each group the classes are numbered
Table 2.4: Partial association of the 37
left to right, top to bottom. The colours
classes of measure for three random vari-
group conjugate classes; for example, classes
ables with some known quantities including
b1 and e1 are conjugate. In each case conju-
mutual information, entropy and lSI , intro-
gate pairs have the same numerical index.
duced in Section 2.8.2.
2.8 Deviations using Entropy and Mutual Information
As noted earlier, the main goal of the measures used in this work is to quantitatively compare
different vectors, distributions and functions. Ideally, measures which satisfy the distance
axioms in Equation 2.4, 2.5, 2.6 and 2.7 will support this most effectively, as a result of the
detailed reasons discussed in Section 2.3. For a deviation measure δ(Si , Sj ) these axioms
are:
δ(Si , Sj ) ≥ 0 ∀Si , Sj ∈ Σ (2.4)

δ(Si , Sj ) = 0 ⇐⇒ Si = Sj (2.5)
δ(S1 , S2 ) = δ(S2 , S1 ) (2.6)
δ(S1 , S2 ) ≤ δ(S1 , S3 ) + δ(S3 , S2 ) (2.7)
where deviations satisfying Axioms 2.4 and 2.5 are also called divergences, D(Si Sj ); and
those also satisfying all four are distances , d(Si , Sj ).
2.8.1 Entropy and Mutual Information
Consider Entropy first as a comparative measure, that is, define a deviation δ(X, Y ) =
|H(X) − H(Y )| which clearly satisfies the first axiom. Now, while the zero deviation is
measured when the distributions are identical, it is also possible for two distributions to
have the same entropy without being identical, so the second axiom is not satisfied. This
measure is similar in this way to the L2 norm of a function, defined by L2 (f ) = f 2 (x)dx,
which can be made the same for dissimilar functions f (x) and g(x). For example, if the

functions are defined over the range x ∈ [0, 1] ⊂ R and if f (x) = x and g(x) = ( 12 − 13 x),
then the evaluated L2 norm will be identical, even though the functions are starkly different.
Mutual Information, however, measures the statistical proximity of two functions. It is

defined by Equation 2.110 as
IP (X; Y ) = Exy {iP (x; y)} , (2.110)
that is,
P (x, y)
IP (X; Y ) = Exy logb . (2.152)
P (x)P (y)
2.8 Deviations using Entropy and Mutual Information 85
As noted in Section 2.5.5 the mutual information is strictly positive and so satisfies the
first axiom. The second axiom requires careful consideration: since this is a measure of the
statistical relationships between the random variables, identity is achieved for any situation
in which there is a bijection between the random variables, that is, f : x ↔ y. In this
case, knowledge of either x or y (and the function f ) is sufficient to uniquely specify the
other random variable. Letting x = f (y) be the mapping, it is possible to rewrite the joint
distribution as22
P (x, y) = P (x)δD [x − f (y)] (2.153)
which follows because if x = f (y) then the joint event (x , y) must have zero probabil-
ity. This is guaranteed by the Dirac delta function δD (x) which is defined appropriately
for discrete and continuous variables x as: zero when x = 0; and either δD (xi ) = 1 or
δD (x)dx = 1 respectively when x = 0. The marginal distribution23 is obtained from this
joint as

PA (xi ) = PAB (xi , yj )
j

= PA (xi )δD [xi − f (yj )]
j
= PA (xi ) (2.154)
since the delta function is zero for all yj except when xi = f (yj ) when it is unity. Likewise,
the marginal distribution for y is

PB (yj ) = PA (xi )δD [xi − f (yj )]
i
= PA [f (yj )] (2.155)
= PA (xi ) when xi = f (yj ) (2.156)
where the last line holds because of the mapping. The resulting mutual information is given
22
Note that this equation is a generalisation of the simple relationship exploited in Section 2.6 to demon-
strate that entropy and mutual information are closely linked.
23
For the discrete case with the continuous obtained by replacing the summations with appropriate inte-
grals.
by
PA (xi )δD [xi − f (yj )]

IAB (X; Y ) = PA (xi ) δD [xi − f (yj )] logb
PA (xi )PB (yj )
i j
PA (xi )δD (0)
= PA (xi ) logb
PA (xi )PA (xi )
i
and in the discrete case the result yields IPAB (X : Y ) = HPA (X) = 0; alternatively, in the
continuous case the result is IPXY (X; Y ) → ∞. In neither case does the measure satisfy
Equation 2.5. In fact, when the bijection exists, this corresponds to a maximum value for
IP (X, Y ).
However, if the two random variables are statistically independent, then the resulting joint
distribution is P (x, y) = P (x)P (y) and Equation 2.152 becomes

P (x)P (y)
IP (X; Y ) = Exy logb
P (x)P (y)
= Exy {logb 1}
= 0 (2.157)
and the MI is seen to satisfy the second axiom, but only if the measure is considered as a
proximity, rather than a deviation, in which case the roles of dependence and independence
are reversed.
The third axiom is obvious from inspection of Equation 2.152 where interchanging x and
y makes no change to the resulting measure. Finally, consider three random variables x, y
and z, where y and z, and x and z are statistically independent pairs, but where x and y
are not independent. Here, I(Y ; Z) + I(X; Z) = 0, but I(X; Y ) ≥ 0 so that the triangle
inequality of Equation 2.7 does not hold.24
24
See Example C.6.1.
Entropy does not satisfy Axiom 2.5 which implies the existence of a unique minimum
corresponding to ‘equality’, taken here to mean totally statistically dependent, so that
it is not advantageous to use this quantity as a measure of the deviation between two
random variables.
Mutual information, however, can be used as a proximity measure, noting that statis-
tical independence corresponds to a zero measure and therefore maximisation of the
quantity will increase the statistical dependence between the random variables. This
quantity is not continuous as it does not satisfy the triangle inequality.
2.8.2 Statistical Independence Distance
Section 2.6 introduced a new interpretation of the relationships between the most common
information-theoretic quantities as a vector construction. In this interpretation, the three
independent quantities HP (X|Y ), HP (Y |X) and IP (X; Y ) are represented as orthogonal
vectors and the vector sums based on these give rise to all other quantities. An additional
quantity was identified from this construction, denoted here as LP (X, Y ). Its length is given
by
|LP (X, Y )|2 = LP (X, Y ) = HP (X|Y ) + HP (Y |X) (2.158)
using Pythagoras’ theorem on the vector construction of Figure 2.15.
From Equation 2.158 it can be seen that this quantity is also represented in the interval and
Venn diagram approaches of Figures 2.10 and 2.12. In the interval diagram it corresponds
to the summed lengths of the two outside intervals of the bottom line, and in the Venn
diagram the sum of the two areas where the sets have no overlap. Section 2.6 asserted
that the event (x, y) gives rise to the joint event entropy hP (x, y) through three mutually
independent contributions: that due only to the variations exhibited in the random variable
x, hP (x|y); that due to variations in y, hP (y|x); and those common to both variables,
iP (x; y). In this way, the quantity LP (X, Y ) corresponds to the expected value of the
individual contributions and therefore can be interpreted as the entropic measure of the
statistically independent behaviour of the two random variables. This, with Figure 2.9,
suggests an extension of this quantity to three random variables as
LP (X, Y, Z) = HP (X|Y, Z) + HP (Y |X, Z)
+HP (Z|X, Y ). (2.159)
The quantity LP is not unknown in the literature, though it is normally considered only as
a curiosity or as an exercise in proving inequalities (Mackay 2004, p140) (Cover & Thomas
1991, pp45-6). Since this quantity represents an entropic distribution measure, as noted
elsewhere, this can also be expressed as an expectation,

1
LP (X; Y ) = Exy logb (2.160)
P (x|y)P (y|x)
= Exy {lP (x, y)} (2.161)
which defines the ‘event’ quantity lP (x, y). As with the total measure, the event quantity
corresponds to the contributions of the event (x, y) which belong only to the independent
parts of the variables. The exercises mentioned above purportedly show that the quantity
satisfies all of the axioms of distances (Equations 2.4-2.7) and thus corresponds to a true
distance measure. This is true, however, only for discrete cases:
Consider first Axiom 2.4, δ(Si , Sj ) ≥ 0. In the discrete case the event entropies hPA (xi |yj )
and hPB (yj |xi ) are known to be positive since the probabilities must lie in the range [0, 1] ⊂
R; the resulting measure lPAB (xi , yj ) will also be positive, as will the distribution average,
and the axiom is satisfied. However, in the case of continuous variables, Section 2.5 showed
that all entropies can be negative quantities, so that the first axiom does not hold true for
all continuous systems.
When considering the Axiom 2.5 it must be noted that, since this is a statistical measure,
the notion of equality is modified to correspond to the occurrence of a bijection between
the random variables, f : x ↔ y. Rewriting the measure in the discrete case25 as
1
LPAB (xi , yj ) = P (xi , yj ) logb
P (x|y)P (y|x)
i j
P (xi )P (yj )
= P (xi , yj ) logb
P 2 (xi , yj )
i j
P (xi )P [f (yj )]
= P (xi ) δD [xi − f (yj )] logb 2 2 [x − f (y )] (2.162)
i j
P (xi )δD i j
where the last line has substituted Equations 2.153, 2.154 and 2.155 for the case where the
bijection exists, this reduces to
PA2 (xi )
LPAB (X, Y ) = PA (xi ) logb
i
PA2 (xi )

= PA (xi ) logb (1)
i
= 0 (2.163)

where the limit limx→∞ x logb 1
x = limx→∞ {−x logb x} = 0 has been used. Again, when
the continuous case is considered, the axiom does not hold; in this case the measure becomes

1
LPXY (X, Y ) = PXY (x) logb 2 (0)
δD
= −∞. (2.164)
Axiom 2.6, δ(Si , Sj ) = δ(Sj , Si ), can be readily verified for both discrete and continuous
systems by inspection. Finally, the triangle inequality is intimately related to the continuity
of the measure and Equations 2.163 and 2.164 imply that as x and y approach one another,
the measure between them approaches 0+ and −∞ in the discrete and continuous cases
respectively. If z is sufficiently far from both x and hence y so that LP (X, Z) and LP (Y, Z)
are both positive26 , then LP (X, Y ) + LP (Y, Z) ≥ LP (X, Z) in the discrete case only and
this is violated in the continuous case.
Therefore, the quantity LP (X, Y ) corresponds to a legitimate distance measure in the case
of discrete random variables, and will be denoted by DSI (X, Y ) and called the statistical
independence distance because it indicates the measure of the joint distribution which has
25
Once more, the continuous case can be obtained by replacing the summations by the relevant integral.
26
Actually they need only be finite.
no statistical dependence. When the joint distribution PAB (x, y) is available, the statistical
independence distance is a candidate distance measure for comparing two distributions.
Since it depends only on the statistical relationships between the two random variables it
is a purely statistical measure of the distance between two distributions. It will have a
minimum value when one of the two random variables maps to a subset of the other and
will be zero when this mapping is a bijection. Its maximum value corresponds to the case
where the distributions are statistically independent and in that case will be equal to the
sum of the marginal entropies.
If the distribution is continuous, however, the quantity is not a distance, though it can
be considered a pseudo-divergence for the purposes of evaluating the degree of agreement
between a signal and several candidate functions. In this case the quantity to be minimised
can be negative and a ‘perfect’ match occurs when there is a bijection between the distribu-
tions and will correspond to a measure of −∞. The lack of a triangle-inequality, however,
means that the measure is not continuous and it will not be possible to determine a gradient
of the measure, as may be required in an optimisation framework.
The quantity LP (X, Y ) = HP (X|Y ) + HP (Y |X) can be considered a statistical inde-

pendence deviation for the distributions X and Y . In the discrete case it satisfies all
axioms and is a valid distance measure, denoted DSI (X, Y ) and can be determined
whenever the joint distribution is known.
In the continuous case, however, this measure: can be negative; has the concept of
‘equality’ corresponding to a value of −∞; and does not satisfy the triangle inequality.
2.9 Conclusions
This chapter has introduced a comprehensive theoretical framework for the quantification of
the notion of the ‘deviation’ between mathematical entities including vectors, functions and
distributions. A measure was introduced as a mapping from the elements of an abstract set
to the real line and the nature of that particular mapping with respect to some particular
task was seen to be the sole method for determining the efficacy of any measure. These
abstract notions were then extended to encompass sets for which coordinate systems could
2.9 Conclusions 91
be defined, enabling measures to be defined over arbitrary spaces. The further extension
of the measure theory to encompass inner-products, distances and norms, and application
to systems of infinite dimensionality, provides a sufficient framework for the introduction of
most commonly used measures.
The identification of information-theoretic quantities as basic measures provides a unified

interpretation of probabilities, entropy, mutual information and ‘traditional’ measures such
as the L1 norm or Euclidian distance. The entropy and mutual information measures
were derived from the basic properties of probability measures and their particular forms
shown to be highly flexible and exceptionally useful in the consideration of probabilistic
systems. The relationships between these particular measures were demonstrated to be
much more intricate than often appreciated and a new interpretation of the Venn diagram
was required to rigourously capture these characteristics. Furthermore, a completely new
approach to the representation of the information theoretic quantities was introduced in
the form of a vector diagram and this new approach immediately suggests the existence of
many alternative information-theoretic measures. In particular, the statistical information
distance was shown to be much more significant than ‘an exercise in proving inequalities’ and
should be considered on equal footing with the entropy and mutual information themselves.
It is readily apparent that traditional measures such as the Euclidian distance can be read-
ily applied to measure deviations, though it is less clear how the information-theoretic
quantities can be used in a similar manner. The entropy, mutual information and statis-
tical information distance, in particular, were shown to have useful characteristics in this
context. Finally, several commonly-used measures were discussed and their distinguishing
properties examined. These various measures are a small subset of the inexhaustible set of
available measures, though their particular characteristics have resulted in their use in a
wide range of applications.
Chapter 3
Sensing and Representation
3.1 Introduction
This chapter examines the detailed structure of the first phase of the high level model
introduced in Chapter 1, in which the raw data from multiple sensors is brought together
to form the summary of information, that is, the data gathering operations.
Implicit in this approach is the parallelisation of the treatment of each individual sensor
and the storage of the information in a sensor-centric form. The critical distinction between
task-centric and sensor-centric representations is drawn, and shown to result directly from
the availability or otherwise of knowledge about the relative meaning of the cues with respect
to any particular task. It is shown that unless all necessary tasks are known a priori it is not
possible to determine, in advance, the exact minimal combination of sensory cues necessary
to support the system goals. Furthermore, sensor-centric representations give rise to the
concept of the ‘sensory space’ which encodes the parallel nature of the sensing process and
enables the clear identification of the relationships between the individual sensors. The
identification of this space is important in that it guides the selection and development of
the representational form used in the summary of information part of the model. It is also
demonstrated that several important deviations from the ideal case can actually be treated
using this new framework and allowing the definition of a ‘pseudo-sensor’.
It is obvious that the information provided by a sensor-centric approach is significantly more

complex than traditional approaches; not only is the data to be maintained in a minimally-
94 Sensing and Representation
processed form, but a greater quantity of information must be maintained and represented
simultaneously. The difficulties involved in representing this data are shown to correspond to
the application of a function estimation operation in the sensor-space. While this approach
formally implies that the ideal case is intractable, it gives rise immediately to a general
formulation of approximate solutions and, simultaneously, the ability to critically assess the
impact of those approximations. The method will be shown to transform the sensor-space
function estimation problem into a parameter estimation problem on a transformed space
and formalises well-known effects such as the ‘data-association problem’ and the impact of
non-trivial relationships between the parameters, such as those handled by Markov random
field theories. These two effects, in particular, give rise to a formal partitioning of the
parameter space and it is demonstrated how such partitions can be used in the development
of practical systems.
Finally, in addition to examining the structure of the data gathering process and highlighting
the importance of both the judicious selection of the state-functions to be represented
and the construction of realistic and meaningful sensor models, this chapter examines the
application of the measure-theoretic quantities of Chapter 2 to the three important areas
identified in Section 1.4. In particular, the effects of common approximations are examined
and, where possible, quantified. Additionally, the information-preserving nature of sensor
models will be quantified and the contribution of an individual sensor to the summary of
information will also be considered.
3.2 Data Gathering
The data gathering phase of the model presented in this thesis is reproduced in Figure 3.1,
which implies that these operations act to combine multiple, possibly independent, sensory
streams to generate a consistent ‘summary of information’ or ‘state’. The recognition that
sensory data is supplied by different sensors suggests the parallelism implied by this figure;
however, this parallelism does not require that the actual sensory streams be independent.
Specifically, reliability and other practical considerations suggest that utilising redundant
and complimentary sensing capabilities is highly desirable. The explicit utilisation of formal
sensor models directly encodes both the physical characteristics of a single sensor and the
relationships between them through their common interaction and dependence on the form
3.2 Data Gathering 95
Sensor
y Raw Model
Data
x
Sensor 1
Likelihood
Z
θ p(a)
p(b1)
Sensor Likelihood
φ Raw Model p(b2) X
Data
R
Sensor 2 p(c)
Y
Likelihood
θ
Raw mmary of Information
Data Sensor
φ Model
R
Sensor N
Figure 3.1: The data gathering part of the model showing how data from each sensor
is transformed using a sensor model to obtain a likelihood function used to update the
state (or summary of information). Each sensory cue is assumed to interact with only a
subset of the dimensions of the state and the interactions between sensors are encoded in
the relationship of their sensor models to the underlying state functions.
and character of the properties represented in the state.
This figure highlights the relationships between the selection of: the sensory cues to be
measured; the processing to be performed on those data; and the choice of representation.
Importantly, unlike many traditional approaches to the development of perceptual systems,
it is obvious that all of these decisions must be made together. For example, the design of
a typical military tracking system would begin with the assumption that the state to be
estimated represents a point-based object location, perhaps with the addition of ‘automatic
friend-or-foe identification’, and the sensing and processing of that sensory data follows
from this decision (Blackman 1986). Similarly, several approaches to ‘rich’ environmental
representations begin with the selection of the sensory cues and, given this selection, match
the processing and representation to fit that decision (Grover et al. 2002). The model
proposed here, however, intimately links the three decisions: the selection of the sensory
cues determines what it is possible to measure, but computational and storage limitations
affect what is practical and the tasks at hand determine what is necessary. It is only
through the appropriate balance of the possible, the practical and the necessary together,
that a reliable and flexible system can be constructed.
3.2.1 Task and Sensor-Centric Representations
The importance of this consideration of the representation, sensing and sensory processing
together is clearly demonstrated in the distinction between the traditional approach of ‘task-
centric’ representations and the alternative ‘sensor-centric’ formulation. Many approaches
to the development of a perceptual system are of the first kind and analysis begins with the
consideration of the task to be achieved. This task-specificity is often implicit, such as in the
selection of geometric feature models for a robotic navigation problem - a line-based model
is convenient because it is being used for navigation or mapping in a structured environment.
A classic example of this involves the utilisation of natural features in a SLAM problem; it
is well known that natural features do not conform well to the assumptions of the SLAM
algorithm, namely that the tracked entities are point targets with Gaussian uncertainty in
their location. In fact, one of the most challenging aspects of the SLAM problem remains
the utilisation of irregularly-shaped or other structurally complex environmental objects as
features. For example, Chapter 5 of Nieto (2005) examines a possible extension to address
this issue.
In all of these approaches, the task remains central to the selection of the representation,
and hence to the construction and management of the sensor models. In contrast, a sensor-
centric representation seeks to find a representation which matches the information provided
by the sensor, without reference to the tasks which may utilise that information. An
example of this approach involves the storage and manipulation of data from a structurally
complex environment, such as where a representation is selected to effectively and efficiently
represent the data itself. In Brooker et al. (2005) the authors seek to represent data from an
imaging radar directly; operations such as terrain estimation or obstacle detection remain
possible, but are considered to be down-stream operations on the data. Furthermore, their
representation of the volumetric radar data allows interpretations of the density of the
environment. Hard surfaces can be identified readily as they will tend to correspond to
clearly defined returns as a result of their high density. The authors further note that
vegetation, in particular, reflects the electromagnetic radiation from within the structure
of the object as a result of having a low density and these objects can again be recognised
through the consideration of these properties. The most significant aspect of these sensor-
centric approaches is the goal of retaining the information characteristic of the sensors
available, rather than of some specific task to which the sensors are applied.
The selection of a sensor-centric approach for the general case involving autonomous sens-
ing and perception is suggested by the inherent flexibility required in practical and reliable
systems, as discussed in Section 1.2 and Chapter 5. Indeed, the problem of identifying
which characteristics of an environment correspond to a particular sensory observation and,
ultimately, to a given decision or operation, is inherently extremely difficult. It has been
known for many years that the identification of the environment from sensory observations
alone is a challenging inverse problem, as it represents an ill-posed, many-to-one physical
interaction. Indeed, if it were possible to model the physical world-to-sensor interactions
exactly, then it would also be possible to determine, in advance, the exact combination
of sensory measurements which would enable the unambiguous reconstruction of an envi-
ronment from the data alone. However, Barshan & Kuc (1990) showed that without the
addition of external knowledge to the task, the problem remains intractable. For this rea-
son, while systems which can be developed to solve highly-specialised and well-defined sets
of operations remain significant in the robotics community, they should be recognised as
specialisations of a more general and flexible approach.
3.2.2 The Sensory Space
The notions of sensor-centricity from the previous section suggest that, in a general model,
the selection of the state of interest should be based on the nature and characteristics of
the sensors. Prior to the consideration of the representational problem, or the development
of viable sensor models for converting the information from the raw sensory form to match
the selection of the state of interest, it is beneficial to formalise the relationships between
the sensors and the quantities to be represented. Several authors have addressed this issue
previously; in particular, Majumder et al. (2001, pp39-41) introduced the concept of the
‘sensory-map’ in order to represent the existence of partial intersections between the sensor
data from different sensors investigating the same environment. The key notion in that
work was identifying a coordinate system with each sensor, so that an observation would
correspond to a point within that sensor’s axes.
The correspondences between the various sensors can then be interpreted as the relationships
between these sets of axes in a high-dimensional space which combines them, an example
Bijection
Sensor 1
Bijection
Sensor 2
Bijection
Sensor N Sensory Space
Figure 3.2: The abstract sensory-space representing the formal relationships between each
individual sensor and the most general representation possible for that series of sensors.
While this figure is similar to Figure 3.1 it represents a guide of what relationships are
possible, while the high-level model represents an implementation drawn from these possi-
bilities.
of which is shown in Figure 3.2. To avoid the misinterpretation of this construction, it will
be referred to as the ‘sensory space’ and should be interpreted as directly representing the
relationships between the sensors and the representation. While the mathematical models
presented in this Chapter can support an ‘arbitrary’ sensor domain, the existence of the
relationships as encompassed in the sensory space suggest that the selections available to
the engineer are limited and that this construction should be used to develop insight into the
‘common’ information or data necessary for the combination of multiple disparate sensor
streams. The similarity between this construction and Figure 3.1 is deliberate, though it
should be recalled that the role of the high-level model is to represent the flow of information
in the system, while the construction here relates to the physical and statistical relationships
between the sensors and a particular summary of information.
Interestingly, the important role of the sensor model in the construction of both the sensory
space and the practical systems embodied in the high-level approach of this thesis, represents
a rigourous embodiment of the high- and low-level fusion debate. This debate, as noted
in Grover et al. (2002) relates to the processing of the sensory information to generate
task-specific behaviour before or after the information has been combined, respectively. In
developing sensor models, however, the engineer is free to utilise or discard the additional
information which defines the relationships between the sensors, the effect of which is to
assume that the different sensors are independent or orthogonal in the sensory space. It
becomes possible for the designer to be explicitly aware of the cause of the sub-optimality
and conservativeness of a high-level approach.
The relationships between various sensory cues as represented in Figure 3.2 are usually
considered to arise from either deterministic or statistical dependencies. The discussions
of Chapter 2, however, suggest that the deterministic is simply a special case of the more
general statistical relationship. As an example, a range-bearing sensor and a cartesian sen-
sor are usually considered non-orthogonal as they measure the same property, the location
of an item on a plane. Meanwhile, a laser and a radar both measure the electromagnetic
reflectivity of a scene, though at different frequencies, but can be considered statistically
related through the characteristics of the scene being investigated. Although the two re-
flectivities need not be related, real scenes will favour certain relationships among these
quantities and the data will not be independent when measured.
These dependencies are often interpreted as analogous to the existence of non-orthogonalities

between the bases as defined by the individual sensors when considered in the sensory space.
While this is an accurate concept for a deterministic dependency, the same is not true for
a true statistical one, where the degree to which the data follow the geometrical model is
probabilistic. Provided the geometric analogy is interpreted only as the result corresponding
to the mode (or mean) of the statistical relationship, then this remains a helpful engineer-
ing concept. Throughout the remainder of this thesis, when considering the nature of the
sensory relationships the term orthogonal will be generally avoided, though when it is used
it should be interpreted in this manner with respect to a geometric analogy.
The work of Majumder et al. (2001) utilises the fact that the process of sensor fusion would
make no sense if all sensors existed as independent measurements, so that the dependencies
can be exploited to ‘project’ the observations from a single sensor into the observational
space of a different sensor. While the resulting geometric ‘sensory map’ concept represents
a new interpretation of the relationships between the sensors, the presumption that the
information could be ‘projected’ through this mapping is inherently misleading: just be-
cause two observations from different sensing modalities (radar and vision, for example)
yield non-orthogonal information does not validate transforming the information from one
sensor and obtaining the ‘equivalent’ expected observational characteristic from the other
sensor. Instead, as will be discussed in detail in Section 3.4.2, it is the relationships between
the sensor models of each source which generate the statistical dependencies between the
observed values. Thus, while it remains invalid to attempt to calculate the expected visual
appearance (or a partial representation) of an object when given the information from a
radar observation, it is not true that the observation of the radar information has no ef-
fect on subsequent operations involving the visual system. The resolution of this apparent
paradox lies in the interpretation of the high-dimensional space of the sensory-space, not as
the relationships between the sensors, but instead as the relationships between each sensor
and a representation.
The sensor-space is an abstract construction whereby the physical and statistical rela-
tionships between different sensory sources can be examined. Treating the data from
the ith sensor as a vector in the Ni -dimensional space spanned by Ei = {eij } for
j ∈ [1, Ni ], then the sensor space is spanned by {E1 , . . . , Ek } for k sensor sources.
As the dependencies between the sources can be recognised as analogous to the individ-
ual basis vectors eij being non-orthogonal, then the construction allows the engineer
to assess the dependency structures of the selected sources.
3.2.3 Sensor Independence
In the previous section it was noted that the sensory-space concept enabled the explicit
consideration of the distinction between a low-level data fusion system and a high-level one.
In fact, this distinction represents an important special sub-optimal case of the high-level
model of Figure 1.3. Treating statistically dependent sensory information as independent
results in an important implication for the downstream operations acting on the summary
of information: each can manage the independent cues separately, combining them only if
that is desired. Figure 3.3 shows the structure which results from this approximation.1 In
this figure the three sensors are treated independently throughout the system and it is the
output of the individually determined reasoning processes which are combined to provide
the ‘fused’ result.
1
This figure shows the case where all the sensors are assumed independent at the data-gathering stage; the
case for partial independence is obtained by the obvious compromise between this figure and the high-level
model of Chapter 1.
Data
Sensor
Transformation Application 1
Model
Sensor 1
Data
Sensor
Transformation Application 2 Application 4
Model
Sensor 2
Data
Sensor
Transformation Application 3
Model
Sensor 3
Representations
Figure 3.3: The independent-sensor high-level model. In this figure, each of the sensors
is assumed to be independent of the others. This means that the information from any
one has no effect on the information from any other and, hence, it is possible to parallelise
the reasoning operations, at least partially. A high-level fusion scheme is represented by
the very right-hand side of the diagram in which the task-oriented information from each
sensory stream is combined as a final operation.
It is possible, therefore, to directly represent the cause for the sub-optimality of a high-level
fusion algorithm and, equivalently, any system where sensory cues are treated as if they
were independent quantities. It should be noted that this approach is especially common in
robotics systems, for example analysis of a GPS-aided SLAM implementation reveals that
the structure is remarkably similar, with the observations of the SLAM landmarks treated
separately from the observations of the GPS satellites. In fact, as discussed in (Scheding
2005), treating both the satellite and landmark observations equivalently represents the
low-level fusion of that data.
A final special case arises predominantly in the computer vision community and corresponds
directly to the difficulties associated with the extraction and interpretation of the vast
quantity of information contained in the data. Consider a colour image obtained at a
resolution of 640 × 480 pixels, with three 8-bit channels associated with hue, saturation
and value (HSV). At first inspection, an image from the camera can be interpreted as
a point in a 640 × 480 × 3 = 921600 dimensional space; however such an interpretation
deliberately neglects the spatial arrangement of the pixels. This is a particularly striking
example of the independent-sensor model in which each pixel is treated independently. The
dimensionality of the problem ensures that methodologies utilising this approach are limited
to those for which the task-specific transformations can be constructed off-line, though many
Sensor
Model
Data
Sensor
Transformation Application
Model
Sensor
Sensor
Model
Representations
Figure 3.4: The common-source independent-sensor model where the information from a
single source (sensor) is transformed separately through three different sensor models and
the subsequent information is treated as independent in the representation. The application-
specific transformation is shown as a single transformation to indicate that in this case all
three representations are used to perform the required task.
provide efficient mechanisms for the interpretation of test images and occasionally the on-
line modification of the transformations (Muller et al. 2001, Tenenbaum 1998, Brand 2002).
An alternative approach to dealing with this incredibly rich data-set is to pre-process the
image to calculate new quantities, usually spatially meaningful, in addition to the colour
information. For example, (Kumar et al. 2005) processed visual data to generate texture
information to augment the HSV data at each pixel location. This approach is very similar
to current theories regarding the early stage visual processing of the mammalian visual
system (Stein & Merideth 1992) and the general approach is represented in Figure 3.4
where the information from a single sensor is passed through three separate sensory models
to construct three independently manipulated representations. The data transformation
shown in this figure represents a single operation which both transforms and combines the
separate cues, c.f. the approach of Figure 3.3 in which the data are first transformed to
obtain three separate application-specific values before they are combined to generate a
fused result.
The two schemes shown in Figures 3.3 and 3.4 can be interpreted as the implementation of
sub-optimal approaches to the utilisation of the relationships implied in the Sensory Space
of Figure 3.2. The assumption that independent cues, whether they originate from the same
data or not, should be treated as if they were orthogonal or independent clearly reduces
3.3 Functional Representations 103
the effectiveness of the fusion algorithm. Importantly, however, when the sensory space
relationships are not known or are difficult to determine, then this approach represents a
conservative implementation. Finally, storage of multiple cues as independent quantities
does not imply that each cue must be stored separately. In fact, Chapter 5 describes
a complex data structure designed to represent multiple independent quantities within a
single, computationally efficient structure.
3.3 Functional Representations
This section defines a mathematical framework for the manipulation and analysis of the
sensor-centric representations introduced in Section 3.2. An important characteristic im-
plicit in the high-level model of the thesis and the sensory-space construction of that section
relates to the existence, or otherwise, of the concept of the ‘localisation’ of the sensory data.
In this section, it is assumed that the sensory data represents measurements derived from
some subset of the environment in which the system is operating and that the particular
subset can be identified, at least probabilistically.
Formally, let there exist an arbitrary domain Y over which elements y ∈ Y represent the
‘localisation’ of a particular measurement of a series of m ‘properties’.2 In this context
‘localisation’ refers to the existence of a ‘region’ of the environment for which the properties
will be estimated, but there is no requirement that the domain Y correspond to a physical
space, or that temporal coordinates can not form part of the localisation. Consider three
examples: the time-dependent temperature distribution over the surface of a sphere; the
height distribution of the sea-bed in a fixed rectangular region; and the temperature and
pressure in a topological representation of a chemical process. In the first case the domain
is spatio-temporal and continuous and can be parameterised by three coordinate indices; in
the second it is spatial in two dimensions; and in the third the domain is topological and
the space Y defines only local connections between neighbouring elements. The elements y
of the domain Y can be readily identified as special cases of the elements {Si }, with i ∈ Z∗ ,
of the set S of Section 2.2, and all other quantities will correspond to measures defined on
those elements.
2
The vector y is used to represent a location in the domain in the same manner as the coordinate vectors
of Section 2.4.1 and can correspond to either a continuous or discrete space Y .
At any particular y each of the properties will take on some value. Let the ith property
correspond to the value xi (y) where writing the value as a function of the domain explicitly
encodes this dependency. Using this notation, the set of all sensory values can be denoted
by the vector-valued3 function x(y),
x(y) ∈ X with y ∈ Y (3.1)
where X represents the space over which the values of the function are defined and the
choice of the domain Y selects the meaning of the ‘instantaneous’ value of the sensory cues.
Significantly, this representation does not make any assumptions regarding the dependencies
between these sensory cues and there is no requirement for the ith and j th properties to
be treated as if they were independent. In the previous examples: the temperature on the
sphere for a given location T (θ, φ, t) for θ ∈ [0, 2π], φ ∈ [−π/2, π/2] and t ∈ R; the height
of the sea-bed h(x, y) for (x, y) ∈ R2 ; and the temperature and pressure P (i) and T (i) for
i ∈ [1, n] ⊂ Z, all define the instantaneous value of the properties at a singular element
of the domain. Elementary thermodynamics implies that these last two quantities will be
intimately linked and it should be clear that the two quantities will be strongly dependent
on one another. Additionally, no assumption is made that the particular functions should
conform to constraints such as continuity or smoothness, nor are they restricted to quantities
defined on a continuous domain. Furthermore, the functional forms can be used to represent
any function from a Dirac delta (existing at a single point in the domain) to a constant
value at all points in Y .
For a given domain, the selection of the coordinate system used to represent the elements of
Y is arbitrary, provided that the system used changes only the index given to each element
of the domain, but does not alter the elements themselves.4 This is equivalent to applying
a bijective transformation of the coordinates of the elements: for y1 , y2 ∈ Y corresponding
to the same element of the domain, this mapping is Ty : y1 → y2 such that
x1 (y1 ) = x2 (y2 ) where y2 = Ty (y1 ) (3.2)
3
If the function is a real vector-valued function, the resulting space will be X = Rm for m cues. More
generally, however, the functions can be tensor-valued, for example a stress field in a rigid body. Additionally,
the function can be considered as an indicator function (Dirac delta) over the combined domain {X, Y }.
4
That is, the instantaneous value of a continuous function (or the value at a given element in a discrete
domain) remains unchanged.
and the parameterisation simply determines the functional form of x. For example, the
sea-bed example can be represented in either the rectangular coordinates, y1 = (x, y), or
polar form, y2 = (r, θ), and while the functional forms are different, the value of the two
functions for the same element (in this case a point in the two-dimensional space) will be
identical.
It is possible, however, to transform not only the parameterisation of the domain, but the
domain itself, that is, T : Y → Y where Y contains different elements to Y . As this
transformation changes the elements, the simple relationship of the coordinate transform
of Equation 3.2 no longer holds.5 For example, consider the Fourier Transform applied to
a one-dimensional signal f (t). The forward and inverse transformations are given by

F : F (ω) = f (t) exp{−jωt} dt (3.3)

1
and F −1 : f (t) = F (ω) exp{jωt} dω (3.4)
2π
for F (ω) ∈ C and where t ∈ Y represents the time-domain representation and ω ∈ Y

represents the parameterisation of the frequency-domain representation (Bracewell 2000,
p6). Since there is no one-to-one mapping of the elements of the two domains then the
equivalent form of Equation 3.2 is
!
x(y0 ) = T −1 x (y ) !y=y0 , (3.5)
that is, in general the instantaneous values of the function in either domain can only be
compared by applying the transformation to the entire function x (y ). In the case of the
Fourier transform, this is immediately obvious as the value of a signal at a specific time t0
is obtained from the frequency form according to
!
f (t0 ) = F −1 [F (ω)]!t=t0

1
= F (ω) exp{jωt0 } dω (3.6)
2π
and clearly depends on the entire function of frequencies. The application of transformations
of this kind will be significant in the development of alternative schemes to represent physical
5
This result is well known in calculus and is the direct cause for the incorporation of the Jacobian in
transformed integrals (Stewart 1995, §13.9).
200
150
200
100 150
15
100
50
50 10
10
0 0
10 5 4
5
0 2
0 0
−5
−2 r
−10 −10 −4 0
θ
(a) Rectangular (b) Polar
Figure 3.5: Domain transformation of functional models. The same surface is depicted in
both cartesian and polar coordinates.
systems and sensor-centric forms in Section 3.3.
Example 3.3.1 – Domain transformation for functional forms

Consider the surface model for a sea-bed discussed earlier, that is h(x, y) for (x, y) ∈ R2 . Let
Figure 3.5(a) be the rectangular representation of the surface of interest. The transformation
of the domain coordinates is given by

r=(x2 + y 2 )
Ty : " # . (3.7)
θ = tan−1 xy
The same surface is shown in Figure 3.5(b) in polar space obtained using this transforma-
tion and it is clear that while the same function is represented, the particular functional
form differs. It is often true that one particular coordinate system may make an operation
substantially simpler than in other spaces; consider that the analysis of the end effector
location on a robotic arm with a single revolute joint is much simpler when written in a
polar formulation than in cartesian, but when multiple joints are considered the formulation
is simpler in cartesian space.
3.3.1 Function Estimation
As the physical processes which give rise to the individual measurements are affected by
uncertainty and ambiguity, a reliable system must take these effects into consideration.
Specifically, rather than only maintaining an estimate of the function of interest, x̂(y), it is
preferable to maintain the probability distribution over all possible functions,
P [x(y)] . (3.8)
In this notation, the random variable describing the set of possible functions is represented
by x(y).6 From this distribution it is possible to obtain an appropriate estimate such as
the mean of the distribution,
x̂(y) = Ex {x(y)} (3.9)
or the mode,
x̂(y) = arg max P [x(y)] . (3.10)
x(y)
The notation of Equation 3.8 represents a distribution over a set whose elements are func-
tions defined on the domain Y . For example, the distribution could be constructed over
the family of sinusoidal functions and to each function, x = sin(2πy) or x = cos(y − y0 )
say, a measure of probability would be defined. As noted in Section 2.4.1 the selection of a
coordinate system for the elements of this set of functions represents a particular method
for distinguishing and indexing those elements. The most common interpretation of the
space of functions was used in the last section where a function is taken to mean the set of
instantaneous values at each singular element of the domain Y . This interpretation amounts
to the identification of a coordinate basis for the set of all possible functions as the set of
delta-functions, or the Dirac function-basis, {eδ }. When defined on a discrete domain Y ,
the basis functions are given by
eδi (yj ) = δD (yj − yi ) ∀ yi ∈ Y (3.11)
and for the continuous domain,
eδ (y, y ) = δD (y − y ) (3.12)
6
The distribution P [x(y)] should not be confused with a value from it for a given function P [x̂(y)].
where y ∈ Y takes on the role of indexing a different element of the domain. Using the
Dirac basis, any (regular) function7 can be written as

f (yj ) = αi eδi (yj ) where Y is discrete (3.13)
i

f (y) = α(y )eδ (y, y ) dy where Y is continuous (3.14)
where the α represent the coordinates of the function f (y) according to the discrete basis,
that is, the instantaneous values of the functions at the appropriate locations in the domain.
The sum and integral are taken over all elements yi or y of the domain. Inserting Equation
3.12 into Equation 3.14 results in the formal equivalency of the parameters of the Dirac basis
and the ‘ordinary’ interpretation of a function’s value, f (y) = α(y). The instantaneous
‘value’ of the function at element can be interpreted as the value of the parameter at the
coordinate corresponding to the particular basis function, and vice-versa.
In addition to the Dirac basis, there are many alternative bases for representing functions.
While the Dirac basis can represent a very wide range of functions, denoted here by FD ,
these alternative bases often only support a subset FY ⊂ FD . An example of this principle
is the ‘Fourier series’ where the subset FY represents the set of one-dimensional, continuous,
square-integrable functions f (y) with periodicity of length 2L. The basis set is identified as
$ nπ % $ nπ %
cos y , sin y for n ∈ [0, ∞) (3.15)
L L
and the functions can be written as
$ nπ % $ nπ %
f (y) = α1n cos y + α2n sin y ; (3.16)
n
L L
the set of parameters {α. } represent the coordinates of the function f (y) in this space
(Kreyszig 1993, p577). Note that this particular change of basis corresponds to a special
case of the transformation of Equation 3.3.
In general, a function-space representation can be written as a generalisation of Equations
7
The definition of the function space spanned by infinitesimal elements of the set formally precludes the
Dirac delta function, for example, from being represented in this manner. Such ‘generalised functions’ can,
however, be described as limiting cases of regular functions. See Rowland (1999) for a discussion of function
spaces and analysis in cases involving generalised functions.
3.13 and 3.14,

f (yj ) → fα (yj ) = αi ei (yj ) (3.17)
i

f (y) → fα (y) = α(w)e(y, w) dw, (3.18)
where the function is subscripted with α to explicitly identify the dependency of the rep-
resentation on the function basis and the function α(w) represents the coordinates of f (y)
in the selected basis. The element y ∈ Y has been replaced by w∈W which represents
an arbitrary coordinate system for the parameters; likewise for the discrete case i repre-
sents an arbitrary index. The Fourier example in which the parameters are defined over a
frequency-space shows that, except in the case of the Dirac basis, these parameters will not
be defined over the function domain Y .
The function f (y) can be written in terms of its coordinates in a different function
space defined by the basis vectors {e} as

f (yj ) → fα (yj ) = αi ei (yj ) (3.17)
i

f (y) → fα (y) = α(w)e(y, w) dw (3.18)
for discrete and continuous domains respectively.
Now, given the representation of a function in the form of Equation 3.17 where the set {ei }
represents an arbitrary function basis, it must be noted that knowledge of the parameters
{α} is sufficient to deterministically encode the function itself. This implies that knowledge
regarding the parameters is equivalent to knowledge of the function itself. Given the random
variable x(y) and the parameters α this can be represented as
P [x(y)] ⇔ P (α) (3.19)
and the determinism can be written explicitly as a conditional probability distribution,
P [x(y)|α] = δD [x(y) − fα (y)] . (3.20)
A critical result of Equation 3.19 is that any deterministic encoding of the function is viable,
j=5
b
j=25
Figure 3.6: A Dirac basis set for a regular grid. The twenty-five grid elements are indexed by
j in the construction of Section 3.3.1 and each of the values can change arbitrarily, yielding
a representation in twenty-five dimensions. However, a grid is often interpreted as a two-
dimensional array, indexed by a and b in this case, leading to erroneous interpretations of
the dimensionality of the grid.
in principle. It is also noteworthy that the most common interpretation of the left-hand
side of this relationship is as implicitly parameterised by the Dirac basis; that is, given a
parametric distribution P (α), it is possible to obtain the corresponding distribution over
the values of the function in the original domain.
Consider now the intrinsic dimensionality of a given functional estimation, that is, the
number of degrees of freedom in the development of the functional form. For a discrete
system of the form of Equation 3.17 it is clear that this corresponds directly to the number
of basis functions represented by the basis set {ei }, which for the Dirac basis corresponds
to the cardinality of the set Y . In the case of a continuous parameter system such as that
of Equation 3.18, however, the dimensionality is equal to the product of the number of
parameter functions α(w) and the cardinality of the space W over which the parameter
function is defined. That is,
dim [f (yj )] = |{ei (yj )}| (3.21)
dim [f (y)] = |W | × |{e(y, w)}| (3.22)
where |.| refers to the cardinality of the given set.
Consider a discrete grid representation with 5 × 5 nodes corresponding to a discrete Dirac

basis representation in twenty-five dimensions; this example is shown in Figure 3.6. As
noted in that figure, the construction of this section implies that each node is sequentially
indexed by j. However, an alternative interpretation is that of a two-dimensional indexing
scheme, denoted here by a and b. When estimating the values of αi or αab , however, it
is critical to recognise that the cardinality of the set, rather than the dimensionality of
the indexing elements, determines the dimensionality. In this example, the dimensionality
remains 25, corresponding to the individually modifiable values at each grid location, and
should not be confused with the 2-dimensional index set for those elements. Assume that a
single quantity is estimated over this basis: now while the resulting function does represent
a surface defined over a two-dimensional set, the estimation of the surface represents a
distribution over the twenty-five parameters.
Estimation of the random function x(y) is usually written as P [x(y)]. However, since
Equation 3.17 or 3.18 represents a bijection, the estimation is equivalent to that of the
parameters α,
P [x(y)] ⇔ P (α). (3.19)
3.3.2 Transformation of Function Bases
When the function being manipulated represents a property defined over an operating
domain, the quantity of interest to a system designer is usually the current estimate of
that property at specific locations within the domain, rather than the entire function. The
uncertainty is encoded in the distribution maintained over the parameters and it is possible
to obtain an expression for the value distribution, given a location y, as

P [x(y)] = P [x(y), α] dα

= δD [x(y) − fα (y)] P (α)dα (3.23)
using Equation 3.20. Now, the formal equivalency of Equation 3.19 requires that the map-
ping between the Dirac basis and the function-basis representations is bijective. Specifically,
this requires that the function-basis used be linearly-independent so that a unique function
is representable with a unique set of coordinates (parameters) in the particular basis. Note
that orthogonality, while greatly simplifying the manipulations, is a sufficient, but not a
necessary condition. This is reasonable as it is well-known that non-orthogonal, linearly-
independent, basis vectors uniquely describe a cartesian space, but correspond to more
difficult manipulations than their orthogonal counterparts.8 In this case the mapping can
be explicitly written as a transformation Tbasis ,

Tbasis : α(w) → x(y) = α(w) e(y, w) dw (3.24)
−1
using Equation 3.17 and the inverse transform Tbasis can also be defined as

−1
Tbasis : x(y) → α(w) = x(y) e−1 (y, w) dy (3.25)
where the e−1 (y, w) represent the inverse basis functions and usually differ from the forward
basis functions9 . Using this notation Equation 3.23 becomes

PX [x(y)] = δD {x(y) − Tbasis [α(w)]} Pα (α)dα (3.26)
−1
= Pα Tbasis [x(y)] (3.27)
where the distributions have been subscripted for clarity and the bijective nature of the
transformation allowed the sifting rule for integrals involving the Dirac delta function to be
used. It is absolutely critical that this expression be understood as a bijective transformation
of the density describing a function from the parameter space to the value space and that for
a particular function x(y) there is only one corresponding parameter function α(w). That is,
if it is desired to evaluate the probability that a one-dimensional function is f (x) = sin(2πx),
then this function is transformed and a single probability value is obtained. Given that the
distribution can also be interpreted as the Dirac basis representation, the distribution over
possible function values at a specific location, y0 say, represents the marginalisation of the
resulting high-dimensional distribution PX [x(y)].
Examples 2.5.1 and 2.5.2 presented an application of this approach which demonstrated that
the behaviour could be succinctly captured in the deterministic transformation and the dis-
tribution over the parameters. In general, however, it is not possible to evaluate Equation
3.27 analytically; in fact, the only simple solution occurs when the distribution is manipu-
lated in the Dirac basis where the transformation is trivial. Instead, if the distribution is
8
See Appendix A for a discussion of tensor geometry and the role of dual-bases in non-orthogonal systems.
9
These basis functions represent the ‘dual’ basis for the representation. If the basis vectors are orthogonal
then the dual basis will be the same (up to some normalising factor). The Fourier transform is a special
case involving a complex representation where the dual basis is the complex conjugate of the forward basis.
required at some location y0 , it is possible to use Equation 3.23 directly as,

P [x(y0 )] = δD [x(y0 ) − fα (y0 )] P [α] dα (3.28)

= P [α] dα (3.29)
α
where the second integral is only taken over the values of α where the Dirac delta function
is non-zero. There are two important ways in which this expression can be utilised to
obtain the value-space distribution at y0 : marginalising P [α] to obtain the joint distribution
involving only the parameters on which the function value at y0 depends; or propagating
the values of the density through the transformation and integrating the result.
The first of these is of use when the values of the function at y0 are only dependent on a
subset of the parameters. For example, in the case of a finite dimensional parameter space,
if the basis functions are compact on y, then the value at y0 will only be dependent on the
joint distribution P [αi ] where i is the subset of i for which ei (y0 ) is non-zero. Provided
that the parameter distribution can be readily marginalised10 then it may be possible to
evaluate Equation 3.29 effectively. For example, if a function value is reducible to the sum of
two parameter values, say f (x0 ) = α5 +10α9 , then the distribution depends only on the joint
distribution P (α5 , α9 ). Furthermore, if the parameters were assumed to be independent it
is readily shown that the resulting distribution is obtained from the convolution of their
individual marginal distributions (Papoulis & Pillai 2002, pp181-2), and the same authors
show similar results for other cases.
In the second approach, the delta function of Equation 3.28 is interpreted as a density
transformation, taking specific values of α and mapping them into the value space. Note
that this propagates the values of α into values of x(y0 ), but takes no account of the
probability associated with those points. For this reason, the expectation of the result
is taken with respect to the original distribution. Practically, values of α are selected,
transformed, weighted by their original probability, and the resulting values integrated to
obtain the result. An alternative scheme involves obtaining samples of α according to the
density P [α]; that is, more samples will be obtained in regions where the density is highest.
In effect, this approximates the distribution P [α] by the sampling density. Since the domain
transformation then operates on samples which are distributed over the domain according
10
As an example, a Gaussian of arbitrary dimension can be marginalised by simply removing the covariance
and mean terms corresponding to the unwanted dimensions.
to the original distribution, the resulting distribution also represents a valid density over
the output domain.
This is a special case of the importance sampling approaches discussed in Chen (2005, §VI).
There are several caveats associated with this particular method, notably the difficulties
of obtaining appropriate samples from high-dimensional distributions and ensuring that
a sufficient number of samples has been obtained to adequately represent the resulting
distribution. In fact, since the domain can change arbitrarily between the parameter space
and the value-space, there is no guarantee that the domain regions with highest density in
the parameter space will correspond to regions of high density in the resulting distribution.
For example, if a large region of low density in the parameter space is compressed by the
transformation into a very small region of the value space, then, if an insufficient number
of samples are available in the parameter space, the estimates of the value distribution may
be inaccurate.
The functional forms of Equation 3.1 can be written in terms of any basis set, {e(y, w)}
or {ei (yj )}, as the new function α(w) according
to
Tbasis : α(w) → x(y) = α(w) e(y, w) dw (3.24)
and the parameter values can be obtained using the inverse transformation

−1
Tbasis : x(y) → α(w) = x(y) e−1 (y, w) dy. (3.25)
Example 3.3.2 – Wavelet approximations

Consider the application of the two-dimensional wavelet transformation to the image shown
at the left side of Figure 3.7. It was seen that the selection of the function basis to represent
a particular functional form depends strongly on the content of the data to be represented.
In this case, two different transformations are considered. These are the Haar and Meyer
wavelet representations corresponding to the top and bottom rows of the image respectively.
The middle diagram shows the image data transformed into the given basis and it should
be clear that these versions have substantially lower mean values. Finally, the right hand
side shows the resulting images reconstructed from compressed versions of the transformed
image. Specifically, the images are reconstructed after 80% of the pixel values are set to
zero; that is, the images are compressed to 20% of their original size.
Haar
Meyer
Figure 3.7: Wavelet representations for image data. The Haar and Meyer wavelets are used
to construct the basis functions for alternate representations for the images. The input
image is shown on the left and the transformed spaces in the centre. The right-hand side
shows the resulting image after reconstruction using only 20% of the coefficients of the
transformed representation.
Examination of the reconstructed images reveals that the two approximations suited to dif-
ferent applications, in that the Haar wavelet results in regularly-shaped artifacts while the
Meyer wavelet smooths the resulting image. There are many different wavelet forms and,
indeed, much of the power of the method lies in the recognition that there is no ideal wavelet
and that the optimal selection depends on the application.
3.3.3 Functional Forms and Distributions
This section has introduced a general method whereby arbitrary functions can be described
in a flexible and general way using function bases. These arbitrary functions were assumed to
correspond to the relationships between measurable quantities and some domain over which
they are defined. Furthermore, as a function basis encodes a deterministic mapping between
a set of parameters and the represented function, it was shown to be possible to maintain
a viable probabilistic representation of the functions themselves through the equivalency of
knowledge of the function to a distribution over the parameters. A clear distinction was
drawn between the functions being estimated and the parametric distribution.
In fact, this distinction is somewhat arbitrary. Consider the situation in which a function is
defined over a one-dimensional continuous domain, f (t) say. Under the scheme presented,
a function basis would be selected for the domain-dependence and the function would be
represented as a set of parameters α(w). A distribution is then defined on the parameters
and, as a distribution can also be represented using an arbitrary function basis, this could
also be parameterised deterministically by β(v). The β are not probabilistic and define
the distribution on α exactly, which in turn, defines the dependence of the function on
the domain. While it would seem that this approach makes it possible to augment the
domain Y with the value-space dimensions X and subsequently define the functions being
represented to be the probability distributions over X which are functionally dependent
on Y , the β must be recognised as encoding the distribution P [α(w)] and not simply the
resulting distributions after the application of Equation 3.23. For example, if the vector was
of dimension 3 and the domain 2, this conditional distribution can be described as a function
over a 5-dimensional space. Such an interpretation, however, does not capture the statistical
interdependencies between the functional values at different locations. These dependencies
are, however, captured in the dependency of the conditional distributions on the parameter
distribution where they are explicitly encoded. Without this explicit dependence, the low-
dimensional projection of the parameters will make erroneous assumptions regarding the
independence of the function at different localities.
Acknowledging that the domain over which a property is defined represents a functional
dependency rather than a statistical one helps to avoid mistaking the probabilistic part with
the functional. In particular, the literature contains many situations in which authors wish
to augment spatial estimates with property information. A discussion of the inference of
such ‘augmented’ representations is deferred until Section 3.4.1, but the impact of confusion
in this regard can be significant. Equation 3.20 explicitly captures the fact that the resulting
probability distribution can be considered in terms of either the function-space coordinates
or the value of the function at any particular location, but not as a distribution over
space. Several authors have developed functional estimation schemes for which the resulting
function is implicitly assumed to be a distribution over space, notably Majumder et al.
(2001) in which a mixture of Gaussians is used to estimate the sonar return amplitudes.
The flaw in this approach is that treating the function as a distribution over space implies
that, in the limit of perfect knowledge, the function would reduce to a delta function over
space. That is, the resulting function would describe the properties intended, but only at a
single location in space.
Finally, note that there are alternative approaches to the representation of arbitrary func-
tions. The most obvious of these are Gaussian Processes (GPs), which are applied often in
supervised learning contexts. Mackay (1998) provides a comprehensive introduction to the
approach and other authors provide methods which incorporate parts of the theory of GPs,
including Bialek (2002, 48-9). In essence, the method amounts to finding the distribution
over functions of the form
t = y(x) (3.30)
where the t represents the ‘target’ value for the input data x. Writing a series of observations
of both target and data as tN and XN , the goal of the scheme is to evaluate the posterior
distribution over possible functions,
P [tN | y(x), XN ] P [y(x), XN ]

P [ y(x) | tN , XN ] = (3.31)
P [tN , XN ]
P [tN | y(x), XN ] P [y(x)]
= , (3.32)
P [tN |XN ]
where Bayes’ rule is used to rewrite this in terms of the generative model P [tN | y(x), XN ]
and the second line follows from assuming statistical independence of the prior function
from the data XN . The scheme is written as a likelihood update in that a prior distribution
for the function (or its parameters) is modified by the supplied data and targets to obtain
the posterior distribution. Mackay (1998) notes that the function y(x) can be replaced in
the approach by its parameters w and that, as with the approach outlined in this thesis, the
evaluation of the distribution for some new location xN +1 requires both the transformation
of the domain and the marginalisation over the parameters.
The Gaussian Processes are fundamentally identical in spirit to the method presented here,
but represent a specialisation of the more general functional models. Firstly, despite claims
of being ‘parameter-free’, the Gaussian Process uses the Dirac function basis, which results
in the marginalisation and transformation of the data becoming trivial: the parameters are
the values of the function at infinitesimal locations in the domain. Secondly, the parameters
are assumed to be Gaussian distributed, which represents a special case of the general theory,
albeit a computationally effective one.
The approach outlined in this thesis differs from the Gaussian process approach in that,
rather than apply constraints to a distribution observing points through the use of a ju-
diciously applied prior, the data and targets correspond to the domain locations y and
state functions x and the data is assumed to be dense on the domain Y . Most importantly,
however, the observations are not assumed to correspond to point events, but as discussed
in Section 3.4.2 in fact correspond to the measurement of a series of functions defined over
domains related to the domain of the ‘state’ function.
3.4 Utilisation of Functional Models
While Section 3.3 described a framework for constructing feasible and flexible sensor-centric
models, this section considers how such models can be applied and interpreted in the context
of the development of an autonomous system. In particular, the functional models are of
little use to the developer unless three capabilities are provided, specifically, the ability to:
• infer the statistics describing a location y ∈ Y or the properties x(y) ∈ X, given the
other;
• incorporate new sensory information into the representation; and
• incorporate data sequentially to construct an on-line estimate.
Each of these is examined in detail and the models of Section 3.3 are shown to provide an
intuitive and powerful approach to these considerations.
3.4.1 Domain Inference using Functional Models
While the general topic of inferencing, transformation and decision making is the focus
of Chapter 4, a special case must be considered in the context of the construction and
utilisation of representations and sensor models. A general problem is to facilitate the dual
3.4 Utilisation of Functional Models 119
tasks of utilising spatial data to infer expected property data,11 and to infer the probable
spatial location given some set of sensory cues.
Fortunately, the formulation introduced in this thesis makes the explicit distinction between
the space defining the ‘operational domain’ and the sensory space defining the properties to
be estimated over that domain. For a general parameterisation, the density transformation
of Equation 3.24 can be performed in the region of interest to obtain the (Dirac function
basis) form,
P [x(ỹ)] , (3.33)
where throughout this section the domain location will be written as ỹ, to clearly distinguish
it as a non-random variable. Recall here that the random variable x(ỹ) represents a function.
Equation 3.33 represents a distribution over a set of functions or, under the Dirac basis
interpretation, a set of distributions indexed by each and every value of ỹ. However, ỹ
is not a random variable, so it is not sufficient to simply augment the function with the
domain parameter in the following manner,
P [x(ỹ), ỹ] , (3.34)
as this assumes that the underlying joint distribution has a single solution, implying that
the non-random variable ỹ takes on a single true value; the only functions which can be
represented in this form consist of a single point estimate in the domain Y .
Alternatively, following the approach of Papoulis & Pillai (2002, p123-4), two new random
variables are introduced: y ∈ Y , representing a domain location; and x = x(y ), repre-
senting the value of the function at a particular value of y . As Equation 3.33 is to be
interpreted in the Dirac basis sense, then the real variable ỹ can be treated as a constant in
the following derivation and x(ỹ) is interpreted as a family of distributions over X, param-
eterised by the value of ỹ. The joint conditional distribution of x = x(y ) and the function
x(ỹ), given the location as the random variable y is constructed as

P x(y ), x(ỹ)| y = P x(y )| x(ỹ), y P x(ỹ)| y . (3.35)
11
In the special case of the Dirac basis, or a suitably spatio-property partitioned model (see Section 3.5.3),
this amounts to determining the conditional probability of the sensory data z given the parameters α and is
a particular instance of the sensor models of Section 3.4.2 which represent the entire conditional probability
distribution for all possible observations. In all other cases this corresponds to transformation of the density
into the value space prior to performing the inference.
Now, two cases bear closer inspection: when the function x(ỹ) is independent of the location
of interest y ; and when the location information forms a part of the state function x(ỹ). The
first corresponds to the evaluation of the value at some location described by a distribution
P [y ], such as the evaluation of the temperature of a heated sphere at a location defined
by a distribution over the two angles θ and φ. The second, however, occurs when the
location of interest is not independent of the estimated function x, such as when the function
contains information about the shape, location and physical properties of an object in the
environment.
In the first case, Equation 3.35 becomes

P x(y ), x(ỹ)| y = P x(y )| x(ỹ), y P [x(ỹ)] (3.36)

= δD x(y ) − x(ỹ) Px [x(ỹ)] (3.37)
where in the second line, the fact that knowledge of y and the function x(ỹ) determin-
istically selects the value of x(y ) has been used. Also, the second distribution has been
subscripted by x to indicate that it corresponds to the distribution over the function. In
this case it is possible to obtain

P x(y )| y = δD x(y ) − x(ỹ) Px [x(ỹ)] dx(ỹ)

= Px x(y ) . (3.38)
This expression is significant in that it describes the distribution over the value of the
function x(ỹ) at a particular value of the random location y in terms of the known distri-
bution over the functions Px [x(ỹ)]. This in turn means that the problem of inferring the
distribution over the value of the function given the distribution P [y ] can be written as

P x(y ) = Px x(y ) P y dy . (3.39)
Similarly, if the goal is to infer the distribution over the location y for which the function
takes on the given value of x(y ), then applying Bayes’ rule to Equation 3.38 yields
P [x(y )| y ] P [y ]
P y | x(y ) = (3.40)
P [x(y )]

∝ Px x(y ) P y (3.41)
and the prior distribution P [y ] is updated by the likelihood distribution to yield a posterior
distribution. Equations 3.39 and 3.41 represent a rigourous incorporation of a location
described by a random variable with the functional forms introduced in this section.
In the second case, however, the location information may be encoded in the function
distribution (for example in an aircraft identification and tracking application the quantities
defining the characteristics of the object are maintained along with its location). The
function can then be written x(ỹ) = [xs (ỹ), xp (ỹ)] where the subscripts s and p refer to
the spatial and property components of the function. The assumption in this case is that
the location part of the function corresponds to a Dirac delta function12 , though there is
no restriction on the distribution associated with that function. Instead of introducing the
location random variable as a new entity, note that in this case it is given by y = xs (ỹ),
corresponding to that Dirac delta function, and introduce the new random variable xp (y )
corresponding to the value of the property part of the function x(ỹ) at the location selected.
Here Equation 3.37 is replaced by

P xp (y ), xp (ỹ)| y = δD xp (y ) − xp (ỹ) Px xp (ỹ)|y (3.42)
and the conditional distribution of Equation 3.38 becomes

P xp (y )| y = Px xp (y )| y (3.43)
Px [xp (y ), y ]
= (3.44)
P [y ]
where the last line follows from the definition of y = xs (ỹ) and the chain-rule of proba-
bilities, and Px corresponds to the conditional distribution of the entire function Px [x(ỹ)].
The equivalent of Equation 3.39 is

Px [xp (y ), y ] P [y ]
P xp (y ) = dy (3.45)
P [y ]

= Px xp (y ), y dy (3.46)
and is reasonable as the distribution over y determines which xp (ỹ) are included in the
marginalisation. Note that this is not simply the marginalisation of the joint distribution,
12
This component represents a particular location in the domain and will be used to determine the in-
stantaneous value of the properties at that location, and so corresponds to the Dirac delta.
but represents the marginalisation of the joint formed by replacing x(ỹ) by the value x(y ).
Finally, using Bayes’ rule on Equation 3.44 yields
Px [xp (y ), y ] P [y ]

P y | xp (y ) = (3.47)
P [y ] P [xp (y )]

∝ Px xp (y ), y (3.48)
where the prior is included in the joint distribution Px and the calculation fixes the value
of xp (y ) to obtain the required distribution.
The distribution Px [x(ỹ)] represents a distribution over functions over the domain Y .
Given an unknown location described by P [y ], the distribution over values of the
function can be obtained according to Equation 3.39,

P x(y ) = Px x(y ) P y dy (3.39)
in which Px [x(y )] is the value of the probability for a given value of Y . Likewise,
given a value of the function x(y ) the location can be inferred from Equation 3.41,

P y | x(y ) ∝ Px x(y ) P y . (3.41)
The location value y may form part of the function, that is, x(ỹ) = [xs (ỹ), xp (ỹ)]
where the subscripts s and p refer to spatial and property respectively. In this case the
equivalent expressions are given by Equations 3.46 and 3.48,

P xp (y ) = Px xp (y ), y dy

(3.46)

P y | xp (y ) ∝ Px xp (y ), y . (3.48)
Example 3.4.1 – Domain inference in topological space

Consider a linear topological space defined by the domain Y = {y1 , y2 , y3 } with
y1 ←→ y2 ←→ y3
and let a single quantity x(y) ∈ [0, 1] be a property defined over that space. Following
the conventions above, define the family of distributions as the distribution over the dirac-
functions corresponding to each location. Figure 3.8(a) shows the resulting distributions
over the value of the property x at each location. Recall that this is not equivalent to the
full distribution P [x(y)] for manipulation, but is a viable interpretation for utilising it.
Now, assume that a measurement is taken somewhere in the domain y ∈ Y , with the
probability of its actual location being given by

Py y = [0.3, 0.6, 0.1] (3.49)
In this case the resulting distribution over the property is given by Equation 3.39,

P x(y ) = P [x(y1 )] × Py y = y1 + P [x(y2 )] × Py y = y2 + P [x(y3 )] × Py y = y3 ,
(3.50)
and is shown in Figure 3.8(b). It can be readily verified that the resulting distribution is
normalised and it is possible to infer that the mean yields the expected value of the property
as x̂ = 0.5445.
Conversely, if the same distribution P [x(y)] were known, but a measurement made of the
quantity x, then the resulting distribution describing where in the space the measurement
was taken is required. Here let x = 0.6 and so Equation 3.41 yields,
P [y|x(y) = 0.6] = [ P [x(y1 ) = 0.6] × P [y = y1 ] ,
P [x(y2 ) = 0.6] × P [y = y2 ] ,
P [x(y3 ) = 0.6] × P [y = y3 ] ] (3.51)
and if the prior distribution for the value of y is uniform, then this yields

P [y|x(y) = 0.6] = 1.6 × 10−4 , 0.333, ≈ 0 (3.52)
3.4.2 Functional Sensor Models
It was noted in Section 1.3.2 that the application of resource constraints implies that in
any practical system it may be advantageous to store a transformed form of the original
sensory data, rather than the raw data itself. For this reason in this work the lth sensory
0.14 0.14
0.12 0.12
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
0.02 0.02
0 0
0 20 40 60 80 100 0 20 40 60 80 100
(a) Functional distributions (b) Value distribution for unknown y
Figure 3.8: Inference in functional models. (a) shows the functional distributions with y1 , y2
and y3 in red, blue and green respectively; while (b) shows the resulting value distribution
for an unknown location.
observation will be denoted by ẑl (vl ) ∈ Zl for some space Zl of observations with an in-
herent domain represented by vl ∈Vl .13 As with the domain Y , Vl represents an arbitrary
operational domain and can be spatial, spatio-temporal, topological or defined by any other
collection of coordinate systems. However, the quantities being maintained in the state
remain represented by x(y) ∈ X, and the notation is intentionally analogous to traditional
state-space approaches. The ideal structural approach of Figure 3.1 corresponds to the sit-
uation in which the observations and state functions are equivalent (or at least correspond
to the same quantities to within the addition of uncertainty from the sensing process). The
sensing part of the model clearly corresponds to the transformation from the observational
domain to the domain of the state representation.
The observation values in Zl can affect only a subset of the state space X, but for generality
it must be assumed to be capable of affecting any component of X at any location in Y .
In order to utilise the information, it is desired to explicitly determine the effect of the
observation on the estimate of the functions x(y),
P [x(y)|zl (vl )] . (3.53)
Unfortunately, this expression represents an ‘inverse’ problem in that the sensory data
is used to infer the environment which gave rise to that observation and is particularly
difficult when the transformation between the state space {X, Y } and the sensor space
13
As with the functional forms of Section 3.3 the random variable associated with an observation will be
denoted zl (vl ) while a particular value of that variable will be denoted ẑl (vl ).
{Zl , Vl } is many-to-one. In general, these problems are extremely difficult to solve adequately
and model-free approaches may be formally impossible to define (Leonard & Durrant-Whyte
1991, §III).
It is possible, however, to develop an equivalent forward model in which physical effects are
used to construct the distribution
P [zl (vl )|x(y)] (3.54)
which represents a model of the physical interaction between the state function and the
sensor itself. This expression can be interpreted in three important ways:
• as a general relationship between an arbitrary state and an arbitrary observation

obtained from it;
• as a probability distribution over possible observations given a particular state func-

tion x̂(y); and
• as a likelihood distribution over the state from a given observation ẑl (vl ).
Bayes’ rule can then be utilised to obtain Equation 3.53 in terms of Equation 3.54 as
P [ zl (vl ) | x(y) ] P [x(y)]

P [ x(y) | zl (vl ) ] =
P [zl (vl )]
∝ P [ zl (vl ) | x(y) ] P [x(y)] (3.55)
where P [x(y)] represents a ‘prior’ distribution, that is the knowledge regarding the state
before receiving the observation. The direct application of the equivalency of Equation 3.19
implies that this can be written as
P [ α | zl ] ∝ P [ zl | α ] P [α] (3.56)
where the functional dependency of the observation on the domain Vl has not been written
explicitly. This expression is exactly analogous to the ‘traditional’ approach in which a
model for the environment is developed as a series of parameters. The term P [zl (vl )] in
Equation 3.56 represents a normalising factor and is often handled implicitly by normalisa-
tion of the posterior distribution P [ x(y) | zl (vl ) ] following the use of Equation 3.55.
When constructing the model it should be noted that the functional approach favoured
in this thesis encodes both the values of the functions and also their dependence on the
domains Y and Vl . A significant advantage of this approach is that models which incorporate
uncertainty and ambiguity in both the values and the domains can be constructed. If each
random quantity represents a function over a given domain, then the conditional distribution
should be seen as a new function, defined over the composite domain {Y, Vl } and taking on
values from the range {X, Zl }. That is, if the function x(y) represented a one-dimensional
function over a two-dimensional space and zl (vl ) a one-dimensional observation along a ray
through that space,14 then the resulting conditional distribution is formed over the space
{Y, X, Vl , Zl } ∈ R(2+1+1+1) = R5 . Setting the values of y, x(y), vl and zl (vl ) therefore yields
a probability measure, and the probability measure will be normalised over the sub-domain
{Vl , Zl }, though it is functionally dependent on the sub-domain {Y, X}.
Section 4.3 derives a convenient formulation for the transform TZl encoded by P [zl (vl ) | x(y)]
according to15

TZl : P [x(y)] → P [zl (vl )] = P [ zl (vl ) | x(y) ] P [x(y)] dx(y) (4.3)
and it is shown there that the character of the transformation is captured completely in the
conditional distribution P [ zl (vl ) | x(y) ].16
The conditional probability distribution can be constructed from three parts:
1. A deterministic transform of the domains (Y and Vl ) and function values (X and Zl )

Tdet : x̂(y) → ẑl (vl ) = fz [x̂(y), y, vl ] dy (4.5)
and the convenient simplifications of Section 4.3.1. In this equation fz represents an

arbitrary function of the input property values17 (x̂), the input location (y) and the
output location (vl ).
14
Formally, Y ∈ R2 , X ∈ R, Vl ∈ R and Zl ∈ R.
15
With the feature space function s(q) being replaced directly by the sensor function zl (vl ).
16
This results from the fact that the conditional represents the relationship between arbitrary distribu-
tions P [x(y)] and P [zl (vl )] whereas the joint distribution represents a particular instantiation of the two
distributions.
17
Since this transform is deterministic the input is assumed to consist of a single function x̂ rather than
a distribution.
2. Uncertainties associated with the function values only
Pvalue [ zl (vl ) | x(y) ] = Pvalue [ zl (vl ) − Tdet {x(y)} ] (4.12)
where the distribution Pvalue captures the uncertainty in the value of zl ∈ Zl at

the particular location vl due to the effect of the particular function x(y).18 This
uncertainty is written here in the space of the observation function as this corresponds
well to the noise introduced by a real sensor.
3. Uncertainties in the domain Y (‘spatial’ uncertainty in transform)

P [ zl (vl ) | x(y) ] = Pvalue zl (vl ) − Tdet x(y ) P y | y dy (4.14)
where the two previous parts have been incorporated into the first term inside the
integral. The distribution P [y |y] represents the uncertainty in the random variable y
corresponding to the actual location in the domain when the estimate of that location
is the deterministic value y.19
The key to the utilisation of the sensor models using Equation 3.55 revolves around the
careful construction and understanding of the forward model of Equation 3.54. Most im-
portantly, this equation does not represent the transformation of new data from the space
of the sensor into the representation, but rather represents the statistics of the anticipated
sensory observation, given the current value of the state. This distinction is paramount,
the model is of information propagated from the state to the observational space and not
the inverse. It is the utilisation of Bayes’ rule in Equation 3.55 which allows the state
to be updated. In essence the operation of this class of sensory model involves using the
current state to predict the observation (using Equation 3.54) and statistically comparing
the prediction probabilities with an actual observation to determine the degree of evidence
supporting the current state distribution, which is then used to ‘correct’ the state appropri-
ately. It is this effect which requires that a viable estimation scheme begin with a reasonable
prior distribution for the state (often the non-informative, uniform distribution) as there
can be no ‘first’ step of the process.
18
Note that the entire function x(y) can affect any particular function value of the output (e.g. the Fourier
transform).
19
Note that this expression of uncertainty explicitly encodes the uncertainty regarding which particular
elements of Y affect the given observation ẑl (vl ). That is, which elements Y ⊂ Y should be associated with
that observation. See Section 4.3.3 for details of how this equation formalises the data-association problem.
New data described by an observation ẑl (vl ) can be combined with a prior state dis-
tribution P [x(y)] according to Equation 3.55,
P [ x(y) | zl (vl ) ] ∝ P [ zl (vl ) | x(y) ] P [x(y)] , (3.55)
where the physical and statistical characteristics of the sensor are entirely contained
within the conditional distribution P [ x(y) | zl (vl ) ]. A convenient form for this condi-
tional capturing the three effects of a deterministic domain transformation, uncertainty
in the resulting value of the function, and uncertainty in the domain elements corre-
sponding to the particular observation, is described extensively in Section 4.3.
Example 3.4.2 – Single-return range-bearing sensor model

Consider a mobile robot moving around a two-dimensional world using a range-bearing
sensor. The transformation from the cartesian space into the observational domain can be
written as
(y − ys )
(x, y) → θ = arctan (3.57)
(x − xs )

(x, y) → r = (x − xs )2 + (y − ys )2 (3.58)
where (x, y) is a location in Y and (xs , ys ) is the known sensor location (this is treated as a
parameter of the transformation here). Since the noise model of the sensor is also defined in
the sensor space and depends only on the difference between this noise-free transformation
and a given location then this corresponds to a kernel form of Section 4.3.1. The transfor-
mations above form the mapping g(y) taking the individual elements of Y and mapping them
to V . Since there is uncertainty about how the elements of the two spaces correspond, but
that this uncertainty is well-described in terms of the difference between the ‘true’ location
and the measured location then write a Gaussian model as
fz [x̂(y), y, v] = N (g[x̂(y)] − v) (3.59)
Treat the function x̂(y) of that section as indicating the presence of an object at the location
y = (x, y); that is,
x̂(y) = δD (y − y0 ) (3.60)
for some ‘true’ location y0 and note that the introduction of the uncertainty in the mapping
above means that the sensor measure z(v) is interpreted as the probability measure associated
with the location corresponding to the true object location. This yields
z(v) = N (g[y0 ] − v) (3.61)
Next, consider that the quantity being measured here is already a probability distribution and
is affected entirely by the transformational uncertainty and so there is no need to consider
the introduction of a measurement value noise term.
Finally, uncertainty associated with the location of the sensor will introduce effects into the
transformation itself and the distribution describing the true sensor location can be written
as
P [ys ] (3.62)
Putting this into the domain uncertainty part of the model yields

P [v|x(y)] = N (g[y0 ] − v) P [ys ] dys (3.63)
because g is parameterised by ys and this corresponds to the distribution resulting from the
sum of two independent random variables (Papoulis & Pillai 2002, pp181-3): the sensor
measurement and the vehicle location.
It was noted in Section 3.2.2 that the concept of a sensor space can be used to examine and
consider the relationships, both deterministic and probabilistic, between different sensory
cues. An important consideration in this context relates to the manner through which
different elements of the state x(y) become statistically related to one another, especially
when those interactions are captured by different sensory cues. In considering this issue,
it is necessary to retain a clear distinction between modelling approximations in which the
statistical relationships between different elements are selectively included20 and the case
involving elements which have an inherent statistical dependence. Of interest here is the
methodology through which these statistical dependencies are observed and introduced into
the representation.
20
See Section 3.5.3 for a detailed development of these approaches.
While at first inspection these dependencies appear to be implied by the selection of the
representation itself, they are actually generated and maintained through the sensor models.
Let two elements of the state function be denoted by x1 (y1 ) ∈ X1 ⊂ X and x2 (y2 ) ∈
X2 ⊂ X to indicate that the elements under consideration can be different parts of the
state space X and/or different elements from the functional domain Y . As noted above,
the methodology through which the observation of a particular sensory reading ẑl (vl ) is
determined by the anticipated statistics of the possible observations is described by the
conditional distribution P [zl (vl )|x(y)]. It is clear then that in order for two elements of the
state to be correlated, they must be observed simultaneously in such a model. That is, they
must both affect a given sensory observation. Obviously, these simultaneous dependencies
can be generated in many different ways:
• through the model explicitly involving different domain elements y1 and y2 , such as
a sonar sensor interacting with a series of elements within an environment;
• by dependence on both subspaces X1 and X2 of the state value space, such as a camera
providing both colour and texture information;
• a combination of both location and cue; or
• through correlations with a common third part of the state x3 (y3 ), such as the manner
by which landmarks become correlated in a SLAM implementation (Bailey 2002, p21).
3.4.3 Sequential Updating
While the models of Section 3.4.2 are useful for combining the information from individual
observations, these observations become available to the system sequentially21 and it is often
desirable to obtain the best estimate of the state after k such observations, x̂k (y). It should
be noted that the selection of an updating scheme is related strongly to the task at hand
(in a similar manner to the selection of the actual quantities to store in the representation
in response to resource limitations).
21
This sequence need not correspond to temporal order. In the case of temporally asequent data, a general
theory requires the explicit management of both temporal propagation effects in the representation and the
effect of the delayed data on the resulting values. See Nettleton & Durrant-Whyte (2001) for an explicit
example.
Following this notation let the kth observation be denoted ẑlk (vl )22 and the sequence of
observations up to the kth be Zk = {ẑ.1 (v. ), ẑ.2 (v. ), . . . , ẑ.k (v. )}. Furthermore, there is no
requirement that the state be fixed throughout the sequence {0, 1, . . . , k} and it is assumed
that a set of parameters uk can be used to describe the transition process for a given interval
Δk = (k − 1, k]. The history of these parameters will be written as Uk = {u1 , u2 , . . . , uk }.
If it is assumed that these parameters are known exactly as they become available then this
allows their effect to be summarised as a ‘motion’ model for the distribution,
P [ xk (y) | xk−1 (y), uk ] , (3.64)
and the ‘motion’ model for the state can be any function of the domain y (and time t
if the model domain does not include it). It is assumed in this model that the transition
behaviour of the distribution depends only on the previous state xk−1 (y) and the parameters
uk ; that is, the sequence of distributions in k form a Markov chain.23 This reduces to the
‘no-motion’ case when the distribution of Equation 3.64 reduces to a Dirac delta function
with dependence only on the previous state and not on the motion parameters,
P [ xk (y) | xk−1 (y), uk ] = δD [xk (y) − xk−1 (y)] . (3.65)
Following the methodology of Durrant-Whyte (2001, pp91-3) it is possible to derive a re-

cursive estimation scheme for combining the observations, ‘motion’ parameters24 and an
initial estimate of the state, P [x0 (y)]. In addition to assuming that the evolution of the
distribution is governed by the previous state and parameters uk , the derived methodology
assumes that each observation ẑ.k (v. )25 is conditionally independent of all others given the
‘true’ state xk (y) and does not depend on either the control inputs or the initial state
22
Where the subscript of l remains only to distinguish the different sources of the observations in the
sequence.
23
Note that the Markovian assumption is reasonable, but is usually conservative and that better perfor-
mance can be obtained through other techniques such as higher-order models or bundle-adjustment (Deans
& Hebert 2001).
24
In the literature these are normally referred to as the ‘control’ inputs and represent those characteris-
tics affecting the time-evolution of the distribution which are known to the system. Uncertainty in their
application and other ‘unknown’ effects are usually lumped together into the probabilistic form of Equation
3.64.
25
The subscripted dot in place of an index is used to indicate that that index can take on any value in
this context.
(unless it corresponds to the initial observation); that is,

P z.k (v. ) | xk (y), Zk−1 , Uk , x0 (y) = P [ z.k (v. ) | xk (y) ] . (3.66)
The resulting estimation scheme can be summarised as follows: given a prior distribution

P xk−1 (y) | Zk−1 , Uk−1 , x0 (y) and a motion model P xk (y) | xk−1 (y), uk the state-
prediction, prior to receipt of observation k is given by

P xk (y) | Zk−1 , Uk , x0 (y)

= P [xk (y)|xk−1 (y), uk ] P xk−1 (y)|Zk−1 , Uk−1 , x0 (y) dxk−1 (y) (3.67)
where it has been assumed that the motion model conforms to the restrictions above and
that the prior distribution is unaffected by knowledge of the parameter uk and so is equiv-
alent to

P xk−1 (y) | Zk−1 , Uk−1 , x0 (y) =P xk−1 (y) | Zk−1 , Uk , x0 (y) . (3.68)
Secondly, knowledge of this ‘prediction’ and an observation model conforming to Equation

3.66 allows the posterior to be written as

P [z.k (v. )|xk (y)] P xk (y)|Zk−1 , Uk , x0 (y)
P xk (y) | Z , U , x0 (y)
k k
= (3.69)
P [z.k (v. )|Zk , Uk , x0 (y)]

∝ P [z.k (v. )|xk (y)] P xk (y)|Zk−1 , Uk , x0 (y) (3.70)
as the denominator does not depend on the current estimate of the state. The importance
of this result is that the functional form can be written as a recursive estimation scheme
fundamentally identical to the ‘traditional’ Bayesian form.
Furthermore, recognising the explicit equivalency of Equation 3.19 converts the updating
equations into the ‘standard’ form, replacing the functional forms here by their parameters.
An examination of Durrant-Whyte (2001, pp91-3) reveals that applying this equivalency
results in the same formulations, with the state variables x replaced by the functional-form
parameters α.
3.5 Functional Approximations 133
3.5 Functional Approximations
The functional formulation of the representation problem discussed in Section 3.3 provides
a consistent and thorough approach to understanding the nature of the sensing and rep-
resentation process. Unfortunately, as was shown in Section 3.3.1, this formulation gives
rise to parameter spaces with dimensionality equal to the product of the cardinalities of the
domain set and dimensionality of the property value-space, dim(α) = |Y | × dim(X) and
the approach should be interpreted as ‘ideal’ but intractable26 . This section examines two
methods by which the function space interpretation can be utilised while still enabling the
construction of practical implementations.
3.5.1 Mixture Representations
The function-space representation of the property function f (y) defined over the domain Y
was written in Equation 3.17 and the equivalent form for a discrete domain is

f (yj ) → fα (yj ) = αi ei (yj ) (3.71)
i
and it was seen in Section 3.3.1 that the coordinate system used to represent the space of
functions FD defined over the domain Y would have a dimensionality equal to the cardinality
of that domain.27
A flexible and convenient approximation scheme involves restricting the functions f (y) to a
subset FY ⊂ FD of the set of functions defined over the domain Y by limiting the number
of basis functions which are available. For example, the functions might be restricted to
piecewise constant with a regular spacing, or to a Fourier series truncated at an upper
frequency limit ωmax . Functions which belong to [FY ]c ⊂ FD , the complement of the
selected subset, cannot be represented under the scheme, though they may be approximated
by selection of a member of the reduced set. The parameterisation becomes

f (yj ) → fα (yj ) = αi ei (yj ) where {ei (yj )} spans FY ⊂ FD (3.72)
i
26
Note that for a sensory-space X and domain Y , the resulting parameter space is given by this expression,
but the Dirac-space conditional distribution can be interpreted as being over dim(Y ) + dim(X).
27
Obviously, if the function f (y) is vector or tensor-valued, then the dimensionality is multiplied by the
dimensionality of that space, dim(X).
where the parameters now represent a vector in a space of dimension equal to the cardinality
of the basis set. Thus, provided that the subset FY represents a function space which
approximates the ideal case sufficiently accurately for a specific system, then the reduced set
can be used effectively. This implies two issues of particular significance for the developers of
practical systems: what constitutes ‘sufficiently accurately’, and how should computational
constraints be managed in a trade-off between dimensionality and accuracy? These issues
will be considered in detail in Section 3.6 once practical methodologies for achieving the
approximations have been discussed.
The coordinate basis representation of Equation 3.72 is the formal definition of a ‘function-
space’ representation, in which an arbitrary function is represented as a point in a space
spanned by a convenient set of functions. A ‘mixture model’ is a modified form of function-
space representation and is written as

f (y) → fβγ (y) = βk gk (y, γk ) (3.73)
k
where the gk (y, γk ) represent the parameterised mixture functions and the βi the weight
of that particular ‘component’. The basis vectors ei of Equation 3.71 and the mixture
components gk of Equation 3.73 are equivalent representations, provided that the mix-
ture components are identified as corresponding to a set of basis vectors indexed by the
parameters γk .
For example, a Fourier series could be represented using gci (t, γci ) = cos(2πγci t) and
gsj (t, γsj ) = sin(2πγsj t) where the γ. represent the frequencies of the sinusoidal compo-
nents and

f (t) → fβγ (t) = βci gci (t, γci ) + βsj gsj (t, γsj ). (3.74)
i j
The advantage of Equation 3.73 over Equation 3.72 is that the summation can be taken
over a specific number of components, rather than the entire basis set. It is important to
note that the parameters β and γ represent a two-part coordinate system: the γ index the
elements of ei in a convenient form and, therefore, represent a coordinate-system for the
reduced function basis; while the β represent the coordinates of the function f (y) according
to the basis set selected. This approach is often referred to as an ‘adaptive basis’ set in the
literature (Mackay 1998, p4). It is possible to define the relationship between the value,
domain and parameters in an equivalent form to Equation 3.20. Noting that the relationship
in Equation 3.73 is deterministic, it is possible to write
& '

N
P [x(y)|β, γ] = δD x(y) − βi gi (y, γi ) (3.75)
i
where the summation is over a finite number, N , of mixture components and provides an
approximation of the arbitrary function x(y) by the nearest member of the set spanned by
the particular basis set selected. If it is desired, the approximation effects could be lumped
with this model through the use of a conditional distribution other than the Dirac delta
and, indeed, the uncertainty introduced can be an arbitrary function of y, β and γ.
This means that it is possible to marginalise over the parameters to obtain the conditional
distribution of the value of the property function, x, given the domain locality, y, as

P [x(y)] = P [x(y)|β, γ] P (β, γ) dβ dγ (3.76)
where it is critical that the joint distribution P (β, γ) is considered as the parameters will not
be independent. The dimensionality of the resulting parameter problem is reduced from the
cardinality of the function basis (equivalently the domain Y ) to the product of the number
of components, N , and the number of parameters for each mixture function gi , M say.
These techniques are used extensively to maintain compact representations of distributions.
In these cases the parameters defining the distribution are assumed to be deterministically
known and the uncertainty is maintained in the represented distribution. The parameters
are updated according to the direct manipulation of the function itself (Ridley et al. 2004).
It should be clear that reducing the basis set used to represent the functions, whether they
be in the function-basis or mixture model form will result in a reduction in the size of the
set of representable functions, but more importantly, if represented in this fashion, it is
no longer possible to represent arbitrary dependencies between the function values at the
individual elements of the domain. That is, it is possible to retain the interactions which
are representable by the interdependencies between mixture components, but if the selected
basis functions are restricted, then there may be interactions which are lost.
As an example, if the function is approximated by piecewise constant functions of width Δ,

then it will not be possible to capture or represent a local deviation over a distance less than
Δ, unless that interaction occurs over the boundary between two constant regions. Another
example of this principle occurs when using a function space defined by a Fourier series
with a maximum frequency of ωmax . In this space, it will not be possible to represent either
functional variations or dependencies which occur with a frequency greater than ωmax .
The functional forms can be written in an arbitrary basis {ei (yj )} according to Equa-
tion 3.71,

f (yj ) → fα (yj ) = αi ei (yj ). (3.71)
i
Defining a new basis {ei (yj )} spanning the subset FY ⊂ FD of the Dirac function
space, this can be written according to Equation 3.72,

f (yj ) → fα (yj ) = αi ei (yj ) (3.72)
i
and functions belonging to the compliment of this space [FY ]c are approximated by
members of the space.
Similarly, a mixture model can be defined according to the parameters (βk , γk ) and a
parameterised family of basis functions {gk (yj , γk )} using Equation 3.73,

f (y) → fβγ (y) = βk gk (y, γk ). (3.73)
k
Since knowing the parameters is equivalent to knowing the functions it is possible to

obtain Equation 3.75,
& '

N
P [x(y)|β, γ] = δD x(y) − βi gi (y, γi ) (3.75)
i
and the resulting distribution over the value space for a given parameter distribution
P (β, γ) as Equation 3.76,

P [x(y)] = P [x(y)|β, γ] P (β, γ) dβ dγ. (3.76)
3.5.2 Graphical Models for Functional Forms
The major obstacle to practical implementation of the function estimation approach occurs
as a result of the computational cost of maintaining a distribution in a very high-dimensional
space. Even with the approximations of Section 3.5.1 the distributions remain computation-
ally challenging. For example, even if the parameter distribution is assumed to be jointly
Gaussian and the approximation contains only 100 parameters (which will be a small num-
ber for situations such as a terrain surface estimation) then this requires maintaining a
100 × 100 covariance matrix.
This section examines techniques which are available for developing approximations to the
fully-dependent cases. Specifically, instead of maintaining the large single distribution,
P (αi ) or P [α(w)] for discrete and continuous parameter domains respectively, it is common
to approximate the parameter space by one which limits the direct influence of parameters
according to some criteria of ‘locality’.28 An an example, consider the earlier example of
estimating the heat distribution on the surface of a sphere, where it may be reasonable to
assume that the influence of a local temperature is limited to an immediate neighbourhood
of the domain Y . The partitioning can be made especially clear when the parameterisation
is the Dirac function basis in which the concept of physical proximity in Y is identical to
the parametric proximity in W ≡ Y .
The use of graphical models is a technique which provides a unique insight into the de-
velopment of approximations to a fully-dependent joint distribution through the explicit
application of conditional independence assumptions between the random variables. In
particular, these methods allow the designer to construct approximations which are inter-
mediate between independent parameter groups (including fully-independent cases) and the
fully-dependent joint distributions. The functional representation problem corresponds to a
specific utilisation of the graphical modelling techniques where the dependency structure of
the model is constructed by the system designer subject to analysis of the representational
problem at hand. Critically, however, there is no requirement that the degree of interac-
tion be known or assumed a priori, nor that similar interactions at different points in the
parameter space W should have the same properties.
Formalising this partitioning of the distribution, it is desired to approximate the fully-

dependent distribution with a distribution where influences are maintained only when they
meet some criteria of dependency ‘proximity’. If it is reasonable to approximate the joint
distribution by assuming that a given random variable depends directly on only a subset
28
This locality is defined in the sense of ‘statistical’ dependency, rather than necessarily corresponding to
the value of the coordinates (or indices) of the parameters.
/ xt−1 / xt / xt+1 /

zt−1 zt zt+1
Figure 3.9: A simple Markov chain for a recursive estimation scheme
x1 x2 x3 x4
Figure 3.10: A Markov chain for demonstrating conditional independence of non-adjacent

variables, given intermediate variables
of the available random variables, then these local interactions can be used to propagate
information over greater ‘distances’ in the parameter space. Consider, for example, a case
involving a set of state estimates Xk and a set of observations of that state Zk . While
the fully-dependent distribution would correspond to the conditional distribution involving
both sets, P (Xk |Zk ), it is reasonable to assume that the conditional distribution of a single
state estimate xt , given all other state estimates and all observations, can be written as

P xt | Xk=t , Zk = P [xt | xt−1 , zt ] ; (3.77)
that is, the value of any estimate at time t is independent of the future and depends only
on the immediate past and the current observation. Figure 3.9 shows a graphical depiction
of this assumption where the arrows depict causality.
The relationship is one of conditional independence, where variables which do not have a
causal relationship between them are assumed conditionally independent, given intermediate
variables(Murphy 2001, Jordan 2004). In addition to causal relationships, it is possible
to represent non-causal dependencies using undirected edges in a graph. A collection of
random variables with undirected interactions is usually known as a Markov Random Field
(MRF) (Boykov et al. 1998) as opposed to the more sophisticated cases where causality
is incorporated. In this case variables which do not have ‘proximity’ to one another are
assumed conditionally independent given the value of intermediate variables. Consider
the one-dimensional Markov random field of Figure 3.10 and note that the conditional
independence results in many different expressions, including
P (x1 , x3 |x2 ) = P (x1 |x2 )P (x3 |x2 ) (3.78)
P (x1 , x4 |x2 ) = P (x1 |x2 , x4 )P (x4 |x2 )
= P (x1 |x2 )P (x4 |x2 ) (3.79)
P (x1 , x4 |x2 , x3 ) = P (x1 |x2 , x3 , x4 )P (x4 |x2 , x3 )
= P (x1 |x2 )P (x4 |x3 ) (3.80)
where in each case it is recognisable that each variable can affect non-adjacent variables,
but only through interactions with its neighbours. Alternatively, if the interactions in this
graph were all directed (towards the right) then a different set of factorisations exist for the
graph. In particular, for the undirected case the joint can be written as
P (x1 , x2 , x3 , x4 ) = P (x1 |x2 )P (x2 |x3 )P (x3 |x4 )P (x4 ) (3.81)
= P (x4 |x3 )P (x3 |x2 )P (x2 |x1 )P (x1 ) (3.82)
but in the directed case, Equation 3.81 is invalid as the causal relationship exists in the
forward direction.
In a functional form the distribution of interest is the conditional distribution of the pa-
rameters given the data, which for a discrete parameter space can be written as
p(α|Zk ) = p(α{j∈[1,N ]} |Zk )
= p(αi |α{j=i} , Zk )p(α{j=i} |Zk ). (3.83)
Now, the influence of particular parameters can be approximated through the use of a
neighbourhood system N in which each parameter has the neighbourhood Ni with i ∈ Ni
and it is assumed that the conditional probability of a parameter depends only on the
parameters within its neighbourhood.29 Explicitly this implies the approximation
p(αi |α{j=i} ) ≈ p(αi |α{j∈Ni } ) (3.84)
29
The neighbourhoods will differ for directed and undirected dependencies. In an undirected graph each
neighbourhood will include the immediate neighbours of the given node, whereas in a directed graph each
neighbourhood includes only those parameters on which that node depends directly.
α1 C α2 α3
CC
CC
CC
C
α4 α5 α6
Figure 3.11: A graphical model showing a non-trivial series of statistical dependencies
and Equation 3.83 can be rewritten as
p(α|Zk ) ≈ p(αi |α{j∈Ni } , Zk )p(α{j=i} |Zk ) (3.85)
which implies that if the neighbourhood system N is an appropriate approximation of

the interdependencies between the parameters, then the effects of all other parameters on
parameter αi can be summarised by the effects of its neighbours α{j∈Ni } . The presence of
the second term P (α{j=i} |Zk ), however, implies that the effects of all other variables are
still present, but are transmitted to the particular variable through these local interactions.
This can be extended to arbitrarily complicated graphs such as that shown in Figure 3.11
in which six random variables are related to one another through a series of interactions
shown as an undirected graph. In this particular example it is specifically represented that
α1 and α3 are related, not only through paths involving intermediate parameters, but also
directly.
In the application of representation of environmental data given computational and storage

constraints, these graphical models provide a uniquely powerful methodology for making
explicit assumptions about the dependencies and relationships present in the parameters
for any particular representational scheme. Directed graphs can be used to directly simplify
the expression of the joint distribution, for example the joint distribution of all states Xk
given all observations Zk and an initial state distribution x0 can be reached from the graph
of Figure 3.9 as
(
k
P (X |Z , x0 ) = P (x0 )
k k
P (xt |xt−1 , zt ). (3.86)
t=1
Similarly, an undirected graphical model can be used to simplify the joint distribution, in
which case the joint can be written as
1 (
M
P (α|Zk ) = Ψm (αm ) (3.87)
Z m=1
(
M
Z = Ψm (αm ) (3.88)
α m=1
where the distribution is approximated by M factors, each of which is a function Ψm (αm )

of a subset of the states (Mackay 2004, p334-5)30 . Each of these functions is considered
a ‘factor’ in the resulting joint distribution and the factors can be readily obtained from
the neighbourhood structure imposed on the system. Explicitly, each factor represents a
‘potential’ over a fully-connected subset of the variables; this is alternatively known as a
clique potential. The directed case can be written as an undirected graph by using the
‘moral’ graph equivalent as noted in Jordan (2004, p5) and Equation 3.87 can be used in
both cases.
From the system design perspective, the importance of these methods should not be un-
derestimated and there are many areas of particular interest to functional representation
management using graphical models. Firstly, the factors Ψm of the MRF joint distribution
of Equation 3.87 are normally assumed to be fixed, known functions. These can be used to
impose known constraints or effects into a system. For example, in Rachlin et al. (2005)
an MRF is constructed as a spatially organised rectilinear grid and each node is a trinary
random variable31 . The cliques can only correspond to pairwise interactions, and the paper
uses a fixed potential to describe these interactions. In particular, the potential Ψij repre-
sents an imposed constraint describing the likelihood of an adjacent value j, given the value
of the node i; their fixed potential is
⎛ ⎞
0.6 0.3 0.1
⎜ ⎟
Ψij = ⎜ ⎟
⎝ 0.475 0.05 0.475 ⎠ . (3.89)
0.1 0.3 0.6
ˆ P ˜
30
An alternate form of the factorisation can be written as P (α|Zk ) = Z1 exp − m Ψm (xm ) where the
product is replaced by the summation (Boykov et al. 1998, p1). This approach is closer to the potential fields
theories of statistical physics on which the MRF theory is based, but the form used in the text supports
efficient message-passing algorithms for computing the joint distribution.
31
The parameterisation of the functional form follows a Dirac delta function defined on a discrete domain
Y corresponding to the rectilinear grid.
Examination of this potential suggests that both smoothing and sharpening effects are
present: for i = 1, 3 the potential favours adjacent values which are the same, whereas for
i = 2 the potential strongly suppresses neighbours of the same value.
In Diebel & Thrun (2005), however, the potentials are formed according to a more sophisti-
cated model in which the parameters of interest are the unobserved ranges to the elements
of a rectilinear grid.32 The difference here is that the MRF includes two sources of ob-
servation regarding the resulting distribution: the first of these is laser range information
available over a more coarse scale than the field grid; and the second an indirect observation
of a visual discontinuity field. The resulting field potential incorporates a potential corre-
sponding to the laser range measurements which acts on only those nodes for which laser
information is available. This potential draws the values of the range distribution towards
the laser measurements at the ‘observation’ nodes. Meanwhile, however, the image discon-
tinuity field is used to adjust the ‘smoothing’ potential controlling how adjacent nodes of
the range field are related.
The remarkable insight in their approach is that additional information should be avail-
able to estimate and adjust the contributions of the graph factors across the field. It is
contended, furthermore, that this approach can be utilised to much greater effect, in that,
instead of applying empirical models to manage such effects, many interesting phenomena
can be obtained through the use of the physically-derived sensor models to control these
effects. This is justified by the assertion of Section 3.3.1 that the only source of model-free
interactions between nodes in an MRF parameterisation is a result of sensor models in which
multiple parameters are ‘observed’ together. As an example of this effect, consider that in
an implementation of the SLAM algorithm, the map parameters (formally the parame-
ters describing a functional form corresponding to a high-dimensional Dirac delta function)
describing multiple features become correlated through the fact that each observation is
correlated to the vehicle pose states (Bailey 2002, p21-2). It is, in principle, possible to
utilise the sensor model to update not only the individual variables, but also to increasingly
strengthen the factors as each is observed. Such an approach would act to increase the
effective distance of the influence of an observation as the factors become more significant.
It is well beyond the scope of this thesis to consider the practical realities of utilising such
32
Presumably corresponding to a Dirac basis defined over a rectilinear domain Y defined over the polar
space (r, θ).
graphical models to represent reduced interdependencies between the parameters of a par-

ticular functional representation and many detailed discussions and applications can be
found in the literature including in Jordan (2004), Mackay (1998) and Murphy (2001). It
is sufficient to note, however, that these methods allow approximate33 solutions to evalu-
ating the joint distribution over large systems utilising the effects of ‘parametrically-local’
interactions to propagate effects. These techniques should be viewed as highly promising
techniques for incorporating dependency effects into models without requiring the explicit
maintenance of the entire joint distribution.
3.5.3 Parameter Space Partitioning
The graphical models of Section 3.5.2 represent a sophisticated methodology for examining
the dependency structure of the parameters of a functional representation. This approach
suggests four particular cases of how different parameters can be decoupled to reduce the
computational complexity of maintaining approximations to the true joint distribution. In
−1
each case the distribution of interest is P (α|Zk ) where a bijective transformation Tbasis :
x(y) → α(w) maps the vector-valued function x, defined over the situational domain y ∈ Y ,
to the parameters α, defined over the domain w ∈ W .34
Decoupling in the Domain
The simplest methodology for reducing the complexity of the resulting distribution is to
divide the domain Y into separate regions and consider the parameterisations for each of the
regions separately. This represents a special case where the cliques of the joint distribution
are assumed to be independent of one another, that is
P (α|Zk ) = P (αi |αj∈Ni , Zk )P (α{j=i} |Zk )

(
= P (αm |Zk ) (3.90)
m
where αm represents all the parameters from a particular fully-connected clique. That
is, each factor of Equation 3.87 is independent of all other factors and represents a fully
connected clique and can be managed separately. The logical extreme of this approach
occurs when every parameter is assumed independent and is normally associated only with
33
Exact solutions are achievable when the graphical models contain no cycles.
34
See Section 3.3.2 for details.
functional forms using the Dirac basis. The most commonly utilised version of this approach
is the Occupancy grid as outlined in Elfes (1995) in which (for a non-dynamic state) the
observation update is approximated by
P (zk |αk−1 P (αk−1 )|Zk−1 )

P (αk |Zk ) =
P (zk |Zk−1 )
P (zk |αi,k−1 )P (αi,k−1 |Zk−1 )
⇒ P (αi,k |Zk ) = (3.91)
P (zk |Zk−1 )
where P (αi,k |Zk ) represents the distribution on the value of the ith parameter after k ob-
servations. A discussion of the important distinction between a physical sensor model
describing the operation of a device such as a sonar range finder, and the sensor model for
situations such as this can be found in Elfes (1995, p102).
Partial Spatial Decoupling

While these independent sub-domain models represent an efficient mechanism for reducing
the computational and storage complexity by utilising independent subsets of the domain
Yi ⊂ Y , no dependencies are transferred across the boundaries of those regions. This
means that there will, in general, be discontinuities at the boundaries of the Yi , even if the
approximations used to represent the property function guarantee continuity or smoothness
in the reconstruction. These discontinuities will be increasingly pronounced as the separated
regions become larger, especially when sensory observations interact with regions which are
smaller than the Yi . This effect could be reduced using several techniques, including: adding
a few global parameters to provide a smaller degree of dependency across the individual
regions, for example maintaining a distribution describing the mean value of the function
over each region; or constructing overlapping regions and utilising both distributions when
interpreting the resulting properties in the overlapping regions (Rotariu & Vullings 2005).
With an associated increase in complexity, the Markov random fields approaches of the
previous section can be utilised directly and amount to a more rigourous approximation of
the underlying distribution.
Property Decoupling
The previous approaches to approximating the joint distribution involve interactions based
on the domain of the function (or parameters). Instead, consider that, in general, the
system will involve multiple sensory cues, each of which is to be estimated over the same
domain Y , for example the temperature and black-body radiation wavelength over the
surface of a heated body, or the returned radar intensity and visual texture of a vegetated
terrain. In general there will be non-trivial dependencies between the different sensory cues
at any given point in the domain. Interestingly, it is these relationships which are of critical
importance to developing meaningful feature-promotion and detection techniques as will be
seen in Chapter 4.
As noted earlier, dependencies between parameters of the representation can be rigourously

introduced through the utilisation of an appropriate sensory model describing the interaction
of the physics of the technique and the properties defining the state being estimated. In
fact, without applying heuristic models to reflect the dependencies between different sensory
cues, it is necessary to have access to both a sensor making observations of multiple cues and
additionally simultaneous measurements of those properties. In many situations, however,
it is sufficient to manipulate and estimate each property separately for storage purposes and
to apply model-based approaches as an interpretative operation. For example, the storage of
occupancy information from a 2D laser scan and a temperature profile for a system operating
in an industrial environment represent cases where it is unlikely that the properties will affect
one another greatly, and for which the influence between different regions of the domain will
be vastly different for the two properties. Constructing geometric estimates of the shape
and determining the mean temperature in each room, however, represent reasoning tasks
in that they introduce additional knowledge into the system to facilitate interpretation of
the data with respect to a specific task.
Once more, the simplest approach involves treating the individual cues as independent and
orthogonal; that is, if the vector valued function can be represented as
x(y) = [x1 (y), x2 (y), . . . , xn (y)] (3.92)
then the individual functions can be estimated separately, and
(
P α|Zk ≈ P [αi ] (3.93)
i
where the function xi (y) is parameterised by αi and each function is independent of all
others. The effect of this approximation may be minimal in that the reasoning stages will
be utilised to re-discover these unrepresented interactions. However, in certain situations
this assumption implies that the interactions between cues will be ignored by the resulting
system. For example, observation of the temperature of the heated body should strongly
affect the estimated black-body radiation frequency and vice versa and while a model could
be used to combine the measurements in this case, there is no reason why the different
sensory cues must be related in a way which is constant over the domain.
Combined Spatial-Property Decoupling
In addition to different sensory cues having non-trivial dependencies between their values
at some location in the domain Y , it is possible that the relationships themselves could vary
over the domain. For example, the relationship between radar intensity profiles and visual
colour will be highly dependent on the particular operating environment: open farmland
and dense forest represent two particular cases where the properties may be strongly related,
though in different ways for the different scenarios. For systems which operate in expansive
environments these changes in dependency may be significant. This leads to the final
parameter decoupling considered explicitly: treating the relationship between sensory cues
as discoverable, but fixed over some region of the domain. For example, observations of
radar intensity and colour may be used to maintain an estimate of the dependency between
those properties in a fixed subset of the operating domain, with these dependencies treated
as independent across different sub-domains.
3.6 Performance Measurement and Practical Considerations
The earlier parts of this chapter have examined the structure and mathematical character-
istics of the sensing and representation processes. In particular, they have focussed on the
identification of the roles of the sensor model and the representational form in the ideal
model. It was noted several times that the representational approach of treating the quan-
tities to be estimated as functions defined over some space of interest results in an ideal case
which is fundamentally intractable. In order to address the tractability and viability of ap-
proximations and sub-optimal implementations, there are several important characteristics
which are significant.
As the selection of the representational form has the greatest impact on the construction and
utilisation of sensory data, a careful analysis of the effect of approximations and assumptions
is critical to the ability to engineer a perception system. Taking the notation of Section
3.6 Performance Measurement and Practical Considerations 147
3.3.1 and rewriting the functions to be represented using Equations 3.17 and 3.18 obtains
the following equations for discrete and continuous domains respectively:

xα (yj ) = αi ei (yj ) (3.71)
i

xα (y) = α(w)e(y, w) dw (3.18)
where α represents the function or vector of parameters35 ; the set {e} represents the basis
functions defining the function space; and w or i correspond to the coordinate system for
the function domain W . Recall that FD represents the set of functions representable using
the Dirac basis and FY a subset formed by limiting the elements of the basis set.
It should be clear that the selection of the underlying basis functions for FY will strongly
affect the properties of the resulting representation, in particular what can be represented
and what effects are necessarily lost. In developing any practical system the designer must
evaluate the effects of this decision on the viability of the selected representation to the
particular problem at hand36 . There are three major considerations which arise in this
context:
• the significance of approximation errors and artifacts;
• how concise the representation is with regard to the expected functional forms and
the sensitivity of the parameters to changes in the function;
• the computational complexity involved in manipulation and in transforming the rep-

resentation between the Dirac basis and the function-space (encoding and decoding,
respectively)
3.6.1 Approximation Effects and Artifacts
As discussed in Section 3.5.1, the selection of the set of basis functions {ei (y)} determines
the subset of functions which is representable in that form, FY ⊂ FD . In fact, approximation
35
Formally, the α represent the coordinates of the function x in the space spanned by the set {e} of basis
vectors.
36
Care must be taken here to avoid confusion of the ‘ideal’ representational form (the infinite-dimensional
function approximation) and the ‘ideal’ data management approach in which the sensory data is maintained
in toto by the representation.
effects and artifacts can be identified formally as the result of attempting to represent a
function from the complement of the set FY through the use of an ‘appropriate’ member of
that set.37 That is, if f (y) ∈ FD represents the ‘true’ function, then
f (y) ≈ f (y) where f (y) ∈ FYc and f (y) ∈ FY . (3.94)
One of the most important features of the formulations of Equations 3.17 and 3.71 is that
the resulting function is linear in the function basis, implying that all linear characteristics
of the functions e(.) are imparted to the represented function. The most notable of these are
the differentiability and continuity of the function and its derivatives (when they exist). In
this way a function which is known to be C 0 continuous can be represented using a basis set
which enforces this result; conversely, a function involving a discontinuity will necessarily
result in artifacts and errors if the same basis is used. For example, if the basis set is selected
to correspond to the Fourier basis (which is C ∞ -continuous) then any smooth function can
be represented arbitrarily as the number of components N → ∞. Clearly, functions which
do not conform to this property, such as any involving a discontinuity, can not belong to
FY and must be approximated by one which does. Let f (y) have a formal discontinuity
at y = y0 , that is, limy→y− f (y) = limy→y+ f (y), where the limits approach y0 from the
0 0
negative and positive sides respectively. The Fourier series approximates the function with
f (y) where & '
1
lim f (y) = lim f (y) = lim f (y) + lim f (y) (3.95)
y→y0−
y→y0+ 2 y→y0− y→y0+
and since the discontinuity cannot be represented in the reduced function space any approx-
imation will demonstrate the Gibbs phenomenon at the point of discontinuity. As noted in
Kreyszig (1993, pp602-3) this is not simply a result of the truncation of the frequencies in
the series, and Gibbs’ original work shows that even with N → ∞ the oscillations remain.
In addition to the effects due to the functional forms of the basis functions, most practical
systems require that only a limited number of the parameters are maintained at any one
time, resulting in an even greater impact on the representable functions. For example, a
Dirac basis can be truncated by maintaining the values of the function on some regularly
spaced grid ej (y) = δD [y − y0 − jΔy ] for an integer j. Such a truncation can not, however,
37
It is beyond the scope of this work to consider the nature of the determination of the approximation
process, though it is clear that an optimisation of one of the measures from Chapter 2 would provide the
essential mechanism.
represent any characteristic of the function at any location between the elements. While
these effects are often minimal, there is a particular class of approximation for which the
effects can be dramatic. In the case where a strictly positive function, such as a probability
distribution, is approximated by some basis for which there is no restriction preventing the
resulting representation from being approximated by a function which is not non-negative,
then obvious problems can arise. One such case relates to the utilisation of Gaussian
mixtures for the approximation of probability densities: in most cases the weights of the
various mixture components are generally restricted to be positive so that the resulting sum
is strictly non-negative. As an alternative, a wavelet decomposition does not guarantee that
the resulting function is strictly positive and the developer must be careful that such effects
do not adversely affect the viability of the resulting distributions.
While the effects of the selection of the representation on the functions which can be rep-
resented is clearly of great interest to the engineer, of greater interest is the possibility of
measuring the effect of the approximations and artifacts on the accuracy of the resulting
function. The quantities of Chapter 2 provide a mechanism for measuring these effects
directly. Obviously, since the relevance of a particular approximation technique depends
directly on the nature of the functional forms being approximated, it is necessary to have
access to the ‘true’ function, or preferably an ensemble of such functions, on which to mea-
sure the approximative effects. Furthermore, the nature of the functions under investigation
strongly affects the selection of which measure should be used. Let one of the ‘true’ func-
tions be denoted xT (y) and a given approximation using the function space as xY (y) then
the deviation can be written as
Dμ Δμ [xT (y), xY (y)] . (3.96)
The most commonly used measure associated with the errors due to an approximation is
the Euclidian distance, which for a continuous domain is given by

2
DEuclidian = [xT (y) − xY (y)]2 dy. (3.97)
As discussed in Appendix D, the Euclidian distance is a particular case of the Ln distance,
/ 01
n
n
DLn = |xT (y) − xY (y)| dy . (3.98)
It is also noted in the appendix that these two measures implicitly align the two functions,
that is, they assume that the correspondence between the values of y is known for both
functions. Most significantly, however, they are purely structural measures, comparing the
values of the two functions at every given location y0 , but never comparing the values at
different y. These measures are useful when the relationship between the two quantities
being represented is an identity, as effects such as a constant offset will strongly affect
the result. An alternative measure which avoids these issues is the mutual information (or
other distance measures derived from it) as the measure is invariant to an arbitrary bijective
transformation of the value space of either of the functions. This results from the fact that
the mutual information compares the relative joint statistics of the values observed, rather
than the values themselves. Equation 2.112 gives the MI as

P [xT (y), xY (y)]
IXT XY (XT ; XY ) = ExT xY logb (2.112)
P [xT (y)] P [xY (y)]
where the expectation is taken over the values of the two functions (for a given y) and is
formulated as required for the continuous and discrete cases. While this measure was shown
to be a similarity measure, rather than a distance, it provides a useful measure when there
may be a transformation between the spaces. The most obvious example of this occurs
when there is a change of datum, or a constant offset in the representation. Note, however,
that the measure still implicitly aligns the functions xT and xY through their common
dependence on the domain Y , though this dependence is hidden in the construction of the
distributions in Equation 2.112 for which the pairs {xT (y0 ), xY (y0 )} are taken together for
each y0 ∈ Y .
Finally, in the discrete case it is possible to utilise the statistical independence distance of
Section 2.8.2 to obtain a true distance measure comparing the statistical independence of
the measures according to Equation 2.158,
DSI (XT , XY ) HP [XT |XY ] + HP [XY |XT ]

1
= ExT xY logb . (3.99)
P [xT (y)|xY (y)] P [xY (y)|xT (y)]
Ultimately, however, the selection of a particular measure depends on the application at

hand and the measures applied to the errors associated with an approximative representa-
tion are most suited to the determination of the average expected error calculated for an
ensemble of typical functions.
3.6.2 Representational Conciseness and Sensitivity
While approximations and artifacts affect the accuracy and flexibility of the resulting rep-
resentation, a significant computational cost can be associated with how concisely the rep-
resentation captures a typical function. Specifically, the smaller the number of parameters
required to encode a given function the lower the computational costs associated with the
storage, transmission and reconstruction of the given function. In considering these effects,
there are two characteristics of interest: how well a typical function is encoded and how
many elements of the representation change for a small change in that function.
The first of these is significant as the complexity associated with maintaining and using
a representation increases dramatically with the size of the representation. For example,
it is well known that the SLAM problem scales poorly as a result of a requirement to
perform an inversion of the N × N covariance matrix at each iteration, an operation with
complexity O(N 3 ) for most implementations (Press et al. 1992, §2.11). It should be clear
that a mismatch between the functions being represented and the representational form will
give rise to poor performance in this regard.
The second of these relates the effect which minor38 changes to the represented function
has on the values of the parameters representing it. Once again, the general cause for
computational problems relates to a mismatch between the representation selected and
the functional forms being represented. In particular, given that the model incorporates
observational updates as well as sequential propagation of the representation, small changes
to the function should have a small impact on the parameters required to represent it.
Example 3.6.1 – Parameter sensitivity in approximations

A striking example of the sensitivity of parameters to an apparently subtle change in the
underlying function can be generated by considering the two rectangular pulses shown in
Figure 3.12(a). While the shifting of the left hand edge of the pulse appears insignificant,
the real part of the Fourier transform of these functions are shown in Figure 3.12(b), and a
similar results holds for the imaginary part. It is clear that a small change in the function
has generated a substantial change in the parameter values defining that function in the
Fourier basis.
38
With respect to the typical functions expected in the application.
2 15
1.8
1.6 10
1.4
1.2 5
0.8 0
0.6
0.4 −5
0.2
0 −10
0 20 40 60 80 100 120 0 20 40 60 80 100 120
(a) Input signals (b) Fourier transform (real part)
Figure 3.12: Sensitivity of the Fourier transform to small time offsets
3.6.3 Manipulation and Decomposition Complexity
The computational costs associated with the utilisation of any particular representational
scheme can be identified as those costs associated with the ongoing manipulation of the
representation itself and those related to encoding or decoding the contents of that repre-
sentation into a form which allows interpretation. As noted in Section 3.4.3, the sequential
incorporation of new data into a representation requires that the distributions describing the
functional representation be updated at least as often as new data is available. Fortunately,
once a sensor model is known, the incorporation of the data reduces to the application of
Bayes’ rule to the stored probability distributions according to Equation 3.70,

P xk (y)|Zk , uk , x0 (y) ∝ P [z.k (v. )|xk (y)] P xk (y)|Zk−1 , Uk , x0 (y) , (3.70)
where the result is normalised to obtain the correct posterior distribution. Now, the func-
tional representations are a general method for writing any particular function and it was
seen in Section 3.3.3 that a probability itself can be encoded as a deterministic functional
form. If the representation is selected to correspond to the state function (rather than the
distribution describing it) then the computational cost of the operation is determined by
the internal representation of the distribution over the parameters, P [α(w)].
Example 3.6.2 – Computational complexity of a Gaussian representation

Consider the temperature distribution in time of a metal rod and let the temperature be
represented according to an exponential mixture as

T (t) = αi1 exp (αi2 t) (3.100)
i
where t ∈ R is the distance along the rod, T is the temperature, αi1 and αi2 are the mix-
ture coefficient and decay constant for the ith component. Since the temperature of a metal
sphere is known to follow Newton’s law of cooling, dT
dt ∝ [T (t) − T0 ] where T0 is the ambi-
ent temperature, then it is easily shown that the solutions of the differential equation can
be well-approximated by the exponential family of Equation 3.100, and if the system is not
heated externally, can be described by two terms, one of which has α22 = 0. This represen-
tation is, therefore, well-matched to the problem at hand. It is, however, deterministic and
the distributions of interest to this thesis relate to the knowledge of the parameters of the
distribution, P [α11 , α12 , α21 ].
It is the computational cost associated with maintaining this distribution which are relevant
to the complexity of the selected representation.39 Say the parameters are represented by a
Gaussian distribution described by the mean vector μ = [αˆ11 , αˆ12 , αˆ21 ] and the covariance
matrix,
⎡ ⎤
2
σ11,11 2
σ11,12 2
σ11,21
⎢ ⎥
S=⎢ 2 2 2 ⎥
⎣ σ12,11 σ12,12 σ12,21 ⎦ ,
2
σ21,11 2
σ21,12 2
σ21,21
then the computational costs associated with the sequential updates are dominated by the
cost of inverting S, which is O(N 3 ) with N = 3.40 In a more general system involving the
use of a Gaussian representation with an arbitrary number N of components the limiting
cost is O(N 3 ).
An interesting effect arises, however, when the function being estimated is the probability
distribution. That is, rather than define the functional representation so that the proper-
39
Though, obviously, the complexity of the representation is affected strongly by the conciseness and
relevance of the approximation to the functions themselves.
40
This assumes a state-space model and the update would be substantially faster in the information form,
though the computational is increased in the prediction stages.
ties are represented, write the functional decomposition of the distribution formed over it.
Formally, this can be written as

P [x(y)] = α(w)e [x(y), w] dw (3.101)
where the basis functions are now defined as functions over the combined space of {X, Y }
and W . In this case it is unnecessary to attempt to encode the parameters in a proba-
bilistic fashion, as the distribution is maintained directly. This approach corresponds to
many different ‘non-parametric’ representational approaches, including sums-of-Gaussians
(Majumder 2001, §4.4), wavelet forms (Grover & Durrant-Whyte 2003) and other mixture
models. In such cases it is possible to directly determine the computational cost of manipu-
lations and transformations. For example in Grover & Durrant-Whyte (2003) a closed form
multiplication algorithm was presented, allowing efficient calculation of the Bayes’ update
step.
Finally, the developer must be aware of the computational complexity associated with
the transformations between the functional and the value-space forms, such as encoding a
distribution with a wavelet as in the previous example, or obtaining the resulting function
given a Fourier transform of it.
3.7 Conclusions 155
The selection of the form by which the state function x(y) is represented has three
distinct implications for the developer:
Approximation Errors and Artifacts in which the selection of the basis generates
a function space FY ⊂ FD which limits which functions can be represented
exactly. The errors introduced by this approximation include effects such as
discretiseation on a grid and the Gibbs’ effect with a Fourier basis.
These effects can be quantitatively assessed using a spatially aligned deviation

measure such as that of Equation 3.96
Dμ Δμ [xT (y), xY (y)] (3.96)
which captures the overall effect of errors at each given location y ∈ Y . The
most obvious candidates for Dμ are the Euclidian distance (§D.1.1), Ln distances
(§D.1.3), mutual information (§D.2.3) and statistical independence distance for
discrete cases (§D.2.4).
Representational Conciseness and Sensitivity in which the selection of represen-

tation affects the complexity of the resulting parameter functions and the sensi-
tivity of those functions to small changes in the function itself.
Manipulation and Decomposition Complexity in which the computational com-

plexity associated with the sequential updating using Equations 3.67 and 3.70 in
Section 3.4.3 and converting the representation to and from the Dirac form using
Equations 3.24 and 3.25 in Section 3.3.2 are considered.
3.7 Conclusions
This chapter has examined the first phase of the new model in which sensory data is
combined to generate a summary of information (or the ‘state’) representation. These
data gathering operations represent the process by which information is introduced into the
system. Separate sensory streams were shown to correspond to parallel operations through
which different subsets of the state information are updated. This interpretation leads to the
concept of the sensory space in which the statistical and physical relationships between the
bijection bijection
Sensor 1 Sensor 2
Sensor Space
observation § 3.2 observation
z1(v) z2(v)
model model
Physical P[ z1(v) | x(y) ] Likelihood Likelihood P[ z2(v) | x(y) ] Physical
Model Generator Performance Measures Generator Model
H(ZV) = I(ZV; XY) + H(ZV|XY)
observation likelihood observation likelihood
P[ z1(v) | x(y) ] § 3.3.3 H(ZV|XY) - Quantity of noise P[ z2(v) | x(y) ]
I (ZV; XY) - Information
Bayes preservation Bayes
Update Update
§ 4.6.6
prior prior
P[ x(y) ] P[ x(y) ]
update update
P[ z1(v) | x(y) ] P[ x(y) ] P[ z2(v) | x(y) ] P[ x(y) ]
P[ x(y) | z1(v) ] = P[ x(y) | z2(v) ] =
P[ z1(v) ] P[ z2(v) ]
X
x(y) = ∫α(w)e(y,w) dw
Y
Function Approximation Functional Representation
Schemes Domain transformation Performance Measures
X
Mixture Representations: Approximation Effects/Artifacts:
x(y') Measure Δμ[ xT(y), xY(y) ]
fα(yj) = Σ αi ei(yj), finite i
fβγ(yj) = Σ αi ei(yj, γk) Conciseness/Sensitivity:
Number of parameters
Parameter Partitioning: Sensitivity to state changes
Y'
Graphical Models
Decoupling (domain, spatial Manipulation Complexity:
property, spatio-property)
Summary of Information (state)
Decomposition costs
P[ x(y) ] ≡ P[ α(w) ] Update/Maintenance costs
§ 3.4 § 3.3 § 3.5
Figure 3.13: The Sensing and Data-Gathering Process. The data from two (or more) sensory
sources are combined to form a sensor-centric representation. The quantities considered in
each part of the model are written in blue. The current estimate of the state P [x(y)]
(the prior) is combined with an appropriate sensor model P [z. (v. )|x(y)] to obtain the
anticipated statistics of the observation. These values are then compared to the actual
observation and the state is updated appropriately through a Bayes’ update. The sensor
space construction provides a tool to guide the design and selection of the form of the
state. Several approximations and their effects are also noted for the function-basis forms
proposed for storing and manipulating the data. The sections describing each component
of this model are shown in red in the figure.
3.7 Conclusions 157
sensors can be directly represented. The structure of the data-gathering parts of the model
are summarised in Figure 3.13 in which two sensors are shown contributing information to
a common state representation.
It was shown that general physical properties can be represented through the use of func-
tional forms. These methods represent an arbitrary function as a point in the abstract space
spanned by a set of functions, with the Dirac-basis corresponding to the ‘traditional’ inter-
pretation of the function x(y). A framework for capturing an arbitrary functional form was
presented and shown to lead to a natural interpretation of the estimation of the functional
form as the estimation of the location of a particular point in this abstract space. Forward
sensor models were demonstrated to provide a consistent method by which observational
data can be used to update the knowledge regarding the underlying state distributions.
The nature of sensor-centric representations suggests that the data maintained in the sys-
tem will correspond to a substantially higher level of complexity when compared to more
traditional approaches, such as parametric environment models. In fact the ideal case as
suggested by the model proposed in this work was demonstrated to be formally intractable,
however, these functional forms were shown to lead naturally to approximation schemes in
which assumptions and external knowledge can be used explicitly to simplify the structure
of any particular scenario. Furthermore, structuring the problem in this way leads to the
ability to directly assess the impact of those assumptions on the reliability and accuracy
of the resulting approximate distributions. The measure theoretic forms of Chapter 2 were
shown to provide convenient and viable quantifications of these effects. The most common
classes of approximation and the important characteristics to be considered when applying
these approximations to a real system are also shown in Figure 3.13.
Lastly, this Figure also clearly identifies the engineering decision making inherent in the
application of the model of this thesis to a practical situation. Most importantly, the
system developer must consider the nature of the sensors available and have some concept
of the task to be achieved. If it is possible to know what many of these tasks are, then the
representational form of the ‘state’ can be readily determined; if not, the model suggests that
the selection of the representation should be guided by supporting the widest gamut of likely
tasks within the computational and storage constraints of the system. An analogy exists
here with the primary (pre-cortex) processing operations of the human visual system. Once
the representation has been selected, there remain two major areas of consideration for the
engineer: the selection of approximation schemes to represent the quantities of interest; and
the assumptions regarding the sensory contributions to those quantities contained explicitly
in the sensor models. The approximation schemes depend strongly on the applications
at hand, for example, in the case studies shown in Sections 5.4–5.6 it is reasonable to
assume that the quantities can be approximated using spatially-independent point sampling
techniques.
With regard to the sensor models, it has been noted that the development of this part of
the system is crucial to a viable implementation. Assuming that the sensors are treated as
independent, the model represents the physical interaction between the environment and the
quantity of interest, such as the relationship between occupied space in the environment
and the resulting radar intensity returns as considered in Section 5.6. However, more
significantly, the sensor models encode the interactions between the sensors and the careful
selection of this interaction remains in the hands of the engineer. For example, the digital
terrain map and radar data of Section 5.4.2 are processed independently in that example,
but it would also be possible to directly combine these as it is possible to characterise the
two sensors’ responses to the same physical quantity - occupancy.
Therefore, this chapter has shown that the process of combining sensory information to
generate and maintain a viable representation of the physical quantities which gave rise
to those observations can be readily described by the data-gathering phase of the model
presented in this thesis. The model is primarily a guide to the development of concrete im-
plementations and does not, inherently, supply a clear sense of which approximations and
assumptions are necessary or appropriate. Instead, treating the state to be represented as a
functional form gives significant insight into the nature of the critical issues of representabil-
ity and the engineered management of the approximations and assumptions necessary for
the construction of practical autonomous systems.
Chapter 4
Reasoning with Sensory Data
4.1 Introduction
This chapter examines the model components associated with the reinterpretation of the
functional forms of Chapter 3, that is those which enable the reliable achievement of
application-specific goals. Critically, these operations can be identified as the sole point
in the model at which task-specific or external knowledge is combined with the stored sen-
sory data. It will be shown that this process involves the utilisation of this information to
guide the reinterpretation and transformation of the data contained in the sensor-centric
function described by P [x(y)]. In this way the data interpretation stages of the model of
Section 1.3 is formalised as an explicit transformation of the data where the properties of
the transformation are determined by the available external knowledge.
The probabilistic functional representations of Section 3.3 are assumed to represent the
complete set of sensory data available to the system and will be termed the ‘state’ in this
chapter. Since the utilisation of the information contained in the state must be intimately
linked to the nature and requirements of the task at hand and because the data itself
contains only task-agnostic information, the transformation must necessarily utilise external
information to control which aspects of the data are enhanced and which are suppressed.
In this sense the external knowledge is not actually introduced into the data, but rather
should be recognised as controlling its re-interpretation. As no additional information is
introduced into the system, the transformation can be, at best, lossless and in most cases
will correspond to a highly lossy process. In fact, a common assumption underlying many
160 Reasoning with Sensory Data
‘machine learning’ algorithms is that, of the richly descriptive data available to a system,
only a small subset is relevant to any particular task. The essential goal of the transformative
process is the enhancement of the contrast, interpretability or discriminative properties of
the data with respect to a certain task.
Now, the functional representations are probabilistic, admitting the deterministic form as a
special case, implying that any formulation of the process must support the incorporation
of statistical, in addition to geometric and functional, effects. It is demonstrated that
the process can be completely described by a conditional probability distribution and an
appropriate marginalisation. Treating the manipulation of data as a lossy transformation
leads to a natural interpretation of the problems of feature selection, extraction and decision
processing as a special case involving a discrete feature space. In fact, the interpretation
of data can be identified as consisting of four increasingly complex processes: thresholding,
where the data is divided into distinct categories based on amplitude alone; feature selection,
in which the space is partitioned according to some set of properties expressed in the feature
space; feature association, where a subset of the feature space is associated with a previously
known subset; and scene interpretation, where the content of the feature space is to be
considered and examined in toto.
The performance of any particular transformation process must necessarily be linked to

the achievement of the specific goals of the task of interest, and it will be shown that
there are many different methods by which this performance can be quantified. When the
statistics of the actual quantities being estimated are known1 then it is possible to define
the transformation performance in terms of error and information theoretic cost functions.
In most cases involving autonomous systems, however, the distribution of the input data
and its relationship to the task at hand is unknown and such cost functions are unavailable.
Instead, the transformations can be considered in terms of two properties: how much of
the data is discarded, which will have properties similar to a compression ratio; and how
individual feature descriptors are related to the individual state quantities. Finally, the
contribution of individual elements of the state to the outputs of transformations provides
a convenient measure of the effectiveness of that element on the overall system performance.
An important special case of this type relates to the contributions of individual sensors to
elements of the state or to particular feature spaces.
1
This is often the case in communications systems or in an environment which has been mapped previ-
ously.
4.2 Reasoning as Lossy Reinterpretation 161
Z Data Data
p(a) S b t Transformation Application 1
p(b1) Features
p(b2) X
p(c)
Y
Data Data
Subset Transformation Application M
Features
Summary of Information
Figure 4.1: The data abstraction part of the model showing how data from the state is
transformed using data transformations to obtain the feature descriptors defining an
output feature space and subsequently to identify the features within that space.
4.2 Reasoning as Lossy Reinterpretation
The data-interpretation stage of the model is shown in Figure 4.1 and can be considered
to represent the application of transformations to the data in the ‘summary of information’
part of the model. A comparison with Figure 3.1 in Section 3.2 reveals a symmetry in
the operations of these two parts of the model. In that section the sensory information
from multiple, possibly independent, sources is combined to form a single consistent and
probabilistic summary; conversely, the reasoning process involves the application of trans-
formations to that summary, generating multiple, possibly independent, ‘feature spaces’.
This symmetry can be strengthened through the recognition that the sensor models of
Section 3.4.2 actually correspond to the transformation of knowledge from the state into the
space of the appropriate sensor and not to the transformation of new information from the
sensor space into the state.2 In this regard the essential characteristics of the two operations
are identical: information from the state is transformed into a new representation, either
with respect to a sensor or a task. The critical difference between the two cases is in the
methodology by which the resulting data is utilised by the system.
An important implication arises in this context: both sensor models and task transforma-
tions require the interpretation of the data with respect to external knowledge not available
2
As noted in Section 3.4.2, the Bayesian update process corresponds to responding to the evidence
supporting the current state distribution in light of a new observation.
to the state. This knowledge includes effects such as: the geometric relationship of a sensor
to the state; the mapping of sensory quantities to target identifications; and the electri-
cal or processing ‘noise’ introduced in a real device. It is critical that this knowledge is
not regarded as being added to the data in the state; instead, the knowledge guides the
re-interpretation of the data, emphasising some characteristics and suppressing others.
While both sensing and reasoning share the same mathematical model to capture how the
probabilistic data is transformed, there are several important differences between the typical
characteristics of these methods. The most important of these is that while sensor models
seek to maintain the integrity of the entire data stream, approaching lossless methods in the
ideal case, a reasoning task transformation attempts to discard as much ‘irrelevant’ data
as possible. Furthermore, it is well known that most sensory systems are not developed
for application to a sole, specific task, but are instead intended to provide task-agnostic
measurements of the operational environment. This is particularly true for sources such
as audio, video or imaging radar systems where the potential applications are numerous.
Alternatively, the interpretation of the data relating to a specific task, such as locating a
beacon in the environment or segmenting an ‘obstacle’ from ‘normal’ terrain, represents a
case where much of the sensory data is irrelevant and should be discarded as ‘noise’.
In this context, the most important challenge to the designer is the identification of which
of the parts of the richly-descriptive functional data of Chapter 3 directly support the
achievement of a specific task. In the communications system literature, this is clearly
recognised as the parts of the data which describe a ‘signal’ present in a ‘noisy’ source.
The robotics community, however, generally regards the items of interest as ‘features’ to be
extracted from the data. In either case, the nature of what constitutes a signal or a feature
depends entirely on the task at hand. Critically, this aspect of the design involves the
application of knowledge external to the represented data, often in the form of a ‘model’.
4.2.1 The Feature Space
When considering the nature of the transformations required to manipulate and interpret
sensory data it is important to maintain a clear distinction between the aspects of the data
which make it possible to obtain useful information and those parts of the data which, when
considered in relation to those aspects, can be separated from one another. For example,
in a natural environment these could be the quantities which allow trees to be identified
(colour, texture, size, etc.) and individual trees in the scene, respectively. In this thesis, the
describing characteristics (such as colour, texture, traversability, temperature or abstract
quantities such as ‘likeness to a tree’) will be termed the ‘feature descriptors’ in that they
describe the gamut of possible features relevant to a particular task. ‘Features’, however,
are taken to refer explicitly to subsets of the data which have some identifying set of feature
descriptor values.3
Each and every quantity which can be derived from the stored information is clearly an
acceptable feature descriptor and any set of these can be interpreted as a viable set of basis
vectors for describing certain characteristics of the data. The space spanned by a set of these
descriptors can be considered to form a ‘feature space’ over some domain and in this space
‘features’ will correspond to subsets or regions within this domain. Let the data be available
according to the models of Section 3.3, that is, described by the distribution P [x(y)] with
y ∈ Y representing a location in the domain of the property function x ∈ X. Recalling that
the space X was interpreted in Section 3.2.2 as being the space spanned by the individual
state quantities4 , the analogous feature space can be defined as S := span{es1 , . . . , esn }
where es. represents a basis vector describing the ‘direction’ of an individual descriptor. In
this way, it is possible to define a feature-descriptor vector s ∈ S.5
Now, while some transformations will result in a single output value, such as the entropy
of the distribution, others require the introduction of a new coordinate system to index the
resulting elements. Consider, for example, that a Fourier transform of a one-dimensional
function transforms the values of the function from real amplitudes in physical space to
complex amplitudes in a frequency space. Denoting the space over which the output func-
tions are evaluated by Q := span {eq1 , . . . , eqm }, with the eq. representing the basis vectors
of the domain, allows the functional form for the output space to be constructed as
s(q) ∈ S with q ∈ Q. (4.1)
This quantity represents a single feature space to be derived from the state distribution
3
The identification of features will be examined in Section 4.4.3 as a special case involving a discrete
output space.
4
The space X represents a property space over the domain Y .
5
In general, as with the state x(y), this quantity s(q) will represent a tensor field over Q, where Q is
some domain over which the feature descriptors are to be evaluated.
Features
s(q1)
TS Feature Descriptors
x(y)
s(q2)
Feature Domain
Summary of Information
q
Feature Space
Figure 4.2: The feature space defines a transformed space obtained from the original data
according to TS : x(y) → s(q). Within this space the domain Q represents the feature
domain, over which different features are identified according to distinctive values of the
feature descriptors, or values of the function s(q) ∈ S.
P [x(y)], such as the transformed data necessary for performing a particular task. In general,
the quantities which can be derived from the state distribution can be considered as the set
{sl (ql )} where l indexes the individual feature spaces.
In this construction there is a clear identification of features as corresponding to regions

in the domain Q for which the feature descriptors described by the location in the feature
space S take on certain characteristic values. An example of this space is shown in Figure
4.2 where two features are identified as two distinct regions in the feature domain according
to the values of the feature descriptors. This interpretation suggests that a transformation
can be considered ‘optimal’ if, with respect to some arbitrary set of feature characteristics,
it generates a space {S, Q} for which the ‘contrast’ of individual features is maximised.
Once more, the structure of the spaces S and Q do not require that the basis vectors be
orthogonal or independent, rather the only requirement is that they span a space sufficiently
descriptive to support the required task. It should be recognised that orthogonality, while
not a necessary property, can be expected to yield spaces which are computationally superior
as a result of reduced redundancy. In the limit of two feature descriptors approaching
a complete physical or statistical dependency, the resulting space will have at least one
redundant dimension. The independence (or orthogonality) of the individual quantities
defining the feature spaces will be discussed in Section 4.4.1.
4.2.2 The Transformation Approach
It is contended that the transformative approach depicted in Figure 4.2 represents a concise
and complete summary of the essential characteristics of all data interpretation techniques.
This conjecture is motivated by the fact that the data obtained from a sensor is inherently
sensor-centric, while the optimal utilisation of that data must be task-centric. Furthermore,
different reasoning tasks involve unique interpretations of not only the sensor values but
also the domain over which those measurements were made. For example, in a situation
involving an audio time–sequence, a ‘voice recognition’ task may not require the ability
to separate the stream into individual phrases, words or syllables, but will likely examine
the frequency spectrum and similar characteristics of an entire sequence. Alternatively, a
‘speech recognition’ task would be required to perform these segmentation operations and
would consider the individual segments to correspond to ‘features’, each of which should
be interpreted from both the temporal context (the feature domain values) and auditory
characteristics (the feature descriptors).
Another implication of considering the reasoning process as an application-specific transfor-

mation relates to the recognition that such transformations can be considered as generali-
sations of available external knowledge. This is obviously true in the case of a simple sensor
model, such as for a temperature transducer, where a viable physical model is available as
a general model. Other cases for which such models are not known a priori represent the
situation for many machine learning applications, for example, determining which charac-
teristics of a sensor stream correspond to a concept of ‘likeness to a tree’ or identifying
what elements within an environment represent an obstacle in a navigation problem. In
these cases, the transformation must be generated from a set of training data considered by
the designer to exemplify the operational characteristics of the system. When the resulting
transformation transcends the specifics of the training set it is often considered as a ‘gen-
eralisation’ of the problem, in that it can be applied to ‘new’ situations not covered during
the learning phase. Such transformations correspond to the identification of the charac-
teristics contained in the data which are salient for a given task and to treating all other
characteristics as ‘noise’ to be discarded, an obvious analogy to the lossy transformation
processes introduced in this chapter.
4.2.3 Reasoning Algorithms as Transformations
As this thesis proposes that a transformational framework can be used to capture concisely
and accurately the nature of the reasoning process, the question of whether current reasoning
techniques can be directly interpreted using this framework must be addressed. There are
four main categories of reasoning framework in the literature:
• Correlation/Convolution filters (Minkoff 2002, §5.4);
• Signal reconstruction techniques from statistical communications theories, including

matched filters and real-world detectors (Middleton 1996);
• Regressions and discriminative classifiers (Mackay 1998); and
• Generative statistical models (Ng & Jordan 2002).
One final issue which arises in this context is the role of learning methods (or any regressive
or ‘trained’ process) within a well-engineered system. As discussed in Section 1.3.3, the
engineer desires that, where possible, performance guarantees can be provided for system
sub-components. These methods represent cases where this is possible, but only when the
training data is sufficient to support a reasonable gamut of the actual operating scenarios.
In this way, it is preferable to limit carefully the application of these methods to specific
parts of the process to ameliorate any potential adverse effects. Furthermore, if it is possible
to assess the quality of the output from these methods within a given sub-component, then
it is possible to improve overall system-stability.
The utilisation of many learning methods represents the application of a priori knowledge6
to the data. This means that they are directly amenable to application in the parts of the
high-level model where external knowledge is introduced, that is, in the transformations of
this chapter and, consequently, the sensor models of Section 3.4.2. As shown in Section
4.5 the on-line performance of the resulting transformations can be quantified allowing
a quality measure to be defined. Thus, limiting the application of learning methods to
the transformation operations (including sensor modelling and reasoning stages) allows the
methods to be embedded reliably into a system and the quality and performance to be
gauged on-line to provide system-level reliability even if one of these parts fails.
6
That is, knowledge not contained in the data stream itself.
4.3 Transformation of Functional Representations 167
4.3 Transformation of Functional Representations
This section examines the transformation of functional forms for utilisation in sensor mod-
elling and task-specific transformation operations. It is assumed that the ‘input’ distribu-
tion P [x(y)] with y ∈ Y and x(y) ∈ X is known. This distribution can be reduced to a
deterministic form by using a Dirac delta function,
Pdet [x(y)] ≡ δD [x(y) − x̂(y)] , (4.2)
where x̂(y) represents a particular function.
Now, since the input to the transformation is a probability distribution (or density for
continuous cases), then the lth output is also a distribution, P [sl (ql )] say, where ql ∈ Ql
indexes the elements of the feature space over which sl (ql ) ∈ Sl is defined.7 This requires
that the transformation TSl be of the form,

TSl : P [x(y)] → P [sl (ql )] = P [ sl (ql ) | x(y) ] P [x(y)] dx(y), (4.3)
where the integral is taken over the functions x(y). Now, the conditional distribution
inside the integral of Equation 4.3 is significant as it represents the relationship between
the quantities independently of the actual statistics involved, that is, of the particular
distribution P [x(y)]. This means that knowledge of this conditional distribution is sufficient
to know the transformation properties,8 that is,
TSl ⇔ P [sl (ql ) | x(y)] . (4.4)
The transformation of a state distribution P [x(y)], for x(y) ∈ X and y ∈ Y , into the
lth feature space defined by sl (ql ) ∈ Sl with ql ∈ Ql is given by Equation 4.3,

and the conditional distribution P [sl (ql )|x(y)] defines the properties of the transfor-
mation completely.
7
The equivalent form for a sensor modelling task is P [zl (vl )] with vl ∈ Vl and zl (vl ) ∈ Zl corresponding
to the sensor domain and the sensor observations respectively.
8
The recognition that the transformation process can be described solely by the conditional probability
distribution follows the approach of Middleton (1996, §18.3-2).
Under a general function-basis interpretation, this distribution represents a function defined

over the joint function-space spanned by the two domains Y and Ql , {FY , FQl }, and is
normalised over FQl for the selection of any particular function x(y) ∈ FY . This means
that the conditional distribution can represent any relationship between the members of
those function spaces. In practice, however, many common relationships can be adequately
constructed according to a combination of three basic effects:
1. a deterministic transformation of the domains and values of the functions;
2. the effect of uncertainty in the transformed value of the functions themselves; and
3. the effect of uncertainty in the input domain location affecting the deterministic trans-
formations of the first part.
4.3.1 Deterministic Transformations
Consider first the deterministic relationship between two functional forms. Let the input
function be the estimate x̂(y) obtained from the input distribution and note that since the
resulting transformation Tdet must yield a function which depends only on the value of ql ,
then it must be of the form,

Tdet : x̂(y) → ŝl (ql ) = fsl [x̂(y), y, ql ] dy, (4.5)
where the function fsl [x̂(y), y, ql ] represents an arbitrary relationship between the values
of the function and the different coordinate locations y and ql . This transformation is
depicted in Figure 4.3 and includes the case in which the value of the input function can
potentially affect the relationship between the locations ql and y. For example, a parity
operation for binary data corresponds to this situation in that if the ql correspond to the
digital values of 1 and 0 and the function ŝl (ql ) represents the count of those values then
the value of the signal determines which ql the sample is mapped to.
The form of Equation 4.5 is entirely general, but admits several simplifications which are
encountered regularly in real systems. These simplifications include linear transformations
of the function values, kernel forms and singular domain transformations, each of which
encompasses a decreasingly small class of possible transformations.
x(y)
Tdet
sl(ql)
y ql
Figure 4.3: A deterministic transform between two functional forms.
Linear Transformations result when the function fsl depends linearly on the values of
the function, though possibly non-linearly on the values of y and ql , yielding

lin
Tdet : x̂(y) → ŝl (ql ) = x̂(y)fsllin [y, ql ] dy. (4.6)
Examples of this simplification include the Fourier transform (Bracewell 2000, p5),
F : f (t) → F (ω) f (t)e−jωt dt, and the Laplace transform (Kreyszig 1993, p262),
∞ −st
L : f (t) → F (s) 0 f (t)e dt, where in both cases each and every value of one of
the domains affects each and every value of the other in a strongly non-linear fashion
through the complex exponential functions. Each of these operations is well-known,
however, to be linear in the function itself.
Kernel Forms result when this effect of linearity is combined with a transformation in
which only the ‘difference’ between the values of y and ql is significant. In particular,
when it is possible to write ql = g(y) to represent the projection of the element y ∈ Y
onto the domain9 Ql then this can be written as

ker
Tdet : x̂(y) → ŝl (ql ) = x̂(y)fslker [g(y) − ql ] dy (4.7)
and the function fslker is known as the kernel as it is invariant to the values of y and
ql . An important example of this type of transformation is a convolutional model of
a range sensor: consider that the projection of a ray through a two-dimensional plane
corresponds to the mapping of a function defined over that plane onto the ray and
that the resulting sensor model can be accurately expressed as the convolution of a
9
g is written as a function to indicate a fixed relationship between the elements of Y and Q while the
kernel fsker represents the effects which different y have on nearby values of q once transformed.
range-error model and that function.
Singular Forms result when the kernel function admits only one value of ql = g(y) to
affect the value of ŝl (ql ), that is,

sing
Tdet : x̂(y) → ŝl (ql ) = x̂(y)δD [g(y) − ql ] dy (4.8)

δD [y − yi ]
= x̂(y) dg
dy (4.9)
i | dy (yi )|

1
= dg
x̂(y)δD [y − yi ] dy
i | dy (yi )|
x̂(yi )
= dg
, (4.10)
i | dy (yi )|
where in Equation 4.9, yi is the ith root of [g(y) − ql ] = 0 (Weisstein 1999a) and the
denominator of Equation 4.10 can be identified as corresponding to the Jacobian for
the transformation function g. Clearly, when the function g(y) is bijective then there
is a single root yi and the transformation obtained corresponds to a change of variable.
Examples of this type of transformation include rectangular-to-polar transformations
and other lossless changes of variable.
The deterministic part of a transformation can be written in a general form as

Tdet : x̂(y) → ŝl (ql ) = fz [x̂(y), y, ql ] dy (4.5)
but can be simplified under various circumstances, including:
Linear Forms: Tdet : x̂(y) → ŝl (ql ) = x̂(y)fsllin [y, ql ] dy
lin
(4.6)

Kernel Forms: Tdet : x̂(y) → ŝl (ql ) = x̂(y)fslker [g(y) − ql ] dy
ker
(4.7)

sing
Singular Forms: Tdet : x̂(y) → ŝl (ql ) = x̂(y)δD [g(y) − ql ] dy (4.8)
4.3.2 Uncertainty in Value
The deterministic transformation of Section 4.3.1 maps a member of the function space
FY into a member of the output function space FQl , but does so without any concept of
uncertainty. The goal of the probabilistic transformation in Equation 4.3 is to convert the
sensor-centric information contained in the state distribution into a task-centric quantity

such as an expected sensory measurement, an environmental property such as traversabil-
ity or a measure for classification. Since each of these involves the addition of external
knowledge to the system it is important to consider the uncertainty associated with that
knowledge. For example, the electrical noise introduced by a transducer, signal processing
circuitry and data acquisition process can be modelled physically and will introduce an
uncertainty in the value of the function sl (ql ). Likewise, the notion of traversability relates
to a physical interaction between a vehicle and the environment and physical disturbances
will result in uncertainty in the application of the value.
This effect focusses on the uncertainty in the value of the functions and is distinct from that
associated with uncertainty in the accuracy of the deterministic part of the transformation
which is examined in Section 4.3.3. Consider the transformation of the particular function
x̂(y) from the input distribution ŝl (ql ) = Tdet {x̂(y)} and another arbitrary element sl (ql ) ∈
FQl . Since the transformation Tdet is deterministic, then this implies that
P [ sl (ql ) | x(y) ] ⇔ P [ sl (ql ) | Tdet {x(y)} ] . (4.11)
Now, the deterministic transform of the previous section corresponds to the domain depen-
dencies between the two function spaces, but this expression seeks to capture the ‘noise’
introduced regarding the function sl (ql ) in addition to the effects of the transformation.
Therefore, taking the Dirac basis interpretation, it is possible to write
Pvalue [sl (ql )|x(y)] = Pvalue [sl (ql ) − Tdet {x(y)}] (4.12)
where the subtraction is legal as the transformation results in a member of FQl and the
resulting distribution is a function of the functional difference and can be dependent on the
value of ql . An important example is the introduction of Gaussian noise to the value of a
sensory cue such as temperature, sonar reflectivity or pixel intensity.
The first type of uncertainty introduced to the deterministic transform of Section 4.3.1
corresponds to noise on the resulting function sl (ql ), given the transformed function
ŝl (ql ),
Pvalue [ sl (ql ) | x(y) ] = Pvalue [ sl (ql ) − Tdet {x(y)} ] (4.12)
and represents a distribution over the difference function which will be a member of
FQl and can, therefore, depend directly on the value of ql .
4.3.3 Domain Uncertainty
If it is noted that the uncertainty added to the transformation in Section 4.3.2 represents the
effects of noise on the value of the function at a known location, and that the deterministic
transformations of Section 4.3.1 indicate how the function at a known location affects the
noise-free values of the function, then the only remaining source of uncertainty results from
uncertainty in the location itself, that is in the value of y. Noting that Equation 4.3 does not
represent the quantity y as a random variable, but rather as a regular variable over which
an unknown function is defined, then it is necessary to introduce a new random variable to
capture the uncertainty in this value. Specifically, for any particular value of y it is possible
to determine the distribution over the new random variable y ,

Py y | y . (4.13)
The contribution of a particular value of y to the resulting function sl (ql ) can be determined
from Equation 4.12,

Pvalue sl (ql ) | x(y ) = Pvalue sl (ql ) − Tdet x(y ) , (4.12)
and the contribution can be weighted according to the probability of that value of y cor-
responding to the value of y. This yields

P [ sl (ql ) | x(y) ] = Pvalue sl (ql ) − Tdet x(y ) P y | y dy . (4.14)
Examples of transformations in which this may be important are those which are affected
by the uncertainty in a vehicle’s pose, or in the orientation and location of a sensor. In
these cases the distribution P [y |y] may be strongly dependent on the value of y and other
parameters which define it and is sufficiently general to represent all these effects succinctly.
In addition to the uncertainty in the value of the function sl (ql ), it is possible to

consider the effects of uncertainty associated with the location in the domain Y , such as
uncertainty in the location of a vehicle on which a sensor is mounted. This distribution
is given by P [y |y] and describes a new random variable y given the value of the
non-random variable y. The resulting transformation can be represented as

The distribution Py [y |y] represents the domain uncertainty, that is, the uncertainty re-
garding which elements of the domain Y ⊂ Y affect the given value of the output function.
Significantly, this distribution also formalises the data-association problem, that is, the
association of particular elements of the state distribution to a given observation or trans-
formed output function. Consider that the ‘classic’ data-association problem in which an
observation is identified with one, and only one, element of the state represents a case where
this distribution is a Dirac delta over the discrete space of state elements. Immediately it
becomes clear why the traditional method is susceptible to catastrophic failure in the case
of an incorrect association - the operation is assumed to be deterministic and the result
of ‘perfect’ knowledge. In addition, approaches such as multi-hypothesis theorems can be
identified as computationally tractable approximations of the full instantiation of Equation
4.14.
4.3.4 Interpreting Feature Transformations
The composite transformation constructed using all three effects from Sections 4.3.1-4.3.3
is illustrated diagrammatically in Figure 4.4 which demonstrates the three effects: a de-
terministic transformation; the value uncertainty; and the domain uncertainty prior to
transformation.
An important characteristic of the transformations presented in this thesis is their capability

of representing the effects of task-specific knowledge and assumptions on the manipulation
and interpretation of the data. It is commonly assumed that the process of reasoning in-
troduces external knowledge or assumptions into the data stream and should, therefore, be
P[y'|y ]
x(y) P[ sl(ql) | x(y) ]
ql' = g(y)
3 y 2
1 sl(ql)
y2 sl(ql) = Tdet { x(y) }
y1 ql' ql
Figure 4.4: The composite transformation for functional forms showing all three components
of the construction: 1. the deterministic transformation of a single function; 2. the addition
of value uncertainty to the output function; and 3. the incorporation of domain uncertainty
prior to the transformation.
communications channel
X Encoder Decoder S
Signal Output
A
Side Information
Figure 4.5: Side information in a communications channel. The information contained in
the random variable A affects the encoding of the signal X and its decoding into S, but
does not actually change the content of those signals.
interpreted as the combination of sensory data and additional knowledge. It is contended

here, however, that the interpretation of reasoning as the transformation of the data, par-
ticularly in relation to the symmetry induced between sensor modelling and reasoning,
provides an alternative explanation: the data remains the sole source of knowledge of the
particular situation in which the system is operating, but the external relationships used
when reasoning represent knowledge of a general type. Consider that a sensor model from
Section 3.4.2 represents the situation-independent relationship between an arbitrary state
and the resulting sensory observation. Furthermore, a transformation which captures the
essential situation-agnostic relationships between a given sensory stream and a task is a
generalisation in the sense that the rule remains valid for situations other than the specific
ones from which it was developed.
4.4 Special Transformation Classes 175
This is closely related to the notions of ‘side’ information in the communications literature.
A typical arrangement involving this concept is shown in Figure 4.5. The data encoded
in the transmitted signal is denoted by X and the output of the decoder is denoted S
analogously to the reasoning approach outlined in this chapter. Note however, that as well
as the information itself, some additional knowledge relating to the encoding process is
transmitted in the form of A. As drawn, this quantity affects the encoding and decoding
processes but does not change or modify the actual data being transmitted. The formal
notion of side information relates to a system with a known probabilistic structure, in this
case the properties of a communications system, but for which additional information is
available which is capable of improving the performance of the system without changing
the actual data communicated.
Shannon (1958) introduced the notion as related to the encoding of a signal and gave a
particularly striking example. An electromagnetic communications system is subject to
interference and errors due to external interference, but if the system has access to real-
time measurements of the noise characteristics at particular frequencies then the system
can select the best band on which to operate. Alternatively, Cover & Thomas (1991,
pp130-1) discuss the notion with respect to a gambling situation in which data is available
which changes the expected performance of a betting strategy. Here the odds represent the
fixed statistical system and the additional information provides knowledge which affects the
performance of the gambling strategy.
Under this interpretation, the side information corresponds to effects such as physical rela-
tionships, sensory processes, physical constraints and other properties which affect how the
information relevant to a given task is encoded in the sensory stream. Furthermore, this
side information also affects how the information regarding the task can be extracted from
the data. An interesting example of this type of situation relates to the embedding and
recovery of watermarking information in digital media (Cox et al. 1999, Linder et al. 2000).
4.4 Special Transformation Classes
In addition to developing the transformational forms of Section 4.3, three special cases of
reasoning highlight the character of the several important situations often associated with
reasoning operations. These are: independent transformations, which deal with the ef-
fects of dependence or independence between the different feature descriptors; composite

transformations, in which it is convenient to consider a single transformation as a combina-
tion of transformations to simplify analysis and design; and feature extraction and decision
processing where the output space is discretised.
4.4.1 Independent Transformations
The lth transformation of Sections 4.3.1-4.3.3 gives rise to the new functional form sl (ql )
representing a tensor-field over the domain Ql . Recall from Section 2.4.1 that the domain
Ql corresponds to the set over which the function is defined, but that the coordinate system
used to index the elements is arbitrary. Consider that the developer is free to select this
coordinate system arbitrarily10 and it should be clear from the discussions of that section
that the basis vectors {eq } need not be orthogonal to one another. As demonstrated in
Appendix A, the existence of dual bases makes such non-orthogonal systems viable, though
orthogonal coordinates clearly give rise to computationally superior systems. As these
coordinates index the elements of the domain set Ql then the particular values which the
vector ql takes will be independent of the actual data contained in the state.
Each feature space transformation gives rise to the quantities in the tensor quantity sl ∈ Sl
for each domain location ql . Denote these quantities by slk for the kth property. Unlike
the quantities defining the domain location, these do not index the elements of a data-
agnostic set, but instead correspond to measures derived from the actual data. For this
reason, it is assumed that each of the quantities slk forms an orthogonal basis vector for the
sensor space and that any dependencies between the quantities will be related to the actual
characteristics of the data involved, rather than representing some general relationship
between the measures involved11 . For example, the relationship between the correlation
score for two image patches will depend on the nature of the two patches and the background
scenes under consideration. Consider also that if the quantities represented the strain tensor
for a solid material, then it is the characteristics of the material which determines the
relationships between the individual elements (Hibbeler 1997, §10.6).
Consider the three two-dimensional feature spaces shown in Figure 4.6. In this figure
10
This is in contrast to the sensor models of Section 3.4.2 where the coordinates of the observation are
determined by the physics of the particular sensor.
11
The same orthogonality applies to the bases of distinct feature spaces, for example those of sl (ql ) and
sl (ql ).
s s
(a) Linearly and statistically in- (b) Linearly Dependent, Statis- (c) Linearly independent, Sta-
dependent tically independent tistically dependent
Figure 4.6: Orthogonal bases and statistical independence in feature spaces. (a) shows the
case where the basis is orthogonal and the data are statistically independent; (b) shows the
case where the axes are non-orthogonal, but for which the data are independent under these
axes; and (c) the case where the basis is orthogonal, but the data is statistically dependent.
three different cases are shown to demonstrate the essential characteristics of the possible
transformations: in (a) the basis of the feature space (the feature descriptors) are orthogonal
and the data are statistically independent; while in (b) the data remains independent, but
the basis is non-orthogonal12 ; finally, (c) shows the case where the axes are orthogonal but
the data are statistically dependent.
While it appears that the first two cases are fundamentally different, consider that the notion
of the axes of case (b) being non-orthogonal depends on an external frame of reference
by which the quantities can be assessed. In both of these cases, however, the data are
independent when measured with respect to the given coordinate axes and the resulting
feature space can be considered ‘optimal’ in that it has no statistical redundancy in the
basis. In the third case, however, the selected axes display a clear statistical dependency
between the resulting data. Examining the figure, it would be possible to obtain independent
axes if they were aligned with the shape of the point cloud, that is, at 45 degrees to the
current axes.
These properties suggest that, when possible, the characteristics of the data should be taken
into account when selecting the axes defining the lth feature space. However, in practice, it is
common to assume that the individual components form an orthogonal and independent set.
The effect of this when the selected transformation is not orthogonal and/or independent
is a reduction in the conciseness and compactness of the representation, but the result will
12
Note that the change of basis affects the placement of the data when they are statistically independent
as this depends on the quantities being measured.
be conservative. Some ‘learning’ methods such as isomap (Tenenbaum 1998) guarantee

orthogonal output spaces, while others do not and a feature space is commonly constructed
from independent instantiations of the method. This is well-known in cases where the
Gaussian processes transformation is used to select the feature descriptors (Mackay 1998).
4.4.2 Composite Transformations
While the transformational approach of Figure 4.2 can utilise any aspect of the input data
to generate a feature space, there is no guarantee that this transformation can be readily
identified or used. An important example of this involves the interpretation of visual data:
consider that a 24-bit colour image with a resolution of 640 × 480 can be represented as
a point within a space of dimension 640 × 480 × 3 = 921600. However it is well known
that transformations of the location, orientation and scale of objects within the scene will
strongly affect the resulting images. This implies that the data does not have the same
number of degrees of freedom as the ‘arbitrary’ image considered. Discovering these ‘hidden’
dependencies between the multitude of potentially independent values can be particularly
challenging.
Many of the obvious relationships, such as neighbouring pixel values in space and time, can
be considered individually and it is common to pre-process the data to obtain series of de-
rived cues from which considerably simpler transformations can be found, such as in Kumar
et al. (2005) or Karumanchi (2005). These particular approaches rely on the assumption
that these derived cues capture a sufficient quantity of the ‘important’ characteristics that
it becomes computationally easier to find and utilise the secondary transformation. This
situation is shown in Figure 4.7, where data is pre-processed to generate three intermediate
data representations prior to being combined in a final transformation. It should be obvious
that this represents the interpretation-stage version of the independent sensor assumptions
of Section 3.2.2 and the figure can be directly compared with Figure 3.4.
Obviously, effects such as the data processing inequality (Mackay 1998, pp141,4) imply that
such approaches are at best as accurate as the single-stage process and will usually result
in less efficient transformations with greater uncertainty in the result. It is misleading,
however, to consider only the efficiency and accuracy of a single composite transform in
assessing the viability of such approaches. Rather, noting that a real autonomous system can
Data
Transformation
Data Data
x(y) Transformation Transformation Application
Feature Space
State space
Data
Transformation
Pre-processed data
Figure 4.7: A compound transformation showing pre-processed data stage. The data from
the state space is transformed through a series of transformations to yield pre-processed
data from which a final transformation generates the desired output feature space. This
figure should be compared with Figure 3.4 where the independence is modelled as part of
the sensing process.
be expected to achieve a very large number of tasks based on the same data, then the issue
of storage limitations can become significant. In particular, there may be a large number
of transformations which are required, each of which requires a subtly different subset of
the original data. Utilising the most efficient transformations for each task may require the
central summary of information to store a very large quantity of data. Alternatively, many
of these tasks may be represented well by a series of composite transforms which rely on
a group of much smaller subsets of the data. In effect, there may be situations in which a
trade-off between the overall performance of individual tasks and the flexibility available to
a system suggests an approach of this form. In fact, there are many biological situations
in which approaches such as these are observed, the most important of which being the
processing of visual cues such as colour, texture and motion before the data reaches the
visual cortex (Croner & Albright 1999).
4.4.3 Feature Extraction and Decision Processing
The development of the reasoning framework in this chapter has focussed on the trans-
formation of the sensory data into an arbitrary feature space. While the transformation
process is the most important aspect of the data manipulation stages, the interpretation
of transformed data is usually associated with a decision or with the extraction of a ‘fea-
ture’ from the data: Middleton (1996, Ch.18) considers the identification of signals in noise,
z(v) ∈ Z = {Z1, Z2, ..., Zn} 2

1 3
Sensor n X
v∈V Model y∈Y x(y) ∈ X = {X1, X2, ..., Xn}
Sensor State
(a) Discretisation in sensor model
x(y) ∈ X = {X1, X2, ..., Xn} 2

1 3
n S
Data
y ∈Y Transformation q∈Q s(q) ∈ S = {S1, S2, ..., Sn}
State Feature Space
(b) Discretisation in simple transformation
x(y) ∈ X = {X1, X2, ..., Xn} s(q) ∈ S = {S1, S2, ..., Sn} 2
1 3
n S'
Data Data
y ∈Y Transformation q ∈Q Transformation q' ∈ Q' s'(q') ∈ S' = {S'1, S'2, ..., S'n}
State Feature Space
(c) Discretisation in secondary transformation
Figure 4.8: Discretisation schemes for decision processing: (a) shows the case where the
state is explicitly discretised and the sensor model captures the action; (b) considers a
case where the reasoning process directly discretises the space; and (c) the case where the
discretisation is performed separately.
Mackay (2004, Ch.20) the extraction of clusters from data, and Mackay (2004, Ch.22) the
identification of the mean of an unknown distribution. In the context of the model presented
in this thesis, the application of a discretisation scheme to the data (in the sensory space or
a transformed feature space) represents a special case of the data transformation operation.
Under this interpretation, the three examples of Figure 4.8 show the discretisation as applied
through: (a) a sensor model; (b) a direct transformation; and (c) a composite transformation
respectively. In the first case the role of producing a discrete output x(y) ∈ X from the
sensory observation z(v) ∈ Z occurs through the use of an appropriate sensor model13
and the state function itself performs the discretisation. An example of this case occurs
when utilising an occupancy model, where the sensory observations are used to estimate
the binary property mi = {O+ , O− } (Thrun et al. 2005, p285). Alternatively, in the second
case the primary feature transformation maps the state x(y) to a discrete output space
S. The final case is similar, except that the input to the transformation is another feature
space rather than the state itself.
It was noted earlier that the transformations stages of a system can be viewed as corre-
sponding to lossy compression of the data. Clearly, this is not with respect to the raw data,
but to the characteristics which are relevant for a particular feature space14 Furthermore,
this suggests that the application of decision processes will necessarily result in the loss
of information and that these decisions are implemented for the purposes of engineering
practical systems as detailed in Example 4.4.1.
Example 4.4.1 – Features from 2-D laser scans

Consider a navigation problem for an indoor mobile-robot using a 2D laser scanner and,
in particular, the determination of the orientation of the system. In order of improved
computational cost but decreased ‘performance’, approaches include:
• Scan matching with entire laser scans
• Pre-processing the data to generate line segments which are compared
• Pre-processing to generate ‘corner features’ which are matched

Where the domains Y , Z and V can be continuous or discrete as desired.
13
14
Throughout this section, compression will correspond to this interpretation and not to the storage and
maintenance of the raw sensory data.
Features Domain (Q) Descriptors (S) Example Transformation

Corners (x, y) ∈ R2 ‘corner-ness’ (scalar) kernel-based corner score
Lines (x1 , y1 , x2 , y2 ) ∈ R4 ‘line score’ (scalar) score of points lying on each line
Scans (x, y, θ) ∈ R3 scan data ∈ RN (no transform)
Table 4.1: Three different feature-spaces for 2-D laser scanner data
Recall from Figure 4.2 and Section 4.2.1 that the transformation TS transforms the distri-
bution P [x(y)] onto the new domain Q with the values of the output function-space defined
over Q spanned by the basis vectors of the space S. Since these basis-vectors {esi } span
the function space of the feature-space, they were considered as the ‘feature descriptors’.
Further, the values of the feature descriptors can be used to identify, separate and compare
particular regions of the domain Q such that these regions can be reasonably considered to
form individual ‘features’. For the three approaches to the laser scanning system above, the
feature domain, feature descriptors and features can be readily identified as shown in Table
4.1.
Although there are many different simplifications which exist for generating the features in
this example efficiently15 , the impact of the increasingly lossy transformations should be
clear.
The role of discretisation schemes in the context of the entire model of this thesis can be
identified with the application of computationally advantageous simplifications to the data
obtained through a feature transformation operation. The utilisation of a discretisation
scheme, however, depends greatly on the available a priori data and the complexity of the
downstream processing. Several important classes, in order of increasing computational
cost, are:
• (Simple) thresholding based on scalar value16 ;
• Feature selection, where the boundaries are to be determined;
• Feature association, where observed features are to be associated with existing infor-
mation; and
15
Especially in the case of the line-extraction case; though many of these correspond to ‘seeding’ a search
space within the noted domain, or restricting the domain by another method.
16
This class includes simple decision processing.
4.5 Performance Measures 183
• Scene interpretation, where the locations, associations, properties and meaning of the
entire scene is to be determined and utilised.
The details of each of these operations, and the development of a comprehensive approach to
each, is beyond the scope of this thesis, though the application of measures such as entropy
and other quantities from Chapter 2 shows some promise in these operations, most notably
in threshold and feature selection operations. The particular problem of feature association
is generally interpreted as the data-association problem identified in Section 4.3.3, that is,
the identification of which subset of the state Y ⊂ Y contributes to a given observation
ẑl (vl ) or output function ŝl (ql ). For example Bailey et al. (2006) shows that many methods
exist for evaluating and maintaining such associations under uncertainty in the particular
case of non-Gaussian target tracking. Finally, a practical solution to the final category of
scene interpretation will represent a substantial improvement to the state-of-the-art, in that
it requires not only viable, reliable and meaningful feature identification, but additionally
the ability to interpret the meaning of the feature descriptor values. This, presumably, will
require the system to additionally consider the nature of the ‘background’ or non-feature
elements of the data-stream in order to provide this meaning.
4.5 Performance Measures
This section considers the application of the measures of Chapter 2 to the determination of
the performance of a functional transformation. There are several important characteristics
which are relevant to the ‘performance’ of a given reasoning operation, including: how
much of the content of the raw data is discarded by the transformation; how well the
transformation retains the information that is ‘relevant’ to a given task; and the impact on
the reliable operation of a system associated with the assignment of specific output values.
The first of these addresses the lossy nature of the transformation and considers the level
to which the transformation reduces the descriptive complexity and size of the data. The
remaining two consider the signal with respect to a specific task: evaluating the ability of
the system to retain the information which is relevant to an operation. They consider the
information preserving characteristics and the impact of the transformation in the context
of interpreting the data in the resulting feature space.
When considering the performance of a transformation of the form illustrated in Figure 4.2
the most important characteristics are those which capture the effects of the transformation
itself, rather than the effects due to the particular data obtained in a given scenario. For
example, the average quantity of uncertainty introduced by a given transformation conforms
to this type, while the current statistics relating the information being provided by a given
sensor to a particular task relates to the online performance instead (and is considered in
Section 4.5.4).
The transformation process is summarised by Equation 4.3,

and Section 4.3 argued that the transformational properties of this relationship are sum-
marised effectively and completely in the conditional distribution P [sl (ql )|x(y)]; the ad-
ditional quantities in the expression relate to the application of this transformation to the
particular situation encoded in the distribution P [x(y)].
There are three primary characteristics of this conditional distribution which are relevant
to the performance of a system utilising such a transformation. In the order of increasing
relationship to the specifics of the task these are:
• the task-agnostic information-preservation characteristics of the transformation;
• the task-specific information preservation characteristics of the transformation; and
• the task-specific impact of the transformation on the system.
In a practical system it is also advantageous to consider not only the transformation itself, as
characterised by the three issues above, but also the ‘online’ performance of the individual
components of the system, for example, the current contribution of a given sensor to the
data in the state or the sensitivity of a particular feature-descriptor value to changes in the
state. Furthermore, the contribution of a particular data source to a particular feature-
descriptor or decision process can be established along the same lines. That is, in addition
to the three criteria above, the engineer can assess:
• the contributions of a particular system component to another; and
• the specific contribution of a given sensory source to the state or to a downstream

feature-space.
4.5.1 Task-agnostic Information Preservation
The task-agnostic part of the transformation process is summarised by the conditional dis-
tribution P [ sl (ql ) | x(y) ] as this considers only the propagation of the information from
one functional form to another. Following the concepts of communications theory (Middle-
ton 1996, Chapter 6), the entropy from Section 2.5.3 can be considered as an appropriate
measure of the information content of the distribution.17 In this case the output distribution
P [sl (ql )] can be characterised by,
H [SQl ] , (4.15)
where SQl represents the ensemble of functions sl (ql ) as the analogy to X for the random
variable x in a case such as Mackay (2004, §8.1). Recall from Section 2.6 that the rela-
tionships between the information theoretic quantities can be summarised by the diagram
reproduced in Figure 4.9. Importantly, the entropy of Equation 4.15 can be written as
H(SQl ) = I(SQl ; XY ) + H(SQl |XY ). (4.16)
This expression is significant in that the two terms represent the two most important aspects
of the transformations as developed in Section 4.3: the mutual information term represents
the quantity of information which is deterministically transformed from the input space to
the output; and the conditional entropy represents the informational content which exists
only in the output representation. The first of these captures the transformed information
and the second the ‘noise’ added to the output by the transformation process. This inter-
pretation is justified by the nature of the mutual information as measuring the common
statistical properties which are invariant to an arbitrary deterministic transformation of the
ordinates, as discussed in Appendix B.
As a direct measure of the descriptive complexity of the distributions resulting from the
transformation process the entropy allows the engineer to accurately assess the impact of
the two effects separately. In practice, the most desirable transformation is that which
retains the properties of the input distribution without the addition of significant levels of
‘noise’; that is, it is desirable to maximise the mutual information component and minimise
the entropy part. For the continuous case, these two terms can be expanded using the forms
17
In the sense of the descriptive complexity of Shannon’s definition.
H(SQl,XY)
H(SQl)
H(XY)
H(SQl|XY) I(SQl;XY) H(XY|SQl)
Figure 4.9: An interval-diagram interpretation of the relationships between input and out-
put distributions. The length of each interval corresponds to the value of that quantity.
of Equations 2.92 and 2.112 in Sections 2.5.4 and 2.5.6 as

1
H(SQl |XY ) = ESQl XY logb
P [sl (ql )|x(y)]

1
= P [sl (ql ), x(y)] logb dx(y)dsl (ql )
P [sl (ql )|x(y)]
/ 0
1
= P [x(y)] P [sl (ql )|x(y)] logb dsl (ql ) dx(y)
P [sl (ql )|x(y)]
(4.17)
and

P [sl (ql )|x(y)]
I(SQl ; XY ) = ESQl XY logb
P [sl (ql )]

P [sl (ql )|x(y)]
= P [sl (ql ), x(y)] logb dx(y)dsl (ql )
P [sl (ql )]
/ 0
P [sl (ql )|x(y)]
= P [x(y)] P [sl (ql )|x(y)] logb dsl (ql ) dx(y).
P [sl (ql )]
(4.18)
Examination of the quantities in Equations 4.17 and 4.18 reveals that these quantities
depend on the specific properties of the joint distribution P [sl (ql ), x(y)], rather than simply
on the properties of the conditional distribution. However, the final expression in each
case can be interpreted as the performance of the transformative part, for a given input
function x̂(y), and under this interpretation the resulting quantities are the expected values
of these measures over all possible input functions. For example, the bracketed expression
of Equation 4.17 is

1
H [SQl |x̂(y)] = P [sl (ql )|x̂(y)] logb dsl (ql ) (4.19)
P [sl (ql )|x̂(y)]
and represents the entropy in the distribution over sl (ql ) for a given input function x̂(y).
Likewise, the bracketed terms of Equation 4.18 represent the contribution to the mutual
information due to the particular function x̂(y), or I [SQl ; x̂(y)]. Importantly, both of these
expressions express the quantities in terms of the conditional distribution and the output
functions so that their dependence on the input function is clear.
Now, in order to obtain expressions which describe the characteristics of the transformation
rather than the specifics of a given instantiation of the distribution, all input functions
should be treated as equally likely. That is, the marginal probability of the input functions
should be assumed to be uniform over the family of functions x(y) ∈ FY so that Equations
4.17 and 4.18 can be interpreted as the average entropy and mutual information for the
given transformation.
Finally, consider the effects of the transformation process becoming deterministic in the val-
ues of the functions.18 As discussed in Section 2.5.1 the deterministic form of a conditional
probability is given by
P [sl (ql )|x(y)] = δD [sl (ql ) − f {x(y)}] for some function f (4.20)
⎧
⎪
⎪ for sl (ql ) = f [x(y)]
⎨ 0
= 1 for a discrete system (4.21)
⎪
⎪
⎩
∞ for a continuous system.
It was shown in Section 2.5.3 that the inner expression of Equation 4.17 becomes
⎧
⎪
⎪ for sl (ql ) = f [x(y)]
1 ⎨ 0 logb 0 = 0
P [sl (ql )|x(y)] logb = 1 logb 1 = 0 for a discrete system
P [sl (ql )|x(y)] ⎪⎪
⎩
∞ logb ∞ 1
= −∞ for a continuous system
(4.22)
and that the resulting entropy expression becomes
⎧
⎨ 0 for a discrete system
H(SQl |XY ) = (4.23)
⎩ −∞ for a continuous system.
18
Note that this relationship does not require the distribution over either function to be deterministic,
nor that the functions must exist in the same function space (FY = FQl ) as a result of the transformation
invariance of the information theoretic quantities, as outlined in Appendix B.
Likewise, Section 2.5.5 showed that the mutual information reduces to

⎧
⎨ H(X ) for a discrete system
Y
I(SQl ; XY ) = (4.24)
⎩ H(XY ) + ∞ for a continuous system.
In either case, the expected result that the Shannon information content of the output space
is equal to the input space holds, that is, H(SQl ) = H(XY ).
Taking these effects into consideration, the two terms which contribute to the entropy of the
output space can be associated with the two most important effects of the transformation
process: the quantity of the original information which is preserved through the operation;
and the uncertainty or noise added through the transformation process. If the input dis-
tribution is assumed to be uniform in all possible functions, then these measures can be
interpreted as capturing the effects of the transformation itself, or at least the average effect
of the transformation on the class of functions considered in the expectation operations.
4.5.2 Task-specific Information Preservation
The task-agnostic measures of the previous section are obviously related primarily to the
descriptive complexity of the transformations rather than to the applicability of the data
to the application at hand. In the basic construction there are two random variables of
interest, the state function x(y) ∈ FY and the feature-space function sl (ql ) ∈ FQl . Let the
information relevant to a particular task be denoted by a new random variable sl (ql ) ∈ FQl .
Represent the part of the input signal which contributes this information by a fourth random
variable x (y ) ∈ FY . The relationship between these four variables can be described by the
graphical model in Figure 4.10. A useful measure of the performance of the system, when
one of the the functions x (y ) or sl (ql ) is known, would capture the degree to which the
information contributing to that signal is retained through the transformation operation.
Obviously, in order to examine the effect of the transformation process itself, it is necessary
to consider the effects of the upper transformation of Figure 4.10 in relation to the effects
of the lower transformation. It is normally assumed that the lower transformation TSl
is deterministic so that knowledge of x (y ) is sufficient to determine sl (ql ). Assuming,
therefore, that x (y ) is known allows the effect of the transformation to be drawn as the
Markov chain in Figure 4.11. In this figure the task-specific information is contained in the
TSl
x(y) / sl (ql )
TS
x (y )
l / s (q )
l l
Figure 4.10: A graphical model describing the transformation of functional forms from a
state-space to a feature-space. Here, the task-specific parts of the functions are represented
by the primed variables
TSl
x (y ) / x(y) / sl (ql )
Figure 4.11: An alternate form of Figure 4.10 for the case where TSl is deterministic so that
all relevant task-specific information is contained in x (y ).
random variable x (y )19 and it is possible to consider the information about this variable
which is preserved through the transformation TSl . The data-processing inequality (Mackay
2004, p141,144) requires that
I(SQl ; XY ) ≤ I(XY ; XY ) (4.25)
I(SQl ; XY ) ≤ I(SQl ; XY ) (4.26)
and implies that the transformation TSl can, at best, retain the information about x (y )
which is contained in x(y). Several authors have recognised that the process of transfor-
mation in the context of information processing gives rise to the consideration of these
quantities. Tishby et al. (1999) considers the development of ‘optimal’ transformations in
the sense of minimising the functional
L I(SQl ; XY ) − βI(SQl ; XY ) (4.27)
with the Lagrange multiplier β representing a ‘tuning’ factor to favour data representation
over the data compression.
Alternatively, Butz & Thiran (2002) and Fisher et al. (2002) examine the problem of multi-
modal signal ‘alignment’ as the development of a pair of transformations which maximise
the mutual information between the resulting distributions. Figure 4.12 demonstrates this
19
Obviously this quantity is representative of the task-specific output information in sl (ql ) because of the
deterministic nature of the lower transformation in Figure 4.10.
TX
/ FX
?X

S?
??
??
??
?
TY
Y / FY
Figure 4.12: An approach to multi-modal signal alignment in which the physical signal S
gives rise to two measured signals X and Y and these are transformed through TX and TY
to obtain the feature-space representations FX and FY .
problem where two signals X and Y are transformed through TX and TY to generate the
‘feature-spaces’ FX and FY . This is analogous to the case where each of the quantities is
considered to be the equivalent of x (y ) for the other transformation. The authors note that
schemes adjusting both transforms simultaneously and which simply maximise I(FX ; FY )
without regard to minimising the entropy of the joint distribution may introduce redundant
information arbitrarily. To overcome this problem, Butz & Thiran (2002) propose a ‘feature-
efficiency’ measure as
I(SQl ; XY )
eSQl ,XY = ∈ [0, 1]. (4.28)
H(SQl , XY )
The advantage of the measure of Tishby et al. (1999) over this approach is that the opti-
misation occurs over the transformation itself, but that the reference data (XY ) remains
external to the process. Furthermore, a consideration of the continuous forms of the in-
formation theoretic quantities in Sections 2.5.3 and 2.5.5 reveal that the expected range of
this quantity is no longer limited to [0, 1], but can be any real number. There are many
other examples involving these quantities, including Kapur & Kesauan (1992, Ch. 5) where
the information content (entropy) of a signal is optimised in a transformation process, and
Torkkola (2001) in which the ‘Renyi’ mutual information is maximised in the context of
constructing optimal feature extraction transformations.
These approaches demonstrate that when considered in conjunction with the task-agnostic
measures of the previous section, these measures give rise to methods for constructing prac-
tical transformations which minimise the quantity of data in the feature space (maximising
the compression of the data) while retaining the ‘important’ characteristics of the data.
That is, when the signal of interest is known, either as a raw form x (y ) or in the space of
the output sl (ql ), then the mutual informations between the quantities in Figure 4.11 can be
used effectively to examine the ability of the transformation to propagate the task-specific
information.
4.5.3 Task-specific Transformation Costs
The dependence of the measures of performance on the specifics of the task at hand can
be strengthened through the consideration of the task-specific effects of the given transfor-
mation. Whereas the previous sections have examined measures related to the descriptive
complexity of the data and the preservation of the ‘important’ data (again with respect to
the descriptive complexity), knowledge of the task may imply that certain transformative
operations have a greater impact than others. In essence, when the application is known
in such a way that an appropriate ‘cost’ can be placed on the association of a given output
with a given input, then the expectation of these new measures can be utilised to assess the
resulting effect of the intermediate transformations.
Consider, for example, that a viable transformation may be from the raw input data to a
Boolean decision and that the effect of an incorrect decision may be directly quantifiable.
Middleton (1996, §18.4) considers the performance of the system in terms of the concept
of a ‘cost’ or ‘loss’ functional, where the cost is determined from the the application itself.
As described in detail by the author, the notion of a cost implies that for any given in-
put function x(y) and a given output function sl (ql ), the cost associated with obtaining
that particular output function can be determined a priori. This function is denoted by
FL [sl (ql ), x(y)], considered as the effects of mapping from the ‘true’ states to the output
states, and includes effects such as incorrect associations and decisions.
If the designer can assign meaningful values to this function, then it is possible to obtain
expressions quantifying the performance of the given transformation on the system. This
approach can be extended to sequential systems such as the communications systems of
Middleton (1996) by noting that the probability distributions usually form a Markov chain.20
See Middleton (1996, §18.4) for details of such a system. Now, there are two distinct cases:
when the true function is known; and when only a distribution is available to describe it.
In the first case, the expected value of that loss can be evaluated to obtain the ‘conditional
20
Such a cascading is required in assessing the performance of a composite transform from Section 4.4.2,
though the loss function is defined across the composite transform, rather than across each separate trans-
formation.
loss rating’,
L [FL , x̂(y)] = FL [sl (ql ), x(y)] P [sl (ql )|x(y)] dsl (ql ), (4.29)
and it is clear that this is the expected value of the loss function for a given value of the
input function x(y). If the distribution describing the probability of that particular function
is known, then the average loss for the transformation can be calculated according to

L [FL ] = FL [sl (ql ), x(y)] P [sl (ql )|x(y)] P [x(y)] dsl (ql ) dx(y). (4.30)
This expression can also be utilised if the input distribution is unknown by assuming that
all functions of a given set are equally likely. In practice, utilising measures of this type
requires the developer to be capable of assigning the loss function for pairs of input and
output functions (such as an input signal to a classification result) and some knowledge of
the information being transformed. When this knowledge is available the designer is able to
consider not only the ability of the transform to retain the data, or to compress it, and the
ability to retain the specific information of relevance to a particular task, but additionally
to consider the task-specific impact of the transformation.
4.5.4 Online Performance Of System Components
The measures examined in Sections 4.5.1–4.5.3 consider the average or expected performance
of the transformation process in terms of the descriptive complexity or Shannon information
content of the signals. An important consideration, however, relates to the assessment of the
importance of a given information source to a given information sink, such as the degree
to which a particular sensor affects a certain component of the state or a feature-space
representation. Two issues are of paramount importance here: is it possible to determine
if a particular source is providing erroneous data, and how much does an information sink
rely on that source?
Scheding (1997) examines the detectability of sensor failure in the context of the devel-
opment of high-integrity navigation systems for autonomous uninhabited ground vehicles
(AUGVs). In particular, the detection of errors with zero-frequency (DC) characteristics
was shown to represent a challenging and subtle problem and the author also showed that
various techniques are available for incorporating redundancy to improve the overall per-
formance of the resulting system. Alternatively, Bailey et al. (2006) examines the process
of validation gating in non-linear and non-Gaussian target tracking problems. These tech-
niques exist to “cull very unlikely [sensor-to-state] associations”. In practice, these methods
seek to identify cases where the given sensory measurements are sufficiently unexpected (in
the sense of the statistics of the state distribution and the physical model of the sensor)
that they are likely to be the result of sensory error or an incorrect association between the
observation and a given element of the state. The particular issue of the identification and
recovery from sensory failure is beyond the scope of this thesis and the interested reader
should refer to the discussions in Scheding (1997).
The second issue, however, can be directly assessed through the framework presented in this
thesis. When considering either the data-gathering operations of Figure 3.1 or the data-
abstraction transformations of Figure 4.1, it is possible to identify the functional forms at
both the input and output of the processes. For example, the sensor models of the data-
gathering operations map the sensory function zl (vl ) to the state function x(y); meanwhile
the data-abstraction phase maps this state function to a feature-space representation sl (ql ).
Consider the reasoning operations as representative of both cases and note that since the
function estimates x̂(y) and ŝl (ql ) are available, then it is possible to construct a batch, or
running, ensemble of pairs of these functions21 , { [ŝl (ql ), x̂(y)] }, from which it is possible
to construct a joint distribution describing the statistics of the observed functions,
Pbatch [ŝl (ql ), x̂(y)] . (4.31)
An important quantity which can then be determined from this distribution22 is the mutual

information between the two functional forms, I ŜQl ; X̂Y which represents the estimated
statistical dependency between the two functional forms for the current ensemble. The
mutual information is also significant because unlike measures which capture the functional
relationships between the input and output data, it is invariant to bijective transformations
in the data. If the input and output data were statistically independent then the resulting
measure would be close to zero, whereas as the statistical dependency increases so does
the value of the mutual information. As discussed in Section 2.5.5, this measure is strictly
positive and has a well-defined upper-limit for the case of a discrete distribution.
21
The quantities ŝl (ql ) and x̂(y) are specific functions drawn from the distributions.
22
Note that the interpretation of this quantity requires the designer to ensure that the number of samples
in the ensemble is sufficient to guarantee that the law of large numbers is satisfied and therefore that the
resulting distribution is a valid approximation of the ‘true’ distribution.
The mutual information can be used to estimate the degree of contribution of a given
information source to a given sink. When considering the operation of an AUGV under
the conditions described in Section 1.2, this allows the system to quantitatively assess the
relative importance of data sources to the current operational scenario. For example, a
visible-spectrum camera may provide no data during night-operation, while a radar device
continues to operate effectively. While this measure is not sufficient to guarantee the quality
of the data being provided, nor can it ensure that the sensor is operating correctly or
consistently, it provides important quantitative information to the developer and to the
system regarding the statistical dependencies between the various quantities.
4.5.5 Sensor Model Performance
While the measures of Sections 4.5.1- 4.5.4 capture the characteristics of the general trans-
forms of Section 4.3, it was noted in Section 3.4.2 that the application of the transforma-
tion concepts to the incorporation of sensory data is an important special case of these
approaches. Recall that the information from a sensor (the measured function ẑl (vl ) say)
can be combined with the current estimate of the state (as described by P [x(y)]) through
the use of Bayes’ rule and a forward or generative sensor model. This model was written
in Equation 3.54 as
P [ zl (vl ) | x(y) ] (3.54)
where the random variable zl (vl ) ∈ Zl with vl ∈ Vl represents the class of possible sen-
sory observations.23 Given a particular observation of the sensor, the resulting likelihood
distribution P [ẑl (vl )|x(y)] is used to update the state according to Equation 3.55,
P [ ẑl (vl ) | x(y) ] P [x(y)]

P [ x(y) | ẑl (vl ) ] = . (3.55)
P [ẑl (vl )]
Equation 3.55 should be interpreted as the response to the comparison of the anticipated
statistics of possible sensory observations (given the previous state) to the actual mea-
surement, and it was noted that the essential information preserving characteristics of the
sensing process are encoded entirely in the forward model of Equation 3.54. Since this
23
In the notation of the transformations of Section 4.3 the sensory observation is equivalent to the feature-
space functional form sl (ql ).
conditional distribution is fully determined by the selection of the functional form of the
state and the physics of the sensor it is unnecessary to assess the performance of the trans-
formation with respect to the ‘task’ of obtaining a sensory observation. Instead, consider
the task-agnostic measures of Section 4.5.1,
H(ZV l ) = I(ZV l ; XY ) + H(ZV l |XY ), (4.16)
and recall that the two terms can be interpreted as measures of the quality of the lossless
transformation and the quantity of ‘noise’ added respectively. The notion of performance
implies that these quantities allow the engineer to assess the viability of different sensor
types and implementations, such as when determining which sensors are appropriate for a
given system.
Consider the mutual information term for two different sensing modalities, such as a cam-
" # " #
era and an imaging radar, denoted by IC ZVC ; XY and IR ZVR ; XY respectively. As these
two quantities share the same input distribution (the state) then their values will be com-
mensurate. The mutual information can be interpreted as the quality of an ideal (that is,
‘noise-free’) sensor of the given type and is not affected by the introduction of additional
" #
noise by a given implementation.24 For this reason the comparison of IC ZVC ; XY and
" #
IR ZVR ; XY will reveal which sensing modality captures more information about the given
state function.
While the mutual information guides the selection of the modalities of sensing, the compar-
ison of particular sensors of the same modality is also of importance. For example, what is
the advantage to a system of utilising a 48-bit camera over a 24-bit one, or a colour camera
over a grey-scale device? Since the entropy term H(ZV |XY ) captures the dimensionality
and structure of the output space, it is misleading to compare the entropies of two devices
of different modality, or to compare the entropy of a continuous device to a discrete one.
If the measures can be made commensurate, however, then the conditional entropy allows
the direct comparison of the quantity of ‘noise’ added to a measurement by the sensor.
Finally, it was noted in Section 3.2.2 that the sensor-space concept allows the designer to
consider the important special case where individual sensors are assumed to be independent
24
This can be seen from the fact that the addition of information (that is, entropy) to either the input
or output marginal distribution affects the joint and conditional entropies, but not the mutual information
itself.
of one another. These cases are shown in Figures 3.3 and 3.4; in these figures the individual
sensors are assumed to correspond to independent information and, in the second case, the
cues were derived from a common source. Firstly, note that the representation of Figure
3.3 was written as a series of separate representations, which is equivalent to a series of
orthogonal, independent quantities in a single representation. Consider the effect on the
expected observation from Sensor 1 from the representational parts associated with Sensor
2,
P [z1 (v1 )|x1 (y), x2 (y)] = P [z1 (v1 )|x1 (y)] , (4.32)
because of the assumed independence. This means that the measures associated with each
of the sensors can be calculated exactly as before with

1
H [ZV l | Xl ] = EXl ZV l logb (4.33)
P [zl (vl ) | xl (y)]
and each of the independent sensor models can be considered in an identical manner to the
previous cases.
However, the actual degree of dependence of the sensors will have a significant impact on
the performance of the system. The limiting case involves two sensors which are actually
identical, say, as alternate measurements from the same sensor. In this case the two sources
of information will be combined separately and the performance of the system will be
reduced when the data is treated as if it were independent. If the data from the two sensors
can be associated into pairs {z1 (v1 ), z2 (v2 )} then the ensemble of observations can be used
to construct the joint distribution and, if the data were independent, then this would yield
P [z1 (v1 ), z2 (v2 )] = P [z1 (v1 )] P [z2 (v2 )] . (4.34)
Fortunately, the mutual information gives a direct measurement of the degree to which this
expression holds,

P [z1 (v1 ), z2 (v2 )]
I(ZV 1 ; ZV 2 ) = EZV 1 ZV 2 logb . (4.35)
P [z1 (v1 )] P [z2 (v2 )]
Since the mutual information between two sensors is independent of an arbitrary bijective
transformation between the sources, then this measure is especially useful as it becomes
4.6 Conclusions 197
possible to compare the statistical independence, or otherwise, of sensors of arbitrarily

complicated modalities such as vision, radar and audio. It is necessary, however, to recognise
that this measure should only be applied in this instance to determine the appropriateness
of the independence assumptions. Importantly, this measure does not correspond to the
overall utility of the sensory system, but rather to the dependencies between a pair of
sensors.25
4.6 Conclusions
This chapter has introduced a comprehensive framework for constructing probabilistic trans-
formations of the functional representations introduced in Chapter 3. Specifically, the func-
tion forms of that chapter were shown to provide the input distribution for a probabilistic
transformation. The characteristics of this transformation were shown to be contained in
the conditional probability distribution P [sl (ql )|x(y)]. It was shown that this distribution
contains the situation-independent characteristics of the transformation. Furthermore, the
application of a model of this form was shown to correspond to the utilisation of exter-
nal knowledge to guide the re-interpretation of the data and does not correspond to the
introduction of new information into the distributions.
The process of reasoning based on the data contained in the state was shown to correspond
directly to the application of transformations of this kind. Specifically, each particular task
necessary for the operation of a system can be shown to give rise to a distinct feature space
in which the relevant parts of the state are emphasised and the remainder suppressed. In
general, these transformations represent highly lossy compressions in that the information
relevant to a particular task will usually be a very small subset of the original state. An
important special case of these transformations corresponds to the sensor models of Section
3.4.2 which, counter-intuitively, correspond to a transformation from the state to the sensor
space and not from to the sensor space to the state.
While the conditional distribution was shown to correspond to the general principle, three
important effects were proposed as giving rise to a wide class of possible transformations of
functional forms. These were identified as:
25
It is also possible to consider the mutual information between three (or more) sensors using the three-
term expressions of Section 2.5.7, though the measure can now be negative and care must be taken to ensure
that the value is interpreted appropriately.
X
x(y')
Y'

state P[ x(y) ] ≡ P[ α(w) ] state
P[ x(y) ] P[ x(y) ]
Chapter 3
Transformation Model estimate

x(y)
§ 4.3
kernel
model Deterministic fs[ x(y), y, s ] model
Tdet § 4.3.1
P[ s1(q) | x(y) ] model P[ s2(q) | x(y) ]
Transformation Transformation § 4.2
noise model
Value (noise) Pv[ s(q) | Tdet{x(y)} ]
model
Pv[ s(q) | Tdet{x(y)} ] § 4.3.2
domain uncertainty
Domain uncertainty Py[ y' | y ]
model Pv[ s(q) | Tdet{x(y')} ] P[y'|y] dy' § 4.3.3
feature distribution feature distribution
P[ s(q) ] P[ s(q) ]
Performance Measures: * Task-agnostic information § 4.6.2

* Task-specific information § 4.6.3
* Task-specific cost functions § 4.6.4
* On-line entropy contribution § 4.6.5
bijection bijection
Feature Space Feature Space
P[ s1(q) ] P[ s2(q) ]
model
Feature Space Independence P[ s3(q) | s2(q) ]
§ 4.4.1 Transformation
Transformation
Model
Feature Extraction /
§ 4.5
Decision Processing
Compound Transformations
§ 4.4.2
Feature Space
P[ s3(q) ]
Figure 4.13: The Data-Abstraction Process. The information contained in the state dis-
tribution P [x(y)] is considered the source for transformations into a new feature space
sl (ql ). Each transformation is generated through a transformation model P [sl (ql )|x(y)]
which can be constructed to take into account deterministic effects, noise models, and un-
certainty in the domain Y . The resulting distributions P [sl (ql )] provide interpretations
of the data based on the external knowledge encoded in the transformation model. The
independence of the feature spaces can be examined, along with the effects of compound
transformations. In this diagram the quantities for each part of the model are shown in
blue and the corresponding sections of the thesis are shown in red.
4.6 Conclusions 199
• a deterministic transformation of the function values, such as in a Fourier or a

rectangular-to-polar transformation;
• uncertainty in the resulting function value, such as noise in a sensor system or a

conservative probabilistic transformation; and
• uncertainty in the domain prior to the transformation, such as uncertainty in a sensor

location or the classic ‘data-association’ problem.
In addition to the construction of a convenient form for the development of the transforma-
tions, the importance of the independence of the resulting quantities defining the feature
space was examined. It was noted that the abstract quantities of relevance to a particular
application need not be statistically independent, but that transformations which gave rise
to independent quantities would represent optimal approaches in that they minimised the
degree of redundancy between the individual quantities. Alternatively, the computational
difficulties associated with performing, or identifying, certain transformations gives rise to
the notion of cascaded or composite transformations. While such transformations neces-
sarily suffer from the effects of the data-processing inequality and represent sub-optimal
approaches, they may give rise to computational and storage advantages in a complete
system analysis. These concepts were shown to bear remarkable resemblance to current
models of pre-processing in biological visual systems. The important operations of decision
processing and feature extraction were shown to correspond to the application of transfor-
mations which result in a discrete output space, rather than to a special operation involving
the data. In particular, the discretisation could occur at any stage in the model: as a sen-
sor model, as a primary transformation, or as a secondary transformation in a composite
system.
Finally, the performance of the transformations was shown to be readily measurable using
the theory of Chapter 2. Two specific classes of performance were identified: those relating
to the off-line performance of a transformation, and those relating to the performance in
a particular situation. The first class were shown to relate to the characteristics of the
conditional distribution P [sl (ql )|x(y)] and could be divided into three categories: task-
agnostic information preservation, task-specific information preservation, and task-specific
cost evaluation. The second class, however, relates to the actual statistics in the particular
distributions involved, and measures the contributions of individual parts of the model to
other elements within it. This approach was shown to yield the special case which measures
the contributions of individual sensory cues to elements within the state or to a downstream
feature space.
Similarly to Chapter 3, the material presented in this chapter does not represent a recipe
for the construction of practical systems, rather it presents a framework for assessing the
assumptions and approximations made in the development of such systems. As such, the
engineer must still consider the application specific details such as computational cost,
storage limitations, likely downstream tasks and the effects of subsystem failures. Since
the model does not provide a mechanism for designing such systems, the limitations of the
presented transformation model are significant - in particular, the three-part transformative
model of this chapter does not represent the most general type of transformation possible,
rather it represents the nature of many commonly used methods. As discussed in the
examples of Chapter 5, however, the transformations necessary for many operations can be
concisely described using this technique. Each of the examples in that chapter are discussed
with reference to the structural elements shown in Figure 4.13.
Overall, the framework of this chapter allows the functional forms of Chapter 3 to be trans-
formed in an arbitrary manner to allow external knowledge to guide the re-interpretation
of the data. This framework allows both downstream interpretations and decisions and also
the sensor models of Section 3.4.2 to be evaluated effectively. Furthermore, utilising the
functional models of Section 3.3 and the measures of Chapter 2 the performance of these
transformations can be explicitly evaluated.
Chapter 5
Application of Model
5.1 Introduction
This chapter examines the application of the theoretical model of Chapters 2–4 to several
interesting problems. The primary goal of this chapter is to balance the detail and subtle
complexities of the material presented in those chapters by considering the application to
several realistic, but simplified, applications. As noted in Section 1.2, this thesis is motivated
by application to an Autonomous Uninhabited Ground Vehicle (AUGV), but it should
remain clear that the scope of the model remains substantially more general. Of particular
interest here are the issues associated with the construction, manipulation and interrogation
of functional representations, and the important computational and algorithmic implications
of such models.
The utilisation of functional models naturally suggests a framework which supports asyn-
chronous, simultaneous1 operation of the procedures of sensing and reasoning as described
in Chapters 3 and 4 respectively. Each source of sensory information and each reasoning
transformation should operate independently. A computational framework for implement-
ing such models has been constructed in the C++ language using threading and will be
examined in detail in Section 5.3. In addition to the organisational structure of the system,
a sophisticated, generalisable and extensible data structure has been constructed to manage
1
The implementation of this style of computing is achieved using multi-threading and multi-process
frameworks with obvious extension to multi-computer configurations. Note that the material discussed
here remains independent of the actual technique utilised and future developments involving computing
architectures are implied also.
202 Application of Model
the data stored within such a representation. The data structure allows multi-resolution
domain sampling, efficient thread-safe access, arbitrary data tensor storage and probabilis-
tic data management. Importantly, each of these aspects is managed independently by
the framework, allowing extensions and specialisations to be incorporated easily. In addi-
tion to the structures of the software, the operations of sensing, sequential manipulations
and reasoning (Sections 3.4.2, 3.4.3 and 4.2 respectively) form logical structures within the
system.
The application of the framework will be considered for four important case studies:
Global planning for AUGV systems will consider the construction of large-scale, multi-
resolution terrain and environmental models supporting the development of sophis-
ticated planning and control strategies for ground vehicle deployment in arbitrary
operating environments;
Local AUGV navigation in which data is utilised for the task of supporting autonomous
navigation in outdoor unstructured environments, particular focus will be made on
the synergy between this application and the previous case;
Terrain imaging and estimation will be examined in a mining application, both to sup-
port the development of AUGV mining systems, but additionally to facilitate more
detailed resource management techniques; and
Construction of richly descriptive maps where the advantages of the theoretical and
computational frameworks allow the rigourous development of applications such as
DenseSLAM (Nieto 2005).
The results shown for these examples are based on work completed as part of two distinct
projects: the development of an autonomous ground vehicle2 ; and the development of a
practical computational framework for combining spatially-significant terrain information.
Due to the scope of both projects, the work of several additional people is included in
these sections. Significantly, the framework of the first and third examples was completed
with Ross Hennessy, though the author was the primary developer of the resulting software
system. Likewise, the AUGV project includes the work of many others, notably James
Underwood and Steve Scheding. Much of the discussion of the Local AUGV navigation
2
this vehicle is described in detail in Chapter 1
5.2 Implementing Functional Representations 203
problem relates to the application of the model of this thesis to that problem. The results
shown are indicative of this relationship, but are drawn from the work of Underwood and
others rather then having been constructed specifically for this purpose. The final example
is a theoretical discussion only and seeks to identify a particularly interesting extension
based on the implications identified in this thesis.
5.2 Implementing Functional Representations
This section summarises several important aspects of implementing functional representa-

tions and examines the practical implications of constructing real systems using the model
presented in this thesis. Specifically, it examines two separate areas of interest: the de-
velopment of implementable functional representations, and the operations which must be
performed on that representation. The approach considered in detail here involves the
development of an asynchronous database to store and manage the data contained in a
functional form. The requirements for this practical implementation are summarised in
Section 5.2.3.
5.2.1 Representing a Functional Form
Section 3.3 introduced the functional forms proposed as an essential part of the model
presented in this thesis. It was argued that the estimation of an arbitrary tensor-valued
function over an arbitrary domain supported interpretations of traditional state-space mod-
els, ‘ordinary’ function models and arbitrary function-space models. The functional forms
of interest were written according to Equation 3.1,
x(y) ∈ X with y ∈ Y, (3.1)
where Y represents the (arbitrary) domain over which the function is defined, y an element
from that domain, and X the space spanned by the components of the function x(y).
The discussions of Section 3.5 identified that the estimation of the functional forms, that
is, P [x(y)], represents a fundamentally intractable problem with a dimensionality scaling
with the cardinality of the domain Y . However, it was also shown that approximation
of the functional form using a truncated mixture form represented a general methodology
for achieving a practical implementation. Furthermore, the formulations of that chapter

allow the measures of Chapter 2 to be utilised to quantify and assess the impact of that
approximation on the fidelity of the resulting implementation. The approximation is written
according to Equation 3.72,

x(yj ) → xα (yj ) = αi ei (yj ) where {ei (yj )} spans FY ⊂ FD , (3.72)
i
which defines the sub function-space FY , and while this is written for a discrete domain Y ,
the same superposition holds equally well for continuous domains, replacing yj by y.
The complexity involved in manipulating and maintaining a functional form of this kind is
clearly related to the number of basis functions which are required to calculate the value
at any given location in the operating domain Y . For example, a piecewise constant model
involves a single basis function for any element of Y , while a Fourier series requires all
frequency components for any element.3 Importantly, the representation transforms the
functional forms of interest into a finite set of parameters and the estimation of the functional
forms is shown to be equivalent to the estimation of the values of those parameters,
P [x(y)] ⇔ P (α). (3.19)
The complexity of evaluating the resulting distribution over the domain X for a specific
location in the environment y ∈ Y is related to the compactness of the region influenced
by individual parameters as examined in Section 3.3.2. For clarity this chapter focusses on
representations which are spatial, that is, directly interpretable according to the Dirac basis
model x(y) and, furthermore, the decompositions will be assumed to be spatial in nature,
providing a domain-compact sampling of the original function.
An important consideration when selecting the manner by which the operating domain is
approximated relates to the number of parameters required to capture the area of interest.
It is well-known that in realistic situations the environmental information is non-uniformly
distributed through the area of interest to the system, different regions having different char-
acteristic spatial complexities. Consider the difference between representing a flat lake-bed
and a mountain road in a 2.5D terrain model: the lake bed may be well approximated over
3
Note, however, that this does not necessarily imply that maintaining and utilising such models will be
less efficient. (In those cases the sensor models and data transformations will all operate directly on the
transformed function, that is, on the {αi }.)
distances of tens or hundreds of metres, while the mountain roadway requires a much finer
level of spatial detail. Practical systems operating in expansive environments must, there-
fore, be capable of either adjusting the local resolution of the representation or representing
the data using a multi-resolution approach across the entire domain. The implementa-
tion of a specific data structure which supports efficient spatial operations within such a
multi-resolution framework is presented in Section 5.3.2.
5.2.2 Computational Operations on Functional Forms
Chapters 3 and 4 discussed the three types of operation which are performed on a gen-
eral functional representation: the incorporation of new data using a sensor; the sequential
manipulation of the representation according to some model of the propagation characteris-
tics; and the extraction of task-specific information to generate a feature-space. Figure 5.1
shows a functional representation with three sensors contributing information, the sequen-
tial propagation model and four derived feature spaces. Clearly each of these operations
is performed asynchronously on the representation and no assumption should be made re-
garding the relative data rates of the sensors, propagation models or feature spaces.
Taking the notation of Section 3.4.3, the estimate contained in the state at time k is denoted
by

P xk (y) | Zk , Uk , x0 (y) (5.1)
where the past observations Zk , state propagation inputs Uk and initial state x0 (y) are
the only factors which affect the resulting distribution. The nature of the three operations
identified above is significant when considering an implementation of such a model:
Sensor Measurements: Estimates from the state representation corresponding to the

region of influence of the particular observation are transformed to generate the like-
lihood update for that region. Importantly, note that as drawn, the information flows
out of the state, is ‘corrected’ by the observation, and is updated back into the state
representation. The operation is limited to the subset of the domain Y which the
measurement is capable of affecting and since the operation changes the value of the
data, that part of the domain cannot be accessed by another operation until the cur-
rent observation has updated the internal values. Specifically, the update is achieved
Sensor 1 Sensor 2 Sensor 3
Likelihood Physical Likelihood Physical Likelihood Physical

Generator Model Generator Model Generator Model
Bayes Bayes Bayes

Update Update Update
X
x(y')
Propagation Control
Model Parameters
Y'

P[ x(y) ] ≡ P[ α(w) ]
Transformation Transformation Transformation

Transformation Transformation Transformation
Model Model Model
Transformation
Model
Transformation
Feature Space Feature Space Feature Space
P[ s1(q1) ] P[ s2(q2) ] P[ s3(q3) ]
Feature Space
P[ s4(q4) ]
Figure 5.1: Implementing functional representations. The representation in the state is

constructed using information from multiple sensors and transformed to generate multiple
feature spaces; all operations are performed asynchronously.
using the models of Section 3.4.2 and Equation 3.70,

k−1 , Uk , x (y)

P [z.k (v. )|xk (y)] P xk (y)|Z 0
P xk (y) | Zk , Uk , x0 (y) = , (3.70)
P [z.k (v. )|Zk , Uk , x0 (y)]

where the current state is encoded in the observation prior P xk (y)|Zk−1 , Uk , x0 (y)
and the observation is incorporated using the forward sensor model P [z.k (v. )|xk (y)].
This forward model represents the introduction of external knowledge into the system
in the form of a re-interpretation of the state information quantities to yield the sensor
measurement quantities.
State Propagation: As with the sensor measurements, state information is modified by

the measured, or estimated, control parameters and the updated information passed
back into the representation. However, the process affects the entire state and care
must be taken to complete the entire update process prior to any new information
becoming available from the sensors. Specifically, Section 3.4.3 demonstrated that the
update took the form of Equation 3.67,

P xk (y) | Zk−1 , Uk , x0 (y)

= P [xk (y)|xk−1 (y), uk ] P xk−1 (y)|Zk−1 , Uk−1 , x0 (y) dxk−1 (y),
(3.67)

where the state prior is denoted by P xk−1 (y)|Zk−1 , Uk−1 , x0 (y) and the propaga-
tion model for the system is contained in the transition model P [xk (y)|xk−1 (y), uk ].
As with the sensor measurements above, the state propagation model represents the
introduction of external knowledge onto the interpretation of the data, this time caus-
ing the data to be re-interpreted according to a modelled transitional behaviour.
Feature Transformation: Unlike the previous operations, the extraction of information

from the state to generate a feature space distribution does not affect the information
contained in the representation and so can be performed at any time the state repre-
sentation is valid, that is, when not being updated by another operation. The most
significant difference between this operation and the others, however, relates to the
spatial extent of the part of the representation which is used to generate the resulting
feature space. While sensory observations are usually constrained to a small subset of
the space, and temporal propagation affects the entire state, the reasoning operations
can utilise any subset of the space, including the entire space. This means that the
implementation details will directly affect the impact of individual operations on the
performance of the representation, such as when the operation prevents others from
occurring. As the operation only reads information from the representation, these
effects are limited to interactions with those operations that change the data. The
transformation is given by Equation 4.3 from Section 4.3,

and the characteristics of the transformation are contained in the conditional distri-
bution P [sl (ql )|x(y)]. This can be written in the form of Equation 4.14,

This distribution is a special case of an arbitrary conditional distribution which explic-

itly models three important effects related to functional representations. These are:
a deterministic transformation of the functional values, such as a polar-rectangular
transformation and other characteristics of a physical sensor model, Tdet {x(y)}; the
incorporation of the effects of uncertainty or noise in the resulting value of the func-
tion, Pvalue [sl (ql ) − Tdet {x(y)}]; and management of uncertainty regarding the actual
location in the domain which is transformed, P [y |y]. Once more, the operation rep-
resents the introduction of external knowledge into the system as a reinterpretation
of the data in the state, and this effect is summarised by the conditional distribution
of Equation 4.14.
5.2.3 Requirements for a Computational Framework
Taken together, these properties result in several important characteristics which are desir-
able in a framework aiming to implement a practical system using functional models. These
can be divided into five main areas:
1. The System Structure should support the parallelisation of the asynchronous tasks
of incorporating sensory measurements, temporally propagating the state representa-
tion and developing downstream feature spaces, and should take into account the
possible interactions of these operations.
2. The Data Structure must support the efficient implementation of the parallel op-
erations noted above on the system. The structure should also address the issues
associated with the storage of arbitrary functional values in both the domain and
the value-space of the functions. The utilisation of a multi-resolution technique for
managing the domain effects can be expected to provide significant advantages with
respect to both computational and storage costs. Also, the functional forms of the
model suggest that the probabilistic management of an arbitrary tensor-valued func-
tion should be directly supported by the structure.
3. Sequential State Management is required if the representation is non-static and

naturally incorporates effects such as temporal propagation, representational resam-
pling and other algorithmic checks of consistency. Since this operation acts on the
entire state representation, its interactions with other operations may be significant.
In particular, if the entire state must be updated in a single operation during which no
new data can be added and the intermediate representation is not valid for extraction
by a client, then this procedure will be mutually exclusive with any other operation.
4. Sensing Operations incorporate the effects of sensory data on the state. Specifically
these affect the subset of the state for which the given observation corresponds. Since
the operations involve utilising the current data in the state and correcting according
to the forward sensor model and the observation itself, the effect will modify the
contents of that region; this implies that the region under influence should remain
unchanged by any other operation while an update is occurring.
5. Reasoning Operations utilise the current information in the state to generate a

re-interpreted feature-space representation of that data. Importantly, any reasoning
operation may utilise any subset of the original data, including the whole state itself.
These operations do not change the data in the state, but may require substantial
computational manipulation of that data. For this reason, it is advantageous to
consider two separate stages: obtaining a copy of the data, and transforming it into
the new form. Furthermore, it is possible that the operation need not read the entire
structure at a single instant, but could sequentially copy different segments of the data
to limit the influence of the read operation on sensing or state propagation operations.
5.3 A Computational Framework
Section 5.2 highlighted the important characteristics of the model presented when applied
to the construction of a practical implementation and, in particular, the model was shown to
yield the overall structure shown in Figure 5.1. For each element of this figure, the relevant
equations describing its essential nature and the implications of that form were discussed
and Section 5.2.3 summarised the essential requirements for such systems. This section
examines the design of a software architecture developed according to these considerations.
In particular, the five main points of Section 5.2.3 are considered in detail with respect
to the implementation of a point-based sampling approach to the representation problem.
While the data structure discussed allows for polymorphism to provide an extension of the
system to non-point based approaches, the computational and other aspects of this section
are restricted to the simpler models.
5.3.1 System and Network Structure
A significant characteristic of a practical implementation is the ability to support the asyn-

chronous nature of the operations involved. It is well known that simple systems, particu-
larly those using traditional state-space models4 , can be developed to sequentially handle
the arrival of sensory information (Bar-Shalom et al. 2001, Ch.5). However, in cases where
the domain is significantly more expansive than the region of influence of a particular sen-
sory observation, or where that observation affects only a subset of the quantities defined
for that region, there will be advantages to allowing multiple observations to be combined
with the representation simultaneously. Much more importantly, however, the asynchronous
nature of sensory observations is usually only considered with respect to the manipulation
of the state representation itself, and not to the feature-space quantities to be derived from
it. It should be obvious that different tasks require access to a wide variety of different
subsets of a representation at a wide variety of update rates. This suggests that while it is
reasonable to simply interleave the sensory data to maintain the state estimate, the process
of interpreting the data should be capable of running at any rate without reference to the
timing of the sensors.
4
That is, a Dirac function basis over the finite-dimensional, discrete set Y consisting of a labelled element
for each quantity of interest. For example, {x, y, z, θ, φ, ψ} in a platform pose estimation case.
5.3 A Computational Framework 211
Data Storage Server Propagation

(Central Repository) Model
Client 1 Client 2 Client 'n'
Figure 5.2: A communications structure for the main system components. The thick lines
represent communications channels, including function calls, messages or more advanced
communications links such as TCP/IP.
The communications requirements of Figure 5.1 are summarised in Figure 5.2 which shows
that the operations of sensory observation, state manipulation and data extraction have
distinct styles of interaction with the central state representation. The framework of this
section approaches the parallelisation through the utilisation of multiple threads within
a single process. The extension from this arrangement to multiple processes, or multiple
machines, communicating via pipes or network communications rather than function calls,
is straightforward and an example is depicted in Figure 5.3. The major disadvantage of
this approach relates to the naı̈ve implementation of feature-space clients accessing the en-
tire state representation by transferring the whole distribution at each iteration. Rather,
since the data is maintained probabilistically, the entire representation can be maintained
separately by different clients through the transmission of the updates and modifications
applied to the main representation. These messages represent the changes to the informa-
tion content and are denoted as the ‘delta’ information in an example of this ‘distributed’
representational form shown in Figure 5.4.5
5
An analogy can be drawn here between this approach and the channel-filter implementations for dis-
tributed data fusion algorithms(Nettleton 2003, §3.4).
Sensor Thread 1 Sensor Thread 2 Sensor Thread 3
State Management Process

Data Storage Server Propagation
(Central Repository) Model
Client Thread 1 Client Thread 2 Client Thread 'n'
Client 1 Client 2 Client 'n'
Figure 5.3: A multi-process extension of the system structure of Figure 5.2. The central
box represents the structure from that figure as implemented in a central process, while the
sensory gathering and data reasoning operations can be managed in separate processes.
Client Software: Client Software:

- Sensor - Sensor
- Feature Space - Feature Space
- Propagation Model - Propagation Model
Data Access
Data Access
DataDeltaManager DataDeltaManager
'Delta' Messages
DataStorage DataStorage
Process #1 Process #2
Figure 5.4: A ‘delta’ manager approach to efficient distribution of representational data

across a bandwidth limited connection. The two processes require extensive access to the
data contained in the representation, but are connected by a bandwidth-limited link (shown
dashed). Instead of transmission of the contents of the DataStorage in either process, the
DataDeltaManager monitors the changes to the representation and only these incremental
changes are transmitted.
5.3.2 The Data Structure
The system structures of the previous section have considered the natural separation of
the operations of sensing, state propagation and feature-space processing and, in order to
support these schemes, the programmer requires access to the data contained in a represen-
tation which allows asynchronous, parallel data access. Section 5.2.3 showed that the four
main requirements for such a data structure are to provide:
• multi-resolution and flexible domain access;
• multi-threaded access;
• the capability to store arbitrary tensor information for each element of the domain;
and
• support for probabilistic methods for manipulating this information.
This section considers the design of a C++ framework for creating and managing represen-
tations of this form.
Multi-resolution Domain Sampling
This aspect of the representation relates to providing access to the individual elements of
W (or to the domain Y directly under certain ‘spatial’ parameterisations) in an efficient
and appropriate manner.6 It is assumed that the representations of interest to the user
correspond to those for which this domain is inherently ordered. For example, a spatial grid
representation or a time-series are considered; arbitrary topological spaces are not, however.
As noted earlier, the benefits of a multi-resolution data structure for providing an efficient
encoding of the logical elements of the domain W are clear and when the dimensionality of
the domain is k, then a k-D tree provides an appropriate implementation.
Under this scheme, a tree structure is used to recursively subdivide the space W according
to the quantities which order the elements of that space. For example, in the case where
W ⊂ R3 , the space is subdivided spatially and the degree of subdivision is controlled by
the level of detail required in that region. This scheme is depicted in Figure 5.5 where
6
See Section 3.3.1 for details.
Root Node
Node (no data) Node (Leaf ) Node (no data)
Node (Leaf ) Node (Leaf ) Node (no data) Node (Leaf ) Node (no data) Node (Leaf )
Node (Leaf ) Node (Leaf ) Node (Leaf ) Node (Leaf ) Node (Leaf ) Node (Leaf )
Figure 5.5: The structure of a k-D subdivision of the domain W . Each box in this figure
corresponds to a region of the space and the child regions of any box represent a subdivision
of that region. Only the leaf elements of this tree (shown shaded in blue in the figure) contain
the data with other nodes providing the structure. The number of nodes at each subdivision
is normally given by 2k for a k-dimensional space.
each box represents a region of the space W . Each of these ‘nodes’ can be subdivided as
necessary into several ‘child’ nodes; there are usually 2k children for a k-dimensional tree
and the subdivision is usually symmetric, though provided that each node has a mechanism
for storing the subdivision properties this is not essential. Only the ‘leaf’ nodes (coloured
blue in the figure) contain data for the representation; the other nodes exist to impart
spatial structure to the data storage. See Preiss (1998, Ch. 9) for a detailed discussion of
the design of data structures such as this.
This defines the subdivision structure of the representation, but not the actual storage of
the data in the leaf nodes. There are obviously many different methods which can be used
within each leaf, including a single value, a structured series of samples and an unstructured
series of samples. For example, within each leaf node the user may wish to store a single
sample from the function x(y), or the leaf can contain a secondary data structure within it.
This second situation corresponds to using the tree to provide a coarse subdivision of the
domain (for searching and data-management purposes) and then storing ‘local’ data within
this structure. If the basic sample is assumed to correspond to a point7 within the space
W , then at each leaf node the structure would contain a collection of points.
7
Either as a point sample itself, or as the centre or location of a region within it.
StorageType Point PointCollection

-Contains -IsContainedIn
0..* 1
-Contains 1
«bind» 1
-IsContainedIn
StoragePoint -Parent 1 kDNode kDTree

-Contains -IsContainedIn
1 1
-Child 0..*
Figure 5.6: A UML class-diagram for implementing a k-D tree. Access to the structure is
provided through the class kDTree which contains a single kdNode called the ‘root’ node.
This node corresponds to the entire domain W and is recursively subdivided as needed into
other kDNodes. Each node contains a PointCollection which can only be non-empty for
leaf nodes and represents a collection of Points defining the parameterisation of W . The
StoragePoint class and StorageType object will be discussed in the context of storing
arbitrary data.
Figure 5.6 shows a description of a class structure for this arrangement, with an extension
to allow polymorphism to be utilised within the structure to store data of an arbitrary
type.8 The main class for the data structure, providing the public access methods for
manipulating the data within it, is the class kDTree, which contains a single kdNode within
it. This ‘root’ node corresponds to the entire domain W and is the starting point for the
recursive decomposition of the domain. Each node is capable of acting as a parent to a
number of child nodes and each contains a PointCollection; only leaf nodes may contain
a non-empty collection. The StoragePoint and StorageType objects will be discussed
below in the context of representing arbitrary data.
Given the spatial9 structure of the tree, the data interface can be accelerated by the use of
locality-based operations. For example, utilising the efficient spatial quantisation allows for
the efficient implementation of algorithms such as ‘k nearest-neighbours’. Alternatively, dis-
tinct regional operations can be developed; if an application requires access to the functional
values within a given region, then the tree presents an efficient mechanism for obtaining
access to those points. Similarly, adding samples to the representation can be achieved
more efficiently if the region containing those points is known at the time of insertion. As
the application utilising the data structure should be agnostic to the internal structure of
8
The diagram is depicted using a Unified Modelling Language (UML) Class Diagram. UML provides
a comprehensive and sophisticated design paradigm for software development (Object Management Group
2006).
9
With respect to the domain W .
(a)
(b)
Figure 5.7: A (k = 3)-D tree for mm-Wave radar data in suburban parkland. The light-blue
lines correspond to the edges of the tree structure and the points are individual samples
from the space W ⊂ R3 .
the tree, the access to the points is provided by the utilisation of the C++ standard library
object std::list to generate lists containing appropriate references10 to the points them-
selves. Figure 5.7 shows examples of a (k = 3)-D tree structure implemented using this
framework. The points are stored as an unstructured cloud for each node and the bound-
aries of the nodes are shown in blue. The recursive subdivision of the space according to
the density of points is clearly demonstrated. The data for these figures was obtained using
a mm-Wave imaging radar in a suburban park environment and the colours relate to an
occupancy measure. More data of this type will be examined in Sections 5.4–5.7.
Thread-safe Data Access
The k-D tree structure provides an appropriate mechanism for managing the spatial com-
plexity of the functional representation of the domain W (or Y ), but it was noted in Sec-
tion 5.3.1 that the data structure for any viable representation must support asynchronous
(possibly simultaneous) access by multiple threads. This is not possible with the direct im-
plementation of the tree as discussed above. Since the addition of new data, or the deletion
of any points may result in the restructuring of the tree, then any operation utilising the
structure must be able to guarantee the stability of that structure (and of the data within
it) over the duration of the process of reading from or writing to it. Furthermore, providing
a single mutex11 at the level of the tree clearly prevents any parallelisation of the access
methods, though it does support a type of asynchronous access.
Before considering the application of the recursive data structure to arbitrary data types,
it should be noted that there is a distinction between accessing the tree in order to read
the data, and acting in a way which changes that data, either by changing the structure
of the tree, the value of points within it or adding or deleting points. In the first case,
the access to the structure can be guaranteed not to change either the structure or the
data contained in it and it is acceptable to allow many different threads to perform these
operations simultaneously. However, changes which affect the structure of part of the tree
are mutually-exclusive with all other operations on that part of the structure. To facilitate
this distinction a Multiple-Read, Single-Write (MRSW) mutex is defined. This object acts
10
Reference here implies a pointer or C++ reference object. That is, a method of accessing the object
directly, rather than only a copy of the data it contains.
11
That is a mutually-exclusive data access token which is locked and unlocked by each thread accessing
the structure.
Tree MRSW Mutex

Protects changes to tree
structure at root node
Root Node
Node MRSW Mutex

Protects changes to tree
structure at current node
Node (no data) Node (Leaf ) Node (no data)
Node (Leaf ) Node (Leaf ) Node (no data) Node (Leaf ) Node (no data) Node (Leaf )
Leaf MRSW Mutex

Protects data at leaf
Node (Leaf ) Node (Leaf ) Node (Leaf ) Node (Leaf ) Node (Leaf ) Node (Leaf )
Figure 5.8: Thread-safe access to a k-D tree. The data within each node is protected by
a separate Multiple-Read, Single-Write (MRSW) mutex (shown in green); each structural
node is protected by a separate MRSW mutex (shown in blue shading) for managing changes
the the structure of the tree at that node; and the tree itself is protected by an MRSW mutex
(shown in red) to protect changes such as expanding the root node of the tree.
like a semaphore for read access, tracking the number of access tokens granted, and as a
mutex for write access, allowing only one token. Utilising this object much like a standard
mutex, the programmer can readily provide an appropriate mechanism to protect a given
data object, such as the tree itself or a node within it. All necessary logic for managing the
interactions and sequencing of multiple threads requesting tokens is managed internally by
a ReadWriteMutex class.12
It should be clear that the act of obtaining a write-lock on any element prevents simultaneous
access to that object throughout the duration of that lock, a case which may have significant
impact on the performance of the structure in a multi-threaded application model like those
shown in Section 5.3.1. Fortunately, the recursive structure of the tree lends itself to an
efficient mechanism for minimising these effects and this is shown in Figure 5.8. Structurally,
the tree consists of nodes which can be subdivided as necessary so the structure at each
node (whether it has children or not) should be protected individually by its own MRSW
mutex. This means that nodes which are not changing act as read-locked elements even if
the data in child nodes is held by the same thread using a write-lock. Thus, changes to the
12
These include preventing new read locks from succeeding when a write lock has been requested or is in
use and waiting for all reading tokens to be returned prior to granting a write token.
data within a subsection of the tree affect only that portion of the tree which is changing
either in value or in structure; two such mutexes are shown in blue in Figure 5.8. Another
MRSW mutex (shown in red in the figure) is provided at the level of the kDTree object to
allow a new root node to be created if the tree is expanded; in this case the old root node
becomes a child of the new root.
Furthermore, in order to prevent copying the data contained in the tree unnecessarily, such
as when adding new data through a sensory observation or modifying individual point val-
ues13 , access to the points is implemented as a list of pointers to the objects themselves.
This means that each and every Point object within the tree should be protected individ-
ually. However, the structure shown in Figure 5.6 only provides access to the individual
points through individual PointCollection objects and so it is sufficient to manage the
access MRSW mutexes at this level. An example of this is shown in green in the figure.
The appropriate calls to obtain various MRSW tokens during read and write operations
on the data structure are managed internally through access functions at the kDTree level
(the public interface), at each kdNode, and for each PointCollection. The resulting public
interface for managing the data within the tree is composed of methods for: adding new
points to the tree by providing a standard list of new items to add; deleting existing points
from the tree by providing a list of references to the points to be removed; reading the data
from a given region (returning a list of point references); and reading points from the tree
for the purposes of changing their values. Each of these operations can be accelerated using
the inherent physical structure of the tree itself, and appropriate type definitions have been
used to protect the data obtained from the tree using these methods (such as preventing a
list of read-locked points from being modified in any manner).
Arbitrary Data Tensor Storage
The data structure developed so far provides efficient, thread-safe access to elements defining
individual locations within the domain W , but does not specify a mechanism for the storage
of data at those elements. Here it is argued that the use of polymorphism at the level of the
13
The distinguishing characteristic of these operations is that they require the data to remain unchanged
during the operations, but do not utilise the data once the immediate operation has completed. Compare
this with a reasoning operation which may operate on the data for an extensive period of time, but does not
require the protection of the tree during that operation; in this case the data can be copied and the lock
released prior to processing.
points allows the data structure to be easily extended to store an arbitrary data structure
at each point without any modification to the tree itself. This approach is depicted in Figure
5.6 where the class StoragePoint extends the Point class by a template variable of type
StorageType. In terms of the theoretical model, the StorageType object corresponds to
the values of the tensor-valued function x(y), or α(w), and the tree Point elements to the
individual elements y ∈ Y , or w ∈ W . For example, if the StorageType was defined to be a
single floating-point number representing the reflected radar intensity of the region of space,
then the representation can be interpreted as containing a deterministic representation of
the function of radar intensity over the space Y .
Probabilistic Data Management
The representation above is deterministic rather than probabilistic but it is not simply
sufficient to change the quantities defined in the StorageType object to correspond to
probabilities, such as the probability of a given quantity having a particular value. This
would correspond to the Dirac value distribution derived from an underlying functional
distribution. Instead, each and every quantity defined as a random variable may be affected
by all others. For example, a grid representation containing N cells defining the height of a
surface does not represent a set of N independent, one-dimensional random variables over
height, but rather, in the most general case, an N -dimensional random variable, as each
cell is capable of directly affecting all other cells.
It is obvious that the kDTree structure does not support these operations directly and that
the management of data contained in the StoragePoint objects will require the consid-
eration of other values from the representation. It is at this point that the effects of the
approximations of Section 3.5 become significant. As will be seen in Sections 5.4-5.7, var-
ious methods can be utilised within the data structure to generate viable approximations
to the fully-dependent cases. The simplest approximation in terms of data management is
the fully-independent domain assumption, where the individual functional values at each
given location are assumed independent of the values at any other point; this assumption
is common in grid representations and, most notably, in the occupancy grid method (Elfes
1987). In that case, the StoragePoint object contains the individual distributions for the
various quantities it represents.14
14
Note that the independent domain assumption does not assume that the functional quantities defining
A particularly useful approximation in robotic systems, however, corresponds to the ap-

plication of Markov field assumptions, that is, on defining ‘neighbourhoods’ over which
interactions are maintained by message passing methods. When these neighbourhoods are
well-defined in terms of the locality of the domain value y, then since the tree is inherently
spatial, the structure can be used to manage the data probabilistically. The message-passing
operations can be managed in the propagation model part of the implementation. In prac-
tice, the probabilistic manipulations can be defined as a class which operates on a specific
list of points retrieved from the tree. In this way, operations such as neighbourhood inter-
polations, local interactions and other effects can be maintained using the representation,
but without changes to the tree data structure.
As an example, a particular special case of the approximations represents a point-based

sampling of a likelihood field. While the individual values are maintained according to the
independent domain model, the instantiation of a new point should be considered in terms
of the local properties of the field, rather than by assuming the non-informative prior.
In the software framework developed, the basic behaviour is defined by a template class
LikelihoodValue which defines the operations of updating a sample given a particular
observation value15 , interpolating to find the prior value, and setting the non-informative
value. In this way, an OccupancyValue class can be defined so that: the update is performed
using any of the standard techniques, Elfes (1987) or Konolige (1997, §3) for example; the
interpolation behaviour can be defined in terms of a set of neighbouring points from the
tree; and the non-informative value set to the relevant value for maintaining occupancy
values. This model is used extensively in the practical examples to follow.
If a local-interaction model was required, then the StoragePoint object would be required
to maintain not only the local value being estimated, but additionally the interactions
between it and the appropriate neighbours, for example, a covariance matrix defining the
properties of a jointly-Gaussian relationship. While this increases the complexity of the data
interpretation and management operations, the underlying kDTree data structure remains
unchanged and the complexity is managed by the StorageType object and any helper classes
defined for that type.
the tensor function x(y), say x1 (y) and x2 (y), are independent as well and it is possible that the update
behaviour may depend in a non-trivial manner on the other quantities at any particular Point.
15
See Section 5.3.4 for a description of this operation
5.3.3 Sequential State Management
As noted earlier, the goal of a thread responsible for managing the sequential characteristics
of the state relates to incorporating temporal effects on the data. As this operation will
generally change the values at all locations within the representation, its implementation will
have a highly significant impact on the performance of the resulting application. The reason
for this potential impact depends on the necessity of the completion of the entire update
process without any new data being incorporated into the structure. If this is the case, then
a sequential update will require that the entire data structure be write-locked (preventing
any other sensing or reasoning operations from operating) for the duration of the update
process. Depending on the nature of the quantities being stored and the temporal processes
affecting the result, it is possible that the operation can proceed sequentially through the
entire structure, allowing other operations to occur on other sections as it progresses. While
the examples of Sections 5.4–5.7 all consider static environments, the implementation of a
management thread remains relatively straightforward, and the majority of the complexity
involved in the operation will relate to the situation-specific details of the application.
In fact, the computational cost of operations which adjust the entire state represents a
substantial burden on a system; this is a possible cause for the paucity of research into the
area of non-static environments.
5.3.4 Sensing Operations
The incorporation of sensory data into the representation was shown in Equation 3.70 to
correspond to the application of the forward model,
P [ zl (vl ) | x(y) ] . (5.2)
Given a particular value of the observation function, the distribution represents the like-
lihood distribution over the possible values of the state that could have given rise to this
observation. This means that there are two important operations undertaken by the thread
(or process) managing one of the sensory cues: to determine the appropriate values of the
forward sensor model; and to combine this with the current state to evaluate the posterior
state distribution. When considering the multi-process and multi-threaded options dis-
cussed earlier, it is clear that the second of these must occur in the thread attached to the
ValueType LikelihoodType PositionType DataType SampleType
«bind» «bind» «bind» «bind»

«bind»
LikelihoodValue SensorModel
«bind»
«bind»
LikelihoodType = OccupancyValue<float>
PositionType = double
LikelihoodType = ScalarValue
DataType = DataTensor
SampleType = StoragePoint<DataTensor>
LikelihoodType = OccupancyValue
OccupancyValue ScalarValue LikelihoodSensor3D
Figure 5.9: A UML class diagram for a subset of the SensorModel part of the software
structure. The top row corresponds to a series of types on which a particular instantiation
is specialised. The middle row describes the abstract classes forming the interface to the
kDTree data structure and the bottom row to a specific instantiation used in an occupancy
model approach.
process maintaining the state distribution, but that the first can be readily performed in a
separate process.
Since the forward sensor models are generated from physical models, the operations required
to determine the likelihood distribution of Equation 5.2 are algorithmically simple,16 and in
most cases the complexity of the sensory operations relates to the process of performing the
update of Equation 3.70. In the framework developed using Point objects, the occurrence
of a new observation value generates new information in a given region of the space, for
which there may or may not be existing samples. The particular implementation details
of this part of the system are related to computational and numerical efficiency, but a
consideration of these does not lend additional clarity to the interpretation of the model
presented and so is omitted. Instead, consider the following operations which are likely to
be performed by a sensing process:
16
Often these consist of a relatively simple model which is transformed into the required domain. For
example, a range sensor is well-defined in polar coordinates with respect to boresight of the beam and the
resulting likelihood distribution can be readily transformed into a cartesian space for incorporation into a
representation at a given location and orientation.
Evaluation of the Sensor Model in the Domain of the Sensor, that is, determining
the effect on the quantities described by the function space X. For example, in a range-
bearing sensor, the observation itself corresponds to the particular range and set of
angles. Importantly, while the character of the resulting distribution is invariant to
the angles of the measurement, the same is not true of the range measured. This
stage, therefore, would correspond to the evaluation of the likelihood distribution in
the vicinity of the beam, but without reference to the actual angles involved;
Translation to the State Domain, in which the likelihood values are translated to cor-
respond to the actual elements of the state to be updated17 ;
Generation of new samples, in which new samples are instantiated to populate areas
for which samples do not exist; and
Updating existing samples, in which the existing samples in the region affected by the
observation are updated appropriately.
In the software system used to develop the case studies of Sections 5.4-5.7, a structure which
provides a flexible interface to managing the information in the kDTree data structure has
been developed; a subset of this is shown in Figure 5.9. Specifically, this structure provides
the necessary interfaces to manage the four tasks noted above, but allowing flexibility in the
selection of the precision of the storage, the selection of the quantities of interest and the
nature of the probabilistic updates associated with those values. In this diagram, the lowest
row of classes corresponds to specific instantiations of the abstract classes in the middle row;
the top row shows the types on which a specific instantiation is specialised. Practically, any
sensory observation will necessarily affect the data within the tree and therefore requires
that the operations be protected within a write-lock. This means that each and every time
a region of the representation is affected by a sensory observation, all other operations are
blocked for the duration of the procedure. The implication of this for a viable practical
implementation is associated with the effect of this on the reasoning operations of the next
section.
17
This operation is strongly affected by the ‘data-association’ effects of Section 4.3.3.
5.4 Application: Global Planning for AUGV Systems 225
5.3.5 Reasoning Operations
Lastly, the nature of the reasoning operations defined for a functional representation al-
low significant flexibility in terms of the development of a viable computational framework.
Specifically, given the thread-safe access methods described earlier, the computational struc-
ture of a reasoning operation is essentially arbitrary. For this reason, the framework de-
veloped defines no specific structure for these operations, other than in considering the
sequencing of these with respect to other operations on the data structure. Specifically,
note that access to the entire structure using a read lock prevents any write operation on
any region of the representation; likewise, if a write operation is occurring on any section of
the representation then the read operation will be delayed. For this reason, it is beneficial
to implement a read of the entire distribution by sequentially reading individual parts of the
data. As the reasoning operations correspond to a reinterpretation of the available data it
may be reasonable that some parts of the representation can be updated during a sequential
reading process.
5.4 Application: Global Planning for AUGV Systems
This section considers the application of the framework and considerations of the first half of
this chapter to the problem of constructing a ‘global’ planning system for an AUGV project.
In this context, global refers to the utilisation of maps covering the entire operating region,
as distinct from a ‘local’ representation describing only a subset of the operating region,
such as the region within a given distance of the vehicle. This alternative approach is
considered in Section 5.5 and the hybrid of both methods will be discussed in Section 5.5.4.
5.4.1 AUGV Project Description
The AUGV to be considered in this section is shown in Figure 5.10 and is part of the ARC
Centre of Excellence in Autonomous Systems (CAS)18 ‘Outdoor Demonstration Facility’.
The platform is an eight-wheeled, all-terrain vehicle, the ‘Argo Conquest 8×8’ by Ontario
Drive and Gear Limited 19 . Capable of speeds of 32 km/hr20 , the platform is a relatively
18
Details of the Centre can be found at http://www.cas.edu.au
19
Product details and company information can be found at the website http://www.argoatv.com
20
That is, approximately 20 m.p.h. or 8.9 ms−1 .
Figure 5.10: The CAS Outdoor Systems Demonstrator AUGV shown at The University of
Sydney’s ‘Arthursleigh’ property. The flexibility of the system with regard to the terrain
which it is capable of traversing is evident from the design of the vehicle. Photo: Alex
Green, 2005
slow vehicle, but as an amphibious, eight-wheel-drive vehicle with excellent entry and exit
angles21 it is capable of traversing a wide variety of challenging terrain situations includ-
ing stairs, logs, waterways and rocky terrain. Vehicle control is provided through direct
actuation of the throttle and choke of the engine, actuated gear selection (two forward, one
reverse), and skid steering by hydraulically-actuated differential braking. The system is
configured to act as a test-platform for the demonstration of ongoing research projects at
the Centre and defines a power and communications hardware interface for flexible payload
deployment.
As noted in Chapter 1, the deployment of a reliable AUGV system in realistic scenarios

requires several key developments which have yet to be demonstrated in such systems.
Specifically, these are: continuous 24-hour operation, with a graceful transition at dawn
and dusk; all-weather operation; and a greater degree of flexibility in the situational terrain
characteristics. This platform has been developed to provide a facility for deploying and
21
That is, the angle between the wheels and the most forward part of the vehicle, corresponding to the
greatest change in terrain gradient which the vehicle can encounter without contact other than the wheels.
Likewise, the exit angle is the same at the rear of the vehicle.
testing the individual components required for these capabilities.
The specific task under consideration here is the autonomous traversal of the terrain from
a given starting position to an appropriate goal location. In this part of the case study
the offline, or global pre-operative planning phases are considered. This focusses on the
management of the data available either remotely or from previous missions to enable
the generation of the pre-mission plan. This plan constitutes the desired path for the
vehicle to follow. That is, the goal of the system is to perform the steps of terrain-surface
estimation, terrain classification and traversability/obstacle identification so that a viable
desired trajectory can be determined.
5.4.2 The Sensor Space
The development of the system will be guided by the sensors available for achieving this task,
that is, the nature of the ‘sensory space’ describing the quantities which can be measured,
and from this selection, the possible state representations can be determined. In this case,
the emphasis is on the information which is available from remote sensing operations, such
as satellite or aerial reconnaissance. Specifically, the data under consideration is that which
is available for the ‘Arthursleigh’ test facility:
Digital Terrain Model information has been made available by the New South Wales
Department of Information Technology and Management (Land and Property Infor-
mation) as a 25m×25m spatial grid of terrain heights. Without access to further
information from the department regarding the accuracy associated with this data,
the information is assumed to have Gaussian uncertainty with the variances selected
heuristically. The information is modelled as a range sensor mounted in an airborne
platform with the appropriate angular uncertainty that the resulting Gaussian has
a 2-σ distance equal to the grid spacing and a range variance of 5m.22 Figure 5.11
shows an 800 × 600m section of the Arthursleigh test facility as a triangulation of the
raw DTM data.
Airborne Visible-spectrum Imagery In addition to the digital terrain model, the same
government organisation also makes aerial photography information available for the
22
Appendix E describes the model in detail.
Figure 5.11: Sample digital Terrain Model (DTM) data for a section of the Arthursleigh test
facility. The X,Y and Z axes correspond to geo-referenced North, East and Down locations
and the area represented is approximately 800 × 600m in extent.
facility. This information was processed to generate geo-referenced imagery informa-

tion using the same datum as the DTM information. A small section of the aerial
imagery data is reproduced in Figure 5.12. This information is available with a reso-
lution of 2 × 2, 1 × 1 and 0.6 × 0.6 metres per pixel.
Hyperspectral Airborne Imagery represents the final remotely sensed data available
for the site and was commissioned in February 2005. A 125-channel reflectance/radiance
spectrum was generated in an unstructured grid with an approximate resolution of
3 × 3m. Figure 5.13 shows the same section of the test facility for six of the 125
reflectance channels.
Now, a critical decision at this point relates to the selection of the operational domain of
the representation. That is, Y , over which the functions defining these quantities are to
be defined. While the DTM and hyperspectral imagery are provided as samples in R3 , the
visible-spectrum imagery is only located with respect to northing and easting values and is
not isomorphic with the more general representation of the other sensors. For this reason
the domain is selected as Y = R3 and the incorporation of the aerial imagery information
is postponed until following the other data to provide a constraint on the projection of the
two-dimensional data into the representation.
Figure 5.12: Aerial imagery data for the test facility covering the same region as the DTM
data in Figure 5.11.
While the global planning operations focus on the utilisation of data obtained remotely
and pre-deployment, an important situation arises when the environment has been pre-
viously explored or investigated. A distinction is made here between local data available
during deployment and information available from previous scenarios or from other sensors.
Consider, for example, that the deployment of an AUGV for ore hauling in a mining en-
vironment will benefit from detailed sensory information regarding the shape of the mined
area following excavation or blasting operations. In order to provide for this capability,
consider a final information source for this implementation: a terrain-imaging radar. The
unit to be considered is shown in Figure 5.14 and is a high-resolution millimeter-wave radar
developed specifically for terrain estimation in mining situations. Data is available as either
a complete range-intensity spectrum, or as an interpolated ‘highest-peak’ return. For this
scenario the data is provided as highest-peak and the information resembles that from a
sonar or laser scanning system. Example data from this sensor was shown in Figure 5.7.
The sensor has a beamwidth of 0.8◦ in azimuth and elevation and a range uncertainty of
±25cm and so represents a source of high resolution information.
Taken together, the sensory space for this idealised example consists of four sensory sources,
three of which provide data which can be directly mapped into a space defined by Y =
R3 and the aerial imagery can be mapped approximately given the prior incorporation
Figure 5.13: Hyperspectral aerial data for the test facility covering the same region as the
DTM data in Figure 5.11. The reflectance magnitude for six different channels is shown.
Figure 5.14: A terrain-imaging radar unit. Photo: Ross Hennessy.
of the other data. Given that the operations to be considered in this example consist of
global or pre-deployment planning, this presents no significant restriction on the viability
of the system. The incorporation of data during a deployment scenario will be considered
separately in Section 5.5.
5.4.3 Task and Representation Selection
While the set of sensors discussed in Section 5.4.2 determines what information is available to
the system, the nature of the tasks to be achieved determines which parts of this information
are necessary. As was noted earlier, if the entire mapping from sensory cues to required tasks
is known, then the representation can be selected optimally to capture only that information
which is required. Unfortunately, this is not possible in practice and it is necessary to select
the representational form so as to support the achievement of a wide variety of possible
tasks under the constraint of limited system resources. For this reason, consider here the
desirable tasks for this example and, given these, a viable practical representation for the
global data.
Recall that the operational goal of the AUGV system under consideration is the autonomous
traversal of a partially-known environment of arbitrary complexity. Since the system is not
Survey Data
X
Mine Plan x(y) Surface Traversability
Estimation Estimation
Global Representation
Radar System
Terrain Trajectory Planning
Classification
GPS/INS Data
Figure 5.15: Schematic view of the Global AUGV planning problem. The data sources
feed information into the representation and the four applications transform the data as
necessary to support their tasks.
interested in additional capabilities such as reconnaissance or terrain-processing (such as

agricultural harvesting or ore collection), it is sufficient to support the determination of
trajectory (or path) options for the traversal task. Figure 5.15 shows the system described
in the previous sections: the four sensory streams are shown feeding information into the
representation and several reasoning tasks are shown, drawing information from that repre-
sentation. Recall from Section 4.2 that each of these operations represents a transformation
of the data contained in the source representations. In this specific example, both primary
tasks (surface estimate and terrain classification) and secondary tasks (traversability and
trajectory planning) are depicted.
Surface Estimation - as the primary goal is terrain traversal, the most important capa-
bility is the determination of the most likely surface to be encountered during the
operational phase. Estimating this from the sensory data can be achieved in many
different ways including: traditional triangulation techniques, such as Delaunay tri-
angulation Rajan (1991); sequential probabilistic triangulation techniques (Leal 2003,
Ch. 5); traditional visualisation iso-surface and contouring techniques, such as march-
ing cubes (Schroeder et al. 2004, pp148-52) or contour connections (Brodlie et al. 1992,
p58). In each of these cases, however, it must be noted that the scalar quantity which
generates a meaningful ‘terrain-isosurface’ must be derived from the representation.
Candidates include an heuristic approach and machine learning approaches in which
the transformation is discovered through training.
Terrain Classification - it is well known that surface geometry, while a significant in-
fluence, is not sufficient for determination of terrain traversability, for example, the
distinction between a dry salt-pan and a lake surface as discussed in Chapter 1. There
are many different classification schemes which could be used to augment the struc-
tural information, though these will not be considered in detail here. The literature is
rich is approaches to this problem and several examples include Laine & Fan (1993),
which examines texture classification using wavelet analysis; Wang et al. (2005), who
utilises Bayesian classifiers for hyperspectral imagery; and Lee et al. (2004), where
synthetic-aperture radar data is assessed using a Wishart classifier.
Traversability Estimation and Trajectory Planning - given the terrain geometry and
classification information, and potentially also utilising separate cues from the raw
data, the traversability of the terrain can be estimated and this information used for
trajectory planning (Vandapel et al. 2006).
This suggests, in particular, that the information available from the DTM model, the struc-
tural components of the hyperspectral data23 and the radar unit should be represented in
a fashion which directly supports these operations. It is beyond the scope of this thesis to
consider such algorithms in detail, but note here that an occupancy-type property provides
a clearly delineated boundary between unoccupied space and the objects which provided
the sensory return. The details of the specific occupancy model used in this example is
included in Appendix E.
In addition to this data, the three-channel visible and 125-channel imagery data is re-
quired in the representation and for the purposes of this example these data sources will
be considered to be deterministic quantities. It should be noted that this is purely for con-
venience in this example as the operational structure directly supports the incorporation
of this information probabilistically. This implies that the functional representation under
23
This data provides a separate terrain estimate in the form of the unstructured grid over which it is
determined.
consideration takes the form of

⎡ ⎤
p(Oradar (y) = O+ )
⎢ ⎥
⎢ p(O (y) = O+ ) ⎥
⎢ dtm ⎥
⎢ ⎥
P [x(y)] = ⎢
⎢ p(Ohyp (y) = O )
+ ⎥
⎥ (5.3)
⎢ ⎥
⎢ xrgb (y) ⎥
⎣ ⎦
xhyp (y)
where it has been assumed that the three occupancy measures are statistically independent
of one another and can, therefore, be estimated separately and xrgb (y) = [xr (y), xb (y), xg (y)]T
and xhyp (y) = [xhyp,1 (y), xhyp,2 (y), . . . , xhyp,250 (y)]T .
The utilisation of this data by downstream applications depends directly on the task at hand
and represents an ideal case for the application of learning-type approaches as a direct result
of the substantial complexity and subtlety of the operations, as discussed in Section 4.2.2.
5.4.4 Results
The nature of the tasks to be considered and the sensory data available support the selec-
tion of a point-based functional model in the form presented in Section 5.3. Significantly,
however, the values contained in the functional forms of Equation 5.3 are assumed to be
spatially independent. This means that the distributions are maintained independently for
each of the spatial samples in the operating domain, reducing the overall performance of
the estimation scheme as interactions between regions are not maintained. However, the
reduction in computational complexity outweighs this reduction in the rate of convergence
of the estimates; this is the same assumption underpinning algorithms such as occupancy
grids (Elfes 1987).
The results of the utilisation of this system to combine a digital terrain model, approxi-
mately 5 hours of local radar data and aerial imagery are discussed in this section. When this
information was loaded into the computational structure developed in this chapter approx-
imately two and a half million samples were generated covering an area of approximately
1 × 1.5km. The digital terrain model (DTM) data, once it has been loaded into the system
is shown in Figure 5.16(a) where the samples are coloured according to their probability
of occupancy. The colour scale runs from red (unoccupied) through green (fifty percent
probability) to blue (occupied). The surface is clearly visible in this data and in Figure
5.16(b) has been triangulated to reveal the terrain shape (the scale has been exaggerated
in this figure).
Figure 5.17(a) shows the entire cloud of samples obtained from loading the radar data
into the representation. Once again, the colours indicate the occupancy probability of
the locations with the same colour scale used as in Figure 5.16. The data was obtained by
driving the imaging radar of Figure 5.14 around the site for approximately five hours on two
different occasions. The utilisation of high-accuracy GPS data was sufficient to provide a
convenient registration for these datasets. It should be noted, when interpreting this figure,
that the visible data covers approximately one and a half kilometers in the depth direction
and nearly four hundred metres across. Figure 5.17(b) shows the radar data points with
high occupancy probability from above. The straight lines which appear in this data are a
direct result of the observation of fences on the property. Figure 5.18 shows a small section
of this radar data demonstrating interesting characteristics of the available information.
Consider the introduction of the aerial imagery data (or similarly, the hyperspectral data)
into the representation. Figures 5.19(a) and 5.19(b) show the data projected onto the points
and a Delaunay triangulated surface. As the visible imagery shows a road passing across
the terrain, the scale of the dataset can be inferred readily.
Finally, consider the application of a combined interpretation using the data within the
structure, specifically, the determination of the surface and the identification of the elements
of the data which stand clear of that surface. These represent the data which may correspond
to obstacles in the world and, therefore, the procedure constitutes an important capability
for autonomous terrain traversal as considered here. Figure 5.20 shows two images obtained
from the system configured to achieve this task. The terrain surface is coloured by the
aerial imagery data to assist with interpretation and the samples which are separate from
the surface are shown in red. In Figure 5.20(a) the scope of this data is clear and it appears
that the data describe well the existence of the trees on either side of the road in the centre
of the image. This area is considered at a higher resolution in Figure 5.20(b) where the
radar has successfully identified the front surfaces (and some volumetric information) for
the groups of trees on either side of the roadway.
(a) DTM points.
(b) DTM surface.
Figure 5.16: Digital Terrain Model (DTM) data in framework. The figures show the raw
samples in (a) and a surface obtained by using a Delaunay triangulation in (b). In (a)
colours nearer to blue represent higher probabilities of occupancy, while those near red
represent lower. It is possible to identify a ‘surface-like’ region and some sparse samples of
low probability above the surface, though the surface is significantly clearer in (b) where
the vertical scale has been exaggerated by a scale of two.
(a) Radar data samples.
Fence-lines
Fence-lines
(b) High occupancy radar points.
Figure 5.17: Radar data in framework. All data points are shown in a perspective view
in (a). The complexity and extent of the data contained in this model is apparent in this
figure. Figure (b) shows a simplified version in which only high-occupancy points are visible.
Some of the irregular lines in this image correspond to physical fences.
Figure 5.18: A view of a small section of radar data of Figure 5.17. The ground scanning
pattern is visible in the left-hand-side of the image as a series of curved raster patterns.
Several fence-lines are visible running diagonally towards the top-right corner and the corner
of a building is visible as a sharp right-angle near the right-hand edge. The large expanse
of unoccupied data in front of the building is a result of a large quantity of data being
collected in this vicinity.
(a) Aerial imagery on point samples.
(b) Aerial imagery on reconstructed surface.
Figure 5.19: Aerial imagery in framework . The aerial imagery data is projected onto the
samples following the incorporation of the DTM and radar data. (a) shows the data on the
points and (b) on a reconstructed surface.
Open Grassland/
Small Trees
Dense trees
(wind-breaks)
(a)
(b)
Figure 5.20: Combining surface estimation with local radar data. The surface is estimated
from the combined DTM and radar information and the remaining samples are coloured
in red and they can clearly be seen to stand vertically proud of the extracted surface. (a)
shows the view from a wide angle and (b) shows the effect near some stands of trees.
5.5 Application: Local AUGV Navigation 241
5.5 Application: Local AUGV Navigation
In contrast to the example considered in Section 5.4 where the information was desired in
the global frame defining the operational domain of the system, consider now the problem
of controlling the AUGV vehicle of Figure 5.10 to achieve the trajectory obtained from the
previous example. The capabilities necessary for achieving this task are briefly described
in Section 5.5.1. The effect of the fact that the global data used for the planning stages is
usually of a coarse-resolution and may be out of date leads to the selection of sensors as
outlined in Section 5.5.2. Finally, Section 5.5.3 will examine the practical implications of
this approach and the selection of the functional forms to be maintained. Note that the
operations involved with the utilisation of local data from the vehicle to modify the global
planning algorithms correspond to an important special case of the incorporation of radar
data discussed in Section 5.5.2.
5.5.1 Task Selection
The primary goal of the AUGV system is reliable traversal of an unknown or partially-
unknown environment. It will be assumed in this section that the global planning algo-
rithms noted above have received a subset of the local data and have responded on a coarse
scale to update the desired trajectory passed to the vehicle. For example, local data may
indicate that a possible path is blocked, say by a closed gate or damaged bridge, requir-
ing an alternative route to be taken. That is, the path which the vehicle is attempting
to navigate is a reasonable global trajectory. The local navigation and control algorithms
will, therefore, correspond to small scale changes to this trajectory to achieve safe, efficient
traversal of approximately the planned path. This suggests four primary tasks for the data
interpretation and a fifth task to combine them. While none are considered in detail in this
thesis, they are:
Navigation - in which the location of the vehicle is determined so that its location can
be compared with the desired trajectory. While it will not be considered here, the
problem of reliable navigation utilising local information in addition to traditional
sources such as Global Positioning Systems (GPS), Inertial Navigation Systems (INS)
and dead-reckoning measurements is examined in detail in Soon et al. (2006).
Surface Updating - in which local data is utilised to correct and augment the surface
information supplied by the higher-level global data so that a high-accuracy model
of the physical surface on which the AUGV is operating can be determined. This
information may be useful at both the global and local level, but for the local form,
only the area surrounding the vehicle is required.
Terrain Characteristics Updating - in which the physical characteristics describing the

quality and nature of the surrounding terrain are estimated by combining the data
from the global system with the local sensory data. Once again, only the character-
istics local to the vehicle are required for this application, though the local data may
be transferred to the global system for planning operations (Vandapel et al. 2006).
Near-field obstacle detection - in which the information regarding terrain character-

istics and surface estimation are combined with local sensory data to identify and
characterise objects which are separate from the terrain and may constitute obsta-
cles, including volumetrically positive and negative structures (Manduchi et al. 2005).
Vehicle Control - in which the information from the previous four tasks and the planned
trajectory is combined with information regarding the dynamics of the vehicle to
provide control inputs for the various actuation systems. While it would be possible
to develop a control system capable of converting the stored sensory data into these
actions, the advantages of utilising the individual stages in developing reliable control
are significant (Ha et al. 2005).
The manner in which these actions, combined with those of the global planning stages,
interact is discussed in Section 5.5.4 which represents the application of the model of this
thesis to a real AUGV system.
5.5.2 Sensor Selection
As noted above, many of the operations specific to the local control problem being consid-
ered here involve the utilisation of local sensory data to modify, update and interpret the
information obtained at the global level of Section 5.4. Since this is the case, the same set
of sensors used in that section will provide information to the local representation, though
this information is provided directly from the global representation and not as new sensory
Figure 5.21: The SICK LMS-221 outdoor laser scanner system.
information. The methodology for information sharing across these distinct representations
will be examined in Section 5.5.4. Instead, consider that local data is available to the AUGV
in the following forms:
Laser Range Scanner data provides high-resolution range information for the local envi-
ronment and these units are used extensively in applications involving AUGV systems.
The most commonly used systems are the SICK Laser Measurement System (LMS)
units. An example of the ruggedised outdoor version (LMS-221) is shown in Figure
5.21. These units are highly configurable with a resolution of as little as 0.25◦ and
1mm at up to 75Hz.24 A single scan of the data from this unit is shown in Figure
5.22 and the unit is mounted on the AUGV as shown in Figure 5.23.
Colour Imagery is obtained from a camera mounted with the LMS system as shown in
Figure 5.23 and provides imagery at 30f.p.s. with a resolution of 640 × 480 pixels
and a colour depth of 8bits per channel. A sample image from the same trial is
shown in Figure 5.24. Note that the motion blur from the vibration of the vehicle is
clearly visible and makes the interpretation of this data more difficult. Furthermore,
as noted in Section 5.4.2, visual data is inherently under-defined with respect to range
information and various techniques are required to ameliorate this difficulty. One such
solution is discussed in Section 5.5.3.
Imaging Radar Finally, while most applications involving AUGV systems utilise laser
data extensively, dust and rain are known to strongly affect their performance (Brooker
24
In the configuration used for this application the data is obtained at 1◦ and 1cm resolution over a range
of 100◦ and 81m with a scan-rate of 75Hz.
10
−2
−8 −6 −4 −2 0 2 4 6 8
Figure 5.22: 2D laser scanner data obtained from the SICK LMS unit running at 0.5◦
and 1mm resolution over 180◦ and 8.1m.
Figure 5.23: Laser scanner and camera on AUGV. The SICK LMS hangs below the
cross-beam and the camera housing is above. Also visible are the GPS, freeWave
modem and 802.11 WiFi antennae. Photo: Lindsay Donnelly, 2006.
Figure 5.24: Sample image from vehicle camera. The image is from the same trial as
Figure 5.23.
2006, §9.3). For this reason, a high-speed scanning radar system has been developed
and the unit is shown in Figure 5.25. This system is capable of 25Hz scanning with
approximately 240 returns per revolution in a full 360◦ pattern. The data from this
scanner is of the same character as that of the radar system utilised in Section 5.4.2
and the full FFT-spectrum data is made available for a two-dimensional plane near
the vehicle (Widzyk-Capehart et al. 2005).
5.5.3 Implementation Details
Now, the local control and sensing problem only requires that the information strictly
relevant for the immediate future is retained in the representation,25 so that only a local
region is required. Define Ylocal to be a local cartesian space around the vehicle position and
Ylocal ⊂ Yglobal = R3 . In this way the data from the global frame, in the form of Equation
5.3, can be loaded directly into the local frame and augmented with data of the form
⎡ ⎤
p(Olaser (yl ) = O+ )
⎢ ⎥
P [xl (yl )] = ⎢
⎣ p(Ohssr (yl ) = O +) ⎥
⎦ (5.4)
xcam (yl )
25
The passing of information from the local frame into the global representation and the recovery of that
information is considered in Section 5.5.4.
Figure 5.25: High-speed Scanning Radar (HSSR) sensor developed by The Australian Cen-
tre for Field Robotics. The unit is shown installed on a bucket shovel in a mining application.
Photo: Ross Hennessy, 2006.
and the algorithms noted in Section 5.5.1 can utilise the information in both this repre-
sentation and the local region of Equation 5.3. Figure 5.26 shows the arrangement of the
sensors, representation and tasks as discussed above.
Now, while the occupancy measures obtained from the radar and laser sensors provide fully-
defined three-dimensional data, the camera suffers from the same problem as in Section
5.4.2 in that the visual field is a two-dimensional projection of the visual scene and does
not have a well-defined mapping into the cartesian representation. In this case a partial
mapping is achieved through the utilisation of accurate calibration between the camera and
the scanning laser. In this way the laser range information is projected into the image
plane and the pixels corresponding to that range measurement can be incorporated into the
representation. In the same way as for the global AUGV sensing problem of Section 5.4,
the state functions of Equation 5.4 are implemented as a point-sampled representation and
each point is assumed to be spatially independent of the points surrounding it. Once more,
while this means that the representation will converge to a stable estimate less quickly,
the computational saving far outweighs this effect. Several scenes constructed using this
technique are shown in Figures 5.27–5.30.
5.6 Application: Terrain Imaging and Estimation 247
Laser
X
Terrain Trajectory Control
Scanning Radar x(y')
Classification
Y'
Local Representation
Camera
Surface Obstacle Detection

Estimation
GPS/INS
Affects all Tasks
Navigation
Figure 5.26: Schematic view of the Local AUGV planning problem. The data sources
feed information into the representation and the four applications transform the data as
necessary to support their tasks.
5.5.4 Global and Local AUGV Operation
Finally, consider that the operations of both Section 5.4 and 5.5 together provide a frame-
work for the development of an AUGV system for navigation of partially known terrain.
Specifically, these operations and representations are depicted in Figure 5.31 to show one
possible arrangement of the interactions between them. The upper half of the diagram cor-
responds to processing which takes place separately from the vehicle, while the lower half
is that which is performed on-board. The effect of the transferral of information between
the two representations, in the form outlined in Section 5.3.1, is to maintain a shared infor-
mation repository between the vehicle and any off-board processing and storage facilities.
5.6 Application: Terrain Imaging and Estimation
5.6.1 Problem Description
The third example considered here relates to the application of the functional models to
the highly specialised process of surface estimation in a mining environment. In particular,
(a)
(b)
(c)
Figure 5.27: Laser augmented camera data showing one of the AUGVs. Each is a different
view of the same data and was obtained from a single run of one AUGV around the other.
Image: James Underwood, 2006.
Figure 5.28: Laser augmented camera data showing a large open field environment. Image:
James Underwood, 2006.
Figure 5.29: Laser augmented camera data showing a road traversal. Image: James Under-
wood, 2006.
Figure 5.30: Laser augmented camera data showing a person in the field of view. Image:
James Underwood, 2006.
Digital Terrain
Model
X
Hyperspectral x(y) Surface Traversability
Imagery Estimation Estimation
Aerial Imagery
Terrain Trajectory Planning
Classification
Radar
Local Sensory Data Remote Data
Laser
X
Terrain Trajectory Control
Scanning Radar x(y')
Classification
Y'
Local Representation
Camera
Surface Obstacle Detection

Estimation
GPS/INS
Affects all Tasks
Navigation
Figure 5.31: Global and Local representations for AUGV control. The figures in green
represent global sensors and the representation of Section 5.4, while the local sensors are
shown in red. The blue tasks correspond to the planning stages discussed in Section 5.4.3
and the orange tasks to those of Section 5.5.1.
this section considers the application of functional models to the sensory systems described
in Hennessy (2005). There are many parallels between this example and the global AUGV
planning operations described in Section 5.4, particularly with respect to the utilisation of
local ranging sensors to correct and augment a digital terrain model. In this application,
however, the properties of the terrain in given locations is of less interest than an accurate
estimate of the actual surface geometry.
Consider that in a mining application there will be significant savings associated with the
optimisation of excavation patterns and approaches. An accurate estimate of a surface,
whether as the internal surface of an underground void or as the surface of an open-pit site,
can provide several important capabilities including:
• determining of the volume of material removed when compared to an historical surface

estimate; for example in a blasting operation, the before and after surfaces can be used
to estimate the available volume of loose material;
• determining of the volume of material which must be removed to achieve a given

surface geometry, such as in removing a specific volume of material identified according
to the geological characteristics;
• combining the previous two cases to provide planning information to operations re-
garding an appropriate excavation or blasting strategy; and
• optimising the filling strategy of a given void in response to the actual characteristics
of the resulting material.
Obviously, the functional models of this thesis can be additionally utilised to estimate
volumetric properties of the mine environment, such as the locations and characteristics of
ore deposits. While these are not considered explicitly in this example, such quantities can
be estimated according to geological models and measurements such as core-samples and
penetrating sensors including sonar and radar systems.
5.6.2 Sensor Systems
The environment considered in this application is the imaging of a dragline excavation

pit. The dragline plant is shown in Figure 5.32 and the scale of the system is evident
Figure 5.32: A dragline mining plant showing the radar unit of Figure 5.14 installed on the
front of the vehicle. The scale of the equipment can be inferred since this radar unit stands
approximately six feet high. Photo: Ross Hennessy, 2004.
from the fact that the radar unit shown in the inset stands approximately six feet high
and is the same unit shown in Figure 5.14. Now, given that the operating environment
is characteristically dusty and dirty, it is extremely difficult to operate visible (or near
visible) spectrum imaging or ranging systems reliably. For example, the outdoor scanning
laser system shown in Figure 5.21, while extremely accurate, requires that the observation
window (the black perspex face) remains clear of dust. Figure 5.33 shows the conditions
typical of the equipment and the level of fine occluding dust is clearly evident. For this
reason, the radar unit utilised in Section 5.4 is well-suited to this application as it is capable
of operation under these conditions, including when the entire unit was covered in black
coal slurry during extended trials.
As the primary goal of this example is the construction of a surface in the operating environ-
ment, then the occupancy model of Appendix E is appropriate and the operating domain of
the system can be easily identified as Y = R3 . This means that the functional probability
takes on the form
" #
P [x(y)] = p O(y) = O+ . (5.5)
Figure 5.33: A typical dusty environment in a mining application. The layer of coal dust in
this image is manageable in dry conditions, but becomes a wet slurry in inclement weather.
Photo: Ross Hennessy, 2004.
For this reason, a similar framework can be used in this situation as in Section 5.4.4 and
is shown in Figure 5.34. In particular, the implementation uses a point-based, spatially-
independent approximation. In the context of the model of Chapter 3 the sensory data
is assumed to be spatially well-defined and the sub-optimality inherent in utilising the
spatially-independent parameter partitioning is vastly overcome by the computational ad-
vantages of that approach.
5.6.3 Results
The computational framework of Section 5.3 and the Paraview visualisation system (Par-
aview 2006) were used to generate a series of surface estimates obtained from a trial on an
open-pit coal mine. Figure 5.35 shows the points extracted directly from the system and
have been coloured by the occupancy information in the model. This data corresponds to an
eighteen-minute sequence obtained during the operation of the dragline and the suitability
of the occupancy information for the interpretation of surface information is clear from the
two views of the data.
A smoothed two-dimensional Delaunay triangulation was used to obtain the surface estimate
shown in Figure 5.36 and the low occupancy points have been superimposed onto the same
representation in Figure 5.36 (b).
Finally, two such surface estimates are used to estimate the volumetric discrepancies between
the surfaces over a period of approximately five hours. The raw samples and estimated
Survey Data
X
Excavated Dig
Geological Data x(y)
Volume Planning
Radar System
Surface Back-Filling
Estimation Plan
GPS/INS Data
Figure 5.34: Schematic view of the terrain imaging problem for a mining application. The
data sources feed information into the representation and the four applications transform
the data as necessary to support their tasks.
surface from the later time are shown in Figure 5.37 and the difference is shown in Figure
5.38.
These figures clearly show that the methodology and framework described here forms a
viable and powerful approach to the management of mining information as proposed.
5.7 Application: Warpable-Domain Function Models
5.7.1 Problem Description
As a final application example, consider that the models presented previously assume that
the information in the representation becomes immediately uncorrelated with the ongoing
uncertainties associated with the observing platform’s location. Specifically, when sensory
information is introduced into those models, the pose (location and orientation) uncertainty
of the platform is incorporated by the sensor model and the system relies on multiple obser-
vations of each individual cue to improve the resulting functional estimate. The performance
of such systems is critically related to the accuracy of the estimate pose of the platform as
this strongly affects the resulting functional form.
It is well-known that the estimation of the pose of a platform can be achieved in a number of
5.7 Application: Warpable-Domain Function Models 255
(a)
(b)
Figure 5.35: Raw occupancy points from a mine trial: (a) shows the top-down view of the
data and (b) the underside. The points are coloured according to occupancy information:
red indicating empty space, blue occupied and green the fifty-percent point.
(a) Estimated surface.
(b) Surface and unoccupied point samples.
Figure 5.36: Surface estimates for the mine environment obtained from the occupancy
information of Figure 5.35.
(a) Point Samples.
(b) Surface estimate.
Figure 5.37: A second surface estimate. This surface was obtained approximately five hours
after that of Figure 5.36.
(a) Overlaid surfaces.
(b) Volumetric discrepancy between initial surface (wireframe) and final surface
(faceted).
Figure 5.38: The volumetric discrepancies of the mine surfaces shown in Figures 5.36 and
5.37.
ways, including: dead-reckoning, in which incremental changes are accumulated to estimate

the location and suffering from the incremental increase in uncertainty by the same process;
use of an a priori map against which sensory measurements are compared to obtain pose
estimates with bounded errors; and the SLAM26 approach, in which both the locations
of features within the ‘map’ and the vehicle pose are estimated concurrently. Clearly the
first method suffers from unbounded pose uncertainty and will be unsuitable for generating
consistent long-term representations using a moving platform. Alternatively, while the
second method provides bounded pose uncertainty, the locations of the external features
necessary for the localisation must be known a priori. This is acceptable when those features
are well-defined and known, such as in systems like GPS, but prevents exploration of new
areas if that information is unavailable.
The important advantage of the SLAM approach is that the locations of the various land-
marks used for localisation become correlated through the ‘vehicle’ model and observation
process and it becomes possible to correct not only the current location estimate of the plat-
form, but also the location of landmarks observed ‘in the past’. It is desirable, therefore, to
construct and maintain a functional representation such as those in Sections 5.4–5.6 in the
presence of these ‘corrections’ to landmark locations within the domain Y . Furthermore,
if such methods exist, then it will be additionally possible to utilise the richly-descriptive
information of the functional representation to support the robust identification of, and
discrimination between, different landmarks within the operating environment. An impor-
tant solution to this problem was described in detail in Nieto (2005) and is referred to as
DenseSLAM. This approach utilises a tessellation of the operating domain according to the
location of SLAM landmarks27 and a bilinear warping of the regions between them to pro-
vide the flexibility required. A schematic depiction of the DenseSLAM approach is shown
in Figure 5.39 where the overlaid triangles represent the tessellation of the domain Y = R2
into bilinear local regions and where the ‘adjustment’ of any vertex as a result of the SLAM
algorithm will linearly warp the adjoining regions. Section 5.7.2 examines a generalisation
of that methodology using the functional forms of Chapter 3.
26
Simultaneous Localisation and Map Building.
27
Nieto (2005, §3.5) describes in detail the limitations placed on the domain tessellation.
Figure 5.39: Schematic representation of the DenseSLAM approach of Nieto (2005). The
red triangles represent the tessellation of the space according to the locations of the vertices
and the aerial imagery data is stored in local bilinear regions for each triangle. When the
SLAM operations modify the location of a particular vertex the local regions bounded by
that point are modified linearly.
5.7.2 Warping the Functional Domain
The underlying assumption of the DenseSLAM work is that domain locations y1 , y2 ∈ Y

are highly correlated to one another if they are in a local region, which means that “their
relative position is almost independent of the vehicle pose uncertainty” (Nieto 2005, p57)
while their global location is strongly linked to the location of the landmarks defining that
local region Ylocal . Effectively, this means that the information in that local region can be
manipulated independently of the information defining the location of that region and that
separate regions can also be maintained independently. When localisation errors in the
landmark positions are corrected by the SLAM algorithm, the local data does not change
although its projection into the global frame will.
It should be noted that the issue here relates to the management of errors associated with
the location y ∈ Y to which a given sensory observation corresponds. It is convenient,
therefore, to consider the prototypical problem of identifying the location of a point object
within the operating environment - that is, of a single landmark within the SLAM algorithm.
When true functional forms are manipulated the result is that all locations {y} ∈ Y are
considered in the same way and their locations are corrected. It is assumed that the space
corresponding to these locations remains topologically unchanged (there is no tearing or
overlap of the space as it is modified). Clearly, it is intractable to treat all locations as part
of the SLAM algorithm and, indeed, many of the locations will not have characteristics which
will allow reliable observations to be made in order to update their locations individually.
Let the location of the domain element be y ∈ Y and denote the pose of the platform
making sensory observations by yp ∈ Y . The uncertainty associated with the platform will
be defined by the distribution P [yp ] and for convenience is measured with respect to some
deterministically-known location y0 .28 When an observation corresponding to the location
y is made, the resulting distribution utilised in the Bayesian update is obtained through
the application of the total-probability theory29 to combine the platform uncertainty and
the uncertainty associated with the given observation,

P [y] = P [y|yp ] P [yp ] dyp . (5.6)
Prior to considering the application of the functional forms of Chapter 3 to this situation, it
is important to consider the consistency of the SLAM algorithm. There are two particular
cases of the data-association problem in a typical implementation: when the observation for
a known landmark has an uncertainty which fails to overlap with the map uncertainty for
that landmark; and when the uncertainty associated with the observation intersects with
multiple map elements. The second of these represents the mathematical definition of data
association, while the first is the result of an inconsistent model for the given system and is
shown in Figure 5.40. This inconsistency presents a severe problem for mapping algorithms
as it will violate the assumption that the space in which the data are embedded remains
topologically intact throughout the operation. It will be assumed, therefore, that the SLAM
algorithm used in this approach remains consistent.
Note that in the traditional SLAM algorithm, when the landmark at y is added to the
map, it is added such that it and the vehicle pose are correlated. This means that when
any feature location (or the vehicle position) are updated, the statistical relationship will
28
In the traditional SLAM algorithm this corresponds to the initial vehicle location and since the maps
generated are locally consistent this is often treated as a Dirac delta; alternatively, this is the reason why
the SLAM algorithm converges to the initial vehicle uncertainty and not to zero in the limit.
29
This appears as a convolution in the typical SLAM implementation when the sensor uncertainty P [y|yp ]
is a function of the difference (y − yp ).
P2[ yp ]
P2[ yp ]
P1[ y ] P1[ y ]
P2[ y ] P 2[ y ]
P1[ yp ] P1[ yp ]
(a) Consistent (b) Inconsistent
Figure 5.40: Inconsistencies in the SLAM algorithm. The first figure shows a consistent al-
gorithm where the observations of the target (shown in red) overlap, while the second shows
the case where the vehicle uncertainty is too small at the time of the second observation (in
blue) and the resulting distributions do not overlap.
update all other features also. Since it is impractical to store all locations as landmarks,
consider the case where the location y is determined according to the uncertainty model of
Equation 5.6 but not added to the SLAM map. Significantly, since the location is known
according to the distribution P [y], which is now not correlated to the SLAM map, then it
can never be updated such that the uncertainty of the location is improved. This obviously
represents a sub-optimal version of the previous approach, in that it utilises a number of
‘key’ landmarks to control the adjustment of the locations {y} but does not maintain the
individual locations separately for computational reasons. Critical to this method, however,
is the fact that this data remains consistent at all future times, provided that the SLAM
algorithm remains consistent also.
In practice, the approach introduced here defines the domain Y as the ‘uncorrected’ ob-
servation domain and introduces the domain Y , a topologically equivalent domain which
is deformed according to the results of the SLAM algorithm. This naturally defines the
bijective transformation,
TY : y → y , (5.7)
and if all elements of Y were updated using the SLAM algorithm, the transformation would
be ‘completely defined’. Instead, only a set of landmarks, {yli } ∈ Y for i ∈ Z+ , are
used to estimate this transformation. Provided that the sensory data are added to the
representation using the model of Equation 5.6, then this will ensure consistency in the
final result.
Consider the methodology by which the sensory data is managed in this arrangement: when
information is first made available to the system it is obtained according to the domain Y 30 ;
when the sensory data is interpreted, the information in the map is deformed according to
the transformation TY ; and when an observation occurs in a known region, the observation
model is determined in the space Y , rather than the corrected space Y . This final operation
means that the data is maintained using a fixed domain and there is no need for the space
Y to be re-sampled or modified within the system. In practice, the observation would

correspond to the determination of P y |yp which, utilising the bijective nature of the

transformation TY , allows the construction of the ‘uncorrected’ distribution P y|yp . This,

when combined with the pose distribution P yp , yields the appropriate distribution in Y .
Figure 5.41 shows an example of this approach in which the uncorrected domain is depicted
in black and the corrected domain in red. As the process continues, the domain Y will
constantly change as the transformation TY approaches the true transformation necessary
to correct the initial domain Y . Each observation is obtained with the platform in the

location estimated by P yp and the sensor model yields the corrected observation P [y ].
Rather than add this data to the map in this form, however, it is transformed using TY−1
to generate P [y], the location in the uncorrected space Y . It is only when the location

estimate is required externally to the system that the resulting distribution P y|Zk is

transformed using the current transformation TY to obtain P y |Zk .
5.7.3 Point-based Transformation Operations
Recognising that the transformation TY defined in the previous section describes the effect
of the operation of a SLAM algorithm on the functional forms defined as x(y), it is easy
to see that the nature of the transformation process, given the corrections to the landmark
locations, is an approximation to the case in which all points y are maintained as part of
that algorithm. Specifically, the bilinear, tessellated approach of Nieto (2005) represents
an important special case of this transformation. It must be noted, however, that the
author does not include the platform pose uncertainty into the dense data stored in those
representations, which can cause inconsistency. Consider the example shown in Figure
30
In fact, this is the definition of the domain Y .
y'l5
yl5
P[ y' ] y'l6
y'l4 yl6
yl4
P[ y ]
P[ y'p ]
y'l3
yl3
y'l2
yl2
y'l1
yl1
Y'
Y
Figure 5.41: The two domains introduced by the morphable-domain algorithm. The data
is obtained firstly using the uncorrected domain Y , shown in black; but the addition of
repeated observation of known landmarks, shown as crosses, causes the correction of that
domain to yield Y . Observationsare always obtained with respect to the current estimate
of the true platform location, P yp with the mean shown as an open circle. Once the
total probability theorem has been applied, this yields the observation distribution P [y ];
however, the internal representation transforms this into P [y] for storage and manipulation.
5.42, where two observations of a single location are made assuming the path shown was
the solid trajectory, with the true path shown solid. The sensor-uncertainty only approach
of Nieto (2005) will yield those in red, while the conservative approach outlined in the
previous section would yield the uncertainty ellipses shown in blue. Clearly the map will
be consistent in the second case, but not in the first.
It should be noted that if the second approach is utilised, then there is no need for the
landmarks defining the transformation TY to reach a certain degree of correlation prior
to storing the dense information, as this appears only necessary if the inconsistent dense
mapping approach is used. Rather, while the localisation is poor (as will be the case prior
to this time), the dense information will be poorly located but consistently maintained, and
as additional data is made available the map will resolve.
Finally, note that a particular methodology for determining the form of the transformation
TY can be identified with recent advances in visualisation using point-based geometries
(Dill 2004). In particular the approaches of Guo et al. (2004) present a comprehensive
methodology for utilising a number of control points to generate warping effects for visual
5.8 Conclusions 265
Figure 5.42: Consistency in location information for two approaches. The platform follows
the ‘true’ path shown dashed and the localisation used is shown as the solid line with
uncertainties shown as black ellipses. The observations in red are those for which only the
sensor effects are considered and the result is a clearly inconsistent estimation, whereas the
blue ellipses show observations made using the platform uncertainty when the data is added
and the result remains consistent. Note that in the SLAM algorithm itself, the observation
model uses the second of these.
objects defined using point sets. While it is beyond the scope of this thesis, the combination
of the approaches outlined in the previous section and these methodologies may present a
powerful new approach to the incorporation of rich environmental data within the context
of uncertain navigation.
5.8 Conclusions
This chapter has described a computational framework developed to address the theoretical
issues associated with the model presented in Chapters 3–4. In particular, it has demon-
strated that the functional models can be mapped in a meaningful way to a computational
framework characterised by parallelisation of the operations and the utilisation of multi-
resolution techniques. Furthermore, the viability of the model presented in this thesis was
examined in the context of four case studies associated with simplified real-world scenarios.
Chapter 6
Conclusions and Future Work
6.1 Introduction
This chapter briefly examines the contributions of this thesis to the field and presents
several suggestions for the direction of future research. Section 6.2 summarises the main
contributions related to the development of the new model, and Section 6.3 outlines a
number important, unexplored implications and proposes directions for research arising
from this model.
6.2 Summary of Contributions
This thesis has developed a new framework for examining the problem of collecting, storing
and reasoning with sensory information in the context of autonomous systems development.
As a new approach it draws its value from the utilisation of function-based approaches to
describing the constituent elements of the problem. In particular, this approach gives rise
to two significant advances over state of the art:
• The means to estimate over infinite spaces that have previously been addressed only
through discretisation. Specifically, rather than address continuous problems - like
the estimation of a height-map of a ground surface - as inherently discretised, the
model approaches these problems as the estimation of continuous functions through
the analysis of the parameters which describe these functions.
268 Conclusions and Future Work
• The means to decouple the sensor representation from the application of task-specific
inference. Defining the role of sensing as the accumulation of a set of ‘sufficient
statistics’ of the sensory data leads to a conception of representation as the finite-
space projection of the infinite input space. Furthermore, defining reasoning tasks as
the task-specific re-interpretation of these statistics allows the assumptions and biases
necessary for achieving a particular task to be clearly identified.
Fundamental to the application of this new approach to the problem as stated is the ex-
plicit recognition of the necessity of approximation. The redefinition of the estimation as
a functional form does not remove the infinite-dimensional nature of the estimation of in-
finitely variable characteristics; rather it directly highlights the complete nature of those
characteristics and provides the mechanism for the informed selection of a finite approx-
imation. Whereas, previously, representations were selected arbitrarily and manipulated
to approximate the infinite problem, this interpretation implies that the selection of the
representation is guided by the quantification of the effects of approximation. Furthermore,
the measure-theoretic nature of those quantifications is readily extended to assessing the
on-line performance of the various components of the model.
The following sections consider the details of several specific contributions made in the
course of achieving this over-arching goal.
6.2.1 Function-based Measure Theoretic Framework
As background to the development of functional estimation schemes, elementary measure

theory was summarised and a framework was constructed which supports the derivations
and interpretations necessary for the new approach. Specifically, the link between the set-
theoretic nature of measure theory and the construction of functions on arbitrary spaces was
shown to suggest new interpretations of 1) the nature and relationships of probabilistic and
information-theoretic measures, and 2) the concepts of distances and divergence measures
between elements which are functions.
6.2.2 Information-theoretic Quantities
As implied above, entropy and mutual information can be reinterpreted in the context of a
framework of function-based measure theory. In particular:
6.2 Summary of Contributions 269
• The nature of the mutual information was identified as the expectation of a new
quantity, the ‘event’ mutual information (analogous to the relationship between the
Shannon information content of an outcome and the distribution entropy). Each of
these four quantities and the relationships between them was examined in detail and
several surprising insights were gained. Most significantly, the potential negativity of
the ‘event’ mutual information was shown to give rise to many of the key properties
of the distribution mutual information.
• This interpretation as an expectation was extended to encompass the case where

more than two random variables were considered simultaneously. Consequently, this
approach justifies the existence and identifies the meaning of these three or more
variable expressions, and asserts that their potential negativity is important.
• Thirdly, the relationships between information-theroetic quantities were examined in

detail and two commonly used approaches (interval diagrams and Venn diagrams) were
considered. Interval diagrams, while accurately depicting the relationship between the
quantities defined for two random variables, fail to generalise to higher-order systems,
and Venn diagrams were shown to rely on a mistaken interpretation of the depiction. A
novel approach utilising a vector-space construction was shown to provide an intuitive
and completely general description of the relationships between the measures for any
number of random variables.
• The construction of the previous case highlighted the existence of a commonly over-
looked information-theoretic quantity defined for two random variable cases. In fact,
this was identified as the only quantity which was not recognised in the literature
as having a distinct practical application. The so-called ‘statistical information dis-
tance’ was demonstrated to have an intuitive interpretation as a divergence measure
capturing the degree of statistical independence between the variables. In fact, for
discrete systems, the quantity was demonstrated to satisfy all necessary axioms for
classification as a true distance measure.
6.2.3 Distance Measures
Section 6.2.1 also demonstrated that the framework allowed the re-interpretation of the
concept of divergence or distance measures. In particular, interpretation of these measures
as applicable to entities corresponding to functions implied that many of the key measures
are misinterpreted in the literature. A review was conducted of a wide variety of important
measures for 1) functions, including Euclidian, Ln , Mahalanobis and Hellinger-Battacharya
distances, and 2) distributions, encompassing Kullback-Leibler divergences, Csiszár’s mea-
sures, mutual information (as an affinity) and the statistical information distance. The
nature and assumptions underlying each of these was considered in detail.
In addition to the foregoing, the interpretation of these measures as functions enabled the
development of a novel closed-form analytic solution for calculating the Euclidian distance
between two Gaussian distributions.
6.2.4 Parameterisations and Function Estimation
As already noted, the rewriting of the representation and estimation problem as that of
estimating functions gave rise to formally intractable solutions in that the infinite-variability
of the systems remained intact. However, the identification of the problem as function
estimation implies the existence of an arbitrary basis for the description of those functions,
for example, utilising a Fourier or wavelet representation for L2 -integrable functions. This
was shown to give rise to the ability to arbitrarily modify the parameterisation of functions
being estimated, and subsequently, to the identification of quantifiable approximations.
Given the ability to measure the impact which a given scheme will have on the validity and
accuracy of a functional representation, the engineer can design a representation scheme to
match a particular problem. Thus, the selection of the representational form is driven by
the assessment of approximations and not by arbitrary selection.
6.2.5 Reasoning as Transformation
The definition of the goal of sensing and reasoning systems as the estimation of functions
gives rise to the obvious extension of considering the arbitrary transformation of the func-
tion data to yield new quantities. A general transformation was introduced and was shown
to give rise to the application of biases to the re-interpretation of the data contained in a
given functional form; that is, rather than adding or changing the inherent content of the
representation, a transformation was recognised as corresponding to a process of enhancing
certain characteristics and suppressing others. As these operations do not add information
6.3 Further Research 271
to the system, they were shown to be analogous to statistically-justifiable reasoning opera-

tions. In this way, reasoning operations were demonstrated to correspond to the application
of these transformations.
6.3 Further Research
It should be clear from the previous section that this thesis has introduced a new methodology
and framework for considering the nature of the sensing and reasoning problem. A result
of the substantial extent of new material which has been demonstrated is the identification
of significant potential for extending and applying this framework in real-world situations.
Two primary areas deserve consideration: further development of the theoretical framework
and the implementation of this technique in real world examples. Since the acceptance of
this new approach depends critically on the performance of the method when applied to
real scenarios, the practical developments are of greatest interest here.
Several specific areas for future research are identified in the following sections.
6.3.1 Theoretical Development
Information Theoretic Quantities for Systems with More Than Two Random
Variables
As noted in Section 2.7, the new approach highlights the existence of information-theoretic
quantities for combinations of more than two random variables (specifically the mutual
information I(X; Y ; Z) of Section 2.5.7). However, aside from the identification that these
measures exist, many appear to have no intuitive meaning. It remains to be seen what,
if any, meaning can be ascribed to these quantities and to what beneficial use they may
be put. For example, the mutual information I(X; Y ; Z) informs the interpreter about
the degree to which observing a single random variable affects the statistical dependencies
between the remaining variables; but what practical use does this serve?
Selection of the Functional Domain and Tensor Range
The thesis utilises the function representation x(y) as the most basic element of the inter-
pretation. The selection of this functional form influences the utilisation of sensory data,
the possible transformations and, therefore, reasoning tasks, and it also affects how the
data is maintained practically (as in the case of the spatial representation suggesting a
kD-tree implementation in Chapter 5). However, the identification of this quantity remains
a necessary decision to be made by the designer, rather than being inherently specified by
the nature of the problem at hand. An interesting question is whether there exists any
mechanism by which the functional form does not need to be explicitly selected by the
engineer.
Confirmation of Reasoning Operations as Transformations
It has been hypothesised throughout this thesis that many commonly used learning and
artificial intelligence techniques can be reduced to special cases of the general transforma-
tional approach of Section 4.3. While this is reasonable (as many learning approaches are
inherently related to the discovery of transformations from an input space to an output
space) this conjecture remains to be more fully explored. This is particularly true when
considering the role of semantic logic and other ‘higher-order’ approaches.
6.3.2 Practical Development
Measure Selection
The nature of the measure-theoretic formulation implies that the criteria for selection of
a measure depend solely on the performance of that particular measure in a given situa-
tion. This leaves as an open problem the issue of the selection of which measure(s) to use
in a given scenario. Several practical situations relating to this problem are the approxi-
mation of a probability density function by an alternative form and validation-gating in a
‘traditional’ estimation scheme. In the first case the concept of ‘best fit’ may correspond
to minimising the Euclidian distance between them, though there is no reason why the
Ln norms could not also be utilised. Similarly, for a validation-gating situation it can be
6.3 Further Research 273
shown that an appropriate measure would emphasise discrepancies in the ‘tails’ of the dis-
tributions. A thorough investigation of which measures are appropriate for a given set of
concrete situations, including those mentioned above, would provide some insight into this
dilemma.
Implementation of Functional Models in Real Systems
Although Chapter 5 showed the application of the models presented in this thesis to simpli-
fied real-world examples, there remain many technical challenges related to the application
of these techniques to actual real-world situations. Consider the AUGV navigation prob-
lem from Sections 5.4–5.5: in practice the data stored in the ‘global’ representation would
contain an order of magnitude more data than is required at any specific instant for any
specific algorithm, suggesting the utilisation of various levels of data caching will be required
in practice. For example, data may be stored in memory, on a hard-disk drive unit or on
a remote storage server and the application of efficient techniques from computer science
to the implementation of these models is required. Likewise, the sharing of data between
the global representation and the vehicle for the local aspsect of the problem represents a
technical challenge regarding communications limitations.
Furthermore, what parameterisations and function bases support effective implementations,

and are there some approaches which are better suited to certain applications than others?
Development of Real Sensor Models
Section 3.4.2 introduced the sensor model concept for the general functional form and
the transformations utilised in this model were discussed in Section 4.3. Simple examples
applying these models to range-bearing sensors were discussed in the text; however, the
application of such models to real-world sensors of greater complexity is required. As a
specific example, consider that an FMCW radar unit, such as that described in Section 5.6,
yields not only the range to a target but instead a radar energy spectrum along the path
which the beam follows. Thus, these systems actually measure the retro-reflectivity of the
environment at the wavelength of operation, for all points in the beam at once. The viable
mapping of this information to a reliable sensor model and functional form is yet to be
demonstrated.
Implementation of Real Reasoning Operations
The implementation of the functional transformations of Section 4.3 to real reasoning trans-
formations presents many possibilities. For example, the application of operations ranging
from the relatively simple to the highly complex appear to be describable using this tech-
nique. Consider the following examples:
• The detection of an aircraft in an airport radar system;
• The utilisation of radar information such as polarisation and reflectivity characteristics

to identify the type of aircraft;
• The development of the data obtained from airborne hyperspectral imagery so as to

generate terrain classifications suitable for mission planning;
• The ability to identify environmental characteristics which will be suitable as targets

for navigation algorithms;
• The ability of an AUGV system to perform an operation such as “Follow the road,
turn left at the third tree and cross the creek”.
These examples represent an increasing degree of complexity in the implementation and

there are many other practical applications which can be explored.
Completion of the Warpable-Domain Theories
Finally, Section 5.7 introduced a new approach to the mapping and localisation problem in
the case where data other than that defining geometric features is required. Specifically,
that section examined a functional form which generalised the approach of Nieto (2005)
and shows significant promise in the development of consistent algorithms for such situa-
tions. This theory, however, remains a conceptual one and the development of a practical
implementation would represent a significant contribution.
Appendix A
Tensor Geometry
Tensor geometry is the branch of mathematics dealing with the extension of geometry into
non-Euclidian forms, that is, non-orthogonal and non-flat spaces. This appendix considers
a very limited and introductory analysis of this branch of mathematics and most of the
details here are derived from Amari & Nagaoka (2000, Chapter 1). Consider a vector in a n
dimensional space x = x1 ê1 + x2 ê2 + · · · + xn ên with unit vectors êi , that is, a vector defined
by n components xi measured along the unit vectors which span the space. Conversely, it is
possible to describe the same vector using a different set of unit vectors êj and components
xj where the components are measured perpendicular to each basis vector; these are known
as the covariant components. Figure A.1 demonstrates these contravariant and covariant
coordinates (measured with respect to their different basis vectors), formally,
x = x1 ê1 + x2 ê2 + · · · + xn ên (A.1)
= x1 ê1 + x2 ê2 + . . . + xn ên . (A.2)
From the construction shown in the figure the relationship between the reciprocal bases ei
and ei is such that the perpendicular projection of the vector onto one set of basis vectors,
the ei for example, yields the parallel projection on the reciprocal basis. Clearly, if the basis
vectors are perpendicular then the contravariant and covariant bases are identical. The
remarkable relationship between the reciprocal bases allows the construction of an entity
known as the ‘metric tensor’ which generates a ‘metric’ (or measure) which is invariant
to the selection of the basis set; obviously the vector x should have the same ‘length’
276 Tensor Geometry
e2 e2
x2 x2
x
2 2 1
x x e
x1
1
x
x
1 x1
e1
Figure A.1: Contravariant and covariant coordinates for a vector x are shown using super-
scripts and subscripts respectively, with the covariant basis vectors shown in grey. Note that
the coordinates in one basis can be found by projection onto the reciprocal basis vectors.
regardless of what coordinate system is used. The ‘uniform’ or Euclidian metric tensor is
defined according to the relationship between these coordinates so that,

xi = gij xj (A.3)
j
and

xj = gij xi (A.4)
i
with the matrices with (i, j)th elements gij and gij denoted by G = (gij ) and G∗ = (gij )
respectively and requiring that GG∗ = I so that xi is recovered when transformed to xi
and back. It is common to utilise Einstein’s convention and treat a repeated index as an
implicit summation; that is, Equation A.3 could be rewritten as xi = gij xj .
The utility of this formulation is that when such a G exists the value of an inner product on
the space is invariant to changes in the basis used. This can be expressed (using Einstein’s
convention) as
x, y1 = gij xi y j
= gij xi yj
= xi y i
= xi yi (A.5)
Tensor Geometry 277
where the subscript 1 is used to denote that the uniform metric tensor is used. Formally, the
inner product is a bilinear form defined on the ‘tangent’ space, which is the space spanned
by the linearised coordinate vectors e. This means that it is defined as a bilinear function
of two tangent vectors to the space at the point p, say. Now, since a general space will be
arbitrarily curved, the linearised coordinates, hence the tangent space and also the inner
product, will depend explicitly on the point p at which the tangent space is taken. The
‘Riemannian’ metric is defined as the mapping g : p → , p and for a given coordinate
system, [ξ i ] say, is given by gij (p) = ∂ξ∂ i , ∂ξ∂ j p . While curved spaces require the definition
of affine connections to augment the structure of the space and allow inner products to be
constructed between different points p1 and p2 , this construction is beyond the scope of this
document, though extensions of the material presented here to curved spaces will require
this additional complexity.
However, for a flat space, it is possible to show that the tangent space is uniformly deter-
mined and is spanned by the same basis vectors as the space itself. This means that it is
possible to show that the uniform metric tensor gij here depends only on the selection of
the basis vectors. Specifically the entries of G are determined by the contravariant basis
vectors ei according to
gij = êi , êj 1 (A.6)
where the inner product is interpreted as the projection operation between pairs of basis
vectors.
It is also possible to define other metric tensors μij = fμ (xi , yj )gij which give rise to the
inner product,
x, yμ = μij xi y j . (A.7)
Additionally, we define the concept of a norm in the space according to the expression,
x2μ = x, xμ . (A.8)
√ √
Consider the space shown in figure A.2 which shows the vector x = 1/ 2î + 1/ 2ĵ and
two basis sets, the orthogonal basis B1 = {î, ĵ} and the non-orthogonal basis B2 = {ê1 =
278 Tensor Geometry
j
e2
x
1/√2
x2
1
e
x1
1/√2 i
√ √
Figure A.2: A simple change of basis for a unit vector x = 1/ 2î + 1/ 2ĵ. When examined
in two bases, B1 = {î, ĵ}, which is orthogonal, and B2 = {ê1 , ê2 } which is not, the norm of
the vector x2 does not change.
√ √
î+ 3ĵ 3î+ĵ
2 , ê2 = 2 }. The uniform metric tensors for these two bases are given by
⎡ ⎤
1 0
G1 = ⎣ ⎦ (A.9)
0 1
⎡ ⎤
ê1 , ê1 ê1 , ê2
G2 = ⎣ ⎦
ê2 , ê1 ê2 , ê2
⎡ √ ⎤
3
1
= ⎣ √
2 ⎦
. (A.10)
3
2 1
Now, the covariant coordinates can be obtained by taking the inner product of the vector
with the contravariant basis vectors and vice-versa, yielding
xcovariant = x, ê1 ê1 + x, ê2 ê2

√
1 + (3) 1+ 3
= √ ê1 + √ ê2
2 2 2 2
and
xcontravariant = (g1j xj )ê1 + (g2j xj )ê2

√ √
3−1 1 3−1
= √ ê + √ ê2 (A.11)
2 2
and it can be verified that x2 = 1, regardless of the basis set used.
Consider now the extension of the vectors from finite dimensional spaces to infinite ones,
that is, to function spaces. Here define the domain of the functions to be x ∈ [x1 , x2 ]
and consider the functions a(x), b(x) and c(x) defined over that interval. As seen earlier,
Tensor Geometry 279
these functions can be treated as infinite-dimensional vectors defined over the infinite set
of Dirac delta functions. Applying the conversion from summations to integrals in the
necessary equations allows continuous versions of the tensor approach above to be defined.
Since this interpretation assumes that the Dirac delta functions are independent, then
this is equivalent to using an orthonormal basis and there is no requirement for different
contravariant or covariant functions a∗ (x) and a∗ (x). Let ei = δ(x − xi ) be considered to
be the ‘basis’ vectors and write the analogy of equation A.1 as
x2
a(x) = a(xi )δ(x − xi )d(xi ). (A.12)
x1
In this case, as noted above, the ei are orthogonal, so G = I and it is possible to write
g(xi , xj ) = δ(xi − xj ) (A.13)
and the inner products of equation A.5 become

x2 x2
a(x), b(x)1 = a(xi )b(xj )g(xi , xj )d(xi )d(xj )
x1x2 x1
= a(x)b(x)dx. (A.14)
x1
Defining a non-uniform metric tensor as a continuous function as
μ[a(xi ), b(xj )] = fμ [a(xi ), b(xj )]g(xi , xj )
= fμ [a(xi ), b(xj )]δ(xi − xj )
(A.15)
and rewriting the inner product as

x2 x2
a(x), b(x)μ = a(xi )b(xi )fμ [a(xi ), b(xj )]δ(xi − xj )d(xi )d(xj )
x x1
ix2
= a(x)b(x)fμ [a(x), b(x)]dx (A.16)
x1
yields the expected analogy for equation A.8 as
a(x)2μ = a(x), a(x)μ . (A.17)

280 Tensor Geometry
Finally, it is also possible to express this final result in a more traditional form which
captures the space-dependent property of the measure, though it obfuscates the dependence
on the measure function fμ ,
x2
a(x), b(x)μ = a(x)b(x)dμ , (A.18)
x1
where dμ is the infinitesimal element of the measure function fμ .

Appendix B
Ordinate Transformation
Invariance for Mutual Information
B.1 Simple Transformation
Another remarkable property of the mutual information measure I(.) is that the value is
unchanged by an arbitrary transformation of the coordinates of the individual distributions.
It is sufficient to show that this result holds for I(X; Y ), and by inference extends to all
derived quantities. In fact, it will be demonstrated that application of an arbitrary one-to-
one mapping to both x and y does not change the measured value of the MI and that the
application of many-to-one mappings will necessarily reduce the value of the MI. This is
essentially an extension of the Data Processing Theorem.
fv
X /V
fw
Y /W
Consider the four random variables x, y, v and w related as shown above where the arrows
represent the following functional relationships:
Tv : x → v = fv (x) (B.1)
Tw : y → w = fw (y) . (B.2)
282 Ordinate Transformation Invariance for Mutual Information
It is desired to demonstrate the relationship between I(X; Y ) and I(V ; W ) given these
relationships. The chain rule for Mutual Information is given by Cover & Thomas (1991,
p22):

I(A; B1 , B2 , . . . , Bn ) = I(A; Bi |Bi−1 , Bi−2 , . . . , B1 ) . (B.3)
i
Using this it is possible to write the following expansions of the Mutual Information for the
four-term case:
I(X; Y, V, W ) = I(X; V, W ) + I(X; Y |V, W )
= I(X; Y, W ) + I(X; V |Y, W ) (B.4)
and
I(V ; X, Y, W ) = I(V ; X, Y ) + I(V ; W |X, Y )
= I(V ; Y, W ) + I(V ; X|Y, W ) (B.5)
where the chain rule has been partially expanded in each case. The pair of equations in B.4
suggest that it is possible to write
I(V ; X|Y, W ) = I(X; V, W ) + I(X; Y |V ; W ) − I(X; Y, W ) (B.6)
and, using the Equations in B.5,
I(V ; X|Y, W ) = I(V ; X, Y ) + I(V ; W |X, Y ) − I(V ; Y, W ) . (B.7)
But these two expressions must be equal so that
I(X; V, W )+I(X; Y |V, W )−I(X; Y, W ) = I(V ; X, Y )+I(V ; W |X, Y )−I(V ; Y, W ) . (B.8)
However, each of the terms involving three variables can be further expanded with the chain
B.1 Simple Transformation 283
rule as
I(X; V, W ) = I(X; V ) + I(X; W |V ) (B.9)
I(X; Y, W ) = I(X; Y ) + I(X; W |Y ) (B.10)
I(V ; X, Y ) = I(V ; X) + I(V ; Y |X) (B.11)
I(V ; Y, W ) = I(V ; W ) + I(V ; Y, W ) (B.12)
resulting in
I(X; Y ) = I(V ; W ) + I(X; Y |V, W ) − I(V ; W |X, Y ) + I(X; W |V )
−I(X; W |Y ) + I(V ; Y |W ) − I(V ; Y |X) . (B.13)
If the model is expanded for the probabilities involved, then
P (v, w|x, y) = P (v|x, y)P (w|v, x, y)
= P (v|x, y)P (w|x, y) (B.14)
since the value of w is uniquely determined by x and y (in fact, by y alone). This suggests
that v and w are statistically independent given x and y. It is therefore possible to write
I(V ; W |X, Y ) = 0 . (B.15)
Similarly,
P (x, w|y) = P (x|y)P (w|x, y)
= P (x|y)P (w|y) , (B.16)
since w is uniquely determined by y, thus making x and w statistically independent given

y. Hence,
I(X; W |Y ) = 0 . (B.17)
Also,
P (y, v|x) = P (y|x)P (v|y, x)
= P (y|x)P (v|x) , (B.18)
and since v is uniquely determined from x, then y and v are statistically independent given
x and
I(Y ; V |X) = 0 . (B.19)
Applying Equations B.15, B.17 and B.19 to Equation B.13, yields
I(X; Y ) = I(V ; W ) + I(X; Y |V, W ) + I(X; W |V ) + I(V ; Y |W ) . (B.20)
Therefore, since I(.) is strictly positive,
I(X; Y ) ≥ I(V ; W ) . (B.21)
Alternatively, if the mapping is a bijection, that is, it is uniquely reversible, and can be
found by
fv
Xo /V
fv−1
fw
Y o −1
/ W,
fw
then in addition to Equations B.15, B.17 and B.19, it is possible to derive (in the same
manner):
I(X; Y |V, W ) = 0 (B.22)
I(X; W |V ) = 0 (B.23)
I(V ; Y |W ) = 0 (B.24)
and
I(X; Y ) = I(V ; W ) . (B.25)
Therefore, I(X; Y ) ≥ I(V ; W ) with equality if and only if the mapping is a bijection. This
amounts to a slightly more generalised version of the Data Processing Inequality (Cover
B.1 Simple Transformation 285
Joint distribution P(x,y) and Marginal Distributions P(x)/5 and P(y)/5 Joint distribution P(v,w) and Marginal Distributions P(v)/5 and P(w)/5
P(x,y) P(v,w)
P(x) P(v)
P(y) P(w)
0.03 0.03
0.025 0.025
0.02 0.02
P(v,w)
P(x,y)
0.015 0.015
0.01 0.01
0.005 0.005
0 0
10 10
20 20
5 15 5 15
10 10
5 5
y 0 0 w 0 0
x v
(a) (b)
Joint distribution P(a,b) and Marginal Distributions P(a)/5 and P(b)/5 Joint distribution P(c,d) and Marginal Distributions P(c)/5 and P(d)/5
P(a,b) P(c,d)
P(a) P(c)
P(b) P(d)
0.03 0.03
0.025 0.025
0.02 0.02
P(a,b)
P(c,d)
0.015 0.015
0.01 0.01
0.005 0.005
0 0
10 10
10 10
8 8
5 6 5 6
4 4
2 2
b 0 0 d 0 0
a c
(c) (d)
Figure B.1: Effects of applying several transformations to the discrete probability field
P (xi , yj ) in (a): (b) shows a randomly permuted version of the same field {Tvw : vi =
xpermute{i} , wj = ypermute{j} }; (c) shows the result of a two-to-one mapping {Tab : ai = x2i +
x2i−1 , bj = yj } which in this case does not change the field as the original has x2i = x2i−1 ;
and (d) shows the effect of the lossy transformation {Tcd : ci = v2i + v2i−1 , dj = wj }. Table
B.1 contains the entropies and mutual information for these four cases.
& Thomas 1991, p32). Figure B.1 shows the effects of applying such a transformation
to a discrete probability field. In Figure B.1a the original distribution is shown and the
relationship between the two random variables is apparent from the shape of the joint
probability field (where the values of the probabilities are given by the vertices of the mesh).
Figure B.1b shows the same distribution after random permutation of the alphabets; x → v
and x → w. The relationship between the random variables is no longer as apparent, but
as the two lines (a) and (b) in table B.1 show, the entropies and MI remain the same as
the original distribution. Appendix B.2 includes a derivation of a stronger version of this
involving the transformations Tv : (x, y) → v = fv (x, y) and Tw : (x, y) → w = fw (x, y)
Measure H(X) H(Y ) H(X, Y ) I(X; Y )

(a) {xi , yj } 2.9881 2.2463 4.8521 0.3823
(b) {vi , wj } 2.9881 2.2463 4.8521 0.3823
(c) {ai , bj } 2.2950 2.2463 4.1590 0.3823
(d) {ci , dj } 2.3001 2.2463 4.3923 0.1542
Table B.1: Entropies and Mutual Information measures (in nats) of the distributions shown
in Figure B.1. The cases are labelled in the same order and correspond to: (a) the original
distribution P (xi , yj ); (b) a randomly permuted version with {Tvw : vi = xpermute{i} , wj =
ypermute{j} }; (c) a lossless two-to-one projection {Tab : ai = x2i + x2i−1 , bj = yj }; and (d) a
lossy two-to-one projection {Tcd : ci = v2i + v2i−1 , dj = wj }.
using an argument involving infinitesimals.
B.2 General Transformation
Consider the determination of the mutual information between random variables x and y
distributed according to P (x, y), and by inference P (x) and P (y). For a continuous case,
the value is given by equation 2.112 which is

P (x, y)
I(X; Y ) = P (x, y) log b dxdy . (2.112)
P (x)P (y)
In Section B.1 it was demonstrated that the arbitrary mappings x → v and y → w results
in I(X; Y ) ≥ I(V ; W ) with equality if and only if the mappings are one-to-one. Here a
tentative proof for the case where the more general (assumed one-to-one) mappings Tv and
Tw operate is presented. The transforms are:
Tv : (x, y) → v = fv (x, y) (B.26)
Tw : (x, y) → w = fw (x, y) . (B.27)
It will only be shown that the one-to-one mapping generates the equality. Now, because
these are mappings, the same distributions PXY , PX and PY can be used to give the mutual
information according to

PXY (v, w)
I(V, W ) = PXY (v, w) log b dvdw . (B.28)
PX (v)PY (w)
B.2 General Transformation 287
Consider the following marginal distributions:

PY (y) = PXY (x, y)dx = PXY (v, y)dv (B.29)

PY (w) = PXY (x, w)dx = PXY (v, w)dv (B.30)

PX (x) = PXY (x, y)dy = PXY (x, w)dw (B.31)

PX (v) = PXY (v, y)dy = PXY (v, w)dw . (B.32)
Now since there is a one-to-one mapping between x and v we can compare the infinitesimals
inside the integrals in equation B.29 and obtain the following relationship,
PXY (x, y) dx = PXY (v, y) dv , (B.33)
likewise
PXY (x, w) dx = PXY (v, w) dv (B.34)
PXY (x, y) dy = PXY (x, w) dw (B.35)
PXY (v, y) dy = PXY (v, w) dw . (B.36)
PXY (x,y)
Thus, the value for an infinitesimal element of the fraction Q(x, y) = PX (x)PY (y) can be
written as the fraction of the infinitesimal elements nearby (x, y) as
PXY (x, y) dxdy

Q(x, y) =
[PX (x) dx] [PY (y) dy]
PXY (v, y) dvdy
=
PXY (v, w) dvdw
= (B.37)
using equations B.33 and B.36 to change the numerator. Consider the element PX (x) dx
next,

PX (x) dx = PXY (x, y)dydx

= PXY (x, w)dwdx

= PXY (x, y)dxdw
= PY (w) dw , (B.38)
where firstly the definition of the marginal has been used, then equation B.35 to change
the marginalised variable and finally, the order of the infinitesimal and integration has been
reversed. Likewise, using B.33,
PY (y) dy = PX (v) dv , (B.39)
which makes equation B.37
PXY (v, w) dvdw

Q(x, y) =
[PY (w) dw] [PX (v) dv]
= Q(v, w) . (B.40)
Therefore, the infinitesimal element in the fractional part of the expression for Mutual
Information is unchanged by the application of the transformations Tv and Tw . Therefore,
the mutual information of equation B.28 becomes

I(V ; W ) = PXY (v, w) log b Q(v, w)dvdw

= PXY (v, w) log b Q(x, y)dvdw

= logb Q(x, y)PXY (v, w)dvdw

= logb Q(x, y)PXY (x, w)dxdw

= logb Q(x, y)PXY (x, y)dxdy
= I(X; Y ) (B.41)
B.2 General Transformation 289
where equations B.34 and B.35 have been used. This means that if there is a one-to-one
mapping then the the mutual information is invariant to the transformation. This holds
also for the discrete case, where the integrals of equation 2.112 are replaced by summations
over the alphabet {A, B}. Now, since this expression depends only on the the values of the
probabilities and not on the value of the underlying alphabet, then any permutation of the
‘points’ Tvw : (xi , yj ) → (vi , wj ) must have no effect on the value of the mutual information.
Appendix C
Measure and Information Theory

Examples
C.1 Introduction
This appendix contains numerous examples applying the theoretical material of Chapter 2
in order to clarify the nature and interpretation of that theory within this thesis.
C.2 Measure Theory
Example C.2.1 – Σ-algebras for a discrete set

Consider the rolling of a six-sided die. The process is discrete and the set of all possible
outcomes is S1 = {∅, S1 , S2 , S3 , S4 , S5 , S6 }, where Si represents the roll of the number i.
Each outcome is mutually exclusive, so that the events Si are disjoint. For a single roll of a
die, the σ-algebra can be written as Σ1 = {{∅}, {S1 }, . . . , {S6 }} and this implicitly includes
all complements, unions and intersections of these basic elements. Thus, (S2 ∪ S4 ∪ S6 ),
corresponding to an even roll, is also an element of Σ. Likewise, for two rolls of the die
it is possible to write Σ2 = {{∅}, {Si , Sj }} with i, j ∈ [1, 6] ⊂ Z+ , the set of positive
integers1 . Again, all of these elements are disjoint and the family of subsets implicitly
includes complements and unions of the pairs {Si , Sj }.
1
Though, strictly, Σ2 is actually a family of subsets of S2 = S1 × S1
292 Measure and Information Theory Examples
Next, define a measure which maps each element of Σ to a real quantity. Possible measures
for Σ1 include the face value of the roll, μ(Si ) : Si → i, and the probability of throwing Si ,
μ(Si ) : Si → P (Si ). It is clear that the set S and the structure of Σ are largely determined by
the nature of the random variable being considered but there is great freedom in the selection
of the measure μ.
Example C.2.2 – Σ-algebras for a continuous space

Alternatively, consider an n-dimensional Euclidian space, where the set S consists of the
set of real numbers S ≡ Rn . Let an arbitrary point in this space be represented by x̃ =
(x1 , . . . , xn ) ∈ Rn . In this case there is significantly greater freedom in the selection of the
σ-algebra. Several examples include: points on the real line, ΣV = {{∅}, {x}} where x ∈ R;
intervals on R, ΣI = {{∅}, {[x1 , x2 ]}}; and points in Rn , ΣPn = {{∅}, {x1 , x2 , . . . , xn }}.
These can be readily extended to lines, regions, volumes, surfaces and other geometric enti-
ties in an analogous manner. In each case, arbitrary measures μ can be generated to map
the elements of the σ-algebra to R. For example, for Si ∈ ΣI , the mapping μ : Si → |x2 −x1 |
will measure the ‘length’ of the interval.
C.3 Deviations, Divergences and Distances
Example C.3.1 – Distance measure for Example C.2.1

Consider the die from Example C.2.1 where the six outcomes are S1 = {∅, {Si }} with
i ∈ [1, 6] ⊂ Z+ . One possible deviation measure is Δ(Si Sj ) |i − j|. The measure has
the following properties: the absolute value ensures that this satisfies Equation 2.4; clearly
Δ(Si , Sj ) = 0 ⇐⇒ i = j satisfying Equation 2.5; |i − j| = |j − i| satisfying Equation 2.6;
and it is readily verified that the triangle inequality of Equation 2.7 is satisfied for all i and
j. This makes the divergence measure a distance, d(Si , Sj ), and all measure results can be
summarised geometrically as
1 1 1 1 1
1 2 3 4 5 6
C.3 Deviations, Divergences and Distances 293
where the numeral 1 above each link indicates the ‘length’ of that interval and the distance
between non-neighbouring elements is obtained by following the path between them. In this
case the triangle inequality holds as an equality in that d(Si , Sj ) + d(Sj , Sk ) = d(Si , Sj ).
Example C.3.2 – Divergence measure for Example C.2.1

An alternative deviation measure can be defined as Δ(Si , Sj ) i(i − j)2 . It is clear that
this expression satisfies Equations 2.4 and 2.5 but also that it does not satisfy either of
Equations 2.6 or 2.7, so must represent a divergence, D(Si Sj ).
Several divergence measures with respect to the common element Si are:
D(S1 S2 ) = 1 D(S6 S1 ) = 150

D(S1 S3 ) = 4 D(S6 S2 ) = 96
D(S1 S4 ) = 9 D(S6 S3 ) = 54
D(S1 S5 ) = 16 D(S6 S4 ) = 24
D(S1 S6 ) = 25 D(S6 S5 ) = 6
These, with the remaining terms readily calculated, suggest that this measure actually does
imply an ordering of the elements S. in that S1 is ‘closest’ to S2 , followed by S3 and so on;
likewise S6 is ‘closest’ to S5 and ‘furthest’ from S1 . This is true, but only if the divergences
are compared for a common first element Si as calculated. If, instead, the divergences are
calculated for a common second element Sj ,
D(S2 S1 ) = 2 D(S1 S6 ) = 25

D(S3 S1 ) = 12 D(S2 S6 ) = 32
D(S4 S1 ) = 36 D(S3 S6 ) = 27
D(S5 S1 ) = 80 D(S4 S6 ) = 16
D(S6 S1 ) = 150 D(S5 S6 ) = 5
It is now clear that the order is not preserved in the comparison between divergences with
Sj = S6 . This does not mean that the divergence is meaningless, simply that the divergence
does not generate a geometry or fixed order between the elements; if there was an application-
specific special reason for the divergence of this example then the change of order in the last
case would be both expected and desirable.
Example C.3.3 – An alternative distance measure for Example C.2.1

A third possible deviation measure is obtained by first writing d = |i − j| and defining
Δ(Si , Sj ) = 6d − d 2 . For i, j ∈ [1, 6] ⊂ Z+ this must be positive and zero if and only if
i = j. Likewise, it is symmetric in Si and Sj and the triangle inequality is satisfied for all
i and j: this measure is a distance. The distances are:
Si Sj 1 2 3 4 5 6
1 0 5 8 9 8 5
2 5 0 5 8 9 8
3 8 5 0 5 8 9
4 9 8 5 0 5 8
5 8 9 8 5 0 5
6 5 8 9 8 5 5
Now, in this case the structure suggested is expressed diagrammatically as
5

1 2=
==
5 ==3
==
=
6 @@ 3
@@
@@
3 @@ 1
5 1
4
where the values on each link are the ‘length’ of that link, with respect to S1 . Note that
the length of these elements changes depending on which element is taken as the reference,
but that the order and structure of the space remains fixed. In fact, while the geometry of
the elements is non-Euclidian, the resulting distance measure is continuous over the space
since it satisfies the triangle inequality (Doob 1994, p34).
C.4 Extending Basic Measure Theory
C.4.1 Coordinate Systems

C.4 Extending Basic Measure Theory 295
Example C.4.1 – The Fourier basis for a simple case

Consider an L-periodic function f (x) defined over x ∈ R and the measure μ[f (x)]
x +L 2
x f (x)dx for some x . Here the set S corresponds to the set of all L2 integrable (pe-
riodic) functions and the σ-algebra represents single elements from this set. Note that two
elements Si and Sj are not necessarily disjoint because
x +L
μ [f (x) + g(x)] = μ[f (x)] + μ[g(x)] + 2 f (x)g(x)dx . (C.1)
x
The functions can, however, be written in terms of their Fourier series according to (Kreyszig
1993, p577):
∞
/ 0 / 0
2nπ 2nπ
f (x) = a0 + an cos + bn sin (C.2)
L L
n=1
with
x +L
1
a0 = f (x)dx (C.3)
L x
x +L
2 2nπx
an = f (x) cos( )dx (C.4)
L x L
x +L
2 2nπx
bn = f (x) sin( )dx . (C.5)
L x L
Thus, the function f (x) is equivalent to the vector of coefficients f = (a0 , a1 , . . . , b1 , . . . ) and
the function can be interpreted as having coordinates f with respect to the basis functions
" # " #
êcn = cos 2nπ
L and êsn = sin 2nπ
L .
x +L
Note also that the bases êcn and êsn are orthogonal: x êcn1 (x)êsn2 (x)dx = 0 for any
x +L
n1 and n2 ; and x êαn1 (x)êαn2 (x)dx = δD (n1 − n2 ) for α selecting the cosine or sine
function and where δD (.) is the Dirac delta function. This means that the third term in
Equation C.1 is zero for the basis vectors and they are disjoint. Therefore,
4 ∞
5

μ [f (x)] = μ a0 + [an êcn + bn êsn ]
n=1
∞

= μ[a0 ] + {an μ [êcn ] + bn μ [êsn ]} , (C.6)
n=1
demonstrating the effectiveness of an orthogonal coordinate system.

C.4.2 Measures and Inner Products
Example C.4.2 – A flat measure

Let S = R2+ correspond to the first quadrant of the real plane and let Si ∈ Σ be a point in
this region with coordinates si = (xi , yi ). Define the measure μ(si ) (xi , yi ) → (αx + βy).
This measure is plotted in Figure C.1 for α = 2 and β = −1. It is clear that this corresponds
to a ‘flat’ measure, which is readily verified by the fact that
∂μ(si )
= α = C1 (C.7)
∂x
∂μ(si )
= β = C2 (C.8)
∂y
and the measure can be written in terms of Equation 2.13 as

2
μ(si ) = Ck sik (2.13)
k=1
where si1 = xi and si2 = yi . For example, for α = 2 and β = −1, μ(3, 5) = 3α + 5β = 1.
Similarly, the ‘generalised inner product’ of Equation 2.16 can be evaluated. Here μ11 = α2 ,
μ12 = μ21 = αβ and μ22 = β 2 and equation 2.18 can be used to calculate the measure for
any points si and sj as

2
2
< si , sj >= μkl sik sjl . (2.18)
k=1 l=1
For example, the inner product of s1 = (3, 5) and s2 = (1, 2) with the same α and β as
before, is
< s1 , s2 > = si1 sj1 α2 + (si1 sj2 + si2 sj1 )αβ + si2 sj2 β 2 (C.9)
= 3.1.(2)2 + (3.2 + 5.1).2. − 1 + 5.2.(−1)2 (C.10)
= 0. (C.11)
μ0 = 0
μ(si)
20
15
10
μ(si)
−5
−10
10
8 10
6 8
4 6
4
2
2
0 0
y ∈ [0,10]
x ∈ [0,10]
Figure C.1: An example of a flat measure: si ∈ R2+ and μ(si ) = αsi1 + βsi2 .
Example C.4.3 – A non-flat measure

Consider the same space as from Example C.4.2: si ∈ R2+ , but let the measure be defined as
μ(si ) = |si |2 where |.| represents the L2 norm of the vector: |si |2 = s2i1 + s2i2 . The measure
for this space is shown in Figure C.2 and it is clear that this is no longer flat.
Writing Equation 2.12 for two dimensions yields
2

∂μ(s ) ∂μ(s ) ∂μ(s )
μ(si ) = dsk = dsi1 + dsi2 (C.12)
k=1s ∂sk s1 ∂si1 s2 ∂si2
where s traces a path from s = 0 to the given point, first following the x-axis then the
y-axis. The paths s1 and s2 represent these two parts respectively, with dsi1 = 0 on s2 and
vice-versa, allowing the path to be broken into these two parts. The partial derivatives are
∂μ(s )
= 2si1 (C.13)
∂si1
∂μ(s )
= 2si2 (C.14)
∂si2
so that
3 5
μ[(3, 5)] = 2 si1 dsi1 + 2 si2 dsi2
0 0
= [s2i1 ]30 + [s2i2 ]50
= 34 . (C.15)
C.4.3 Norms and Distances

μ0 = 0
μ(si)
200
150
μ(si)
100
50
0
10
8 10
6 8
4 6
4
2
2
0 0
y ∈ [0,10]
x ∈ [0,10]
Figure C.2: An example of a non-flat measure: si ∈ R2+ and μ(si ) = s2i1 + s2i2 . Also shown
in green is a path from the origin s = 0 to the point si = (3, 5).
Example C.4.4 – Norms and distance in discrete system

Let si , sj ∈ Rn be defined analogously to the example in R2 in Example C.4.2. Define

the general form of the α-β measure of the same example as: μ(si ) = k αk sik for si =
(si1 , si2 , . . . , sin ). Consider three particular vectors for the case where n = 5:
s1 = (1, 5, 3, 4, 6)
s2 = (3, 12, 2, 1, 3)
s3 = (2, 5, 2, 3, 3)
and therefore:
s1 + s2 = (4, 17, 5, 5, 9)
s1 + s3 = (3, 10, 5, 7, 9)
s1 − s2 = (−2, −7, 1, 3, 3)
s1 − s3 = (−1, 0, 1, 1, 3) .
The results of applying the L1 , L2 and L∞ norms to these composite elements yields
L1 [s1 + s2 ] = 40 L1 [s1 + s3 ] = 34
L2 [s1 + s2 ] = 20.88 L2 [s1 + s3 ] = 16.25
L∞ [s1 + s2 ] = 17 L∞ [s1 + s3 ] = 10
L1 [s1 − s2 ] = 16 L1 [s1 − s3 ] = 6
L2 [s1 − s2 ] = 8.49 L2 [s1 − s3 ] = 3.46
L∞ [s1 − s2 ] = 7 L∞ [s1 − s3 ] = 3
Each of these emphasises different characteristics of the relationship between the vectors: the
L∞ norm finds the maximum component of the vector and can be considered the ‘extreme-
case’ measure; the L2 norm is the Euclidian distance for this example and represents the
length of the resulting vector; and, by comparison, the L1 norm also utilises all components
of the vector, but gives an increased emphasis to the smaller differences than the L2 norm.
In a similar manner, the Ln for n > 2 will further accentuate the larger elements and in
the limit the maximum component will dominate, yielding L∞ .
1
μ[s ]
0
i
25
1
20
1 15
10
0.5
8 10
10 0
6 8 −0.5 5
6 index along path
4 1
4 2 3 4 5 0
2
2
si2 0 k ∈ [1,5]
0 s
i1
(a) Measure for Example C.4.5 (b) Path (0, 0, 0, 0, 0) → (8, 6, 4, 5, 6)

Figure C.3: The measure μ[s] = nk=1 sin(sik ) for Example C.4.5. (a) shows the actual
measure for the entire set (si ). A path is marked in green from 0 to (8, 6, 4, 5, 6) for n = 5
and the ‘measure’ along this path is shown separately for each component in (b). Note that
each element in (b) corresponds to the term ∂μ(s)
∂sk from Equation 2.12 with k the component

and s the path.
C.4.4 Measures for Functions and Infinite Dimensional Systems
Example C.4.5 – Measure on a continuous space with finite dimensionality

Let s ∈ Rn+ be the n-dimensional continuous set and define the measure μ(s) = k sin(sk )
for k ∈ [1, n]. The partial derivatives are
∂μ(s) ∂
= sin(sk ) = cos(sk ) ; (C.16)
∂sk ∂sk
k
inserting this into Equation 2.12 yields

μ(si ) = cos(sik ) dsk . (C.17)
k s
This measure for n = 2 is shown in Figure C.3(a) and an axis-aligned path is shown on it
∂μ(s)
in green. Figure C.3(b) shows the local differential ∂sk for a similar path when n = 5 for
the path from 0 = (0, 0, 0, 0, 0) to (8, 6, 4, 5, 6).
Here the measure can be calculated as before, integrating along the continuous axis and
summing the results. Here the result is
8 6 4 5 6
μ[(8, 6, 4, 5, 6)] = cos(x)dx + cos(x)dx + cos(x)dx + cos(x)dx + cos(x)dx
0 0 0 0 0
≈ −1.285 .
50 6
5
40
4
30
3
20
2
10 1
0 0
1
2 1 8
3 6 7
4 5 6
5 4 5
3 4
6 2 3
2 2
1 1
Sj Si i=1, j=2 index of (S ,S )
i j
(a) Measure for Example C.4.6 (b) Measure for path from (1, 1) to (3, 5)
Figure C.4: The measure μ[Si , Sj ] = 6 × i + j for the example of two dice in Example C.4.6.
(a) shows the actual measure for the entire set (Si , Sj ) and i, j ∈ [1, 6]. A path is marked
in green from (1, 1) to (3, 5) and the ‘measure’ along this path is shown separately for each
)
component in (b). Note that each element in (b) corresponds to the term Δμ(s Δsk δsk from
Equation 2.29 with k ∈ [1, 2] the component and s the path.
Note that the differential in Figure C.3(b) has a trivial form because of the selection of the
path as following each orthogonal coordinate axis in turn.
Example C.4.6 – Measure on a discrete space with finite dimensionality

Consider once more the example of throwing two dice simultaneously from Example C.2.1:
S2 = {Si , Sj } with integers i, j ∈ [1, 6]. Let the measure being considered be μ[Si , Sj ]
6 × i + j. The measure is shown in Figure C.4(a) and the path marked on it in green is
redrawn in the form of Figure 2.3 in Figure C.4(b), where each element corresponds to the
Δμ(s )
term Δsk δsk from Equation 2.29 with k ∈ [1, 2] the component and let s ∈ [1, 8] be the
index along the path s2 . The resulting measure is calculated from the double sum of this
discrete surface resulting in μ[3, 5] = 23, as expected.
2
Note that the element (0, 0) = ∅ is inserted into the path explicitly with index 1 so that the non-zero
measure of (1, 1) is taken into account.
Example C.4.7 – Measure on a discrete space with infinite dimensionality

The third arrangement corresponds to an infinite-dimensional discrete set. Let S = {0, 1}
correspond to the state of an asynchronous digital signal s = s(t) with length T . Let a
1 T
measure μ [s(t)] = T 0 s(t) dt correspond to the duty-cycle of the signal. It is not possible
to generate the equivalent of Figures C.4(a) or C.3(a) because of the infinite dimensionality.
It can be shown, however, that this measure is flat. Consider the discrete equivalent of the
∂μ[s ] Δμ[s(t)]
partial derivative ∂sk denoted by Δs(t) : to calculate this difference, first consider two
functions s0 (t) and s1 (t) which differ only in that s0 (t ) = 0 and s1 (t ) = 1. The partial
difference between the two measures will be given by
μ [s1 (t)] − μ [s0 (t)]

(C.18)
Δ [s1 (t) − s0 (t)]
where the numerator is the (infinitesimal) difference in the measure and the denominator
the (infinitesimal) difference in the set value. This fraction is independent of t and has a
value of unity.
Without loss of generality an infinitesimal step along the path can be considered to change
only one of the elements, s(t ) for some t say, so that the summation in Equation 2.31 can
be ignored and the resulting expression is
T
μ [s1 (t)] − μ [s0 (t)]
μ(s) = δ [s1 (t) − s0 (t)] dt
0 Δ [s1 (t) − s0 (t)]
T
= δ [s1 (t) − s0 (t)] dt . (C.19)
0
Now, since the δ term represents the ‘actual’ change in the observed function at the given
value of t, then this will be δ [s1 (t) − s0 (t)] = s(t) and the resulting expression is the same
as the definition of the measure. This verifies that the partial difference which is unity for
all t is constant and the measure is flat.
Example C.4.8 – Measure on a continuous space with infinite dimensionality

Let the set S correspond to the set of all L2 -integrable functions f (x) for x ∈ R and let the
∞
measure correspond to μ [xf (x)] −∞ xf (x)dx. Noting that the partial differential from
Equation 2.32 must be evaluated for each value of x = x0 ,
∞
∂μ [f (x)] ∂
= xf (x)dx (C.20)
∂f (x0 ) ∂f (x0 ) −∞
∞
∂
= x f (x)dx (C.21)
−∞ ∂f (x0 )
= x0 , (C.22)
where the final step follows because the partial derivative with respect to f (x0 ) will yield zero
for all values of x = x0 , one otherwise and x is a constant with respect to f (x0 ). Clearly this
corresponds to a flat measure and Equation 2.37 can be used to calculate the inner product
for any two elements.
C.5 Information Theoretic Measures
Example C.5.1 – Probabilities in a discrete system

Let the outcomes of the roll of two dice be denoted by S = {Si , Sj } where i, j ∈ [1, 6] ⊂
Z+ . If the trivial σ-algebra is defined as the set of individual elements (and their unions,
complements and intersections), then the measure μ(Si , Sj ) should map each element onto
the interval [0, 1] ⊂ R. Let the probabilities be
S. i=1 2 3 4 5 6 P (Sj )
j=1 0.05 0.04 0.03 0.03 0.02 0.02 0.19
2 0.04 0.03 0.02 0.02 0.01 0.01 0.13
3 0.04 0.04 0.02 0.02 0.02 0.01 0.15
4 0.03 0.02 0.03 0.03 0.02 0.02 0.15
5 0.05 0.05 0.04 0.04 0.05 0.05 0.28
6 0.02 0.02 0.02 0.02 0.01 0.01 0.1
P (Si ) 0.23 0.2 0.16 0.16 0.13 0.12 1.0
C.5 Information Theoretic Measures 305
The centre of this table represents the probability of each event (Si , Sj ); for example, the
probability of rolling two 3’s is P (S3 , S3 ) = 0.02 or 2%, the far right column and bottom
row represent the marginal probabilities P (Sj ) and P (Si ) respectively and are found by
summation through the table. Finally, the bottom right corner shows the probability of the
union of all possible outcomes, by convention set to unity.
C.5.1 Shannon Information and Entropy
Example C.5.2 – Event entropy in a discrete system

Consider the marginal distribution P (Sj ) from Example C.5.1. The event Sj with j ∈
[1, 6] ⊂ Z+ represents a roll of a die and j the face which is showing. Let the distribution
and event entropies (for b = e) be given by
Sj 1 2 3 4 5 6 ∪ j Sj
P (Sj ) 0.19 0.13 0.15 0.15 0.28 0.1 1.0
hP (Sj ) 1.66 2.04 1.90 1.90 1.27 2.30 0.0
where the final column shows the probability and event entropy for the event consisting of
the union of all other events and represents the ‘certain’ event. It is immediately clear that
the event entropy measures the unexpectedness of the outcomes, with the highest value of
2.30 occurring for S6 with the smallest probability value of 0.1, and the smallest value of
1.27 when P (S5 ) = 0.28 is the largest value. The entropy of the certain event is zero as
demonstrated in the last column suggesting that if the outcome is known a priori then there
is no unexpectedness in observing the result.
Example C.5.3 – Event entropies in a continuous system

Consider the process of measuring the range, z ,to a stationary target with a sensor known
to yield a value distributed according to

1 1
P (z)|x = √ exp − (z − x)2 (C.23)
2π 2
where x is the true distance to the object, z the measured distance and the distribution
is Gaussian with a mean equal to the true distance with a variance of σz = 1. Consider
also that the equivalent expression for the joint probability distribution of two (or more)
measurements can be written as

1 1 T −1
P (z) = n exp − (z − x) Σ N (z − x) (C.24)
|2πΣN | 2 2
where the vector z represents the vector of measurements z = {z1 , . . . , zn } for some n, x is
a scalar which remains constant (assuming that the true range is fixed), and ΣN represents
the covariance matrix describing the n-dimensional Gaussian. The operations (.)T and |.|
represent the matrix transpose and determinate respectively.
If this distribution is assumed to arise as a result of the combination of a deterministic

observation and Gaussian noise with mean zero, then the relationship between multiple
observations of the same range will be related to how the noise part changes between readings.
Assume that the noise is Gaussian, zero mean and does not depend on the noise at any
previous time - the noise process is said to be uncorrelated. Here the covariance matrix
will be diagonal with a constant value, σ 2 , along that diagonal and the expression can be
rewritten as
(
1 1
P (z) = exp − (zi − x)2
i
2π|ΣN | 2
(
= P (zi ) . (C.25)
i
Thus, if the noise process causing the spread in the distribution P (z) is independent between
multiple measurements, then the resulting distribution is statistically independent with re-
spect to the measured values, zi .
Measured in ‘nats’, that is taken with the base b = e, the entropy of a particular outcome z
is given by

1 1 T −1
hP (z) = − loge n exp − (z − x) ΣN (z − x)
|2πΣN | 2 2
/ 0
1 1
= − loge n + (z − x)T Σ−1
N (z − x) . (C.26)
|2πΣN | 2 2
When the covariance matrix is diagonal with σ 2 along the diagonal, the entropy is given by

n
1
n
(zi − x)2
hP (z) = − loge 1 +
|2πΣN | 2 2σ 2
i=1 i=1
n
1 (zi − x)2
= − loge 1 exp
|2πΣN | 2 2σ 2
i=1
n
= − [loge P (zi )] (C.27)
i=1
which shows that when the noise terms are uncorrelated, the distributions are independent
and the resulting entropy is the sum of the entropies of the individual outcomes. Figure
C.5 shows the distribution P (z1 , z2 ) and entropy function hP (z1 , z2 ) for two independent
measurements with σ 2 = 1 and x = 5. The entropy of the outcome (z1 , z2 ) is clearly seen to
increase as the observations deviate further from the true value of (5, 5). Figure C.5b also
shows the marginal entropies on the diagram and the joint entropy can be seen to be the
sum of the two curves.
Finally, note that the entropies here are positive, but that for a sufficiently compressed
distribution the values of P (z) can be greater than unity and the resulting entropy will be
negative.
Example C.5.4 – Event entropy in the case of Example 2.5.2

In Example 2.5.2 it was shown that for a three random variable situation involving: v, the
voltage across a resistor; i the current through it; and R the value of the resistance, the
resulting probability distributions are obtained from the combination of the deterministic
relationship between the variables, v = iR, and the statistical properties of the individual
variables. In particular, given the distribution for the value of the resistance as P (R), it is
possible to write the conditional distribution for the voltage, given the current according to
Equation 2.44,

P (v|i) = P (v|i, R)P (R) dR , (2.44)
where P (v|i, R) = δD (v − iR) is the deterministic relationship between the variables. Given
a flat distribution P (R) 1
Rmax −Rmin [UH (R − Rmin ) − UH (R − Rmax )], the resulting distri-
0.3
P(z1,z2) 0.2
30
25
0.1
20
h (z ,z )
P 1 2
0 15
10
10
10 5
5 10
5 0
5
10
8
6
z2 0 4
2
z1 0 0
z
z2 1
(a) Probability distribution P (z1 , z2 ) (b) Entropy function hP (z1 , z2 )
Figure C.5: Probability distribution and entropy function for two dimensional independent
Gaussian distribution from Example C.5.3 with true range x = 10 and variance σ 2 = 1.
(a) also shows the marginal distributions P (z1 ) = P (z2 ) in red and in (b) the marginal
entropies are shown in blue and red
bution was shown to be given by Equation 2.46,
[UH (v − iRmin ) − UH (v − iRmax )]

P (v|i) =
Rmax − Rmin
⎧
⎨ 1
if v ∈ [iR , iR
i(Rmax −Rmin ) min max ]
= . (C.28)
⎩ 0 otherwise
The event entropy function is given by

⎧
⎨ log i(R
b max − Rmin ) if v ∈ [iRmin , iRmax ]
hP (v|i) = . (C.29)
⎩ ∞ otherwise
Example C.5.5 – Distribution entropy from Example C.5.2

The probabilities and event entropies from Example C.5.2 are reproduced below, with the
addition of the impossible event of rolling a ‘7’ on a six-sided die and the union of all
possible events in the last two columns,
Sj 1 2 3 4 5 6 7 ∪j Sj
P (Sj ) 0.19 0.13 0.15 0.15 0.28 0.1 0.0 1.0
hP (Sj ) 1.66 2.04 1.90 1.90 1.27 2.30 ∞ 0.0
P (Sj )hP (Sj ) 0.316 0.265 0.285 0.285 0.356 0.230 0.0 -
Also added are the values of P (Sj )hP (Sj ), which are the individual contributions to the
distribution entropy. It can be verified by the addition of the values in the bottom row that
the total entropy of the distribution is 1.74 nats. The impossible event contributes nothing
to the distribution entropy3 .
If, however, the distribution was flat, that is, with P (Sj ) = 1
6 ∀j, then the entropy of
each event will be hP (Sj ) = logb 6 = 1.79 nats and the resulting distribution entropy will be
6× 1
6 × logb 6 = 1.79, that is, the distribution entropy is the same as the event entropy of
each outcome. It is readily shown that this is the maximum possible value of the distribution
entropy (though individual event entropies can be higher than this) and intuitively corre-
sponds to the situation in which all outcomes are equally likely and, therefore, the expected
benefit of actually making a measurement is greatest.
If the distribution was far from flat, say with the following values,
Sj 1 2 3 4 5 6 ∪j Sj
P (Sj ) 0.01 0.01 0.01 0.48 0.48 0.01 1.0
hP (Sj ) 4.605 4.605 4.605 0.734 0.734 4.605 0.0
P (Sj )hP (Sj ) 0.046 0.046 0.046 0.352 0.352 0.046 0.0
with a total distribution entropy is 0.889 nats, which is substantially lower than 1.79. This
suggests that the outcomes of the experiment are expected to be highly predictable which
agrees with the fact that the distribution is highly favourable to rolls of 4 and 5.
Finally, if the distribution was deterministic, say with P (S3 ) = 1 and P (Sj ) |j=3 = 0, the
contributions are P (Sj ) logb P (Sj ) = 0 ∀j and the distribution entropy will be zero also.
3
Using L’Hospital’s rule, for example.
Example C.5.6 – Entropy of a Gaussian distribution

Further consider the situation of Example C.5.3. It can be readily shown(Durrant-Whyte
2001, pp33-34) that the distribution entropy of an n-dimensional Gaussian distribution in
natural units (nats) is given by
1
HP (Z) = loge [(2πe)n |ΣN |] (C.30)
2
so that the entropy of a Gaussian distribution (irrespective of the independence or other-

wise) depends only on the dimensionality and the covariance matrix ΣN . Intuitively this
is reasonable: firstly the addition of a further degree of freedom should make the resulting
distribution less defined and the distribution entropy should rise; likewise, the determinate
of the covariance is proportional to the ‘hyper-volume’ of a rectangle enclosing the Gaussian
so that the more uncertainty there is in the estimate, the greater the volume and hence the
entropy.
In the case of independent, identically distributed (i.i.d.) measurements as earlier, the

covariance matrix is diagonal, with the value σ 2 on the diagonal, and this becomes
& '
1 (
n
n 2
HP (Z) = loge (2πe) σ (C.31)
2
i=1

n
1
= loge (2πe)n σ 2 (C.32)
2
i=1
n
= loge (2πe)n σ 2 (C.33)
2
= nHP (Zi ) (C.34)
where HP (zi ) is the entropy of the ith component, which will be independent of i in this
case.
Also note that if |Σn | < 1

(2πe)n , or in the case of i.i.d. measurements σ 2 < 1
2πe , then HP (Zi )
will be negative.
Example C.5.7 – Distribution entropy in the case of Example 2.5.2

Recall from Example C.5.4 that for a resistor with voltage v, current i and resistance R the
event entropy is given by Equation C.29,
⎧
⎨ log i(R
b max − Rmin ) if v ∈ [iRmin , iRmax ]
hP (v|i) = . (C.29)
⎩ ∞ otherwise
Here the distribution conditional entropy (for a given i) is

∞
HP (V |i) = P (v|i)hP (v|i) dv
−∞
iRmax
1
= logb i(Rmax − Rmin ) dv
iRmin i(Rmax − Rmin )
= logb i(Rmax − Rmin ) . (C.35)
If the current is also distributed according to a flat distribution over some range i ∈
[imin , imax ], or, ⎧
⎨ 1
if i ∈ [imin , imax ]
imax −imin
P (i) (C.36)
⎩ 0 otherwise
then the distribution conditional entropy is given by

HP (V |I) = P (i)HP (V |i) di
imax
1
= logb i(Rmax − Rmin ) di
imax − imin imin
1
= . (C.37)
imax − imin
C.5.2 Mutual Information
Example C.5.8 – Limiting values for event mutual information

Consider a discrete system of two variables x, y ∈ [−2, 2] with 25 possible steps in that
interval for each variable. Let the un-normalised values be given by the continuous Gaussian
distribution
1
1 exp (x − μ)Σ−1
N (x − μ)
T
. (C.38)
2π|ΣN | 2
Here let
T
μ = 0 0 (C.39)
⎡ ⎤
0.2 0
Σ0 = ⎣ ⎦ (C.40)
0 0.02
π
but rotate the covariance matrix through θ = 6 using ΣN = RΣ0 RT where R is the rotation
matrix, ⎡ ⎤
cos(θ) − sin(θ)
R=⎣ ⎦ . (C.41)
sin(θ) cos(θ)
The normalised distribution is shown in Figure C.6(a) along with the marginal distributions
and it is clear that the joint distribution is always less than the minimum of the marginal
values. Figure C.6(b) shows the joint distribution in red and the independent-assumption
I (x , y ) in blue. Finally, Figure C.6(c) shows the event mutual infor-
joint distribution, PAB i j
mation in blue and the upper and lower limits of Equations 2.107 and 2.108 in red, showing
that the event mutual information does, in fact, lie between the limits as expected.
Example C.5.9 – Event mutual information with independence

Let the same system as in Example C.5.8 be changed so that the covariance matrix Σ0 is
used without rotation. If x = [xi , yj ]T , the distribution (without normalisation) is

PAB (xi , yj ) = exp (x − μ)Σ0 (x − μ)T

(xi − μx )2 (yj − μy )2
= exp +
σx2 σy2

(xi − μx )2 (yj − μy )2
= exp × exp (C.42)
σx2 σy2
where the fact that Σ0 is diagonal and contains the covariances σx2 and σy2 on that diagonal
has been used. This implies that an axis-aligned jointly Gaussian distribution is independent.
Figures C.7(a) and (b) demonstrates this independence with the mutual information being
identically zero for all events (xi , yj ).
0.4
0.3
P(x,y)
0.2
0.1
0
2
1 2
0 1
0
−1
−1
y −2 −2
x
(a) The Joint distribution and marginal distribu- (b) The joint distribution (in blue) with the
tions independent-assumption joint distribution (in red)
(c) Event mutual information (blue) with upper and

lower limits (red)
Figure C.6: Joint and marginal distributions and mutual information measures for the
two-dimensional discrete system with a Gaussian distribution of Example C.5.8
0.5 2
0.4
1
0.3
i (x;y)
P(x,y)
0
P
0.2
−1
0.1
0 −2
2 2
1 2 1 2
0 1 0 1
0 0
−1 −1 −1 −1
y −2 −2 y −2 −2
x x
(a) The Joint distribution and marginal distribu- (b) The event mutual information surface
tions
two-dimensional discrete system with a Gaussian distribution of Example C.5.9
20
2.5
0
2 −20
−40
1.5
P(x,y)
−60
i (x;y)
1
P
−80
0.5 −100
−120
0
2 −140
1 2 −160
2 1.5
2
0 1 1.5
1 0.5
1
0.5
0 0
−0.5
0
−1 −0.5
−1
−1 −1
−1.5 −1.5
−2 −2
y −2 −2
x y x
(a) The Joint distribution and marginal distribu- (b) Event mutual information (blue) with upper
tions and lower limits (red)
two-dimensional continuous system with a Gaussian distribution of Example C.5.10
Example C.5.10 – Event mutual information for a continuous system

Finally, consider the same distribution as from Example C.5.8, but instead generated on a
continuous domain. The joint and marginal distributions are shown in Figure C.8(a) and
comparison with Figure C.6(a) clearly demonstrates that the joint distribution is no longer
limited to be smaller than the minimum of the marginal values. In turn this means that the
event mutual information is no longer limited in the same way as before and Figure C.8(b)
indicates that the mutual information surface (blue) is no longer constrained to fall within
the limits (red), in contrast with Figure C.6(c).
Example C.5.11 – Distribution mutual information in Example 2.5.3

In Example 2.5.3 there were two boxes of produce available, one of which had statistically
independent distributions for colour and vegetable type, and one in which these were no
longer independent. The distribution mutual information for the first box can be calculated
as4
P1 (xi , yj )
I1 (X; Y ) = P1 (xi , yj ) loge
P1 (xi )P1 (yj )
i j
/ 0
0.25
= 4 × 0.25 × loge
0.5 × 0.5
= 0 (C.43)
4
With b = e for natural units (nats).
since the argument of the logarithm is unity. Likewise, for the second box,
P2 (xi , yj )
I1 (X; Y ) = P2 (xi , yj ) loge
P2 (xi )P2 (yj )
i j
/ 0 / 0
0.4 0.1
= 2 × 0.4 log e + 2 × 0.1 log e
0.5 × 0.5 0.5 × 0.5
= 0.1927 (nats) . (C.44)
This indicates that distribution mutual information of the first box, corresponding to inde-
pendent random variables, confirms the lack of statistical dependence between the variables,
while the second box has a clear relationship demonstrated.
Example C.5.12 – Information theoretic quantities for Gaussian systems

Consider now the distribution entropies and mutual information for a two-dimensional
Gaussian distribution from Examples C.5.8, C.5.9 and C.5.10. Denoting the three dis-
tributions as P11 (x, y), P12 (x, y) and P13 (x, y) respectively, the various quantities (in nats)
are,
Quantity P11 (x, y) P12 (x, y) P13 (x, y)

HP (X, Y ) 3.661 3.661 0.077
HP (X) 2.279 2.406 0.487
HP (Y ) 1.844 1.255 0.052
HP (X|Y ) 1.817 2.406 0.025
HP (Y |X) 1.382 1.255 -0.410
IP (X; Y ) 0.462 0 0.462
The basic relationships of the previous sections can be readily verified using the quantities
here, noting that P11 (x, y) and P12 (x, y) are discrete while P13 (x, y) is continuous. Note,
significantly, that this continuous distribution has a negative conditional entropy HP13 (Y |X)
and that HP13 (X) > HP13 (X, Y ).
Furthermore, Example C.5.6 showed that the entropy of a Gaussian distribution is given by
1
HP (Z) = loge [(2πe)n |ΣN |] (C.30)
2
and the covariance matrix is ⎡ ⎤

0.155 0.078
ΣN = ⎣ ⎦
0.078 0.065
so that |ΣXY | = 0.004, |ΣX | = 0.155 and |ΣY | = 0.065 and the values in the table can be
directly verified using Equation C.5.6.
C.6 Deviations Using Entropy and Mutual Information
C.6.1 Entropy and Mutual Information
Example C.6.1 – Mutual information as deviation in continuous systems

Let x be a three-dimensional Gaussian distribution and consider the relationships between
the components x1 , x2 and x3 , where x = [x1 , x2 , x3 ]. Here,
1
PX (x) 3 1 exp (x − x̃) Σ−1
N (x − x̃)
T
(C.45)
(2π) |ΣN |
2 2
and ΣN and x̃ represent the covariance matrix and the mean respectively. Let an axis-aligned
covariance matrix be given by
⎡ ⎤
2 0 0
⎢ ⎥
Σ0 = ⎢ ⎥
⎣ 0 1 0 ⎦ (C.46)
0 0 3
π
and rotate this through θ = 6 about the x3 axis to obtain the covariance matrix as
⎡ √ ⎤
7 3
0
⎢ 4
√
4
⎥
ΣN = ⎢
⎣ 4
3 5
4 0 ⎥
⎦ . (C.47)
0 0 3
Equation C.30 gives the entropy of a Gaussian as
1
HP (x) = loge [(2πe)n |ΣN |] . (C.30)
2
The marginalisation of a Gaussian can be obtained directly by extraction of the subset of the
covariance matrix corresponding to the marginal of interest so that it is possible to calculate
C.6 Deviations Using Entropy and Mutual Information 317
the entropies as
! !
1 !7!
HP (x1 ) = loge (2πe) !! !!
2 4
= 1.6988
HP (x2 ) = 1.5305
HP (x3 ) = 1.9682
⎡ ! √ !⎤
! !
1 ! 7 3 !
HP (x1 , x2 ) = loge ⎣(2πe)2 ! 4
√
4 !⎦
2 ! !
! 4
3 5
4 !
= 3.1845
HP (x1 , x3 ) = 3.6670
HP (x2 , x3 ) = 3.4988
and the mutual information follows immediately as
IP (x1 , x2 ) = HP (x1 ) + HP (x2 ) − HP (x1 , x2 )
= 0.0448
IP (x1 , x3 ) = 0
IP (x2 , x3 ) = 0
so that the triangle inequality cannot hold as
IP (x1 , x3 ) + IP (x2 , x3 ) IP (x1 , x2 ) . (C.48)

Appendix D
Measure Comparisons
Chapter 2 dealt primarily with the development of a framework to unify the various compet-
ing approaches to making comparisons between vectors, functions and distributions. This
appendix turns attention to several of the commonly used approaches to this particular prob-
lem. It primarily seeks to identify the particular characteristics of the different measures
and their relative capabilities. Particular focus will be placed on the subtle but important
distinction between measures which capture functional (or ‘structural’) characteristics of
the entities involved, and those which capture statistical relationships.
D.1 Measures for Vectors and Functions
In the first instance consider several measures commonly used in systems involving vectors
and, by extension, any set S for which a coordinate system of Section 2.4.1 exists. Impor-
tantly, as Section 2.4.4 showed, functions can be considered an extension of this idea into
infinite dimensional spaces. In the absence of statistical information, all of these measures
capture the structural properties of the elements being compared. Fundamentally, a struc-
tural measure can be characterised by the manner in which it compares each coordinate
value of one element with the coordinate value(s) of the other. This section is not an ex-
haustive discussion of the commonly used measures, but rather considers some measures of
interest to the applications implied throughout this thesis. Indeed, interpreting the notion
of distances in the measure-theoretic sense implies that there will be an inexhaustible set of
320 Measure Comparisons
valid measures and the appropriate selection for any particular task will necessarily depend
on the nature of that task.
D.1.1 Euclidian Distance
The Euclidian distance is the most commonly used measure for capturing the proximity of
vectors (and functions). For two vectors Si ↔ si and Sj ↔ sj , the measure is given by

2
DE (si , sj ) = (sik − sjk )2 for finite dimensionality (D.1)
k
= [si (k) − sj (k)]2 dk for continuous cases . (D.2)
Here the components of the vectors are indexed by the parameter k which is discrete or
continuous as necessary. These equations imply three characterising effects: firstly, the
components of the vectors being compared are implicitly aligned by the index k with each
component of each vector, sik or sj (k), only compared with the equivalent component of
the other element; secondly, each component contributes to the total independently of all
others; and finally, all components are considered equally1 .
While the alignment is usually unambiguous in the discrete case, the continuous case involves
an alignment of the abscissae, k. Consider Figure D.1 in which two different time-series are
compared. In Figure D.1(a) the signals are inherently synchronous and are unambiguously
aligned; this case would occur when considering two signals captured simultaneously. In
Figure D.1(b), however, the two signals are inherently asynchronous and the offset between
them, k , is unknown a priori. For the continuous case,

2
2
DE (si , sj , k ) = si (k) − si (k − k ) dk (D.3)

= si (k)dk + sj (k − k )dk − 2 si (k)sj (k − k )dk ,
2 2
(D.4)
which can be directly compared to the inner product of Equation 2.37,

< [si − sj ], [si − sj ] >= μij (k)[si − sj ]2 (k) dk . (2.37)
k
1
That is, there is no favoured value of k; note that in the section D.1.3 this distance will be shown to
non-linearly transform the contributions from each component to induce a circular symmetry so that the
resulting space is Euclidian.
D.1 Measures for Vectors and Functions 321
4 4
2 2
0 0
−2 −2
−4 −4
0 2 4 6 8 10 0 2 4 6 8 10
4 4
2 2
0 0
−2 −2
−4 −4
0 2 4 6 8 10 −5 0 5
time (s) time (s)
(a) Synchronous signals (b) Asynchronous signals
Figure D.1: Synchronous and asynchronous time-series with unknown offset. In (a) the
signals are explicitly aligned and the Euclidian distance can be calculated directly as a
scalar value. In (b), however, the alignment is not known a priori and the Euclidian
distance generates a function of the offset parameter k.
Recognising that this inner product requires a flat space and that all coordinates are
weighted equally, that is, μij (k) ≡ 1, implies that the Euclidian distance is valid for an
isotropic, flat space and the connection with Euclidian geometry is directly and unambigu-
ously clear.
Furthermore, the Euclidian distance of Equation D.4 contains only one term which actually
depends on k . Note that if the function sj (k − k ) is non-zero only over the domain of k
then it is possible to write

sj (k − k )dk = sj (k )dk (D.5)
and the result will be independent of k . Given two fixed functions si (k) and sj (k), therefore,
the only free parameter in the Euclidian distance will be the offset k and its entire effect
will be encapsulated in the final term. This term is recognisable as the correlation function
and can be written
si sj (k ) = si (k)sj (k − k )dk (D.6)
and in this case of a fixed function, finding the value of k which maximises the correlation
score corresponds directly to the value which minimises the Euclidian distance.
Finally, note that the form of Equations D.1 and D.2 implies that the Euclidian distance is
related to the mean squared error (MSE) of the two vectors. In the discrete case this distance
is proportional to the MSE with the constant of proportionality equal to the dimensionality,
whereas in the continuous case the value is the MSE directly.
The Euclidian distance calculates the deviation between two vectors in a flat, isotropic
space according to

DE2
(si , sj ) = (sik − sjk )2 (D.1)
k
DE (si , sj ) = [si (k) − sj (k)]2 dk
2
(D.2)
depending on whether the system is discrete or continuous, respectively. It is a measure

of the structural deviation which is proportional to the mean squared error (MSE) and
has three defining characteristics:
1. The index k implicitly aligns the components of the two vectors
2. Each component is compared only to the equivalent component of the other vector
3. All components are identically weighted
The Euclidian distance can also be rewritten in terms of a correlation function as

2
DE (si , sj ) = s2i (k)dk + s2j (k − k )dk − 2si sj (k ) (D.7)
and maximising the correlation between two fixed functions si (k) and sj (k) is equivalent
to minimising the Euclidian distance between them.
Example D.1.1 – Euclidian distance for vectors

Consider first a simple geometrical example in three dimensions. Let two points A and B
be defined by their displacements a and b from an appropriate origin. Let their values be
a = [ax , ay , az ]T
= [1.0, 2.0, −0.5] T
b = [bx , by , bz ]T
= [5.0, −3.0, 2.5] T .

The Euclidian distance will be

2
DE (a, b) = (ak − bk )2
k
= 4.02 + 5.02 + 3.02
= 50
√
so that the distance between them is 50 units.
Example D.1.2 – Euclidian distance in discrete system

Once more consider the example of throwing two dice simultaneously and let the rolls be
represented by si = [Si1 , Si2 ] with Sik ∈ [1, 6] ⊂ Z+ . In the first instance let the two dice
have a different colour so that they can be distinguished and consider the two rolls,
s1 = [2, 5]
s2 = [6, 3]
where the first component corresponds to k = 1 and the first die and the second die to k = 2.
The Euclidian distance can be calculated between the two rolls as
2
DE (s1 , s2 ) = (2 − 6)2 + (5 − 3)2
= 20 .
Now, note that the implicit alignment of the components gives rise to an interesting effect.
Say that there is a third roll of s3 = [3, 6]; here the distance will be given by
2
DE (s1 , s3 ) = (2 − 3)2 + (5 − 6)2
= 2
and despite the fact that the coordinates have simply been reversed between s2 and s3 there
is a substantial difference in the resulting distance. While this is undesirable in the case of
the tossing of a pair of dice, it is critical to many other applications, including scoring the
distance between words so that they may be arranged in lexicographical order.
Consider instead the case when the two dice are indistinguishable so that either die could
give rise to either of the two components. It would be appropriate to consider the rolls
such that once rolled, the components are constructed from the values of the faces shown in
increasing order. This means that if the roll included a 3 and a 6, then s3 would be valid,
but s2 would not. Consider the extension of this to the rolling of six dice with the following
values obtained:
s4 = [1 2 3 4 5 6]
s5 = [1 1 2 4 5 5]
s6 = [1 1 1 5 6 6] .
The two distances of interest are
2
DE (s4 , s5 ) = 3
2
DE (s4 , s6 ) = 7 .
Here it is clear that the ordered sets yield a result which is intuitive and the effect of the
implicit coordinate alignment may be significant depending on the context.
Example D.1.3 – Euclidian distance for functions

Consider the input signal and ‘models’ of Figure D.2(a) where the top graph represents the
input signal of interest and the lower two graphs two possible models for an asynchronous
event expected to occur in the input signal. The goal of a detection system is to determine
firstly which model matches the signal most closely and at what time offset that occurs.
Noting the effect of the first two terms of Equation D.4 on the resulting distance and because
the dependence of the distance on the time offset is limited to the final term, the models
have been normalised so that they yield the same constant term in the distance. This allows
the effect of the final term to be examined in isolation.
A visual inspection suggests that the second model should provide a better match and hence
a smaller distance measure and that the asynchronous signal appears to occur at around
3.5 seconds. The output correlation scores and Euclidian distances are shown in Figure
D.2(b). In these graphs the left-hand axis (blue) represents the distance measure, while the
Input signal
0
−2
−4
0 2 4 6 8 10
2
Model #1
−2
−4
−5 0 5
2
Model #2
−2
−4
−5 0 5
time (s)
(a) Input signals
6 10
4
0
2
−10
0 2 4 6 8 10
6 10
4 5
2 0
0 −5
0 2 4 6 8 10
time offset (s)
(b) Correlation (green) and distance measures (blue)
Figure D.2: Comparing two models to an input signal using Euclidian Distance. (a) shows
the input signal (top) with the two models for an asynchronous event expected in the signal
below, while (b) shows the correlation scores and Euclidian distance measures for the two
signals for the offsets along the x-axis
right hand side (green) represents the correlation score. The relationship of Equation D.4
is immediately apparent as is the fact that the constant terms are equal in the two cases.
Furthermore, the distance measure attains a lower value in the case of model #2 and this
value occurs at about 3.5 seconds, which is reasonable given the noise present on the input
signal.
D.1.2 Closed Form Euclidian Distance
One of the most commonly used representations in robotics systems is the Gaussian distri-
bution defined over x ∈ Rn with mean μ and covariance matrix C,

1 1 T −1
P (x) n 1 exp − (x − μ) C (x − μ) . (D.8)
(2π) 2 |C| 2 2
Now, given two such distributions with means μ1 , μ2 and covariances C1 , C2 it is possible
to determine the Euclidian distance between them as a closed-form algebraic solution. Two
such one-dimensional Gaussians are shown in Figure D.3(a) and their difference in Figure
D.3(b). The Euclidian distance is given by Equation D.2 and in this case becomes

2
DE [P1 (x), P2 (x)] = [P1 (x) − P2 (x)]2 dx

= P12 (x) dx + P22 (x) dx − 2 P1 (x)P2 (x) dx . (D.9)
Each of the first two terms can be evaluated according to

1
2
P (x) dx = exp −(x − μ)T C−1 (x − μ) dx (D.10)
(2π)n |C|
and the integral part can be obtained by noting that it corresponds to an un-normalised
0.4
0.35
0.3
0.25
P(x)
0.2
0.15
0.1
0.05
0
10 8 6 4 2 0 2 4 6 8 10
x
(a) Two Gaussian distributions
0.4
0.3
0.2
P1(x) - P2 (x)
0.1
0. 1
0. 2
0. 3
0. 4
10 8 6 4 2 0 2 4 6 8 10
x
(b) Their difference
Figure D.3: Construction for the closed–form Euclidian distance for Gaussian distributions
Gaussian function with a covariance given by C = C/2, which yields

1 n
1
P 2 (x) dx = × (2π) 2 |C | 2 (D.11)
(2π)n |C|
1
|C/2| 2
= n (D.12)
(2π) 2 |C|
/ 01
1 2
= . (D.13)
22n π n |C|
The third term of Equation D.9 contains the product of two Gaussians and Roweis (1999)
shows that the product is given by
N (a, A) × N (b, B) = KN (d, D) (D.14)

with
" #−1
D = A−1 + B−1 (D.15)
d = DA−1 a + DB−1 b (D.16)

/ 01
|D| 2 1 " T −1 T −1 T −1
#
K = exp − a A a + b B b − d D d (D.17)
(2π)n |A||B| 2
/ 01
1 2 1 T −1
= exp − (a − b) (A + B) (a − b) (D.18)
(2π)n |A + B| 2
using the result of Bailey (2002, pp191–2) to simplify the result in terms of the parameters
of the original Gaussians. The value of the third term is then given by
/ 01
1 2 1 −1
exp − (μ1 − μ2 ) (C1 + C2 ) (μ2 − μ2 ) .
T
(D.19)
(2π) |C1 + C2 |
n 2
The resulting distance is given algebraically as
/ 01 / 01
2 1 2 1 2
DE [P1 (x), P2 (x)] = +
2 π |C1 |
2n n 2 π |C2 |
2n n
/ 01
1 2 1 −1
−2 exp − (μ1 − μ2 ) (C1 + C2 ) (μ2 − μ2 ) .
T
(2π)n |C1 + C2 | 2
(D.20)
Because of the positivity of the Euclidian distance, the limits of this expression can be
calculated directly, with the distance having a minimum of zero for P1 (x) = P2 (x) and a
maximum value occurring when the third term is zero, that is, as the overlap of the two
Gaussians approaches zero. The maximum value is given by
/ 01 / 01
2 1 2 1 2
DE , max [P1 , P2 ] = + . (D.21)
2 π n |C1 |
2n 2 π n |C2 |
2n
Finally, note that an analytic solution exists also for Gaussian mixture models of the form

N
PN (x) = αi N (μi , Ci ) (D.22)
i=1
as the equivalent expression of Equation D.9 is given by

⎡ ⎤2
N
M
2
DE [PN , PM ] = ⎣ αi N (μi , Ci ) − αj N (μj , Cj )⎦ dx (D.23)
i=1 j=1
⎡ 2 ⎛ ⎞2 ⎛ ⎞⎤

= ⎣ αi Ni + ⎝ αj Nj ⎠ − 2 αi Ni ⎝ αj Nj ⎠⎦ dx
i j i j

= αi αi Ni Ni dx + αj αj Nj Nj dx
i i j j

−2 αi αj Ni Nj dx (D.24)
i j
where for two arbitrary Gaussians Na and Nb ,
/ 01
1 2 1 −1
Na Nb dx = exp − (μa − μb ) (Ca + Cb ) (μa − μb )
T
(2π) |Ca + Cb |
n 2
(D.25)
/ 01
1 2
Na2 dx = , (D.26)
2 π n |Ca |
2n
and the maximum value occurs at

2
DE , max = αi αi Ni Ni dx + αj αj Nj Nj dx . (D.27)
i i j j
D.1.3 Ln Distances
The Euclidian distance is a special case of a more general class of distance measures, the Ln
(normed) distances. These distances can be interpreted as the application of the appropriate
measure of Section 2.4.3 to the vector difference; for the discrete case the result is
DLn (si , sj ) = Ln [si − sj ]

1
n
= |(sik − sjk )|n (D.28)

k
e2
si
e1
Figure D.4: The Manhattan metric shown for a two-dimensional space. The measure is
determined from the integral along any valid axis-aligned path from the origin to the element
of interest; that is, no diagonal paths are allowed. In this figure both the red and blue paths
are valid and would result in the same measure for the element si .
and the continuous case is

/ 01
n
n
DLn (si , sj ) = |si (k) − sj (k)| dk . (D.29)
The Euclidian distance obviously follows for n = 2 and corresponds to the ordinary ge-
ometrical concept of distance. Other values of n effectively apply non-linear scalings to
the difference space, thereby changing how much any one component contributes to the
resulting distance according to its value. Consider the discrete case for n = 1,

DL1 (si , sj ) = |sik − sjk | , (D.30)
k
where each difference contributes to the total distance equally. This measure is equivalent
to the Manhattan metric (Krause 1986), a two dimensional example of which is shown
in Figure D.4 where the path over which the measure is calculated is restricted to travel
parallel to the coordinate axes. In the Euclidian case, however, a circular symmetry is
induced and the contribution of each component is scaled accordingly so that the length of
the radius of the resulting hyper-sphere corresponds to the calculated measure.
As n grows, however, the contributions of the smaller components fall, relative to the
contribution of the larger ones. Thus, these higher order measures emphasise the effects of
the components with the greatest disparity. In the limit that n → ∞, the L∞ norm is given
by
DL∞ (si , sj ) = max {|sik − sjk |} (D.31)
k
and the result depends only on the largest component of the vector difference.
Note that the three characteristics of the Euclidian distance all apply to the Ln distances:
the components are implicitly aligned; each component of each vector is only compared with
the equivalent component of the other vector; and there is no favoured dimension. While
the resulting contribution of the different components depends directly on their relative
magnitudes, they do not depend in any way on the value of the index k.
The Ln (normed) distances correspond to a generalisation of the Euclidian distance

(n = 2) to non-flat spaces. The measure is given by
1
n
n
DLn (si , sj ) = |(sik − sjk )| (D.28)

k
for the discrete case, and for the continuous by
/ 01
n
n
DLn (si , sj ) = |si (k) − sj (k)| dk . (D.29)
The measure is characterised by four properties:
3. All components are identically weighted (no dependence on k)
4. The contributions of each component depend on their relative magnitudes and

the value of n
Example D.1.4 – Ln distances for vectors

Example D.1.1 considered the Euclidian, L2 , distance between two vectors in R3 which had
the specific values of
a = [ax , ay , az ]T
= [1.0, 2.0, −0.5] T
b = [bx , by , bz ]T
= [5.0, −3.0, 2.5] T
⇒ (a − b) = [4.0, 5.0, 3.0] T .
Applying the Ln measure to the difference vector above for n = {1, 2, 3, 4, ∞} yields the
following measures:
n DLn (a, b)
1 12.0
√
2 7.071 = 50
3 6.0
4 5.569
∞ 5.0
It is readily seen that L2 corresponds to the value obtained in the earlier example for the
Euclidian distance measure. As the value of n increases the resulting measure monotonically
approaches the largest element of the difference vector.
Example D.1.5 – Ln distances for functions

Example D.1.3 evaluated the Euclidian distance between a signal and two asynchronous
models for different time offsets. The Ln distances for the same input signals of Figure
D.2(a) over the same range of time (or domain) offsets are shown in Figure D.5 for n =
{1, 2, 3, 4, 5}. The upper two graphs in this figure represent the Ln distances between the
input signal and models #1 and #2 respectively. The bottom graph shows the L2 distances
separately and comparison with Figure D.2(b) confirms the assertion that the Euclidian
distance is, indeed, the same as the L2 distance.
15
L
1
Model #1
10
L2
L3
5
L4
0
0 2 4 6 8 10
15
Model #2
10
0
0 2 4 6 8 10
6
L2 distances
2
Model #1
Model #2
0
0 2 4 6 8 10
time offset (s)
Figure D.5: The Ln distance measures for an input signal and two asynchronous models.
The distances are calculated for n = {1, 2, 3, 4, 5} and offsets from 0 to 10 seconds and
these distances can be compared to the Euclidian distances of Figure D.2(b). The upper
two graphs show the five distance measures for the varying offset for models #1 and #2
respectively. The bottom graph, however, shows the L2 distance separately and yields
identical results to the Euclidian distance of Example D.1.3.
The monotonic convergence of the distances as n → ∞ is immediately clear. In addition,

while the relative weighting of the components is changing, the relationships between the
measures at different offsets remains qualitatively unchanged as the value of n changes.
The upper graph of Figure D.5 highlights an important drawback of the Ln distance mea-
sures: the measure is strongly affected by transformations of the ordinate dimension, such
as an inversion, a scaling or an offset. In the continuous case the measure is given by

DLnn (si , sj ) = |si (k) − sj (k)|n dk . (D.29)
If the two signals, si and sj , are the input signals and model respectively, then leaving the
model fixed and applying transformations to the values of the signal will have a direct effect
on the value of the resulting measure. For example, applying an offset sl = si + C for some
constant C yields

DLnn (sl , sj ) = |sl (k) − sj (k)|n dk

= |si (k) − sj (k) + C|n dk (D.32)
and the value of C will affect the resulting measure significantly. This is a common effect
seen in the application of correlation template matching in image processing applications
where constant offsets in the image and template result in biased results (Lewis 1995).
A similar effect is demonstrated in the application of Model #1 to the input signal in

Figure D.5 where the second part of the signal (the downward going peak at approximately
4.25 seconds) results in an increase in the distance value, despite the fact that the signal
appears to have the same shape (with an inversion) as the model. This is acceptable
when the signals and models are known to have the same sign, but in other applications
this will introduce unfortunate consequences. For example, if the application involved an
image-based identification of a tree in a natural environment, there is no guarantee that the
relationship of the background to the tree will retain the same sign (or the same magnitude)
and identifying the same object viewed under different illumination conditions would require
an ensemble of models for these conditions if the Ln distances were utilised.
A significant drawback of the Ln distances (including the Euclidian distance) is their

sensitivity to any transformation in the ordinate values of the signals or vectors under
consideration. Specifically, the resulting measure will be strongly affected by offsets,
inversions and scalings and this measure will perform poorly when such variations are
expected in the data.
D.1.4 Mahalanobis Distance
The Euclidian and Ln (for a fixed n) distance measures share the property that the contri-
bution of any particular component to the resulting measure depends only on their relative
magnitudes and not on the value of k, that is, there are no ‘preferred’ or ‘weighted’ coordi-
nates. For instance, when considering the distance between two vectors in a 3-space, each of
the three coordinate directions should be expected to yield equally weighted contributions
to the resulting distance, though the resulting distance will be dominated by the largest
components. This results in a spherical symmetry in the coordinate axes defining the space
and an arbitrary rotation of these axes will have no effect on the resulting measure, in
accordance with the fact that the physical distance between two objects in the real world
does not depend on which way is ‘up’.
It has already been seen that the Ln distances for n = 1 apply a non-linear scaling to the
component differences s−k = sik − sjk ; for example, when n = 2 each component difference
is squared before being combined. As the functions |x|n are all monotonically increasing for
increasing |x|, then the larger the value of the component, |xk |, the larger the contribution
to the resulting measure with the value of n skewing the weighting towards the maximum
component difference DLn → maxk {|sik − sjk |} as n → ∞.
In many applications these ‘isotropic’ distance measures are appropriate, though compar-
isons involving coordinates (or components) which have vastly different characteristic value
or range can skew the results. The goal of a distance measure is to provide an informative
comparison of the relationships between the various elements under consideration, so if the
resulting measure is excessively weighted towards particular components, then the resulting
measure can have a lower efficacy than one which avoids this skew. This means that the
isotropic distances implicitly suggest that on average all component differences should have
values of approximately equivalent magnitude.
If this is not the case then the component difference with the larger values will always
dominate the resulting measure. As a result, consider the distance between a set of points
in a two-dimensional plane where the points have a large range in the x direction and a
substantially smaller range in the y direction. In this case the average contribution of the x
component difference will dominate the distance measure. To overcome these difficulties it
is possible to deliberately weight those components whose contributions would be otherwise
ignored: where the contributions are small the weighting should be large and where the
contribution is large, the weighting should be comparatively smaller. While this generates
a measure which has equal average contributions from the different components, it destroys
the spherical symmetry in favour of an ellipsoidal weighting. Unless both the coordinates
and the weighting effects are rotated together the measure will no longer be invariant to an
arbitrary change of the coordinate basis vectors.
The Mahalanobis distance is an extension of the Euclidian distance to take this into account.
It assumes that reasonable statistics for the expected measurements are available a priori
or can be adequately estimated on-line. In the simplest case the variance in each of the
coordinate directions is calculated from an expectation over the distribution or ensemble of
points {si },

σk2 E (sik − sμk )2 (D.33)
where sμk is the kth coordinate of the mean of the (estimated) distribution of points (Pa-
poulis & Pillai 2002, p144). The resulting measure is calculated for the discrete case (with
the immediate and obvious extension for the continuous case) as
|sik − sjk |2
D(si , sj ) = . (D.34)
σk2
k
This measure can be further generalised by replacing the variance σk2 for individual coor-
dinates with the covariance matrix for all pairs of coordinates. The (k, l)th element in the
covariance matrix is the covariance between the kth and lth coordinates,
2
Σ σkl (D.35)
2
where σkl = E {(sik − sμk )(sil − sμl )} (D.36)
where the expectation is taken over the ensemble of {si } so that the variances of Equation
D.33 correspond to the diagonal entries of this matrix and the measure of Equation D.34 to
the case where this matrix is diagonal. The full matrix is able to consider not only which
components should have higher weighting, but also how each should be weighted to take
the correlations between the different coordinates into account. The resulting measure is
the Mahalanobis distance and can be expressed as a matrix equation,
DM (si , sj ) = (si − sj )T Σ−1 (si − sj ) . (D.37)
Clearly, the utility of this distance measure depends on the accuracy of the determination of
the covariance matrix entries and several methods are available for estimating the weights,
including direct calculation of the mean and variance for an ensemble of data, Gaussian
approximations to a sample ensemble, and learning the weights in a classification problem
(Tsang & Kwok 2003).
In the continuous case the general distance of Equation D.37 becomes a double integral
equation,
DM (si , sj ) = [si (k) − sj (k)] [si (l) − sj (l)] Σ−1 (k, l) dk dl (D.38)
where Σ−1 (k, l) is the generalisation of the inverse covariance matrix element Σ−1
kl . The
terms of the covariance function are given by
Σ(k, l) = E {[si (k) − sμ (k)] [si (l) − sμ (l)]} (D.39)
where the expectation is taken over the set {si }. Finding the inverse of an infinite-
dimensional, symmetric, positive-definite matrix is difficult, and so consider here only the
simplified case of a symmetric covariance function Σ(k), with the trivial inverse Σ−1 (k) =
1
Σ(k) , yielding

[si (k) − sj (k)]2
DMdiagonal (si , sj ) = dk (D.40)
Σ(k)

where Σ(k) = E [si (k) − sμ (k)]2 . (D.41)
The Mahalanobis distance, therefore, applies a non-isotropic scaling to the measure space
so that correlations and variances can be used to artificially scale the data to yield the max-
imum spread of distances for an ensemble or expected distribution of points si of interest.
The isotropic nature of the Ln distance measures result in a measure which treats
all components (or coordinates) equally. When the magnitudes of components differ
significantly, this results in a measure which is skewed towards the larger magnitude
components. In applications where it is desirable to give equal weighting to all compo-
nents, taking their relative magnitudes and correlations into account, the L2 distance,
specifically, can be extended to develop the Mahalanobis distance for the discrete case
as
DM (si , sj ) = (si − sj )T Σ−1 (si − sj ) , (D.37)
where the (k, l)th element of the covariance matrix is given by
Σkl = E {(sik − sμk )(sil − sμl )} . (D.36)
In the continuous case, the general form is difficult to write as a result of the complexity
of the inversion of an infinite-dimensional, symmetric, positive-definite matrix; instead
the simplified Mahalanobis distance is determined taking only the diagonal entries
(variances) into account,

[si (k) − sj (k)]2
DMdiagonal (si , sj ) = dk (D.40)
Σ(k)
and the variance function Σ(k) is given by

Σ(k) = E [si (k) − sμ (k)]2 . (D.41)
The Mahalanobis distance is characterised by the following properties:
3. All components are weighted according to the inverse of the variances and covari-
ances associated with them so that the resulting distance is not skewed towards
components with the largest variance or covariances.
Example D.1.6 – The Mahalanobis distance

Let a two-dimensional plane be parameterised by the coordinates of points within it as si =
(xi , yi ). Consider the application of distance measures to a jointly-Gaussian distribution of
points,
{si } ∼ N (si ; μ, Σ)
and let the mean of the distribution be μ = (0, 0) and let an axis-aligned covariance matrix
be ⎡ ⎤
10 0
Σ0 = ⎣ ⎦ .
0 0.5
π
Following a rotation of this distribution through 6 radians, the resulting covariance matrix
is ⎡ ⎤
7.625 4.1136
Σ=⎣ ⎦ .
4.1136 2.875
Randomly sampling 100 points from this distribution yields the ensemble shown in Figure
D.6(a) and it is clearly seen that the distribution is Gaussian, is not axis-aligned and shows
a strong correlation between the two dimensions. The physical distances between the points
in this figure correspond to the Euclidian metric and this diagram represents the Euclidian
metric space for this distribution.
Analogously, the Euclidian space can be transformed to correspond to the simplified Ma-
halanobis space (where only the variances are taken into account). This will generate a
Euclidian space, but one in which the resulting distances will correspond to the application
of the simplified Mahalanobis distance to the original points. This is achieved by scaling
each of the coordinate axes by the inverse of the relevant standard deviation term,
x
x → x = √
Σ11
y
y → y = √ .
Σ22
The simplified Mahalanobis metric space is shown in Figure D.6(b). It is apparent that
the x and y variances are now made equal, but that the correlation effects are still strongly
visible. While the contributions of x and y independently will now have comparable average
magnitude, deviations along the line y = x will dominate deviations perpendicular to this
line.
Finally, applying a transformation to generate the Mahalanobis metric space taking the full
covariance matrix into account yields the results shown in Figure D.6(c). This figure is
obtained by removing the rotation from the original samples to generate an axis-aligned
distribution, applying the simplified transformation and rotating the samples back into the
original space. While this final rotation is not necessary as the resulting Euclidian distances
are isotropic and are equivalent to the Mahalanobis distance for the original points, it serves
to leave the samples in their original orientation. The resulting metric space represents the
points as a circularly symmetric Gaussian distribution.
D.1.5 Hellinger-Battacharya Distance
The Hellinger-Battacharya distance is included here as a functional distance measure, rather

than as a statistical or distribution measure as it is normally considered. This is because
the mathematical form of the measure is of a similar form to the Ln distance measures and
many of the same properties are implied. This distance measure is often defined in terms
of the Hellinger Affinity as in Hero et al. (2001, p3) and the resulting distance measure is
2

DHB (si , sj ) = si (k) − sj (k) dk (D.42)
"√ √ #2
DHB (si , sj ) = sik − sjk (D.43)
k
for continuous and discrete systems respectively.
Firstly, note that this measure will be complex-valued if the elements being considered have
any negative components. Thus, the measure is real-valued for all probability distributions
because of their non-negativity. Consider the application of the transformation,

T : si (k) → si (k) = si (k) (D.44)
to the input space. This results in the distance measure becoming

2
which is the Euclidian distance measure in the transformed space. Since the transforma-
4
8
3
6
2
4
2 1
0 0
y
y
−2 −1
−4
−2
−6
−3
−8
−4
−5 0 5 −4 −2 0 2 4
x x
(a) Euclidian metric space (b) Simplified Mahalanobis metric space
0
y
−1
−2
−3
−4
−4 −2 0 2 4
x
(c) Mahalanobis metric space
Figure D.6: Euclidian and Mahalanobis metric spaces for correlated two-dimensional Gaus-
sian distribution. (a) shows the original samples and distances in this figure correspond
to the Euclidian distance between the actual samples. However, in (b) the space has been
transformed using a simplified Mahalanobis metric to make the resulting variances equal;
the distances in this figure correspond to the simplified Mahalanobis distance metric for the
original samples. Finally, (c) shows the (full) Mahalanobis metric space and distances on
this figure correspond to the Mahalanobis distances between the original samples.
tion is monotonic then the measure is order-preserving. This means that the Hellinger-
Battacharya distance can be interpreted as the Euclidian distance calculated on the points
after the application of the square-root operator on the original elements. This unexpected
result implies that the Hellinger-Battacharya shares the structural properties of the Euclid-
ian distance. The square root operation acts to emphasise the components of si and sj which
have small magnitudes; that is, the ‘tails’ of the distributions make a greater contribution
to the resulting measure than in the Euclidian case.
The Hellinger-Battacharya distance measure can be obtained for any non-negative func-
tions (or vectors), si (k) and sj (k) for example, according to
2
and can be interpreted as the Euclidian distance applied to the transformed space
obtained by the application of the square root to the ordinates of the function (or
vector),

T : si (k) → si (k) = si (k) , (D.44)
yielding

2 " #
DHB (si , sj ) = si (k) − sj (k) dk = DE si , sj . (D.45)
The structural properties of the Ln and Euclidian distances apply also here: the index
k implicitly aligns the two elements; each component (or value of k) is compared only
with the equivalent component; and there are no ‘favoured’ values of k.
Example D.1.7 – Template matching with Hellinger-Battacharya distance

Consider a template matching problem equivalent to that in Example D.1.3, in which the
model and signal are non-negative. Figure D.7(a) shows both the inputs in blue and the
signals after the application of the square-root transformation, T , in red. Note that the
transformation is applied to each of the signals individually and since the range of the sig-
nals exceeds unity this results in a compression of the range while maintaining the ordering
of the ordinate values. The Hellinger-Battacharya distances for a varying time offset is
compared to the Euclidian distance for the same signals in Figure D.7(b) with the Euclid-
ian distance shown in blue and the Hellinger-Battacharya distance in red. The resulting
4 5
3 4.5
Input signal 2 4
1 3.5
0 3
Distance
0 2 4 6 8 10
2.5
4 2
3 1.5
Model
2 1
1 0.5
0 0
−5 0 5 0 2 4 6 8 10
time (s) time offset (s)
(a) Input signals (blue) and square-root trans- (b) Euclidian distance measures for input signals
formed signals (red). (blue) and Hellinger-Battacharya distances for the
same input signals (red).
Figure D.7: Comparison of the Euclidian and Hellinger-Battacharya distance measures

applied to a non-negative signal and model. The input signals are shown in (a) and the
resulting measures are shown in (b).
measures are clearly related and demonstrate the same characteristic relationships, further
justifying the conjecture that this particular measure can be interpreted as a structural rather
than probabilistic measure.
Example D.1.8 – Hellinger-Battacharya for probability distributions

The previous example considered the application of the Hellinger-Battacharya distance met-
ric to non-negative signals or arbitrary magnitude. In this example, consider the same
signals, only in this case the signals are interpreted as probability distributions and must be
normalised appropriately by treating the distribution as continuous and using

si (k) dk = 1 . (2.39)
The input signals and their transformed versions are shown in Figure D.8(a). The square
root transformation maps all positive values to a value which is closer to unity than the
original and leaves zero unchanged. In the previous example the range of the signals exceeded
unity, while in this case2 the density values are all lower than unity and all values are
increased, though as before the smaller ordinate values are emphasised more than the larger
values and the ordering is maintained.
2
This also occurs for all discrete probability distributions as all ordinates must be less than unity.
1 1.5
0.8
Input signal 0.6
0.4
0.2 1
Distance
0 2 4 6 8 10
0.8 0.5
0.6
Model
0.4
0.2
0 0
−5 0 5 0 2 4 6 8 10
time (s) time offset (s)
(a) Input distributions (blue) and square-root trans- (b) Euclidian distance measures for input distribu-
formed distributions (red). tions (blue) and Hellinger-Battacharya distances for
the same input distributions (red).
Figure D.8: Comparison of the Euclidian and Hellinger-Battacharya distance measures

applied to a continuous probability distribution and model. The input distributions are
shown in (a) and the resulting measures are shown in (b)
Figure D.8(b) shows the resulting distance measures applied to this distribution. Clearly this
figure shows the same characteristic relationship between the values of the distance at the
various offsets. Thus, even when the functions being compared are interpreted as probability
distributions the Hellinger-Battacharya distance measure captures the structural properties
of the distributions, albeit with a greater emphasis on the tails of the functions than when
calculating the Euclidian distance.
D.2 Measures for Distributions
While the structural distance measures are the most commonly used measures in geometry
and simple signal analysis, it is possible to extend the notions to encompass the distance
between probability distributions. While many of the following measures can also be inter-
preted as structural measures, their statistical nature is considered more important in this
context. Once more, the measures listed here are not exhaustive, rather they are simply
commonly used in approaches to operations identified in the later parts of this thesis.
D.2 Measures for Distributions 345
D.2.1 Kullback-Leibler Divergence
One of the most commonly used divergence measures for comparisons involving probability
distributions is the Kullback-Leibler divergence (Mackay 2004, p34),

si
DKL (si sj ) = Esi logb (D.46)
sj
which becomes
sik
DKL (si sj ) = sik logb (D.47)
sjk
k

si (k)
DKL (si sj ) = si (k) log b dk (D.48)
sj (k)
for discrete and continuous systems, respectively. As noted in Section 2.5.5, Jensen’s in-
equality can be used to show that the Kullback-Leibler divergence is strictly non-negative
(Mackay 2004, p44). This measure does not satisfy either the triangle inequality or sym-
metry and so is a divergence rather than a distance measure. These properties imply that
the divergences of a series of distributions from a reference distribution can be directly
compared, but their divergences from an alternative reference can not be compared with
the divergence measures from the first.
One immediately obvious property of this measure is that, in exactly the same way as the
structural measures of the previous section, the distributions are implicitly aligned by the
index k and each component (or value of the distribution at a specific k) is compared only
with the equivalent member of the alternative distribution.
The measure can, however, be interpreted in information-theoretic ways. Firstly, Equation

D.46 calculates the expected value (with respect to si ) of the ‘log-odds’ of si and sj . For
each component k, the log-odds is a measure capturing the degree to which the distribution
si is more likely than distribution sj . The values of the log-odds for various values of si (k)
and sj (k) are shown in Figure D.9. The logarithm of the fraction induces a symmetry on
the result so that a positive log-odds indicates that si is favoured, whereas a negative value
favours sj . Switching si and sj results in an inversion of the sign only.
Under this interpretation the average value of the log-odds would represent the average
amount by which one distribution is favoured over the other. Since the distributions are
log[ s (k) / s (k) ]

j
0
i
−2
−4
2
1.5 2
1 1.5
1
0.5 0.5
s (k) 0 0
j s (k)
i
Figure D.9: The value of the log-odds for constructing the Kullback-Leibler divergence for
various si (k) and sj (k).
normalised, if one distribution is significantly greater than the other at a small number
of values of k, then it must be smaller at all remaining values of k. Without loss of
generality, let the distribution si be uniformly distributed. In this case the peaks of sj
will be significantly greater than the values of si , but at all remaining values of k the
uniform distribution will have a greater probability than for sj , though this difference will
be somewhat smaller than for the peaks of sj . Now, the logarithm has the mathematical
effect of reducing the relative contribution of ratios which are very large (that is, at the
peaks of sj ) and the average log-odds will favour the uniform distribution and the result will
be positive. Switching the roles of si and sj will reverse the sign of the average log-odds. In
this way, the average log-odds will indicate, in general, which distribution has the greater
distribution entropy.
The Kullback-Leibler divergence is, however, the statistical expectation of the log-odds and
not the average value. That is, the measure assumes that the ‘true’ distribution is si and
weights the log-odds contributions in light of this. This has a significant effect on the
resulting measure: the contributions of the log-odds for which sj is favoured over si are
strongly suppressed. The expression for each contribution becomes
si (k)
si (k) si (k)
si (k) logb = logb (D.49)
sj (k) sj (k)
and the contributions for a series of values of the distributions are shown in Figure D.10.
D [ s (k) || s (k) ]
4
j
2
i
KL
0
−2
2
1.5 2
1 1.5
1
0.5 0.5
s (k) 0 0
j s (k)
i
Figure D.10: Individual contributions to the Kullback-Leibler divergence for various si (k)
and sj (k).
It is clear that as si → 0 the logarithm approaches zero. The measure, therefore, can be
interpreted as an asymmetric measure of the degree to which the assumed distribution si
has a greater distribution entropy than the reference distribution sj .
Alternatively, the expectation can be manipulated to yield

si
DKL (si sj ) = Esi logb
sj

1 1
= Esi logb − logb
sj si

= Esi hsj (sj ) − Hsi (Si ) (D.50)
where hsj (sj ) and Hsi (Si ) represent the event entropy of the kth element of sj and the
distribution entropy of sj respectively. The event entropy in the first term represents the
unexpectedness of the event sjk occurring and the expectation is, therefore, the average
unexpectedness of the events occurring with probabilities sj when the events were expected
to have been observed with probabilities si . The second term represents the average un-
expectedness of the events occurring with the expected probabilities and so the measure is
related to the relative compactness of the distribution sj with respect to the distribution si .
Conversely, and more commonly, this measure is interpreted as the relative ‘spread’ of the
distribution si with respect to the distribution sj , which justifies the measure often being
called the ‘relative entropy’ of si with respect to sj .
Note that the Kullback-Leibler divergence will be infinite whenever there is any value of k
for which sjk = 0 with sik = 0, that is, whenever the distribution si suggests that an event
is possible while sj implies that it should be impossible. This implies that the reference
density sj should be carefully selected so that the measure is meaningful. A particularly
interesting case occurs when the reference density is uninformative and all events occur with
equal probability. In this case the divergence measure will capture the compactness of the
density si directly and it can be shown that
1
DKL (si u) = −Hsi (Si ) + logb (D.51)
Uk
where u is the uniform distribution and Uk is the value of that density for any particular
outcome. Note that the final term is zero only when the reference distribution is the unit
function u ≡ 1 ∀ k. While this is not a valid distribution, its use is justified by noting that
the final term is a constant for any particular selection of Uk and so arbitrarily selecting
Uk = 1 does not change the ordering of the divergence measures calculated with respect to
the common distribution. Importantly, this means that the entropy of the distribution si is
linearly related to the Kullback-Leibler divergence between that distribution and a uniform
one.
The characteristic properties of the Kullback-Leibler divergence include: the two distribu-
tions are implicitly aligned through the value of the index (or coordinate) k; only compo-
nents with the same value of k are compared; the contribution of any particular component
depends only on the values of the functions, that is, there are no preferred values of k; and
finally, the measure represents an expectation taken of the log-odds and therefore, unlike
the purely structural measures earlier, implies that the measure is related to the marginal
statistics of the two distributions. Significantly, this means that the measure takes no ac-
count of the joint statistical properties of the two distributions and, therefore, can not reflect
any statistical dependencies.
The Kullback-Leibler divergence is given by the expectation of the log-odds of the two
distributions, taken with respect to the first distribution,

si
DKL (si sj ) = Esi logb (D.46)
sj
and takes on the appropriate summation or integration, depending on whether the
system is discrete or continuous respectively. This measure is a divergence, and does not
satisfy either the properties of symmetry or the triangle-inequality necessary for being
considered a distance measure. It is possible to rank a set of distributions according to
their divergences from a common reference, but only if a single reference distribution
is used.
The Kullback-Leibler divergence has the following characteristic properties:
1. The index k implicitly aligns the components of the two distributions
3. The contribution of each component depends on the marginal statistics of the
two distributions and has no dependence on their joint statistical properties.
4. The resulting measure is the statistical expectation of the log-odds of the distri-
butions with respect to si and can, therefore, be interpreted as a relative entropy
of this distribution with respect to sj .
Die 1 2 3 4 5 6
1 1 1 1 1 1
s1 6 6 6 6 6 6
1 1 1 1 1 1
s2 3 3 12 12 12 12
1 1 1 1 1 1
s3 12 12 12 12 3 3
1 1 1 1
s4 4 4 0 0 4 4
1 1 1 1
s5 0 4 3 0 4 6
Table D.1: Table of normalised frequencies for the five dice of Example D.2.1.
Example D.2.1 – Kullback-Leibler divergence in discrete system

Let there be four dice which are weighted differently and consider to what degree these
dice differ from the distribution expected for a fair die. Specifically, let the statistics of
the five dice involved be as summaries in Table D.1. Now, consider the Kullback-Leibler
divergence measure applied to all possible pairings of these statistical distributions. Applying
the discrete measure of Equation D.47 yields the results shown in Table D.2 and graphically
in Figure D.11.
Firstly, noting that the divergence measures for (i, j) ∈ {1, 2, 3}2 are symmetric, this is not
true in general, as seen in the divergences involving distributions 4 and 5. Furthermore,
the triangle inequality implies that DKL (si sj ) + dKL (sj sk ) ≥ DKL (si sk ). However it is
readily confirmed that
DKL (s5 , s3 ) + DKL (s3 , s1 ) = 5.713 + 0.231
= 5.954
DKL (s5 , s1 )
= 11.17
and the triangle inequality can not hold in general.

Dice Number (i) 1 2 3 4 5

j=1 0 0.231 0.231 11.15 11.17
2 0.231 0 0.693 5.602 14.28
3 0.231 0.693 0 5.602 5.713
4 ∞ ∞ ∞ 0 ∞
5 ∞ ∞ ∞ ∞ 0
Table D.2: Table of Kullback-Leibler divergences for the dice of Example D.2.1.
10
DKL
1 5
4
2 3
2
3
Dice j 1
Dice i
Figure D.11: Kullback-Leibler divergences for the dice of Example D.2.1 in which the i
axis corresponds to the distribution si and j to sj in the expression of Equation D.47. The
values of j = 4, 5 are not shown as these are infinite.
2.5
2
(k)
1.5
input
1
P
0.5
0
0 2 4 6 8 10
2.5
2
1.5
P (k)
1
1
0.5
0
−5 0 5
2.5
2
1.5
P (k)
2
1
0.5
0
−5 0 5
2.5
2
1.5
P (k)
3
1
0.5
0
−5 0 5
k (volts)
Figure D.12: The input and model distributions from Example D.2.2
Example D.2.2 – Template-matching with the Kullback-Leibler divergence

Assume there is a task which requires determining which of a series of analog voltage signals
best matches an input signal, but based on the statistical distribution of the observed values,
not simply on the values themselves. For example, let a signal be observed with an unknown
offset. The distribution of values obtained by sampling from this signal represents the ‘input’
probability distribution. Say that there are three signals corresponding to models of the signal
and it is desired to see which model has statistics which most closely resemble that of the
input signal.
Since the offset of the signal is unknown, the models can conveniently be offset so that they
have a mean value of zero. In addition to determining which model best fits the statistical
properties of the input, the value of the offset is also of interest. The distributions shown
in Figure D.12 correspond to these signals: the top distribution is the input distribution
and the offset can be clearly seen, while the remaining three distributions represent zero-
mean signals to which the original signal is to be compared. Note that only the statistics of
the marginal distributions are compared and the actual time-series values of the signals are
irrelevant here.
The results of the application of the Kullback-Leibler divergence to these distributions are
shown in Figure D.13 which shows the divergences over the full range of k-offsets and a
magnified view of the region in the vicinity of the true offset. The behaviour near k-offset
= 0 occurs as a result of the numerical implementation of the calculation of the distance,
but the region of Figure D.13(b) is of greater interest. It is immediately obvious that the
best model in this case is Model #1 and comparison of the actual distributions supports this
conjecture. Model #2 has a greater spread over values of k and so it is reasonable that the
resulting divergences are larger for this model than for Model #1. Finally, the highly-peaked
Model #3 is seen to have two minima corresponding to when the uni-modal model is aligned
with each of the peaks in the bi-modal input distribution.
These results justify the utility of the Kullback-Leibler divergence as a similarity measure
for comparing two distributions.
600 2
P1 P1
P2 1.8 P2
500 P3 P3
1.6
1.4
400
DKL[ Pi || Psignal ]
DKL[ Pi || Psignal ]
1.2
300 1
0.8
200
0.6
0.4
100
0.2
0 0
0 2 4 6 8 10 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2
k offset (volts) k offset (volts)
(a) Divergences over full range of k-offset (b) Divergences in vicinity of unknown offset
Figure D.13: Kullback-Leibler divergences for an input signal with unknown offset and
several candidate models. (a) shows the full range of the k-offset used in the calculations
and (b) shows only the region in the vicinity of the applied offset.
D.2.2 Csiszár’s Measures
The Csiszár’s measures, or f -measures, are a general class of measure encompassing many
available measures. They are generally defined according to
/ 0
sj
Df (si , sj ) = Esi f (D.52)
si
where f (.) is a continuous, convex, real function defined on R+ (Dragomir et al. 2001,
p2377). It can be shown that the following properties hold for the f -measures (Basseville
1989, p351):
• The measure implies that the index k aligns the two distributions;
• The measure compares the kth component of one distribution with the kth component
of the other distribution only;
• Df (si , sj ) is a minimum when the distributions are equal and a maximum when they
are orthogonal, si ⊥sj ; and
• The lower limit can be further strengthened (for valid densities si and sj ) by noting
that / 0
si
Df (si , sj ) = Esi f sj ≥ f (1) (D.53)
sj
with equality if and only if si = sj almost everywhere3 (Basseville 1989, p351).
The functions f (.) are considered as ‘dispersions’ as they are a convex mapping of the
odds-ratio of the distributions for a particular component. The f -measures are neither
exhaustive of all dispersion-based divergence measures, but a very large class of measures
fall into this category. These include: the L1 (or variational) distance, with f (x) = |1 − x|
yielding Df (si , sj ) = |sj (k) − si (k)| dk; the Hellinger-Battacharya distance with f (x) =

( (x) − 1)2 ; the Kullback-Leibler divergence, with f (x) = − logb x; the α-divergence (or
$ %α
Renyi divergence) with f (x) = ssji 1
and applying the function g(x) = α−1 logb (x) to the
result; and many others. See Basseville (1989) for a thorough discussion of the merits of
the various alternatives available under the class of f -divergences.
D.2.3 Mutual Information as an Affinity
It was noted in the discussion of the Kullback-Leibler divergence that the measures discussed
so far capture the characteristics of the relationships between the marginal distributions
only. Consider instead that the mutual information between two arbitrary random variables
can be interpreted as an affinity measure capturing the statistical dependency of one random
variable on the other. The mutual information is given by

sij (k, l)
IP (Si ; Sj ) = Esij logb (D.54)
si (k)sj (l)
where the distributions sij (k, l), si (k) and sj (l) represent the joint and marginal distribu-
tions of the random variables Si and Sj . The most critical difference between this measure
and the previous ones is that the measure depends not only on the marginal statistics
(which are easily obtained for many systems), but also requires the availability of the joint
statistics.
Under this measure the notion of ‘equality’ is replaced by ‘deterministic dependence’ and
‘orthogonality’ by ‘statistical independence’. This implies that the measure will be invariant
to an arbitrary bijection applied to the data involved. This is because the degree of statistical
determinism is not affected by the addition of such a bijection: the operation merely replaces
the labels associated with the ordinates of the original signals (or equivalently, the domain
3
Almost everywhere convergence requires convergence at all points except possibly those which have zero
measure.
of the distribution). A corollary of this property is that the measure implies nothing about
the actual shapes of the marginal distributions, but rather is the direct comparison of the
instantaneous values of the joint distribution and the ‘independent joint-distribution’, the
distribution which would have occurred had the random variables been independent.
Unlike the previous measures, the two distributions are not ‘aligned’ by an index k, but
rather each is parameterised separately by indices k and l respectively and the interpretation
of the marginal statistics requires no further assumptions. Evaluation or estimation of the
joint statistical properties of a pair of random variables, however, requires analysis of the
outcome pairs [x(t), y(t)] and so requires that the input signals have an explicit alignment
in their sample space. Consider a time-series signal S(t) and model M (t + τ ) where τ is the
time-offset of the centre of the model from the start of the signal. For a given value of τ it
is possible to construct estimates of the joint and marginal distributions (parameterised by
k and l, the indices on the value-space of the signals).
The distribution Mutual Information of two random variables Si and Sj depends di-
rectly on their joint and marginal distributions
sij , si andsj according to
sij (k, l)
IP (Si ; Sj ) = Esij logb (D.54)
si (k)sj (l)
and represents the degree of statistical dependence between the random variables. Sig-
nificantly, the notions of ‘equality’ and ‘orthogonality’ are replaced by ‘determinis-
tic dependence’ and ‘statistical independence’. The mutual information represents an
affinity rather than a deviation, with the highest values obtained as the dependency
increases. Fortunately, the mutual information is strictly non-negative with a value of
zero only when the random variables are independent. The mutual information also
satisfies the symmetry conditions of a distance, but the triangle-inequality does not
hold. It can, therefore, be interpreted as a symmetric statistical affinity.
The mutual information has the following properties:
1. The two distributions are parameterised separately using indices k and l and there
is no implicit alignment of the components
2. The generation of the joint distribution sij (k, l) requires an explicit association
between the samples from the two processes - this could be a time offset between
a model and a signal, or a spatial offset between an image and a pattern
3. The contribution of each pair of components (k, l) depends only on the marginal
and joint probabilities for those values of k and l
4. The mutual information may have an upper-bound, see Section 2.5.5 for a through
discussion of the numerical properties of the measure.
D.2.4 Statistical Information Distance
While the mutual information is a useful statistical measure for the degree of statistical
dependence between two random variables, it does not satisfy the triangle inequality and
is not a true distance measure. It was noted in Section 2.8.2 that for a discrete system the
quantity LP (Si , Sj ) can be interpreted as a distance, the statistical-independence distance,

4 5
si (k)sj (l)
dSI (Si , Sj ) = Esij logb 2
sij (k, l)

1
= Esij logb (D.55)
si (k|l)sj (l|k)
where si (k|l) and sj (l|k) represent the conditional distributions. It was shown in that
section that this particular measure satisfies all four axioms of a distance. Significantly,
the measure represents the combination of the parts of the two random variables which are
independent. In this way, the measure is related to the mutual information by
DSI (Si , Sj ) = HAB (Si , Sj ) − IAB (Si ; Sj ) . (D.56)
The statistical independence distance satisfies all four axioms of a distance and re-
flects the statistical properties of the two random variables. The measure replaces the
concepts of ‘equality’ and ‘orthogonality’ by ‘deterministic dependence’ and ‘statistical
independence’. The measure is given by, 4 5
si (k)sj (l)
dSI (Si , Sj ) = Esij logb 2 (D.55)
sij (k, l)
The measure has the following properties:
1. The two distributions are parameterised separately using indices k and l and there
is no implicit alignment of the components
2. The generation of the joint distribution sij (k, l) requires an explicit association
between the samples from the two processes - this could be a time offset between
a model and a signal, or a spatial offset between an image and a pattern
3. The contribution of each pair of components (k, l) depends only on the marginal
and joint probabilities for those values of k and l
4. The measure can only be applied as a distance to discrete systems and will have,
under those circumstances, an upper-limit of HAB (Si , Sj ).
Example D.2.3 – Mutual information and statistical independence distance

Consider a collection of coloured, numbered balls and let the space of possible combinations of
colour and number be parameterised by number (k ∈ [0, 9] ⊂ Z) and colour (l ∈ [0, 9] ⊂ Z).
Consider the three joint-distributions shown in Figure D.14. In each case the marginal dis-
tributions remain unchanged, but the joint distributions show three specific cases: determin-
istic dependency, partial statistical dependency and independence. Using these distributions,
the marginal entropies can be calculated to be
HP (k) = 2.0901 (nats)
HP (l) = 2.0901
and the joint entropy, mutual information and statistical independence distances are
Measure (a) (b) (c)

HP (k, l) 2.0901 3.5830 4.1801
IP (k; l) 2.0901 1.0197 0
DSI (k, l) 0 2.5633 4.1801
Firstly, note that the relationship HP (k, l) = IP (k; l) + DSI (k, l) holds for all three cases. In
addition, it is clear that the statistical independence distance and mutual information are
complimentary measures.
Example D.2.4 – Template matching with I(X; Y ) and DSI (X, Y )

Figure D.15 shows a subset of the well-known image “Lena” (Rosenberg 2001) and two
‘models’ of her face: the first being cropped directly from the image, while the second has
been inverted. The original image is a three-channel colour image of size 128 × 128 pixels
and the model patches are 32 × 32 pixels. Assume that the task of a system is to determine
the quality of a match between the models and an equivalently-sized patch from the original
image. This patch, and the model, can be considered to represent 32 × 32 = 1024 samples
from a three-dimensional space. Note that in this context each of the pixel locations in the
input ‘patch’, (x, y) ∈ [1, 32]2 ⊂ Z2 , is explicitly aligned with the associated pixel location in
the model.
0.2
0.15
0.1
0.05
0
10
2
8
4
6
6
4
8
2
10 colour (l)
number (k)
(a) Deterministic Relationship
0.2
0.15
0.1
0.05
0
10
2
8
4
6
6
4
8
2
10 colour (l)
number (k)
(b) Partial Statistical Dependency
0.2
0.15
0.1
0.05
0
10
2
8
4
6
6
4
8
2
10 colour (l)
number (k)
(c) Statistically independent
Figure D.14: Deterministic, statistically dependent and independent joint distributions with
common marginal distributions for Example D.2.3.
(a) Input image
(b) Model #1 - cropped only (c) Model #2 - chromatically inverted
Figure D.15: An input image and two models for investigation of the mutual information
and statistical independence distance. Example D.2.4 compares these quantities and the
Euclidian distance in the context of template-matching.
In this way, each of the 32 × 32 × 3 = 3072 ‘components’ of the patch and model contribute
to the distance measure between them and for this example it will be sufficient to interpret
the (i, j)th image patch and either model as instances of a 3072-dimensional vector.
The results of applying each model to the image are shown in Figure D.16. It is immediately
clear that the two left-most surfaces are unchanged from model #1 to model #2, as expected
from the signal-ordinate invariance properties of the mutual information and entropy mea-
sures. For both models, the mutual information and statistical independence distance have a
similar structure, most of the image is a poor match but a substantially better match is seen
to occur when the image patch coincides with the face in the original image. In the case of
the Euclidian distance, however, the ‘true’ offset is much less apparent for model #1 and is
not obvious at all for model #2. This difference occurs as a result of the Euclidian distance
being sensitive to offsets applied to the signals. In this context illumination or large-scale
features (with respect to the size of the image patch) cause a non-zero offset to appear in
the resulting distance measures.
The mutual information and statistical independence distance are, therefore, viable affinity
and distance measures for many applications. The mutual information has the advantage
that it is well-defined for both discrete and continuous systems and both measures are in-
variant to an application of a bijective transformation of the input space.
D.3 Measure Selection
This section has examined the characteristics and properties of various commonly-used
measures for comparing vectors, functions and distributions. Understanding their origins
in the measure-theoretic framework of the earlier sections enables the realisation that there
can be no ‘universal’ distance (or affinity) measure. Rather, each measure emphasises a
unique set of properties of the entities being compared and the viability of any particular
measure will depend intimately on the task at hand.
D.3 Measure Selection 363
10
250
5 8
200
4
6
150
3
4
100
2
2 50
1
0 0
0
80 80
80 80 80 60 80
60 60 60 60 40 40 60
40 40 40 40 20 20
20 20 20 20
(a) Model #1
10
250
5
8
200
4
6
150
3
4
100
2
2 50
1
0 0
0 80
80 80 80 60 80
60 80 60 60 40 40 60
40 40 60 40 40 20 20
20 20 20 20
(b) Model #2
Figure D.16: Mutual information, statistical independence distance and Euclidian distance
for the image-based template-matching problem of Example D.2.4. The left-most surface
represents the mutual information, the centre the statistical information distance and the
right-most the Euclidian distance.
Appendix E
Occupancy Sensor Models
This appendix contains various derivations and proofs for the sensor models used in the
examples of Sections 5.4–5.7. Specifically, it examines the process of drawing the values
from a Gaussian model and mapping likelihood values to them according to an occupancy
model. While it is well-known that these models are overly simplified and usually obtained
heuristically rather than analytically, they are sufficient for demonstrating the advantages
of using the model proposed in this thesis.
Specifically, this appendix examines the analytic nature of the model proposed; the method-
ology for sampling from that model for the point-based estimation used in the implementa-
tions; the mapping of that methodology to real sensors; and an algorithm for updating the
samples obtained from that model with new observations.
E.1 Analytic Occupancy Model
The occupancy approach pioneered by Moravec & Elfes (1985) essentially corresponds to
the estimation of a binary decision process (in this case the occupancy or otherwise of a
location in space) under the assumption that all locations in the operating domain are inde-
pendently estimated. That is, using the terminology of this thesis, the functional model is
approximated by a fully-independent functional representation, most commonly a regularly-
spaced piecewise-constant grid. Under this assumption, the sensor model is obtained as the
collection of likelihood values for each domain location, and each is updated independently.
366 Occupancy Sensor Models
The full derivation of viable sensor models is not relevant here, but it should be noted that
there are several different approximations to the work of the original authors. In particular,
Leal (2003, Appendix A) elucidates the work of the original authors and demonstrates an
approximation based on a normalised range uncertainty model. Alternatively, Ribo & Pinz
(2001, §2.1) presents a piecewise model using quadratic and linear elements. Finally, Kono-
lige (1997) presents a much more sophisticated model seeking to address the characteristics
of a real sensor system. The model used in this thesis is a piecewise model utilising Gaus-
sian and constant elements and is expressed in terms of a range uncertainty model. If the
measured distance to an object is denoted by rz and the true distance by r, then the range
model is usually written as P (rz |r) and this model assumes this takes a Gaussian form,

1 1 (r − rz )2
P (rz |r) = √ exp − . (E.1)
σr 2π 2 σr2
That is, the sensor is assumed to have only zero-mean Gaussian noise added to the true
value during measurement. While this is not true in real systems, it presents a viable
model for the purposes of this thesis. Denoting the estimated quantities as O(r) then the
approximation of the resulting occupancy likelihoods is given by,
⎧
⎪
⎪ po− , r ≤ (rz − 2σr )
⎪
⎪
⎪
⎪
⎨ po− + (po+ − po− ) exp − 1 (rz −r)
2
2
, (rz − 2σr ) < r ≤ rz
P (rz |O(r) = O+ ) = 2 σr 2
⎪
⎪ (r −r)
⎪ pou + (po+ − pou ) exp − 2 σr2 rz < r ≤ (rz + 2σr )
1 z
,
⎪
⎪
⎪
⎩ p u,
o (rz + 2σr ) < r
(E.2)
where po− , po+ and pou represent the minimum, maximum and uninformative probabilities
of occupancy assigned by this model. That is, the lower represents a conservative restriction
on the model to prevent it from diminishing the near-field (pre-return ranges) values to zero.
Likewise, the maximum value represents the ‘strength-of-belief’ in the particular sensory
value and can be adjusted to modify the significance of any particular observation. The
uninformative value corresponds to the value which implies no bias towards either occupied
or unoccupied and for a binary decision is set to pou = 0.5. This model is shown in
Figure E.1 along with the numerical solution of Leal (2003, Equation A.10) for po− = 0.1
and po+ = 0.516. This figure demonstrates that the approximation is reasonable for the
demonstration in this thesis.
E.1 Analytic Occupancy Model 367
0.5
P( rz = 50 |O( r ) = O ) 0.4
+
i
0.3
0.2
0.1
0
0 10 20 30 40 50 60 70 80 90 100
ri
Figure E.1: One dimensional occupancy model. This figure shows a one-dimensional system
defined by the range coordinate r and the result of an observation rz = 50. At each range
location r the occupancy of the location is to be estimated, O(r). The blue curve is obtained
from the numerical model of Leal (2003, Equation A.10) while the red is the approximation
of Equation E.2.
In order to extend this model to three-dimensional scenarios, consider that the regions
out of the beam should be assigned to the uninformative value pou and that the angular
uncertainty of a ranging sensor can be approximated as a Gaussian,

1 1 T −1
P (θz |θ) ≈ 1 exp − (θ − θz ) Σθ (θ − θz ) (E.3)
2π|Σθ | 2 2
where θ represents the two-dimensional angle vector; θz the measured beam-axis direction1 ;
and Σθ the covariance matrix defining the resulting Gaussian.2 Section E.3 considers the
determination of the covariance values for real sensors.
This model suggests that as this value reduces towards zero, the influence of the sensory
model P (rz , θz | O(r, θ) = O+ ) should be reduced and the resulting value should approach
1
This can be set arbitrarily to zero to obtain the model with respect to boresight.
2
Note that the product of two Gaussians results in a Gaussian, so this pattern is reasonable for systems
with rectangular apertures and also for a Gaussian approximation to the circular bessel functions which
result from circular apertures. See Johnson (1991, §2.4).
pou . Write Equation E.2 as
P (rz |O(r) = O+ ) = pou + P (rz |O(r) = O+ ) (E.4)
where P represents the variations from the uninformative value. Weight the variations
from this term by the angular model of Equation E.3 yields the three-dimensional model
used in this thesis,
P (rz , θz |O(r, θ) = O+ ) = pou + P (rz |O(r) = O+ )Pθ (θz |θ) (E.5)
where P and Pθ are given by
P (rz |O(r) = O+ ) = P (rz |O(r) = O+ ) − pou (E.6)

1
Pθ (θz |θ) = exp − θ T Σθ θ . (E.7)
2
Figure E.2 shows an example of this model for two dimensions, dim(θ) = 1, with po− = 0.1,
po+ = 0.9, pou = 0.5, σr = 2, Σθ = 5◦ and rz = 50.3
E.2 Sampled Occupancy Model
While Equation E.5 defines the analytic values of the occupancy properties for a given
situation, the implementation considered in this thesis uses a point-based sampled model.
This section describes the approach used to draw a set of spatial samples which correspond
to a particular observation. Specifically, this approach includes two distinct effects: a group
of samples corresponding to the uncertainties in the angle and range to the actual object
sensed; and a group of samples corresponding to the empty space between the sensor and
that object. An examination of Figure E.2 suggests that the observation values will have
greatest influence in those two regions, justifying the approach.
The first component can be identified as a cloud of points drawn from the model describing
3
This model is ‘on-boresight’ so the value of θz is treated as zero.
E.2 Sampled Occupancy Model 369
0.8
0.6
0.4
0.2
1.5 100
90
1
80
70
0.5
60
0 50
40
−0.5
30
−1 20
10
θ (rad) r (m)
−1.50
(a) Polar Form
0.8
0.6
0.4
0.2
25 80
20
70
15
60
10
5 50
0 40
−5
30
−10
20
−15
−20 10
y (m) x (m)
−25
0
(b) Cartesian Form
Figure E.2: A Two-dimensional Occupancy Sensor Model. Here po− = 0.1, po+ = 0.9,
pou = 0.5, σr = 2, Σθ = 5◦ and rz = 50. The model has been constructed using Equation
E.5. Note that the axes of the cartesian form are not to equal scale.
the uncertainty in the measured range to the target, that is, defined by the distributions,

1 1 (r − rz )2
P (rz |r) = √ exp − (E.8)
σr 2π 2 σr2

1 1 T −1
P (θz |θ) = 1 exp − (θ − θz ) Σθ (θ − θz ) (E.9)
2π|Σθ | 2 2
where the range r is a scalar, but the angle θ = [θaz , θel ]T is a two dimensional vector cor-
responding to azimuth and elevation angles. The first distribution from which the samples
are drawn (the ‘target’ part) corresponds to the product of these two distributions, or,

1 1 T −1
Pt (z|r) = 3 1 exp − (r − z) Σ t (r − z) (E.10)
(2π) 2 |Σt | 2 2
where z = [rz , θz ]T , r = [r, θ]T and

⎡ ⎤
σr2 0
Σt = ⎣ ⎦ (E.11)
0 Σθ
The second component (the ‘space’ part) corresponds to the area between the sensor and the
target, that is, the wedge in the bottom part of Figure E.2. Since the samples of the first case
are developed in the polar form, the samples from this second region will be obtained in the
same way. Note that polar to cartesian transformation means that the number of samples
required to generate an approximately uniform cartesian sampling reduces significantly in
the region close to the sensor location, and that the greatest number of samples is required
at ranges close to that of the observed distance. For this reason, a half-Gaussian model is
used,
⎧
⎨ 3
1
1 exp − 12 (r − z)T Σ−1
s (r − z) , r ≤ rz
Ps (z|r) = (2π) 2 |Σs | 2 (E.12)
⎩
0 r > rz
with ⎡ " # ⎤
rz 2
0
Σs = ⎣ 2 ⎦ . (E.13)
0 Σθ
This corresponds to half of the Gaussian with mean given by the observation value z and
with the covariance in range set so that the range to the target corresponds to two standard
E.2 Sampled Occupancy Model 371
(a) Iteration 1
(b) Iteration 2
(c) Iteration 3
(d) Iteration 4
(e) Iteration 5
Figure E.3: A Three-dimensional Occupancy Sensor Model. Here po− = 0.2, po+ = 0.8,
pou = 0.5, σr = 2.5m, Σθ = 0.1rad and rz = 200m. The model has been constructed using
Equation E.5 and the colours red, green and blue correspond to p = 0, p = 0.5 and p = 1
respectively.
deviations. Figure E.3 shows this model generated in three dimensions using the values of
po− = 0.2, pou = 0.5, po+ = 0.8, rz = 200m, σr = 2.5m and σaz = σel = 0.1rad. Each
diagram represents the application of a repeated observation and the values are represented
by colour, with red, green and blue corresponding to p = 0, p = 0.5 and p = 1 respectively.
E.3 Fitting the Gaussian Model to Real Sensors
When considering a real system, particularly the radar system utilised as an example sensor
in this thesis, there are multiple contributions to the uncertainties modelled as a Gaussian
distribution in the previous section. Specifically, it is common to define the beamwidth of
a sensor according to the −3dB points and this section derives a correspondence between
this measurement and the angular variances of the Gaussian approximation. Following the
approach of Johnson (1991, §A.13.3) it is noted that the Gaussian approximation (in a
single dimension and assuming separability in the angular components) is defined by the
un-normalised one dimensional equivalent (denoted by φ to distinguish it from the angular
vector θ) of Equation E.3
4 5
1 (φ − φz )2
P (φz |φ) = exp − (E.3)
2 σφ2
Since the measurement of the beam is always made with respect to boresight, the value of
φz is zero, yielding 4 5
1 φ2
P (φz |φ) = exp − 2 . (E.14)
2 σφ
Now Figure E.4 shows a two-way power beam pattern from a radar system (in blue) and
an approximating Gaussian (in red). Converting the Gaussian into decibels yields
4 5
1 φ2
PdB = 10 log10 exp − 2 (E.15)
2 σφ
−5φ2
= (E.16)
σφ2 ln 10
E.3 Fitting the Gaussian Model to Real Sensors 373
Power
φ3dB
φ
Figure E.4: Approximation of a radar beam pattern with a Gaussian. The blue line repre-
sents the pattern from a real aperture, while the red is the Gaussian approximation selected
so that the curves coincide at the −3dB point.
0.9
0.8
0.7
0.6 X: 2.5
Y: 0.5012
0.5
0.4
0.3
0.2
0.1
0
−10 −5 0 5 10
Figure E.5: A specific Gaussian beam approximation for φ3dB = 5◦ .
φ3dB
which for the −3dB point has φ = 2 and yields
5φ23dB
σφ2 = (E.17)
12 ln 10
6
5
⇒ σφ = φ3dB (E.18)
12 ln 10
≈ 0.42539 φ3dB . (E.19)
Figure E.5 shows the resulting Gaussian for a beamwidth of φ3dB = 5◦ , which yields a
standard deviation of σφ ≈ 2.13◦ . Inspection of the figure reveals that the −3dB (or 50%
power) point occurs at φ = 2.5◦ confirming the beamwidth of 5◦ .
E.4 On-Boresight to Global Transformations
The models presented above provide the samples (and analytical occupancy values) in a
polar space relative to the bore-sight direction of the sensor itself. However, the model of
this thesis requires that this information be transformed into a consistent global frame, in
this case as a cartesian 3-space. For a scanning sensor there are four coordinate frames
of interest: the ‘on-boresight’ frame, which defines the model above; the ‘sensor’ frame,
defining the entire scanned space with respect to the sensor itself; the ‘platform’ frame,
defining the space relative to the vehicle or platform on which the sensor is operating; and
the ‘world’ frame. Each of the transformations between them has a specific character:
On-boresight to Sensor Since the model is defined in terms of the on-boresight frame,
the conversion to the sensor frame consists of a rotation alone4 and using the coordi-
nate transformation notation from Nettleton (2003, §5.5.2) where the rotation order
is yaw, pitch and roll (ψ, θ and φ) and cos(.) and sin(.) are written c(.) and s(.),
Ps = Csb Pb (E.20)
where Pb and Ps represent a given location in the boresight and sensor frames respec-
tively and the matrix Csb is the rotation matrix between those frames,
⎡ ⎤
c(ψ)c(θ) c(ψ)s(θ)s(φ) − s(ψ)c(φ) c(ψ)s(θ)c(φ) + s(ψ)s(φ)
⎢ ⎥
Csb = ⎢
⎣ s(ψ)c(θ) s(ψ)s(θ)s(φ) + c(ψ)c(φ) s(ψ)s(θ)c(φ) − c(ψ)s(φ) ⎥
⎦ (E.21)
−s(θ) c(θ)s(φ) c(θ)c(φ)
Sensor to Platform This transformation represents a six-degree, fixed transformation re-

lating the coordinate system of the sensor to the coordinate frame defined for the
platform on which it is mounted. This transformation can be written
Pp = Pps + Cps Ps (E.22)
where Pp represents the location in the platform frame and Pps the location of the
sensor frame origin in the platform frame.
4
This conversion is required as the spherical polar space defined by the sensor models contains a singularity
point for elevation angles approaching ± π2 . See Griffiths (1989, §1.4.1) for further details.
E.5 Batch Updating in Sampled Models 375
Platform to World Finally, assuming that the platform is capable of moving in the world
frame, then the final transformation is given by
Pw = Pw w p
p + Cp P (E.23)
with Pw w
p and Cp corresponding to the navigation solution for the platform in the
world frame.
The resulting compound transformation is therefore given by

Pw = Pw
p + C w
p Pp
s + C p s b
C
s b P . (E.24)
E.5 Batch Updating in Sampled Models
E.5.1 Updating Binary Values
Consider the management of a single probability value p, given an observation o1 , the prior
value p0 for a binary decision such as occupancy,
p0 o1
p1 = (E.25)
p0 o1 + (1 − p0 )(1 − o1 )
Now, consider the two alternative observation orders shown in Figure E.6 where the solid
line represents the ‘ordered’ incorporation of the three observations and the dotted line an
out-of-order version. For the ordered observations,
p1 o2
p2 = (E.26)
p1 o2 + (1 − p1 )(1 − o2 )
p0 o1 o2
= (E.27)
p0 o1 o2 + (1 − p0 )(1 − o1 )(1 − o2 )
p0 o1 o2 o3
p3 = (E.28)
p0 o1 o2 o3 + (1 − p0 )(1 − o1 )(1 − o2 )(1 − o3 )
(E.29)
Now, this expression has no inherent order to the observations o1 , o2 and o3 , so that the
o3
o1
o2
+p o3
p0 / p1 k / p3
@ 2
o1
o2
Figure E.6: In order and out-of-order observations. The in order observation path is shown
in solid lines and the out-of order is dotted.
o0
p−1 / p0 o1 / p1 o2
/ p2
BB
BB o3
BB
BB
o4
p3 / p4 / p5
Figure E.7: Parallel observations in a binary model.
application of the out-of-order case will result in
p0 o2 o1 o3
p3 = (E.30)
p0 o2 o1 o3 + (1 − p0 )(1 − o2 )(1 − o1 )(1 − o3 )
which will yield the same result as o1 = o1 and o2 = o2 .
This means that provided all observations are combined together, the order in which they
are combined is irrelevant to the final result. However, consider the case shown in Figure
E.7 where the original value p0 is updated using two sets of observations to obtain the
parallel values p2 and p4 . The dotted lines represent the process of combining these values
to obtain the result that would have been obtained if the four observations were incorporated
sequentially. However, it is readily shown that
p2 p4
p5 = (E.31)
p2 p4 + (1 − p2 )(1 − p4 )
p0 o1 o2 o3 o3 o4
= (1−p0 )
(E.32)
p0 o1 o2 o3 o3 o4 + p0 (1 − p0 )(1 − o1 )(1 − o2 )(1 − o3 )(1 − o4 )
which is equivalent to the observations being obtained sequentially, if and only if p0 = 0.5.
This means that observations can be batch processed only when the prior value of the
observation is uninformative, that is, when the result of observation o0 yields p0 = 0.5.
In practice, this is not met, so that the updating of the observations must proceed in a
sequential fashion.
E.5 Batch Updating in Sampled Models 377
E.5.2 Maintaining Sampled Models
This section considers two distinct approaches to combining observations of the form of
Section E.3 in a sampled situation. The first combines an individual observation with the
data in the representation, while the second seeks to batch process several observations
prior to combining them with the information already in the representation.
In the first case, let the model generate newSamples, a list of samples obtained from the
observation, and consider the methodology for combining these values with repSamples
the points already in the representation. Two operations are required: updating the values
of the existing points according to the analytical model (since the occupancy likelihoods
are known exactly); and adding new samples to the representation corresponding to the
observation. Since the second operation requires that the prior values for the new sam-
ples be obtained from the existing information in the representation, these priors must be
determined before the existing points are updated. The resulting algorithm is:
1. Generate newSamples from the sensor model;
2. Interrogate repSamples to obtain the priors for the elements of newSamples and
update the new values;
3. Use the analytical model to update the existing elements in repSamples; and
4. Insert the members of verb”newSamples” into repSamples.
This ensures that the prior values from the representation are used to spawn the new samples
and the observation is used to update both the existing and new samples appropriately.
Consider, however, the situation in which it is desired to combine several observations as a

‘block’, prior to incorporating these new samples into the representation. In the implemen-
tation of this thesis this is the result of wishing to improve the threading performance of the
representational structure. Recall from the previous section that parallel operations are not
valid, so that the resulting algorithm must combine the information in a manner consistent
with the previous algorithm. Denoting the list of samples from the block by blockSamples
consider the following algorithm:
1. Generate newSamples from the sensor model;

2. Search blockSamples to obtain priors for members of newSamples” and update those
values;
3. Update the members of blockSamples using analytical model;
4. Search repSamples and save repPriors, the priors from the representation which
will be applied to the new samples;
5. Find the elements of repSamples which are influenced by the observation
(a) If they are already in repModified, update the value using the analytical model
OR
(b) Update the value (assuming a prior of 0.5) and add the element to repModified
(this is a list of points from the representation which will be modified by the
observations); and
6. Insert newSamples onto end of blockSamples.
The result of this is three lists of elements: blockSamples, containing the new samples,
updated by the entire series of observations, but not adjusted by the priors contained in the
representation; repPriors, the priors drawn from the representation for the members of
blockSamples; and repModified, the list of points from the tree as they would have been
modified from the individual observations. This effectively corresponds to an individual ob-
servation containing all of the block observations and the information necessary to combine
them. These are combined using the following steps:
1. Apply repPriors to blockSamples and insert into repSamples; and
2. Use repModified to update the members of repSamples.
This algorithm achieves the fully sequential updates of the previous section in a convenient
form and has been used throughout the practical examples discussed in this thesis.
Bibliography
Amari, S. (2001), ‘Information geometry on heirarchy of probability distributions’, IEEE

Trans. Information Theory 47(5), 1701–1711.
Amari, S. & Nagaoka, H. (2000), Methods of Information Geometry, Translations of Math-
ematical Monographs, Oxford University Press. Translated by Daishi Harada.
Bailey, T. (2002), Mobile Robot Localisation and Mapping in Extensive Outdoor Environ-
ments, PhD thesis, Australian Centre for Field Robotics, The University of Sydney.
Bailey, T., Upcroft, B. & Durrant-Whyte, H. (2006), Validation gating for non-linear non-
gaussian target tracking, in ‘IEEE Conference Proceedings on Information Fusion’. Avail-
able at: http://www.cas.edu.au/content.php/237.html?publicationid=351.
Bar-Shalom, Y., Li, X. R. & Kirubarajan, T. (2001), Estimation with Applications to Track-
ing and Navigation: Theory, Algorithms and Software, John Wiley and Sons.
Barshan, B. & Kuc, R. (1990), ‘Differentiating sonar reflections from corners and planes by
employing an intelligent sensor’, IEEE Trans. Pattern Analysis and Machine Intelligence
12(6), pp560–569.
Basseville, M. (1989), ‘Distance measures for signal processing and pattern recognition’,
Signal Processing 18, 349–369.
Bialek, W. (2002), Thinking about the brain, Technical report, Department of Physics,
Princeton University. Based on lectures at Les Houches Session LXXV, July 2001.
Blackman, S. (1986), Multiple-Target Tracking with Radar Application, Artech House.
Blakemore, C., ed. (1990), Vision: Coding and Efficiency, Cambridge University Press.
Boykov, Y., Veksler, O. & Zabih, R. (1998), ‘Markov random fields with efficient approxi-
mations’, IEEE Conf. on Computer Vision and Pattern Recognition pp. 648–655.
Bracewell, R. (2000), The Fourier Transform and Its Applications, 3rd edn, McGraw-Hill.
Brand, M. (2002), Incremental singular value decomposition of uncertain data with missing
values, Technical Report TR-2002-24, Mitsubishi Electric Research Laboratory (MERL).
Brodlie, K., Carpenter, L., Earnshaw, R., Gallop, J., Hubbold, R., Mumford, A., Osland, C.
& Quarendon, P. (1992), Scientific Visualisation: Techniques and Applications, Springer-
Verlag.
380 BIBLIOGRAPHY
Brooker, G. (2006), ‘Sensors and signals’, Lecture Notes: The University of

Sydney. Available at: http://www.acfr.usyd.edu.au/teaching/4th-year/mech4721-
Signals/material/lecture notes/index.html.
Brooker, G., Scheding, S., Bishop, M. & Hennessy, R. (2005), ‘Development and applica-
tion of millimeter wave radar sensors for underground mining’, IEEE Sensors Journal
5(6), 1270–1280.
Butz, T. & Thiran, J.-P. (2002), ‘Multi-modal signal processing: An information theoretical
framework’, Technical Report 02.01, Signal Processing Insitute, Swiss Federal Institute
of Technology (EPFL), Laussanne, Switzerland.
Cerf, N. & Adami, C. (1997), ‘Negative entropy and information in quantum mechanics’,
Physical Review Letters 79, 5194.
URL: http://www.citebase.org/cgi-bin/citations?id=oai:arXiv.org:quant-ph/9512022
Chen, Z. (2005), Bayesian filtering: From kalman filters to particle filters, and beyond,
Manuscript, Communications Research Laboratory, McMaster University.
Cover, T. M. & Thomas, J. A. (1991), Elements of Information Theory, John Wiley & Sons.
Cox, I., Miller, M. & McKellips, A. (1999), ‘Watermarking as communications with side
information’, Proc. IEEE 87(7), 1127–1141.
Croner, L. & Albright, T. (1999), ‘Seeing the big picture: Integration of image cues in the
primate visual system’, Neuron 24, 777–789.
Csiszar, I. & Korner, J. G. (1982), Information Theory: Coding Theorems for Discrete
Memoryless Systems, Academic Press.
DARPA (2005), ‘The darpa grand challenge website’, Website:

http://www.darpa.mil/grandchallenge. Last Accessed: 16-Feb-06, Last Modified:
28-Dec-05.
Deans, M. & Hebert, M. (2001), ‘Experimental comparison of techniques for localization

and mapping using a bearing-only sensor’, Lecture Notes in Control and Information
Sciences 271, 395–404.
Diebel, J. & Thrun, S. (2005), ‘An application of markov random fields to range sensing’,
In Proc. Conf. Neural Inf. Proc. Sys. (NIPS) .
Dill, J., ed. (2004), IEEE Computer Graphics and Applications: Point-Based Computer
Graphics, IEEE. Guest Editors: H. Pfister and M. Gross.
Doob, J. L. (1994), Measure Theory, Graduate Texts in Mathematics, Springer-Verlag.
Dragomir, S. S., Gluščević, V. & Pearce, C. E. M. (2001), ‘Csiszár f -divergence, ostrowski’s

inequality and mutual information’, Nonliner Analysis 47, 2375–2386.
Durrant-Whyte, H. F. (2001), Multi sensor data fusion, Technical report, Australian Centre
for Field Robotics, The University of Sydney.
BIBLIOGRAPHY 381
Elfes, A. (1987), ‘Sonar-based real-world mapping and navigation’, IEEE Trans. Robotics
and Automation 3(3), 249–265.
Elfes, A. (1995), Robot navigation: Integrating perception, environmental constraints and

task execution within a probabilistic framework, in ‘Reasoning With Uncertainty in
Robotics’, Springer-Verlag, Berlin, Germany, pp. 93 – 130. Invited Paper.
Fiser, J., Chiu, C. & Weliky, M. (2004), ‘Small modulation of ongoing cortical dynamics by
sensory input during natural vision’, Nature 431, 573–578.
Fisher, J., Wainwright, M., Sudderth, E. & Willsky, A. (2002), ‘Statistical and information-
theoretic methods for self-organization and fusion of multimodal, networked sensors’, Int.
J. High Performance Computing Applications 16(3), 337–353.
Griffiths, D. (1989), Introduction to Electrodynamics, 2nd edn, Prentice-Hall.
Grover, R., Brooker, G. & Durrant-Whyte, H. (2002), Environmental representation for

fused millimetre wave radar and nightvision data, in ‘Int. Conf. Control, Automation,
Robotics and Computer Vision (ICARCV), Singapore’.
Grover, R. & Durrant-Whyte, H. (2003), Efficient representations for learning unstructured

environments, in ‘Learning’03 Workshop, Snowbird, Utah, USA’.
Guo, X., Hua, J. & Qin, H. (2004), ‘Scalar-function-driven editing on point set surfaces’,
IEEE Computer Graphics and Applications pp. 43–52.
Ha, Q. P., Tran, T. H., Scheding, S., Dissanayake, G. & Durrant-Whyte, H. F. (2005),
‘Control issues of an autonomous vehicle’, 22nd Int. Symp. Automation and Robotics in
Construction .
Hearst, M., Dumanis, S. T., Osuna, E., Platt, J. & Schölkopf, B. (1998), ‘Support vector
machines’, IEEE Intelligent Systems pp. 18–28.
Heinbockel, J. H. (2001), Introduction to Tensor Calculus and Continuum Mechanics, Traf-

ford Publishing.
Hennessy, R. (2005), A generic architecture for scanning radar sensors, Masters dissertation,
Australian Centre for Field Robotics, The University of Sydney.
Hero, A., Ma, B., Michel, O. & Gorman, J. (2001), Alpha-divergence for classificiation,
indexing and retrieval, Technical Report Technical Report CSPL-328, Communications
and Signal Processing Laboratory.
Hibbeler, R. (1997), Mechanics of Materials, 3rd (international) edn, Prentice-Hall.
Jaynes, E. T. (1996), Probability Theory: The Logic of Science, Cambridge University Press.
Johnson, R. (1991), Designer Notes for Microwave Antennas, Artech House.
Jordan, M. (2004), ‘Graphical models’, Statistical Science 19, 140–155. Special issue on
Bayesian Statistics.
382 BIBLIOGRAPHY
Kapur, J. N. & Kesauan, H. K. (1992), Entropy Optimisation Principles with Applications,

Academic Press.
Karumanchi, S. (2005), Bio-inspired visual feature extraction on an augv, Undergraduate

thesis, Australian Centre for Field Robotics, The University of Sydney.
Konolige, K. (1997), ‘Improved occupancy grids for map building’, Autonomous Robots
4, 351–367.
Kraskov, A., Stögbauer, H., Andrzejak, R. & Grassberger, P. (2004), ‘Hierarchical clustering
based on mutual information’, E-print, arxiv.org/abs/q-bio.QM/0311037 .
Krause, E. (1986), Taxicab Geometry: An Adventure in Non-Euclidian Geometry, Dover.

Reprint of Addison-Wesley publication in 1975.
Kreyszig, E. (1993), Advanced Engineering Mathematics, 2nd edn, John Wiley and Sons.
Kumar, S., Ramos, F., Upcroft, B. & Durrant-Whyte, H. (2005), ‘A statistical framework
for natural feature representation’, Proc. IEEE/RSJ Int. Conf. Intelligent Robotics and
Systems .
Laine, A. & Fan, J. (1993), ‘Texture classification by wavelet packet signatures’, IEEE
Trans. Pattern Analysis and Machine Intelligence 15(11), 1186–1191.
Leal, J. (2003), Stochastic Environment Representation, Phd dissertation, Aus-

tralian Centre for Field Robotics, The Univeristy of Sydney. Available online at
http://www.cas.edu.au.
Lee, J.-S., Grunes, M. R., Pottier, E. & Ferro-Famil, L. (2004), ‘Unsupervised terrain
classification preserving polarimetric scattering characteristics’, IEEE Trans. Geoscience
and Remote Sensing 42(4), 722–731.
Leonard, J. & Durrant-Whyte, H. (1991), ‘Mobile robot localization by tracking geometric

beacons’, IEEE Trans. Robotics and Automation 7(3), 376–382.
Lewis, J. P. (1995), ‘Fast normalized cross-correlation’, Vision Interface .
Linder, T., Zamir, R. & Zeger, K. (2000), ‘On source coding with side information dependent
distortion measures’, IEEE Trans. Inf. Theory 46(7), 2697–2704.
Mackay, D. (1998), Introduction to gaussian processes, Technical report, Depart-

ment of Physics (Cavendish Laboratory), Cambridge University. Available at:
http://wol.ra.phy.cam.ac.uk/mackay/BayesGP.html.
Mackay, D. (2004), Information Theory, Inference and Learning Algorithms, Cambridge

University Press.
Majumder, S. (2001), Sensor Fusion and Feature-Based Navigation for Subsea Robots, Phd
dissertation, The Australian Centre for Field Robotics, The Unviersity of Sydney.
Majumder, S., Scheding, S. & Durrant-Whyte, H. (2001), ‘Multi-sensor data fusion for
underwater navigation’, Robotics and Autonomous Systems 35, 97–108.
BIBLIOGRAPHY 383
Manduchi, R., Castano, A., Talukder, A. & Matthies, L. (2005), ‘Obstacle detection and ter-
rain classification for autonomous off-road navigation’, Autonomous Robotics 18(1), 81–
102.
Middleton, D. (1996), An Introduction to Statistical Communcation Theory, 2nd reprint
edn, IEEE Press.
Minkoff, J. (2002), Signal Processing Fundamentals and Applications for Communications
and Sensing Systems, Artech House.
Moravec, H. & Elfes, A. (1985), High resolution maps from wide angle sonar, in ‘Proc.
IEEE Int. Conf. on Robotics and Automation (ICRA)’, pp. 116–121.
Muller, K.-R., Mika, S., Ratsch, G., Tsuda, K. & Scholkopf, B. (2001), ‘An introduction to
kernel-based learning algorithms’, IEEE Trans. Neural Networks 12(2), 181–202.
Murphy, K. (2001), An introduction to graphical models, Technical tutorial, University of
British Columbia.
Nettleton, E. (2003), Decentralised Architectures for Tracking and Navigation with Multiple
Flight Vehicles, Phd dissertation, Australian Centre for Field Robotics, The University
of Sydney.
Nettleton, E. & Durrant-Whyte, H. (2001), ‘Delayed and asequent data in decentralised
sensing networks’, Sensor Fusion and Decentralized Control in Robotics Systems IV
4571, 1–9. Presented at Photonics Boston, Boston MA, USA, 28 Oct. - 02 Nov. 2001.
Newman, P. (2000), On the Structure and Solution of the Simultaneous Localisation and
Map Building Problem, Phd dissertation, Australian Centre for Field Robotics, The
University of Sydney.
Ng, A. & Jordan, M. (2002), ‘On discriminative vs. generative classifiers: A comparison of
logistic regression and naive bayes’, Advances in Neural Information Processing Systems
14(2), 841–848.
Nieto, J. (2005), Detailed Environment Representation for the SLAM Problem, Phd disser-
tation, Australian Centre for Field Robotics, The University of Sydney.
Object Management Group (2006), ‘Uml resource page’, Internet site: http://www.uml.org.
Last Accessed: 26-Aug-2006.
Papoulis, A. & Pillai, U. (2002), Probability, Random Variables and Stochastic Processes,
4th edn, McGraw-Hill.
Paraview (2006), ‘Paraview website’, http://www.paraview.org. Last Accessed: 31 Aug
2006.
Preiss, B. (1998), Data Structures and Algorithms with Object-Oriented Design Patterns in
C++, Wiley.
Press, W., Teukolsky, S., Vetterlinh, W. & Flannery, B. (1992), Numerical Recipes in C,
Cambridge University Press.
384 BIBLIOGRAPHY
Rachlin, Y., Dolan, J. & Khosla, P. (2005), ‘Efficient mapping through exploitation of
spatial dependencies’, In Proc. Int. Conf. Intelligent Robots and Systems - IROS .
Rajan, V. T. (1991), ‘Optimality of the delaunay triangulation in Rn ’, Proc. Seventh Annual

Symp. on Computational Geometry pp. 357–363.
Ribo, M. & Pinz, A. (2001), ‘A comparison of three uncertainty calculi for building sonar-
based occupancy grids’, Robotics and Autonomous Systems 35, 201–209.
Ridley, M., Upcroft, B., Ong, L. L., Kumar, S. & Sukkarieh, S. (2004), ‘Decentralised
data fusion with parzen density estimates’, Proc. Int. Conf. Intelligent Sensors, Sensor
Networks and Information Processing pp. 161–166.
Rosenberg, C. (2001), ‘The lenna story’, http://www.cs.cmu.edu/ chuck/lennapg/. Last

Accessed: 28 Aug 2006.
Rotariu, I. & Vullings, E. (2005), ‘Multi-dictionary matching pursuit for servo error anal-
ysis applied to iterative learning control’, 2005 IEEE Int. Workshop Intelligent Signal
Processing pp. 86–91.
Roweis, S. (1999), ‘Gaussian identities’, Tutorial Notes. Available at:

http://www.cs.toronto.edu/ roweis/notes.html Last Accessed: 16 March 2006.
Rowland, T. (1999), ‘Generalized function’, From MathWorld–A Wolfram Web Resource,

created by Eric W. Weisstein. http://mathworld.wolfram.com/GeneralizedFunction.html.
Last Accessed: 28 Aug 2006.
Scheding, S. (1997), High Integrity Navigation, Phd dissertation, Australian Centre for
Field Robotics, The University of Sydney.
Scheding, S. (2005), High integrity outdoor navigation, Technical report, Australian Centre
for Field Robotics, The University of Sydney.
Schroeder, W., martin, K. & Lorensen, B. (2004), The Visualization Toolkit, 3rd edn,
Kitware Inc.
Shannon, C. (1958), ‘Channels with side information at the transmitter’, IBM Journal
pp. 289–293.
Smith, J. (2001), ‘Some observations on the concepts of information-theoretic entropy and

randomness’, Entropy 3, pp1–11. Available online at: www.mdpi.org/entropy.
Soon, B., Scheding, S. & Connolly, L. (2006), Error analysis of an integrated inertial naviga-
tion system and pseudoslam during gps outages, in ‘Proc. Int. Global Navigation Satellite
Systems Society (IGNSS) Symposium’.
Stanford Racing Team (2005), Stanford racing team’s entry in the 2005 darpa
grand challenge, Technical paper, Stanford University and DARPA. Available at:
http://www.darpa.mil/grandchallenge05/TechPapers/Stanford.pdf, Last Accessed: 16-
Feb-06.
BIBLIOGRAPHY 385
Stein & Merideth (1992), Merging of the Senses, MIT Press.
Stewart, J. (1995), Calculus, 3rd edn, Brooks/Cole.
Tenenbaum, J. (1998), ‘Mapping a monifold of perceptual observations’, Advances in Neural

Information Processing Systems .
Thrun, S. (1998), ‘Learning metric-topological maps for indoor mobile robot navigation’,
Artifical Intelligence 99(1), 21–71.
Thrun, S., Burgard, W. & Fox, D. (2005), Probabilistic Robotics, MIT Press.
Tishby, N., Pereira, F. & Bailek, W. (1999), The information bottleneck method, in ‘Pro-
ceedings of the 37-th Annual Allerton Conference on Communication, Control and Com-
puting’, pp. 368–377.
URL: citeseer.ist.psu.edu/tishby99information.html
Torkkola, K. (2001), Nonlinear feature transforms using maxmimum mutual information,

in ‘Proc. IJCNN’01 Int. Joint Conf. on Neural Networks’, Vol. 4, pp. 2756–2761.
Tsang, I. & Kwok, J. (2003), ‘Distance metric learning with kernels’, Proc. Int. Conf.
Artificial Neural Networks pp. pp126–9.
Vandapel, N., Donamukkala, R. R. & Hebert, M. (2006), ‘Unmanned ground vehicle navi-
gation using aerial ladar data’, The Int. J. Robotics Research 25(1), 31–51.
Wang, X. R., Brown, A. & Upcroft, B. (2005), ‘Applying incremental em to bayesian

classifiers in the learning of hyperspectral remote sensing data’, Proc. 8th Int. Conf.
Information Fusion .
Weisstein, E. (1999a), ‘Delta function’, From MathWorld–A Wolfram Web Resource.

http://mathworld.wolfram.com/DeltaFunction.html. Last Accessed: 15 Mar 2006.
Weisstein, E. (1999b), ‘Inner product’, From MathWorld–A Wolfram Web Resource.

http://mathworld.wolfram.com/Inner Product.html. Last Accessed: 14 Oct 2005.
Weisstein, E. (1999c), ‘Norm’, From MathWorld–A Wolfram Web Resource.

http://mathworld.wolfram.com/ Norm.html. Last Accessed: 13 Oct 2005.
Weisstein, E. (1999d), ‘Vector norm’, From MathWorld–A Wolfram Web Resource.

http://mathworld.wolfram.com/VectorNorm.html. Last Accessed: 28 Aug 2006.
Widzyk-Capehart, E., Brooker, G., Hennessy, R. & Lobsey, C. (2005), ‘Rope shovel envi-
ronment mapping for improved operation using millimetre wave radar’, Proc. Australian
Mining Technology Conf. .
Yeung, R. W. (1991), ‘A new outlook on shannon’s information measures’, IEEE Trans.

Inf. Theory 37(3), 466–474.

Grover 2007 Thesis

Uploaded by

Copyright:

Available Formats

You might also like

Grover 2007 Thesis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Grover 2007 Thesis

Uploaded by

Copyright:

Available Formats

A Model for Sensing and

Reasoning in Autonomous Systems

A thesis submitted in fulﬁllment

ARC Centre of Excellence in Autonomous Systems

A Model for Sensing and

List of Figures xii

List of Tables xviii

List of Symbols xxiii

1.4.4 Situational Dependencies of Measure Quantities . . . . . . . . . . . 14

2 Measures, Distances and Information Theory 19

3 Sensing and Representation 93

4 Reasoning with Sensory Data 159

4.3.1 Deterministic Transformations . . . . . . . . . . . . . . . . . . . . . 168

5 Application of Model 201

5.3.5 Reasoning Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 225

6 Conclusions and Future Work 267

Information Theoretic Quantities for Systems with More Than Two

A Tensor Geometry 275

B Ordinate Transformation Invariance for Mutual Information 281

C Measure and Information Theory Examples 291

D Measure Comparisons 319

E Occupancy Sensor Models 365

1.1 An Autonomous Uninhabited Ground Vehicle (AUGV) shown in a typical

2.1 The triangle inequality for deviation measures . . . . . . . . . . . . . . . . . 22

3.1 The data gathering part of the model . . . . . . . . . . . . . . . . . . . . . 95

4.1 The data abstraction part of the model . . . . . . . . . . . . . . . . . . . . 161

4.11 An alternate form of Figure 4.10 . . . . . . . . . . . . . . . . . . . . . . . . 189

5.1 Implementing functional representations . . . . . . . . . . . . . . . . . . . . 206

A.1 Contravariant and covariant coordinates for a vector . . . . . . . . . . . . . 276

B.1 Eﬀects of applying several transformations to a discrete probability ﬁeld . . 285

C.1 An example of a ﬂat measure . . . . . . . . . . . . . . . . . . . . . . . . . . 297

D.1 Synchronous and asynchronous time-series with unknown oﬀset . . . . . . . 321

E.1 One dimensional occupancy model . . . . . . . . . . . . . . . . . . . . . . . 367

2.1 Two discrete probability distributions for examining mutual information . . 59

2.4.1 Discrete Coordinate System . . . . . . . . . . . . . . . . . . . . . . . . . 24

C.4.4 Norms and distance in discrete system . . . . . . . . . . . . . . . . . . . 300

S An arbitrary set over which a family of subsets (also called a σ-

X A distribution (or experiment) associated with the random variable

Q A domain over which feature space values s will be deﬁned as a

1.1 The Objectives of this Thesis

• To identify the essential characteristics of the operations of sensing, representation

• To explicitly acknowledge the intractability of the ideal approach and, therefore, to

approximations on the performance of a sub-optimal technique with respect to an

1.2 Sensing and Reasoning in Outdoor Autonomous Systems

(a) An open desert (b) Heavily-vegetated environment

1.3 The Model

1.3.1 The Structure of the Model

The model shown in Figure 1.3 has three distinguishing characteristics:

• The process of interpretation of the data is recognised as an operation which abstracts,

The distinguishing characteristic of this approach is in the emphasis on attempting to pro-

1.3.2 Approximations and Assumptions in the Model

1.3.3 Engineering Analysis vs. Machine Learning Methods

1.4 Performance Measurement

ability to quantify the eﬀects of the approximations identiﬁed in Chapters 3 and 4 on

1.4.1 Approximative Eﬀects

In the ﬁrst situation, performance can be interpreted as a measurement of the impact of

1.4.2 Oﬀ-line Components